当前位置：首页 > news >正文

【阿里千问最新多模态模型】Qwen2-VL：让世界看得更清晰

news 来源：原创 2024/9/20 20:48:07

在这里插入图片描述
经过一年的不懈努力，今天我们非常高兴地发布了 Qwen2-VL！ Qwen2-VL 是 Qwen 模型家族中基于 Qwen2 的视觉语言模型的最新版本。与 Qwen-VL 相比，Qwen2-VL 具有以下功能：

SoTA 可理解各种分辨率和比例的图像： Qwen2-VL 在视觉理解基准测试（包括 MathVista、DocVQA、RealWorldQA、MTVQA 等）中取得了最先进的性能。
可理解 20 分钟以上的视频： Qwen2-VL 可理解 20 分钟以上的视频，用于高质量的视频答题、对话、内容创建等。
可操作手机、机器人等的代理：Qwen2-VL 具有复杂的推理和决策能力，可与手机、机器人等设备集成，根据视觉环境和文本指令进行自动操作。
多语言支持：为服务全球用户，除英文和中文外，Qwen2-VL 现在还支持理解图像中的不同语言文本，包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。

我们以 Apache 2.0 许可对 Qwen2-VL-2B 和 Qwen2-VL-7B 进行了开源，并发布了 Qwen2-VL-72B 的 API！该开源软件已集成到 Hugging Face Transformers、vLLM 和其他第三方框架中。希望您喜欢！

性能

我们从六个关键维度评估了我们模型的可视化能力：复杂的大学水平问题解决、数学能力、文档和表格理解、多语言文本图像理解、一般场景问题解答、视频理解和基于代理的交互。总体而言，我们的 72B 模型在大多数指标上都表现出了一流的性能，甚至经常超过 GPT-4o 和 Claude 3.5-Sonnet 等闭源模型。值得注意的是，它在文档理解方面具有显著优势。

在这里插入图片描述
在 7B 的规模上，我们成功地保留了对图像、多图像和视频输入的支持，以更具成本效益的模型规模提供了具有竞争力的性能。具体来说，我们的模型在文档理解任务（如 DocVQA）和图像多语言文本理解（如 MTVQA）方面表现出色，达到了最先进的性能水平。
在这里插入图片描述

此外，我们很高兴能推出一款更小的 2B 机型，该机型针对潜在的移动部署进行了优化。尽管体积小巧，但该模型在图像、视频和多语言理解方面表现出色。与其他类似规模的模型相比，它在视频相关任务、文档理解和一般场景问题解答方面表现尤为突出。

在这里插入图片描述

模型功能

1. 增强识别能力

Qwen2-VL 现在拥有更强的物体识别能力，不仅能识别植物和地标，还能理解场景中多个物体之间的复杂关系。我们还大大提高了模型识别图像中手写文字和多种语言的能力，使全球用户更容易使用。

示例：多目标识别 (1/6)

User
Output the color and number of each box.
在这里插入图片描述
Qwen2-VL
The image shows a stack of colorful blocks with numbers on them. Here is the color and number of each block from top to bottom:

Top row (single block): Blue with the number 9
Second row (two blocks): Light blue with the number 7, and green with the number 8
Third row (three blocks): Purple with the number 4, pink with the number 5, and light green with the number 6
Bottom row (four blocks): Red with the number 0, orange with the number 1, yellow with the number 2, and light green with the number 3

范例：植物识别 (2/6)

User
What kind of flower is this?

在这里插入图片描述
Qwen2-VL
The flowers in the picture are forget-me-nots (Myosotis). These small, delicate flowers are known for their vibrant colors, typically blue, pink, or white, and are often used in bouquets and gardens for their charming appearance.

示例：使用密集公式解析文档 (3/6)

User
Convert the content in the image to Markdown

在这里插入图片描述

这里我想贴原文的，编辑器。。。直接看原文吧

模型体系结构更新：

自然动态分辨率：与以往不同的是，Qwen2-VL 可处理任意图像分辨率，并将其映射为动态数量的视觉标记，从而提供更接近人类的视觉处理体验。

在这里插入图片描述
多模态旋转位置嵌入（M-ROPE）：将位置嵌入分解为多个部分，以捕捉一维文本、二维视觉和三维视频位置信息，从而增强其多模态处理能力。

在这里插入图片描述
我们有三种模型，参数分别为 20 亿、70 亿和 720 亿。

Quickstart

我们提供了一个工具包，帮助您更方便地处理各种类型的视觉输入。其中包括 base64、URL 以及交错图片和视频。您可以使用以下命令进行安装：

pip install qwen-vl-utils

下面我们将展示一个代码片段，告诉您如何使用transformers和 qwen_vl_utils 来使用聊天模型：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)messages = [{"role": "user","content": [{"type": "image","image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",},{"type": "text", "text": "Describe this image."},],}
]# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",
)
inputs = inputs.to("cuda")# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)