Paper: https://arxiv.org/abs/2308.12966
1. qwen-VL 模型參數(shù)分布
Vision Encoder 1.9B;VL Adapter 0.08B;LLM 7.7B;Total 9.6B
語言模型:Qwen-7B
視覺模型:Vit-bigG
V-L Adapter: single-layer cross-attention
數(shù)據(jù):image input: 圖像前后加<image>標(biāo)簽, 增加了對(duì)region的描述信息和檢測等信息
Training:
1. pretrainj階段: 數(shù)據(jù)處理:

Ptrain時(shí)期數(shù)據(jù)過濾和處理
2. Multi-task Pre-training: vision encoder分辨率從224224提升到448448,解凍語言模型,所有參數(shù)都參與訓(xùn)練

Multi-Task Pretrain所使用的數(shù)據(jù)
多任務(wù)訓(xùn)練數(shù)據(jù)組織形式如圖所示,黑色部分是數(shù)據(jù)輸入,藍(lán)色部分是輸出,用于計(jì)算loss的。

Multi-Task的數(shù)據(jù)組織形式
3. vision-lanuage adapter:主要功能是對(duì)圖片的sequence長度進(jìn)行壓縮,壓縮到固定length256, 同時(shí)和llm的文本信息進(jìn)行對(duì)齊。adapter中會(huì)隨機(jī)初始化256個(gè)vector作為query-vector, 將Vit得到的圖像特征作為attention層的key和value。然后將query-vector和圖像key, value做attention后返回。

QwenVL 訓(xùn)練pipeline
如何進(jìn)行圖片信息融合:
具體做法步驟:
- 使用vision transformer對(duì)圖像編碼;
- 將圖像特征送進(jìn)vision-language adapter層,對(duì)圖像信息進(jìn)行進(jìn)一步壓縮和編碼;\
- 將得到的圖像詳細(xì)賦值給LLM的hidden states中為圖像預(yù)留span,將既包含圖像表示又包含文本表示的hidden states送入LLM進(jìn)行編碼;
代碼參考如下:
# -------------------------------------【step1】-------------------------------------------
# 1. tokenizer將query中的圖片和文字轉(zhuǎn)換成input_ids,先對(duì)圖像進(jìn)行編碼:
self.visual = VisionTransformer(**config.visual)
...
if past_key_values is None and torch.any(input_ids == self.config.visual['image_start_id']):
bos_pos = torch.where(input_ids == self.config.visual['image_start_id'])
eos_pos = torch.where(input_ids == self.config.visual['image_start_id'] + 1)
assert (bos_pos[0] == eos_pos[0]).all()
img_pos = torch.stack((bos_pos[0], bos_pos[1], eos_pos[1]), dim=1)
images = []
for i, a, b in img_pos:
image = input_ids[i][a + 1 : b - 1].tolist()
image = image[ : image.index(self.config.visual['image_start_id'] + 2)]
images.append(bytes(image).decode('utf-8'))
# print(images)
# ['demo.jpg']
# self.visual.encode為VisionTransformer中的encode函數(shù)
# -------------------------------------【step2】-------------------------------------------
# 2. vision encode則主要調(diào)用VisionTransformer中的forward和adapter,獲取圖片表示
def forward(self, x: torch.Tensor):
x = x.to(
dtype=self.transformer.get_cast_dtype(),
device=self.transformer.get_cast_device(),
)
# to patches
x = self.conv1(x) # shape = [*, width, grid, grid]
x = x.reshape(x.shape[0], x.shape[1], -1) # shape = [*, width, grid ** 2]
x = x.permute(0, 2, 1) # shape = [*, grid ** 2, width]
x = x + get_abs_pos(self.positional_embedding, x.size(1))
x = self.ln_pre(x)
x = x.permute(1, 0, 2) # NLD -> LND
# 得到圖像經(jīng)過VisionTransformer之后的表示
x = self.transformer(x)
x = x.permute(1, 0, 2) # LND -> NLD
#過vision-language adapter模塊
x = self.attn_pool(x)
x = self.ln_post(x)
x = x @ self.proj
# 最后將結(jié)果進(jìn)行返回
return x
class Resampler(nn.Module):
"""
A 2D perceiver-resampler network with one cross attention layers by
(grid_size**2) learnable queries and 2d sincos pos_emb
Outputs:
A tensor with the shape of (grid_size**2, embed_dim)
"""
def __init__(
self,
grid_size,
embed_dim,
num_heads,
kv_dim=None,
norm_layer=nn.LayerNorm
):
super().__init__()
self.num_queries = grid_size ** 2
self.embed_dim = embed_dim
self.num_heads = num_heads
self.pos_embed = nn.Parameter(
torch.from_numpy(get_2d_sincos_pos_embed(embed_dim, grid_size)).float()
).requires_grad_(False)
# 隨機(jī)初始化的query vector
self.query = nn.Parameter(torch.zeros(self.num_queries, embed_dim))
trunc_normal_(self.query, std=.02)
if kv_dim is not None and kv_dim != embed_dim:
self.kv_proj = nn.Linear(kv_dim, embed_dim, bias=False)
else:
self.kv_proj = nn.Identity()
self.attn = nn.MultiheadAttention(embed_dim, num_heads)
self.ln_q = norm_layer(embed_dim)
self.ln_kv = norm_layer(embed_dim)
self.apply(self._init_weights)
def _init_weights(self, m):
if isinstance(m, nn.Linear):
trunc_normal_(m.weight, std=.02)
if isinstance(m, nn.Linear) and m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
def forward(self, x, attn_mask=None):
# 獲得絕對(duì)位置編碼
pos_embed = get_abs_pos(self.pos_embed, x.size(1))
# 進(jìn)行維度轉(zhuǎn)換,vision transformer得到的特征向量可能和vision-language adapter中的query向量維度不匹配
x = self.kv_proj(x)
x = self.ln_kv(x).permute(1, 0, 2)
N = x.shape[1] # N為batch size
q = self.ln_q(self.query)
# q 和 vision transformer模塊輸出的x做cross attention的計(jì)算
out = self.attn(
self._repeat(q, N) + self.pos_embed.unsqueeze(1), # 對(duì)初始化的query向量加上位置編碼
x + pos_embed.unsqueeze(1), # 對(duì)x再次加上位置編碼,在vision transformer中,圖像轉(zhuǎn)換為patch之后也會(huì)加上位置編碼
x,
attn_mask=attn_mask)[0]
# 最后再將結(jié)果進(jìn)行返回
return out.permute(1, 0, 2)
def _repeat(self, query, N: int):
return query.unsqueeze(1).repeat(1, N, 1)
# -------------------------------------【step3】-------------------------------------------
# 3. 將vision transformer以及adapter模塊編碼后的特征賦值給LLM的hidden_states,再送入LLM進(jìn)行編碼:
# images的shape為bsz, 256, hidden_size
# a + 1:b 的長度為256,是tokenizer在對(duì)輸入進(jìn)行tokenize的時(shí)候,為圖像編碼信息預(yù)留的256個(gè)位置,
# 得到了圖像編碼的信息之后,再將值賦值給hidden_states中為它預(yù)留的位置
# hidden_states
for idx, (i, a, b) in enumerate(img_pos):
hidden_states[i][a + 1 : b] = images[idx]