QwenVL

Github: GitHub - QwenLM/Qwen-VL: The official repo of Qwen-VL (通義千問-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.

Paper: https://arxiv.org/abs/2308.12966

1. qwen-VL 模型參數(shù)分布

Vision Encoder 1.9B;VL Adapter 0.08B;LLM 7.7B;Total 9.6B
語言模型:Qwen-7B
視覺模型:Vit-bigG
V-L Adapter: single-layer cross-attention
數(shù)據(jù):image input: 圖像前后加<image>標(biāo)簽, 增加了對(duì)region的描述信息和檢測等信息

Training:

1. pretrainj階段: 數(shù)據(jù)處理:

Ptrain時(shí)期數(shù)據(jù)過濾和處理

2. Multi-task Pre-training: vision encoder分辨率從224224提升到448448,解凍語言模型,所有參數(shù)都參與訓(xùn)練
Multi-Task Pretrain所使用的數(shù)據(jù)

多任務(wù)訓(xùn)練數(shù)據(jù)組織形式如圖所示,黑色部分是數(shù)據(jù)輸入,藍(lán)色部分是輸出,用于計(jì)算loss的。
Multi-Task的數(shù)據(jù)組織形式

3. vision-lanuage adapter:主要功能是對(duì)圖片的sequence長度進(jìn)行壓縮,壓縮到固定length256, 同時(shí)和llm的文本信息進(jìn)行對(duì)齊。adapter中會(huì)隨機(jī)初始化256個(gè)vector作為query-vector, 將Vit得到的圖像特征作為attention層的key和value。然后將query-vector和圖像key, value做attention后返回。
QwenVL 訓(xùn)練pipeline

如何進(jìn)行圖片信息融合
具體做法步驟:

  1. 使用vision transformer對(duì)圖像編碼;
  2. 將圖像特征送進(jìn)vision-language adapter層,對(duì)圖像信息進(jìn)行進(jìn)一步壓縮和編碼;\
  3. 將得到的圖像詳細(xì)賦值給LLM的hidden states中為圖像預(yù)留span,將既包含圖像表示又包含文本表示的hidden states送入LLM進(jìn)行編碼;
    代碼參考如下:
# -------------------------------------【step1】-------------------------------------------
# 1. tokenizer將query中的圖片和文字轉(zhuǎn)換成input_ids,先對(duì)圖像進(jìn)行編碼:
self.visual = VisionTransformer(**config.visual)
...
if past_key_values is None and torch.any(input_ids == self.config.visual['image_start_id']):
    bos_pos = torch.where(input_ids == self.config.visual['image_start_id'])
    eos_pos = torch.where(input_ids == self.config.visual['image_start_id'] + 1)
    assert (bos_pos[0] == eos_pos[0]).all()
    img_pos = torch.stack((bos_pos[0], bos_pos[1], eos_pos[1]), dim=1)
    images = []
    for i, a, b in img_pos:
        image = input_ids[i][a + 1 : b - 1].tolist()
        image = image[ : image.index(self.config.visual['image_start_id'] + 2)]
        images.append(bytes(image).decode('utf-8'))
# print(images)
# ['demo.jpg']
# self.visual.encode為VisionTransformer中的encode函數(shù)
# -------------------------------------【step2】-------------------------------------------
# 2. vision encode則主要調(diào)用VisionTransformer中的forward和adapter,獲取圖片表示
def forward(self, x: torch.Tensor):
    x = x.to(
        dtype=self.transformer.get_cast_dtype(),
        device=self.transformer.get_cast_device(),
    )
    # to patches
    x = self.conv1(x)  # shape = [*, width, grid, grid]
    x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
    x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]

    x = x + get_abs_pos(self.positional_embedding, x.size(1))

    x = self.ln_pre(x)

    x = x.permute(1, 0, 2)  # NLD -> LND
    # 得到圖像經(jīng)過VisionTransformer之后的表示
    x = self.transformer(x)
    x = x.permute(1, 0, 2)  # LND -> NLD

    #過vision-language adapter模塊
    x = self.attn_pool(x)
    x = self.ln_post(x)
    x = x @ self.proj

    # 最后將結(jié)果進(jìn)行返回
    return x

class Resampler(nn.Module):
    """
    A 2D perceiver-resampler network with one cross attention layers by
        (grid_size**2) learnable queries and 2d sincos pos_emb
    Outputs:
        A tensor with the shape of (grid_size**2, embed_dim)
    """
    def __init__(
            self,
            grid_size,
            embed_dim,
            num_heads,
            kv_dim=None,
            norm_layer=nn.LayerNorm
    ):
        super().__init__()
        self.num_queries = grid_size ** 2
        self.embed_dim = embed_dim
        self.num_heads = num_heads

        self.pos_embed = nn.Parameter(
            torch.from_numpy(get_2d_sincos_pos_embed(embed_dim, grid_size)).float()
        ).requires_grad_(False)

        # 隨機(jī)初始化的query vector
        self.query = nn.Parameter(torch.zeros(self.num_queries, embed_dim))
        trunc_normal_(self.query, std=.02)

        if kv_dim is not None and kv_dim != embed_dim:
            self.kv_proj = nn.Linear(kv_dim, embed_dim, bias=False)
        else:
            self.kv_proj = nn.Identity()

        self.attn = nn.MultiheadAttention(embed_dim, num_heads)
        self.ln_q = norm_layer(embed_dim)
        self.ln_kv = norm_layer(embed_dim)
        
        self.apply(self._init_weights)

    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            trunc_normal_(m.weight, std=.02)
            if isinstance(m, nn.Linear) and m.bias is not None:
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.bias, 0)
            nn.init.constant_(m.weight, 1.0)

    def forward(self, x, attn_mask=None):
        # 獲得絕對(duì)位置編碼
        pos_embed = get_abs_pos(self.pos_embed, x.size(1))
        # 進(jìn)行維度轉(zhuǎn)換,vision transformer得到的特征向量可能和vision-language adapter中的query向量維度不匹配
        x = self.kv_proj(x)
        x = self.ln_kv(x).permute(1, 0, 2)

        N = x.shape[1] # N為batch size
        q = self.ln_q(self.query)
        # q 和 vision transformer模塊輸出的x做cross attention的計(jì)算
        out = self.attn(
            self._repeat(q, N) + self.pos_embed.unsqueeze(1), # 對(duì)初始化的query向量加上位置編碼
            x + pos_embed.unsqueeze(1), # 對(duì)x再次加上位置編碼,在vision transformer中,圖像轉(zhuǎn)換為patch之后也會(huì)加上位置編碼
            x,
            attn_mask=attn_mask)[0]
        # 最后再將結(jié)果進(jìn)行返回
        return out.permute(1, 0, 2)

    def _repeat(self, query, N: int):
        return query.unsqueeze(1).repeat(1, N, 1)
# -------------------------------------【step3】-------------------------------------------
# 3. 將vision transformer以及adapter模塊編碼后的特征賦值給LLM的hidden_states,再送入LLM進(jìn)行編碼:
# images的shape為bsz, 256, hidden_size
# a + 1:b 的長度為256,是tokenizer在對(duì)輸入進(jìn)行tokenize的時(shí)候,為圖像編碼信息預(yù)留的256個(gè)位置,
# 得到了圖像編碼的信息之后,再將值賦值給hidden_states中為它預(yù)留的位置
# hidden_states
for idx, (i, a, b) in enumerate(img_pos):
    hidden_states[i][a + 1 : b] = images[idx]
3. 用einops模擬patchEmbedding的操作。

用einops直觀任性操作Tensor,解決Patch Embedding問題 - 知乎 (zhihu.com)

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容