遷移學習-Question

1. 講下BERT。

  • 雙向二階段預訓練模型-word-piece。


  • Special Token:[CLS]、[SEP]。
  • BERT_Base(12 layers)、BERT_Large(24 layers)。
  • Pre-training:Task #1: Masked LM、Task #2: Next Sentence Prediction
  • Special Token:start and end tokens (<s>, <e>)delimiter token ($)。

  • Fine-tuning:Two Sentence ClassificationSingle Sentence Classification、Question Answering、Single Sentence Tagging。

2. 能否實現(xiàn)下Word Piece?忘記步驟了,換成實現(xiàn)一下從若干文件中生成一個詞典,即word2idx和idx2word。BPE算法。

WordPiece算法可以看作是BPE的變種。不同點在于,WordPiece基于概率生成新的subword而不是下一最高頻字節(jié)對。

算法:

  1. 準備足夠大的訓練語料
  2. 確定期望的subword詞表大小
  3. 將單詞拆分成字符序列
  4. 基于第3步數(shù)據(jù)訓練語言模型
  5. 從所有可能的subword單元中選擇加入語言模型后能最大程度地增加訓練數(shù)據(jù)概率的單元作為新的單元
  6. 重復第5步直到達到第2步設定的subword詞表大小或概率增量低于某一閾值
class WordpieceTokenizer(object):
    """Runs WordPiece tokenization."""

    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
        self.vocab = vocab
        self.unk_token = unk_token
        self.max_input_chars_per_word = max_input_chars_per_word

    def tokenize(self, text):
        """Tokenizes a piece of text into its word pieces.

        This uses a greedy longest-match-first algorithm to perform tokenization
        using the given vocabulary.

        For example:
          input = "unaffable"
          output = ["un", "##aff", "##able"]

        Args:
          text: A single token or whitespace separated tokens. This should have
            already been passed through `BasicTokenizer`.

        Returns:
          A list of wordpiece tokens.
        """

        output_tokens = []
        for token in whitespace_tokenize(text):
            chars = list(token)
            if len(chars) > self.max_input_chars_per_word:
                output_tokens.append(self.unk_token)
                continue

            is_bad = False
            start = 0
            sub_tokens = []
            while start < len(chars):
                end = len(chars)
                cur_substr = None
                while start < end:
                    substr = "".join(chars[start:end])
                    if start > 0:
                        substr = "##" + substr
                    if substr in self.vocab:
                        cur_substr = substr
                        break
                    end -= 1
                if cur_substr is None:
                    is_bad = True
                    break
                sub_tokens.append(cur_substr)
                start = end

            if is_bad:
                output_tokens.append(self.unk_token)
            else:
                output_tokens.extend(sub_tokens)
        return output_tokens

Byte Pair Encoding:

def bpe(self, token):
        if token in self.cache:
            return self.cache[token]
        word = tuple(token)
        pairs = get_pairs(word)

        if not pairs:
            return token

        while True:
            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
            if bigram not in self.bpe_ranks:
                break
            first, second = bigram
            new_word = []
            i = 0
            while i < len(word):
                try:
                    j = word.index(first, i)
                except ValueError:
                    new_word.extend(word[i:])
                    break
                else:
                    new_word.extend(word[i:j])
                    i = j

                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
                    new_word.append(first + second)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_word = tuple(new_word)
            word = new_word
            if len(word) == 1:
                break
            else:
                pairs = get_pairs(word)
        word = " ".join(word)
        self.cache[token] = word
        return word

    def _tokenize(self, text):
        """ Tokenize a string. """
        bpe_tokens = []
        for token in re.findall(self.pat, text):
            token = "".join(
                self.byte_encoder[b] for b in token.encode("utf-8")
            )  # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)
            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
        return bpe_tokens
# '想要有直升機\n想要和你飛到宇宙去\n想要和你融化在一起\n融化在宇宙里\n我每天每天每'

# 這個數(shù)據(jù)集有6萬多個字符。為了打印方便,我們把換行符替換成空格,然后僅使用前1萬個字符來訓練模型。
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
corpus_chars = corpus_chars[0:10000]

idx_to_char = list(set(corpus_chars))
char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
vocab_size = len(char_to_idx)
vocab_size # 1027

3. 講下bert,講著講著面試官打斷了我,說你幫我估算下一層bert大概有多少參數(shù)量。

講下BERT:如問題1。

# BertEmbeddings:
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id) 
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

# BertAttention(BertSelfAttention+ BertSelfOutput):
#### BertSelfAttention
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
#### BertSelfOutput
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)

# BertIntermediate
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)

# BertOutput
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)

# BertPooler
self.dense = nn.Linear(config.hidden_size, config.hidden_size)

4. bert里add&norm是什么以及作用。

  • add:Residual connection(殘差連接),主要是為了避免模型較深時,在進行反向傳播時,梯度消失等問題。
  • norm:Layer Normalization(層歸一化),為了解決網(wǎng)絡中數(shù)據(jù)分布變化大,學習過程慢的問題。

5. 了不了解bert的擴展模型,roberta,SpanBERT,XLM,albert;介紹幾個除了bert之外的模型。

roberta:


SpanBERT:

XLM:

albert:





transformer-xl(extra-long)

Transformer

Transformer-XL (extra-long)

State Reuse for Segment-Level Recurrence

Incoherent Positional Encoding

Relative Positional Encoding

Segment-Level Recurrence in Inference

Contributions

xlnet:

Permutation Language Model

Permutation Language Model

Formulation Reparameterizing

Two-Stream Self-Attention

Two-Stream Self-Attention

Contributions

6. bert源碼里mask部分在哪個模塊;bert如何mask。

Multi-Head Attention:(BertSelfAttention)

class BertSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
            )

        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        output_attentions=False,
    ):
        mixed_query_layer = self.query(hidden_states)

        # If this is instantiated as a cross-attention module, the keys
        # and values come from an encoder; the attention mask needs to be
        # such that the encoder's padding tokens are not attended to.
        if encoder_hidden_states is not None:
            mixed_key_layer = self.key(encoder_hidden_states)
            mixed_value_layer = self.value(encoder_hidden_states)
            attention_mask = encoder_attention_mask
        else:
            mixed_key_layer = self.key(hidden_states)
            mixed_value_layer = self.value(hidden_states)

        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        if attention_mask is not None:
            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
            attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.Softmax(dim=-1)(attention_scores)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)

        # Mask heads if we want to
        if head_mask is not None:
            attention_probs = attention_probs * head_mask

        context_layer = torch.matmul(attention_probs, value_layer)

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)

        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
        return outputs

7. 估計一下bert的參數(shù)量。

估算下bert的參數(shù)量:如問題3。

  • BERT_{BASE}: L=12, H=768, A=12, Total Parameters=110M。
  • BERT_{LARGE}: L=24, H=1024, A=16, Total Parameters=340M。

8. roberta和bert在預訓練時的不同。

RoBERTa
  • Static vs. Dynamic Masking
  • Model Input Format and Next Sentence Prediction

9. 介紹下roberta,為什么選用wwm。

同上:
Whole Word Masking (wwm),暫翻譯為全詞Mask或整詞Mask,是谷歌在2019年5月31日發(fā)布的一項BERT的升級版本,主要更改了原預訓練階段的訓練樣本生成策略。 簡單來說,原有基于WordPiece的分詞方式會把一個完整的詞切分成若干個子詞,在生成訓練樣本時,這些被分開的子詞會隨機被mask。 在全詞Mask中,如果一個完整的詞的部分WordPiece子詞被mask,則同屬該詞的其他部分也會被mask,即全詞Mask。

需要注意的是,這里的mask指的是廣義的mask(替換成[MASK];保持原詞匯;隨機替換成另外一個詞),并非只局限于單詞替換成[MASK]標簽的情況。 更詳細的說明及樣例請參考:#4

同理,由于谷歌官方發(fā)布的BERT-base, Chinese中,中文是以為粒度進行切分,沒有考慮到傳統(tǒng)NLP中的中文分詞(CWS)。 我們將全詞Mask的方法應用在了中文中,使用了中文維基百科(包括簡體和繁體)進行訓練,并且使用了哈工大LTP作為分詞工具,即對組成同一個的漢字全部進行Mask。

下述文本展示了全詞Mask的生成樣例。 注意:為了方便理解,下述例子中只考慮替換成[MASK]標簽的情況。

10. BERT、GPT、ELMO之間的區(qū)別(模型結構、訓練方式)。

  • BERT-雙向兩階段預訓練模型:Pre-training(MLM、NSP)+Fine-Tuning(...)
  • GPT-單向兩階段預訓練模型:Pre-training(LM)+Fine-Tuning(...)
  • ELMO-兩個方向預訓練模型:Pre-training(兩個單向LM)+supervised NLP tasks(使用LSTM各層表征)

11. BERT為什么只用Transformer的Encoder而不用Decoder。

BERT在Pre-training過程中使用的Masked Language Model(AE),聯(lián)合上下文信息預測被[MASK]掉的標記。而Decoder采用的一種單向的語言模型(LM)。所以BERT使用Encoder,而不是用Decoder。

12. xlnet和bert有啥不同。自回歸&&自編碼的知識,其中解釋了xlnet排列語言模型以及雙流attention。

bert:自回歸;xlnet:自回歸&&自編碼。


Auto-Regressive (AR)

Auto-Encoding (AE)

Auto-Encoding (AE)

Permutation Language Model

Permutation Language Model

Two-Stream Self-Attention

Two-Stream Self-Attention

13. albert了解嗎?embedding層矩陣分解+參數(shù)共享+SOP+工程細節(jié)。

ALBERT: A Lite BERT

ALBERT: A Lite BERT

ALBERT: A Lite BERT

GLUE Results

Concluding Remarks
?著作權歸作者所有,轉載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容