亚洲无码无卡尤物视频,91she区,999热96色婷婷

1. 講下BERT。

雙向二階段預訓練模型-word-piece。

Special Token：[CLS]、[SEP]。
BERT_Base（12 layers）、BERT_Large（24 layers）。
Pre-training：Task #1: Masked LM、Task #2: Next Sentence Prediction

Special Token：start and end tokens (<s>, <e>)、delimiter token ($)。
Fine-tuning：Two Sentence Classification、Single Sentence Classification、Question Answering、Single Sentence Tagging。

2. 能否實現(xiàn)下Word Piece?忘記步驟了，換成實現(xiàn)一下從若干文件中生成一個詞典，即word2idx和idx2word。BPE算法。

WordPiece算法可以看作是BPE的變種。不同點在于，WordPiece基于概率生成新的subword而不是下一最高頻字節(jié)對。

算法：

準備足夠大的訓練語料
確定期望的subword詞表大小
將單詞拆分成字符序列
基于第3步數(shù)據(jù)訓練語言模型
從所有可能的subword單元中選擇加入語言模型后能最大程度地增加訓練數(shù)據(jù)概率的單元作為新的單元
重復第5步直到達到第2步設定的subword詞表大小或概率增量低于某一閾值

class WordpieceTokenizer(object):
    """Runs WordPiece tokenization."""

    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
        self.vocab = vocab
        self.unk_token = unk_token
        self.max_input_chars_per_word = max_input_chars_per_word

    def tokenize(self, text):
        """Tokenizes a piece of text into its word pieces.

        This uses a greedy longest-match-first algorithm to perform tokenization
        using the given vocabulary.

        For example:
          input = "unaffable"
          output = ["un", "##aff", "##able"]

        Args:
          text: A single token or whitespace separated tokens. This should have
            already been passed through `BasicTokenizer`.

        Returns:
          A list of wordpiece tokens.
        """

        output_tokens = []
        for token in whitespace_tokenize(text):
            chars = list(token)
            if len(chars) > self.max_input_chars_per_word:
                output_tokens.append(self.unk_token)
                continue

            is_bad = False
            start = 0
            sub_tokens = []
            while start < len(chars):
                end = len(chars)
                cur_substr = None
                while start < end:
                    substr = "".join(chars[start:end])
                    if start > 0:
                        substr = "##" + substr
                    if substr in self.vocab:
                        cur_substr = substr
                        break
                    end -= 1
                if cur_substr is None:
                    is_bad = True
                    break
                sub_tokens.append(cur_substr)
                start = end

            if is_bad:
                output_tokens.append(self.unk_token)
            else:
                output_tokens.extend(sub_tokens)
        return output_tokens

Byte Pair Encoding：

def bpe(self, token):
        if token in self.cache:
            return self.cache[token]
        word = tuple(token)
        pairs = get_pairs(word)

        if not pairs:
            return token

        while True:
            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
            if bigram not in self.bpe_ranks:
                break
            first, second = bigram
            new_word = []
            i = 0
            while i < len(word):
                try:
                    j = word.index(first, i)
                except ValueError:
                    new_word.extend(word[i:])
                    break
                else:
                    new_word.extend(word[i:j])
                    i = j

                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
                    new_word.append(first + second)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_word = tuple(new_word)
            word = new_word
            if len(word) == 1:
                break
            else:
                pairs = get_pairs(word)
        word = " ".join(word)
        self.cache[token] = word
        return word

    def _tokenize(self, text):
        """ Tokenize a string. """
        bpe_tokens = []
        for token in re.findall(self.pat, text):
            token = "".join(
                self.byte_encoder[b] for b in token.encode("utf-8")
            )  # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)
            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
        return bpe_tokens

# '想要有直升機\n想要和你飛到宇宙去\n想要和你融化在一起\n融化在宇宙里\n我每天每天每'

# 這個數(shù)據(jù)集有6萬多個字符。為了打印方便，我們把換行符替換成空格，然后僅使用前1萬個字符來訓練模型。
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
corpus_chars = corpus_chars[0:10000]

idx_to_char = list(set(corpus_chars))
char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
vocab_size = len(char_to_idx)
vocab_size # 1027

3. 講下bert，講著講著面試官打斷了我，說你幫我估算下一層bert大概有多少參數(shù)量。

講下BERT：如問題1。

# BertEmbeddings：
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id) 
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

# BertAttention（BertSelfAttention+ BertSelfOutput）：
#### BertSelfAttention
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
#### BertSelfOutput
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)

# BertIntermediate
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)

# BertOutput
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)

# BertPooler
self.dense = nn.Linear(config.hidden_size, config.hidden_size)

4. bert里add&norm是什么以及作用。

add：Residual connection（殘差連接），主要是為了避免模型較深時，在進行反向傳播時，梯度消失等問題。
norm：Layer Normalization（層歸一化），為了解決網(wǎng)絡中數(shù)據(jù)分布變化大，學習過程慢的問題。

5. 了不了解bert的擴展模型，roberta，SpanBERT，XLM，albert；介紹幾個除了bert之外的模型。

roberta：

SpanBERT：

XLM：

albert：

transformer-xl(extra-long)：

Transformer

Transformer-XL (extra-long)

State Reuse for Segment-Level Recurrence

Incoherent Positional Encoding

Relative Positional Encoding

Segment-Level Recurrence in Inference

Contributions

xlnet：

Permutation Language Model

Formulation Reparameterizing

Two-Stream Self-Attention

Contributions

6. bert源碼里mask部分在哪個模塊；bert如何mask。

Multi-Head Attention：(BertSelfAttention)

class BertSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
            )

        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        output_attentions=False,
    ):
        mixed_query_layer = self.query(hidden_states)

        # If this is instantiated as a cross-attention module, the keys
        # and values come from an encoder; the attention mask needs to be
        # such that the encoder's padding tokens are not attended to.
        if encoder_hidden_states is not None:
            mixed_key_layer = self.key(encoder_hidden_states)
            mixed_value_layer = self.value(encoder_hidden_states)
            attention_mask = encoder_attention_mask
        else:
            mixed_key_layer = self.key(hidden_states)
            mixed_value_layer = self.value(hidden_states)

        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        if attention_mask is not None:
            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
            attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.Softmax(dim=-1)(attention_scores)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)

        # Mask heads if we want to
        if head_mask is not None:
            attention_probs = attention_probs * head_mask

        context_layer = torch.matmul(attention_probs, value_layer)

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)

        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
        return outputs

7. 估計一下bert的參數(shù)量。

估算下bert的參數(shù)量：如問題3。

$BERT_{BASE}$ : L=12, H=768, A=12, Total Parameters=110M。
$BERT_{LARGE}$ : L=24, H=1024, A=16, Total Parameters=340M。

8. roberta和bert在預訓練時的不同。

RoBERTa

Static vs. Dynamic Masking
Model Input Format and Next Sentence Prediction

9. 介紹下roberta，為什么選用wwm。

同上：
Whole Word Masking (wwm)，暫翻譯為全詞Mask或整詞Mask，是谷歌在2019年5月31日發(fā)布的一項BERT的升級版本，主要更改了原預訓練階段的訓練樣本生成策略。簡單來說，原有基于WordPiece的分詞方式會把一個完整的詞切分成若干個子詞，在生成訓練樣本時，這些被分開的子詞會隨機被mask。在全詞Mask中，如果一個完整的詞的部分WordPiece子詞被mask，則同屬該詞的其他部分也會被mask，即全詞Mask。

需要注意的是，這里的mask指的是廣義的mask（替換成[MASK]；保持原詞匯；隨機替換成另外一個詞），并非只局限于單詞替換成[MASK]標簽的情況。更詳細的說明及樣例請參考：#4

同理，由于谷歌官方發(fā)布的BERT-base, Chinese中，中文是以字為粒度進行切分，沒有考慮到傳統(tǒng)NLP中的中文分詞（CWS）。我們將全詞Mask的方法應用在了中文中，使用了中文維基百科（包括簡體和繁體）進行訓練，并且使用了哈工大LTP作為分詞工具，即對組成同一個詞的漢字全部進行Mask。

下述文本展示了全詞Mask的生成樣例。 注意：為了方便理解，下述例子中只考慮替換成[MASK]標簽的情況。