久久久mv一区二区,国产风韵犹存熟妇三区,欧美性感久久草

sklearn.feature_extraction.text 中有4種文本特征提取方法：

CountVectorizer
TfidfVectorizer
TfidfTransformer
HashingVectorizer

CountVectorizer會(huì)將文本中的詞語(yǔ)轉(zhuǎn)換為詞頻矩陣，它通過fit_transform函數(shù)計(jì)算各個(gè)詞語(yǔ)在文檔中出現(xiàn)的次數(shù)。

參數(shù)

屬性

屬性表	作用
vocabulary_	詞匯表；字典型
get_feature_names()	所有文本的詞匯；列表型
stop_words_	返回停用詞表

方法

方法表	作用
fit_transform(X)	擬合模型，并返回term-document矩陣
fit(raw_documents[, y])	學(xué)習(xí)文檔集中的vocabulary dictionary

入門示例

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird'] # “dog cat fish” 為輸入列表元素,即代表一個(gè)文章的字符串
cv = CountVectorizer() #創(chuàng)建詞袋數(shù)據(jù)結(jié)構(gòu)
cv_fit = cv.fit_transform(texts)
# 上述代碼等價(jià)于下面兩行
# cv.fit(texts)
# cv_fit=cv.transform(texts)

print(cv.get_feature_names())    #['bird', 'cat', 'dog', 'fish'] 列表形式呈現(xiàn)文章生成的詞典

print(cv.vocabulary_)            # {‘dog’:2,'cat':1,'fish':3,'bird':0} 字典形式，key：詞，value:該詞（特征）的索引，同時(shí)是tf矩陣的列號(hào)
[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)
[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)

print(cv_fit)
#（0,3）1   第0個(gè)列表元素，**詞典中索引為3的元素**， 詞頻
#（0,1）1
#（0,2）1
#（1,1）2
#（1,2）1
#（2,0）1
#（2,3）1
#（3,0）1

print(cv_fit.toarray()) #.toarray() 是將結(jié)果轉(zhuǎn)化為稀疏矩陣矩陣的表示方式；
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]

print(cv_fit.toarray().sum(axis=0))  #每個(gè)詞在所有文檔中的詞頻
#[2 3 2 2]

復(fù)現(xiàn)

功能包括：

去停詞等文本預(yù)處理操作
fit
transform
支持 n-gram

import numpy as np

with open('data.txt', 'r', encoding='utf-8') as f:
    data = [i.strip() for i in f.readlines()]

class MyCountVectorizer(object):
    vocabulary = {}
    corpus = []
    
    def __init__(self, n=1, remove_stop_words=False):
        self.n = n
        self.remove_stop_words = remove_stop_words
        
    def clean(self, corpus):
        if self.remove_stop_words:
            # Load stopword list
            with open('stopwords.txt') as f:
                stop_words = [w.strip() for w in f.readlines()]
        for text in corpus:
            # Lower case
            text = text.lower()
            # Remove special punctuation
            for c in """!"'#$%&\()*+,-./:;<=>?@[\\]^_`{|}~“”‘’""":
                text = text.replace(c, ' ')
            if self.remove_stop_words:
                word_ls = [word for word in text.split(' ') if word and word.isalnum() and len(word)>1 and (word not in stop_words)]
            else:
                word_ls = [word for word in text.split(' ') if word and word.isalnum() and len(word)>1]
            # corpus: document size * vocabulary size
            n_gram_word_ls = []
            for idx in range(len(word_ls)):
                if idx + self.n > len(word_ls):
                    break
                n_gram_word = ' '.join(word_ls[idx: idx + self.n])
                n_gram_word_ls.append(n_gram_word)
            self.corpus.append(n_gram_word_ls)    
    
    def fit(self, corpus):
        # Create a dictionary of terms which map to columns of the term-frequency matrix.
        self.clean(corpus)
        for row in self.corpus:
            for word in row:
                if word not in self.vocabulary:
                    self.vocabulary[word] = len(self.vocabulary)
        return
    
    def transform(self):
        # Create a term-frequency matrix of appropriate size (document size * vocabulary size)
        tf_matrix = []
        size = len(self.vocabulary)
        for doc in self.corpus:
            # Count how often the word appears in the document
            word_count = {}
            for word in doc:
                word_count[word] = word_count.get(word, 0) + 1
            # Construct the term-frequency vector of the row
            row = [0 for i in range(size)]
            for word, value in word_count.items():
                row[self.vocabulary[word]] = value
            tf_matrix.append(row)
        tf_matrix = np.array(tf_matrix)
        return tf_matrix
    
    def get_vocab(self):
        # Returns the dictionary of terms
        return self.vocabulary
    
cv = MyCountVectorizer(1, True)
cv.fit(data)
print(cv.get_vocab())
term_frequency_matrix = cv.transform()
print(term_frequency_matrix.shape)

參考文獻(xiàn)：
sklearn——CountVectorizer詳解

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

CountVector基礎(chǔ)功能的復(fù)現(xiàn)

CountVector基礎(chǔ)功能的復(fù)現(xiàn)

參數(shù)

屬性

方法

入門示例

復(fù)現(xiàn)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

CountVector基礎(chǔ)功能的復(fù)現(xiàn)

參數(shù)

屬性

方法

入門示例

復(fù)現(xiàn)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av