CountVector基礎(chǔ)功能的復(fù)現(xiàn)

sklearn.feature_extraction.text 中有4種文本特征提取方法:

  • CountVectorizer
  • TfidfVectorizer
  • TfidfTransformer
  • HashingVectorizer

CountVectorizer會(huì)將文本中的詞語(yǔ)轉(zhuǎn)換為詞頻矩陣,它通過fit_transform函數(shù)計(jì)算各個(gè)詞語(yǔ)在文檔中出現(xiàn)的次數(shù)。

參數(shù)

屬性

屬性表 作用
vocabulary_ 詞匯表;字典型
get_feature_names() 所有文本的詞匯;列表型
stop_words_ 返回停用詞表

方法

方法表 作用
fit_transform(X) 擬合模型,并返回term-document矩陣
fit(raw_documents[, y]) 學(xué)習(xí)文檔集中的vocabulary dictionary

入門示例

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird'] # “dog cat fish” 為輸入列表元素,即代表一個(gè)文章的字符串
cv = CountVectorizer() #創(chuàng)建詞袋數(shù)據(jù)結(jié)構(gòu)
cv_fit = cv.fit_transform(texts)
# 上述代碼等價(jià)于下面兩行
# cv.fit(texts)
# cv_fit=cv.transform(texts)

print(cv.get_feature_names())    #['bird', 'cat', 'dog', 'fish'] 列表形式呈現(xiàn)文章生成的詞典

print(cv.vocabulary_)            # {‘dog’:2,'cat':1,'fish':3,'bird':0} 字典形式,key:詞,value:該詞(特征)的索引,同時(shí)是tf矩陣的列號(hào)
[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)
[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)

print(cv_fit)
#(0,3)1   第0個(gè)列表元素,**詞典中索引為3的元素**, 詞頻
#(0,1)1
#(0,2)1
#(1,1)2
#(1,2)1
#(2,0)1
#(2,3)1
#(3,0)1

print(cv_fit.toarray()) #.toarray() 是將結(jié)果轉(zhuǎn)化為稀疏矩陣矩陣的表示方式;
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]

print(cv_fit.toarray().sum(axis=0))  #每個(gè)詞在所有文檔中的詞頻
#[2 3 2 2]

復(fù)現(xiàn)

功能包括:

  • 去停詞等文本預(yù)處理操作
  • fit
  • transform
  • 支持 n-gram
import numpy as np

with open('data.txt', 'r', encoding='utf-8') as f:
    data = [i.strip() for i in f.readlines()]

class MyCountVectorizer(object):
    vocabulary = {}
    corpus = []
    
    def __init__(self, n=1, remove_stop_words=False):
        self.n = n
        self.remove_stop_words = remove_stop_words
        
    def clean(self, corpus):
        if self.remove_stop_words:
            # Load stopword list
            with open('stopwords.txt') as f:
                stop_words = [w.strip() for w in f.readlines()]
        for text in corpus:
            # Lower case
            text = text.lower()
            # Remove special punctuation
            for c in """!"'#$%&\()*+,-./:;<=>?@[\\]^_`{|}~“”‘’""":
                text = text.replace(c, ' ')
            if self.remove_stop_words:
                word_ls = [word for word in text.split(' ') if word and word.isalnum() and len(word)>1 and (word not in stop_words)]
            else:
                word_ls = [word for word in text.split(' ') if word and word.isalnum() and len(word)>1]
            # corpus: document size * vocabulary size
            n_gram_word_ls = []
            for idx in range(len(word_ls)):
                if idx + self.n > len(word_ls):
                    break
                n_gram_word = ' '.join(word_ls[idx: idx + self.n])
                n_gram_word_ls.append(n_gram_word)
            self.corpus.append(n_gram_word_ls)    
    
    def fit(self, corpus):
        # Create a dictionary of terms which map to columns of the term-frequency matrix.
        self.clean(corpus)
        for row in self.corpus:
            for word in row:
                if word not in self.vocabulary:
                    self.vocabulary[word] = len(self.vocabulary)
        return
    
    def transform(self):
        # Create a term-frequency matrix of appropriate size (document size * vocabulary size)
        tf_matrix = []
        size = len(self.vocabulary)
        for doc in self.corpus:
            # Count how often the word appears in the document
            word_count = {}
            for word in doc:
                word_count[word] = word_count.get(word, 0) + 1
            # Construct the term-frequency vector of the row
            row = [0 for i in range(size)]
            for word, value in word_count.items():
                row[self.vocabulary[word]] = value
            tf_matrix.append(row)
        tf_matrix = np.array(tf_matrix)
        return tf_matrix
    
    def get_vocab(self):
        # Returns the dictionary of terms
        return self.vocabulary
    
cv = MyCountVectorizer(1, True)
cv.fit(data)
print(cv.get_vocab())
term_frequency_matrix = cv.transform()
print(term_frequency_matrix.shape)

參考文獻(xiàn):
sklearn——CountVectorizer詳解

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容