sklearn.feature_extraction.text 中有4種文本特征提取方法:
- CountVectorizer
- TfidfVectorizer
- TfidfTransformer
- HashingVectorizer
CountVectorizer會(huì)將文本中的詞語(yǔ)轉(zhuǎn)換為詞頻矩陣,它通過fit_transform函數(shù)計(jì)算各個(gè)詞語(yǔ)在文檔中出現(xiàn)的次數(shù)。
參數(shù)

屬性
| 屬性表 | 作用 |
|---|---|
| vocabulary_ | 詞匯表;字典型 |
| get_feature_names() | 所有文本的詞匯;列表型 |
| stop_words_ | 返回停用詞表 |
方法
| 方法表 | 作用 |
|---|---|
| fit_transform(X) | 擬合模型,并返回term-document矩陣 |
| fit(raw_documents[, y]) | 學(xué)習(xí)文檔集中的vocabulary dictionary |
入門示例
from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird", 'bird'] # “dog cat fish” 為輸入列表元素,即代表一個(gè)文章的字符串
cv = CountVectorizer() #創(chuàng)建詞袋數(shù)據(jù)結(jié)構(gòu)
cv_fit = cv.fit_transform(texts)
# 上述代碼等價(jià)于下面兩行
# cv.fit(texts)
# cv_fit=cv.transform(texts)
print(cv.get_feature_names()) #['bird', 'cat', 'dog', 'fish'] 列表形式呈現(xiàn)文章生成的詞典
print(cv.vocabulary_) # {‘dog’:2,'cat':1,'fish':3,'bird':0} 字典形式,key:詞,value:該詞(特征)的索引,同時(shí)是tf矩陣的列號(hào)
[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)
[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)
print(cv_fit)
#(0,3)1 第0個(gè)列表元素,**詞典中索引為3的元素**, 詞頻
#(0,1)1
#(0,2)1
#(1,1)2
#(1,2)1
#(2,0)1
#(2,3)1
#(3,0)1
print(cv_fit.toarray()) #.toarray() 是將結(jié)果轉(zhuǎn)化為稀疏矩陣矩陣的表示方式;
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]
print(cv_fit.toarray().sum(axis=0)) #每個(gè)詞在所有文檔中的詞頻
#[2 3 2 2]
復(fù)現(xiàn)
功能包括:
- 去停詞等文本預(yù)處理操作
- fit
- transform
- 支持 n-gram
import numpy as np
with open('data.txt', 'r', encoding='utf-8') as f:
data = [i.strip() for i in f.readlines()]
class MyCountVectorizer(object):
vocabulary = {}
corpus = []
def __init__(self, n=1, remove_stop_words=False):
self.n = n
self.remove_stop_words = remove_stop_words
def clean(self, corpus):
if self.remove_stop_words:
# Load stopword list
with open('stopwords.txt') as f:
stop_words = [w.strip() for w in f.readlines()]
for text in corpus:
# Lower case
text = text.lower()
# Remove special punctuation
for c in """!"'#$%&\()*+,-./:;<=>?@[\\]^_`{|}~“”‘’""":
text = text.replace(c, ' ')
if self.remove_stop_words:
word_ls = [word for word in text.split(' ') if word and word.isalnum() and len(word)>1 and (word not in stop_words)]
else:
word_ls = [word for word in text.split(' ') if word and word.isalnum() and len(word)>1]
# corpus: document size * vocabulary size
n_gram_word_ls = []
for idx in range(len(word_ls)):
if idx + self.n > len(word_ls):
break
n_gram_word = ' '.join(word_ls[idx: idx + self.n])
n_gram_word_ls.append(n_gram_word)
self.corpus.append(n_gram_word_ls)
def fit(self, corpus):
# Create a dictionary of terms which map to columns of the term-frequency matrix.
self.clean(corpus)
for row in self.corpus:
for word in row:
if word not in self.vocabulary:
self.vocabulary[word] = len(self.vocabulary)
return
def transform(self):
# Create a term-frequency matrix of appropriate size (document size * vocabulary size)
tf_matrix = []
size = len(self.vocabulary)
for doc in self.corpus:
# Count how often the word appears in the document
word_count = {}
for word in doc:
word_count[word] = word_count.get(word, 0) + 1
# Construct the term-frequency vector of the row
row = [0 for i in range(size)]
for word, value in word_count.items():
row[self.vocabulary[word]] = value
tf_matrix.append(row)
tf_matrix = np.array(tf_matrix)
return tf_matrix
def get_vocab(self):
# Returns the dictionary of terms
return self.vocabulary
cv = MyCountVectorizer(1, True)
cv.fit(data)
print(cv.get_vocab())
term_frequency_matrix = cv.transform()
print(term_frequency_matrix.shape)
參考文獻(xiàn):
sklearn——CountVectorizer詳解