久久亚洲人妻视频,亚洲精品在线日韩

之前一直做的是目標(biāo)跟蹤上的東西，這幾天在看這本書又看到NLP，兩者均作為對數(shù)據(jù)序列的處理，應(yīng)該是有共通點(diǎn)的，于是就簡單摸索了一下。

在NLP建立由詞到向量的映射最簡單的方法是bag of words，粗暴直接，沒準(zhǔn)還效果拔群。

但是bag of words 沒法表達(dá)出詞與詞之間聯(lián)系與相似程度，功能上還是有些粗糙，所以就考慮用Word2Vec將詞映射到向量空間，再進(jìn)行分類訓(xùn)練。

這次的工作主要就是一些書上教程和word2vec的結(jié)合

需要用到的module

sklearn
nltk
gensim

不是什么偏門模塊，直接anaconda里面install就行，conda沒有就pip

數(shù)據(jù)來源

爬網(wǎng)頁什么的就算了，我也搞不了那么大的，直接找現(xiàn)成的。（我是拒絕承認(rèn)我是因?yàn)橹懒诉@個數(shù)據(jù)才做這項(xiàng)工作的）。
這個數(shù)據(jù)集里面把IMDB評分在5以上的都視作positive sample，5以下的視作 negative sample

數(shù)據(jù)預(yù)處理

借用了nltk的 stopwords 集，就是那些像 i, you, is 之類的沒啥營養(yǎng)哪都出現(xiàn)頻率還死高的詞。用來把他們從訓(xùn)練集中清除。
pyprind看個進(jìn)度
還有個對符號表情的提取，比如:-)但是這一套東西，不管是stopwords還是表情符號，都是基于英語環(huán)境的，對于中文還是不行，或者是有類似的成果只是我孤陋寡聞。（我就對這種 (～￣▽￣)～還有這種 (???) 搞出通用的識別方法的難度表示關(guān)切）

把原始的txt整理成為csv

import pyprind
import pandas as pd
import os
from nltk.corpus import stopwords
import re
import numpy as np


stop = stopwords.words('english')


def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized


basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
                token = tokenizer(text=txt)
            df = df.append([[token, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv')

生成基于此數(shù)據(jù)集的word2vec模型

import pyprind
import gensim.models
import re

inpath = 'movie_data.csv'
outpath = 'wordVectTrainResult'
pbar = pyprind.ProgBar(100000)
class csvStream(object):
    def __init__(self,path):
        self.path=path
    def __iter__(self):
        with open(self.path, 'r',) as csv:
            next(csv)  # skip header
            for line in csv:
                text = line[4:-3]
                text = re.sub('[\'\"\[\]\d\b]','',text)   
                while (text[0] == ',') or (text[0] == ' '):
                    text = text[1:]
                pbar.update()
                yield text.split(', ')


lineIterator = csvStream(inpath)
model = gensim.models.Word2Vec()
model.build_vocab(lineIterator)
print('vocabulary building finished, start training...')
model.train(lineIterator,total_examples=model.corpus_count,epochs=1)
model.save(outpath)

模型被保存到了當(dāng)前目錄的wordVectTrainResult文件。想用的時候再load就行。

分類器訓(xùn)練

可憐我的小筆記本，跑grid來選取最優(yōu)的參數(shù)肯定是不行了，所以就采用了SGD miniBatch的訓(xùn)練方式。分類器在sklearn里面有現(xiàn)成的，直接拿來用就可以。
當(dāng)前面臨的一個最重要的問題是怎么把基于詞的word2vec映射數(shù)據(jù)對應(yīng)到訓(xùn)練數(shù)據(jù)的以句（段）為單位的映射數(shù)據(jù)。各個樣本的長度不一，所以沒法通過堆疊的方式來轉(zhuǎn)化成訓(xùn)練數(shù)據(jù)。
最粗暴的方法是以每條樣本句子（段落）各個單詞的平均值來作為整個句子的在單詞空間的向量。查了一下網(wǎng)上大佬們的說法，這里的這位大佬提出在word2vec詞庫的基礎(chǔ)上用bag of words的方法。。。我默默瞅了一眼我的小筆記本。。還有大佬直接貼了一篇論文From Word Embeddings To Document Distances（ICML-15）。。算了，看看最簡單粗暴的能出個什么結(jié)果吧。


# load the trained word2vec model
import gensim.models

inpath = 'wordVectTrainResult'
model = gensim.models.Word2Vec.load(inpath)

# start with the IMDB data
import re
from nltk.corpus import stopwords
from sklearn.linear_model import SGDClassifier
import pyprind
import numpy as np
import matplotlib.pyplot as plt

stop = stopwords.words('english')
# BatchNum*BatchSize must smaller than 50000
BatchSize = 1000

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized


def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv)  # skip header
        for line in csv:
            text, label = line[4:-3], int(line[-2])
            text = re.sub('[\'\"\[\]\d\b]','',text)
            while text[0] == ',':
                    text = text[1:]
            yield text.split(', '), label


def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y


clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
ACC = []
 
classes = np.array([0, 1])
pbar = pyprind.ProgBar(21)

for BatchNum in range(25,46): 
    doc_stream = stream_docs(path='movie_data.csv') 
    for _ in range(BatchNum):
        X_train = []
        X_raw, y_train = get_minibatch(doc_stream, size=BatchSize)
        if not X_raw:
            break
        for line in X_raw:
            wordAveVec = np.zeros([100])
            abandon = 0
            try:
                for word in line:
                    wordAveVec = wordAveVec + model[word]
            except KeyError:
                abandon+=1
            wordAveVec = wordAveVec/(len(line) - abandon)
            X_train.append(wordAveVec)    
        clf.partial_fit(X_train, y_train, classes=classes)        
    
    X_raw_test, y_test = get_minibatch(doc_stream, size=(50000-BatchNum*BatchSize))
    X_test = []
    for line in X_raw_test:
            wordAveVec = np.zeros([100])
            abandon = 0
            try:
                for word in line:
                    wordAveVec = wordAveVec + model[word]
            except KeyError:
                abandon+=1
            wordAveVec = wordAveVec/(len(line) - abandon)
            X_test.append(wordAveVec)
    ACC.append(clf.score(X_test,y_test))
    pbar.update()
x = range(25,46)
plt.plot(x, ACC)
plt.xlabel('BatchNum')
plt.ylabel('Accuracy')
plt.grid()
plt.show()

因?yàn)樵谇皫状螠y試的時候發(fā)現(xiàn)訓(xùn)練樣本和測試樣本的比值對最后測試準(zhǔn)確度影響很大。所以就做了個50%-50%到90%-10%的遍歷，看看比值對最終結(jié)果的影響。

這里寫圖片描述

(⊙ω⊙)！
好像有那么點(diǎn)意思

雖然抖地比較厲害，但總體趨勢向上，最后差不多到 75%。

個人感覺因?yàn)橛?xùn)練詞典是以整個50000個樣本來訓(xùn)練的，在訓(xùn)練分類器的時候，和訓(xùn)練樣本太少基本等于瞎猜，訓(xùn)練樣本數(shù)越接近50000，準(zhǔn)確率越高

但是比起不用word2vec直接上 bag of words 的SGD方法（87%），差距還是挺明顯的。產(chǎn)生差距的原因應(yīng)該還是用了均值向量來表示一整個文檔的特征。

如果結(jié)合word2vec和bag of words應(yīng)該能夠有更好的結(jié)果，有空再補(bǔ)。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python：用Word2Vec 和 sklearn 對IMDB評論進(jìn)行分類訓(xùn)練

Python：用Word2Vec 和 sklearn 對IMDB評論進(jìn)行分類訓(xùn)練

需要用到的module

數(shù)據(jù)來源

數(shù)據(jù)預(yù)處理

把原始的txt整理成為csv

生成基于此數(shù)據(jù)集的word2vec模型

分類器訓(xùn)練

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Python：用Word2Vec 和 sklearn 對IMDB評論進(jìn)行分類訓(xùn)練

需要用到的module

數(shù)據(jù)來源

數(shù)據(jù)預(yù)處理

把原始的txt整理成為csv

生成基于此數(shù)據(jù)集的word2vec模型

分類器訓(xùn)練

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av