【zt】Google's trained Word2Vec model in Python

From here: http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/

12 Apr 2016

In this post I’m going to describe how to get Google’s?pre-trained?Word2Vec model up and running in Python to play with.

As an interface to word2vec, I decided to go with a Python package called gensim. gensim appears to be a popular NLP package, and has some nice documentation and tutorials, including for word2vec.

You can download Google’s pre-trained model?here. It’s 1.5GB! It includes word vectors for a vocabulary of 3 million words and phrases that they trained on roughly 100 billion words from a Google News dataset. The vector length is 300 features.

Loading this model using gensim is a piece of cake; you just need to pass in the path to the model file (update the path in the code below to wherever you’ve placed the file).

However, if you’re running 32-bit Python (like I was) you’re going to get a memory error!

This is because gensim allocates a big matrix to hold all of the word vectors, and if you do the math…

…that’s a big matrix!

Assuming you’ve got a 64-bit machine and a decent amount of RAM (I’ve got 16GB; maybe you could get away with 8GB?), your best bet is to switch to 64-bit Python. I had a little trouble with this–see my notes down at the end of the post.

Inspecting the Model

I have a small Python project on GitHub called?inspect_word2vec?that loads Google’s model, and inspects a few different properties of it.

If you’d like to browse the 3M word list in Google’s pre-trained model, you can just look at the text files in the?vocabulary folder?of that project. I split the word list across 50 files, and each text file contains 100,000 entries from the model. I split it up like this so your editor wouldn’t completely choke (hopefully) when you try to open them. The words are stored in their original order–I haven’t sorted the list alphabetically. I don’t know what determined the original order.

Here are some the questions I had about the vocabulary, which I answered in this project:

Does it include stop words?

Answer: Some stop words like “a”, “and”, “of” are?excluded, but others like “the”, “also”, “should” are?included.

Does it include misspellings of words?

Answer: Yes. For instance, it includes both “mispelled” and “misspelled”–the latter is the correct one.

Does it include commonly paired words?

Answer: Yes. For instance, it includes “Soviet_Union” and “New_York”.

Does it include numbers?

Answer: Not directly; e.g., you won’t find “100”. But it does include entries like “###MHz_DDR2_SDRAM” where I’m assuming the ‘#’ are intended to match any digit.

Here’s a selection of 30 “terms” from the vocabulary. Pretty weird stuff in there!

Al_Qods

Surendra_Pal

Leaflet

guitar_harmonica

Yeoval

Suhardi

VoATM

Streaming_Coverage

Vawda

Lisa_Vanderpump

Nevern

Saleema

Saleemi

rbracken@centredaily.com

yellow_wagtails

P_&C;

CHICOPEE_Mass._WWLP

Gardiners_Rd

Nevers

Stocks_Advance_Paced

IIT_alumnus

Popery

Kapumpa

fashionably_rumpled

WDTV_Live

ARTICLES_##V_##W

Yerga

Weegs

Paris_IPN_Euronext

##bFM_Audio_Simon

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 分手一年多,似乎也走出來了。認(rèn)識(shí)了一些女生,有想脫單的心,但是真正開始行動(dòng)的時(shí)候,卻怎么也提不起精神。 有時(shí)候在想...
    生活不只有你閱讀 174評(píng)論 0 0
  • 殘缺的點(diǎn) 新歷史觀 新歷史館 中國教育殘缺的一個(gè)點(diǎn):擁抱! 練習(xí)擁抱,從小練習(xí)擁抱會(huì)讓孩子們放松,感到滿足進(jìn)而獲得...
    54f70f613c7c閱讀 365評(píng)論 0 1
  • 1.感恩張波總眾途軟件,讓精準(zhǔn)客戶不是夢(mèng)。科技成就生活之美。 2.感恩高速公路,讓距離不是問題,感恩小??怂挂宦沸?..
    山東慧恩賀守金閱讀 158評(píng)論 0 0

友情鏈接更多精彩內(nèi)容