LDA文檔主題生成模型入門(mén)

一、LDA簡(jiǎn)介

LDA(Latent Dirichlet Allocation)是一種文檔主題生成模型,也稱(chēng)為一個(gè)三層貝葉斯概率模型,包含詞、主題和文檔三層結(jié)構(gòu)。所謂生成模型,就是說(shuō),我們認(rèn)為一篇文章的每個(gè)詞都是通過(guò)“以一定概率選擇了某個(gè)主題,并從這個(gè)主題中以一定概率選擇某個(gè)詞語(yǔ)”這樣一個(gè)過(guò)程得到。文檔到主題服從多項(xiàng)式分布,主題到詞服從多項(xiàng)式分布。

LDA是一種非監(jiān)督機(jī)器學(xué)習(xí)技術(shù),可以用來(lái)識(shí)別大規(guī)模文檔集(document collection)或語(yǔ)料庫(kù)(corpus)中潛藏的主題信息。它采用了詞袋(bag of words)的方法,這種方法將每一篇文檔視為一個(gè)詞頻向量,從而將文本信息轉(zhuǎn)化為了易于建模的數(shù)字信息。但是詞袋方法沒(méi)有考慮詞與詞之間的順序,這簡(jiǎn)化了問(wèn)題的復(fù)雜性,同時(shí)也為模型的改進(jìn)提供了契機(jī)。每一篇文檔代表了一些主題所構(gòu)成的一個(gè)概率分布,而每一個(gè)主題又代表了很多單詞所構(gòu)成的一個(gè)概率分布。

二、安裝LDA庫(kù)

pip install lda

安裝完成后,可以在python安裝目錄下的Lib/site-packages目錄下看到lda相關(guān)的目錄。

三、了解數(shù)據(jù)集

1.png

數(shù)據(jù)集位于lda安裝目錄的tests文件夾中,包含三個(gè)文件:reuters.ldac, reuters.titles, reuters.tokens。
reuters.titles包含了395個(gè)文檔的標(biāo)題
reuters.tokens包含了這395個(gè)文檔中出現(xiàn)的所有單詞,總共是4258個(gè)
reuters.ldac有395行,第i行代表第i個(gè)文檔中各個(gè)詞匯出現(xiàn)的頻率。以第0行為例,第0行代表的是第0個(gè)文檔,從reuters.titles中可查到該文檔的標(biāo)題為“UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20”。
第0行的數(shù)據(jù)為:
159 0:1 2:1 6:1 9:1 12:5 13:2 20:1 21:4 24:2 29:1 ……
第一個(gè)數(shù)字159表示第0個(gè)文檔里總共出現(xiàn)了159個(gè)單詞(每個(gè)單詞出現(xiàn)一或多次),
0:1表示第0個(gè)單詞出現(xiàn)了1次,從reuters.tokens查到第0個(gè)單詞為church
2:1表示第2個(gè)單詞出現(xiàn)了1次,從reuters.tokens查到第2個(gè)單詞為years
6:1表示第6個(gè)單詞出現(xiàn)了1次,從reuters.tokens查到第6個(gè)單詞為told
9:1表示第9個(gè)單詞出現(xiàn)了1次,從reuters.tokens查到第9個(gè)單詞為year
12:5表示第12個(gè)單詞出現(xiàn)了5次,從reuters.tokens查到第12個(gè)單詞為charles
……
這里第1、3、4、5、7、8、10、11……個(gè)單詞序號(hào)和次數(shù)沒(méi)列出來(lái),表示出現(xiàn)的次數(shù)為0

注意:
395個(gè)文檔的原文是沒(méi)有的。上述三個(gè)文檔是根據(jù)這395個(gè)文檔處理之后得到的。

四、程序?qū)崿F(xiàn)

(一)載入數(shù)據(jù)

(1)查看文檔中詞出現(xiàn)的頻率

import numpy as np
import lda
import lda.datasets

# document-term matrix
X = lda.datasets.load_reuters()
print("type(X): {}".format(type(X)))
print("shape: {}\n".format(X.shape))
print(X[:5, :5])        #前五行的前五列

運(yùn)行結(jié)果:

type(X): <class 'numpy.ndarray'>
shape: (395, 4258)

[[ 1  0  1  0  0]
 [ 7  0  2  0  0]
 [ 0  0  0  1 10]
 [ 6  0  1  0  0]
 [ 0  0  0  2 14]]

觀(guān)察reuters.ldac中的前5行的前5列,發(fā)現(xiàn):
第0行的前5列,單詞編號(hào)為0,1,2,3,4的出現(xiàn)頻次,正是1,0,1,0,0
第1行的前5列,單詞編程為0,1,2,3,4的出現(xiàn)頻次,正是7,0,2,0,0
……

(2)查看詞

# the vocab
vocab = lda.datasets.load_reuters_vocab()
print("type(vocab): {}".format(type(vocab)))
print("len(vocab): {}\n".format(len(vocab)))
print(vocab[:5])

運(yùn)行結(jié)果:

type(vocab): <class 'tuple'>
len(vocab): 4258

('church', 'pope', 'years', 'people', 'mother')

可以看出,reuters.tokens中有4258個(gè)單詞,前五個(gè)分別是church, pope, years, people, mother.

(3)查看文檔標(biāo)題

# titles for each story
titles = lda.datasets.load_reuters_titles()
print("type(titles): {}".format(type(titles)))
print("len(titles): {}\n".format(len(titles)))
print(titles[:5])       # 打印前五個(gè)文檔的標(biāo)題

運(yùn)行結(jié)果:

type(titles): <class 'tuple'>
len(titles): 395

('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20', 
'1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21',
"2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23", 
'3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25', 
'4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25')

(4)查看前5個(gè)文檔第0個(gè)詞出現(xiàn)的次數(shù)

doc_id = 0
word_id = 0
while doc_id < 5:
    print("doc id: {} word id: {}".format(doc_id, word_id))
    print("-- count: {}".format(X[doc_id, word_id]))
    print("-- word : {}".format(vocab[word_id]))
    print("-- doc  : {}\n".format(titles[doc_id]))
    doc_id += 1

運(yùn)行結(jié)果:

doc id: 0 word id: 0
-- count: 1
-- word : church
-- doc  : 0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20

doc id: 1 word id: 0
-- count: 7
-- word : church
-- doc  : 1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21

doc id: 2 word id: 0
-- count: 0
-- word : church
-- doc  : 2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23

doc id: 3 word id: 0
-- count: 6
-- word : church
-- doc  : 3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25

doc id: 4 word id: 0
-- count: 0
-- word : church
-- doc  : 4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25

(二)訓(xùn)練模型

設(shè)置20個(gè)主題,500次迭代

model = lda.LDA(n_topics=20, n_iter=500, random_state=1)
model.fit(X)          # model.fit_transform(X) is also available

(三)主題-單詞分布

計(jì)算前3個(gè)單詞在所有主題(共20個(gè))中所占的權(quán)重

topic_word = model.topic_word_
print("type(topic_word): {}".format(type(topic_word)))
print("shape: {}".format(topic_word.shape))
print(vocab[:3])
print(topic_word[:, :3])    #打印所有行(20)行的前3列

運(yùn)行結(jié)果:

type(topic_word): <class 'numpy.ndarray'>
shape: (20, 4258)
('church', 'pope', 'years')
[[2.72436509e-06 2.72436509e-06 2.72708945e-03]
 [2.29518860e-02 1.08771556e-06 7.83263973e-03]
 [3.97404221e-03 4.96135108e-06 2.98177200e-03]
 [3.27374625e-03 2.72585033e-06 2.72585033e-06]
 [8.26262882e-03 8.56893407e-02 1.61980569e-06]
 [1.30107788e-02 2.95632328e-06 2.95632328e-06]
 [2.80145003e-06 2.80145003e-06 2.80145003e-06]
 [2.42858077e-02 4.66944966e-06 4.66944966e-06]
 [6.84655429e-03 1.90129250e-06 6.84655429e-03]
 [3.48361655e-06 3.48361655e-06 3.48361655e-06]
 [2.98781661e-03 3.31611166e-06 3.31611166e-06]
 [4.27062069e-06 4.27062069e-06 4.27062069e-06]
 [1.50994982e-02 1.64107142e-06 1.64107142e-06]
 [7.73480150e-07 7.73480150e-07 1.70946848e-02]
 [2.82280146e-06 2.82280146e-06 2.82280146e-06]
 [5.15309856e-06 5.15309856e-06 4.64294180e-03]
 [3.41695768e-06 3.41695768e-06 3.41695768e-06]
 [3.90980357e-02 1.70316633e-03 4.42279319e-03]
 [2.39373034e-06 2.39373034e-06 2.39373034e-06]
 [3.32493234e-06 3.32493234e-06 3.32493234e-06]]

計(jì)算所有行的比重之和(等于1)

for n in range(20):
    sum_pr = sum(topic_word[n,:])   # 第n行所有列的比重之和,等于1
    print("topic: {} sum: {}".format(n, sum_pr))

計(jì)算結(jié)果:

topic: 0 sum: 1.0000000000000875
topic: 1 sum: 1.0000000000001148
topic: 2 sum: 0.9999999999998656
topic: 3 sum: 1.0000000000000042
topic: 4 sum: 1.0000000000000928
topic: 5 sum: 0.9999999999999372
topic: 6 sum: 0.9999999999999049
topic: 7 sum: 1.0000000000001694
topic: 8 sum: 1.0000000000000906
topic: 9 sum: 0.9999999999999195
topic: 10 sum: 1.0000000000001261
topic: 11 sum: 0.9999999999998876
topic: 12 sum: 1.0000000000001268
topic: 13 sum: 0.9999999999999034
topic: 14 sum: 1.0000000000001892
topic: 15 sum: 1.0000000000000984
topic: 16 sum: 1.0000000000000768
topic: 17 sum: 0.9999999999999146
topic: 18 sum: 1.0000000000000364
topic: 19 sum: 1.0000000000001434

(四)計(jì)算各主題top-N個(gè)詞

計(jì)算每個(gè)主題中,比重最大的5個(gè)詞

n = 5
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n+1):-1]
    print('*Topic {}\n- {}'.format(i, ' '.join(topic_words)))

運(yùn)行結(jié)果:

*Topic 0
- government british minister west group
*Topic 1
- church first during people political
*Topic 2
- elvis king wright fans presley
*Topic 3
- yeltsin russian russia president kremlin
*Topic 4
- pope vatican paul surgery pontiff
*Topic 5
- family police miami versace cunanan
*Topic 6
- south simpson born york white
*Topic 7
- order church mother successor since
*Topic 8
- charles prince diana royal queen
*Topic 9
- film france french against actor
*Topic 10
- germany german war nazi christian
*Topic 11
- east prize peace timor quebec
*Topic 12
- n't told life people church
*Topic 13
- years world time year last
*Topic 14
- mother teresa heart charity calcutta
*Topic 15
- city salonika exhibition buddhist byzantine
*Topic 16
- music first people tour including
*Topic 17
- church catholic bernardin cardinal bishop
*Topic 18
- harriman clinton u.s churchill paris
*Topic 19
- century art million museum city

(五)文檔-主題分布

總共有395篇文檔,計(jì)算前10篇文檔最可能的主題

doc_topic = model.doc_topic_
print("type(doc_topic): {}".format(type(doc_topic)))
print("shape: {}".format(doc_topic.shape))
for n in range(10):
    topic_most_pr = doc_topic[n].argmax()
    print("doc: {} topic: {}".format(n, topic_most_pr))

運(yùn)行結(jié)果:

type(doc_topic): <class 'numpy.ndarray'>
shape: (395, 20)
doc: 0 topic: 8
doc: 1 topic: 1
doc: 2 topic: 14
doc: 3 topic: 8
doc: 4 topic: 14
doc: 5 topic: 14
doc: 6 topic: 14
doc: 7 topic: 14
doc: 8 topic: 14
doc: 9 topic: 8

(六)可視化分析

(1)繪制主題0、主題5、主題9、主題14、主題19的詞出現(xiàn)次數(shù)分布

import matplotlib.pyplot as plt

f, ax = plt.subplots(5, 1, figsize=(8, 6), sharex=True)
for i, k in enumerate([0, 5, 9, 14, 19]):
    print(i, k)
    ax[i].stem(topic_word[k, :], linefmt='b-',
               markerfmt='bo', basefmt='w-')
    ax[i].set_xlim(-50, 4350)
    ax[i].set_ylim(0, 0.08)
    ax[i].set_ylabel("Prob")
    ax[i].set_title("topic {}".format(k))

ax[4].set_xlabel("word")

plt.tight_layout()
plt.show()

運(yùn)行結(jié)果:

2.png

(2)繪制文檔1、文檔3、文檔4、文檔8和文檔9的主題分布

f, ax = plt.subplots(5, 1, figsize=(8, 6), sharex=True)
for i, k in enumerate([1, 3, 4, 8, 9]):
    ax[i].stem(doc_topic[k, :], linefmt='r-',
               markerfmt='ro', basefmt='w-')
    ax[i].set_xlim(-1, 21)
    ax[i].set_ylim(0, 1)
    ax[i].set_ylabel("Prob")
    ax[i].set_title("Document {}".format(k))

ax[4].set_xlabel("Topic")

plt.tight_layout()
plt.show()

運(yùn)行結(jié)果:

3.png

五、完整代碼

import numpy as np
import lda
import lda.datasets

# document-term matrix
X = lda.datasets.load_reuters()
print("type(X): {}".format(type(X)))
print("shape: {}\n".format(X.shape))
print(X[:5, :5])        #前五行的前五列

# the vocab
vocab = lda.datasets.load_reuters_vocab()
print("type(vocab): {}".format(type(vocab)))
print("len(vocab): {}\n".format(len(vocab)))
print(vocab[:5])

# titles for each story
titles = lda.datasets.load_reuters_titles()
print("type(titles): {}".format(type(titles)))
print("len(titles): {}\n".format(len(titles)))
print(titles[:5])       # 打印前五個(gè)文檔的標(biāo)題

print("\n************************************************************")
doc_id = 0
word_id = 0
while doc_id < 5:
    print("doc id: {} word id: {}".format(doc_id, word_id))
    print("-- count: {}".format(X[doc_id, word_id]))
    print("-- word : {}".format(vocab[word_id]))
    print("-- doc  : {}\n".format(titles[doc_id]))
    doc_id += 1

topicCnt = 20
model = lda.LDA(n_topics = topicCnt, n_iter = 500, random_state = 1)
model.fit(X)          # model.fit_transform(X) is also available

print("\n************************************************************")
topic_word = model.topic_word_
print("type(topic_word): {}".format(type(topic_word)))
print("shape: {}".format(topic_word.shape))
print(vocab[:3])
print(topic_word[:, :3])    #打印所有行(20)行的前3列

for n in range(20):
    sum_pr = sum(topic_word[n,:])   # 第n行所有列的比重之和,等于1
    print("topic: {} sum: {}".format(n, sum_pr))

print("\n************************************************************")
n = 5
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n+1):-1]
    print('*Topic {}\n- {}'.format(i, ' '.join(topic_words)))

print("\n************************************************************")
doc_topic = model.doc_topic_
print("type(doc_topic): {}".format(type(doc_topic)))
print("shape: {}".format(doc_topic.shape))
for n in range(10):
    topic_most_pr = doc_topic[n].argmax()
    print("doc: {} topic: {}".format(n, topic_most_pr))

print("\n************************************************************")
import matplotlib.pyplot as plt

f, ax = plt.subplots(5, 1, figsize=(8, 6), sharex=True)
for i, k in enumerate([0, 5, 9, 14, 19]):
    print(i, k)
    ax[i].stem(topic_word[k, :], linefmt='b-',
               markerfmt='bo', basefmt='w-')
    ax[i].set_xlim(-50, 4350)
    ax[i].set_ylim(0, 0.08)
    ax[i].set_ylabel("Prob")
    ax[i].set_title("topic {}".format(k))

ax[4].set_xlabel("word")

plt.tight_layout()
plt.show()

print("\n************************************************************")
f, ax = plt.subplots(5, 1, figsize=(8, 6), sharex=True)
for i, k in enumerate([1, 3, 4, 8, 9]):
    ax[i].stem(doc_topic[k, :], linefmt='r-',
               markerfmt='ro', basefmt='w-')
    ax[i].set_xlim(-1, 21)
    ax[i].set_ylim(0, 1)
    ax[i].set_ylabel("Prob")
    ax[i].set_title("Document {}".format(k))

ax[4].set_xlabel("Topic")

plt.tight_layout()
plt.show()

六、參考資料

(1)
https://blog.csdn.net/eastmount/article/details/50824215

(2)http://chrisstrelioff.ws/sandbox/2014/11/13/getting_started_with_latent_dirichlet_allocation_in_python.html

七、推薦閱讀

《LDA漫游指南》


了解小朋友學(xué)編程請(qǐng)加QQ307591841(微信與QQ同號(hào)),或QQ群581357582。
關(guān)注公眾號(hào)請(qǐng)掃描二維碼


qrcode_for_kidscode_258.jpg
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀(guān)點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • LDA(Latent Dirichlet Allocation)是一種文檔主題生成模型,也稱(chēng)為一個(gè)三層貝葉斯概率模...
    chaaffff閱讀 2,044評(píng)論 0 2
  • LDA的代碼實(shí)現(xiàn):http://blog.csdn.net/u010551621/article/details/...
    wlj1107閱讀 34,479評(píng)論 0 31
  • 前面的文章主要從理論的角度介紹了自然語(yǔ)言人機(jī)對(duì)話(huà)系統(tǒng)所可能涉及到的多個(gè)領(lǐng)域的經(jīng)典模型和基礎(chǔ)知識(shí)。這篇文章,甚至之后...
    我偏笑_NSNirvana閱讀 14,453評(píng)論 2 64
  • 這個(gè)系列的第六個(gè)主題,主要談一些搜索引擎相關(guān)的常見(jiàn)技術(shù)。 1995年是搜索引擎商業(yè)公司發(fā)展的重要起點(diǎn),《淺談推薦系...
    我偏笑_NSNirvana閱讀 6,890評(píng)論 3 24
  • 火車(chē)上,站臺(tái),客運(yùn)站里,長(zhǎng)途車(chē)上,我發(fā)現(xiàn),無(wú)時(shí)無(wú)刻不在想著母親。 心里是空的,恍恍惚惚走著神,這時(shí)候若是遇見(jiàn)了壞人...
    司卓閱讀 121評(píng)論 0 1

友情鏈接更多精彩內(nèi)容