分析星球大戰(zhàn)正傳劇本

0 引言

??星球大戰(zhàn)是一部偉大的電影,講述了一段在遙遠的銀河發(fā)生的故事,對世界流行文化影響深遠。Kaggle 上有星球大戰(zhàn)正傳三部曲的劇本,雖然數(shù)據(jù)量不大,也是一次實現(xiàn)文本分析的有趣嘗試。

1 導入相關包

# Jupyter 魔法函數(shù),在當前頁面輸出圖像
%matplotlib inline
# 數(shù)據(jù)處理及導入導出
import pandas as pd

# 數(shù)據(jù)可視化基礎庫
import matplotlib.pyplot as plt
# 更好的可視化效果
import seaborn as sns
sns.set_style("whitegrid") #設置 seaborn 主題

# 詞云
from wordcloud import WordCloud  
from imageio import imread

# 機器學習
import gensim #構建 word2vec
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# 去除停用詞,并將字符串轉(zhuǎn)換成列表
import string
from nltk.corpus import stopwords 
stop = stopwords.words('english') #停用詞

2 導入數(shù)據(jù)

??分別從 3 個 txt 文件中,導入星球大戰(zhàn)四、五、六的劇本為數(shù)據(jù)框(DataFrame)。

# 從 txt 文件中導入數(shù)據(jù)
SW_IV = pd.read_table('data/SW_EpisodeIV.txt', delim_whitespace=True, header=0, escapechar='\\')
SW_V = pd.read_table('data/SW_EpisodeV.txt', delim_whitespace=True, header=0, escapechar='\\')
SW_VI = pd.read_table('data/SW_EpisodeVI.txt', delim_whitespace=True, header=0, escapechar='\\')
# 查看數(shù)據(jù)框
SW_IV.sample(10)

3 數(shù)據(jù)處理

??在分析之前,先進行簡單的數(shù)據(jù)處理,剔除 dialogue 列中的停用詞,并將其由字符串轉(zhuǎn)換成列表。

print("停用詞表:\n{0}".format(stop))

停用詞表:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

def prep_text(series):
    """
        去除停用詞,并將字符串轉(zhuǎn)換成列表
        
        Args:
            series: Series
    
        Returns:
            Series
    """
    return series.str.replace('\'', ' ').apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]).lower().translate(str.maketrans("", "", string.punctuation)).split())
SW_IV['clean_text'] = prep_text(SW_IV['dialogue'])
SW_V['clean_text'] = prep_text(SW_V['dialogue'])
SW_VI['clean_text'] = prep_text(SW_VI['dialogue'])
SW_IV.head()

??將 3 個數(shù)據(jù)框進行合并為一個新的數(shù)據(jù)框。

SW = pd.concat([SW_IV, SW_V, SW_VI], ignore_index=True)
# 查看數(shù)據(jù)框信息
SW.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2523 entries, 0 to 2522
Data columns (total 3 columns):
character 2523 non-null object
dialogue 2523 non-null object
clean_text 2523 non-null object
dtypes: object(3)
memory usage: 59.2+ KB

4 統(tǒng)計分析

??接下來,我們將對 4 個數(shù)據(jù)框進行簡單的統(tǒng)計分析,看看三部曲中誰的臺詞最多,并通過詞云查看哪些詞語的詞頻高。

??新建函數(shù) character_countSWCloud,分別用于查看臺詞數(shù)前 20 的角色和生成詞云。

def character_count(df):
    """
        展示臺詞數(shù)前 20 的角色
        
        Args:
            df: DataFrame
    """
    print(df.groupby('character').size().sort_values(ascending=False)[0:20])
    
    top20 = list(df.groupby('character').size().sort_values(ascending=False)[0:20].index)
    
    df_top20 = df[df['character'].isin(top20)]
    
    sns.countplot(y="character", 
                data=df_top20,
                palette="GnBu_d", 
                order = df_top20['character'].value_counts().index);
def SWCloud(df, cloud_mask, ep=''):
    """
        生成詞云,并導出為 jpg 圖片
        
        Args:
            df: DataFrame
            cloud_mask: string, fileName
            ep: string, episode
    """
    text = []

    for line in df['clean_text']:
        text.extend(line)
    
    join_text = " ".join(text)
    
    mask = imread(cloud_mask)
    
    cloud = WordCloud(
        background_color = 'white',
        mask = mask,
        max_words = 1024,
        max_font_size = 100
    )
    
    word_cloud = cloud.generate(join_text)
    word_cloud.to_file('output\SW_' + ep + '_Cloud.jpg')
    
    plt.figure(figsize=(8,8))
    plt.imshow(word_cloud) 
    plt.axis('off');

??先來查看第四部新希望臺詞數(shù)前 20 的角色,及其詞云。

character_count(SW_IV)

character
LUKE 254
HAN 153
THREEPIO 119
BEN 82
LEIA 57
VADER 41
RED LEADER 37
BIGGS 34
TARKIN 28
OWEN 25
TROOPER 19
GOLD LEADER 14
WEDGE 14
OFFICER 11
RED TEN 8
GOLD FIVE 7
INTERCOM VOICE 6
GREEDO 6
JABBA 6
FIRST TROOPER 6
dtype: int64

SWCloud(SW_IV, 'img/r2d2.png', 'IV')

??接著是帝國反擊戰(zhàn),在這一集中,爵爺對盧克說出了那句著名的 "I’m your father"。

character_count(SW_V)

character
HAN 182
LUKE 128
LEIA 114
THREEPIO 92
LANDO 61
VADER 56
YODA 36
PIETT 23
CREATURE 21
BEN 15
RIEEKAN 13
WEDGE 8
DECK OFFICER 7
VEERS 7
ZEV 6
EMPEROR 5
OZZEL 5
NEEDA 5
JANSON 4
DACK 4
dtype: int64

SWCloud(SW_V, 'img/yoda.png', 'V')

??最后是絕地歸來,在這集中,由于爵爺墮入原力的光明面,加上帝國沒錢造欄桿,叛軍取得勝利,并承勝追擊,占領了帝國大半的領土,在此危急存亡之秋,索龍元帥脫穎而出...

character_count(SW_VI)

character
HAN 124
LUKE 112
THREEPIO 90
LEIA 56
VADER 43
LANDO 40
EMPEROR 39
JABBA 20
BEN 18
ACKBAR 14
YODA 13
WEDGE 11
PIETT 8
BOUSHH 7
COMMANDER 7
JERJERROD 7
STORMTROOPER 6
BIB 6
NINEDENINE 6
CONTROLLER 5
dtype: int64

SWCloud(SW_VI, 'img/vader.jpg', 'VI')
output_31_0.png

??在正傳三部曲中,臺詞最多的無疑是男主盧克天行者,其次是戲份一點也不比男主少的漢·索羅,而話癆 C-3PO 則屈居探花。為廣大人民群眾所喜聞樂見的楚巴卡和 R2-D2,由于只會發(fā)出奇怪的聲音,只能活在臺詞中了。

character_count(SW)
character
LUKE           494
HAN            459
THREEPIO       301
LEIA           227
VADER          140
BEN            115
LANDO          101
YODA            49
EMPEROR         44
RED LEADER      38
BIGGS           34
WEDGE           33
PIETT           31
TARKIN          28
JABBA           26
OWEN            25
CREATURE        22
TROOPER         19
GOLD LEADER     14
ACKBAR          14
dtype: int64
SWCloud(SW, 'img/rebel alliance.png')

5 TF-IDF

??再看看叛軍是否使用與帝國不同的詞語。這里為了方便起見,將次要角色歸為 3 類,分別是帝國(Rebels)、叛軍(Rebels)和中立(Neutrals),以新希望為例。

def character_group(name: str) -> str:
    """
        將次要角色歸類
        
        Args:
            name: string, character name
            
        Returns:
            string, main character name & secondary character type
    """
    rebel = ('BASE VOICE', 'CONTROL OFFICER', 'MAN', 'PORKINS', 'REBEL OFFICER', 'RED ELEVEN',
             'RED TEN', 'RED SEVEN', 'RED NINE', 'RED LEADER', 'BIGGS', 'GOLD LEADER',
             'WEDGE', 'GOLD FIVE', 'REBEL', 'DODONNA', 'CHIEF', 'TECHNICIAN', 'WILLARD',
             'GOLD TWO', 'MASSASSI INTERCOM VOICE')
    imperial = ('CAPTAIN', 'CHIEF PILOT', 'TROOPER', 'OFFICER', 'DEATH STAR INTERCOM VOICE',
                'FIRST TROOPER', 'SECOND TROOPER', 'FIRST OFFICER', 'OFFICER CASS', 
                'INTERCOM VOICE', 'MOTTI', 'TAGGE', 'TROOPER VOICE', 'ASTRO-OFFICER',
                'VOICE OVER DEATH STAR INTERCOM', 'SECOND OFFICER', 'GANTRY OFFICER', 
                'WINGMAN', 'IMPERIAL OFFICER', 'COMMANDER', 'VOICE')
    neutral = ('WOMAN', 'BERU', 'CREATURE', 'DEAK', 'OWEN', 'BARTENDER', 'CAMIE', 'JABBA', 
               'AUNT BERU', 'GREEDO', 'NEUTRAL', 'HUMAN', 'FIXER')

    if name in rebel:
        return 'Rebels'
    elif name in imperial:
        return 'Imperials'
    elif name in neutral:
        return 'Neutrals'
    else:
        return name
SW_IV['group_character'] = SW_IV['character'].apply(character_group)
    
print(SW_IV.groupby('group_character').size().sort_values(ascending=False))
    
sns.countplot(y="group_character", data=SW_IV, 
            palette="GnBu_d", 
            order = SW_IV['group_character'].value_counts().index);

group_character
LUKE 254
HAN 153
Rebels 139
THREEPIO 119
BEN 82
Imperials 79
Neutrals 58
LEIA 57
VADER 41
TARKIN 28
dtype: int64

??通過 TF-IDF 方法提取相關單詞,每個單詞將在每一行中都有一個值,表示其重要性。

tfidf_vec = TfidfVectorizer(max_df=0.1, max_features=200, stop_words='english')

features = tfidf_vec.fit_transform(SW_IV['dialogue'])
X = pd.DataFrame(data=features.toarray(), 
                 index=SW_IV.group_character, 
                 columns=tfidf_vec.get_feature_names())
X.sample(10)

??使用 PCA 將每一行顯示在 2D 圖形中。

pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

df_reduced = pd.DataFrame(X_reduced)
df_reduced['group_character'] = X.index
df_reduced.head(10)

??為角色分配對應的顏色:

  • 叛軍中的中堅分子顯示為藍色;
  • 叛軍中的其他人員顯示為青色;
  • 爵爺顯示為紅色;
  • 其他帝國成員為洋紅;
  • 中立設為黑色。
def character_to_color(name: str):
    """
        返回角色對應的顏色
    
        Args:
            name: string
    
        Returns:
            Series
    """
    color = {'LUKE': 'b', 'HAN': 'b', 'THREEPIO': 'b', 'BEN': 'b', 'LEIA': 'b',
             'VADER': 'r', 'TARKIN': 'r', 
             'Imperials': 'm', 'Rebels': 'c', 'Neutrals': 'k'}
    return color[name]
df_reduced['color'] = df_reduced['group_character'].apply(character_to_color)

plt.figure(figsize=(10, 10))
plt.scatter(x=df_reduced[0], y=df_reduced[1],
            color=df_reduced['color'], alpha=0.5)
plt.savefig('output\displaying_lines.jpg');

??不難看出,藍色和青色廣泛地分布在平面上,而紅色和洋紅則主要集中在左側(cè)靠下的位置,這意味著叛軍使用的詞匯比帝國要廣泛。

??值得注意的是,在上方有一點洋紅色,“帝國中出了一個叛徒”,將這點找出來。

df_reduced[(df_reduced[0]>0.1) & (df_reduced[1]>0.55) & (df_reduced[1]<0.6)]
SW_IV.loc[714]

character FIRST TROOPER
dialogue Give me regular reports.
clean_text [give, regular, reports]
group_character Imperials
Name: 714, dtype: object

6 Word2Vec

??使用 gensim 構建 Word2Vec

sentences_IV = SW_IV['clean_text']

model = gensim.models.Word2Vec(min_count=3, window=5, iter=20)
model.build_vocab(sentences_IV)
model.train(sentences_IV, total_examples=model.corpus_count, epochs=model.epochs)

(84136, 142920)

??Word2Vec 為語料庫中的每個單詞構建一個向量,可以籍此討論不同單詞的接近程度——相似的詞具有接近1的值,而相反的詞具有接近-1的值。

model.wv.most_similar('force')

[('system', 0.9998038411140442),
('he', 0.9998025298118591),
('the', 0.9997953176498413),
('us', 0.9997825622558594),
('going', 0.9997814893722534),
('want', 0.9997798800468445),
('her', 0.9997773170471191),
('main', 0.9997760057449341),
('get', 0.9997740983963013),
('one', 0.9997738599777222)]

model.wv.most_similar(negative=['force'])

[('hello', -0.992935836315155),
('makes', -0.9964836239814758),
('worse', -0.9966601133346558),
('artoodetoo', -0.9969137907028198),
('moving', -0.9970685839653015),
('identification', -0.9971023797988892),
('rock', -0.9971140027046204),
('gonna', -0.9974701404571533),
('cover', -0.9974837899208069),
('over', -0.9975569248199463)]

??創(chuàng)建一個關于主要角色的詞匯列表。

characters = SW_IV['group_character'].str.lower().unique()
vocab = list(model.wv.vocab)
vocab = list(filter(lambda x: x in characters, vocab))
vocab

['vader', 'luke', 'ben', 'threepio', 'han', 'leia', 'tarkin']

??創(chuàng)建一個表示詞匯的向量列表。

X = model[vocab]

??通過 K-Means 算法將數(shù)據(jù)以 3 個簇為中心進行聚類。

cluster_num = 3

kmeans = KMeans(n_clusters=cluster_num, random_state=0).fit(X)
cluster = kmeans.predict(X)

??使用 PCA 來降低到 2 維,再將其可視化。

pca = PCA(n_components=2, random_state=11, whiten=True)
clf = pca.fit_transform(X)

tmp = pd.DataFrame(clf, index=vocab, columns=['x', 'y'])

tmp.head(3)
tmp['cluster'] = None
tmp['c'] = None

count = 0
for index, row in tmp.iterrows():
    tmp['cluster'][index] = cluster[count]
    tmp['c'][index] = characters[count]
    count += 1
    
for i in range(cluster_num):
    values = tmp[tmp['cluster'] == i]
    plt.scatter(values['x'], values['y'], alpha = 0.5)

for word, row in tmp.iterrows():
    x, y, cat, character = row
    pos = (x, y)
    plt.annotate(character, pos)
    
plt.axis('off')
plt.title('Star Wars Episode IV')
plt.savefig('output\w2v_map.jpg')
plt.show();
最后編輯于
?著作權歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容