香蕉伊人在线观看,久久久久久午夜福利,久久大奶

竇唯

1.1 API 分析

??網(wǎng)易云音樂的評論區(qū)一直為人們所津津樂道，不少人因其優(yōu)質(zhì)的評論被圈粉。近日看到篇通過 SnowNLP 對爬取的云音樂評論進行情感分析的文章，便乘此研究下如何爬取云音樂評論并對其進行情感分析。

??首先，通過瀏覽器的開發(fā)者工具觀察云音樂歌曲評論的頁面請求，發(fā)現(xiàn)評論是通過 Ajax 來傳輸?shù)模?POST 請求的 params 和 enSecKey 參數(shù)是經(jīng)過加密處理的，這問題已有人給出了解決辦法。但在前面提到的那篇文章里，發(fā)現(xiàn)了云音樂未被加密的 API（=。=）：

http://music.163.com/api/v1/resource/comments/R_SO_4_5279713?limit=20&offset=0

??在該 URL 中，R_SO_4_ 后的那串?dāng)?shù)字是歌曲的 id，而 limit 和 offset 分別是分頁的每頁記錄數(shù)和偏移量。但有了這個 API 還不夠，還需要獲取歌曲列表的 API，否則得手動查找和輸入歌曲 id。然后又十分愉快地，找到了搜索的 API：

http://music.163.com/api/search/get/web?csrf_token=&hlpretag=&hlposttag=&s=%E7%AA%A6%E5%94%AF&type=1&offset=0&total=true&limit=

??這條 URL，s= 后面的是搜索條件，type 則對應(yīng)的是搜索結(jié)果的類型（1=單曲, 10=專輯, 100=歌手, 1000=歌單, 1006=歌詞, 1014=視頻, 1009=主播電臺, 1002=用戶）。

??有了這兩個 API，就可以開始編寫爬蟲了。

Warning:
本文代碼基于 Win10 + Py3.7 環(huán)境，由于為一次性需求，且對數(shù)據(jù)量估計不足（實際爬取近 16w 條），未過多考慮效率和異常處理問題，僅供參考。

1.2 爬蟲

??按照慣例，首先導(dǎo)入爬蟲的相關(guān)庫。

import requests

import re
import urllib
import math
import time
import random

import pandas as pd
import sqlite3

??構(gòu)造請求頭。

my_headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Host': 'music.163.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'
}

??接下來構(gòu)建了 6 個用于爬蟲的函數(shù)：

getJSON(url, headers): 從目標(biāo) URL 獲取 JSON
countPages(total, limit): 根據(jù)記錄總數(shù)計算要抓取的頁數(shù)
parseSongInfo(song_list): 解析歌曲信息
getSongList(key, limit=30): 獲取歌曲列表
parseComment(comments): 解析評論
getSongComment(id, limit=20): 獲取歌曲評論

def getJSON(url, headers):
    """ Get JSON from the destination URL
    @ param url: destination url, str 
    @ param headers: request headers, dict
    @ return json: result, json
    """
    res = requests.get(url, headers=headers) 
    res.raise_for_status()  #拋出異常
    res.encoding = 'utf-8'  
    json = res.json()
    return json

def countPages(total, limit):
    """ Count pages
    @ param total: total num of records, int
    @ param limit: limit per page, int
    @ return page: num of pages, int
    """
    page = math.ceil(total/limit) 
    return page

def parseSongInfo(song_list):
    """ Parse song info
    @ param song_list: list of songs, list
    @ return song_info_list: result, list
    """
    song_info_list = []
    
    for song in song_list:
        song_info = []
        song_info.append(song['id'])
        song_info.append(song['name'])
        artists_name = ''
        artists = song['artists']
        for artist in artists:
            artists_name += artist['name'] + ','
        song_info.append(artists_name)
        song_info.append(song['album']['name'])
        song_info.append(song['album']['id'])
        song_info.append(song['duration'])
        
        song_info_list.append(song_info)
        
    return song_info_list

def getSongList(key, limit=30):
    """ Get a list of songs
    @ param key: key word, str
    @ param limit: limit per page, int, default 30
    @ return result: result, DataFrame
    """
    total_list = []
    key = urllib.parse.quote(key) #url編碼
    url = 'http://music.163.com/api/search/get/web?csrf_token=&hlpretag=&hlposttag=&s=' + key +  '&type=1&offset=0&total=true&limit='
    # 獲取總頁數(shù)
    first_page = getJSON(url, my_headers)
    song_count = first_page['result']['songCount']
    page_num = countPages(song_count, limit)
    # 爬取所有符合條件的記錄
    for n in range(page_num):
        url = 'http://music.163.com/api/search/get/web?csrf_token=&hlpretag=&hlposttag=&s=' + key +  '&type=1&offset=' + str(n*limit) + '&total=true&limit=' + str(limit)
        tmp = getJSON(url, my_headers)
        song_list = parseSongInfo(tmp['result']['songs'])
        total_list += song_list
        print('第 {0}/{1} 頁爬取完成'.format(n+1, page_num))
        time.sleep(random.randint(2, 4)) 
        
    df = pd.DataFrame(data = total_list, columns=['song_id', 'song_name', 'artists', 'album_name', 'album_id', 'duration'])
    return df

def parseComment(comments):
    """ Parse song comment
        @ param comments: list of comments, list
        @ return comments_list: result, list
    """
    comments_list = []
    
    for comment in comments:
        comment_info = []
        comment_info.append(comment['commentId'])
        comment_info.append(comment['user']['userId'])
        comment_info.append(comment['user']['nickname'])
        comment_info.append(comment['user']['avatarUrl'])
        comment_info.append(comment['content'])
        comment_info.append(comment['likedCount'])
        comments_list.append(comment_info)
        
    return comments_list

def getSongComment(id, limit=20):
    """ Get Song Comments
    @ param id: song id, int
    @ param limit: limit per page, int, default 20
    @ return result: result, DataFrame
    """
    total_comment = []
    url = 'http://music.163.com/api/v1/resource/comments/R_SO_4_' + str(id) +  '?limit=20&offset=0'
    # 獲取總頁數(shù)
    first_page = getJSON(url, my_headers)
    total = first_page['total']
    page_num = countPages(total, limit)
    # 爬取該首歌曲下的所有評論
    for n in range(page_num):
        url = 'http://music.163.com/api/v1/resource/comments/R_SO_4_' + str(id) +  '?limit=' + str(limit) + '&offset=' + str(n*limit)
        tmp = getJSON(url, my_headers)
        comment_list = parseComment(tmp['comments'])
        total_comment += comment_list 
        print('第 {0}/{1} 頁爬取完成'.format(n+1, page_num))
        time.sleep(random.randint(2, 4)) 
        
    df = pd.DataFrame(data = total_comment, columns=['comment_id', 'user_id', 'user_nickname', 'user_avatar', 'content', 'likeCount'])
    df['song_id'] = str(id) #添加 song_id 列
    return df

??在爬取數(shù)據(jù)前，先連接上數(shù)據(jù)庫。

conn = sqlite3.connect('netease_cloud_music.db')

??設(shè)置搜索條件，并爬取符合搜索條件的記錄。

artist='竇唯' #設(shè)置搜索條件
song_df = getSongList(artist, 100)
song_df = song_df[song_df['artists'].str.contains(artist)] #篩選記錄
song_df.drop_duplicates(subset=['song_id'], keep='first', inplace=True) #去重
song_df.to_sql(name='song', con=conn, if_exists='append', index=False)

??從數(shù)據(jù)庫中讀取所有 artists 包含 竇唯 的歌曲，這將得到 song_id 數(shù)據(jù)框。

sql = '''
    SELECT song_id
    FROM song
    WHERE artists LIKE '%竇唯%'
'''
song_id = pd.read_sql(sql, con=conn)

??爬取 song_id 數(shù)據(jù)框中所有歌曲的評論，并保存到數(shù)據(jù)庫。

comment_df = pd.DataFrame()
for index, id in zip(song_id.index, song_id['song_id']):
    print('開始爬取第 {0}/{1} 首, {2}'.format(index+1, len(song_id['song_id']), id))
    tmp_df = getSongComment(id, 100)
    comment_df = pd.concat([comment_df, tmp_df])
comment_df.drop_duplicates(subset=['comment_id'], keep='first', inplace=True)
comment_df.to_sql(name='comment', con=conn, if_exists='append', index=False)
print('已成功保存至數(shù)據(jù)庫！')

??完成上述所有步驟后，數(shù)據(jù)庫將增加近 16w 條記錄。

1.3 數(shù)據(jù)概覽

??從數(shù)據(jù)庫中讀取所有 artists 包含 竇唯 的評論，得到 comment 數(shù)據(jù)框。

sql = '''
    SELECT *
    FROM comment
    WHERE song_id IN (
        SELECT song_id
        FROM song
        WHERE artists LIKE '%竇唯%'
    )
'''
comment = pd.read_sql(sql, con=conn)

??通過 nunique() 方法可得到 comment 中各字段分別有多少個不同值。從中可以看出，一共有來自 70254 名用戶的 159232 條評論。

comment.nunique()

comment_id 159232
user_id 70254
user_nickname 68798
user_avatar 80094
content 136898
likeCount 616
song_id 445
dtype: int64

??接下來分別查看評論數(shù)、評論次數(shù)、點贊數(shù)前 10 的歌曲、用戶和評論

song_top10_num = comment.groupby('song_id').size().sort_values(ascending=False)[0:10]
song_top10 = song[song['song_id'].isin(song_top10_num.index)].iloc[:, 0:2]
song_top10['num'] =  song_top10_num.tolist()
print(song_top10)

index	song_id	song_name	num
0	5279713	高級動物	11722
4	5279715	悲傷的夢	9316
5	77169	暮春秋色	7464
8	5279714	噢乖	6477
13	526468453	送別2017	5605
28	512298988	重返魔域	4677
124	27853979	殃金咒	4493
327	26031014	雨吁	3965
377	34248413	既然我們是兄弟	3845
435	28465036	天宮圖	3739

user_top10 = comment.groupby('user_id').size().sort_values(ascending=False)[0:10]
print(user_top10)

user_id	comments
42830600	549
33712056	322
51625217	273
284151966	242
2159884	234
271253793	234
388206024	233
263344124	232
84030184	209
131005965	204

comment_top10 = comment.sort_values(['likeCount'], ascending=False)[0:10]
print(comment_top10[['comment_id', 'likeCount']])

index	comment_id	likeCount
11252	51694054	35285
10522	133265373	15409
10211	148045985	12886
146129	40249220	9234
10038	157500246	7670
38728	6107434	7393
48826	658314395	5559
31101	7875585	5248
146213	35287069	4900
37307	231408710	4801

1.4 情感分析

??導(dǎo)入情感分析及可視化的相關(guān)庫。

import numpy as np

import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False 

import jieba
from snownlp import SnowNLP
from wordcloud import WordCloud

??這里使用 SnowNLP 進行情感分析，SnowNLP 是一個用于處理中文文本的自然語言處理庫，可以很方便地進行中文文本的情感分析（”現(xiàn)在訓(xùn)練數(shù)據(jù)主要是買賣東西時的評價，所以對其他的一些可能效果不是很好，待解決“），試舉一例：

test = '竇唯只要出來把自己的老作品演繹一遍，就能日進斗金，可人家沒這么干！人家還在自己坐著地鐵！什么是人民藝術(shù)家？這就是！！'
c = SnowNLP(test)
c.sentiments
# 0.9988789161400798

??得分在 [0, 1] 區(qū)間內(nèi)，越接近 1 則情感越積極，反之則越消極。一般來說，得分大于 0.5 的歸于正向情感，小于的歸于負(fù)向。下面為 comment 增加兩列，分別是評論內(nèi)容的情感得分和正負(fù)向標(biāo)簽（1=正向，-1=負(fù)向）。

comment['semiscore'] = comment['content'].apply(lambda x: SnowNLP(x).sentiments)
comment['semilabel'] = comment['semiscore'].apply(lambda x: 1 if x > 0.5 else -1)

??基于評論內(nèi)容的情感得分，得到下方的直方圖，從圖中不難看出，對竇唯音樂的評論多是積極正面的：

plt.hist(comment['semiscore'], bins=np.arange(0, 1.01, 0.01), label='semisocre', color='#1890FF')
plt.xlabel("semiscore")
plt.ylabel("number")
plt.title("The semi-score of comment")
plt.show()

??再通過情感標(biāo)簽觀察，可知持正向情感的評論數(shù)是負(fù)向情感的近兩倍。

semilabel = comment['semilabel'].value_counts()
semilabel = semilabel.loc[[1, -1]]

plt.bar(semilabel.index, semilabel.values, tick_label=semilabel.index, color='#2FC25B')
plt.xlabel("semislabel")
plt.ylabel("number")
plt.title("The semi-label of comment")
plt.show()

1.5 詞云

??最后，使用 jieba 進行中文分詞（關(guān)于 jieba，可參閱簡明 jieba 中文分詞教程），并繪制詞云圖：

text = ''.join(str(s) for s in comment['content'] if s not in [None]) #將所有評論合并為一個長文本
jieba.add_word('竇唯') #增加自定義詞語
word_list = jieba.cut(text, cut_all=False) #分詞
stopwords = [line.strip() for line in open('stopwords.txt',encoding='UTF-8').readlines()] #加載停用詞列表
clean_list = [seg for seg in word_list if seg not in stopwords] #去除停用詞
# 生成詞云
cloud = WordCloud(  
    font_path = 'F:\fonts\FZBYSK.TTF',   
    background_color = 'white',  
    max_words = 1000,  
    max_font_size = 64       
) 
word_cloud = cloud.generate(clean_text) 
# 繪制詞云
plt.figure(figsize=(16, 16))
plt.imshow(word_cloud)  
plt.axis('off')  
plt.show()

??在生成的詞云圖中（混入了一個 、、、、，可能是特殊字符的問題），最顯眼的是竇唯高級動物的歌詞，結(jié)合高達 11722 的評論數(shù)，不難看出人們對這首歌的喜愛。其次是 喜歡, 聽不懂, 好聽 等詞語，在一定程度上體現(xiàn)了人們對竇唯音樂的評價。再基于 TF-IDF 算法對評論進行關(guān)鍵詞提取，得出前 30 的關(guān)鍵詞：

for x, w in anls.extract_tags(clean_text, topK=30, withWeight=True):
    print('{0}: {1}'.format(x, w))

喜歡: 0.07174921826661623
搖滾: 0.06222465433996381
好聽: 0.048331581166697744
仙兒: 0.04814604948274102
王菲: 0.04271112348151552
竇仙: 0.027324893954643947
聽不懂: 0.01956956751188709
幸福: 0.014775956892430308
成仙: 0.01465450183828875
汪峰: 0.014175488038594907
大仙: 0.013705819518861267
高級: 0.013225888298888759
黑夢: 0.013076421076696725
前奏: 0.012872688959687885
黑豹: 0.012540924545728218
聽歌: 0.012455923064269991
艷陽天: 0.012455923064269991
動物: 0.012396754282072616
聽聽: 0.012369319024839337
聽懂: 0.01160376390830011
吉他: 0.01142745810497296
忘詞: 0.011296092030755316
歌曲: 0.011181124179616048
希望: 0.01089713506654457
理解: 0.010537493766491456
厲害: 0.0104225740491279
哀傷: 0.009602942087618863
竇靖童: 0.009406198340815812
電影: 0.009266377909595709
送別: 0.008950847971089923

??排在前面的關(guān)鍵詞有“喜歡、搖滾、好聽、聽不懂”等，還出現(xiàn)了 3 個人名，分別是竇唯的前妻、女兒以及另一位中國搖滾代表人物。一些歌名（如“高級動物”）、專輯名（如“黑夢”）也出現(xiàn)在這列表中，可惜的是竇唯后來的作品并沒有出現(xiàn)（和“聽不懂”多少有點關(guān)系）。而帶“仙”字的關(guān)鍵詞有 4 個，“竇唯成仙了”。最有意思的彩蛋，莫過于"忘詞"這個關(guān)鍵詞，看樣子大家對竇唯在 94 年那場演唱會的忘詞，還是記憶猶新。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

網(wǎng)易云評論爬蟲及情感分析

網(wǎng)易云評論爬蟲及情感分析

1.1 API 分析

1.2 爬蟲

1.3 數(shù)據(jù)概覽

1.4 情感分析

1.5 詞云

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

網(wǎng)易云評論爬蟲及情感分析

1.1 API 分析

1.2 爬蟲

1.3 數(shù)據(jù)概覽

1.4 情感分析

1.5 詞云

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av