在本人主觀定義的入口用戶(見簡書入口用戶簡單分析)中選取了8143名用戶,爬取他們的時間線,獲取全部動態(tài)進行可視化分析
-
所有動態(tài)的年月變化趨勢,喜歡文章、發(fā)表評論和贊賞文章的趨勢基本一致,可以看出簡書身為創(chuàng)作者社區(qū),發(fā)文與閱讀的比例還是很高的:
贊賞文章的行為在今年三月份到達高峰,16年中下半年是用戶入住簡書的高峰:
縱軸對數(shù)分布:
-
所有動態(tài)的一天24小時分布,一天的活動從早上6-8點開始,晚上9-11點有明顯的閱讀活動和發(fā)表文章的小高峰:
-
所有動態(tài)的星期分布,差異不大:
縱軸放大:
換一種看法,動態(tài)在一周內(nèi)的占比分布,星期六有個較明顯的下降趨勢,但還是區(qū)別不大:
-
所有動態(tài)以月為周期按日取和統(tǒng)計,月末有個小下降,但31號的異常是因為某些月份沒有31號:
按動態(tài)的當日占比來看,相當均衡:
-
幾個最活躍用戶(動態(tài)總數(shù)最多)的行為時間分布,作者梅話三弄和云兒飄過也分別是給別的作者打賞次數(shù)第一、第二多(1300多次打賞)的用戶:
-
給別人打賞第三多的用戶
用戶打賞次數(shù)分布,差異還是相當大的,但是有過打賞行為的用戶占了34.4%,還是很高的比例了。
-
所有用戶總活躍天數(shù)(一天內(nèi)有動態(tài)該天即為活躍)分布,分布并不均衡,是不是老用戶的活躍天數(shù)就多呢?
將用戶按入住年月分組,對他們的活躍天數(shù)取平均,可以看出16年左右入住的用戶的活躍天數(shù)最多,并非是越老的用戶活躍天數(shù)就越多,老用戶也可能漸漸失去對平臺的興趣而流失。
活躍比率(活躍天數(shù)除以入住總天數(shù))的分布,大部分用戶小于一周一次活躍:
用戶的簡書入住總天數(shù)(橫軸)與活躍比率(縱軸)的相關性分析,老用戶的活躍度要低些,新用戶正是熱情高漲。
代碼
# -------------------續(xù)簡書入口用戶簡單分析代碼---------------------------- #
# 獲取動態(tài)
# share_note 發(fā)表文章
# like_comment 贊了評論
# like_note 喜歡了文章
# comment_note 發(fā)表評論
# like_collection 關注了專題
# reward_note 贊賞文章
# like_user 關注作者
# join_jianshu 加入簡書
# like_notebook 喜歡專輯
# 數(shù)據(jù)庫準備
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client.jianshu
# 輔助裝飾器
from functools import wraps
def store_and_record_error(errors, coll):
def decorator(f):
@wraps(f)
def wrapper(*args, **kwargs):
try:
# func_args = inspect.getcallargs(f, *args, **kwargs)
res = f(*args, **kwargs)
coll.insert_many(res)
# return res
# del res # 直接返回res內(nèi)存占用過大
return None
except:
errors.append(args[0])
print('error', args[0])
return None # []
return wrapper
return decorator
import re
errors = []
coll = db.user_active_new
@store_and_record_error(errors, coll)
@retry(Exception, delay=1, backoff=2, tries=2)
def get_active_from_single_user(id='45a15c9b5a22'):
first_url = host + '/users/{id}/timeline?_pjax=%23list-container'.format(id=id)
res = requests.get(first_url, headers=headers)
# timetuple = time.strptime('2017-07-08T08:36:25', '%Y-%m-%dT%H:%M:%S')
# datetime.datetime(*timetuple[0:6])
infos = [] # 該用戶全部動態(tài)
match = re.compile('<li id="feed-(\d+)">')
num = 2
while 1:
soup = BeautifulSoup(res.text, 'lxml')
info = [{'id': id, # 多條動態(tài) # todo:考慮內(nèi)存占用問題
'action_time': i['data-datetime'].split('+')[0],
'action_type': i['data-type'],}
for i in soup.select('.content .author .name span')]
if not info:
print('over', id)
break
infos.extend(info)
max_id = int(re.findall(match, res.text)[-1]) - 1
next_url = host + '/users/{id}/timeline?page={page_num}&max_id={max_id}'.format(
id=id, max_id=max_id, page_num=num)
res = requests.get(next_url, headers=headers)
# print(num)
num += 1
return infos
# 4000個75分鐘
# pool = Pool(30)
# iter map 暫不需要
# https://stackoverflow.com/questions/28375508/python-multiprocessing-tracking-the-process-of-pool-map-operation
# https://stackoverflow.com/questions/34827250/how-to-keep-track-of-status-with-multiprocessing-and-pool-map
# https://stackoverflow.com/questions/26520781/multiprocessing-pool-whats-the-difference-between-map-async-and-imap
# 首先要有users變量,即之前爬取的入口用戶
all_active = pool.map(get_active_from_single_user, users['id'].values)
# Mongo Command: db.getCollection('user_active_new').distinct('id')
import pandas as pd
def read_data_from_mongo(coll):
cursor = coll.find({}, {'_id': False})
all_active = pd.DataFrame(list(cursor))#, dtype=int
all_active.action_time = pd.to_datetime(all_active.action_time)
all_active.action_type = all_active.action_type.astype('category')
# 查看具體數(shù)值:all_active.action_type.cat.codes
all_active.id = all_active.id.astype('category')
all_active.info(memory_usage='deep')
return all_active
coll = db.user_active_new
all_active = read_data_from_mongo(coll)
# -----------------------檢查內(nèi)存占用 (--------------------------- #
import sys # dir() /globals() /locals() /vars() /whos
for var, obj in locals().items():
print(var, sys.getsizeof(obj))
print('內(nèi)存占用%sM' % (sys.getsizeof(all_active)/1048576))
import os, psutil
process = psutil.Process(os.getpid())
process.memory_info()[0] / float(2 ** 20)
process.memory_percent()
# -----------------------檢查內(nèi)存占用 )--------------------------- #
# 數(shù)據(jù)分析
id_active_times = all_active.groupby('id').size().to_frame(name='active_times').reset_index()
id_active_times.sort_values('active_times', inplace=True, ascending=False)
id_active_times.active_times.max()
weekday = all_active.action_time.apply(lambda x: x.isoweekday())
all_active['weekday'] = weekday
weekday = all_active.groupby('weekday').size()
weekday.plot.bar()
plt.show()
hours = all_active.action_time.apply(lambda x: x.hour)
all_active['hours'] = hours
hours = all_active.groupby('hours').size()
hours.plot.bar()
plt.show()
all_active = pd.concat([
all_active,
pd.get_dummies(all_active['action_type'])], axis=1)
# all_active.groupby('weekday')['share_note'].sum()
# 分別畫柱狀圖
data = all_active.groupby('weekday')
for d, t in zip(
['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
'like_user', 'join_jianshu', 'like_notebook'],
['發(fā)表文章', '贊了評論', '喜歡了文章', '發(fā)表評論', '關注了專題', '贊賞文章',
'關注作者', '加入簡書','喜歡專輯',]):
temp = data[d].sum()
fig = {
'data': [go.Bar(x=temp.index.values,
y=temp.values,)],
'layout': {'yaxis': {'title': t}},
}
plotly.offline.plot(fig, filename='basic_bar_%s.html'%d, show_link=False)
# 合在一起畫柱狀圖
x = []
for d, t in zip(
['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
'like_user', 'join_jianshu', 'like_notebook'],
['發(fā)表文章', '贊了評論', '喜歡了文章', '發(fā)表評論', '關注了專題', '贊賞文章',
'關注作者', '加入簡書','喜歡專輯',]):
temp = data[d].sum()
x.append(go.Bar(x=temp.index.values,
y=temp.values,
name=t))
plotly.offline.plot(x, filename='basic_bar.html', show_link=False)
# 折線圖
x = []
for d, t in zip(
['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
'like_user', 'join_jianshu', 'like_notebook'],
['發(fā)表文章', '贊了評論', '喜歡了文章', '發(fā)表評論', '關注了專題', '贊賞文章',
'關注作者', '加入簡書','喜歡專輯',]):
temp = data[d].sum()
# 是否看占比
temp = temp / temp.sum()
x.append(go.Scatter(x=temp.index.values,
y=temp.values,
mode = 'lines',
name = t))
plotly.offline.plot(x, filename='basic_bar.html', show_link=False)
# 各種行為按hour統(tǒng)計
data = all_active.groupby('hours')
x = []
for d, t in zip(
['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
'like_user', 'join_jianshu', 'like_notebook'],
['發(fā)表文章', '贊了評論', '喜歡了文章', '發(fā)表評論', '關注了專題', '贊賞文章',
'關注作者', '加入簡書','喜歡專輯',]):
temp = data[d].sum()
# temp = temp / temp.sum()
x.append(go.Scatter(x=temp.index.values,
y=temp.values,
mode = 'lines',
name = t))
plotly.offline.plot(x, filename='basic_bar.html', show_link=False)
# 各種行為按day統(tǒng)計
all_active['day'] = all_active.action_time.apply(lambda x: x.day)
data = all_active.groupby('day')
x = []
for d, t in zip(
['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
'like_user', 'join_jianshu', 'like_notebook'],
['發(fā)表文章', '贊了評論', '喜歡了文章', '發(fā)表評論', '關注了專題', '贊賞文章',
'關注作者', '加入簡書','喜歡專輯',]):
temp = data[d].mean()
# temp = temp / temp.sum()
x.append(go.Scatter(x=temp.index.values,
y=temp.values,
mode = 'lines+markers',
name = t))
plotly.offline.plot(x, filename='basic_bar.html', show_link=False)
# 加入簡書的行為按年月統(tǒng)計
join_jianshu = all_active[all_active['action_type'] == 'join_jianshu']
# test: a = pd.datetime.strptime('2014-10-09 11:34:45', '%Y-%m-%d %H:%M:%S')
join_jianshu['year_month'] = join_jianshu['action_time'].dt.strftime('%Y-%m')
data = join_jianshu.groupby('year_month').size()
plotly.offline.plot([go.Scatter(
x = data.index.values,
y = data.values,
mode = 'lines+markers',
name = 'lines'
)], filename='lineZ.html', show_link=False)
# 所有行為按年月統(tǒng)計
all_active['year_month'] = all_active['action_time'].dt.strftime('%Y-%m')
# bug: categories is not json serializable:
# all_active['year_month'] = all_active['year_month'].astype('category')
data = all_active.groupby('year_month')
x = []
for d, t in zip(
['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
'like_user', 'join_jianshu', 'like_notebook'],
['發(fā)表文章', '贊了評論', '喜歡了文章', '發(fā)表評論', '關注了專題', '贊賞文章',
'關注作者', '加入簡書','喜歡專輯',]):
temp = data[d].sum()
x.append(go.Scatter(x=temp.index.values,
y=temp.values,
mode = 'lines+markers',
name = t))
fig = {'data': x,
# 是否取對數(shù)坐標
# 'layout': {'xaxis': {'title': '年月'}, 'yaxis': {'type': 'log'}}
}
plotly.offline.plot(fig, filename='basic_bar.html', show_link=False)
# todo: 最后一個月發(fā)表文章的占比
# 找出最活躍用戶并按年月統(tǒng)計
most_active_user = all_active.groupby('id').size().sort_values(ascending=False).index[:5]
for mu in most_active_user:
data = all_active[all_active['id'] == mu].groupby('year_month')
x = []
for d, t in zip(
['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
'like_user', 'join_jianshu', 'like_notebook'],
['發(fā)表文章', '贊了評論', '喜歡了文章', '發(fā)表評論', '關注了專題', '贊賞文章',
'關注作者', '加入簡書','喜歡專輯',]):
temp = data[d].sum() # .mean() # 注意這里的含義
x.append(go.Scatter(x=temp.index.values,
y=temp.values,
mode='lines+markers',
name=t))
fig = {'data': x,
# 是否取對數(shù)坐標
# 'layout': {'xaxis': {'title': '年月'}, 'yaxis': {'type': 'log'}}
}
plotly.offline.plot(fig, filename='most_active_user_%s.html'%mu, show_link=False)
# 用戶注冊后的活躍天數(shù)占比
all_active['just_date'] = all_active.action_time.dt.date
active_days = all_active.groupby('id')['just_date'].apply(lambda x: x.nunique())
active_days.sort_values(ascending=False, inplace=True)
# - - 用戶總活躍天數(shù)(一天內(nèi)有動態(tài)該天即為活躍)分布
plotly.offline.plot([go.Bar(y=active_days)], filename='active_days.html', show_link=False)
# 算出從注冊至今的總天數(shù),然后與總活躍天數(shù)相比
now = pd.datetime.now()
join_during_time = all_active[
all_active['action_type'] == 'join_jianshu'
][['id', 'action_time']]
join_during_time['during_time'] = now - join_during_time['action_time']
join_during_time['during_time'] = join_during_time.during_time.dt.days + 1
join_during_time = pd.merge(join_during_time,
active_days.reset_index(name='active_days'))
join_during_time['ratio'] = join_during_time['active_days'] / join_during_time['during_time']
# 餅圖
labels = ['小于10%', '大于10%小于50%', '大于50%小于90%', '大于90%']
ratio = join_during_time['ratio']
values = [(ratio > i).sum() for i in [0.1, 0.5, 0.9]]
values[1] -= values[2]
values.insert(1, len(ratio)-sum(values))
trace = go.Pie(labels=labels, values=values)
plotly.offline.plot([trace], filename='active_days.html', show_link=False)
# 直方圖與相關性分布圖
join_during_time['ratio'].hist()
sns.jointplot(data=join_during_time, x='during_time', y='ratio', kind='reg', color='g')
sns.plt.show()
# 按注冊年月顯示的平均活躍天數(shù)
join_during_time['year_month'] = join_during_time['action_time'].dt.strftime('%Y-%m')
temp = join_during_time.groupby('year_month')['active_days'].mean()
plotly.offline.plot([go.Scatter(x=temp.index.values,
y=temp.values,
mode = 'lines+markers')],
filename='average_active_days.html',
show_link=False)
# 打賞次數(shù)最多的用戶的行為
most_reward_user = all_active.groupby('id')['reward_note'].sum().sort_values(ascending=False).index[2]
data = all_active[all_active['id'] == most_reward_user].groupby('year_month')
x = []
for d, t in zip(
['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
'like_user', 'join_jianshu', 'like_notebook'],
['發(fā)表文章', '贊了評論', '喜歡了文章', '發(fā)表評論', '關注了專題', '贊賞文章',
'關注作者', '加入簡書','喜歡專輯',]):
temp = data[d].sum()
x.append(go.Scatter(x=temp.index.values,
y=temp.values,
mode = 'lines+markers',
name = t))
fig = {'data': x,
# 是否取對數(shù)坐標
# 'layout': {'xaxis': {'title': '年月'}, 'yaxis': {'type': 'log'}}
}
plotly.offline.plot(fig, filename='most_reward_user_%s.html'%most_reward_user, show_link=False)
# 打賞用戶占比
reward = all_active.groupby('id')['reward_note'].sum().sort_values(ascending=False)
print((reward > 0).sum() / len(reward))
plotly.offline.plot([go.Bar(y=reward)], filename='reward.html', show_link=False)
其他
- 熱門文章爬取(前兩頁)
def get_hot_articles_from_single_user(id='45a15c9b5a22'):
hot_first_url = host + '/u/{id}?order_by=top&_pjax=%23list-container'.format(id=id)
hot_next_url = host + '/u/{id}?order_by=top&page={page_num}'.format(id=id, page_num=2)
user_hot_articles = []
for u in [hot_first_url, hot_next_url]:
res = requests.get(u, headers=headers)
soup = BeautifulSoup(res.text, 'lxml')
for info in soup.select('.content'):
title = info.select_one('.title').text
time_ = info.select_one('.time')['data-shared-at']
details = info.select_one('.meta')
read, comments = [i.text.strip() for i in details.findAll('a')]
like_money = [i.text.strip() for i in details.findAll('span')]
if len(like_money) == 1:
like = like_money[0]
money = 0
else:
like, money = like_money
user_hot_articles.append({
'title': title,
'time': time_,
'read': read,
'comments': comments,
'like': like,
'money': money,
})


















