關(guān)鍵字
K_means、ARIMA
前言
一月份主要工作如下:
精細化數(shù)據(jù)預(yù)處理
過濾掉單一地點mac、過濾掉出現(xiàn)天數(shù)低于10天的mac、進一步細分地點列表;
數(shù)據(jù)索引
保留兩份原始數(shù)據(jù),以不同的索引保存,便于后續(xù)檢索
a.時間戳、地點->mac
b.日期、mac->時間段:地點
人員數(shù)目分布統(tǒng)計
聚類準備
將人員關(guān)于地點的時間分布以ndarray的形式呈現(xiàn)(經(jīng)過數(shù)據(jù)處理)
1.
第一部分的工作只是簡單的修改了之前的代碼,內(nèi)容意義不是很多,所以這里就不詳細記錄啦~
數(shù)據(jù)索引這塊,詳細記錄一下通過日期和mac索引到place id的過程:
輸入?yún)?shù):
start_time 開始時間
end_time 結(jié)束時間
mac 索引的mac地址對象
輸出
stime1,etime1,pid1 停留時間段1
stime2,etime2,pid2 停留時間段2
...
stimen,etimen,pidn 停留時間段n
數(shù)據(jù)片段
2017-09-11 00:00:00,0,141,dormitory,382dd1da2381,N
2017-09-11 00:00:00,0,142,dormitory,5844988f54a5,N
2017-09-11 00:00:00,0,145,dormitory,c8f23075fa06,N
2017-09-11 00:00:00,0,148,dormitory,1c77f6ab931e,N
2017-09-11 00:00:00,0,149,dormitory,10f681e38ca9,N
2017-09-11 00:00:00,0,149,dormitory,4c49e3406f61,N
2017-09-11 00:00:00,0,150,dormitory,bc201040b118,N
2017-09-11 00:00:00,0,150,dormitory,6021013f5b85,N
2017-09-11 00:00:00,0,150,dormitory,a444d1108e48,N
2017-09-11 00:00:00,0,151,dormitory,c8f230a5c86f,N
2017-09-11 00:00:00,0,151,dormitory,483c0cc230cc,N
2017-09-11 00:00:00,0,151,dormitory,bc7574a0e1fa,N
2017-09-11 00:00:00,0,158,edu,8056f2ea0cd9,N
2017-09-11 00:00:00,0,168,edu,74042bcb3a77,N
2017-09-11 00:00:00,0,181,canteen,40f02f4c670d,N
2017-09-11 00:00:00,0,193,edu,8844773c62e3,N
2017-09-11 00:00:00,0,240,canteen,4c1a3d3f0f21,N
2017-09-11 00:01:00,0,141,dormitory,382dd1da2381,N
2017-09-11 00:01:00,0,142,dormitory,5844988f54a5,N
2017-09-11 00:01:00,0,145,dormitory,c8f23075fa06,N
2017-09-11 00:01:00,0,148,dormitory,1c77f6ab931e,N
2017-09-11 00:01:00,0,149,dormitory,10f681e38ca9,N
2017-09-11 00:01:00,0,150,dormitory,bc201040b118,N
2017-09-11 00:01:00,0,150,dormitory,6021013f5b85,N
2017-09-11 00:01:00,0,150,dormitory,a444d1108e48,N
2017-09-11 00:01:00,0,151,dormitory,483c0cc230cc,N
2017-09-11 00:01:00,0,151,dormitory,bc7574a0e1fa,N
2017-09-11 00:01:00,0,151,dormitory,c8f230a5c86f,N
2017-09-11 00:01:00,0,158,edu,8056f2ea0cd9,N
2017-09-11 00:01:00,0,168,edu,74042bcb3a77,N
python代碼:
# -*- coding: UTF-8 -*-
import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time
__author__ = 'SuZibo'
"""
數(shù)據(jù)維度變換2
日期、mac->時間段:地點
重新索引后的數(shù)據(jù)格式:
起始時間1,終止時間1,place id1
起始時間2,終止時間2,place id2
起始時間3,終止時間3,place id3
...
并得到規(guī)定時間內(nèi)的軌跡數(shù)組:
[142, 202, 142, 202, 200, 202, 200, 142, 142](example)
輸入?yún)?shù):mac地址、開始時間、結(jié)束時間
"""
start_time ='2017-09-11 00:00:00'
end_time ='2017-09-18 00:00:00'
mac ='205d4717e6de'
def findpathByMacDate(mac,start_time,end_time):
records = pd.read_csv('./macdata/normalinfo_trans.txt',names=['timestamp','timerange','pid','ptype','mac','isholiday'])
#讀取源數(shù)據(jù),并指明列名(時間、時間范圍、地點id、地點類型、mac、是否為節(jié)假日)
records_select = records[(records['mac']==mac) &(records['timestamp'] >start_time) &(records['timestamp'] <end_time)]
#篩選出時間范圍內(nèi)的mac記錄
records_select = records_select.reset_index(drop=True)
#重新索引數(shù)據(jù)(從0到n)
filepath='./macdata/path/'+mac+'_pathinfo'+'.txt'
change = []
#chang列表,記錄該mac地點變化的節(jié)點
place =[]
#place列表,記錄該mac的place軌跡
rs = open(filepath,'w')
for i in range(records_select.shape[0] - 1):
if int(records_select.ix[i][2]) == int(records_select.ix[i+1][2]):
continue
#如果相鄰記錄的地點一致,則繼續(xù)
else:
change.append((i + 1))
#否則,記錄記錄變化之處的index
# print records_select
# print change
print str(records_select.ix[0][0])+','+str(records_select.ix[change[0] - 1][0])+','+ str(records_select.ix[0][2])
place.append(str(records_select.ix[0][2]))
# 頭部,也就是第一個地點對應(yīng)的時間段
rs.write(str(records_select.ix[0][0])+','+str(records_select.ix[change[0] - 1][0])+','+ str(records_select.ix[0][2])+'\n')
for n in range(len(change) - 1):
print str(records_select.ix[change[n]][0]) +','+ str(records_select.ix[change[n + 1] - 1][0])+','+ str(records_select.ix[change[n]][2])
place.append(str(records_select.ix[change[n]][2]))
rs.write(str(records_select.ix[change[n]][0]) +','+ str(records_select.ix[change[n + 1] - 1][0])+','+ str(records_select.ix[change[n]][2])+'\n')
# 中部
print str(records_select.ix[change[-1]][0])+','+ str(records_select.ix[records_select.shape[0] - 1][0])+','+ str(records_select.ix[records_select.shape[0] - 1][2])
place.append(str(records_select.ix[records_select.shape[0] - 1][2]))
rs.write(str(records_select.ix[change[-1]][0])+','+ str(records_select.ix[records_select.shape[0] - 1][0])+','+ str(records_select.ix[records_select.shape[0] - 1][2])+'\n')
# 尾部,也就是最后一個時間段對應(yīng)的地點
place = [int(i) for i in place]
print place
rs.close()
findpathByMacDate(mac,start_time,end_time)
數(shù)據(jù)重新索引看上去比較麻煩,巧妙運用pandas進行數(shù)據(jù)聚合、篩選操作,發(fā)現(xiàn)代碼量并不多,很容易就實現(xiàn)了~
2.人員分布統(tǒng)計
工作內(nèi)容:
根據(jù)日期、時間段、地點類型(地點)等三個維度統(tǒng)計mac數(shù)量。柱狀圖同時顯示兩個維度(固定第三個維度),顯示時可以切換第三個維度便于觀察特征
輸入:start_time,end_time
按天輸出:不同地點類型的mac數(shù)量
按時段輸出:不同地點類型的mac數(shù)量
返回文件屬性說明:
宿舍,食堂,教學樓,體育館/學生活動中心
python代碼
# -*- coding: UTF-8 -*-
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
__author__ = 'SuZibo'
"""
根據(jù)日期、時間段、地點類型(地點)等三個維度統(tǒng)計mac數(shù)量。柱狀圖同時顯示兩個維度(固定第三個維度),顯示時可以切換第三個維度便于觀察特征
輸入:start_time,end_time
按天輸出:不同地點類型的mac數(shù)量
按時段輸出:不同地點類型的mac數(shù)量
返回文件屬性說明:
宿舍,食堂,教學樓,體育館/學生活動中心
"""
dormitory =[141,142,145,146,148,149,150,151,152,153]
canteen =[171,172,173,174,175,176,177,178,179,180,181,182,240]
edu =[54,60,133,134,136,154,155,156,157,158,159,160,161,162,164,165,166,167,168,169,193,194,195,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,227,230,231,233,234,235,236,237,238,239]
stadium =[183,184,185,186,187,188,189,190,191,232]
stime ='2017-09-11 00:00:00'
# etime ='2017-09-12 00:00:00'
etime ='2017-09-12 00:00:00'
#小循環(huán)里面的時間上限和下限
weekdaylist =[]
start_date = '2017-09-11'
# end_date = '2017-11-13'
end_date='2017-11-13'
#大循環(huán)的時間上限和下限
sdate = datetime.datetime.strptime(start_date,'%Y-%m-%d')
edate = datetime.datetime.strptime(end_date,'%Y-%m-%d')
while sdate<edate:
weekdaylist.append(sdate.strftime('%Y-%m-%d'))
sdate += datetime.timedelta(days=1)
def getMacCountInfoByDay(stime,etime):
#實現(xiàn)stime到etime時間段內(nèi)的人數(shù)分布統(tǒng)計
dic_dormitory =dict()
dic_canteen =dict()
dic_edu =dict()
dic_stadium =dict()
mac_count=[]
with open('../macinfo/macdata/normalinfo_trans.txt') as file:
for line in file:
line = line.split(',')
line[-1] = line[-1].strip('\n')
day = line[0][5:10]
if stime<line[0]<etime:
if line[3] =='dormitory':
if line[4] not in dic_dormitory:
dic_dormitory[line[4]] = dict()
dic_dormitory[line[4]][day] = dic_dormitory[line[4]].get(day, 0) + 1
if line[3] =='canteen':
if line[4] not in dic_canteen:
dic_canteen[line[4]] = dict()
dic_canteen[line[4]][day] = dic_canteen[line[4]].get(day, 0) + 1
if line[3] =='edu':
if line[4] not in dic_edu:
dic_edu[line[4]] = dict()
dic_edu[line[4]][day] = dic_edu[line[4]].get(day, 0) + 1
if line[3] =='stadium':
if line[4] not in dic_stadium:
dic_stadium[line[4]] = dict()
dic_stadium[line[4]][day] = dic_stadium[line[4]].get(day, 0) + 1
for mac in dic_dormitory:
dic_dormitory[mac] = len(dic_dormitory[mac])
for mac in dic_canteen:
dic_canteen[mac] = len(dic_canteen[mac])
for mac in dic_edu:
dic_edu[mac] = len(dic_edu[mac])
for mac in dic_stadium:
dic_stadium[mac] = len(dic_stadium[mac])
mac_count.append(stime)
mac_count.append(len(dic_dormitory))
mac_count.append(len(dic_canteen))
mac_count.append(len(dic_edu))
mac_count.append(len(dic_stadium))
# print mac_count
return mac_count
#返回mac_count列表
rs = open('./plotdata/maccountbyday.txt','w')
for i in range(len(weekdaylist)):
#for 循環(huán)程序運行g(shù)etMacCountInfoByDay,得到(sdate到edate時間段內(nèi)的)mac數(shù)按天、按地點分布
list = getMacCountInfoByDay(stime,etime)
# print list
# print list[0]
rs.write(str(list[0][0:10])+','+str(list[1])+','+str(list[2])+','+str(list[3])+','+str(list[4])+'\n')
#將返回的mac_count列表寫入文件
stime = datetime.datetime.strptime(stime, '%Y-%m-%d %H:%M:%S')
etime = datetime.datetime.strptime(etime, '%Y-%m-%d %H:%M:%S')
stime += datetime.timedelta(days=1)
etime += datetime.timedelta(days=1)
stime = str(stime)
etime = str(etime)
rs.close()
關(guān)于人員統(tǒng)計,需要熟練運用python字典里面的get方法
簡要陳述字典get方法:
語法
get()方法語法:
dict.get(key, default=None)
參數(shù)
key -- 字典中要查找的鍵。
default -- 如果指定鍵的值不存在時,返回該默認值值。
返回值
返回指定鍵的值,如果值不在字典中返回默認值None。
3.人員時間分布矩陣獲取
工作內(nèi)容:
以male_dor,famale_dor,postgraduate_dor,net,hospital,canteen,edu,lab,stadium,activity,administration,library為屬性
建立人員出現(xiàn)時長矩陣(以mac為索引)
python代碼:
# -*- coding: UTF-8 -*-
import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time
__author__ = 'SuZibo'
"""
統(tǒng)計每個人時間特征矩陣(地點分布)
地點list
male_dor=[141,145,146,149,151]
#男生宿舍
famale_dor=[148,150,152,153]
#女生宿舍
postgraduate_dor=[142]
#研究生宿舍
net=[217,229]
#網(wǎng)絡(luò)中心
hospital=[192]
#校醫(yī)院
canteen=[171,172,173,174,175,176,177,178,179,180,181,182,240]
#食堂
edu=[54,60,133,134,136,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,230,231,239]
#教學樓
lab=[155,156,157,158,159,160,161,162,164,165,166,167,168,169,233,234,235,236,237,238]
#實驗室
stadium=[189,190,191]
#體育館
activity=[183,184,185,186,187,188,232]
#學生活動中心
administration=[221,222,223]
#行政樓
library=[193,194,195,227]
#圖書館
"""
mac_time_dic =dict()
#建立字典存儲mac對應(yīng)的時間統(tǒng)計信息,因為源數(shù)據(jù)的時間周期為1min,利用此特性累加得到的結(jié)果正好就是時長(單位為min)
# start_time ='2017-09-11 00:00:00'
# end_time ='2017-11-13 00:00:00'
# frame_data = pd.read_csv('../macinfo/macdata/normalinfo_trans_v2.txt',header=None)
# print frame_data.tail()
with open('../macinfo/macdata/normalinfo_trans_v2.txt') as file:
for line in file:
# print line
line = line.split(',')
line[-1] = line[-1].strip('\n')
if line[4] not in mac_time_dic:
mac_time_dic[line[4]] = dict()
mac_time_dic[line[4]][line[3]] = mac_time_dic[line[4]].get(line[3], 0) + 1
#{mac1:{place1:m,place2:n,...},...}
#{'10b1f8f3a4d0': {'famale_dor': 10, 'male_dor': 507, 'hospital': 10, 'activity': 10, 'library': 4, 'edu': 41, 'canteen': 86...},...}
# print mac_time_dic
# print list(mac_time_dic.iteritems())
# print list(mac_time_dic.values())
# list1 = list(mac_time_dic.values())
# print list(mac_time_dic.keys())
frame = DataFrame(list(mac_time_dic.values()),columns=['male_dor','famale_dor','postgraduate_dor','net','hospital','canteen','edu','stadium','activity','administration','library'],index=list(mac_time_dic.keys()))
#轉(zhuǎn)換成dataframe格式,并且以mac為index
frame = frame.dropna(how='all')
#去掉NA項
frame = frame.fillna(0)
#用0填充NA項
frame.to_csv('./data/user_time_array_includex.csv')
frame.to_csv('./data/user_time_array.csv',index=False,header=False)
4.人員頻次分布矩陣生成
接3,由于android和iOS操作系統(tǒng)的區(qū)別——前者開啟wifi后鎖屏會繼續(xù)連接,而后者鎖屏后過一小段時間會退出無線連接,因此以時間長度來衡量人員特征不夠準確,于是希望以人員頻次為單位建立人員關(guān)于地點的向量矩陣。
Ps:希望對特定區(qū)域劃分時間段來區(qū)分人群,比如教學樓7:00-22:00和其他時間兩個時間段,借此劃分人群
因此在以上基礎(chǔ)上又擴充了一些數(shù)據(jù)運算操作
python代碼1:
不需要劃分時間段的地點頻次統(tǒng)計
# -*- coding: UTF-8 -*-
import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time
__author__ = 'SuZibo'
"""
統(tǒng)計每個人時間特征矩陣(地點分布)
地點list
male_dor=[141,145,146,149,151]
famale_dor=[148,150,152,153]
postgraduate_dor=[142]
net=[217,229]
hospital=[192]
canteen=[171,172,173,174,175,176,177,178,179,180,181,182,240]
edu=[54,60,133,134,136,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,230,231,239]
lab=[155,156,157,158,159,160,161,162,164,165,166,167,168,169,233,234,235,236,237,238]
stadium=[189,190,191]
activity=[183,184,185,186,187,188,232]
administration=[221,222,223]
library=[193,194,195,227]
最終數(shù)據(jù)結(jié)構(gòu):{'教學樓(07:00-22:00)': 1, '教學樓(其他時段)': 0, '男生宿舍': 0, '研究生宿舍': 0, '女生宿舍': 0, '學生活動中心(07:00-21:00)': 0, '學生活動中心(其他時段)': 0, '行政樓(07:00-21:00)': 0, '行政樓(其他時段)': 0, '實驗樓(07:00-21:00)': 0, '實驗樓(其他時段)': 0, '食堂(07:00-23:00)': 0, '食堂(其他時段)': 0}
edu,edu1,male_dor,postgraduate_dor,famale_dor,activity,activity1,administration,administration1,lab,lab1,canteen,canteen1,library,hospital,stadium
"""
mac_count_dic = dict()
with open('../macinfo/macdata/normalinfo_trans_v2.txt') as file:
for line in file:
# print line
line = line.split(',')
line[-1] = line[-1].strip('\n')
day = line[0][5:10]
if line[4] not in mac_count_dic:
mac_count_dic[line[4]] = dict()
if line[3] not in mac_count_dic[line[4]]:
mac_count_dic[line[4]][line[3]] = dict()
mac_count_dic[line[4]][line[3]][day] = mac_count_dic[line[4]][line[3]].get(day,0)+1
#建立嵌套mac
#mac_count_dic['mac']['地點'] [日期集合]
# print mac_count_dic
rs = open('./data/user_count_array_includex.csv','w')
for key in mac_count_dic:
#遍歷得到的字典
mac = key
dis = mac_count_dic[key]
#相當于解嵌套
if dis.has_key('male_dor') == True:
male_dor_count = len(dis['male_dor'])
if dis.has_key('male_dor') == False:
male_dor_count = 0
if dis.has_key('famale_dor') == True:
famale_dor_count = len(dis['famale_dor'])
if dis.has_key('famale_dor') == False:
famale_dor_count = 0
if dis.has_key('postgraduate_dor') == True:
postgraduate_dor_count = len(dis['postgraduate_dor'])
if dis.has_key('postgraduate_dor') == False:
postgraduate_dor_count = 0
if dis.has_key('net') == True:
net_count = len(dis['net'])
if dis.has_key('net') == False:
net_count = 0
if dis.has_key('hospital') == True:
hospital_count = len(dis['hospital'])
if dis.has_key('hospital') == False:
hospital_count = 0
if dis.has_key('stadium') == True:
stadium_count = len(dis['stadium'])
if dis.has_key('stadium') == False:
stadium_count = 0
rs.write(str(mac)+','+str(male_dor_count)+','+str(famale_dor_count)+','+str(postgraduate_dor_count)+','+str(net_count)+','+str(hospital_count)+','+str(stadium_count).strip('\n')+'\n')
rs.close()
#mac,male_count,famale_count,...
#mac為索引
同理得到7:00-22:00時間段內(nèi)的頻次字典/extra時間段內(nèi)的頻次字典
建立三個dataframe對象,命名為df1,df2,df3
python代碼2:
dataframe對象合并
# -*- coding: UTF-8 -*-
import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time
__author__ = 'SuZibo'
df1 = pd.read_csv('./data/user_count_array_includex_1.csv',names=['canteen','edu','lab','activity','administration','library'])
df2 = pd.read_csv('./data/user_count_array_includex_1_extra.csv',names=['canteen_extra','edu_extra','lab_extra','activity_extra','administration_extra','library_extra'])
df3 = pd.read_csv('./data/user_count_array_includex.csv',names=['male_dor','famale_dor','postgraduate_dor','net','hospital','stadium'])
# print len(df1)
# print len(df2)
# print len(df3)
df = df2.join(df1)
# print df
df = df.join(df3)
df = df.dropna(how='all')
df = df.fillna(0)
# print df
df.to_csv('./data/user_TimeArray_includex.csv')
#生成有索引的csv
df.to_csv('./data/user_TimeArray.csv',index=False,header=False)
#生成無索引csv
至此就完成了人員頻次向量矩陣的生成
矩陣片段:
,canteen_extra,edu_extra,lab_extra,activity_extra,administration_extra,library_extra,canteen,edu,lab,activity,administration,library,male_dor,famale_dor,postgraduate_dor,net,hospital,stadium
483b38cac86d,15,10,0,0,0,3,15.0,9.0,0.0,0.0,0.0,3.0,0,12,0,3,1,1
786256354ae3,9,1,1,0,0,0,9.0,1.0,1.0,0.0,0.0,0.0,5,0,0,0,1,0
908d6c7faa0c,7,13,0,0,0,2,7.0,13.0,0.0,0.0,0.0,2.0,0,6,0,0,0,0
4c49e31c7c69,20,10,3,6,13,19,20.0,10.0,3.0,6.0,13.0,19.0,0,1,22,0,4,3
58449877c1c5,3,7,8,0,2,1,3.0,7.0,8.0,0.0,2.0,1.0,0,0,4,0,0,2
64cc2e771dd3,21,10,6,0,2,6,21.0,10.0,6.0,0.0,2.0,6.0,3,38,0,0,4,1
9cb2b2c7ad65,3,10,2,0,10,0,3.0,10.0,2.0,0.0,10.0,0.0,0,0,0,1,0,0
742344e4ff39,10,3,1,0,0,1,10.0,3.0,1.0,0.0,0.0,1.0,5,0,0,0,0,0
1ccde57a678a,7,4,0,0,0,6,7.0,4.0,0.0,0.0,0.0,6.0,11,0,0,0,1,0
ecdf3ad00c44,15,9,3,0,0,0,15.0,9.0,3.0,0.0,0.0,0.0,0,13,0,1,0,0
f431c39cf8cc,8,4,0,0,0,0,8.0,4.0,0.0,0.0,0.0,0.0,12,0,0,2,1,1
f40e22420be9,18,32,14,0,11,8,17.0,32.0,12.0,0.0,11.0,8.0,5,0,0,0,0,1
68fb7eee63e9,13,6,0,0,1,1,13.0,6.0,0.0,0.0,1.0,1.0,0,15,0,0,2,0
205d47642a4c,17,12,4,2,0,1,17.0,12.0,4.0,2.0,0.0,1.0,27,4,0,0,2,12
b0e235c341d5,13,11,0,1,0,1,13.0,11.0,0.0,1.0,0.0,1.0,0,18,0,0,0,0
在下一篇準備對于ARIMA模型進行描述和研究