序
出于了解HTTP和爬蟲的目的,于是就有了一個(gè)自己動(dòng)手實(shí)現(xiàn)一個(gè)爬蟲,并在此之上做一些簡(jiǎn)單的數(shù)據(jù)分析的想法。有兩種選擇,一種是完全自己用Python的urllib再配合一個(gè)html解析(beautifulsoup之類的)庫實(shí)現(xiàn)一個(gè)簡(jiǎn)單的爬蟲,另一種就是學(xué)習(xí)一個(gè)成熟而且功能強(qiáng)大的框架(比如說scrapy)。綜合考慮之下,我決定選擇后者,因?yàn)樽约涸斓妮喿涌隙]有別人造的好,以后真的需要用上爬蟲,使用scrapy也更加靠譜。
爬什么呢? 第一次爬蟲實(shí)踐,我想爬一個(gè)數(shù)據(jù)格式比較工整的、干凈的,最好是一條一條數(shù)據(jù)的網(wǎng)站,這樣我就想到了PAT的題庫。
github地址
我理解的爬蟲
簡(jiǎn)單的說,我們?cè)跒g覽一個(gè)網(wǎng)頁的時(shí)候,其實(shí)是向網(wǎng)頁的服務(wù)器發(fā)送一個(gè)請(qǐng)求(Request),網(wǎng)頁服務(wù)器在收到請(qǐng)求之后返回?cái)?shù)據(jù)(Response),這些數(shù)據(jù)中包括HTML數(shù)據(jù)(最早期的http協(xié)議只能返回HTML數(shù)據(jù),現(xiàn)在當(dāng)然不是了),我們的瀏覽器再將這些HTML數(shù)據(jù)展示出來,就是我們看到的網(wǎng)頁了。爬蟲忽略了瀏覽器的存在,通過自動(dòng)化的方式去發(fā)送請(qǐng)求,獲取服務(wù)器的響應(yīng)數(shù)據(jù)。
真實(shí)去做一個(gè)復(fù)雜的爬蟲的時(shí)候當(dāng)然不會(huì)這么簡(jiǎn)單了,還需要去考慮cookie、反爬蟲技巧、模擬登陸等等,不過這個(gè)項(xiàng)目只是一個(gè)入門,以后接觸的多了再慢慢了解也不急。
scrapy使用
對(duì)于scrapy安裝、介紹這里就不復(fù)述了,我覺得網(wǎng)上有很多很棒的資源。
scrapy startproject patSpider
就表示我們創(chuàng)造了這個(gè)叫做patSpider的scrapy項(xiàng)目,tree 一下,可以發(fā)現(xiàn)項(xiàng)目的結(jié)構(gòu)是這個(gè)樣子的:

在spider文件夾下,創(chuàng)建一個(gè)python文件,繼承crawlSpider類,這就是一個(gè)爬蟲了(要注意的是,一個(gè)scrapy項(xiàng)目可以創(chuàng)造不止一個(gè)爬蟲,你可以用它來創(chuàng)造多個(gè)爬蟲,不過每個(gè)爬蟲都有一個(gè)獨(dú)一無二的name加以區(qū)分,在項(xiàng)目的文件下使用spracy crawl 爬蟲的name 就可以啟動(dòng)這個(gè)爬蟲了 )
首先觀察一下pat登錄界面的network數(shù)據(jù)(使用chrome開發(fā)者模式),因?yàn)橐M登陸,其實(shí)登陸也就是在request的表單里把服務(wù)器需要的數(shù)據(jù)提交過去(用戶名、密碼等),注意這里還有一個(gè)authenticity_token數(shù)據(jù)項(xiàng),我們?cè)诘谝淮蔚膔esponse數(shù)據(jù)中將這一項(xiàng)數(shù)據(jù)提取出來,然后在下一次提交上去(其實(shí)直接復(fù)制也可以,但是就失去了代碼的重用性,假如一段時(shí)間后服務(wù)器端把這個(gè)值改了怎么辦?)

觀察一下from_data中的數(shù)據(jù)項(xiàng),這就是我們要提交的所有數(shù)據(jù)項(xiàng)
然后觀察一下我們要爬取的pat甲級(jí)題庫的html數(shù)據(jù)格式,因?yàn)槲覀兙褪且凑者@個(gè)格式來解析html數(shù)據(jù)的;我們發(fā)現(xiàn)<td><tr> 下面的六行就是一個(gè)題目的信息(有沒有通過, 題目編號(hào), 題目名稱, 提交次數(shù),通過次數(shù),通過率),我們等會(huì)就按照這個(gè)規(guī)律來解析HTML數(shù)據(jù)

from scrapy import FormRequest
from scrapy import Request
from scrapy.loader import ItemLoader
from scrapy.spiders import CrawlSpider
from patSpider.items import *
import pickle
from patSpider.pipelines import *
class pat_Spider(CrawlSpider):
name = "pat"
items = []
call_times = 0
# allowed_domains = []
#這個(gè)是爬蟲需要爬取的url,因?yàn)橹挥袃身?,所以就直接把第二頁的url放上去了
start_urls = ["https://www.patest.cn/contests/pat-a-practise",
"https://www.patest.cn/contests/pat-a-practise?page=2"
]
#想網(wǎng)頁發(fā)送請(qǐng)求,注意這些函數(shù)不需要顯示地調(diào)用,啟用爬蟲的時(shí)候就自動(dòng)調(diào)用了
#使用post_login這個(gè)回調(diào)函數(shù)來提交表單數(shù)據(jù),所謂 request 回調(diào)函數(shù),就是一個(gè)request 獲?。ㄒ部梢哉f是下載)了一個(gè)
# response
# post_login
# 參見: callback https://doc.scrapy.org/en/1.3/topics/request-response.html#topics-request-response-ref-request-callback-
# arguments
# def start_requests(self) 這個(gè)函數(shù)是重寫crawlSpider 中的函數(shù),這個(gè)函數(shù)是自動(dòng)執(zhí)行的,不用管在
# 哪里去調(diào)用它,在這一段代碼中,這個(gè)函數(shù)的執(zhí)行順序是最前的
# 這三個(gè)函數(shù)的邏輯是: 首先請(qǐng)求登錄界面,獲取到第一個(gè)response 之后,把表單數(shù)據(jù)提交了,這時(shí)候就有網(wǎng)站的cookie了
# 之后就把cookie作為request的參數(shù)提交,這樣就能保持登錄狀態(tài)了。
# 關(guān)于 cookie登錄 ,這篇文章介紹的不錯(cuò) http://m.itdecent.cn/p/887af1ab4200
def start_requests(self):
return [Request("https://www.patest.cn/users/sign_in", meta={'cookiejar': 1}, callback=self.post_login)]
def post_login(self, response):
post_headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Content-Type": "application/x-www-form-urlencoded",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36",
"Referer": "https://www.patest.cn/users/sign_in",
"Upgrade-Insecure-Requests": 1
}
authenticity_token = response.xpath('//input[@name="authenticity_token"]/@value').extract()[0]
# print authenticit-y_token
return [FormRequest.from_response(response,
url="https://www.patest.cn/users/sign_in",
meta={'cookiejar': response.meta['cookiejar']},
headers=post_headers,
formdata={
'utf8': '?',
'authenticity_token': authenticity_token,
'user[handle]': 'suncun',
# 我把密碼隱藏了
'user[password]': '********',
'user[remember_me]': '0',
'commit': "登錄"
},
callback=self.after_login,
dont_filter=True
)]
def after_login(self, response):
for url in self.start_urls:
yield Request(url, meta={'cookiejar': response.meta['cookiejar']})
# 注意,這個(gè)方法是自動(dòng)調(diào)用的,通常有多少個(gè)請(qǐng)求url,parse就會(huì)執(zhí)行多少次
# 當(dāng)這段代碼執(zhí)行到這個(gè)地方的時(shí)候 ,已經(jīng)獲取到了一個(gè)登錄系統(tǒng)后返回的response響應(yīng)
# 對(duì)這個(gè)response中的數(shù)據(jù)進(jìn)行提取,就能夠獲取到我們需要的結(jié)果
# 尤其注意xpath的語法規(guī)范,selector對(duì)象selectorlist對(duì)象
def parse(self, response):
print response.body
self.call_times += 1
data_selector = response.xpath('//tr/td')
i = 0
while i < len(data_selector):
six_lines = data_selector[i:i+6 ]
i += 6
item = PatspiderItem()
if len(six_lines[0].xpath('.//span/text()').extract()) == 0:
item['does_pass'] = 'Not submit'
else:
item['does_pass'] = six_lines[0].xpath('.//span/text()').extract()[0]
item['id'] = six_lines[1].xpath('.//a/text()').extract()[0]
item['title'] = six_lines[2].xpath('.//a/text()').extract()[0]
item['pass_times'] = six_lines[3].xpath('./text()').extract()[0]
item['submit_times'] = six_lines[4].xpath('./text()').extract()[0]
item['pass_rate'] = six_lines[5].xpath('./text()').extract()[0]
self.items.append(item)
# do not use 'return' cause the item is piped to 'pipelines'
# when the Spider is working. yield can make data collecting and
# processing at the same time.
yield item
# 在最后一次調(diào)用這個(gè)parse()方法的時(shí)候,將對(duì)象序列化,以供數(shù)據(jù)分析的時(shí)候再來使用
if self.call_times == len(self.start_urls):
with open('items_list', 'wb') as tmp_f:
pickle.dump(self.items, tmp_f)
簡(jiǎn)單的數(shù)據(jù)分析
分析了最難的幾道題(通過率最低的)、我一共通過了多少題,多少題沒有做等等...
import json
import matplotlib.pyplot as plt
import pickle
def total_submit_data(items):
'''
:param items: all the data of pat type:list of dic
:return: (cnt_submit, cnt_pass)
'''
cnt_submit = 0
cnt_pass = 0
for item in items:
cnt_submit += int(item['submit_times'])
cnt_pass += int(item['pass_times'])
print 'total submit times: %d, total pass times: %d' %(cnt_submit, cnt_pass)
print 'rate: %f' %(cnt_pass * 1.0/ cnt_submit)
return cnt_submit,cnt_pass
def top_k_hard(items, k):
'''
:param items: all the data of pat, type: list of dic
:param k: self defined number, ex: if k = 10, the function will return
information of top 10 most hard problems
:return: list(dic)
'''
size = len(items)
if k > size:
k = size
print 'since k is too large, now we smaller k to:', k
new_items = sorted(items, key=lambda x:float(x['pass_rate']))
# print new_items[0:k]
return new_items[0:k]
def self_practice_data(items):
'''
user: suncun(myself)
pass_word: ***********
this function aim to show, number of problems I've passed,
# of problems tried but not passed yet,# of problems never tried
:param items: all the data of pat, type: list of dic
:return:
'''
print items
cnt_pass = 0
cnt_not_try = 0
cnt_not_pass = 0
total_problems = len(items)
for item in items:
situation = item['does_pass']
if situation == 'Not submit':
cnt_not_try += 1
elif situation == 'Y':
cnt_pass += 1
else:
cnt_not_pass += 1
print 'there a totally %d problems, and I\'ve passed %d problems' %(total_problems, cnt_pass)
print 'tried but not passed %d problems, still %d problems not tried yet' %(cnt_not_pass, cnt_not_try)
if __name__ == '__main__':
items = {}
with open('../items_list', 'r') as f:
items = pickle.load(f)
# total_submit_data(items)
# print top_k_hard(items, 10)
self_practice_data(items)
部分分析結(jié)果截圖:
