日韩综合色9,9久热这里有精品,懂色国产精品

序

出于了解HTTP和爬蟲的目的，于是就有了一個(gè)自己動(dòng)手實(shí)現(xiàn)一個(gè)爬蟲，并在此之上做一些簡(jiǎn)單的數(shù)據(jù)分析的想法。有兩種選擇，一種是完全自己用Python的urllib再配合一個(gè)html解析（beautifulsoup之類的）庫實(shí)現(xiàn)一個(gè)簡(jiǎn)單的爬蟲，另一種就是學(xué)習(xí)一個(gè)成熟而且功能強(qiáng)大的框架（比如說scrapy）。綜合考慮之下，我決定選擇后者，因?yàn)樽约涸斓妮喿涌隙]有別人造的好，以后真的需要用上爬蟲,使用scrapy也更加靠譜。
爬什么呢？第一次爬蟲實(shí)踐，我想爬一個(gè)數(shù)據(jù)格式比較工整的、干凈的，最好是一條一條數(shù)據(jù)的網(wǎng)站，這樣我就想到了PAT的題庫。
github地址

我理解的爬蟲

簡(jiǎn)單的說，我們?cè)跒g覽一個(gè)網(wǎng)頁的時(shí)候，其實(shí)是向網(wǎng)頁的服務(wù)器發(fā)送一個(gè)請(qǐng)求（Request），網(wǎng)頁服務(wù)器在收到請(qǐng)求之后返回?cái)?shù)據(jù)(Response)，這些數(shù)據(jù)中包括HTML數(shù)據(jù)（最早期的http協(xié)議只能返回HTML數(shù)據(jù)，現(xiàn)在當(dāng)然不是了），我們的瀏覽器再將這些HTML數(shù)據(jù)展示出來,就是我們看到的網(wǎng)頁了。爬蟲忽略了瀏覽器的存在，通過自動(dòng)化的方式去發(fā)送請(qǐng)求，獲取服務(wù)器的響應(yīng)數(shù)據(jù)。
真實(shí)去做一個(gè)復(fù)雜的爬蟲的時(shí)候當(dāng)然不會(huì)這么簡(jiǎn)單了，還需要去考慮cookie、反爬蟲技巧、模擬登陸等等，不過這個(gè)項(xiàng)目只是一個(gè)入門，以后接觸的多了再慢慢了解也不急。

scrapy使用

對(duì)于scrapy安裝、介紹這里就不復(fù)述了，我覺得網(wǎng)上有很多很棒的資源。

 scrapy startproject patSpider

就表示我們創(chuàng)造了這個(gè)叫做patSpider的scrapy項(xiàng)目，tree 一下，可以發(fā)現(xiàn)項(xiàng)目的結(jié)構(gòu)是這個(gè)樣子的：

tree，項(xiàng)目結(jié)構(gòu)

在spider文件夾下，創(chuàng)建一個(gè)python文件，繼承crawlSpider類，這就是一個(gè)爬蟲了（要注意的是，一個(gè)scrapy項(xiàng)目可以創(chuàng)造不止一個(gè)爬蟲，你可以用它來創(chuàng)造多個(gè)爬蟲，不過每個(gè)爬蟲都有一個(gè)獨(dú)一無二的name加以區(qū)分，在項(xiàng)目的文件下使用spracy crawl 爬蟲的name 就可以啟動(dòng)這個(gè)爬蟲了）

首先觀察一下pat登錄界面的network數(shù)據(jù)（使用chrome開發(fā)者模式），因?yàn)橐M登陸，其實(shí)登陸也就是在request的表單里把服務(wù)器需要的數(shù)據(jù)提交過去（用戶名、密碼等），注意這里還有一個(gè)authenticity_token數(shù)據(jù)項(xiàng)，我們?cè)诘谝淮蔚膔esponse數(shù)據(jù)中將這一項(xiàng)數(shù)據(jù)提取出來，然后在下一次提交上去（其實(shí)直接復(fù)制也可以，但是就失去了代碼的重用性，假如一段時(shí)間后服務(wù)器端把這個(gè)值改了怎么辦？）

Screenshot from 2017-06-04 20-08-11.png

觀察一下from_data中的數(shù)據(jù)項(xiàng)，這就是我們要提交的所有數(shù)據(jù)項(xiàng)
然后觀察一下我們要爬取的pat甲級(jí)題庫的html數(shù)據(jù)格式，因?yàn)槲覀兙褪且凑者@個(gè)格式來解析html數(shù)據(jù)的；我們發(fā)現(xiàn)<td><tr> 下面的六行就是一個(gè)題目的信息（有沒有通過，題目編號(hào)，題目名稱，提交次數(shù)，通過次數(shù)，通過率），我們等會(huì)就按照這個(gè)規(guī)律來解析HTML數(shù)據(jù)

image.png

patSpider/patSpider/spiders/problem_info_spider.py

from scrapy import FormRequest
from scrapy import Request
from scrapy.loader import ItemLoader
from scrapy.spiders import CrawlSpider
from patSpider.items import *
import pickle
from patSpider.pipelines import *

class pat_Spider(CrawlSpider):
    name = "pat"
    items = []
    call_times = 0
    # allowed_domains = []
  #這個(gè)是爬蟲需要爬取的url，因?yàn)橹挥袃身?，所以就直接把第二頁的url放上去了 
    start_urls = ["https://www.patest.cn/contests/pat-a-practise",
                  "https://www.patest.cn/contests/pat-a-practise?page=2"
                  ]
    #想網(wǎng)頁發(fā)送請(qǐng)求，注意這些函數(shù)不需要顯示地調(diào)用，啟用爬蟲的時(shí)候就自動(dòng)調(diào)用了
    #使用post_login這個(gè)回調(diào)函數(shù)來提交表單數(shù)據(jù)，所謂 request 回調(diào)函數(shù)，就是一個(gè)request 獲?。ㄒ部梢哉f是下載）了一個(gè)
    # response
    # post_login 
    # 參見： callback https://doc.scrapy.org/en/1.3/topics/request-response.html#topics-request-response-ref-request-callback-
    # arguments
    #  def start_requests(self) 這個(gè)函數(shù)是重寫crawlSpider 中的函數(shù)，這個(gè)函數(shù)是自動(dòng)執(zhí)行的，不用管在
    # 哪里去調(diào)用它，在這一段代碼中，這個(gè)函數(shù)的執(zhí)行順序是最前的
    # 這三個(gè)函數(shù)的邏輯是： 首先請(qǐng)求登錄界面，獲取到第一個(gè)response 之后，把表單數(shù)據(jù)提交了，這時(shí)候就有網(wǎng)站的cookie了
    # 之后就把cookie作為request的參數(shù)提交，這樣就能保持登錄狀態(tài)了。
    # 關(guān)于 cookie登錄 ,這篇文章介紹的不錯(cuò) http://m.itdecent.cn/p/887af1ab4200
    def start_requests(self):
        return [Request("https://www.patest.cn/users/sign_in", meta={'cookiejar': 1}, callback=self.post_login)]

    def post_login(self, response):
        post_headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "Content-Type": "application/x-www-form-urlencoded",
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36",
            "Referer": "https://www.patest.cn/users/sign_in",
            "Upgrade-Insecure-Requests": 1

        }
        authenticity_token = response.xpath('//input[@name="authenticity_token"]/@value').extract()[0]
        # print authenticit-y_token
        return [FormRequest.from_response(response,
                                          url="https://www.patest.cn/users/sign_in",
                                          meta={'cookiejar': response.meta['cookiejar']},
                                          headers=post_headers,
                                          formdata={
                                              'utf8': '?',
                                              'authenticity_token': authenticity_token,
                                              'user[handle]': 'suncun',
                                              # 我把密碼隱藏了
                                              'user[password]': '********',
                                              'user[remember_me]': '0',
                                              'commit': "登錄"
                                          },
                                          callback=self.after_login,
                                          dont_filter=True
                                          )]


    def after_login(self, response):
        for url in self.start_urls:
            yield Request(url, meta={'cookiejar': response.meta['cookiejar']})

    # 注意，這個(gè)方法是自動(dòng)調(diào)用的，通常有多少個(gè)請(qǐng)求url，parse就會(huì)執(zhí)行多少次
    # 當(dāng)這段代碼執(zhí)行到這個(gè)地方的時(shí)候 ,已經(jīng)獲取到了一個(gè)登錄系統(tǒng)后返回的response響應(yīng)
    # 對(duì)這個(gè)response中的數(shù)據(jù)進(jìn)行提取，就能夠獲取到我們需要的結(jié)果
    #  尤其注意xpath的語法規(guī)范，selector對(duì)象selectorlist對(duì)象
    def parse(self, response):
        print response.body
        self.call_times += 1
        data_selector = response.xpath('//tr/td')
        i = 0
        while i < len(data_selector):
            six_lines = data_selector[i:i+6 ]
            i += 6
            item = PatspiderItem()
            if len(six_lines[0].xpath('.//span/text()').extract()) == 0:
                item['does_pass'] = 'Not submit'
            else:
                item['does_pass'] = six_lines[0].xpath('.//span/text()').extract()[0]
            item['id'] = six_lines[1].xpath('.//a/text()').extract()[0]
            item['title'] = six_lines[2].xpath('.//a/text()').extract()[0]
            item['pass_times'] = six_lines[3].xpath('./text()').extract()[0]
            item['submit_times'] = six_lines[4].xpath('./text()').extract()[0]
            item['pass_rate'] = six_lines[5].xpath('./text()').extract()[0]
            self.items.append(item)
            # do not use 'return' cause the item is piped to 'pipelines'
            # when the Spider is working. yield can make data collecting and
            # processing at the same time.
            yield item
        # 在最后一次調(diào)用這個(gè)parse()方法的時(shí)候，將對(duì)象序列化，以供數(shù)據(jù)分析的時(shí)候再來使用
        if self.call_times == len(self.start_urls):
            with open('items_list', 'wb') as tmp_f:
                pickle.dump(self.items, tmp_f)

簡(jiǎn)單的數(shù)據(jù)分析

分析了最難的幾道題（通過率最低的）、我一共通過了多少題，多少題沒有做等等...

import json
import matplotlib.pyplot as plt
import pickle

def total_submit_data(items):
    '''
    :param items: all the data of pat type:list of dic
    :return: (cnt_submit, cnt_pass)
    '''
    cnt_submit = 0
    cnt_pass = 0
    for item in items:
        cnt_submit += int(item['submit_times'])
        cnt_pass += int(item['pass_times'])
    print 'total submit times: %d, total pass times: %d' %(cnt_submit, cnt_pass)
    print 'rate: %f' %(cnt_pass * 1.0/ cnt_submit)
    return cnt_submit,cnt_pass

def top_k_hard(items, k):
    '''
    :param items: all the data of pat, type: list of dic
    :param k: self defined number, ex: if k = 10, the function will return
    information of top 10 most hard problems
    :return: list(dic)
    '''
    size = len(items)
    if k > size:
        k = size
        print 'since k is too large, now we smaller k to:', k
    new_items = sorted(items, key=lambda x:float(x['pass_rate']))
    # print new_items[0:k]
    return new_items[0:k]

def self_practice_data(items):
    '''
    user: suncun(myself)
    pass_word: ***********
    this function aim to show, number of problems I've passed,
    # of problems tried but not passed yet,# of problems never tried
    :param items: all the data of pat, type: list of dic
    :return:
    '''
    print items
    cnt_pass = 0
    cnt_not_try = 0
    cnt_not_pass = 0
    total_problems = len(items)
    for item in items:
        situation = item['does_pass']
        if situation == 'Not submit':
            cnt_not_try += 1
        elif situation == 'Y':
            cnt_pass += 1
        else:
            cnt_not_pass += 1
    print 'there a totally %d problems, and I\'ve passed %d problems' %(total_problems, cnt_pass)
    print 'tried but not passed %d problems, still %d problems not tried yet' %(cnt_not_pass, cnt_not_try)


if __name__ == '__main__':
    items = {}
    with open('../items_list', 'r') as f:
        items = pickle.load(f)
    # total_submit_data(items)
    # print top_k_hard(items, 10)
    self_practice_data(items)

部分分析結(jié)果截圖：

image.png

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python Scrapy 爬取PAT網(wǎng)站數(shù)據(jù)(1.0 爬取題目數(shù)據(jù))

Python Scrapy 爬取PAT網(wǎng)站數(shù)據(jù)(1.0 爬取題目數(shù)據(jù))

序

我理解的爬蟲

scrapy使用

簡(jiǎn)單的數(shù)據(jù)分析

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Python Scrapy 爬取PAT網(wǎng)站數(shù)據(jù)(1.0 爬取題目數(shù)據(jù))

序

我理解的爬蟲

scrapy使用

簡(jiǎn)單的數(shù)據(jù)分析

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av