99蜜臀精品视频,人妻中出久久久久

想爬取網(wǎng)站數(shù)據(jù)？先登錄網(wǎng)站！對于大多數(shù)大型網(wǎng)站來說，想要爬取他們的數(shù)據(jù)，第一道門檻就是登錄網(wǎng)站。下面請跟隨我的步伐來學習如何模擬登陸網(wǎng)站。

為什么進行模擬登陸？

互聯(lián)網(wǎng)上的網(wǎng)站分兩種：需要登錄和不需要登錄。（這是一句廢話?。?/p>

那么，對于不需要登錄的網(wǎng)站，我們直接獲取數(shù)據(jù)即可，簡單省事。而對于需要登錄才可以查看數(shù)據(jù)或者不登錄只能查看一部分數(shù)據(jù)的網(wǎng)站來說，我們只好乖乖地登錄網(wǎng)站了。（除非你直接黑進人家數(shù)據(jù)庫，黑客操作請慎用?。?/p>

所以，對于需要登錄的網(wǎng)站，我們需要模擬一下登錄，一方面為了獲取登陸之后頁面的信息和數(shù)據(jù)，另一方面為了拿到登錄之后的 cookie ，以便下次請求時使用。

模擬登陸的思路

一提到模擬登陸，大家的第一反應肯定是：切！那還不簡單？打開瀏覽器，輸入網(wǎng)址，找到用戶名密碼框，輸入用戶名和密碼，然后點擊登陸就完事！

這種方式?jīng)]毛病，我們的 selenium 模擬登陸就是這么操作的。

除此之外呢，我們的 Requests 還可以直接攜帶已經(jīng)登陸過的 cookies 進行請求，相當于繞過了登陸。

我們也可以利用 Requests 發(fā)送 post 請求，將網(wǎng)站登錄需要的信息附帶到 post 請求中進行登錄。

以上就是我們常見的三種模擬登陸網(wǎng)站的思路，那么我們的 Scrapy 也使用了后兩種方式，畢竟第一種只是 selenium 特有的方式。

Scrapy 模擬登陸的思路：

1、直接攜帶已經(jīng)登陸過的 cookies 進行請求
2、將網(wǎng)站登錄需要的信息附帶到 post 請求中進行登錄

模擬登陸實例

攜帶 cookies 模擬登陸

每種登陸方式都有它的優(yōu)缺點以及使用場景，我們來看看攜帶 cookies 登陸的應用場景：

1、cookie 過期時間很長，我們可以登錄一次之后不用擔心登錄過期問題，常見于一些不規(guī)范的網(wǎng)站。
2、我們能在 cookie 過期之前把我們需要的所有數(shù)據(jù)拿到。
3、我們可以配合其他程序使用，比如使用 selenium 把登錄之后的 cookie 獲取保存到本地，然后在 Scrapy 發(fā)送請求之前先讀取本地 cookie 。

下面我們通過模擬登錄被我們遺忘已久的人人網(wǎng)來講述這種模擬登陸方式。

我們首先創(chuàng)建一個 Scrapy 項目：

> scrapy startproject login

為了爬取順利，請先將 settings 里面的 robots 協(xié)議設(shè)置為 False ：

ROBOTSTXT_OBEY = False

接著，我們創(chuàng)建一個爬蟲：

> scrapy genspider renren renren.com

我們打開 spiders 目錄下的 renren.py ，代碼如下：

# -*- coding: utf-8 -*-
import scrapy


class RenrenSpider(scrapy.Spider):
    name = 'renren'
    allowed_domains = ['renren.com']
    start_urls = ['http://renren.com/']

    def parse(self, response):
        pass

我們知道，start_urls 存的是我們需要爬取的第一個網(wǎng)頁地址，這是我們爬數(shù)據(jù)的初始網(wǎng)頁，假設(shè)我需要爬取人人網(wǎng)的個人中心頁的數(shù)據(jù)，那么我登錄人人網(wǎng)后，進入到個人中心頁，網(wǎng)址是：http://www.renren.com/972990680/profile ，如果我直接將這個網(wǎng)址放到 start_urls 里面，然后我們直接請求，大家想一下，可不可以成功？

不可以，對吧！因為我們還沒有登錄，根本看不到個人中心頁。

那么我們的登錄代碼加到哪里呢？

我們能確定的是我們必須在框架請求 start_urls 中的網(wǎng)頁之前登錄。

我們進入 Spider 類的源碼，找到下面這一段代碼：

def start_requests(self):
        cls = self.__class__
        if method_is_overridden(cls, Spider, 'make_requests_from_url'):
            warnings.warn(
                "Spider.make_requests_from_url method is deprecated; it "
                "won't be called in future Scrapy releases. Please "
                "override Spider.start_requests method instead (see %s.%s)." % (
                    cls.__module__, cls.__name__
                ),
            )
            for url in self.start_urls:
                yield self.make_requests_from_url(url)
        else:
            for url in self.start_urls:
                yield Request(url, dont_filter=True)

    def make_requests_from_url(self, url):
        """ This method is deprecated. """
        return Request(url, dont_filter=True)

我們從這段源碼中可以看到，這個方法從 start_urls 中獲取 URL ，然后構(gòu)造一個 Request 對象來請求。既然這樣，我們就可以重寫 start_requests 方法來做一些事情，也就是在構(gòu)造 Request 對象的時候把 cookies 信息加進去。

重寫之后的 start_requests 方法如下：

# -*- coding: utf-8 -*-
import scrapy
import re

class RenrenSpider(scrapy.Spider):
    name = 'renren'
    allowed_domains = ['renren.com']
    # 個人中心頁網(wǎng)址
    start_urls = ['http://www.renren.com/972990680/profile']

    def start_requests(self):
        # 登錄之后用 chrome 的 debug 工具從請求中獲取的 cookies
        cookiesstr = "anonymid=k3miegqc-hho317; depovince=ZGQT; _r01_=1; JSESSIONID=abcDdtGp7yEtG91r_U-6w; ick_login=d2631ff6-7b2d-4638-a2f5-c3a3f46b1595; ick=5499cd3f-c7a3-44ac-9146-60ac04440cb7; t=d1b681e8b5568a8f6140890d4f05c30f0; societyguester=d1b681e8b5568a8f6140890d4f05c30f0; id=972990680; xnsid=404266eb; XNESSESSIONID=62de8f52d318; jebecookies=4205498d-d0f7-4757-acd3-416f7aa0ae98|||||; ver=7.0; loginfrom=null; jebe_key=8800dc4d-e013-472b-a6aa-552ebfc11486%7Cb1a400326a5d6b2877f8c884e4fe9832%7C1575175011619%7C1%7C1575175011639; jebe_key=8800dc4d-e013-472b-a6aa-552ebfc11486%7Cb1a400326a5d6b2877f8c884e4fe9832%7C1575175011619%7C1%7C1575175011641; wp_fold=0"
        cookies = {i.split("=")[0]:i.split("=")[1] for i in cookiesstr.split("; ")}

        # 攜帶 cookies 的 Request 請求
        yield scrapy.Request(
            self.start_urls[0],
            callback=self.parse,
            cookies=cookies
        )

    def parse(self, response):
        # 從個人中心頁查找關(guān)鍵詞"閑歡"并打印
        print(re.findall("閑歡", response.body.decode()))

我先用賬號正確登錄人人網(wǎng)，登錄之后用 chrome 的 debug 工具從請求中獲取一個請求的 cookies ，然后在 Request 對象中加入這個 cookies 。接著我在 parse 方法中查找網(wǎng)頁中的“閑歡”關(guān)鍵詞并打印輸出。

我們運行一下這個爬蟲：

>scrapy crawl renren

在運行日志中我們可以看到下面這幾行：

2019-12-01 13:06:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.renren.com/972990680/profile?v=info_timeline> (referer: http://www.renren.com/972990680/profile)
['閑歡', '閑歡', '閑歡', '閑歡', '閑歡', '閑歡', '閑歡']
2019-12-01 13:06:55 [scrapy.core.engine] INFO: Closing spider (finished)

我們可以看到已經(jīng)打印了我們需要的信息了。

我們可以在 settings 配置中加 COOKIES_DEBUG = True 來查看 cookies 傳遞的過程。

加了這個配置之后，我們可以看到日志中會出現(xiàn)下面的信息：

2019-12-01 13:06:55 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET http://www.renren.com/972990680/profile?v=info_timeline>
Cookie: anonymid=k3miegqc-hho317; depovince=ZGQT; _r01_=1; JSESSIONID=abcDdtGp7yEtG91r_U-6w; ick_login=d2631ff6-7b2d-4638-a2f5-c3a3f46b1595; ick=5499cd3f-c7a3-44ac-9146-60ac04440cb7; t=d1b681e8b5568a8f6140890d4f05c30f0; societyguester=d1b681e8b5568a8f6140890d4f05c30f0; id=972990680; xnsid=404266eb; XNESSESSIONID=62de8f52d318; jebecookies=4205498d-d0f7-4757-acd3-416f7aa0ae98|||||; ver=7.0; loginfrom=null; jebe_key=8800dc4d-e013-472b-a6aa-552ebfc11486%7Cb1a400326a5d6b2877f8c884e4fe9832%7C1575175011619%7C1%7C1575175011641; wp_fold=0; JSESSIONID=abc84VF0a7DUL7JcS2-6w

發(fā)送 post 請求模擬登陸

我們通過模擬登陸 GitHub 網(wǎng)站為例，來講述這種模擬登陸方式。

我們首先創(chuàng)建一個爬蟲 github ：

> scrapy genspider github github.com

我們要用 post 請求模擬登陸，首先需要知道登陸的 URL 地址，以及登陸所需要的參數(shù)信息。我們通過 debug 工具，可以看到登陸的請求信息如下：

github_login_request.png

從請求信息中我們可以找出登陸的 URL 為：https://github.com/session ，登陸所需要的參數(shù)為：

commit: Sign in
utf8: ?
authenticity_token: bbpX85KY36B7N6qJadpROzoEdiiMI6qQ5L7hYFdPS+zuNNFSKwbW8kAGW5ICyvNVuuY5FImLdArG47358RwhWQ==
ga_id: 101235085.1574734122
login: xxx@qq.com
password: xxx
webauthn-support: supported
webauthn-iuvpaa-support: unsupported
required_field_f0e5: 
timestamp: 1575184710948
timestamp_secret: 574aa2760765c42c07d9f0ad0bbfd9221135c3273172323d846016f43ba761db

這個請求的參數(shù)真是夠多的，汗！

除了我們的用戶名和密碼，其他的都需要從登陸頁面中獲取，這其中還有一個 required_field_f0e5 參數(shù)需要注意一下，每次頁面加載這個名詞都不一樣，可見是動態(tài)生成的，但是這個值始終傳的都是空，這就為我們省去了一個參數(shù)，我們可以不穿這個參數(shù)。

其他的參數(shù)在頁面的位置如下圖：

github_login_params.png

我們用 xpath 來獲取各個參數(shù)，代碼如下（我把用戶名和密碼分別用 xxx 來代替了，大家運行的時候請把自己真實的用戶名和密碼寫上去）：

# -*- coding: utf-8 -*-
import scrapy
import re

class GithubSpider(scrapy.Spider):
    name = 'github'
    allowed_domains = ['github.com']
    # 登錄頁面 URL
    start_urls = ['https://github.com/login']

    def parse(self, response):
        # 獲取請求參數(shù)
        commit = response.xpath("http://input[@name='commit']/@value").extract_first()
        utf8 = response.xpath("http://input[@name='utf8']/@value").extract_first()
        authenticity_token = response.xpath("http://input[@name='authenticity_token']/@value").extract_first()
        ga_id = response.xpath("http://input[@name='ga_id']/@value").extract_first()
        webauthn_support = response.xpath("http://input[@name='webauthn-support']/@value").extract_first()
        webauthn_iuvpaa_support = response.xpath("http://input[@name='webauthn-iuvpaa-support']/@value").extract_first()
        # required_field_157f = response.xpath("http://input[@name='required_field_4ed5']/@value").extract_first()
        timestamp = response.xpath("http://input[@name='timestamp']/@value").extract_first()
        timestamp_secret = response.xpath("http://input[@name='timestamp_secret']/@value").extract_first()

        # 構(gòu)造 post 參數(shù)
        post_data = {
            "commit": commit,
            "utf8": utf8,
            "authenticity_token": authenticity_token,
            "ga_id": ga_id,
            "login": "xxx@qq.com",
            "password": "xxx",
            "webauthn-support": webauthn_support,
            "webauthn-iuvpaa-support": webauthn_iuvpaa_support,
            # "required_field_4ed5": required_field_4ed5,
            "timestamp": timestamp,
            "timestamp_secret": timestamp_secret
        }

        # 打印參數(shù)
        print(post_data)

        # 發(fā)送 post 請求
        yield scrapy.FormRequest(
            "https://github.com/session", # 登錄請求方法
            formdata=post_data,
            callback=self.after_login
        )

    # 登錄成功之后操作
    def after_login(self, response):
        # 找到頁面上的 Issues 字段并打印
        print(re.findall("Issues", response.body.decode()))

我們使用 FormRequest 方法發(fā)送 post 請求，運行爬蟲之后，報錯了，我們來看下報錯信息：

2019-12-01 15:14:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/login> (referer: None)
{'commit': 'Sign in', 'utf8': '?', 'authenticity_token': '3P4EVfXq3WvBM8fvWge7FfmRd0ORFlS6xGcz5mR5A00XnMe7GhFaMKQ8y024Hyy5r/RFS9ZErUDr1YwhDpBxlQ==', 'ga_id': None, 'login': '965639190@qq.com', 'password': '54ithero', 'webauthn-support': 'unknown', 'webauthn-iuvpaa-support': 'unknown', 'timestamp': '1575184487447', 'timestamp_secret': '6a8b589266e21888a4635ab0560304d53e7e8667d5da37933844acd7bee3cd19'}
2019-12-01 15:14:47 [scrapy.core.scraper] ERROR: Spider error processing <GET https://github.com/login> (referer: None)
Traceback (most recent call last):
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/cxhuan/Documents/python_workspace/scrapy_projects/login/login/spiders/github.py", line 40, in parse
    callback=self.after_login
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/http/request/form.py", line 32, in __init__
    querystr = _urlencode(items, self.encoding)
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/http/request/form.py", line 73, in _urlencode
    for k, vs in seq
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/http/request/form.py", line 74, in <listcomp>
    for v in (vs if is_listlike(vs) else [vs])]
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/utils/python.py", line 107, in to_bytes
    'object, got %s' % type(text).__name__)
TypeError: to_bytes must receive a unicode, str or bytes object, got NoneType
2019-12-01 15:14:47 [scrapy.core.engine] INFO: Closing spider (finished)

看這個報錯信息，好像是參數(shù)值中有一個參數(shù)取到 None 導致的，我們看下打印的參數(shù)信息中，發(fā)現(xiàn) ga_id 是 None ，我們再修改一下，當 ga_id 為 None 時，我們傳空字符串試試。

修改代碼如下：

ga_id = response.xpath("http://input[@name='ga_id']/@value").extract_first()
if ga_id is None:
    ga_id = ""

再次運行爬蟲，這次我們來看看結(jié)果：

Set-Cookie: _gh_sess=QmtQRjB4UDNUeHdkcnE4TUxGbVRDcG9xMXFxclA1SDM3WVhqbFF5U0wwVFp0aGV1UWxYRWFSaXVrZEl0RnVjTzFhM1RrdUVabDhqQldTK3k3TEd3KzNXSzgvRXlVZncvdnpURVVNYmtON0IrcGw1SXF6Nnl0VTVDM2dVVGlsN01pWXNUeU5XQi9MbTdZU0lTREpEMllVcTBmVmV2b210Sm5Sbnc0N2d5aVErbjVDU2JCQnA5SkRsbDZtSzVlamxBbjdvWDBYaWlpcVR4Q2NvY3hwVUIyZz09LS1lMUlBcTlvU0F0K25UQ3loNHFOZExnPT0%3D--8764e6d2279a0e6960577a66864e6018ef213b56; path=/; secure; HttpOnly

2019-12-01 15:25:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/> (referer: https://github.com/login)
['Issues', 'Issues']
2019-12-01 15:25:18 [scrapy.core.engine] INFO: Closing spider (finished)

我們可以看到已經(jīng)打印了我們需要的信息，登錄成功。

Scrapy 對于表單請求，FormRequest 還提供了另外一個方法 from_response 來自動獲取頁面中的表單，我們只需要傳入用戶名和密碼就可以發(fā)送請求。

我們來看下這個方法的源碼：

@classmethod
    def from_response(cls, response, formname=None, formid=None, formnumber=0, formdata=None,
                      clickdata=None, dont_click=False, formxpath=None, formcss=None, **kwargs):

        kwargs.setdefault('encoding', response.encoding)

        if formcss is not None:
            from parsel.csstranslator import HTMLTranslator
            formxpath = HTMLTranslator().css_to_xpath(formcss)

        form = _get_form(response, formname, formid, formnumber, formxpath)
        formdata = _get_inputs(form, formdata, dont_click, clickdata, response)
        url = _get_form_url(form, kwargs.pop('url', None))

        method = kwargs.pop('method', form.method)
        if method is not None:
            method = method.upper()
            if method not in cls.valid_form_methods:
                method = 'GET'

        return cls(url=url, method=method, formdata=formdata, **kwargs)

我們可以看到這個方法的參數(shù)有好多，都是有關(guān) form 定位的信息。如果登錄網(wǎng)頁中只有一個表單， Scrapy 可以很容易定位，但是如果網(wǎng)頁中含有多個表單呢？這個時候我們就需要通過這些參數(shù)來告訴 Scrapy 哪個才是登錄的表單。

當然，這個方法的前提是需要我們網(wǎng)頁的 form 表單的 action 里面包含了提交請求的 url 地址。

在 github 這個例子中，我們的登錄頁面只有一個登錄的表單，因此我們只需要傳入用戶名和密碼就可以了。代碼如下：

# -*- coding: utf-8 -*-
import scrapy
import re

class Github2Spider(scrapy.Spider):
    name = 'github2'
    allowed_domains = ['github.com']
    start_urls = ['http://github.com/login']

    def parse(self, response):
        yield scrapy.FormRequest.from_response(
            response, # 自動從response中尋找form表單
            formdata={"login": "xxx@qq.com", "password": "xxx"},
            callback=self.after_login
        )
    # 登錄成功之后操作
    def after_login(self, response):
        # 找到頁面上的 Issues 字段并打印
        print(re.findall("Issues", response.body.decode()))

運行爬蟲后，我們可以看到和之前一樣的結(jié)果。

這種請求方式是不是簡單了許多？不需要我們費力去找各種請求參數(shù)，有沒有覺得 Amazing ？

總結(jié)

本文向大家介紹了 Scrapy 模擬登陸網(wǎng)站的幾種方法，大家可以自己運用文中的方法去實踐一下。當然，這里沒有涉及到有驗證碼的情況，驗證碼是一個復雜并且難度很高的專題，以后有時間再給大家介紹。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

爬蟲實戰(zhàn)之Scrapy模擬登陸

爬蟲實戰(zhàn)之Scrapy模擬登陸

為什么進行模擬登陸？

模擬登陸的思路

模擬登陸實例

攜帶 cookies 模擬登陸

發(fā)送 post 請求模擬登陸

總結(jié)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

爬蟲實戰(zhàn)之Scrapy模擬登陸

為什么進行模擬登陸？

模擬登陸的思路

模擬登陸實例

攜帶 cookies 模擬登陸

發(fā)送 post 請求模擬登陸

總結(jié)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

為什么進行模擬登陸？