日韩精品无码黄色导航,18一区二区

'''
CrawlSpider它是Spider的派生類，Spider類的設(shè)計(jì)原則是只爬取start_url列表中的網(wǎng)頁(yè)，而CrawlSpider類定義了一些規(guī)則Rule來(lái)提供跟進(jìn)鏈接的方便的機(jī)制，從爬取的網(wǎng)頁(yè)結(jié)果中獲取鏈接并繼續(xù)爬取的工作．

創(chuàng)建爬蟲(chóng)文件的方式

scrapy genspider -t crawl 爬蟲(chóng)文件 域

爬蟲(chóng)文件繼承的類

rule:里面存放的是Rule對(duì)象（元祖或列表）

Rule：自定義提取規(guī)則，提取到的url會(huì)自動(dòng)構(gòu)建Request對(duì)象
設(shè)置回調(diào)函數(shù)解析響應(yīng)結(jié)果，設(shè)置是否需要跟進(jìn)（進(jìn)一步提取url）
process_links:攔截Rule規(guī)則提取的url，返回的是一個(gè)列表列表里面存放的是link對(duì)象
LinkExtractor：是一個(gè)對(duì)象，設(shè)置提取正則的url規(guī)則
注意：在Rule中沒(méi)有設(shè)置callback回調(diào)，follow默認(rèn)為True
注意：一定不要去實(shí)現(xiàn)parse方法
注意：要想處理起始url的響應(yīng)結(jié)果，我們需要重寫parse_start_url的方法
什么時(shí)候適合使用crawlspider：
1.網(wǎng)頁(yè)結(jié)構(gòu)比較簡(jiǎn)單
2.頁(yè)面大多是靜態(tài)文件
'''
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from chinazcrawlspider.items import ChinazcrawlspiderItem

class ChinazSpider(CrawlSpider):
    name = 'chinaz'
    allowed_domains = ['chinaz.com']
    start_urls = ['http://top.chinaz.com/hangyemap.html']
    # 存放定制的獲取連接規(guī)則對(duì)象（是一個(gè)列表或元祖）
    # 根據(jù)規(guī)則提取到所有的url，由crawlspider構(gòu)建Request對(duì)象并交給引擎
    """
    LinkExtractor: 提取連接的規(guī)則（正則）
    # 常用
    allow = ():設(shè)置允許提取的目標(biāo)url
    deny=（）：設(shè)置不允許提取的目標(biāo)url（優(yōu)先級(jí)比allow高）
    allow_domains=():設(shè)置允許提取的url的域
    
    deny_domains =():不允許提取的url的域（優(yōu)先級(jí)比allow_domains高）
    restrict_xpaths=(): 根據(jù)xpath語(yǔ)法，定位到某一標(biāo)簽提取目標(biāo)url
    unique=True:如果存在多個(gè)相同的url，只會(huì)保留一個(gè)
    restrict_css=(): 根據(jù)css語(yǔ)法，定位到某一標(biāo)簽提取目標(biāo)url
    strip=True:
    """
    """
    Rule
    link_extractor: Linkextractor對(duì)象
    callback=None:設(shè)置回調(diào)函數(shù)
    follow=None:是否設(shè)置跟進(jìn)(下一頁(yè)滿足條件跟進(jìn))
    process_links:可設(shè)置回調(diào)函數(shù)，
    對(duì)request對(duì)象攔截(標(biāo)簽下無(wú)法直接獲取的url,拼接url錨點(diǎn))
    """
    rules = (
        # Rule規(guī)則對(duì)象
        # 分頁(yè)地址
        Rule(
            LinkExtractor(
                          allow=r'http://top.chinaz.com/hangye/index_.*?.html', # 正則匹配URL
                          restrict_xpaths=('//div[@class="Taright"]',# 匹配分類地址
'//div[@class="ListPageWrap"]')# 匹配分頁(yè)地址
                          ), # xpath可設(shè)置范圍，即在哪里匹配符合正則的url
            callback='parse_item',
            follow=True # 下一頁(yè)頁(yè)滿足allow條件
        ),
    )
    # 在crawlspider中一定不要出現(xiàn)parse()方法
    def parse_start_url(self,response):
        """
        如果想要對(duì)起始url的響應(yīng)結(jié)果做處理的話，就需要回調(diào)這個(gè)方法
        :param response:
        :return:
        """
        self.parse_item
    def parse_item(self, response):
        """
        解析分頁(yè)的網(wǎng)頁(yè)數(shù)據(jù)
        :param response:
        :return:
        """
        webInfos = response.xpath('//ul[@class="listCentent"]/li')
        for webInfo in webInfos:
            web_item = ChinazcrawlspiderItem()
            # 封面圖片
            web_item['coverImage'] = webInfo.xpath('.//div[@class="leftImg"]/a/img/@src').extract_first('')
            # 標(biāo)題
            web_item['title'] = webInfo.xpath('.//h3[@class="rightTxtHead"]/a/text()').extract_first('')
            # 域名
            web_item['domenis'] = webInfo.xpath(
                './/h3[@class="rightTxtHead"]/span[@class="col-gray"]/text()').extract_first('')
            # 周排名
            web_item['weekRank'] = webInfo.xpath('.//div[@class="RtCPart clearfix"]/p[1]/a/text()').extract_first('')
            # 反連接數(shù)
            web_item['ulink'] = webInfo.xpath('.//div[@class="RtCPart clearfix"]/p[4]/a/text()').extract_first('')
            # 網(wǎng)站簡(jiǎn)介
            web_item['info'] = webInfo.xpath('.//p[@class="RtCInfo"]/text()').extract_first('')
            # 得分
            web_item['score'] = webInfo.xpath('.//div[@class="RtCRateCent"]/span/text()').re('\d+')[0]
            # 排名
            web_item['rank'] = webInfo.xpath('.//div[@class="RtCRateCent"]/strong/text()').extract_first('')
            print(web_item)

            yield web_item

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Scrapy通用爬蟲(chóng)--CrawlSpider

Scrapy通用爬蟲(chóng)--CrawlSpider

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Scrapy通用爬蟲(chóng)--CrawlSpider

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av