首先,絮絮叨叨一下:
最近學(xué)習(xí)Scrapy框架中,由于沒(méi)有Python的編程基礎(chǔ)所以學(xué)起來(lái)很是困難!
在網(wǎng)上找一個(gè)免費(fèi)視頻看了一下,就匆匆上手了。
教程來(lái)自?Python最火爬蟲(chóng)框架Scrapy入門(mén)與實(shí)踐? 講的確實(shí)不錯(cuò)!
一、建立項(xiàng)目
首先用Scrapy命令建立一個(gè)爬蟲(chóng)項(xiàng)目
scrapy startproject bingScrapy
然后進(jìn)入 bingScrapy目錄建立爬蟲(chóng)文件
cd bingScrapy
scrapy?genspider bingScrapy?ioliu.cn
這里使用了https://bing.ioliu.cn/這個(gè)地址來(lái)獲取Bing壁紙,在這里感謝作者提供如此優(yōu)秀的項(xiàng)目。
二、編寫(xiě)爬蟲(chóng)文件
bingScrapy\spiders\bingScrapy.py
# -*- coding: utf-8 -*-
import scrapy
import re
from bingScrapy.items import BingscrapyItem
class BingscrapySpider(scrapy.Spider):
? ? name = 'bingScrapy'
? ? allowed_domains = ['ioliu.cn']
? ? start_urls = ['https://bing.ioliu.cn/?p=1/']
? ? def parse(self, response):
? ? ? ? container = response.xpath("http://div[@class='container']/div[@class='item']/div")
? ? ? ? next_page = response.xpath("http://div[@class='page']/a[2]/@href").extract_first()
? ? ? ? print(next_page)
? ? ? ? if next_page:
? ? ? ? ? ? yield scrapy.Request('https://bing.ioliu.cn' + next_page, callback=self.parse)
? ? ? ? for i in container:
? ? ? ? ? ? item = BingscrapyItem()
? ? ? ? ? ? item['time'] = i.xpath(".//div[@class='description']/p[1]/em[1]/text()").extract_first()
? ? ? ? ? ? item['name'] = i.xpath(".//div[@class='description']/h3/text()").extract_first()
? ? ? ? ? ? item['image_urls'] = i.xpath(".//img/@src").extract()
? ? ? ? ? ? yield item
? ? ? ? print(item)
bingScrapy\spiders\items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class BingscrapyItem(scrapy.Item):
? ? # define the fields for your item here like:
? ? # name = scrapy.Field()
? ? name = scrapy.Field()
? ? time = scrapy.Field()
? ? image_urls = scrapy.Field()
? ? pass
bingScrapy\spiders\pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import re
import scrapy
from scrapy.pipelines.images import ImagesPipeline
class BingscrapyPipeline(ImagesPipeline):
? ? def get_media_requests(self, item, info):
? ? ? ? # 循環(huán)每一張圖片地址下載,若傳過(guò)來(lái)的不是集合則無(wú)需循環(huán)直接yield
? ? ? ? for image_url in item['image_urls']:
? ? ? ? ? ? #src = re.sub(r'1920x1080', '1920x1200', image_url)
? ? ? ? ? ? # meta里面的數(shù)據(jù)是從spider獲取,然后通過(guò)meta傳遞給下面方法:file_path
? ? ? ? ? ? yield scrapy.Request(image_url,meta={'item': item,})
? ? # 重命名,若不重寫(xiě)這函數(shù),圖片名為哈希,就是一串亂七八糟的名字
? ? def file_path(self, request, response=None, info=None):
? ? ? ? item = request.meta['item']
? ? ? ? name = item['name']
? ? ? ? # 過(guò)濾windows字符串,不經(jīng)過(guò)這么一個(gè)步驟,你會(huì)發(fā)現(xiàn)有亂碼或無(wú)法下載
? ? ? ? name = re.sub(r'[\/\\\:\*\?\"\<\>\|]','_',name)
? ? ? ? # 分文件夾存儲(chǔ)的關(guān)鍵
? ? ? ? #filename = u'{0}'.format(name)
? ? ? ? folder = item['time']
? ? ? ? #folder_strip = strip(folder)
? ? ? ? image_guid = request.url.split('/')[-1]
? ? ? ? filename = u'full/{0}/{1}{2}'.format(folder, name, '.jpg')
? ? ? ? return filename
bingScrapy\spiders\settings.py
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
IMAGES_STORE = 'D:\bing'
ITEM_PIPELINES = {
? ?'bingScrapy.pipelines.BingscrapyPipeline': 300,
}
這樣一個(gè)爬蟲(chóng)就基本完成了!
三、運(yùn)行與結(jié)果
scrapy?crawl bingScrapy


以上是最近學(xué)習(xí)的成果,雖然代碼可能寫(xiě)的不夠嚴(yán)謹(jǐn)與高效,但是對(duì)于非專(zhuān)業(yè)人員來(lái)說(shuō)覺(jué)得比較滿(mǎn)意了!?
提醒的是在代碼運(yùn)行的過(guò)程中遇到報(bào)錯(cuò)盡量的用Google吧!?