五月天青色开心激情,国逼久久久

原文出處： Cer_ml

1.目標(biāo)

這兩天要弄一個大作業(yè)，從水木社區(qū)和北大未名社區(qū)的實習(xí)板塊，爬取實習(xí)信息，保存在MongoDB數(shù)據(jù)庫。
正好想學(xué)習(xí)一下scrapy框架的使用，就愉快地決定用scrapy來實現(xiàn)。

2.介紹

Scrapy是Python開發(fā)的一個快速,高層次的屏幕抓取和web抓取框架，用于抓取web站點并從頁面中提取結(jié)構(gòu)化的數(shù)據(jù)。使用了 Twisted 異步網(wǎng)絡(luò)庫來處理網(wǎng)絡(luò)通訊。整體架構(gòu)：

學(xué)習(xí)使用Scrapy，最重要的是官方文檔。本文的主要參考資料也是該文檔。
Scrapy的安裝，這里就不說了，在滿足一系列依賴的安裝以后，pip一下，就搞定了。

pip install scrapy

3.開始

3.1 首先，新建一個Scrapy工程。

進(jìn)入你的目標(biāo)目錄，輸入以下指令，創(chuàng)建項目intern。

$ scrapy startproject intern

目錄結(jié)構(gòu)如下：

.
├── scrapy.cfg
└── intern
  ├── __init__.py
  ├── items.py
  ├── pipelines.py
  ├── settings.py
  └── spiders
    └── __init__.py

這個目錄結(jié)構(gòu)要熟記于心。

scrapy.cfg: 全局配置文件
intern/: 項目python模塊
intern/items.py: 項目items文件，定義爬取的數(shù)據(jù)保存結(jié)構(gòu)
intern/pipelines.py: 項目管道文件，對爬取來的數(shù)據(jù)進(jìn)行清洗、篩選、保存等操作
intern/settings.py: 項目配置文件
intern/spiders: 放置spider的目錄

3.2 編寫items.py文件。

定義item的字段如下：

import scrapy
class InternItem(scrapy.Item):
  title = scrapy.Field()
  href = scrapy.Field()
  author = scrapy.Field()
  time = scrapy.Field()
  content = scrapy.Field()
  is_dev = scrapy.Field()
  is_alg = scrapy.Field()
  is_fin = scrapy.Field()
  base_url_index = scrapy.Field()

定義的方法很簡單，每個字段都=scrapy.Field()即可。
使用：比如要使用某item的title，就像python中的dict一樣，item[‘title’]即可。

3.3 編寫爬蟲。

好了終于到了編寫爬蟲了。以爬取水木社區(qū)的爬蟲為例。在spiders目錄下，創(chuàng)建smSpider.py。

class SMSpider(scrapy.spiders.CrawlSpider):   
'''    
#要建立一個 Spider，你可以為 scrapy.spider.BaseSpider 創(chuàng)建一個子類，并確定三個主要的、強制的屬性：    
#name ：爬蟲的識別名，它必須是唯一的，在不同的爬蟲中你必須定義不同的名字.    
#start_urls ：爬蟲開始爬的一個 URL 列表。爬蟲從這里開始抓取數(shù)據(jù)，所以，第一次下載的數(shù)據(jù)將會從這些 URLS 開始。其他子 URL 將會從這些起始 URL 中繼承性生成。   
#parse() ：爬蟲的方法，調(diào)用時候傳入從每一個 URL 傳回的 Response 對象作為參數(shù)，response 將會是 parse 方法的唯一的一個參數(shù),    
#這個方法負(fù)責(zé)解析返回的數(shù)據(jù)、匹配抓取的數(shù)據(jù)(解析為 item )并跟蹤更多的 URL。    
''' 
  name="sm"    
  base_url = 'http://www.newsmth.net/nForum/board/Intern'    
  start_urls = [base_url]   
  start_urls.extend([base_url+'?p='+str(i) for i in range(2,4)])    
  platform = getPlatform()    
  def __init__(self):        
    scrapy.spiders.Spider.__init__(self)        
    if self.platform == 'linux':            
      self.driver = webdriver.PhantomJS()        
    elif self.platform == 'win':            
      self.driver =webdriver.PhantomJS(executable_path= 'F:/runtime/python/phantomjs-2.1.1-windows/bin/phantomjs.exe')            
    self.driver.set_page_load_timeout(10)       
    dispatcher.connect(self.spider_closed, signals.spider_closed)    
  def spider_closed(self, spider):        
    self.driver.quit()    
  def parse(self,response):
...

從淺到深，一步步解釋這段代碼。
首先，這個SMSpider是繼承于CrawlSpider，CrawlSpider繼承于BaseSpider。一般用BaseSpider就夠了，CrawlSpider可以增加一些爬取的Rule。但實際上我這里并沒有用到。必需要定義的三個屬性。
name：爬蟲的名字。（唯一）
start_url：爬蟲開始爬取的url列表。
parse()：爬蟲爬取的方法。調(diào)用時傳入一個response對象，作為訪問某鏈接的響應(yīng)。
在爬取水木社區(qū)的時候發(fā)現(xiàn)，水木的實習(xí)信息是動態(tài)加載的。

也就是說，源代碼中，并沒有我們要的實習(xí)信息。這時，考慮使用Selenium和Phantomjs的配合。Selenium本來在自動化測試上廣泛使用，它可以模仿用戶在瀏覽器上的行為，比如點擊按鈕等等。Phantomjs是一個沒有UI的瀏覽器。Selenium和Phantomjs搭配，就可以方便地抓取動態(tài)加載的頁面。

回到SMSpider的代碼，我們要判斷當(dāng)前的操作系統(tǒng)平臺，然后在Selenium的webdriver中加載Phantomjs。Linux不用輸入路徑，Windows要輸入程序所在路徑。在init()的結(jié)尾，還要加上事件分發(fā)器，使得在爬蟲退出后，關(guān)閉Phantomjs。

self.driver.set_page_load_timeout(10)

這句代碼是為了不讓Phantom卡死在某一鏈接的請求上。設(shè)定每個頁面加載時間不能超過10秒。
具體的parse方法：

def parse(self,response):      
  self.driver.get(response.url)    
  print response.url
  #等待，直到table標(biāo)簽出現(xiàn)    
  try:        
    element = WebDriverWait(self.driver,30).until(  
               EC.presence_of_all_elements_located((By.TAG_NAME,'table'))        )        
    print 'element:\n', element    
  except Exception, e:        
    print Exception, ":", e        
    print "wait failed"    
  page_source = self.driver.page_source    
  bs_obj = BeautifulSoup(page_source, "lxml")    
  print bs_obj    
  table = bs_obj.find('table',class_='board-list tiz')    
  print table    
  print "find message ====================================\n" 
  intern_messages = table.find_all('tr',class_=False)    
  for message in intern_messages:        
    title, href, time, author = '','','',''        
    td_9 = message.find('td',class_='title_9')        
    if td_9:            
      title = td_9.a.get_text().encode('utf-8','ignore')            
      href = td_9.a['href']        
    td_10 = message.find('td', class_='title_10')        
    if td_10:            
      time=td_10.get_text().encode('utf-8','ignore')        
    td_12 = message.find('td', class_='title_12')        
    if td_12:            
      author = td_12.a.get_text().encode('utf-8','ignore')        
    item = InternItem()        
    print 'title:',title        
    print 'href:', href        
    print 'time:', time        
    print 'author:', author        
    item['title'] = title        
    item['href'] = href        
    item['time'] = time       
    item['author'] = author        
    item['base_url_index'] = 0        
    #嵌套爬取每條實習(xí)信息的具體內(nèi)容
    root_url = 'http://www.newsmth.net'              
    if href!='':            
    content = self.parse_content(root_url+href)            
    item['content'] = content       
    yield item

這段代碼，先是找到動態(tài)加載的目標(biāo)標(biāo)簽，等待這個標(biāo)簽出現(xiàn)，再爬取實習(xí)信息列表，再嵌套爬取每條實習(xí)信息的具體內(nèi)容。這里我使用bs4對html進(jìn)行解析。你也可以使用原生態(tài)的Xpath，或者selector。這里就不進(jìn)行具體的講解了，多了解幾種方法，熟練一種即可。爬取到的目標(biāo)內(nèi)容，像 item[‘title’] = title這樣，保存在item里。注意最后不是return，而是yeild。parse方法采用生成器的模式，逐條爬取分析。
爬取具體實習(xí)內(nèi)容的代碼：

def parse_content(self,url):    
  self.driver.get(url)    
  try:        
    element = WebDriverWait(self.driver, 30).until(            
            EC.presence_of_all_elements_located((By.TAG_NAME, 'table'))        )        
    print 'element:\n', element    
  except Exception, e:        
    print Exception, ":", e        
    print "wait failed"    
  page_source = self.driver.page_source    
  bs_obj = BeautifulSoup(page_source, "lxml")    
  return bs_obj.find('td', class_='a-content').p.get_text().encode('utf-8','ignore’)

3.4 編寫pipelines.py。

接下來，我們想把爬取到的數(shù)據(jù)，存在Mongodb里面。這可以交給pipeline去做。pipeline是我喜歡Scrapy的一個理由，你可以把你爬到的數(shù)據(jù)，以item的形式，扔進(jìn)pipeline里面，進(jìn)行篩選、去重、存儲或者其他自定義的進(jìn)一步的處理。pipeline之間的順序，可以在settings.py中設(shè)置，這使得pipeline更加靈活。
來看看MongoDBPipeline：

class MongoDBPipeline(object):    
  def __init__(self):        
    pass     
  def open_spider(self, spider):        
    self.client = pymongo.MongoClient(            
    settings['MONGODB_SERVER'],            
    settings['MONGODB_PORT']        )        
    self.db = self.client[settings['MONGODB_DB']]        
    self.collection = self.db[settings['MONGODB_COLLECTION']]    
  def close_spider(self, spider):        
    self.client.close()    
  def process_item(self, item, spider):        
    valid = True        
    for data in item:            
      if not data :                
        valid = False                
        raise DropItem("Missing {0}!".format(data))        
      if item['title'] == '':            
        valid = False            
        raise DropItem("title is '' ")        
      if valid:            
        self.collection.insert(dict(item))            
    return item

來說明一下。
首先創(chuàng)建類MongoDBPipeline，這里不用繼承什么預(yù)先設(shè)定好的pipeline。但是要有一個process_item的方法，傳入一個item和spider，返回處理完的item。open_spider和close_spider是在爬蟲開啟和關(guān)閉的時候調(diào)用的回調(diào)函數(shù)。這里我們要用到MongoDB，所以我們在爬蟲開啟的時候，連接一個Mongo客戶端，在爬蟲關(guān)閉的時候，再把客戶端關(guān)掉。這里的數(shù)據(jù)庫相關(guān)的信息，都保存在settings.py里面。如下：

MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "intern"
MONGODB_COLLECTION = “items"

寫在settings.py里面的參數(shù)可以通過

from scrapy.conf import settings
settings['xxxxxx’]

這種方式來獲取。
在寫完MongoDBPipeline以后，還要在settings.py注冊一下這個pipeline，如下：

ITEM_PIPELINES = {    
    'intern.pipelines.TagPipeline': 100, 
    'intern.pipelines.MongoDBPipeline':300                  
}

后面的數(shù)值越小，越先執(zhí)行。數(shù)值的范圍是1000以內(nèi)的整數(shù)。通過這種方法，可以非常方便地設(shè)置pipeline之間的順序，以及開啟和關(guān)閉一個pipeline。

4.運行

在項目目錄下，執(zhí)行如下指令：

scrapy crawl sm

這時我們的SMSpider就愉快地開始爬取數(shù)據(jù)了。

5.下一步

關(guān)于scrapy框架，要學(xué)的還有很多。比如說擴(kuò)展和中間件的編寫，以及Crawler API的使用。
關(guān)于爬蟲，可以學(xué)習(xí)的還有：

使用代理
模擬登陸
下面一段時間，要做新浪微博的爬蟲，屆時有新的收獲再和大家分享。
本文源碼地址:github
喜歡star一下哦~~~~

歡迎報名“第九屆移動互聯(lián)網(wǎng)開發(fā)者大會”，僅需一頓飯錢，即可學(xué)到包括2位QCon講師在內(nèi)的7場干貨分享。詳情及報名：2016互聯(lián)網(wǎng)移動開發(fā)者大會報名通道
掃碼加群還有購票優(yōu)惠。不買票也可以加群，在群里認(rèn)識幾個高手也是合適的。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

用Python爬取實習(xí)信息（Scrapy初體驗）

用Python爬取實習(xí)信息（Scrapy初體驗）

1.目標(biāo)

2.介紹

3.開始

3.1 首先，新建一個Scrapy工程。

3.2 編寫items.py文件。

3.3 編寫爬蟲。

3.4 編寫pipelines.py。

4.運行

5.下一步

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

用Python爬取實習(xí)信息（Scrapy初體驗）

1.目標(biāo)

2.介紹

3.開始

3.1 首先，新建一個Scrapy工程。

3.2 編寫items.py文件。

3.3 編寫爬蟲。

3.4 編寫pipelines.py。

4.運行

5.下一步

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

3.1 首先，新建一個Scrapy工程。

3.2 編寫items.py文件。

3.3 編寫爬蟲。

3.4 編寫pipelines.py。