北条麻妃视频,一本到国内高清无码

感謝曾老師耐心的講解和細(xì)致的回答。

本次作業(yè)主要還是為了讓我們來熟悉爬蟲代碼，僅進(jìn)行了執(zhí)行部分的修改。
雖然全程代碼以依據(jù) 湯堯和 joe 同學(xué)的作業(yè)進(jìn)行了修改和標(biāo)注，但是不得不承認(rèn)小白依舊理解不深刻，期待今晚課曾老師耐心的講解。

本次課的作業(yè)如下：

選擇第二次課程作業(yè)選中的網(wǎng)址

爬取該頁面中的所有可以爬取的元素，至少要求爬取文章主體內(nèi)容

可以嘗試用lxml爬取

上次課作業(yè)地址
http://m.itdecent.cn/u/6eca8e1506ce

代碼部分：

# 導(dǎo)入要用的庫
#coding: utf-8

import os
import time
import urllib2
import urlparse
from bs4 import BeautifulSoup # 用于解析網(wǎng)頁中文

# 定義下載頁面函數(shù)并進(jìn)行容錯(cuò)處理

def download(url, retry=2): # 定義一個(gè)叫“download”的函數(shù)用于下載頁面信息
    print ("downloading:", url) # 定義打印方式
    
    # 設(shè)置header信息，模擬瀏覽器請求
    header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}
    
    # 設(shè)置容錯(cuò)機(jī)制（爬取可能會失敗，采用try-except方式來捕獲處理）
    try:
        request = urllib2.Request(url, headers=header) # 設(shè)置請求數(shù)據(jù)（輸入網(wǎng)址，并模擬自己的機(jī)器登錄）
        html = urllib2.urlopen(request).read() # 抓取url
    except urllib2.URLError as e:  # 異常處理
        print ("download error: ", e.reason) # 打印異常原因
        html = None # 并返回空值
        if retry > 0: # 如果未超過重試次數(shù)，可以繼續(xù)爬取
            if hasattr(e, 'code') and 500 <= e.code <600: # 錯(cuò)誤碼范圍，是請求出錯(cuò)才繼續(xù)重試爬?。ㄖ挥性阪溄哟虿婚_的情況下才重新爬取，而鏈接寫錯(cuò)了則不重新爬?。?                print (e.code) # 打印錯(cuò)誤碼
            return download(url, retry -1)
    time.sleep(1) # 等待1s，避免對服務(wù)器造成壓力，也避免被服務(wù)器屏蔽爬取
    return html

url_root = 'http://m.itdecent.cn' # 下載的種子頁面地址 
url_seed = 'http://m.itdecent.cn/u/6eca8e1506ce?page=%d' # 爬取網(wǎng)站的根目錄（重點(diǎn)要加“page=%d”，用來配合下段代碼進(jìn)行翻頁爬?。?

# 定義真正需要爬取的頁面

crawled_url = set() # 需要爬取的頁面
i = 1
flag = True # 標(biāo)記是否需要繼續(xù)爬取
while flag:
    url = url_seed % i # 格式化 url_seed 的 page=%d 
    i += 1 # 下一次需要爬取的頁面（i = i + 1）

    html = download(url) # 下載頁面
    if html == None: # 下載頁面為空，表示已爬取到最后
        break

    soup = BeautifulSoup(html, "html.parser") # 格式化爬取的頁面數(shù)據(jù)
    links = soup.find_all('a',{'class': 'title'}) # 獲取標(biāo)題元素（返回class屬性為title的h1標(biāo)簽）
    if links.__len__() == 0: # 爬取的頁面中已無有效數(shù)據(jù)，終止爬取
        flag = False

    for link in links: # 獲取有效的文章地址
        link = link.get('href')
        if link not in crawled_url:
            realUrl = urlparse.urljoin(url_root, link)
            crawled_url.add(realUrl) # 記錄未重復(fù)的需要爬取的頁面
        else:
            print ('end')
            flag = False # 結(jié)束抓取

# 輸出結(jié)果

('downloading:', 'http://m.itdecent.cn/u/6eca8e1506ce?page=1')
('downloading:', 'http://m.itdecent.cn/u/6eca8e1506ce?page=2')
('downloading:', 'http://m.itdecent.cn/u/6eca8e1506ce?page=3')
('downloading:', 'http://m.itdecent.cn/u/6eca8e1506ce?page=4')
('downloading:', 'http://m.itdecent.cn/u/6eca8e1506ce?page=5')
('downloading:', 'http://m.itdecent.cn/u/6eca8e1506ce?page=6')
('downloading:', 'http://m.itdecent.cn/u/6eca8e1506ce?page=7')
('downloading:', 'http://m.itdecent.cn/u/6eca8e1506ce?page=8')
('downloading:', 'http://m.itdecent.cn/u/6eca8e1506ce?page=9')

# 計(jì)算可能獲取的全部文章數(shù)量

paper_num = crawled_url.__len__()
print('total paper num: ',paper_num)

# 輸出結(jié)果

('total paper num: ', 273)

# 抓取文章內(nèi)容，并按標(biāo)題和內(nèi)容保存起來

for link in crawled_url:  # 按地址逐篇文章爬取
    html = download(link)
    soap = BeautifulSoup(html, "html.parser")
    title = soap.find('h1', {'class': 'title'}).text  # 獲取文章標(biāo)題（返回class屬性為title的h1標(biāo)簽，并且只包含文字）
    content = soap.find('div', {'class': 'show-content'}).text # 獲取文章內(nèi)容（返回class屬性為show-content的div標(biāo)簽，并且只包含文字）

    if os.path.exists('spider_res/') == False: # 檢查保存文件的地址
        os.mkdir('spider_res') # 創(chuàng)建一個(gè)目錄

    file_name = 'spider_res/' + title + '.txt' # 保存的文件名及文件格式
    if os.path.exists(file_name):
        # os.remove(file_name) #刪除文件
        continue # 已存在的文件不再寫，跳出該循環(huán)，繼續(xù)下一個(gè)循環(huán)
    
    # 處理title中的特殊字符
    title = title.strip()
    title = title.replace('|', ' ')
    title = title.replace('"', ' ')
    title = title.replace(':', ' ')
    title = title.replace('?', ' ')
    title = title.replace('<', ' ')
    title = title.replace('>', ' ')
    print (title) #可以打印出來感受下樣式
    
    arr = 'spider_res/' + title + '.txt'
    file = open(arr, 'wb') # 寫文件，定義樣式
    content = unicode(content).encode('utf-8', errors='ignore') # 用UTF-8實(shí)現(xiàn)Unicode，并消除轉(zhuǎn)義字符
    file.write(content)
    file.close()

以下為爬出的文件

spider_res

參考文檔：
湯堯 - 爬蟲入門03作業(yè)

joe同學(xué)的作業(yè)還沒有發(fā)布，但是這次我真的是抄的Joe的作業(yè)，所以必須要嚴(yán)重的感謝一下joe

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

爬蟲第三次作業(yè)-0706

爬蟲第三次作業(yè)-0706

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

爬蟲第三次作業(yè)-0706

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av