簡(jiǎn)書非官方大數(shù)據(jù)(二)

PS:這條很重要,我的文章中所說的大數(shù)據(jù)并不是現(xiàn)在很火的大數(shù)據(jù)話題,前幾天看過一篇大數(shù)據(jù)的文章,簡(jiǎn)單來說:當(dāng)一臺(tái)電腦沒法處理或你現(xiàn)在的條件沒法處理的數(shù)據(jù)就可以談的上大數(shù)據(jù)了,這個(gè)沒有指定的數(shù)據(jù)量。
爬蟲爬了一晚上,到目前為止已爬取170W+,大早上想了一下,效率不夠,我又不會(huì)分布式爬蟲,也只好停下來改代碼了,這時(shí)細(xì)心的朋友就會(huì)想到我要解釋斷點(diǎn)續(xù)爬了?。〝嗔酥笥忠仡^開始么?)。但今天也只是偽斷點(diǎn)續(xù)爬,但會(huì)給你們提供一個(gè)思路。

爬取熱門和城市URL

import requests
from lxml import etree
import pymongo

client = pymongo.MongoClient('localhost', 27017)
jianshu = client['jianshu']
topic_urls = jianshu['topic_urls']

host_url = 'http://m.itdecent.cn'
hot_urls = ['http://m.itdecent.cn/recommendations/collections?page={}&order_by=hot'.format(str(i)) for i in range(1,40)]
city_urls = ['http://m.itdecent.cn/recommendations/collections?page={}&order_by=city'.format(str(i)) for i in range(1,3)]

def get_channel_urls(url):
    html = requests.get(url)
    selector = etree.HTML(html.text)
    infos = selector.xpath('//div[@class="count"]')
    for info in infos:
        part_url = info.xpath('a/@href')[0]
        article_amounts = info.xpath('a/text()')[0]
        focus_amounts = info.xpath('text()')[0].split('·')[1]
        # print(part_url,article_amounts,focus_amounts)
        topic_urls.insert_one({'topicurl':host_url + part_url,'article_amounts':article_amounts,
                              'focus_amounts':focus_amounts})

# for hot_url in hot_urls:
#     get_channel_urls(hot_url)

for city_url in city_urls:
    get_channel_urls(city_url)

這部分代碼是爬取URL存儲(chǔ)到topic_urls表中,其它爬取細(xì)節(jié)比較簡(jiǎn)單,就不多述。

爬取文章作者及粉絲

import requests
from lxml import etree
import time
import pymongo

client = pymongo.MongoClient('localhost', 27017)
jianshu = client['jianshu']
author_urls = jianshu['author_urls']
author_infos = jianshu['author_infos']

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Connection':'keep-alive'
}

def get_article_url(url,page):
    link_view = '{}?order_by=added_at&page={}'.format(url,str(page))
    try:
        html = requests.get(link_view,headers=headers)
        selector = etree.HTML(html.text)
        infos = selector.xpath('//div[@class="name"]')
        for info in infos:
            author_name = info.xpath('a/text()')[0]
            authorurl = info.xpath('a/@href')[0]
            if 'http://m.itdecent.cn'+ authorurl in [item['author_url'] for item in author_urls.find()]:
                pass
            else:
            # print('http://m.itdecent.cn'+authorurl,author_name)
                author_infos.insert_one({'author_name':author_name,'author_url':'http://m.itdecent.cn'+authorurl})
                get_reader_url(authorurl)
        time.sleep(2)
    except requests.exceptions.ConnectionError:
        pass

# get_article_url('http://m.itdecent.cn/c/bDHhpK',2)
def get_reader_url(url):
    link_views = ['http://m.itdecent.cn/users/{}/followers?page={}'.format(url.split('/')[-1],str(i)) for i in range(1,100)]
    for link_view in link_views:
        try:
            html = requests.get(link_view,headers=headers)
            selector = etree.HTML(html.text)
            infos = selector.xpath('//li/div[@class="info"]')
            for info in infos:
                author_name = info.xpath('a/text()')[0]
                authorurl = info.xpath('a/@href')[0]
                # print(author_name,authorurl)
                author_infos.insert_one({'author_name': author_name, 'author_url': 'http://m.itdecent.cn' + authorurl})
        except requests.exceptions.ConnectionError:
            pass
# get_reader_url('http://m.itdecent.cn/u/7091a52ac9e5')

1 簡(jiǎn)書對(duì)爬蟲還是比較友好的,加了一個(gè)代理就行(但大家不要惡意爬取,維護(hù)網(wǎng)絡(luò)安全)。
2 中途出現(xiàn)了二次錯(cuò)誤,加了二個(gè)try就好了,之前有考慮過是否會(huì)出錯(cuò),簡(jiǎn)書翻頁如果超過了最后一頁會(huì)自動(dòng)跳轉(zhuǎn)到第二頁(手動(dòng)嘗試了下),所以調(diào)了一個(gè)很大的閾值,不想到會(huì)出錯(cuò)。
3 出現(xiàn)錯(cuò)誤不想爬重復(fù)數(shù)據(jù)以及一個(gè)用戶可以發(fā)表很多篇文章,所以在get_article_url中加了一個(gè)判斷,大概意思是說:如果爬去的url在用戶表中,我就不進(jìn)行訪問,存儲(chǔ),爬取粉絲等操作了。

運(yùn)行入口

import sys
sys.path.append("..")
from multiprocessing import Pool
from channel_extract import topic_urls
from page_spider import get_article_url

db_topic_urls = [item['topicurl'] for item in topic_urls.find()]
shouye_url = ['http://m.itdecent.cn/c/bDHhpK']
x = set(db_topic_urls)
y = set(shouye_url)
rest_urls = x - y

def get_all_links_from(channel):
    for num in range(1,5000):
        get_article_url(channel,num)

if __name__ == '__main__':

    pool = Pool(processes=4)
    pool.map(get_all_links_from,rest_urls)

1 今天還在爬首頁(因?yàn)閚um之前取的17000(首頁文章太多)),我想了下首頁的文章大部分是其它專題推送過來的,就不爬取了,續(xù)爬的話我就用二個(gè)集合相減,去掉首頁的鏈接,進(jìn)而爬取。
2 為什么說是偽斷點(diǎn)爬取呢?因?yàn)橄麓螆?bào)錯(cuò)還是要重新開始(除非改程序),但這里提供了一個(gè)思路給大家,通過集合相減,去爬取其余的信息。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容