区福利免精,中文字幕第二页,国产剧情在线

從今天開始自學(xué)Python網(wǎng)絡(luò)爬蟲實戰(zhàn)了，買到一本好書，和大家一起分享學(xué)習(xí)，也建議大家要多寫多練。今天的收獲感覺好多呢。越來越覺得Python有意思了。今天結(jié)合書上練習(xí)，自己實踐了一把。書上的部分代碼和實際代碼有出入，根據(jù)書上的方法，經(jīng)過一天的研究，最終把10頁的新聞列表提取到了WORD文檔里^_^

一、獲取網(wǎng)度新聞headers

二、獲取網(wǎng)頁源代碼

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}

url = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=考察'

res = requests.get(url, headers=headers).text

print(res)

三、編寫正則表達(dá)式提取新聞信息

import re

res = '''

<div class="news-source">

? ? <div class="c-img c-img1 c-img-circle news-source-icon_1tdlx c-gap-right-xsmall">

? ? ? ? <span class="c-img-border c-img-circle"></span>

? ? ? ? <img class="source-img_33bs5" src="https://timg01.bdimg.com/timg?pacompress=&imgtype=0&sec=1439619614&autorotate=1&di=834cbb7d72ef5d6290e356c3a9b82679&quality=90&size=b870_10000&src=http%3A%2F%2Fpic.rmb.bdstatic.com%2Fb5aef7a1e77791d0387d001e5fa2d184.png">

? ? </div>

? ? <span class="c-color-gray c-font-normal c-gap-right">網(wǎng)易新聞</span>

? ? <span class="c-color-gray2 c-font-normal">2020年12月27日 18:37</span>

</div>

'''

p_info = '<div class="news-source">(.*?)</div>'

info = re.findall(p_info, res, re.S)

print(info)

四、編寫正則表達(dá)式提取新聞鏈接

import re

res = '''

<h3 class="news-title_1YtI1">

? ? <a target="_blank" class="news-title-font_1xS-F" data-click="{

? ? ? ? 'f0':'77A717EA',

? ? ? ? 'f1':'9F63F1E4',

? ? ? ? 'f2':'4CA6DE6E',

? ? ? ? 'f3':'54E5243F',

? ? ? ? 't':'1609115182',

? ? }"><em>阿里巴巴</em>某某某某某某,由...</a>

</h3>

'''

p_href = '<h3 class="news-title_1YtI1">.*?<a href="(.*?)"'

href = re.findall(p_href, res, re.S)

print(href)? # ['https://finance.ifeng.com/c/82Z0Nx2QiJ6']

五、編寫正則表達(dá)式提取新聞標(biāo)題

import re

res = '''

<h3 class="news-title_1YtI1">

? ? <a target="_blank" class="news-title-font_1xS-F" data-click="{

? ? ? ? 'f0':'77A717EA',

? ? ? ? 'f1':'9F63F1E4',

? ? ? ? 'f2':'4CA6DE6E',

? ? ? ? 'f3':'54E5243F',

? ? ? ? 't':'1609115182',

? ? }"><em>阿里巴巴</em>在港公告:董事會已授權(quán)增加本公司的股份回購計劃總額,由...</a>

</h3>

'''

p_title = '<h3 class="news-title_1YtI1">.*?>(.*?)</a>'

title = re.findall(p_title, res, re.S)

print(title)? # ['<em>阿里巴巴</em>在港公告:董事會已授權(quán)增加本公司的股份回購計劃總額,由...']

六、數(shù)據(jù)清洗并打印輸出

1.新聞標(biāo)題清洗

import re

res = '''

<h3 class="news-title_1YtI1">

? ? <a target="_blank" class="news-title-font_1xS-F" data-click="{

? ? ? ? 'f0':'77A717EA',

? ? ? ? 'f1':'9F63F1E4',

? ? ? ? 'f2':'4CA6DE6E',

? ? ? ? 'f3':'54E5243F',

? ? ? ? 't':'1609115182',

? ? }">? ? <em>阿里巴巴</em>在港公告:董事會已授權(quán)增加本公司的股份回購計劃總額,由...</a>

</h3>

'''

p_title = '<h3 class="news-title_1YtI1">.*?>(.*?)</a>'

title = re.findall(p_title, res, re.S)

# strip()函數(shù)，清理空格和換行符

# 該方法只能刪除開頭或是結(jié)尾的字符，不能刪除中間部分的字符。

for i in range(len(title)):? # len(title) title的長度

? ? title[i] = title[i].strip()

? ? print(title[i])

2.新聞來源和日期清洗

import re

res = '''

<div class="news-source">

? ? <div class="c-img c-img1 c-img-circle news-source-icon_1tdlx c-gap-right-xsmall">

? ? ? ? <span class="c-img-border c-img-circle"></span>

? ? ? ? <img class="source-img_33bs5" src="https://timg01.bdimg.com/timg?pacompress=&imgtype=0&sec=1439619614&autorotate=1&di=834cbb7d72ef5d6290e356c3a9b82679&quality=90&size=b870_10000&src=http%3A%2F%2Fpic.rmb.bdstatic.com%2Fb5aef7a1e77791d0387d001e5fa2d184.png">

? ? </div>

? ? <span class="c-color-gray c-font-normal c-gap-right">網(wǎng)易新聞</span>

? ? <span class="c-color-gray2 c-font-normal">2020年12月27日 18:37</span>

</div>

'''

p_source = '<span class="c-color-gray c-font-normal c-gap-right">(.*?)</span>'

source = re.findall(p_source, res, re.S)

for i in range(len(source)):

? ? source[i] = re.sub('<.*?>', '', source[i])

? ? print(source[i])

p_date = '<span class="c-color-gray2 c-font-normal">(.*?)</span>'

date = re.findall(p_date, res, re.S)

for j in range(len(date)):

? ? print(date[j])

完整代碼如下：

import requests

import re

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}

url = 'https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&rsv_dl=ns_pc&word=考察'

res = requests.get(url, headers=headers).text

p_href = '<h3 class="news-title_1YtI1">.*?<a href="(.*?)"'

p_title = '<h3 class="news-title_1YtI1">.*?>(.*?)</a>'

p_source = '<span class="c-color-gray c-font-normal c-gap-right">(.*?)</span>'

p_date = '<span class="c-color-gray2 c-font-normal">(.*?)</span>'

href = re.findall(p_href, res, re.S)

title = re.findall(p_title, res, re.S)

source = re.findall(p_source, res, re.S)

date = re.findall(p_date, res, re.S)

# 數(shù)據(jù)清洗及打印輸出

for i in range(len(title)):

? ? title[i] = title[i].strip()

? ? title[i] = re.sub('<.*?>', '', title[i])

? ? print(str(i+1) + '.' + title[i] + '(' + date[i] + '-' + source[i] + ')')

? ? print(href[i])

本人是網(wǎng)絡(luò)爬蟲新手，拿百度新聞做了一個測試，代碼中還有需要改進(jìn)的地方，請指正，謝謝！

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

python網(wǎng)絡(luò)爬蟲4：【完整代碼】獲取百度新聞的標(biāo)題、來源、日期、鏈接

python網(wǎng)絡(luò)爬蟲4：【完整代碼】獲取百度新聞的標(biāo)題、來源、日期、鏈接

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

python網(wǎng)絡(luò)爬蟲4：【完整代碼】獲取百度新聞的標(biāo)題、來源、日期、鏈接

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

python網(wǎng)絡(luò)爬蟲4：【完整代碼】獲取百度新聞的標(biāo)題、來源、日期、鏈接