Python實(shí)戰(zhàn)計劃學(xué)習(xí)筆記-抓取動態(tài)網(wǎng)頁

展示代碼###

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import requests
import time

info = []

headers = {
    'cookie': '_gat=1; gr_user_id=7e4ae08a-42a2-4645-bcdc-c5121ebf4c28; _knewone_v2_session=bUswaVJNb2poakxGZVUwRmQ0V0dqWm9sOFpQVDBjMWF1b0grazh0Q1l6dGpub3BHK1JSanV2OGc0ZmhDZitBMkNCZFY5T2JZQ1ZxN0ZpM0dONExsN3AwREtLcmw3dC9Ub1hKbnk5N3NabG1oTGdHU01EM0lHZU5nQWRGRGdaZUhhV1crYTY4Zlk4NGdUSjhmak1jSFVNSSt1YzNEOVY5bS9zVC8rbkJHT282M0dheXpwT0FNWDFoV1lZQ0hNcnp5LS1qbEM3TStTRDNVOFFLWkF3UEh6QzNBPT0%3D--62e609c4894070dfadfe1ddfd816eae455c0803e; _ga=GA1.2.1202148329.1464331142; Hm_lvt_b44696b80ba45a90a23982e53f8347d0=1464331143; Hm_lpvt_b44696b80ba45a90a23982e53f8347d0=1464331216; gr_session_id_e7b7e334c98d4530928513e7439f9ed2=65a4dbb6-0a13-4293-8745-06637ceba521',
    'referer': 'https://knewone.com/things',
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36',
    'x-csrf-token': 'W+zjfj7CxkXZvvltEUXvvLWVrEMMdQZFgV7wgdQx5On71c+Iqo9YcmlK1u+Cdz7V92NR59vOyBQaOfzbMcKCeA==',
}


def getText(item):
    return '' if item == 0 else ''.join(item.get_text().replace('\n', ''))


def getPic(item):
    return '' if item == 0 else ''.join(item.get('src'))


def getUrl(item):
    return '' if item == 0 else ''.join('https://knewone.com' + item.get('href'))



def getInfo(url, data=None):
    web_data = requests.get(url, headers=headers)
    web_data.encoding = "utf-8"
    if web_data.status_code == 200:
        soup = BeautifulSoup(web_data.text, 'html.parser')
        titles = soup.select('#things_list > article > section > h4 > a')
        pics = soup.select('#things_list > article > header > a > img')
        favos = soup.select('span.fanciers_count')
        links = soup.select('#things_list > article > header > a')
        if data == None:
            for title, pic, favo,link in zip(titles, pics, favos,links):
                time.sleep(2)
                data = {
                    'title': getText(title),
                    'pic': getPic(pic),
                    'favo': getText(favo),
                    'link': getUrl(link)
                }
                print(data)


def get_more_page(start, end):
    url = 'https://knewone.com/things?page='
    for i in range(start, end):
        getInfo(url + str(i))


get_more_page(1, 50)

出現(xiàn)問題###

gbk編碼.png

這是一個老生常談的問題,最主要是因?yàn)閜ycharm的編碼有問題,只需要將IDE和Project 的編碼問題都改成UTF-8即可。

UTF-8編碼.png

這個問題是個大坑(找了好久我才搞定的/(ㄒoㄒ)/~~)。只要網(wǎng)頁中有特殊字符,如 ° ,[B?hm]等,抓取的時候就都會報錯。
另外的附加解決方案是
a.加入shebang,他也只是解決IDE當(dāng)中的編碼問題

#!/usr/bin/env python
# -*- coding: utf-8 -*-

b.給網(wǎng)頁進(jìn)行編碼轉(zhuǎn)換,適用于輸出時有亂碼。尤其需要注意網(wǎng)頁的編碼形式,盡管大多數(shù)為UTF-8

    web_data = requests.get(url, headers=headers)
    web_data.encoding = "utf-8"

2.還有一個小坑,對tag使用.get() or .get_test(),有些需要加[0],但是有些不需要

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容