逐漸囂張,使用python采集CSDN文章數(shù)據(jù)保存PDF

前言

嗨嘍!大家好呀,這里是魔王~**

本次必備素材:

  • wkhtmltopdf [軟件]
  • 素材代碼

第三方庫:

  • requests >>> pip install requests
  • parsel >>> pip install parsel
  • pdfkit >>> pip install pdfkit

開發(fā)環(huán)境:

  • 版 本:python3.8
  • 編輯器:pycharm

win + R 輸入cmd 輸入安裝命令 pip install 模塊名 如果出現(xiàn)爆紅 可能是因為 網(wǎng)絡(luò)連接超時 切換國內(nèi)鏡像源

采集流程:

一. 分析想要數(shù)據(jù)內(nèi)容, 可以從哪里獲取

通過開發(fā)者工具進行抓包分析, 分析之后可得, 我們想要數(shù)據(jù)內(nèi)容其實就請求導(dǎo)航欄url地址即可

二. 代碼實現(xiàn)步驟:

獲取多個文章內(nèi)容(獲取所有文章url地址)

  1. 發(fā)送請求, 對于文章目錄頁面發(fā)送請求
  2. 獲取數(shù)據(jù), 獲取網(wǎng)頁源代碼數(shù)據(jù) 文本數(shù)據(jù)
  3. 解析數(shù)據(jù), 提取文章url地址

獲取文章內(nèi)容代碼

  1. 發(fā)送請求, 對于url地址發(fā)送請求
  2. 獲取數(shù)據(jù), 獲取網(wǎng)頁源代碼數(shù)據(jù)
  3. 解析數(shù)據(jù), 提取文章內(nèi)容
  4. 保存數(shù)據(jù), 先保存成html文件, 再把html文件轉(zhuǎn)成PDF

代碼

# import requests  # 數(shù)據(jù)請求模塊
# import parsel   # 數(shù)據(jù)解析模塊
# import re  # 正則表示
# import pdfkit
# import subprocess
# for page in range(4, 6):
#     url = f'https://blog.csdn.net/fei347795790/article/list/{page}'  # 確定請求網(wǎng)址
#     # headers 請求頭, 主要用于偽裝python, 防止程序被服務(wù)器識別出來
#     headers = {
#         'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36'
#     }
#     # 用requests模塊里面get方式發(fā)送請求
#     response = requests.get(url=url, headers=headers)
#     # print(response.text)  # <Response [200]> 響應(yīng)對象 200 表示請求成功
#     selector = parsel.Selector(response.text)  # <Selector xpath=None data='<html lang="zh-CN">\n<head>\n    <meta ...'> 返回對象
#     # css 是解析方式之一 根據(jù)標簽屬性內(nèi)容提取數(shù)據(jù) a::attr(href) 獲取a標簽里面href屬性
#     href = selector.css('#articleMeList-blog > div.article-list > div > h4 > a::attr(href)').getall()
#     # print(href)
#     for index in href:
#         try:
#             print(index)
#             html_data = requests.get(url=index, headers=headers).text
#             selector_1 = parsel.Selector(html_data)
#             title = selector_1.css('#articleContentId::text').get()
#             cmd = f'C:\\01-Software-installation\\wkhtmltopdf\\bin\\wkhtmltopdf.exe {index} pdf_1\\{title}.pdf'
#             subprocess.run(cmd, shell=True)
#         except Exception as e:
#             print(e)


import requests

url = 'https://blog.csdn.net/phoenix/web/v1/comment/submit'
like_url = 'https://blog.csdn.net//phoenix/web/v1/article/like'
headers = {
    'cookie': 'uuid_tt_dd=10_29360288410-1640936706807-857482; __gads=ID=1a4feb23074a3469-22da76a196cf0001:T=1640936708:RT=1640936708:S=ALNI_MawGCakjM400IbVY204TvKfKLhDlg; Hm_lvt_e5ef47b9f471504959267fd614d579cd=1645514550; __gpi=UID=0000049689281fe2:T=1649317424:RT=1649317424:S=ALNI_MYlX9R83NQ5EzlFY5UgNF09G45dPw; c_dl_prid=-; c_dl_rid=1650090830371_447095; c_dl_fref=https://so.csdn.net/so/search; c_dl_fpage=/download/qq_43651710/10848772; c_dl_um=distribute.pc_search_result.none-task-blog-2%7Evipall%7Efirst_rank_ecpm_v1%7Erank_v31_ecpm-1-114898691.nonecase; dc_session_id=10_1650262926080.949004; c_first_ref=www.baidu.com; c_first_page=https%3A//blog.csdn.net/fei347795790/category_11731395.html; c_segment=10; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1650090803,1650095679,1650112607,1650262927; firstDie=1; hide_login=1; dc_sid=70fca81ac8fa563314905c0e38f533b9; unlogin_scroll_step=1650263871780; c_pref=default; SESSION=eb13b53e-41e8-43e0-aa6c-54811bb65d0c; c_ref=https%3A//blog.csdn.net/fei347795790/article/details/110070943; ssxmod_itna=Qqfx2DB7D=DQexCq0LpO8D9i8DORYYrQN7Yd7DlOiQxA5D8D6DQeGTTRdY=T1zCep+uDQDRgyfKlFpO2GWKk7YawWsUnO4GLDmKDyKA=ueDxOq0rD74irDDxD39D7PGmDiWZRD72=1lSgK8DWKDKx0kDY5Dw=AGDiPD7gFeCB9w1g911pBGd4D1qCvxKBKD9x0CDlPxf9GkDDyf69isyo3EDmb3A1BhDCKDjg71s6YDUeysgaFU/j0aAnT5YQxxLQi4Kg0Dt=2DK2GYGQpN1nredjDxfsrFTnTqDDpxpywx4D===; ssxmod_itna2=Qqfx2DB7D=DQexCq0LpO8D9i8DORYYrQN7YdD6h8iQD0vxLx03qKru2d+UOqcnUg8xhCDRoHKH1SQqrUY0iFWAxm=RhDFIOD8xod7VS8Bv0+m23mlQcq+912jIp1r/8bM1z9ZgSyzg5CKBhHsmH8BeHiq8wHMDp1prTH5eoO5FE83p976COKCP57q35OWchz=iuDVBi5KB4GeDIbenWenPaKBYrmQWWek4qqcAFWKnxt0/M=u0pK0nDH5M+rPa1eVQQxRaZDREMbBYBbi5mb17K13xzFV+en8OpHAqw+pp5dK4=R7caLRTTSb5K91ea5UFt8D4QRiIhqRrfRvY+eu3qEY9QQR0z44fK=RGxd4eDPiR+10hu+FCIxaBe1Ue=QB7YnpQc/FwEWvP=mO+4sAHn95OQwbC9H/p+mTa9E/lIP2bcWFk+mwB9N/Ej9ID2xYE+aLSiPkWWT=iiK+aT0bWKAsYGdWnDDgDcIQr4ORGCBGmQPG7O2Y7VmmARgGWWKoqszEmiwB0m7gWRz91N+QE4wXTt78wCo3LWZRxCkoO7m1KT4rmvfKxZ+NITqbgw/hrixDKd9D7=DYFqeD===; UserName=weixin_43239784; UserInfo=b58cf84406a84acebf2c3f36442f1c59; UserToken=b58cf84406a84acebf2c3f36442f1c59; UserNick=%E6%97%A0%E9%9B%A8%E0%B8%88%E0%B8%B8%E0%B9%8A%E0%B8%9A; AU=1D5; UN=weixin_43239784; BT=1650268841955; p_uid=U010000; c_page_id=default; dc_tos=raizmv; log_Id_pv=153; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1650268904; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%2C%22uid_%22%3A%7B%22value%22%3A%22weixin_43239784%22%2C%22scope%22%3A1%7D%7D; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_29360288410-1640936706807-857482!5744*1*weixin_43239784; log_Id_view=478; log_Id_click=110',
    'origin': 'https://blog.csdn.net',
    'referer': 'https://blog.csdn.net/fei347795790/article/details/110070943',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36',
    'x-requested-with': 'XMLHttpRequest',
    'x-tingyun-id': 'im-pGljNfnc;r=268943811',
}
data = {
    'commentId': '',
    'content': '自游老師真帥',
    'articleId': '124196275',
}
like_data = {
    'articleId': '110070943'
}
# response = requests.post(url=url, data=data, headers=headers)
response = requests.post(url=like_url, data=like_data, headers=headers)
print(response)

尾語

好了,我的這篇文章寫到這里就結(jié)束啦!

有更多建議或問題可以評論區(qū)或私信我哦!一起加油努力叭(? ?_?)?

喜歡就關(guān)注一下博主,或點贊收藏評論一下我的文章叭?。?!

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容