成人黄色亚洲久久,国内精品1区

很多時(shí)候我們需要爬取網(wǎng)上的文件并提取文件的數(shù)據(jù)做對(duì)比，文件一般為pdf格式需要轉(zhuǎn)化為excel表格，現(xiàn)在可以用python實(shí)現(xiàn)采集數(shù)據(jù)到提取數(shù)據(jù)的全流程操作。
一、首先要爬取網(wǎng)頁(yè)內(nèi)容下載pdf文件

import requests
from lxml import html
etree = html.etree
import os
import time
def main(i):
    #第一頁(yè)
    if i==1:
        url = "http://www.innocom.gov.cn/gxjsqyrdw/gswj/list.shtml"
   #進(jìn)行翻頁(yè)處理
 else:
        url = 'http://www.innocom.gov.cn/gxjsqyrdw/gswj/list'+'_'+str(i)+'.shtml'
    html = requests.get(url)
    time.sleep(60)
    xhtml = etree.HTML(html.content.decode("utf-8"))  
   #定位到需要提取的內(nèi)容
    node = xhtml.xpath('/html/body/div[2]/div[1]/div[3]/ul/li/a[contains(text(), "擬認(rèn)定")]/@href')
    res = []
    for url in node:
            #拼接pdf的url
            url = 'http://www.innocom.gov.cn/' + url 
            html = requests.get(url)  
            time.sleep(60)
            xhtml = etree.HTML(html.content.decode("utf-8"))   
            node = xhtml.xpath('//*[@id="content"]//@href')
            url_1 =url[::-1]
            a= url[:-url_1.find('/')]
            res.append(a+node[0]) 
            print(a+node[0])  
            #點(diǎn)擊url下載pdf文件      
            for i in range(len(res)):
                r = requests.get(res[i])
                os.makedirs('名單./',exist_ok=True) #創(chuàng)建目錄存放文件
                f = open('名單./'+f"{i}.pdf", 'wb')
                for chunk in r.iter_content(): 
                    if chunk: # filter out keep-alive new chunks
                        f.write(chunk)
                f.close()
            
if __name__=='__main__':
    for i in range(1,15):
        main(i)

二、把pdf解析為excel文件
1.使用tabula模塊解析

import tabula
import pandas as pd
df = tabula.read_pdf("1.pdf", encoding='utf-8', pages='all')
df = pd.DataFrame(df)
print(df)

2.使用adobe.acrobat來(lái)批量解析某個(gè)文件夾下所有的pdf文件

import os    
import winerror
from win32com.client.dynamic import Dispatch, ERRORS_BAD_CONTEXT
ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)
my_dir = r"C:\Users\sq\Desktop\名單"
file_list = os.listdir(my_dir)
print(file_list)
for i in file_list:
    my_pdf = f"{i}"
    os.chdir(my_dir)
    src = os.path.abspath(my_pdf)
    try:
        AvDoc = Dispatch("AcroExch.AVDoc")    

        if AvDoc.Open(src, ""):            
            pdDoc = AvDoc.GetPDDoc()
            jsObject = pdDoc.GetJSObject()
            i = i[:-4]
            #也可以把后綴轉(zhuǎn)為其他格式
            jsObject.SaveAs(os.path.join(my_dir, f'{i}.xlsx'), "com.adobe.acrobat.xlsx")

    except Exception as e:
        print(str(e))

    finally:       
        AvDoc.Close(True)
        jsObject = None
        pdDoc = None
        AvDoc = None

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

爬取網(wǎng)頁(yè)文件并批量解析pdf

爬取網(wǎng)頁(yè)文件并批量解析pdf

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

爬取網(wǎng)頁(yè)文件并批量解析pdf

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av