銷售最重要的就是數(shù)據(jù) ,數(shù)據(jù)一般來源于網(wǎng)站,b2b, 還有一些會展的會刊。
這里要學(xué)習(xí)的,就是beautifulSoup網(wǎng)站, 一段段小小的代碼,5分鐘可以幫你節(jié)約輸入六個小時。
首先看看代碼,
__author__ = 'lixiang'
#coding:utf-8
from bs4 import BeautifulSoup
import urllib2
import re
from openpyxl import Workbook
urls = ['','',''] #網(wǎng)站保密
links = []
for url in urls:
request = urllib2.Request(url)
response = urllib2.urlopen(request)
source = response.read()
response.close()
soup = BeautifulSoup(source)
urlLink = soup.find_all(href=re.compile("custom_exhibitor"))
number = len(urlLink)
for numbers in range(number):
links.append(urlLink[numbers]['href'])
count = 2
wb = Workbook()
ws =wb.active
for url in links:
thtext=[]
tdtext=[]
text=[]
text1=[]
request = urllib2.Request(url)
response = urllib2.urlopen(request)
source=response.read()
response.close()
soup =BeautifulSoup(source)
thtext = soup.find_all("th")
tdtext = soup.find_all("td")
length = len(thtext)
for i in range(length):
a = thtext[i].string
text.append(a)
for j in range(length):
try:
b = tdtext[j].string.lstrip()
except AttributeError:
b = tdtext[j].string
text1.append(b)
print text1[1]
if count >1 :
ws.append([text[i]for i in range(length)])
count = count -1
else:
pass
ws.append([text1[j]for j in range(length)])
wb.save('文件名.xlsx')
以上代碼,比較滿意的是,可以爬數(shù)據(jù)了,但是有幾個問題, 如何讓源代碼可讀性,比如是否可以實現(xiàn)類。 以及多線程加快爬蟲速度。
這是下一次迭代的事情。
感謝互聯(lián)網(wǎng),感謝知識,這就是效率吧。