學習python爬蟲有一段時間了，今天先拿獵聘網(wǎng)的上海公司數(shù)據(jù)練練手，并做一點數(shù)據(jù)分析，分享給大家。

一、數(shù)據(jù)獲取

1、爬蟲思路

①首先找到上海公司列表頁面 https://www.liepin.com/company/020-000/。
②由于該頁面最多顯示100頁，所以要分行業(yè)爬取，分行業(yè)沒有超過100頁的了。
③獲取各行業(yè)頁面的網(wǎng)址。
④對每一個行業(yè)頁面進行翻頁，這樣就已經(jīng)得到上海公司的所有頁面。
⑤再從這些頁面中獲取所有公司詳情頁的網(wǎng)址。
⑥對詳情頁進行解析獲取到各公司的詳細數(shù)據(jù)。

2、scrapy爬蟲

4個函數(shù)分別對應后四個步驟，這里用到User-Agent隨機切換，沒用代理，共11548條數(shù)據(jù)用時35分鐘左右
Spider模塊代碼如下（其他模塊基本沒用）：

# -*- coding: utf-8 -*-

import scrapy
import requests
from bs4 import BeautifulSoup
from lxml import etree
import re
import random
import time
from LiePinWang.items import LiepinwangItem
import json

hds=[{'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'},\
    {'User-Agent':'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.12 Safari/535.11'},\
    {'User-Agent':'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)'},\
    {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0'},\
    {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/44.0.2403.89 Chrome/44.0.2403.89 Safari/537.36'},\
    {'User-Agent':'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'},\
    {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'},\
    {'User-Agent':'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0'},\
    {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'},\
    {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'},\
    {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'},\
    {'User-Agent':'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11'},\
    {'User-Agent':'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11'}]

class LiepinSpider(scrapy.Spider):
    name = "liepin"


    def start_requests(self):
        href_list = []
        url = 'https://www.liepin.com/company/020-000/'
        req = requests.get(url).text
        soup = BeautifulSoup(req,'lxml')
        hrefs_1 = soup.select('#region > div.wrap > div.top-bar > div.industry-box > div > a')
        for href_1 in hrefs_1:
            href_list.append(href_1['href'])
        href_list.pop()
        hrefs_2 = soup.select('#region > div.wrap > div.top-bar > div.industry-box > div > div > a')
        for href_2 in hrefs_2:
            href_list.append(href_2['href'])
        href_list.pop()
        for industry_href in href_list[1:]:
            yield scrapy.Request(url = industry_href,callback=self.next_page)

    def next_page(self,response):
        base_url = response.url
        req = requests.get(base_url,headers=hds[random.randint(0,len(hds)-1)]).text
        pages = re.findall('<span.*?"addition">(.*?)<span.*?"redirect">',req,re.S)[0].replace('共','').replace('頁','')
        if pages:
            for i in range(0,int(pages)):
                urls = str(base_url) + 'pn' + str(i)
                yield scrapy.Request(url = urls ,callback=self.get_company_url)
        else:
            yield scrapy.Request(url = base_url ,callback=self.get_company_url)

    def get_company_url(self,response):
        url = response.url
        req = requests.get(url,headers=hds[random.randint(0,len(hds)-1)]).text
        soup = BeautifulSoup(req,'lxml')
        company_urls = soup.select('#region > div.wrap > div.company-list.clearfix > div > div.item-top.clearfix > div > p.company-name > a')
        for company_url in company_urls:
            detail_url = company_url['href']
            yield scrapy.Request(url = detail_url ,callback=self.parse_detail)

    def parse_detail(self,response):
        url = response.url
        try:
            item = LiepinwangItem()
            req = requests.get(url,headers=hds[random.randint(0,len(hds)-1)]).text
            selector = etree.HTML(req)


            item['companyname'] = selector.xpath('//*[@id="company"]/div[2]/section/div/h1/text()')[0] if selector.xpath('//*[@id="company"]/div[2]/section/div/h1/text()') else None
            position = selector.xpath('//*[@id="company"]/div[2]/div/div/div[2]/h2/small/text()')[0] if selector.xpath('//*[@id="company"]/div[2]/div/div/div[2]/h2/small/text()') else None
            item['position_total'] = re.sub("\D", "", position)
            item['welfares'] = selector.xpath('//*[@id="company"]/div[2]/section/div/div/ul/li/text()') if selector.xpath('//*[@id="company"]/div[2]/section/div/div/ul/li/text()') else None
            item['industry'] = selector.xpath('//*[@id="company"]/div[2]/div/aside/div[2]/ul[1]/li[1]/a/text()')[0] if selector.xpath('//*[@id="company"]/div[2]/div/aside/div[2]/ul[1]/li[1]/a/text()') else None
            item['companysize'] = selector.xpath('//*[@id="company"]/div[2]/div/aside/div[2]/ul[1]/li[2]/text()')[0] if selector.xpath('//*[@id="company"]/div[2]/div/aside/div[2]/ul[1]/li[2]/text()') else None
            item['address'] = selector.xpath('//*[@id="company"]/div[2]/div/aside/div[2]/ul[1]/li[3]/text()')[0] if selector.xpath('//*[@id="company"]/div[2]/div/aside/div[2]/ul[1]/li[3]/text()') else None
            item['poi'] = selector.xpath('//*[@id="company"]/div[2]/div/aside/div[2]/ul[1]/li[3]/@data-point')[0] if selector.xpath('//*[@id="company"]/div[2]/div/aside/div[2]/ul[1]/li[3]/@data-point') else None
            item['time'] = selector.xpath('//*[@id="company"]/div[2]/div/aside/div[2]/ul[2]/li[2]/text()')[0] if selector.xpath('//*[@id="company"]/div[2]/div/aside/div[2]/ul[2]/li[2]/text()') else None
            item['capital'] = selector.xpath('//*[@id="company"]/div[2]/div/aside/div[2]/ul[2]/li[3]/text()')[0] if selector.xpath('//*[@id="company"]/div[2]/div/aside/div[2]/ul[2]/li[3]/text()') else None
            item['field'] = selector.xpath('//*[@id="company"]/div[2]/div/aside/div[2]/ul[1]/li[1]/text()')[0].strip() if selector.xpath('//*[@id="company"]/div[2]/div/aside/div[2]/ul[1]/li[1]/text()') else None
            yield item 
        except Exception:
            pass

由于數(shù)據(jù)量不大，爬取到的數(shù)據(jù)保存成csv格式就可以了，然后再把csv轉(zhuǎn)換為Excel格式，方便做數(shù)據(jù)清洗和分析。

二、數(shù)據(jù)清洗

1、編號，方便做數(shù)據(jù)統(tǒng)計。
2、對異常數(shù)據(jù)進行清洗，數(shù)據(jù)格式統(tǒng)一。
3、坐標轉(zhuǎn)換，獵聘網(wǎng)使用的是火星坐標系，所以要轉(zhuǎn)換為WGS84地球坐標系，方便做分析，這里用到的是別人寫好的代碼，地址https://github.com/wandergis/coordTransform_py。
4、通過QGIS軟件匹配出個公司所在的行政區(qū)和街鎮(zhèn)（區(qū)域）。
5、在獵聘網(wǎng)找到行業(yè)分類的數(shù)據(jù)，對每個行業(yè)歸為13大類。
清洗之后，大概長這樣：

三、數(shù)據(jù)分析及可視化

1、各行業(yè)公司數(shù)量及占比

僅在獵聘網(wǎng)上，公司最多的行業(yè)為互聯(lián)網(wǎng)/游戲/軟件，占比25.5%，達到了上海公司的1/4以上，上海的互聯(lián)網(wǎng)公司數(shù)量雖然跟北京沒法比，但依然是占比最高的。
緊隨其后的就是金融行業(yè)，上海作為中國的金融中心，金融行業(yè)肯定不會少。
第三位是房地產(chǎn)/建筑/物業(yè)，上海的房價在全國都是數(shù)一數(shù)二的，房地產(chǎn)建筑行業(yè)也不會少。

2、各規(guī)模的公司數(shù)量

100-499人的中小型企業(yè)最多，達到了3673家，其次是1-49人和50-99人的小型企業(yè)。

3、公司福利

提到福利，我腦海中就浮現(xiàn)出了蒼老師的形象，這里用的是PPT大神阿文推薦的wordart，https://wordart.com/create
福利排名前幾位的是：五險一金、帶薪年假、績效獎金、崗位晉升，很顯然，都是跟錢和假期有關(guān)的。
由于每個公司福利數(shù)據(jù)較多，要先將福利數(shù)據(jù)在Python進行處理，并統(tǒng)計一下，這個比較簡單，代碼就不放了。

4、各街鎮(zhèn)公司數(shù)量

各區(qū)域中，公司數(shù)量最多的是陸家嘴537家，上海金融圈中心，其次是張江481家，上海程序猿最多的地方，第三位是虹梅路409家，為了探尋一下這些公司多的區(qū)域的行業(yè)占比，再進一步把各區(qū)域和行業(yè)進行交叉分析。
結(jié)果顯示，陸家嘴近6成的公司都是金融公司，而互聯(lián)網(wǎng)/游戲/軟件公司占比較低。
張江的互聯(lián)網(wǎng)/游戲/軟件公司占比為47%，第三名的虹梅路互聯(lián)網(wǎng)/游戲/軟件公司占比同樣是47%，其他行業(yè)構(gòu)成也和張江近似。
通過上圖我們還發(fā)現(xiàn)，濰坊新村和花木的金融公司占比也很高，達到了40%以上，通過觀察地圖發(fā)現(xiàn)，這兩個區(qū)域離陸家嘴較近，所以可能是受陸家嘴影響，金融公司也很多。

5、各街鎮(zhèn)的密度

有些區(qū)域雖然公司多，但是面積也很大，并不能說明該區(qū)域的公司很密集，所以這里引入一個是新的指標：單位面積的密度，來看看到底哪里的公司最密集。

單位面積的密度=各街鎮(zhèn)中公司數(shù)量/各街鎮(zhèn)的面積

密度前三位的是南京西路，淮海西路和南京東路，這幾個區(qū)域雖然很小，但是寫字樓比較多，大多數(shù)都是中小型企業(yè)，一個寫字樓可能有幾十家或者上百家公司，例如：南京西路區(qū)域雖然只有1.6平方公里，但是有195家公司，所以密度很大。

感謝您有耐心看到這里，如果您覺得有趣或者有用，請點個贊，有任何疑問可以在下方留言。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

從爬蟲到數(shù)據(jù)可視化（1）—獵聘網(wǎng)

從爬蟲到數(shù)據(jù)可視化（1）—獵聘網(wǎng)

一、數(shù)據(jù)獲取

1、爬蟲思路

2、scrapy爬蟲

二、數(shù)據(jù)清洗

三、數(shù)據(jù)分析及可視化

1、各行業(yè)公司數(shù)量及占比

2、各規(guī)模的公司數(shù)量

3、公司福利

4、各街鎮(zhèn)公司數(shù)量

5、各街鎮(zhèn)的密度

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

從爬蟲到數(shù)據(jù)可視化（1）—獵聘網(wǎng)

一、數(shù)據(jù)獲取

1、爬蟲思路

2、scrapy爬蟲

二、數(shù)據(jù)清洗

三、數(shù)據(jù)分析及可視化

1、各行業(yè)公司數(shù)量及占比

2、各規(guī)模的公司數(shù)量

3、公司福利

4、各街鎮(zhèn)公司數(shù)量

5、各街鎮(zhèn)的密度

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

一、數(shù)據(jù)獲取

1、爬蟲思路

2、scrapy爬蟲

二、數(shù)據(jù)清洗

1、各行業(yè)公司數(shù)量及占比

3、公司福利

5、各街鎮(zhèn)的密度