Python爬蟲(chóng)筆記二——爬取愛(ài)因斯坦名言

這次的筆記主要和大家分享BeautifulSoup的一些用法。


數(shù)據(jù)定位

查找

BS一個(gè)很大的作用就是可以對(duì)HTML中的tag進(jìn)行定位。其中最常用的函數(shù)就是find()findAll(),這兩個(gè)函數(shù)其實(shí)功能相仿,差距在于一個(gè)只尋找最近的tag,另一個(gè)會(huì)查找所有的標(biāo)簽。其主要參數(shù)如下:

tag : 所要查找的tag,格式為字符串或列表(一系列tag)
attributes : 所要查找tag的attributes,格式為字典,例如
.find("span", { "class" : "green", "class" : "red" })
這兩個(gè)基本是最常用的參數(shù)
text : 指定tag的內(nèi)容,注意是全部?jī)?nèi)容而非部分內(nèi)容,但是可以使用正則表達(dá)式進(jìn)行模糊匹配
keyword : 類(lèi)似于attributes,不過(guò)前者是“或”判斷,后者為“和”判斷

移動(dòng)

BS也可以在不同節(jié)點(diǎn)間移動(dòng)
.children : 下一級(jí)的子節(jié)點(diǎn)
.descendants : 所有子節(jié)點(diǎn)
.parent : 父節(jié)點(diǎn)
.next_siblings() .next_sibling() : 之后所有/一個(gè)兄弟(同一級(jí))節(jié)點(diǎn),不包括這個(gè)節(jié)點(diǎn)
.previous_siblings() .previous_sibling() : 之前所有/一個(gè)兄弟節(jié)點(diǎn),不包括這個(gè)節(jié)點(diǎn)
.find_next_sibling() .find_previous_sibling() : 同前

查找愛(ài)因斯坦的名言

我們這次要爬取的網(wǎng)站是Quotes to Scrape,這個(gè)網(wǎng)站是Scrapy這個(gè)包給出的測(cè)試網(wǎng)站,網(wǎng)站的內(nèi)容是一些名人名言,我們準(zhǔn)備把其中所有愛(ài)因斯坦說(shuō)的話(huà)爬取下來(lái)。網(wǎng)站代碼如下:

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" /    > 
            
            <a class="tag" href="/tag/change/page/1/">change</a>
            
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            
            <a class="tag" href="/tag/world/page/1/">world</a>
            
        </div>
    </div>

爬蟲(chóng)代碼如下:

word_list = obj.findAll("small", { “class" : "author", "itemprop" : "author" }, text="Albert Einstein")
if len(word_list)>1:
    for i in word_list:
        print(i.parent.find_previous_sibling().get_text())

這段代碼十分簡(jiǎn)單,obj是已經(jīng)處理好的bs對(duì)象。我們用findAll基本定位之后,再用之前講的方法找到目標(biāo)文本。當(dāng)然之前要對(duì)網(wǎng)頁(yè)的結(jié)構(gòu)進(jìn)行分析,找到合適的定位方法。這就是bs的簡(jiǎn)單應(yīng)用,下面是我的成果:

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“Try not to become a man of success. Rather become a man of value.”
“If you can't explain it to a six year old, you don't understand it yourself.”
“If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”
“Logic will get you from A to Z; imagination will get you everywhere.”
“Any fool can know. The point is to understand.”
“Life is like riding a bicycle. To keep your balance, you must keep moving.”
“If I were not a physicist, I would probably be a musician. I often think in music. I live my daydreams in music. I see my life in terms of music.”
“Anyone who has never made a mistake has never tried anything new.”

怎么樣,愛(ài)翁的話(huà)是不是很有哲理?祝大家玩的開(kāi)心!

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 關(guān)于bs4,官方文檔的介紹已經(jīng)非常詳細(xì)了,傳送:Beautifulsoup 4官方文檔,這里我把它組織成自己已經(jīng)消...
    徐薇薇閱讀 5,958評(píng)論 0 1
  • 今天早上,寫(xiě)的東西掉了。這個(gè)爛知乎,有bug,說(shuō)了自動(dòng)保存草稿,其實(shí)并沒(méi)有保存。無(wú)語(yǔ) 今晚,我們將繼續(xù)討論如何分析...
    阿爾卑斯山上的小灰兔閱讀 951評(píng)論 0 0
  • 西門(mén)吹雪在家?陸花二人不語(yǔ)。其中究竟有何蹊蹺,恐怕得進(jìn)去才知道究竟了。陸小鳳若有所思摸了摸胡子。 “等等我等等我”...
    廣陵墨丞閱讀 632評(píng)論 1 2
  • 當(dāng)你還能感到焦慮時(shí),說(shuō)明你對(duì)自己的現(xiàn)狀不滿(mǎn)意了,這是一種提醒。很多人知道這不是我想要的,卻不能清晰說(shuō)出自己要什么。...
    周筱玲心靈成長(zhǎng)閱讀 348評(píng)論 0 0
  • 2017年1月29日,星期日,白天有雪花飄!昨晚九點(diǎn)半后店長(zhǎng)在微信群通知我今天休息!早上睜眼后有拿手機(jī)的習(xí)慣,我看...
    古耐Cc閱讀 280評(píng)論 0 1

友情鏈接更多精彩內(nèi)容