這次的筆記主要和大家分享BeautifulSoup的一些用法。
數(shù)據(jù)定位
查找
BS一個(gè)很大的作用就是可以對(duì)HTML中的tag進(jìn)行定位。其中最常用的函數(shù)就是find()和findAll(),這兩個(gè)函數(shù)其實(shí)功能相仿,差距在于一個(gè)只尋找最近的tag,另一個(gè)會(huì)查找所有的標(biāo)簽。其主要參數(shù)如下:
tag : 所要查找的tag,格式為字符串或列表(一系列tag)
attributes : 所要查找tag的attributes,格式為字典,例如
.find("span", { "class" : "green", "class" : "red" })
這兩個(gè)基本是最常用的參數(shù)
text : 指定tag的內(nèi)容,注意是全部?jī)?nèi)容而非部分內(nèi)容,但是可以使用正則表達(dá)式進(jìn)行模糊匹配
keyword : 類(lèi)似于attributes,不過(guò)前者是“或”判斷,后者為“和”判斷
移動(dòng)
BS也可以在不同節(jié)點(diǎn)間移動(dòng)
.children : 下一級(jí)的子節(jié)點(diǎn)
.descendants : 所有子節(jié)點(diǎn)
.parent : 父節(jié)點(diǎn)
.next_siblings() .next_sibling() : 之后所有/一個(gè)兄弟(同一級(jí))節(jié)點(diǎn),不包括這個(gè)節(jié)點(diǎn)
.previous_siblings() .previous_sibling() : 之前所有/一個(gè)兄弟節(jié)點(diǎn),不包括這個(gè)節(jié)點(diǎn)
.find_next_sibling() .find_previous_sibling() : 同前
查找愛(ài)因斯坦的名言
我們這次要爬取的網(wǎng)站是Quotes to Scrape,這個(gè)網(wǎng)站是Scrapy這個(gè)包給出的測(cè)試網(wǎng)站,網(wǎng)站的內(nèi)容是一些名人名言,我們準(zhǔn)備把其中所有愛(ài)因斯坦說(shuō)的話(huà)爬取下來(lái)。網(wǎng)站代碼如下:
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" / >
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
爬蟲(chóng)代碼如下:
word_list = obj.findAll("small", { “class" : "author", "itemprop" : "author" }, text="Albert Einstein")
if len(word_list)>1:
for i in word_list:
print(i.parent.find_previous_sibling().get_text())
這段代碼十分簡(jiǎn)單,obj是已經(jīng)處理好的bs對(duì)象。我們用findAll基本定位之后,再用之前講的方法找到目標(biāo)文本。當(dāng)然之前要對(duì)網(wǎng)頁(yè)的結(jié)構(gòu)進(jìn)行分析,找到合適的定位方法。這就是bs的簡(jiǎn)單應(yīng)用,下面是我的成果:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“Try not to become a man of success. Rather become a man of value.”
“If you can't explain it to a six year old, you don't understand it yourself.”
“If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”
“Logic will get you from A to Z; imagination will get you everywhere.”
“Any fool can know. The point is to understand.”
“Life is like riding a bicycle. To keep your balance, you must keep moving.”
“If I were not a physicist, I would probably be a musician. I often think in music. I live my daydreams in music. I see my life in terms of music.”
“Anyone who has never made a mistake has never tried anything new.”
怎么樣,愛(ài)翁的話(huà)是不是很有哲理?祝大家玩的開(kāi)心!