爬蟲入門獲取節點和解析節點課件

上傳人：x*** IP屬地：貴州上傳時間：2022-11-04 格式：PPTX 頁數：28 大小：189.33KB 積分：25 舉報 版權申訴

已閱讀5頁，還剩23頁未讀，繼續免費閱讀

版權說明：本文檔由用戶提供并上傳，收益歸屬內容提供方，若內容存在侵權，請進行舉報或認領

文檔簡介

動手學

Python，實踐出真知！Python爬蟲入門BeautifulSoup4解析網頁動手學Python，實踐出真知！1網絡爬蟲的基本處理流程保存數據發起請求獲取響應內容解析內容通過URL向服務器發起request請求，請求可以包含額外的header信息如果服務器正常響應，會收到一個response（所請求的網頁內容），如HTML、JSON字符串或者二進制的數據（視頻、圖片）等HTML代碼網頁解析器解析JSON數據轉換成JSON對象二進制的數據保存到文件保存到本地文件或保存到數據庫（MySQL、Redis、MongoDB等）12234BeautifulSoup4Requests網絡爬蟲的基本處理流程保存數據發起請求獲取響應內容解析內容通2網絡爬蟲的基本處理流程獲取響應內容解析內容如果服務器正常響應，會收到一個response（所請求的網頁內容），如HTML、JSON字符串或者二進制的數據（視頻、圖片）等HTML代碼網頁解析器解析JSON數據轉換成JSON對象二進制的數據保存到文件223BeautifulSoup4Requests使用requests庫獲取HTML頁面并將其轉換成字符串后，需要進一步解析HTML頁面格式，提取有用信息，這需要處理HTML和XML的函數庫。beautifulsoup4庫，也稱為BeautifulSoup庫或bs4庫，用于解析和處理HTML和XML。網絡爬蟲的基本處理流程獲取響應內容解析內容如果服務器正常響應3BeautifulSoup4BeautifulSoup提供簡單的、Python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱，通過解析文檔為用戶提供需要抓取的數據。由于使用簡單，所以不需要多少代碼就可以寫出一個完整的應用程序。BeautifulSoup已成為和lxml、html5lib一樣出色的Python解析器。BeautifulSoup3目前已經停止開發，不過它已經被移植到BS4了，推薦使用BeautifulSoup4，導入時寫為importbs4。BeautifulSoup將復雜的HTML文檔轉換成樹形結構，每個節點都是Python對象。解析網頁的核心可以歸結為兩點：獲取節點和從節點中提取信息。BeautifulSoup4BeautifulSoup提供簡4提取信息提取信息5任務：提取出第1個段落的文本Hellosoup.psoup.p.stringfrombs4importBeautifulSoupsoup=BeautifulSoup('HelloBeautifulSoup','lxml')print(soup)#<html><body>Hello</body></html>print(type(soup.p))#<class'bs4.element.Tag'>print(soup.p)#Helloprint()#pprint(soup.p.string)#Hello節點文本在標簽唯一的情況下，可直接使用標簽作為屬性值來獲得節點。如果有多個同類標簽，如這里有兩個段落，則soup.p只能代表第一個。標簽任務：提取出第1個段落的文本Hellosou6任務：從超鏈接中提取屬性html="""<li><aclass='seasongreen'href="spring.html">春天</a></li>"""soup=BeautifulSoup(html,'lxml')print(soup.a.string)#春天print(soup.li.string)#春天print(soup.a['href'])#spring.htmlprint(soup.a['class'])#['season','green']print(soup.a.attrs)#字典類型{'class':['season','green'],'href':'spring.html'}<aclass='seasongreen'href="spring.html">春天</a>1223標簽名name屬性attrs文本string123節點3要素：標簽名name、屬性attrs和文本string

任務：從超鏈接中提取屬性html="""<li><ac7獲取節點獲取節點8獲取節點的主要方式BeautifulSoup提供了多種方式來獲取節點，主要有兩種：①方法find_all和find；②CSS選擇器。由于每個節點有3個屬性：標簽名、屬性、文本，方法find_all和find也針對這3個屬性來搜索節點。從名字上也能看出來，find_all搜索所有滿足要求的節點，find搜索滿足要求的第一個節點。<aclass='seasonred'id='summer'href="summer.html">夏天</a>12223標簽名name屬性attrs文本string123獲取節點的主要方式BeautifulSoup提供了多種方式來9從超文本中提取出夏天html="""<ul>一年有四個季節：<li><aclass='seasongreen'href="spring.html">春天</a></li><li><aclass='seasonred'id='summer'href="summer.html">夏天</a></li><li><aclass='seasonyellow'href="autumn.html">秋天</a></li><li><aclass='seasonwhite'href="winter.html">冬天</a></li></ul>"""find_all(name,attrs,recursive,string,limit,**kwargs)123標簽名屬性文本五種選擇：字符串,正則表達式，列表，函數，布爾值True從超文本中提取出夏天html="""<ul>一年有10方法1：找出所有季節（標簽a）html="""<ul>一年有四個季節：<li><aclass='seasongreen'href="spring.html">春天</a></li><li><aclass='seasonred'id='summer'href="summer.html">夏天</a></li><li><aclass='seasonyellow'href="autumn.html">秋天</a></li><li><aclass='seasonwhite'href="winter.html">冬天</a></li></ul>"""soup=BeautifulSoup(html,'lxml')print(soup.find_all('a')[1].string)print(soup.find_all('a',class_='season')[1].string)print(soup.find_all('a',{'class':'season'})[1].string)print(soup.find_all('a',attrs={'class':'season'})[1].string)#完整形式方法1：找出所有季節（標簽a）html="""<ul><11方法2：精準定位（find）html="""<ul>一年有四個季節：<li><aclass='seasongreen'href="spring.html">春天</a></li><li><aclass='seasonred'id='summer'href="summer.html">夏天</a></li><li><aclass='seasonyellow'href="autumn.html">秋天</a></li><li><aclass='seasonwhite'href="winter.html">冬天</a></li></ul>"""print(soup.find('a',id='summer').string)print(soup.find('a',href="summer.html").string)print(soup.find('a',{'id':'summer'}).string)print(soup.find('a',{'href':'summer.html'}).string)

print(soup.find(string='夏天').string)print(soup.find(string=pile('夏')).string)方法2：精準定位（find）html="""<ul><p12方法2：定位（find_all）html="""<ul>一年有四個季節：<li><aclass='seasongreen'href="spring.html">春天</a></li><li><aclass='seasonred'id='summer'href="summer.html">夏天</a></li><li><aclass='seasonyellow'href="autumn.html">秋天</a></li><li><aclass='seasonwhite'href="winter.html">冬天</a></li></ul>"""print(soup.find_all('a',id='summer')[0].string)print(soup.find_all('a',href="summer.html")[0].string)

print(soup.find_all('a',{'id':'summer'})[0].string)print(soup.find_all('a',{'href':'summer.html'})[0].string)

print(soup.find_all(string='夏天')[0].string)print(soup.find_all(string=pile('夏'))[0].string)方法2：定位（find_all）html="""<ul>13方法3：CSS選擇器html="""<ul>一年有四個季節：<li><aclass='seasongreen'href="spring.html">春天</a></li><li><aclass='seasonred'id='summer'href="summer.html">夏天</a></li><li><aclass='seasonyellow'href="autumn.html">秋天</a></li><li><aclass='seasonwhite'href="winter.html">冬天</a></li></ul>"""print(soup.select('a')[1].string)print(soup.select('ula')[1].string)print(soup.select('.season')[1].string)print(soup.select('.red')[0].string)print(soup.select_one('.red').string)#使用select_one獲取第1個print(soup.select('ul.season')[1].string)print(soup.select('#summer')[0].string)print(soup.select('li>.season')[1].string)#在子標簽中查找print(soup.select('ul>li')[1].string)#在子標簽中查找方法3：CSS選擇器html="""<ul>一年有14動手學

Python，實踐出真知！Python爬蟲入門BeautifulSoup4解析網頁動手學Python，實踐出真知！15網絡爬蟲的基本處理流程保存數據發起請求獲取響應內容解析內容通過URL向服務器發起request請求，請求可以包含額外的header信息如果服務器正常響應，會收到一個response（所請求的網頁內容），如HTML、JSON字符串或者二進制的數據（視頻、圖片）等HTML代碼網頁解析器解析JSON數據轉換成JSON對象二進制的數據保存到文件保存到本地文件或保存到數據庫（MySQL、Redis、MongoDB等）12234BeautifulSoup4Requests網絡爬蟲的基本處理流程保存數據發起請求獲取響應內容解析內容通16網絡爬蟲的基本處理流程獲取響應內容解析內容如果服務器正常響應，會收到一個response（所請求的網頁內容），如HTML、JSON字符串或者二進制的數據（視頻、圖片）等HTML代碼網頁解析器解析JSON數據轉換成JSON對象二進制的數據保存到文件223BeautifulSoup4Requests使用requests庫獲取HTML頁面并將其轉換成字符串后，需要進一步解析HTML頁面格式，提取有用信息，這需要處理HTML和XML的函數庫。beautifulsoup4庫，也稱為BeautifulSoup庫或bs4庫，用于解析和處理HTML和XML。網絡爬蟲的基本處理流程獲取響應內容解析內容如果服務器正常響應17BeautifulSoup4BeautifulSoup提供簡單的、Python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱，通過解析文檔為用戶提供需要抓取的數據。由于使用簡單，所以不需要多少代碼就可以寫出一個完整的應用程序。BeautifulSoup已成為和lxml、html5lib一樣出色的Python解析器。BeautifulSoup3目前已經停止開發，不過它已經被移植到BS4了，推薦使用BeautifulSoup4，導入時寫為importbs4。BeautifulSoup將復雜的HTML文檔轉換成樹形結構，每個節點都是Python對象。解析網頁的核心可以歸結為兩點：獲取節點和從節點中提取信息。BeautifulSoup4BeautifulSoup提供簡18提取信息提取信息19任務：提取出第1個段落的文本Hellosoup.psoup.p.stringfrombs4importBeautifulSoupsoup=BeautifulSoup('HelloBeautifulSoup','lxml')print(soup)#<html><body>Hello</body></html>print(type(soup.p))#<class'bs4.element.Tag'>print(soup.p)#Helloprint()#pprint(soup.p.string)#Hello節點文本在標簽唯一的情況下，可直接使用標簽作為屬性值來獲得節點。如果有多個同類標簽，如這里有兩個段落，則soup.p只能代表第一個。標簽任務：提取出第1個段落的文本Hellosou20任務：從超鏈接中提取屬性html="""<li><aclass='seasongreen'href="spring.html">春天</a></li>"""soup=BeautifulSoup(html,'lxml')print(soup.a.string)#春天print(soup.li.string)#春天print(soup.a['href'])#spring.htmlprint(soup.a['class'])#['season','green']print(soup.a.attrs)#字典類型{'class':['season','green'],'href':'spring.html'}<aclass='seasongreen'href="spring.html">春天</a>1223標簽名name屬性attrs文本string123節點3要素：標簽名name、屬性attrs和文本string

任務：從超鏈接中提取屬性html="""<li><ac21獲取節點獲取節點22獲取節點的主要方式BeautifulSoup提供了多種方式來獲取節點，主要有兩種：①方法find_all和find；②CSS選擇器。由于每個節點有3個屬性：標簽名、屬性、文本，方法find_all和find也針對這3個屬性來搜索節點。從名字上也能看出來，find_all搜索所有滿足要求的節點，find搜索滿足要求的第一個節點。<aclass='seasonred'id='summer'href="summer.html">夏天</a>12223標簽名name屬性attrs文本string123獲取節點的主要方式BeautifulSoup提供了多種方式來23從超文本中提取出夏天html="""<ul>一年有四個季節：<li><aclass='seasongreen'href="spring.html">春天</a></li><li><aclass='seasonred'id='summer'href="summer.html">夏天</a></li><li><aclass='seasonyellow'href="autumn.html">秋天</a></li><li><aclass='seasonwhite'href="winter.html">冬天</a></li></ul>"""find_all(name,attrs,recursive,string,limit,**kwargs)123標簽名屬性文本五種選擇：字符串,正則表達式，列表，函數，布爾值True從超文本中提取出夏天html="""<ul>一年有24方法1：找出所有季節（標簽a）html="""<ul>一年有四個季節：<li><aclass='seasongreen'href="spring.html">春天</a></li><li><aclass='seasonred'id='summer'href="summer.html">夏天</a></li><li><aclass='seasonyellow'href="autumn.html">秋天</a></li><li><aclass='seasonwhite'href="winter.html">冬天</a></li></ul>"""soup=BeautifulSoup(html,'lxml')print(soup.find_all('a')[1].string)print(soup.find_all('a',class_='season')[1].string)print(soup.find_all('a',{'class':'season'})[1].string)print(soup.find_all('a',attrs={'class':'season'})[1].string)#完整形式方法1：找出所有季節（標簽a）html="""<ul><25方法2：精準定位（find）html="""<ul>一年有四個季節：<li><aclass='seasongreen'href="spring.html">春天</a></li><li><aclass='seasonred'id='summer'href="summer.html">夏天</a></li><li><aclass='seasonyellow'href="autumn.html">秋天</a></li><li><aclass='seasonwhite'href="winter.html">冬天</a></li></ul>"""print(soup.find('a',id='summer').string)print(soup.find('a',href="summer.html").string)print(soup.find('a',{'id':'summer'}).string)print(soup.find('a',{'href':'summer.html'}).string)

print(soup.find(string='夏天').string)print(soup.find(string=pile('夏')).string)方法2：精準定位（find）html="""<ul><p26方法2：定位（find_all）html="""<ul>一年有四個季節：<li><aclass='seasongreen'href="spring.html">春天</a></li><li><aclass='seasonred'id='summer'href="summer.html">夏天</a></li><li><aclass='seasonyellow'href="autumn.html">秋天</a></li><li><aclass='seasonwhite'href="winter.html">冬天</a></li></ul>

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯系上傳者。文件的所有權益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網頁內容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
5. 人人文庫網僅提供信息存儲空間，僅對用戶上傳內容的表現方式做保護處理，對用戶上傳分享的文檔內容本身不做任何修改或編輯，并不能對任何下載內容負責。
6. 下載文件中如有侵權或不適當內容，請與我們聯系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

老太爷的乳妓h开裆裤,久久久久久精品国产三级非禁歌 ,久久久久久久99精品国产片,免费观看交性大片

爬蟲入門獲取節點和解析節點課件

文檔簡介

溫馨提示

最新文檔

評論

老太爷的乳妓h开裆裤,久久久久久精品国产三级非禁歌 ,久久久久久久99精品国产片,免费观看交性大片

爬蟲入門 獲取節點和解析節點課件

文檔簡介

溫馨提示

最新文檔

評論

相關文檔

爬蟲入門獲取節點和解析節點課件