Python BeautifulSoup4

HTML, XML 처리는 역시 BeautifulSoup41. 이유는 간단하다. 편리하기 때문이다.
설치부터 치트시트까지 빠르게 훑어버리자.

설치

아래의 명령으로 필요한 요소들을 설치하자:

sudo pip3 install beautifulsoup4
sudo apt-get install python3-lxml

lxml은 고속 HTML 파서를 제공할 뿐만 아니라, bs4가 XML파싱을 할 수 있게2 해준다.

Cheat Sheet

import urllib.request
import re

import bs4

dom = urllib.request.urlopen('http://makerj.tistory.com').read()
soup = bs4.BeautifulSoup(dom, 'lxml')  # use lxml-xml if you want to parse XML
print(soup.prettify())  # print well formatted dom (nice indentation)

soup.find('title')
head = soup.find('title').parent
title_string = soup.find('title').text
title_string_data_mydata = soup.find('title')['data-mydata']

soup.findAll('div')
soup.findAll('div', limit=5)
soup.findAll('div', class_='container')
soup.findAll('div', {'class': 'container'})
soup.findAll('div', {'id': 'wow', 'class': 'card'})

child_child_string = soup.find('div', id='some-primary').div.span.text
filtered_child = soup.find('table', class_='comments').find('tr', class_='adminuser')

soup.findAll('span', class_=re.compile(r'size_\d*'))

soup.select('input[name="password"]')

생성자에 파서의 이름을 명시적으로 넣어주자.

HTML처리 bs4.BeautifulSoup(dom, 'lxml')
XML처리 bs4.BeautifulSoup(dom, 'lxml-xml')

BeautifulSoup4는 사실상 아래의 3가지 메소드를 잘 사용하는 것이 전부다.

soup.find() 전체 노드들을 뒤져가며 검색한다. 정규 표현식도 사용할 수 있다. 결과는 1개만 찾아준다.
soup.findAll() find()와 같다. 반환값은 무조건 리스트다. 찾아진 노드가 없다면 빈 리스트를 반환한다.
soup.select() CSS Selector를 사용해 findAll()을 수행한다. jQuery나 queryselector에 익숙한 사람에게는 최고의 친구.

이하 bs4 ↩
The only currently supported XML parser 라고 공식 문서에서 안내하고 있다 ↩

저작자표시 비영리 변경금지

'Python' 카테고리의 다른 글

Python update all package (0)	2016.02.13
Python urllib.parse (0)	2016.02.12
Python global variable (0)	2016.02.10
Python argparse (0)	2016.02.10
Python itertools (0)	2016.02.09

CheatSheet

Python BeautifulSoup4

Python BeautifulSoup4

설치

Cheat Sheet

'Python' 카테고리의 다른 글

티스토리툴바

Python BeautifulSoup4

Python BeautifulSoup4

설치

Cheat Sheet

'Python' 카테고리의 다른 글

'Python' Related Articles

티스토리툴바