Python爬虫入门：自动获取网页数据

学会爬虫，就能自动采集任何公开数据。

三板斧

import requests
from bs4 import BeautifulSoup

resp = requests.get("https://example.com")
soup = BeautifulSoup(resp.text, 'html.parser')
for h2 in soup.find_all('h2'):
    print(h2.text)

常用方法

soup.find('tag')       # 第一个
soup.find_all('tag')   # 所有
soup.select('.class')  # CSS选择器
tag.text               # 文本
tag['href']            # 属性

⚠️ 爬虫伦理

遵守robots.txt / 加time.sleep控制频率 / 不采隐私数据 / 不造成负担。