The Power of Web Scraping
Web scraping lets you extract data from websites programmatically. Price monitoring, research, content aggregation – the applications are endless.
Essential Tools
pip install requests beautifulsoup4 lxml
Basic Scraping Example
import requests
from bs4 import BeautifulSoup
def scrape_page(url):
headers = {"User-Agent": "Mozilla/5.0"}
resp = requests.get(url, headers=headers, timeout=10)
return BeautifulSoup(resp.text, "lxml")
soup = scrape_page("http://quotes.toscrape.com")
quotes = soup.find_all("div", class_="quote")
for quote in quotes:
text = quote.find("span", class_="text").get_text()
author = quote.find("small", class_="author").get_text()
print(f"{author}: {text[:50]}...")
Scraping Multiple Pages
import time
def scrape_all_pages(base_url):
results = []
page = 1
while True:
url = f"{base_url}/page/{page}"
soup = scrape_page(url)
items = soup.find_all("article")
if not items:
break
for item in items:
results.append({
"title": item.find("h2").get_text(strip=True),
"link": item.find("a")["href"]
})
page += 1
time.sleep(1) # Be polite!
return results
Ethics and Best Practices
- Always check robots.txt before scraping
- Respect rate limits – add delays between requests
- Use a public API if one is available
- Check Terms of Service for scraping restrictions
- Never scrape personal or sensitive data
Saving Scraped Data
import csv, json
# Save to CSV
with open("data.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["title", "link"])
writer.writeheader()
writer.writerows(results)
# Save to JSON
with open("data.json", "w") as f:
json.dump(results, f, indent=2)
Advanced: Selenium for Dynamic Content
Some websites load content with JavaScript. For these, use Selenium to control a real browser.
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com")
driver.implicitly_wait(5)
elements = driver.find_elements(By.CLASS_NAME, "product-card")
for el in elements:
print(el.text)
driver.quit()
Web scraping is a powerful skill. Use it responsibly and ethically!
