Web Scraping with Python: Extract Data from Any Website

The Power of Web Scraping

Web scraping lets you extract data from websites programmatically. Price monitoring, research, content aggregation – the applications are endless.

Essential Tools

pip install requests beautifulsoup4 lxml

Basic Scraping Example

import requests
from bs4 import BeautifulSoup

def scrape_page(url):
    headers = {"User-Agent": "Mozilla/5.0"}
    resp = requests.get(url, headers=headers, timeout=10)
    return BeautifulSoup(resp.text, "lxml")

soup = scrape_page("http://quotes.toscrape.com")
quotes = soup.find_all("div", class_="quote")

for quote in quotes:
    text = quote.find("span", class_="text").get_text()
    author = quote.find("small", class_="author").get_text()
    print(f"{author}: {text[:50]}...")

Scraping Multiple Pages

import time

def scrape_all_pages(base_url):
    results = []
    page = 1
    while True:
        url = f"{base_url}/page/{page}"
        soup = scrape_page(url)
        items = soup.find_all("article")
        if not items:
            break
        for item in items:
            results.append({
                "title": item.find("h2").get_text(strip=True),
                "link": item.find("a")["href"]
            })
        page += 1
        time.sleep(1)  # Be polite!
    return results

Ethics and Best Practices

  • Always check robots.txt before scraping
  • Respect rate limits – add delays between requests
  • Use a public API if one is available
  • Check Terms of Service for scraping restrictions
  • Never scrape personal or sensitive data

Saving Scraped Data

import csv, json

# Save to CSV
with open("data.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "link"])
    writer.writeheader()
    writer.writerows(results)

# Save to JSON
with open("data.json", "w") as f:
    json.dump(results, f, indent=2)

Advanced: Selenium for Dynamic Content

Some websites load content with JavaScript. For these, use Selenium to control a real browser.

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://example.com")
driver.implicitly_wait(5)

elements = driver.find_elements(By.CLASS_NAME, "product-card")
for el in elements:
    print(el.text)

driver.quit()

Web scraping is a powerful skill. Use it responsibly and ethically!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top