Web Scraping in Python Using BeautifulSoup
Web scraping is the process of extracting data from websites. BeautifulSoup is a powerful Python library for parsing and extracting HTML or XML data from web pages.
1️⃣ Installing Required Libraries
Before starting, install the required libraries:
requests
: Fetches web pages.beautifulsoup4
: Parses HTML and extracts data.
2️⃣ Basic Web Scraping with BeautifulSoup
Example: Extracting Titles from a Website
import requests
from bs4 import BeautifulSoup
# Step 1: Fetch the Web Page
URL = "https://example.com"
response = requests.get(URL)
# Step 2: Parse HTML Content
soup = BeautifulSoup(response.text, "html.parser")
# Step 3: Extract Data (e.g., Page Title)
title = soup.title.text
print(f"Page Title: {title}")
3️⃣ Extracting Specific Elements
You can extract elements using tags, classes, and IDs.
Example: Extracting All Headings (<h1>
)
headings = soup.find_all("h1")
for heading in headings:
print(heading.text)
Example: Extracting Paragraphs (<p>
)
paragraphs = soup.find_all("p")
for para in paragraphs:
print(para.text)
Example: Extracting Links (<a>
Tags)
links = soup.find_all("a")
for link in links:
print(link.get("href"))
4️⃣ Extracting Elements Using Classes and IDs
Sometimes, websites use CSS classes or IDs to structure content.
Example: Extracting Data Using a Class Name
data = soup.find_all("div", class_="article-content")
for item in data:
print(item.text)
Example: Extracting Data Using an ID
specific_element = soup.find("div", id="main-section")
print(specific_element.text)
5️⃣ Extracting Data from Tables
If a website contains tabular data, you can extract it using <table>
, <tr>
, and <td>
tags.
Example: Extracting Table Data
table = soup.find("table")
rows = table.find_all("tr")
for row in rows:
cols = row.find_all("td")
cols = [col.text.strip() for col in cols]
print(cols)
6️⃣ Handling Dynamic Content (JavaScript-Rendered Pages)
Some websites load content dynamically using JavaScript. BeautifulSoup alone cannot handle JavaScript, but Selenium or Scrapy can be used.
Example: Using Selenium for JavaScript-Rendered Content
from selenium import webdriver
from bs4 import BeautifulSoup
# Set up Selenium WebDriver
driver = webdriver.Chrome() # Or use webdriver.Firefox()
driver.get("https://example.com")
html = driver.page_source # Get rendered HTML
soup = BeautifulSoup(html, "html.parser")
data = soup.find("div", class_="dynamic-content")
print(data.text)
driver.quit()
✅ Best for: Scraping sites that use JavaScript to load content dynamically.
7️⃣ Saving Scraped Data to a File
Example: Saving Data to a CSV File
import csv
# Sample scraped data
data = [("Title1", "https://example1.com"), ("Title2", "https://example2.com")]
# Write to CSV
with open("scraped_data.csv", "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["Title", "URL"]) # Column headers
writer.writerows(data)
print("Data saved to scraped_data.csv")
8️⃣ Respecting robots.txt
(Ethical Web Scraping)
Before scraping a website, check its robots.txt file by visiting:
https://example.com/robots.txt
If a page is disallowed, you should avoid scraping it.
Example: Checking robots.txt
Using Python
URL = "https://example.com/robots.txt"
response = requests.get(URL)
print(response.text)
9️⃣ Handling Headers and Proxies
Some websites block scrapers. You can bypass detection using headers and proxies.
Example: Using Headers to Avoid Blocks
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get("https://example.com", headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
🔹 Summary: Web Scraping Steps
Step | Description |
---|---|
1️⃣ | Install requests & BeautifulSoup4 |
2️⃣ | Fetch webpage using requests.get() |
3️⃣ | Parse HTML with BeautifulSoup |
4️⃣ | Extract elements using find() or find_all() |
5️⃣ | Handle dynamic content using Selenium (if needed) |
6️⃣ | Save data to a file (CSV, JSON, or database) |
7️⃣ | Follow ethical guidelines (robots.txt ) |
Related posts:
- Python Web Scraping Tutorial
- How can I use BeautifulSoup to Scrape ESPN Fantasy Football Data?
- How can you make a POST request with headers and a body using Python’s requests library?
- Getting only response header from HTTP POST using cURL
- How to start with the InstagramAPI in Python?
- curl Command in Linux with Examples
Leave a comment