Web Scraping Wikipedia tables with Beautiful Soup
A complete python tutorial on performing web scraping with the Beautiful soup library.
In this article you will learn to perform Web Scraping using the Beautiful Soup and Requests in Python 3.
You are going to scrape a Wikipedia table in order to fetch all the information, filter it(if necessary) and store them in a CSV.
- Benefits of Web Scraping
- Beautiful Soup vs Selenium vs Scrapy
- Importance of DOM in Web Scraping
- Parsing HTML Table with Beautiful Soup
- Parsing headers from Wikipedia table
- Parsing rows of data from Wikipedia table
- Converting data into CSV
- Youtube Video
Benefits of Web Scraping
21st century is the age of Data. Every organization depends on minute analysis of various data sources in order to grow their business.
With web scraping, one can accumulate tons of relevant data from various sources with a lot of ease, therefore, skipping on the manual effort. Real Estate Listings, Job listings, price tracking on ecommerce websites, stock market trends and many more - Web Scraping has become a go to tool for each of these objectives and much more.
Beautiful Soup vs Selenium vs Scrapy
When it comes to using Python for web scraping, there are 3 libraries that developers consider for their scraping pipeline. They are Beautiful Soup, Selenium or Scrapy.
Each of these libraries has its pro and cons of its own. One should shoose the library that is best suited for their requirement.
The pros and cons of each of these libraries are described below.
1. Easy to learn
2. User friendly
2. Needs dependencies
2. Not built for scraping
|1. Difficult to set up|
Importance of DOM in Web Scraping
In order to scrape the necessary content, it is imperative that you understand HTML DOM properly.
The HTML DOM is an Object Model for HTML. It defines:
- HTML elements as objects
- Properties for all HTML elements
- Methods for all HTML elements
- Events for all HTML elements
When a web page is loaded, the browser creates a Document Object Model of the page.
An HTML page consists of different tags -
table etc. We are interested in scraping the
table tag of an HTML.
Let’s dig deeper into the componenets of a
table tag in HTML.
<table> <thead> <tr> <th>Month</th> <th>Savings</th> </tr> </thead> <tbody> <tr> <td>January</td> <td>$100</td> </tr> <tr> <td>February</td> <td>$80</td> </tr> </tbody> </table>
The above HTML code will generate the following table.
Observe the following:-
- The entire table is defined within
- Header resides in
- Data resides in
- Each table row is defined within a
- Each table header is defined with a
- Each table data/cell is defined with a
Now using the above information, we can scrape our Wikipedia tables.
Parsing HTML Table with Beautiful Soup
The first step involves scraping an entire Wikipedia page and then identifying the table that we would like to store as CSV.
For this article, we will scrape all the Tropical Cyclones of January, 2020.
Step 1 - Make a
GET request to the Wikipedia page and fetch all the content.
import requests as r wiki_page_request = r.get("https://en.wikipedia.org/wiki/Tropical_cyclones_in_2020") wiki_page_text = wiki_page_request.text
wiki_page_text variable contains the content from the page.
We will pass the content through Beautiful Soup. This should give us a BeautifulSoup object, which represents the document as a nested data structure.
from bs4 import BeautifulSoup import requests as r wiki_page_request = r.get("https://en.wikipedia.org/wiki/Tropical_cyclones_in_2020") wiki_page_text = wiki_page_request.text # New code below soup = BeautifulSoup(wiki_page_text, 'html.parser')
Let’s experiment with the
soup variable which is a BeautifulSoup object.
soup.title # <title>Tropical cyclones in 2020 - Wikipedia</title> soup.title.name # 'title' soup.title.string # 'Tropical cyclones in 2020 - Wikipedia' soup.a # <a id="top"></a>
This way you can interact with various elements of HTML using the Beautiful Soup object.
Let’s find our table that we want to scrape.
"""This returns a list containing all the tables in the HTML""" soup.find_all('table') """ How many tables are there in this HTML?""" len(soup.find_all('table')) #18
We are interested in the table with the caption
Tropical cyclones formed in January 2020. Let’s read that particular table.
# First remove Falsey values(None) if present table_soup = soup.find_all('table') filtered_table_soup = [table for table in table_soup if table.caption is not None] required_table = None for table in filtered_table_soup: if str(table.caption.string).strip() == 'Tropical cyclones formed in February 2020': required_table = table break
We should be able to see the HTML for just the Tropical Cyclones formed in 1 January, 2020 table in our
<table class="wikitable sortable"> <caption>Tropical cyclones formed in February 2020 </caption> <tbody> <tr> <th width="5%">Storm name </th> <th width="15%">Dates active </th> <th width="10%">Max wind<br/>km/h (mph) </th> <th width="5%">Pressure<br/>(hPa) </th> <th width="30%">Areas affected </th> <th width="10%">Damage<br/>(<a class="mw-redirect" href="/wiki/USD" title="USD">USD</a>) </th> <th width="5%">Deaths </th> <th width="5%">Refs </th> </tr> <tr> ... </table>
Parsing headers from Wikipedia table
Let’s move on to parsing the headers of the table.
As we had discussed earlier, each table header is defined with a
th tag. So, we could just look up all the
th elements within the
headers = [header.text.strip() for header in required_table.find_all('th')] # ['Storm name', 'Dates active', # 'Max windkm/h (mph)', 'Pressure(hPa)', # 'Areas affected', 'Damage(USD)', 'Deaths', 'Refs']
headers is now a list containing all the header names.
Parsing rows of data from Wikipedia table
Let’s now parse the rows containing the data. As discussed above, Each table data/cell is defined with a
td tag and the entire row resides within a
Now, we want to store every row as a list, so that it can be easily converted to a csv file. For this purpose, we will parse the
tr tags and loop through each
tr tag to find the
rows =  # Find all `tr` tags data_rows = required_table.find_all('tr') for row in data_rows: value = row.find_all('td') beautified_value = [ele.text.strip() for ele in value] # Remove data arrays that are empty if len(beautified_value) == 0: continue rows.append(beautified_value)
Now our variable
rows contains all the rows of the tables in a list format.
Converting data into CSV
Now that we have both our headers and the data rows, the only task that remains is to convert them to a CSV file.
import csv with open('world_cyclones.csv', 'w', newline="") as output: writer = csv.writer(output) writer.writerow(headers) writer.writerows(rows)
Beautiful Soup has a lot of useful functionality to parse HTML data. It is user-friendly and has a well explained documentation. In fact, Beautiful Soup could help you with most of your parsing of the static websites.
In this article, you have learned how to scrape Wikipedia tables using Python, requests, and Beautiful Soup. You learned how to:
- Inspect the DOM structure of Wikipedia tools.
- Download the page HTML content using Python requests library with a
- Parse the downloaded HTML with Beautiful Soup to extract relevant information.
To learn more about Python HTTP Methods, check out our blog.