Web Scraping Wikipedia tables with Beautiful Soup

A complete python tutorial on performing web scraping with the Beautiful soup library.

Web Scraping Wikipedia tables with Beautiful Soup

In this article you will learn to perform Web Scraping using the Beautiful Soup and Requests in Python 3.

How?

You are going to scrape a Wikipedia table in order to fetch all the information, filter it(if necessary) and store them in a CSV.

Sections Covered

  1. Benefits of Web Scraping
  2. Beautiful Soup vs Selenium vs Scrapy
  3. Importance of DOM in Web Scraping
  4. Parsing HTML Table with Beautiful Soup
  5. Parsing headers from Wikipedia table
  6. Parsing rows of data from Wikipedia table
  7. Converting data into CSV
  8. Conclusion
  9. Youtube Video

Benefits of Web Scraping

21st century is the age of Data. Every organization depends on minute analysis of various data sources in order to grow their business.

With web scraping, one can accumulate tons of relevant data from various sources with a lot of ease, therefore, skipping on the manual effort. Real Estate Listings, Job listings, price tracking on ecommerce websites, stock market trends and many more - Web Scraping has become a go to tool for each of these objectives and much more.

Beautiful Soup vs Selenium vs Scrapy

When it comes to using Python for web scraping, there are 3 libraries that developers consider for their scraping pipeline. They are Beautiful Soup, Selenium or Scrapy.

Each of these libraries has its pro and cons of its own. One should shoose the library that is best suited for their requirement.

The pros and cons of each of these libraries are described below.

  Beautiful Soup(BS4) Selenium Scrapy
Advantages

1. Easy to learn

2. User friendly

1. Versatile

2. Scraps Javascript too

1. Portable

2. Efficient

Disadvantages

1. Inefficient

2. Needs dependencies

1. Inefficient

2. Not built for scraping

1. Difficult to set up

 

Pros and Cons of each web scraping library

Importance of DOM in Web Scraping

In order to scrape the necessary content, it is imperative that you understand HTML DOM properly.

The HTML DOM is an Object Model for HTML. It defines:

  • HTML elements as objects
  • Properties for all HTML elements
  • Methods for all HTML elements
  • Events for all HTML elements

When a web page is loaded, the browser creates a Document Object Model of the page.

An HTML page consists of different tags - head,body, div, img, table etc. We are interested in scraping the table tag of an HTML.

Let’s dig deeper into the componenets of a table tag in HTML.

<table>
  <thead>
    <tr>
      <th>Month</th>
      <th>Savings</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>January</td>
      <td>$100</td>
    </tr>
    <tr>
      <td>February</td>
      <td>$80</td>
    </tr>
  </tbody>
</table>

The above HTML code will generate the following table.

Simple HTML Table

Observe the following:-

  1. The entire table is defined within <table> tag.
  2. Header resides in <thead> tag.
  3. Data resides in <tbody> tag.
  4. Each table row is defined within a <tr> tag.
  5. Each table header is defined with a <th> tag.
  6. Each table data/cell is defined with a <td> tag.

Now using the above information, we can scrape our Wikipedia tables.


Parsing HTML Table with Beautiful Soup

The first step involves scraping an entire Wikipedia page and then identifying the table that we would like to store as CSV.

For this article, we will scrape all the Tropical Cyclones of January, 2020.

Tropical Cyclones of January, 2020

Step 1 - Make a GET request to the Wikipedia page and fetch all the content.

import requests as r

wiki_page_request = r.get("https://en.wikipedia.org/wiki/Tropical_cyclones_in_2020")
wiki_page_text = wiki_page_request.text

The wiki_page_text variable contains the content from the page.

Step 2

We will pass the content through Beautiful Soup. This should give us a BeautifulSoup object, which represents the document as a nested data structure.

from bs4 import BeautifulSoup
import requests as r

wiki_page_request = r.get("https://en.wikipedia.org/wiki/Tropical_cyclones_in_2020")
wiki_page_text = wiki_page_request.text

# New code below
soup = BeautifulSoup(wiki_page_text, 'html.parser')

Let’s experiment with the soup variable which is a BeautifulSoup object.

soup.title
# <title>Tropical cyclones in 2020 - Wikipedia</title>

soup.title.name
# 'title'

soup.title.string
# 'Tropical cyclones in 2020 - Wikipedia'

soup.a
# <a id="top"></a>

This way you can interact with various elements of HTML using the Beautiful Soup object.

Let’s find our table that we want to scrape.

"""This returns a list containing all
the tables in the HTML"""
soup.find_all('table')

""" How many tables are there in this HTML?"""
len(soup.find_all('table'))
#18

We are interested in the table with the caption Tropical cyclones formed in January 2020. Let’s read that particular table.

# First remove Falsey values(None) if present
table_soup = soup.find_all('table')
filtered_table_soup = [table for table in table_soup if table.caption is not None]

required_table = None

for table in filtered_table_soup:
    if str(table.caption.string).strip() == 'Tropical cyclones formed in February 2020':
        required_table = table
        break    

We should be able to see the HTML for just the Tropical Cyclones formed in 1 January, 2020 table in our required_table variable.

<table class="wikitable sortable">
    <caption>Tropical cyclones formed in February 2020
    </caption>
    <tbody>
    <tr>
        <th width="5%">Storm name
        </th>
        <th width="15%">Dates active
        </th>
        <th width="10%">Max wind<br/>km/h (mph)
        </th>
        <th width="5%">Pressure<br/>(hPa)
        </th>
        <th width="30%">Areas affected
        </th>
        <th width="10%">Damage<br/>(<a class="mw-redirect" href="/wiki/USD" title="USD">USD</a>)
        </th>
        <th width="5%">Deaths
        </th>
        <th width="5%">Refs
        </th>
    </tr>
    <tr>
       ...
</table>

Parsing headers from Wikipedia table

Let’s move on to parsing the headers of the table.

As we had discussed earlier, each table header is defined with a th tag. So, we could just look up all the th elements within the required_table.

headers = [header.text.strip() for header in required_table.find_all('th')]
# ['Storm name', 'Dates active',
# 'Max windkm/h (mph)', 'Pressure(hPa)',
# 'Areas affected', 'Damage(USD)', 'Deaths', 'Refs']

Our variable headers is now a list containing all the header names.


Parsing rows of data from Wikipedia table

Let’s now parse the rows containing the data. As discussed above, Each table data/cell is defined with a td tag and the entire row resides within a tr tag.

Now, we want to store every row as a list, so that it can be easily converted to a csv file. For this purpose, we will parse the tr tags and loop through each tr tag to find the td tag.

rows = []

# Find all `tr` tags
data_rows = required_table.find_all('tr')

for row in data_rows:
    value = row.find_all('td')
    beautified_value = [ele.text.strip() for ele in value]
    # Remove data arrays that are empty
    if len(beautified_value) == 0:
        continue
    rows.append(beautified_value)

Now our variable rows contains all the rows of the tables in a list format.


Converting data into CSV

Now that we have both our headers and the data rows, the only task that remains is to convert them to a CSV file.

import csv

with open('world_cyclones.csv', 'w', newline="") as output:
    writer = csv.writer(output)
    writer.writerow(headers)
    writer.writerows(rows)

Conclusion

Beautiful Soup has a lot of useful functionality to parse HTML data. It is user-friendly and has a well explained documentation. In fact, Beautiful Soup could help you with most of your parsing of the static websites.

In this article, you have learned how to scrape Wikipedia tables using Python, requests, and Beautiful Soup. You learned how to:

  1. Inspect the DOM structure of Wikipedia tools.
  2. Download the page HTML content using Python requests library with a GET request.
  3. Parse the downloaded HTML with Beautiful Soup to extract relevant information.

To learn more about Python HTTP Methods, check out our blog.


Youtube Video

Subscribe to Pylenin

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe