Articles

Python Web Scraping Using Selenium and Beautiful Soup: A Step-by-Step Tutorial

Published Mar 10, 2024Updated Apr 24, 2025
Explore how to perform Python web scraping using Selenium and Beautiful Soup in this beginner-friendly, step-by-step guide.

In today’s data-centric world, the ability to extract information from websites has become an essential skill for developers, data analysts, and researchers. However, many websites don’t offer public APIs for easy data access, which leads us to an increasingly popular technique: web scraping.

In this comprehensive guide, we’ll learn what web scraping in Python is, when it is important, and how to use Selenium and Beautiful Soup together to build a web scraper. We’ll also go through some common errors that we may face while performing web scraping.

Let’s start the discussion with a brief introduction to web scraping.

What is web scraping?

Web scraping is the process of extracting information automatically from websites using software. Instead of manually copying and pasting content, web scrapers programmatically visit pages, parse the HTML content, and retrieve the data we’re interested in. This allows us to gather large volumes of information quickly and efficiently.

For example, a data scientist might use web scraping to collect housing prices for a market analysis, while a developer might scrape sports scores for an automated report.

We need to perform web scraping when:

  • The target website does not offer a public API
  • We need to extract data from multiple pages or dynamically loaded content
  • Manual data collection is too slow or inefficient
  • We’re working on automation tasks or building datasets for machine learning

Next, we’ll discuss what Selenium and Beautiful Soup are and why we use them together for web scraping.

Related Course

Learn Web Scraping with Beautiful Soup

Can't download the data you need? Learn how to pull data right from the page by web scraping with the Python library Beautiful Soup.Try it for free

What is Selenium and Beautiful Soup?

When we’re talking about web scraping in Python, two of the most widely used tools are Selenium and Beautiful Soup. Each serves a unique purpose, and together, they form a powerful duo for extracting data from both static and dynamic websites.

Selenium: Automating web browsers

Selenium is a web automation framework that enables us to control a browser through code. Originally developed for testing web applications, Selenium has become a go-to tool for scraping websites that rely on JavaScript to render content.

With Selenium, we can:

  • Launch and control a browser (e.g., Chrome, Firefox)
  • Navigate through pages and simulate user behavior (clicking buttons, submitting forms, scrolling)
  • Wait for elements to load before scraping
  • Extract dynamically generated HTML content

Selenium is especially useful when traditional scraping tools fail due to JavaScript-heavy pages or interactive elements.

Beautiful Soup: Parsing HTML made easy

Beautiful Soup is a Python library designed to parse HTML and XML documents quickly and intuitively. It allows us to search and navigate through HTML elements using tags, classes, IDs, and more.

With Beautiful Soup, we can:

  • Search for specific elements using tag names or attributes
  • Extract text, links, tables, or structured content
  • Clean and manipulate scraped HTML data

Beautiful Soup is best suited for working with static pages or the HTML output generated by tools like Selenium.

Why use Selenium and Beautiful Soup for web scraping?

While Selenium and Beautiful Soup are powerful on their own, using them together gives us a more complete and flexible web scraping toolkit. Here’s how:

  • Handling JavaScript-rendered pages: Many websites load data dynamically using JavaScript, which Beautiful Soup alone can’t access. Selenium renders the full page like a real browser, allowing Beautiful Soup to then parse the updated content.
  • Automating complex user interactions: Selenium can replicate user actions like clicking buttons, selecting from dropdowns, and submitting forms. This allows us to reach hidden or dynamically loaded content before scraping.
  • Streamlined HTML parsing: While Selenium offers basic scraping tools, Beautiful Soup makes HTML parsing much easier. Its syntax is more concise and efficient for searching and extracting data from HTML structures.
  • Best of both worlds: Together, these tools let us scrape almost any site - static or dynamic - with precision. Selenium handles the interaction and rendering, and Beautiful Soup focuses on clean data extraction.

This workflow allows us to scrape modern, interactive websites that load data dynamically, which would otherwise be inaccessible with static scrapers.

With these technologies covered, let’s explore how to perform web scraping using Selenium and Beautiful Soup in the next section.

Web Scraping using Selenium and Beautiful Soup

In this section, we’ll walk through a hands-on tutorial that shows how to extract medal data from Olympedia.org, a website that provides detailed historical data on Olympic events and athletes.

The Olympedia.org site has a fairly simple layout structured around a navigation bar at the top, as the main wayfinding element, with dropdowns for several categories such as “Athletes” and “Countries”.

An image that shows the navbar of the Olympedia.org website

Under the “Statistics” dropdown we can select “Medals by Country”, which leads us to a page with a table of medal counts by country for every Olympic games ever contested. Above the table are several dropdowns that we can use to filter the results (e.g. Olympic year, discipline, gender, etc).

By selecting the year of a given Olympics and a gender, we can highlight the total medals won as well as the breakdown by medal type for that year. To collect the data required for our chart we must extract the values for team USA for every summer Olympics, by gender. In other words, we must select each (summer Olympic) year from the dropdown in turn to update the table with the medal information for that event, for both the men and women.

Let’s start the process.

Step 1: Launching a browser using Selenium

Selenium is fundamentally an automation library, which provides tools for interacting with webpages and their elements hands-free. The first step of our data collection script is to create a driver object, an instance of a browser that we can manipulate with Selenium methods.

We start with our import statements:

from selenium import webdriver
from selenium.webdriver import Safari

Note: In this example, we use Safari, but there are drivers available for other browsers, such as Firefox.

Next, we instantiate a driver object and assign the URL for the medals page:

driver = Safari()
driver.get('http://www.olympedia.org/statistics/medal/country')

With these simple lines of code, we’ve launched a new Safari window, primed for automation.

Step 2: Retrieving form elements

Once we have our driver instantiated and pointed at our target, we must locate the elements and options necessary to update the table. The Selenium library has many tools for locating elements, circumstances may dictate a preferred path in some cases, but often there are several ways to achieve any objective. Here we’ve chosen to employ the .find_element_by_id() method, which allows us to identify an element by its “id” string.

We can examine the source code of the page to identify an “id”, “class name” or any other feature by right-clicking the page in the browser window and selecting “inspect element”.

An image that shows how to inspect an element

In this view, we can navigate through all the elements and identify the “ids” we need. The dropdowns for the Olympic year and gender are labeled edition_select and athlete_gender respectively. We assign those elements to variables with the following lines:

year_dd = driver.find_element_by_id('edition_select')
gender_dd = driver.find_element_by_id('athlete_gender')

The next step is to collect the options for those dropdowns, and we can do so with another locate method:

year_options = year_dd.find_elements_by_tag_name('option')
gender_options = gender_dd.find_elements_by_tag_name('option')

Step 3: Navigating the website using Beautiful Soup

So far, we’ve identified the page and the form elements we need to update the tables we’re targeting. We’ve set up our automated browser window and assigned variables to the elements in question. Now, we’re in the transition phase and we’re passing the baton to the Beautiful Soup library.

In the code below, we structure this handoff within a set of nested loops, cycling through men and women first, and in the interior loop, clicking through the years for every summer games. We execute each selection by simply looping each of our option lists and calling the .click() method on the option object to submit that form selection.

for gender in gender_options[1:]: # index 0 is omitted because it contains placeholder txt
gender.click()
for year in year_options[2:]: # skipping first two options to start with 1900
year.click()

Once we’ve made our selections, we can pass the page source to Beautiful Soup by calling the .page_source attribute on our driver object to parse the content of this iteration of the page:

the_soup = BeautifulSoup(driver.page_source, 'html.parser')

Step 4: Parsing the HTML content

With the page content in hand, we must now locate the table elements of interest, so we can copy only those items to our output file. In order to isolate this content, we utilize two versions of Beautiful Soup’s search methods. First, we can grab the start of the row containing team USA results with the .find() method.

In this instance, we use a regular expression as an argument to ensure we get the correct object. Next, we can use another variation of a search method, .find_all_next(<tag><limit>) to extract the medal counts. This method allows us to pull all of the objects that follow any other, and an optional <limit> argument gives us the flexibility to specify how many elements (beyond our reference) we’re interested in capturing.

head = the_soup.find(href=re.compile('USA'))
head.find_all_next('td', limit=5)

Step 5: Organizing the data

At this point, we’ve completed the scaffolding for our browser automation and with the head.find_all_next('td', limit=5) object we have access to the medal counts for each medal type as well as the overall total for that year. Now, all that remains is to bundle our data and set up our export pipeline.

First, we process the data we’ve sourced by calling the .string attribute on the elements we’ve captured and assigning the result to a variable, medals_lst. Then, we supplement the medal values with the year and gender values and append the entire thing to a list.

try:
year_val = year.get_attribute('text')
head = the_soup.find(href=re.compile('USA'))
medal_values = head.find_all_next('td', limit=5)
val_lst = [x.string for x in medal_values[1:]] # the first index is the link with the country abbreviation and flag
except:
val_lst = ['0' for x in range(4)] # we address years team USA did not compete with this option
val_lst.append(gender_val)
val_lst.append(year_val)
usa_lst.append(val_lst)

Having completed our data collection, we can close out the browser with:

driver.quit()

Finally, we can loop through all of our compiled data, usa_lst, and write it out to a CSV. A basic export can be modeled as follows:

output_f = open('output.csv', 'w', newline='')
output_writer = csv.writer(output_f)
for row in usa_lst:
output_writer.writerow(row)
output_f.close()

Here is the extracted data:

An image that visualizes the extracted data

If you want, you can take a look at olympic_data.py for the full solution.

Congratulations, you’ve successfully built a web scraper using Selenium and Beautiful Soup!

Next, let’s have a look at some common errors that we can face while performing web scraping.

Common errors while performing web scraping

Here are some common web scraping errors and their solutions:

1. Handling dynamic content

Many modern websites load content using JavaScript, which may not be immediately visible to Beautiful Soup.

Solution:

Use time.sleep() or Selenium’s WebDriverWait to delay until the content is ready.

2. Dealing with anti-scraping measures

Websites may detect and block bots using CAPTCHAs or by monitoring unusual traffic.

Solution:

  • Rotate IP addresses using proxies
  • Randomize request intervals
  • Use headless mode wisely

3. Website structure changes

Web pages can change layout or element IDs, which may break our scraper.

Solution:

Regularly inspect the site and update our selectors. Use descriptive and flexible element locators like CSS selectors or XPaths.

Conclusion

Web scraping is a powerful skill that enables developers, analysts, and data enthusiasts to gather valuable information from websites and turn it into actionable insights. In this tutorial, we explored how to combine Selenium and Beautiful Soup, two of the most widely used tools for web scraping in Python, to automate the collection of data from a dynamically rendered web page.

Whether we’re building a portfolio project, conducting market research, or automating a data-driven workflow, mastering these tools will serve us well. With this solid foundation, we’ll be more than ready to build efficient, reliable, and ethical web scraping scripts using Python.

If you want to learn more about web scraping using Beautiful Soup, check out the Learn Web Scraping with Beautiful Soup course on Codecademy.

Frequently Asked Questions

1. Can I use Selenium and Beautiful Soup together to scrape any website?

While the combination of Selenium and Beautiful Soup is powerful, always ensure you’re complying with the target website’s terms of service and robots.txt file.

2: Why is my scraper returning empty data?

This could be due to the page not fully loading before parsing. Implement explicit waits to ensure the content is ready.

3. How can I speed up the scraping process?

Minimize the use of Selenium where possible and use headless browsers to reduce overhead. However, always balance speed with respect to the target server.

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team