Python Web Scraping Using Selenium and Beautiful Soup: A Step-by-Step Tutorial
In today’s data-centric world, the ability to extract information from websites has become an essential skill for developers, data analysts, and researchers. However, many websites don’t offer public APIs for easy data access, which leads us to an increasingly popular technique: web scraping.
In this comprehensive guide, we’ll learn what web scraping in Python is, when it is important, and how to use Selenium and Beautiful Soup together to build a web scraper. We’ll also go through some common errors that we may face while performing web scraping.
Let’s start the discussion with a brief introduction to web scraping.
What is web scraping?
Web scraping is the process of extracting information automatically from websites using software. Instead of manually copying and pasting content, web scrapers programmatically visit pages, parse the HTML content, and retrieve the data we’re interested in. This allows us to gather large volumes of information quickly and efficiently.
For example, a data scientist might use web scraping to collect housing prices for a market analysis, while a developer might scrape sports scores for an automated report.
We need to perform web scraping when:
- The target website does not offer a public API
- We need to extract data from multiple pages or dynamically loaded content
- Manual data collection is too slow or inefficient
- We’re working on automation tasks or building datasets for machine learning
Next, we’ll discuss what Selenium and Beautiful Soup are and why we use them together for web scraping.
Learn Web Scraping with Beautiful Soup
Can't download the data you need? Learn how to pull data right from the page by web scraping with the Python library Beautiful Soup.Try it for freeWhat is Selenium and Beautiful Soup?
When we’re talking about web scraping in Python, two of the most widely used tools are Selenium and Beautiful Soup. Each serves a unique purpose, and together, they form a powerful duo for extracting data from both static and dynamic websites.
Selenium: Automating web browsers
Selenium is a web automation framework that enables us to control a browser through code. Originally developed for testing web applications, Selenium has become a go-to tool for scraping websites that rely on JavaScript to render content.
With Selenium, we can:
- Launch and control a browser (e.g., Chrome, Firefox)
- Navigate through pages and simulate user behavior (clicking buttons, submitting forms, scrolling)
- Wait for elements to load before scraping
- Extract dynamically generated HTML content
Selenium is especially useful when traditional scraping tools fail due to JavaScript-heavy pages or interactive elements.
Beautiful Soup: Parsing HTML made easy
Beautiful Soup is a Python library designed to parse HTML and XML documents quickly and intuitively. It allows us to search and navigate through HTML elements using tags, classes, IDs, and more.
With Beautiful Soup, we can:
- Search for specific elements using tag names or attributes
- Extract text, links, tables, or structured content
- Clean and manipulate scraped HTML data
Beautiful Soup is best suited for working with static pages or the HTML output generated by tools like Selenium.
Why use Selenium and Beautiful Soup for web scraping?
While Selenium and Beautiful Soup are powerful on their own, using them together gives us a more complete and flexible web scraping toolkit. Here’s how:
- Handling JavaScript-rendered pages: Many websites load data dynamically using JavaScript, which Beautiful Soup alone can’t access. Selenium renders the full page like a real browser, allowing Beautiful Soup to then parse the updated content.
- Automating complex user interactions: Selenium can replicate user actions like clicking buttons, selecting from dropdowns, and submitting forms. This allows us to reach hidden or dynamically loaded content before scraping.
- Streamlined HTML parsing: While Selenium offers basic scraping tools, Beautiful Soup makes HTML parsing much easier. Its syntax is more concise and efficient for searching and extracting data from HTML structures.
- Best of both worlds: Together, these tools let us scrape almost any site - static or dynamic - with precision. Selenium handles the interaction and rendering, and Beautiful Soup focuses on clean data extraction.
This workflow allows us to scrape modern, interactive websites that load data dynamically, which would otherwise be inaccessible with static scrapers.
With these technologies covered, let’s explore how to perform web scraping using Selenium and Beautiful Soup in the next section.
Web Scraping using Selenium and Beautiful Soup
In this section, we’ll walk through a hands-on tutorial that shows how to extract medal data from Olympedia.org, a website that provides detailed historical data on Olympic events and athletes.
The Olympedia.org site has a fairly simple layout structured around a navigation bar at the top, as the main wayfinding element, with dropdowns for several categories such as “Athletes” and “Countries”.
Under the “Statistics” dropdown we can select “Medals by Country”, which leads us to a page with a table of medal counts by country for every Olympic games ever contested. Above the table are several dropdowns that we can use to filter the results (e.g. Olympic year, discipline, gender, etc).
By selecting the year of a given Olympics and a gender, we can highlight the total medals won as well as the breakdown by medal type for that year. To collect the data required for our chart we must extract the values for team USA for every summer Olympics, by gender. In other words, we must select each (summer Olympic) year from the dropdown in turn to update the table with the medal information for that event, for both the men and women.
Let’s start the process.
Step 1: Launching a browser using Selenium
Selenium is fundamentally an automation library, which provides tools for interacting with webpages and their elements hands-free. The first step of our data collection script is to create a driver object, an instance of a browser that we can manipulate with Selenium methods.
We start with our import
statements:
from selenium import webdriverfrom selenium.webdriver import Safari
Note: In this example, we use Safari, but there are drivers available for other browsers, such as Firefox.
Next, we instantiate a driver object and assign the URL for the medals page:
driver = Safari()driver.get('http://www.olympedia.org/statistics/medal/country')
With these simple lines of code, we’ve launched a new Safari window, primed for automation.
Step 2: Retrieving form elements
Once we have our driver instantiated and pointed at our target, we must locate the elements and options necessary to update the table. The Selenium library has many tools for locating elements, circumstances may dictate a preferred path in some cases, but often there are several ways to achieve any objective. Here we’ve chosen to employ the .find_element_by_id()
method, which allows us to identify an element by its “id” string.
We can examine the source code of the page to identify an “id”, “class name” or any other feature by right-clicking the page in the browser window and selecting “inspect element”.
In this view, we can navigate through all the elements and identify the “ids” we need. The dropdowns for the Olympic year and gender are labeled edition_select
and athlete_gender
respectively. We assign those elements to variables with the following lines:
year_dd = driver.find_element_by_id('edition_select')gender_dd = driver.find_element_by_id('athlete_gender')
The next step is to collect the options for those dropdowns, and we can do so with another locate method:
year_options = year_dd.find_elements_by_tag_name('option')gender_options = gender_dd.find_elements_by_tag_name('option')
Step 3: Navigating the website using Beautiful Soup
So far, we’ve identified the page and the form elements we need to update the tables we’re targeting. We’ve set up our automated browser window and assigned variables to the elements in question. Now, we’re in the transition phase and we’re passing the baton to the Beautiful Soup library.
In the code below, we structure this handoff within a set of nested loops, cycling through men and women first, and in the interior loop, clicking through the years for every summer games. We execute each selection by simply looping each of our option lists and calling the .click()
method on the option object to submit that form selection.
for gender in gender_options[1:]: # index 0 is omitted because it contains placeholder txtgender.click()for year in year_options[2:]: # skipping first two options to start with 1900year.click()
Once we’ve made our selections, we can pass the page source to Beautiful Soup by calling the .page_source
attribute on our driver object to parse the content of this iteration of the page:
the_soup = BeautifulSoup(driver.page_source, 'html.parser')
Step 4: Parsing the HTML content
With the page content in hand, we must now locate the table elements of interest, so we can copy only those items to our output file. In order to isolate this content, we utilize two versions of Beautiful Soup’s search methods. First, we can grab the start of the row containing team USA results with the .find()
method.
In this instance, we use a regular expression as an argument to ensure we get the correct object. Next, we can use another variation of a search method, .find_all_next(<tag><limit>)
to extract the medal counts. This method allows us to pull all of the objects that follow any other, and an optional <limit>
argument gives us the flexibility to specify how many elements (beyond our reference) we’re interested in capturing.
head = the_soup.find(href=re.compile('USA'))head.find_all_next('td', limit=5)
Step 5: Organizing the data
At this point, we’ve completed the scaffolding for our browser automation and with the head.find_all_next('td', limit=5)
object we have access to the medal counts for each medal type as well as the overall total for that year. Now, all that remains is to bundle our data and set up our export pipeline.
First, we process the data we’ve sourced by calling the .string
attribute on the elements we’ve captured and assigning the result to a variable, medals_lst
. Then, we supplement the medal values with the year and gender values and append the entire thing to a list.
try:year_val = year.get_attribute('text')head = the_soup.find(href=re.compile('USA'))medal_values = head.find_all_next('td', limit=5)val_lst = [x.string for x in medal_values[1:]] # the first index is the link with the country abbreviation and flagexcept:val_lst = ['0' for x in range(4)] # we address years team USA did not compete with this optionval_lst.append(gender_val)val_lst.append(year_val)usa_lst.append(val_lst)
Having completed our data collection, we can close out the browser with:
driver.quit()
Finally, we can loop through all of our compiled data, usa_lst
, and write it out to a CSV. A basic export can be modeled as follows:
output_f = open('output.csv', 'w', newline='')output_writer = csv.writer(output_f)for row in usa_lst:output_writer.writerow(row)output_f.close()
Here is the extracted data:
If you want, you can take a look at olympic_data.py for the full solution.
Congratulations, you’ve successfully built a web scraper using Selenium and Beautiful Soup!
Next, let’s have a look at some common errors that we can face while performing web scraping.
Common errors while performing web scraping
Here are some common web scraping errors and their solutions:
1. Handling dynamic content
Many modern websites load content using JavaScript, which may not be immediately visible to Beautiful Soup.
Solution:
Use time.sleep()
or Selenium’s WebDriverWait
to delay until the content is ready.
2. Dealing with anti-scraping measures
Websites may detect and block bots using CAPTCHAs or by monitoring unusual traffic.
Solution:
- Rotate IP addresses using proxies
- Randomize request intervals
- Use headless mode wisely
3. Website structure changes
Web pages can change layout or element IDs, which may break our scraper.
Solution:
Regularly inspect the site and update our selectors. Use descriptive and flexible element locators like CSS selectors or XPaths.
Conclusion
Web scraping is a powerful skill that enables developers, analysts, and data enthusiasts to gather valuable information from websites and turn it into actionable insights. In this tutorial, we explored how to combine Selenium and Beautiful Soup, two of the most widely used tools for web scraping in Python, to automate the collection of data from a dynamically rendered web page.
Whether we’re building a portfolio project, conducting market research, or automating a data-driven workflow, mastering these tools will serve us well. With this solid foundation, we’ll be more than ready to build efficient, reliable, and ethical web scraping scripts using Python.
If you want to learn more about web scraping using Beautiful Soup, check out the Learn Web Scraping with Beautiful Soup course on Codecademy.
Frequently Asked Questions
1. Can I use Selenium and Beautiful Soup together to scrape any website?
While the combination of Selenium and Beautiful Soup is powerful, always ensure you’re complying with the target website’s terms of service and robots.txt
file.
2: Why is my scraper returning empty data?
This could be due to the page not fully loading before parsing. Implement explicit waits to ensure the content is ready.
3. How can I speed up the scraping process?
Minimize the use of Selenium where possible and use headless browsers to reduce overhead. However, always balance speed with respect to the target server.
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
What is Python?
What is Python, and what can it do? - Article
Create and View a Web Page on Your Computer
If you've completed many Codecademy courses, but still find yourself asking, "Where can I write code on my own computer?", then start here! - Article
Create and View a Web Page on Your Computer
If you've completed many Codecademy courses, but still find yourself asking, "Where can I write code on my own computer?", then start here!
Learn more on Codecademy
- Course
Learn Web Scraping with Beautiful Soup
Can't download the data you need? Learn how to pull data right from the page by web scraping with the Python library Beautiful Soup.With CertificateIntermediate2 hours - Free course
Learn HTML: Fundamentals
Build an important foundation for creating the structure and content for web pages.Beginner Friendly3 hours - Free course
Learn HTML
Start at the beginning by learning HTML basics — an important foundation for building and editing web pages.Beginner Friendly7 hours