Learn about how to use BeautifulSoup to scrape simple HTML websites and put all of the data in one place

Before we get started, a quick note on **prerequisites**: This course requires knowledge of <a href="https://www.codecademy.com/learn/learn-python-3">Python</a>. Also some understanding of the Python library <a href="https://www.codecademy.com/learn/data-processing-pandas">Pandas</a> will be helpful later on in the lesson, but isn't totally necessary. If you haven't already, check out those courses before taking this one. Okay, let's get scraping!

In Data Science, we can do a lot of exciting work with the right dataset. Once we have interesting data, we can use <a href="https://www.codecademy.com/learn/data-processing-pandas">Pandas</a> or <a href="https://www.codecademy.com/learn/data-visualization-python">Matplotlib</a> to analyze or visualize trends. But how do we get that data in the first place?

If it's provided to us in a well-organized csv or json file, we're lucky! Most of the time, we need to go out and search for it ourselves.

Often times you'll find the perfect website that has all the data you need, but there's no way to download it. This is where BeautifulSoup comes in handy to scrape the HTML. If we find the data we want to analyze online, we can use BeautifulSoup to grab it and turn it into a structure we can understand. This Python library, which takes its name from <a href="https://www.youtube.com/watch?v=FWxFsJUlBbw" >a song in Alice in Wonderland</a>, allows us to easily and quickly take information from a website and put it into a DataFrame.


Introduction

When we scrape websites, we have to make sure we are following some guidelines so that we are treating the websites and their owners with respect.

Always check a website's Terms and Conditions before scraping. Read the statement on the legal use of data. Usually, the data you scrape should not be used for commercial purposes.

Do not spam the website with a ton of requests. A large number of requests can break a website that is unprepared for that level of traffic. As a general rule of good practice, make one request to one webpage per second.

If the layout of the website changes, you will have to change your scraping code to follow the new structure of the site.


Rules of Scraping

In order to get the HTML of the website, we need to make a _request_ to get the content of the webpage. To learn more about requests in a general sense, you can check out <a href="https://www.codecademy.com/articles/http-requests" target="_blank">this article</a>.

Python has a `requests` library that makes getting content really easy. All we have to do is import the library, and then feed in the URL we want to `GET`:

```py
import requests

webpage = requests.get('https://www.codecademy.com/articles/http-requests')
print(webpage.text)
```

This code will print out the HTML of the page.

We don't want to unleash a bunch of requests on any one website in this lesson, so for the rest of this lesson we will be scraping a local HTML file and pretending it's an HTML file hosted online.

Requests

When we printed out all of that HTML from our request, it seemed pretty long and messy. How could we pull out the relevant information from that long string?

BeautifulSoup is a Python library that makes it easy for us to traverse an HTML page and pull out the parts we're interested in. We can import it by using the line:

```py
from bs4 import BeautifulSoup
```

Then, all we have to do is convert the HTML document to a BeautifulSoup object!

If this is our HTML file, `rainbow.html`:

```html
<body>
  <div>red</div>
  <div>orange</div>
  <div>yellow</div>
  <div>green</div>
  <div>blue</div>
  <div>indigo</div>
  <div>violet</div>
</body>
```

```py
soup = BeautifulSoup("rainbow.html", "html.parser")
```

`"html.parser"` is one option for parsers we could use. There are other options, like `"lxml"` and `"html5lib"` that have different advantages and disadvantages, but for our purposes we will be using `"html.parser"` throughout.

With the requests skills we just learned, we can use a website hosted online as that HTML:

```py
webpage = requests.get("http://rainbow.com/rainbow.html")
soup = BeautifulSoup(webpage.content, "html.parser")
```

When we use BeautifulSoup in combination with `pandas`, we can turn websites into DataFrames that are easy to manipulate and gain insights from. 


The BeautifulSoup Object


BeautifulSoup breaks the HTML page into several types of objects.

#### Tags


A Tag corresponds to an HTML Tag in the original document. These lines of code:

```py
soup = BeautifulSoup('<div id="example">An example div</div><p>An example p tag</p>')
print(soup.div)
```

Would produce output that looks like:
```
<div id="example">An example div</div>
```

Accessing a tag from the BeautifulSoup object in this way will get the _first_ tag of that type on the page. 

You can get the name of the tag using `.name` and a dictionary representing the attributes of the tag using `.attrs`:

```py
print(soup.div.name)
print(soup.div.attrs)
```

```
div
{'id': 'example'}
```

#### NavigableStrings

NavigableStrings are the pieces of text that are in the HTML tags on the page. You can get the string inside of the tag by calling `.string`:

```py
print(soup.div.string)
```

```
An example div
```

Object Types

To navigate through a tree, we can call the tag names themselves. Imagine we have an HTML page that looks like this:

```html
<h1>World's Best Chocolate Chip Cookies</h1>
<div class="banner">
  <h1>Ingredients</h1>
</div>
<ul>
  <li> 1 cup flour </li>
  <li> 1/2 cup sugar </li>
  <li> 2 tbsp oil </li>
  <li> 1/2 tsp baking soda </li>
  <li> ½ cup chocolate chips </li> 
  <li> 1/2 tsp vanilla <li>
  <li> 2 tbsp milk </li>
</ul>
```

If we made a `soup` object out of this HTML page, we have seen that we can get the first `h1` element by calling:

```py
print(soup.h1)
```

```
<h1>World's Best Chocolate Chip Cookies</h1>
```

We can get the children of a tag by accessing the `.children` attribute:

```py
for child in soup.ul.children:
	print(child)
```

```
<li> 1 cup flour </li>
<li> 1/2 cup sugar </li>
<li> 2 tbsp oil </li>
<li> 1/2 tsp baking soda </li>
<li> ½ cup chocolate chips </li> 
<li> 1/2 tsp vanilla <li>
<li> 2 tbsp milk </li>
```

We can also navigate _up_ the tree of a tag by accessing the `.parents` attribute:

```py
for parent in soup.li.parents:
	print(parent)
```
This loop will first print:
```
<ul>
<li> 1 cup flour </li>
<li> 1/2 cup sugar </li>
<li> 2 tbsp oil </li>
<li> 1/2 tsp baking soda </li>
<li> ½ cup chocolate chips </li>
<li> 1/2 tsp vanilla </li>
<li> 2 tbsp milk </li>
</ul>
```
Then, it will print the tag that contains the `ul` (so, the `body` tag of the document). Then, it will print the tag that contains the `body` tag (so, the `html` tag of the document).



Navigating by Tags

When we're telling our Python script what HTML tags to grab, we need to know the structure of the website and what we're looking for.

Many browsers, including Chrome, Firefox, and Safari, have Dev Tools that help you inspect a webpage and see what HTML elements it is composed of.

First, learn <a href="https://www.codecademy.com/articles/use-devtools" target="_blank">how to use DevTools</a>.

Then, when you're preparing to scrape a website, first inspect the HTML to see where the info you are looking for is located on the page.

Website Structure

Beautiful Soup offers two methods for traversing the HTML tags on a webpage, `.find()` and `.find_all()`. Both methods can take just a tag name as a parameter but will return slightly different information.

`.find()` returns the first tag that matches the parameter or `None` if there are no tags that match.

```py
print(soup.find("h1"))
```
```html
<h1>World's Best Chocolate Chip Cookies</h1>
```

Note that this produces the same result as directly accessing `h1` through the `soup` object:

```py
print(soup.h1)
```

If we want to find all of the occurrences of a tag, instead of just the first one, we can use `.find_all()`. `.find_all()` returns a list of all the tags that match &mdash; if no tags match, it returns an empty list.

```py
print(soup.find_all("h1"))
```
```html
[<h1>World's Best Chocolate Chip Cookies</h1>, <h1>Ingredients</h1>]
```

`.find()` and `.find_all()` are far more flexible than just accessing elements directly through the `soup` object. With these methods, we can use regexes, attributes, or even functions to select HTML elements more intelligently.

#### Using Regex

Regular Expressions (regex) is a way to match patterns in a text. We cover Regular Expressions in more depth here. This is invaluable for finding tags on a webpage. 

What if we want every `<ol>` *and* every `<ul>` that the page contains? We will use the `.compile()` function from the `re` module. We will use the regex: `"[ou]l"` which means "match either `o` or `u` and `l`".

We can select both of these types of elements with a regex in our `.find_all()`:

```py
import re
soup.find_all(re.compile("[ou]l"))
```

What if we want all of the `h1` - `h9` tags that the page contains? Regex to the rescue again! The expression `"h[1-9]"` means `h` and any number between 1 and 9. 

```py
import re
soup.find_all(re.compile("h[1-9]"))
```
#### Using Lists

We can also just specify all of the elements we want to find by supplying the function with a list of the tag names we are looking for:

```py
soup.find_all(['h1', 'a', 'p'])
```

#### Using Attributes

We can also try to match the elements with relevant attributes. We can pass a dictionary to the `attrs` parameter of `find_all` with the desired attributes of the elements we're looking for. If we want to find all of the elements with the `"banner"` class, for example, we could use the command:

```py
soup.find_all(attrs={'class':'banner'})
```

Or, we can specify multiple different attributes! What if we wanted a tag with a `"banner"` class and the id `"jumbotron"`?

```py
soup.find_all(attrs={'class':'banner', 'id':'jumbotron'})
```

#### Using A Function

If our selection starts to get really complicated, we can separate out all of the logic that we're using to choose a tag into its own function. Then, we can pass that function into `.find_all()`!

```py
def has_banner_class_and_hello_world(tag):
    return tag.attr('class') == "banner" and tag.string == "Hello world"

soup.find_all(has_banner_class_and_hello_world)
```

This command would find an element that looks like this:
```html
<div class="banner">Hello world</div>
```
but not an element that looks like this:
```html
<div>Hello world</div>
```
Or this:
```html
<div class="banner">What's up, world!</div>
```


Find

Another way to capture your desired elements with the `soup` object is to use CSS selectors. The `.select()` method will take in all of the CSS selectors you normally use in a `.css` file!

```html
<h1 class='results'>Search Results for: <span class='searchTerm'>Funfetti</span></h1>
<div class='recipeLink'><a href="spaghetti.html">Funfetti Spaghetti</a></div>
<div class='recipeLink' id="selected"><a href="lasagna.html">Lasagna de Funfetti</a></div>
<div class='recipeLink'><a href="cupcakes.html">Funfetti Cupcakes</a></div>
<div class='recipeLink'><a href="pie.html">Pecan Funfetti Pie</a></div>
```
If we wanted to select all of the elements that have the class `'recipeLink'`, we could use the command:
```py
soup.select(".recipeLink")
```
If we wanted to select the element that has the id `'selected'`, we could use the command:
```py
soup.select("#selected")
```
Let's say we wanted to loop through all of the links to these funfetti recipes that we found from our search. 

```py
for link in soup.select(".recipeLink > a"):
  webpage = requests.get(link)
  new_soup = BeautifulSoup(webpage)
```
This loop will go through each link in each `.recipeLink` div and create a soup object out of the webpage it links to. So, it would first make soup out of `<a href="spaghetti.html">Funfetti Spaghetti</a>`, then `<a href="lasagna.html">Lasagna de Funfetti</a>`, and so on. 


Select for CSS Selectors

When we use BeautifulSoup to select HTML elements, we often want to grab the text inside of the element, so that we can analyze it. We can use `.get_text()` to retrieve the text inside of whatever tag we want to call it on. 

```html
<h1 class="results">Search Results for: <span class='searchTerm'>Funfetti</span></h1>
```

If this is the HTML that has been used to create the `soup` object, we can make the call:
```py
soup.get_text()
```

Which will return:
```
'Search Results for: Funfetti'
```

Notice that this combined the text inside of the outer `h1` tag with the text contained in the `span` tag inside of it! Using `get_text()`, it looks like both of these strings are part of just one longer string. If we wanted to separate out the texts from different tags, we could specify a separator character. This command would use a `|` character to separate:

```py
soup.get_text('|')
```

Now, the command returns:
```
'Search Results for: |Funfetti'
```


Reading Text

Amazing! Now you know the basics of how to use BeautifulSoup to turn websites into data. If you take our <a href="https://www.codecademy.com/learn/intro-to-data-visualization-with-python">Data Visualization</a> or <a href="https://www.codecademy.com/learn/data-processing-pandas">Data Manipulation</a> courses, you can see how you might analyze this data and find patterns!

You now can see how far the rabbit hole goes by finding some interesting data you want to analyze on the web. But remember to be respectful to site owners if you test out your scraping chops on real sites. 


Review

Web Scraping with Beautiful Soup

Learn how to take data that's displayed on websites and put it into Python using the Beautiful Soup library!

Beautiful Soup

Use BeautifulSoup to scrape a site that contains over 1700 expert ratings of different chocolate bars. Then, put the data you find into Pandas and analyze the results!

Explore the webpage displayed in the browser. What elements could be useful to scrape here? Which elements do we _not_ want to scrape?

Let's make a request to this site to get the raw HTML, which we can later turn into a BeautifulSoup object.

The URL is:

```
https://content.codecademy.com/courses/beautifulsoup/cacao/index.html
```

You can pass this into the `.get()` method of the `requests` module to get the HTML.

Create a BeautifulSoup object called `soup` to traverse this HTML.

Use `"html.parser"` as the parser, and the content of the response you got from your request as the document.

If you want, print out the `soup` object to explore the HTML. 

So many table rows! You're probably very relieved that we don't have to scrape this information by hand.

How many terrible chocolate bars are out there? And how many earned a perfect 5? Let's make a histogram of this data.

The first thing to do is to put all of the ratings into a list.

Use a command on the `soup` object to get all of the tags that contain the ratings.

Create an empty list called `ratings` to store all the ratings in.

Loop through the ratings tags and get the text contained in each one. Add it to the ratings list.

As you do this, convert the rating to a float, so that the ratings list will be numerical. This should help with calculations later.

Using Matplotlib, create a histogram of the ratings values:

```py
plt.hist(ratings)
```

Remember to show the plot using `plt.show()`!

Your plot will show up at `localhost` in the web browser. You will have to navigate away from the cacao ratings webpage to see it.

We want to now find the 10 most highly rated chocolatiers. One way to do this is to make a DataFrame that has the chocolate companies in one column, and the ratings in another. Then, we can do a `groupby` to find the ones with the highest average rating.

First, let's find all the tags on the webpage that contain the company names.

Just like we did with ratings, we now want to make an empty list to hold company names.

Loop through the tags containing company names, and add the text from each tag to the list you just created.

Create a DataFrame with a column "Company" corresponding to your companies list, and a column "Ratings" corresponding to your ratings list.

Use `.groupby` to group your DataFrame by Company and take the average of the grouped ratings.

Then, use the `.nlargest` command to get the 10 highest rated chocolate companies. Print them out.

Look at the hint if you get stuck on this step!

We want to see if the chocolate experts tend to rate chocolate bars with higher levels of cacoa to be better than those with lower levels of cacoa.

It looks like the cocoa percentages are in the table under the Cocoa Percent column (note we are looking at cocoa not cocao!).

Using the same methods you used in the last couple of tasks, create a list that contains all of the cocoa percentages. Store each percent as a float, after stripping off the `%` character.

Add the cocoa percentages as a column called `"CocoaPercentage"` in the DataFrame that has companies and ratings in it.

Make a scatterplot of ratings (`your_df.Rating`) vs percentage of cocoa (`your_df.CocoaPercentage`).

You can do this in Matplotlib with these commands:

```py
plt.scatter(df.CocoaPercentage, df.Rating)
plt.show()
```

Call `plt.clf()` to clear the figure between showing your histogram and this scatterplot.

Remember that your plots will show up at the address `localhost` in the web browser.


Is there any correlation here? We can use some numpy commands to draw a line of best-fit over the scatterplot.

Copy this code and paste it after you create the scatterplot, but before you call `.show()`:

```py
z = np.polyfit(df.CocoaPercentage, df.Rating, 1)
line_function = np.poly1d(z)
plt.plot(df.CocoaPercentage, line_function(df.CocoaPercentage), "r--")
```

We have explored a couple of the questions about chocolate that inspired us when we looked at this chocolate table.

What other kinds of questions can you answer here? Try to use a combination of BeautifulSoup and Pandas to explore some more.

For inspiration:
Where are the best cocoa beans grown?
Which countries produce the highest-rated bars?

Chocolate Scraping with Beautiful Soup

Make requests repeatedly until you no longer receive information

Read the Terms and Conditions of the website

Stop making requests after you have enough information

Change your scraping code when the website code changes

`<div id='user'><p>MirandaRights</p></div>`

soup = BeautifulSoup("<div class='tweet'><span>New year, new me. </span></div><div id='user'><p>MirandaRights</p></div>")

print(soup.find(id="user").get_text())


`<div class='tweet'><span>New year, new me. </span></div>`

soup = BeautifulSoup("<div class='tweet'><span>New year, new me. </span></div><div id='user'><p>MirandaRights</p></div>")

print(soup.div.get_text())


soup = BeautifulSoup("seuss.html")

print(soup._______)

`<class 'bs4.element.Tag'> <class 'bs4.element.Tag'>`

`<class 'bs4.element.Tag'> <class 'bs4.element.NavigableString'>`

soup = BeautifulSoup("""
<h1>Syllabus</h1>
<div><h3>Unit 1: Variables</h3><p>Learn the basics!</p></div>
<div><h3>Unit 2: Loops</h3> <p>Repeat stuff!</p></div>
<div><h3>Unit 3: Review</h3></div>
""")

for child in soup.div.children:
  print(type(child))


Practice your knowledge of BeautifulSoup with this multiple choice quiz.

#### Why Learn Beautiful Soup?

Many of your coding projects may require you to pull a bunch of information from an HTML or XML page. This task can be really tedious and boring, that is until you learn how to scrape the web with an HTML Parser!  That's where Beautiful Soup comes in. This Python package allows you to parse HTML and XML pages with ease and pull all sorts of data off the web.

Say you want to pull all of the tweets from your favorite movie star and run some analysis on their word usage — 
 scrape em! Maybe you want to make a digital collage all the images you've posted of your dog to Instagram — parse em! And if you want to pull a list of all of your friend's favorite books from Goodreads — Beautiful Soup em!

#### Take Away Skills

After this course, you will be able to parse HTML and XML files for all sorts of information.

Can't download the data you need? Learn how to pull data right from the page by web scraping with the Python library Beautiful Soup.

Learn Web Scraping with Beautiful Soup

PRO SALE: Get 50% off annual Pro memberships using code [LLM50](https://www.codecademy.com/checkout?plan_id=proGoldAnnualV2&discountCode=LLM50&plan_type=pro)