An Excel interface surrounded by illustrative tables of data, charts, and formulas.

One of the most effective ways to identify and communicate trends in data is through **data visualization**. In Excel, we can create a wide variety of visualizations in the same sheets we've been using for data storage and analysis. In this lesson, you'll learn to visualize data by

- using **conditional formatting** to visually explore the values in tabular data
- creating **column charts** and **pie charts** to compare the sizes of categories
- creating **histograms** and **scatterplots** to visualize numeric data
- creating **line charts** and **sparklines** to identify trends in data
- using Excel's **recommended chart type** feature to easily explore different possible visualizations 

Each exercise will come with an Excel spreadsheet for you to download. The spreadsheet will give you instructions to practice what you've learned, explore customization options, and check your answers.


Introduction to Visualizing Data

Often, we think of data visualization as being all about charts and graphs. In Excel, data visualization starts with tools to help us explore the data tables themselves! 

For example, let's take a look at a table of average gas mileage (mpg) for different types of vehicles over time (from the [Bureau of Transportation Statistics](https://www.bts.gov/content/productions-production-shares-and-production-weighted-fuel-economies-new-domestic-and)).

![A table in Excel with four columns: year, car , car SUV, and pickup truck. The rows are labeled with years 2010 through 2015. The values are average miles per gallon.](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/visualizing-data/ex2_slides/assets/ex2_a.png)

What if we wanted to see at a glance whether gas mileage was improving over time, or which years met a certain threshold for efficiency? We'll look at two techniques: color scales and cell rules.

#### Color scales

In many cases, we want to see which rows and columns of a table have larger or smaller values. For example, we might want to see if vehicles have tended to get more efficient over time or identify the most and least efficient vehicles in the MPG dataset. A way to solve this using data visualization is by creating a **heatmap**, which colors each cell in the data table depending on its value.

In the heatmap below, for example, the smallest values are colored dark red, the highest values are dark green, and the values in-between are shaded based on where they fall in that spectrum (with yellow in the middle). 

The car column is all shades of green, showing that they are more efficient than the other columns (yellow for car SUVs and red for trucks). Within the car column, the green is getting darker over time, indicating that cars have continually improved in average MPG!

![The same MPG table. The car column is shaded green, starting light in 2010 and getting dark by 2015. The Car SUV column is shaded yellow, getting slightly lighter over time, and the pickup truck is shaded red, getting slightly lighter over time.](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/visualizing-data/ex2_slides/assets/ex2_d.png)

We've placed a slideshow in the Learning Environment illustrating how to apply a heatmap in Excel using conditional formatting!

### Cell rules

Sometimes, we have more specific questions about the values in our data. For example, we might want to see if every value in a table meets a particular threshold or not. 

In our dataset, let's assume that a minimum of 26 mpg is considered "good" gas mileage for a vehicle. We can visualize this by coloring cells that meet this criteria green and all other cells red.

The result would look something like this:

![The same MPG dataset. Now, every cell is green if it is larger than 26 and red if it is smaller than 26](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/visualizing-data/ex2_slides/assets/ex2_i.png)

In Excel, we can do this by defining custom rules to conditionally format the cells. The full process is illustrated in the slideshow in the learning environment.



Conditional Formatting

A common task in data visualization is to compare the sizes of categories. For example, in our vehicle data we might want to compare how many cars are being produced as opposed to SUVs or trucks. **Column charts** and **pie charts** are simple ways to visualize categorical data and make comparisons.

For this exercise, we'll look at a dataset containing data on the production shares (%) for different types of vehicles in 2020 (from the [Bureau of Transportation Statistics](https://www.bts.gov/content/productions-production-shares-and-production-weighted-fuel-economies-new-domestic-and)).

![A table in Excel. There are two columns: vehicle type and percent. The vehicle types are: Car, 30.9%; Car SUV, 13%; Pickup truck, 14.4%; Van, 2.9%; Truck SUV, 38.7%](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/visualizing-data/ex3_slides/assets/ex3_a.png)

#### Column charts

In column charts, each category is represented by a column in the plot and we can compare them by assessing the length of the columns or bars. Note that these are commonly also called bar charts, but the tool in Excel calls them column charts.

![A chart titled Production share by vehicle type. Horizontally, there are columns labeled car, car suv, pickup truck, van, truck SUV. The heights of the columns are measured by a vertical axis going from 0 to 45 in steps of 5. The column heights are in order truck SUV, car, car SUV, pickup truck, van, truck SUV](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/visualizing-data/ex3_slides/assets/ex3_h.png)

### Pie Charts

In pie charts, each category is viewed as a "slice"  or `sector` of a pie, with the size (or area) of the slice corresponding to the relative size of that category. Because the categories as slices make up the whole pie, pie charts should *only* be used to visualize categories that are pieces of some whole. In other words, the area of the sectors (slices) should add up to 100%. 

In our dataset, the different vehicle categories are all part of the same whole: the collection of all vehicles in 2020. 

![A pie chart that pictures each category as a slice of pie, with the size of the slice relative to how big that category is.](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/visualizing-data/ex3_slides/assets/ex3_k.png)

Let's look at a related dataset that might not be very suitable for pie charts. This table shows the production share of cars from 2015 to 2020:

![A table in Excel with Year and Percent columns. The Year column goes from 2015 to 2020. The percent values are: 47, 43, 41, 36, 32, 30.](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/visualizing-data/ex3_slides/assets/ex3_p.png)

While we can certainly plot a pie chart, it would be misleading, because these percents are not pieces of the same whole. They are each "slices" of their own year, and shouldn't be placed together in a pie. Make sure to evaluate your data to see if a pie chart is a right choice for visualization! 

We've placed a slideshow illustrating how to create both types of charts in the learning environment. When you're ready to practice this yourself, move on to the next set of instructions!



Column and Pie Charts

Column charts and pie charts are excellent for comparing the sizes of categories. But they won't help us understand more general numeric columns. For example, suppose we wanted to use the following table of [housing price data](https://www.kaggle.com/datasets/camnugent/california-housing-prices) (in thousands of US dollars) to analyze the impact of wealth inequality on housing.

![A table in Excel with Income and Housing Price columns. The Income column has values roughly 40, 20, 30, 28, 15, 50, 44, 11, 31. The Housing Price column has values 2489,975,1399,875,518,2320,1907,561,3667.](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/visualizing-data/ex4_slides/assets/ex4_a.png)

We'll look at two methods for visualizing numeric data in this exercise: histograms and scatterplots. We'll use histograms to visualize one column at a time, and scatterplots to understand the connection between the two columns.

#### Histogram

Histograms are used to understand the "shape" of a single column of numeric data. A histogram breaks the data-points into **bins** and then counts the number of data points in each bin.

To better understand what a histogram shows, let's visualize the median household income data for just the first 9 rows of data shown above:

![A chart in Excel. The horizontal axis has broken the range from 11.8 to 50.1 into four equal pieces, and above each piece there is a column representing the number of datapoints in that piece of the range.](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/visualizing-data/ex4_slides/assets/ex4_j.png)

The smallest and largest household incomes are 11.8 and 50.1. To create this chart, Excel broke that range into five equal pieces. The range of values for each piece is labelled along the x-axis. The column above each piece represents the number of data points that fall within that range. For example, the largest column has a height of 3, indicating that there are 3 data points in the table between 27.1 and 34.7 (not including 27.1). Check the table above to confirm that Excel got that right!

#### Scatterplots

Scatterplots are used for visualizing **correlation** between two columns. As one column increases, does the other also increase? Decrease? Fluctuate? For example, in our dataset, we might want to know if higher-income households own more expensive houses: as income increases, does housing price also increase? It's important to remember that the existence of a correlation like this does not mean the one column is directly influencing the other.

Here is what the scatterplot looks like using the first 9 rows of data shown above, with household income on the x-axis and housing price on the y-axis:

![](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/visualizing-data/ex4_slides/assets/narrative-scatter.png)

To create this chart, Excel has placed a dot for each row of the table. The dot is positioned horizontally by the income value in that row and vertically by the housing price for that row. For example, the first dot in the bottom left corresponds to the ninth row with an income of 11.8 and housing price of 561. 

We can see a general trend from this plot: as incomes increase (moving to the right in the plot), so do housing prices (moving up in the plot.) We refer to this as a positive correlation when the variables trend in the same direction. If one variable increases while another decreases, this is referred to as a negative correlation.

We've placed a slideshow illustrating how to create these plots in Excel! When you're ready to try it yourself, move on to the next set of instructions.





Histograms and Scatterplots

The last two visualizations we'll look at are charts that help us see the change in a numeric column over time.

For this exercise, we'll look at a dataset containing data on the total production (thousands of vehicles) for different types of vehicles over time (from the [Bureau of Transportation Statistics](https://www.bts.gov/content/productions-production-shares-and-production-weighted-fuel-economies-new-domestic-and)). Here are the first few columns of this dataset:

![A table in Excel. There are columns corresponding to the years 2010 through 2013, and rows corresponding to the different vehicle types.](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/visualizing-data/lines/assets/ex5_a.png)

#### Line charts

Line charts plot points in the data connected with a line to visualize the trend. Multiple lines can be plotted on the same graph to compare trends between groups. For example, here is a line chart of the production numbers above:

![A chart in Excel titled Total productions by vehicle type. The horizontal axis lists the years 2010 through 2013. The vertical axis lists numbers in thousands from 0 to 9 thousand. Each vehicle type is associated with a different color line, and a line is drawn for each vehicle type with the height of the line indicating the size of production in that year.](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/visualizing-data/ex5_slides/assets/ex5_p1.png)

It looks like most vehicle types increased production over these four years, but Vans stayed fairly constant in comparison to the rest.

#### Sparklines

Sparklines are a feature in Excel that places a trend line (like a mini line chart) next to each row of a table. For example, here is what the sparklines would look like for our vehicle production data:

![The original vehicle production table. Next to the 2013 column is a column of trendlines. The lines go up when the numbers in that row increase, and down when the numbers in that row decrease.](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/visualizing-data/ex5_slides/assets/ex5_o2.png)

Sparklines can be helpful when we want to see general trends for individual categories. In the line chart, the fluctuations in Van production weren't visible, since they were so small compared to the fluctuations for cars. In the sparkline, we can see them much more clearly.

We've placed a slideshow illustrating both methods in the learning environment. When you're ready to practice this yourself, move on to the next set of instructions!

Line Charts and Sparklines

What if you aren't sure which chart type is best for your data? After all, different types of graphs can bring out different aspects of the data. Conveniently, Excel provides a quick and easy way to preview various types of charts with its Recommended Charts functionality! 

**Note**: Just because Excel recommends or doesn't recommend a chart doesn't mean that chart should or shouldn't be used. Make sure to use your own discretion to see what works well for your specific case!

We've placed a slideshow in the learning environment illustrating how to view recommended charts and demonstrating that some recommended charts are not very useful!

Recommended Charts

In this lesson, you've learned how to use

- conditional formatting for visualizing tabular data
- column charts and pie charts for visualizing categorical data
- histograms and scatterplots for visualizing numeric data
- line charts and sparklines for identify trends over time

You've also learned how to

- select the right type of chart for your data
- interpret the results of each type of chart
- customize the charts with titles and axis labels

Congratulations!

Let's Review

Visualizing Data

Want to work with Google Sheets as well as Microsoft Excel? Check out our Google Sheets cheatsheet to help make that transition!

Microsoft Excel to Google Sheets Cheat Sheet

Congratulations! You've successfully completed Learn Microsoft Excel for Data Analysis! You've learned the essential techniques of data analysis in Microsoft Excel, including

- using sorting and filtering to explore datasets
- using functions and PivotTables to transform and summarize data
- selecting and customizing visualizations to communicate your findings
- importing and validating data
- protecting data to maintain data integrity

Your journey in Data Analysis isn't over. Here are our recommendations for next steps:

### Data Literacy

If you want to learn more about conceptual aspects of data science, like interpreting data properly or selecting the right visualization, check out [Principles of Data Literacy](https://www.codecademy.com/learn/principles-of-data-literacy). This is a fully conceptual course that will help you learn how to think about data.

### SQL and Databases

Most companies store their data in databases, not in Excel tables. If you want to learn how to sort, filter, and summarize data in database tables, check out [Learn SQL](https://www.codecademy.com/learn/learn-sql). This course will teach you how to use the Structured Query Language that most companies use to work with their databases!

### Data Science Foundations

Hooked on data analysis, and want to learn the industry standard tools? Check out our [Data Science Foundations](https://www.codecademy.com/learn/paths/data-science-foundations) Skill Path. This Skill Path starts with the two courses above, and then teaches you more fundamental tools and concepts that data scientists use every day.

Once again, congratulations for completing Learn Microsoft Excel for Data Analysis. This is a huge accomplishment, and we're excited to see what you do next!



You've completed Learn Microsoft Excel for Data Analysis! What's next?

Learn Microsoft Excel for Data Analysis Next Steps

### What will you learn?

In this course, you will learn the basics of handling, analyzing, and visualizing data in Excel. 

After this course, you will be able to:

- explore datasets using sorting and filtering
- transform datasets with functions and PivotTable
- select the right visualization to communicate your findings
- import and format data
- protect sheets to maintain data integrity.

Learning is social. Whatever you're working on, be sure to connect with the Codecademy community in the [forums](https://discuss.codecademy.com/c/get-help). Remember to check in with the community regularly, including for things like asking for peer reviews on your project work and providing reviews to others in the [projects category](https://discuss.codecademy.com/c/project/1833), which can help to reinforce what you’ve learned. 

### What should you know before starting this course?

This course does not require any prerequisite knowledge in either Excel or data analysis!

### What will you do?

Throughout this course, you will be applying the concepts you learn in real data analysis projects:

- You'll [explore GDP data](https://www.codecademy.com/courses/analyze-data-with-microsoft-excel/projects/explore-gdp-in-excel) to investigate the rise of the computing industry
- You'll [visualize hotel data](https://www.codecademy.com/courses/analyze-data-with-microsoft-excel/projects/visualizing-hotel-data-in-excel) to better understand reservation cancellation patterns
- You'll [import, prepare, and protect hotel data](https://www.codecademy.com/courses/analyze-data-with-microsoft-excel/projects/handling-hotel-data-in-excel) to make sure analyses are using valid, unaltered data
- At the end of the course, you'll combine your analysis, visualization, and handling skills to [analyze Bitcoin price data](https://www.codecademy.com/courses/analyze-data-with-microsoft-excel/projects/capstone-project-bitcoin-prices-in-excel) to compare potential upsides to volatility

We're excited for you to start your journey as a data analyst in Microsoft Excel!

A brief overview of what you will learn in this course.

Welcome to Learn Microsoft Excel for Data Analysis

Before you start analyzing data with Excel, you'll need to have Excel installed on your computer!

If you don't already have Excel installed, you can follow these steps:

1. Navigate to [the Excel website](https://www.microsoft.com/en-us/microsoft-365/excel).
2. Select `Buy now` or `Try for free` if you are not ready to purchase right away.
3. Select install once you have selected the option you would like.
4. Follow any on-screen installation prompts.

### Opening an Excel File

Once you have Excel installed, download [your first Excel spreadsheet](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/exploring-data/welcome-to-the-course.xlsx)! Once that is downloaded, you can open it in Excel by double-clicking the file or selecting File->Open inside Excel.

### What if I use Google Sheets?

We recommend using Excel for this course, as our instructions will sometimes use Excel-specific functions. However, everything taught in this course, including the exercise spreadsheets, can also be done using Google Sheets. We've created [a Google Sheets cheat sheet](https://www.codecademy.com/learn/paths/analyze-data-with-microsoft-excel/tracks/analyze-data-with-microsoft-excel/modules/excel-google-sheets/cheatsheet) to help you out if you decide to go that route!

### Let’s Get Started!
In the next lesson, you'll dive directly into analyzing data with Excel!


Download and Install Excel

Use Excel to analyze GDP data for the United States.

We've been given two tables of GDP data for the US from 1987 to 1997. These are on the "Raw Data" sheet. Take a look through them. The first has billions contributed to GDP broken down by category and industry. The second summarizes the first table by recording only billions contributed by the larger categories. When you're ready to start analyzing, select the "Initial Exploration" sheet and move to the next task.

Our ultimate goal is to zoom in on computers, but it is important to start by getting a general sense of the dataset we have. Let's start by looking at the largest and smallest contributors to GDP by category. Sort the copy of Table 2 on the "Initial Exploration" sheet. Your goal is to group all the data for each year together. Within each group, the data should be sorted by Billions.

To get a quick sense of what the largest and smallest contributors have been over the decade, let's look at the beginning, middle, and end of the decade. Filter the table you just sorted so only the years 1987, 1992, and 1997 are visible.

Scroll through the filtered table and identify the largest and smallest categories for each of the three years. Record these in the "Summary of Exploration" table and write down any observations you have in the "Statement" box.

Let's dig deeper into the largest and smallest categories. Comparing raw numbers will only tell us so much, since both Finance and Manufacturing increased over the course of the decade in terms of raw dollar amounts. What we'd really like to calculate is the portion of the total GDP that these categories contribute. 

So let's convert raw dollars to percentages. On the "Percentage" sheet, write a formula in `C25` to sum the 1987 column of data, and drag that formula over through `M25`. 

(Excel might display a "formula omits adjacent cell" error. This is fine: the year at the top of the column should not be included in the sum.)

Write a formula in `C28` that calculates the percentage of 1987's GDP contributed by Agriculture, Forestry, Fishing, And Hunting (in `C9`).

Add a `$` sign to the formula in `C28` so that when you drag it down the column, it updates to a percentage formula for each category.

Drag the formula in `C28` down through `C43`.

Select the range `C28:C43` and drag it over through column `M`

Filter the percentage table to look at the categories you identified on the prior sheet. What trends do you notice?

Write a brief summary of what you notice in the Summary Statement box.

Okay, we've got a basic idea of what the major categories look like. But we've been asked to analyze the impact of computers! Computers aren't their own category, but they do appear in a separate table that breaks GDP down by individual industry. This table is on the "Computers" sheet. You're going to create a new "Computers" category. First, copy the data table to the "Backup Table" sheet, so you can recover the original if you need it!

Go back to the "Computers" sheet. Find the industries that list "Computer" in their description and manually alter the corresponding "Category" to "Computers".

Select the entire table on the "Computers" sheet and create a pivot table in `G9` that reports the total GDP contribution for each category by year.

Sort the pivot table by the 1987 values from largest to smallest. Has anything changed from the initial analysis when computers are turned into their own category? Write a brief explanation in the "Changes in 1987" box.

Record the rank of the "Computers" category in 1987 in `H2`.

Sort the pivot table by 1997 from largest to smallest and record the rank of the "Computers" category in `H3`.

Write an overall summary of what you've discovered on the "Final Summary" sheet, including any limitations of the data or your analysis.

Explore GDP in Excel

Learn to to restructure and summarize your data using Excel's PivotTable tool!

Formulas, filtering, and sorting are powerful tools! But combining them to answer complex data questions can be tricky. Excel solves this problem for many common data questions through the `PivotTable` tool.

In this lesson you'll learn how to take a question about data and

- Determine if a pivot table is appropriate
- Translate the question into pivot table settings
- Create the pivot table in Excel
- Update the pivot table with new data




Introduction to Pivot Tables

Two tables. The first table has 5 columns. 
Column 1 is blank.
Column 2 has the header "Month" and is highlighted with the label "Rows"
Column three has the header "Losses" and is highlighted with the label "Values"
Column 4 is blank
Column 5 has the header "State" and is highlighted with the label "Columns".
The table has two rows with data. Both rows have month "February" and state "FL". One row has the losses of 10,000 and the other has losses of 25,000.

The second table has rows labeled with months January through April and columns labeled with states AZ, FL, and TX. The cell in the row February and column FL has the sum of the two losses from the first table 10,000+25,000.

Pivot tables won't help you get your furniture through a door (unfortunately). They will help you *group* and *summarize* your data. Let's look at an example of how powerful this can be.

Suppose we're trying to calculate the total dollars lost to tornados in each state by month. Let's use the small example table below.

|   | Year | Month    | Losses ($) | Length (mi) | State |
|---|------|----------|--------|--------|-------|
| 1 | 2020 | March    | 5,000  | .4     | TX    |
| 2 | 2020 | February | 10,000 | 1.5    | FL    |
| 3 | 2020 | January      | 50,000 | .8     | AZ    |
| 4 | 2020 | February | 25,000 | .9     | FL    |
| 5 | 2020 | January      | 30,000 | 1      | AZ    |
| 6 | 2020 | January      | 1,000 | 1.2      | TX    |

We could answer this question by hand in three steps:

1. Group together all rows with the same state and month.
2. Collect the `Losses` values for all the rows in each group.
3. Sum the `Losses` for each group to report the total losses by group.

A **pivot table** is a separate table collecting the results of this process. That might sound a bit abstract, so let's go through this step-by-step.

We start by creating a table of all the possible `Month` and `State` groups.

First, we use the values of the `State` column as column headers for the pivot table. Note that we don't repeat values. Even though `AZ` appears twice in the state column, it only appears once in the headers of our new table.

|          | AZ | FL | TX |
|----------|----|----|----|
| |    |    |    |
|    |    |    |    |
|     |    |    |    |

Second, we use the values of the `Month` column to label the rows of the new table. Once again, each month will only appear once.

|          | AZ | FL | TX |
|----------|----|----|----|
| January |    |    |    |
| February    |    |    |    |
| March      |    |    |    |


The cells of this table correspond to the *groups* we need to form to answer the original question. Now, we need to gather and summarize the values in each group to fill in the corresponding cell.

For example, the upper-left cell of the new table is in the row `January` and the column `AZ`. When we look back at the original table, there are two rows in this group. To report the total losses, we collect the `Losses`  values for those rows and add them up: `50,000+30,000 = 80,000`.

|          | AZ | FL | TX |
|----------|----|----|----|
| January | 80,000   |    |    |
| February    |    |    |    |
| March      |    |    |    |

If we move one cell down, we'll be in the row `February` and column `AZ`. There are no rows of the original table corresponding to this group. We leave this cell blank, to tell someone reading the table that there were no rows of data in this group.

Here's the final pivot table &mdash; do you agree with the rest of our calculations?

|          | AZ | FL | TX |
|----------|----|----|----|
| January | 80,000   |    | 1,000   |
| February    |    |  35,000  |    |
| March      |    |   |  5,000  |

This process is the same for any pivot table. Starting with a table of data, you pick

- *column labels*: a column or columns of the original table to serve as column labels (or headers) for the pivot table
- *row labels*: a column or columns of the original table to serve as row labels for the pivot table
- *cell values*: a column or columns of the original table to provide the values for the cells of the pivot table
- *summary method*: a method for taking multiple values corresponding to a single group and producing one value for the pivot table

In our initial example, we picked
- *column labels*: `State`
- *row labels*: `Month`
- *cell values*: `Losses`
- *summary method*: `sum` (or total)


What is a Pivot Table

Designing a pivot table by hand is one thing. Calculating the values is a task for Excel! Excel's `PivotTable` tool takes our pivot table design and dynamically creates the corresponding pivot table within a spreadsheet.

We've provided a slideshow in the Learning Environment that illustrates the process of creating a pivot table in Excel. 

A quick note:  you might notice that when Excel automatically references a range in the PivotTable dialogues, it will put the sheet name followed by `!` before the range. This sheet name syntax is for referencing ranges that may be on a different sheet than the original table. If you are typing in a range on the sheet you have currently selected, you can type in ranges as usual without the sheet name.

When you're ready to try creating pivot tables yourself, move on to the next set of instructions!

Pivot Tables in Excel

Unlike formulas, pivot tables in Excel do not update automatically if the original table changes. There are two methods for updating a pivot table manually, depending on how the original table has changed:

1. **Refresh**: if the size and location of the source dataset have not changed, select  `Refresh` from the `PivotTable Analyze` tab.
2. **Change data source**: if the size and/or location of the source dataset have changed, select `Change Data Source` from the `PivotTable Analyze` tab and select the new source range.

Note that the `PivotTable Analyze` tab will only appear if you have selected a cell within the pivot table.

We've placed a slideshow illustrating both cases in the Learning Environment. When you're ready to try this yourself, get started on the next set of instructions!


Updating Pivot Tables

Congratulations on finishing this lesson! You've taken your Excel skills to the next level with the PivotTable tool, and now have everything you need to start exploring datasets and answering data questions. You've learned how to

- Translate a data question into pivot table settings
- Implement those settings in Excel's PivotTable tool
- Update pivot tables with new and corrected data

Pivot Tables

By date (oldest to newest) and then by letter (A to Z)

By letter (A to Z) and then by date (oldest to newest)

By date (newest to oldest) and then by letter (A to Z)

Filter by `Date before 02/01/2022` and `Letter = a`.

Questions that require grouping and summarizing a dataset to answer.

Questions about what future data might look like.

Test your knowledge of exploring data in Excel!

Exploring Data Quiz

Use Excel to explore a dataset with formulas, sorting, and filtering tools.

Excel lets us store, organize, and analyze our data all in the same place. This makes Excel a great one-stop-shop for exploring a dataset prior to performing a formal analysis. In this lesson, we'll cover three standard Excel methods for data exploration:
* calculating averages and other statistics using **formulas**
* organizing data using **sorting**
* zooming in on the data we care about by **filtering**

Each exercise in this lesson will come with an interactive Excel spreadsheet for you to download. The spreadsheet will give you instructions to practice what you've learned, provide hints, and give feedback on your answers.

Even old hands at Excel sometimes get stuck on seemingly "simple" tasks, so if you get stuck, please consult the solutions or ask a question in the forums, even if you think you "should" be able to do it. 

Introduction to Exploring Data

A schematic of Excel described from the bottom up in the exercise narrative text.

Let's start by taking a tour of the Excel interface! When you first open up Excel, the sheer number of features can be a little overwhelming. But once you understand the structure behind Excel's layout, you can find the features you need and put the flexibility of Excel to use. 

Excel is structured around two basic purposes: 

1. Storing data and analyses of data
2. Performing analyses of data

Let's walk through the schematic of Excel in the learning environment to the right. At the bottom of the schematic are tabs corresponding to three **sheets** named `Sheet1`, `Sheet2`, and `Sheet3`.

The actual sheets appear above the tabs with the sheet names. The sheets are where Excel stores data tables and the results of data analysis, like charts or calculations. The data analysis tools themselves are stored above the sheets.

The **formula bar** is directly above the sheets. This is where you can view and edit any calculations you are using in your analysis.

The area above the formula bar displays different analytics options depending on which tab is open. In the schematic, the **Home** tab is open and the **Insert** and **Data** tabs are closed. Each icon on a tab is connected to some analytics function like sorting a table or creating a chart.

Depending on your version of Excel, icons may appear differently, but they should always be in the same location. We'll give you directions in each exercise to find the tools you need! 


Introduction to Excel

Let's start by exploring numeric data! In the learning environment, we've loaded a dataset containing the number of new cars produced in the US by year (from the [Bureau of Transportation Statistics](https://www.bts.gov/content/productions-production-shares-and-production-weighted-fuel-economies-new-domestic-and)). Often, exploring numeric data starts with answering basic statistical questions about the data such as

* what is the largest number of new cars in a year?
* what is the smallest number of new cars in a year?
* what is the average number of new cars in a year?

Excel can help us answer all of these. But first, we'll have to know a little more about how cells in a spreadsheet connect to each other. The connections between cells are based on cell "addresses" or "references".


Cell addresses are made up of a  `column letter` followed by a `row number`. For example, the cell containing the number 10 in the following image is named B2 since it is in column B and row 2. The number 10 is the **content** or **value** of the cell.
![rows A through C and columns 1 through 3 of Excel with cell B2 highlighted and containing the number 10](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/exploring-data/cellname.png)

In addition to numbers, cells can also contain **formulas**, which are how we'll calculate our maximum, minimum, and average. A formula in Excel always starts with the `=` sign, so Excel knows that it should do some kind of computation. 

For example, since B2 contains the number 10, the formula `=2+B2` would produce the value `12`, since `2+10 = 12`. The cell name B2 gets replaced by its value when Excel does the computation.

We can also do subtraction, division, and multiplication in formulas using

* `-` for subtraction
* `/` for division
* `*` for multiplication

This last one might seem like an unusual symbol, but it is pretty commonly used for multiplication on computers.

Here's an example of using `*` for multiplication. What would you change if you wanted to multiply B2 by 3?

![rows A through C and columns 1 through 3 of Excel with cell B2 containing the number 10 and cell C2 containing the formula =2*B2](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/exploring-data/cellformulaformula.png)
![rows A through C and columns 1 through 3 of Excel with cell B2 containing the number 10 and cell C2 containing the number 20](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/exploring-data/cellformularesult.png)

Now, you might be thinking to yourself that you could multiply `2` by `10` without a spreadsheet. That's right! But the magic of Excel formulas is that they update automatically when data changes. If the cell `B2` was changed to contain the number `4`, the cell `C2` would automatically update to contain the number `8`, since `2*4 = 8`. These dynamic formulas are the real power of Excel since they let us automatically update analyses with new data.


### Ranges of Cells

To calculate an average we need to reference a whole collection of cells instead of a single cell at a time. To reference a range of data, we type the name of the first cell in the range (usually the top-left cell), a colon, and then the name of the last cell (usually the bottom-right cell).

For example, the range `B1:C4` includes the cells in Columns B-C and Rows 1-4.
![rows A through C and columns 1 through 8 of Excel with cell A1 containing the formula =B1:C4 and the cells B1-B4 and C1 through C4 highlighted](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/exploring-data/rangetable.png)

To calculate the maximum, minimum, and average of this range we use special Excel functions `=MAX()`, `=MIN()`, and `=AVERAGE()`, and we we put the range containing the data between the parentheses. 

In the example in the learning environment, the data is contained in the range `B2:B7`, so the full formulas for calculating these three statistics are

* `=MAX(B2:B7)`
* `=MIN(B2:B7)`
*  `=AVERAGE(B2:B7)`

To display the average in cell `B8`, we

1. Select cell `B8`
2. Type the formula into the cell
3. Select `return` or `enter`

You can watch the full process in the slideshow in the learning environment. When you are ready to practice this yourself, follow the next set of instructions!




Summarizing Numeric Data

The raw data we get often isn't in precisely the form we would like to analyze. For example, our new vehicle data contains the number of new vehicles for each type of vehicle, but what if we wanted to see each type as a percent of total new vehicles instead?

We could calculate percentages one cell at a time using individual formulas, but this would result in a lot of time writing formulas instead of analyzing our data. Also, if we ever decided to change our analysis slightly, we'd be stuck updating each formula individually again. Fortunately, Excel lets us move, or drag, formulas while automatically updating cell references.

Let's take a look at what this means. Suppose the **cell** `B2` contains the **formula** `=A4`. If we move this formula from `B2` to `D1`, we moved the formula two columns to the right and one row up. Excel then updated the formula from `=A4` to `=C3` following the same "path": two columns to the right and one row up. A way to think of it is that Excel moved **everything** two steps over and one step up.

![a screenshot of an Excel table with columns A-D and rows 1-4. Cell B2 has the formula =A4. There are arrows demonstrating this formula being moved three columns over and one row up, so that cell D1 has formula C3.](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/exploring-data/narrative.svg)

Sometimes, we won't want the column or row to update. For example, let's take a look at calculating the percentages of the new vehicle data in the next table.

|   | A            | B                      | C       |
|---|--------------|------------------------|---------|
| 1 | Vehicle type | # of new vehicles | Percentage |
| 2 | car     | 1000                   |         |
| 3 | SUV       | 1500                   |         |
| 4 | truck       | 5000                   |         |
| 5 | TOTAL:       | 7500                   |         |

We can calculate the first percentage in `C` by dividing the number of new cars by the total number of new vehicles. In Excel, we would type the formula `= B2/B5` in the cell `C2`.

|   | A            | B                      | C       |
|---|--------------|------------------------|---------|
| 1 | Vehicle type | # of new vehicles | Percentage |
| 2 | car     | 1000                   |     =B2/B5    |
| 3 | SUV       | 1500                   |         |
| 4 | truck       | 5000                   |         |
| 5 | TOTAL:       | 7500                   |         |


What happens when we move this formula to the cell `C3`? The column stayed the same, and the row moved one cell down. Excel will update the references in the formula `=B2/B5` to follow the same path:

- `B2` becomes `B3`
- `B5` becomes `B6`

Now the cell `C3` will have the formula `=B3/B6`.

|   | A            | B                      | C       |
|---|--------------|------------------------|---------|
| 1 | Vehicle type | # of new vehicles | Percentage |
| 2 | car     | 1000                   |     =B2/B5    |
| 3 | SUV       | 1500                   |    =B3/B6     |
| 4 | truck       | 5000                   |         |
| 5 | TOTAL:       | 7500                   |         |

This isn't what we want to calculate! The formula `=B3/B6` takes the number of new SUVs in `B3` and divides it by whatever is in cell `B6`, not by the total which is in cell `B5`.

We want Excel to update the reference `B2` to `B3`, so that we're calculating the SUV percentage. But we don't want Excel to update the reference to `B5`, since the total stays the same no matter which vehicle type we're working with.

To stop Excel from updating this row, we need to place a `$` before the original row reference. The first formula in `C2` becomes `=B2/B$5`. When we move it down to `C3` it updates to `=B3/B$5`. This takes the number of new SUVs and divides it by the total, which is what we want to do!

|   | A            | B                      | C       |
|---|--------------|------------------------|---------|
| 1 | Vehicle type | # of new vehicles | Percentage |
| 2 | car     | 1000                   |     =B2/B$5    |
| 3 | SUV       | 1500                   |    =B3/B$5     |
| 4 | truck       | 5000                   |    |
| 5 | TOTAL:       | 7500                   |         |


Note that if we wanted to stop the column from updating instead of the row, we would need to put the `$` sign in front of the column letter in the formula instead of the row number.

### Errors and Best Practices

Forgetting to use a `$` when necessary sometimes results in an Excel error. These errors are displayed in the cell containing the reference.

The two most common errors are `#DIV/0!` and `#VALUE!`. Let's look at how these could occur in our example, where Excel updated the percentage formula to `=B3/B6`, pointing to a cell `B6` that isn't in our table.

The `#DIV/0!` error indicates that Excel is trying to divide by `0`. This could happen because the cell `B6` contains a `0` or, more likely, because the cell `B6` is empty. Excel assumes that empty cells are the same as `0`.

The `#VALUE!` error likely indicates that the cell `B6` contains some non-numeric value that Excel cannot use in the mathematical formula `=B3/B6`.

Even if no errors appear, our cell references might still be updating in a way we don't want. If there was a nonzero number in cell `B6`, Excel would calculate `=B3/B6` without an error. Unless we're careful, we could use the wrong number in our analysis without realizing it.

In general, it is always good practice to audit every formula we intend to move, looking at every cell referenced to ask:

1. Should the row of this reference update when moved up/down?
2. Should the column of this reference update when moved left/right?

The learning environment has a slideshow illustrating each step of creating and moving a formula in Excel. You can advance the slides using the controls at the bottom. Once you feel ready to do this yourself, move on to the next set of instructions!

Transform Data

Since Excel is often used to store and display data, we can scroll to explore, too. If the data is randomly arranged, it's hard to get any insight by just scrolling. But if the data is organized in a predictable way, we might be able to spot patterns that can guide our more formal analysis later on.
		
For example, look at the tornado data in the learning environment (from the [National Weather Service](https://www.spc.noaa.gov/wcm/)). Notice that it is unsorted and difficult to make sense of. Once it is sorted, however, we can see all tornado measurements for each state together. This makes it much easier to visually inspect.


To sort a table of data in Excel by one of its columns, select a cell containing data for that column and then choose an order for sorting from the Sort & Filter menu on the Home tab. 

For example, here's how we would sort this tornado data alphabetically by state:
![An image of the Sort menu in Excle](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/exploring-data/sort-by-state.png)

This method would result in a table starting with the tornados in Florida, since F comes before M and O alphabetically.

The option to sort `A to Z` or `Z to A` will only be displayed for text-based data. For numeric and time data, the options are `Sort Smallest to Largest`; or dates, `Sort Oldest to Newest` and vice versa. Sometimes, Excel might interpret the data type incorrectly and display the wrong option. We'll show you how to fix that in a later lesson.

Often, we want to sort by more than one column. For example, we might want to organize our tornado data so that all the tornados in each state are grouped together and only then sorted by date. The natural thing to try would be to sort by state and then sort by date. But, in case this wasn't already complicated enough, we actually need to use the reverse order.

This might seem unintuitive, but let's look at a small example to see why this happens. Consider the following table:

| Date      | State |
| ----------- | ----------- |
|   1/4/20    |FL       |
| 2/5/20| AL|
| 1/31/20    |FL       |
| 1/11/20| AL |
| 2/6/20 |FL |

If we sort newest-to-oldest by date, we'll get the table rearranged by date:

| Date      | State |
| ----------- | ----------- |
| 2/6/20 |FL |
| 2/5/20| AL|
| 1/31/20    |FL       |
| 1/11/20| AL |
|   1/4/20    |FL       |




When we sort by state next, AL will come before FL. But which of the two AL rows will be first? 

The way Excel deals with cells that have the same value is by maintaining the order they were already in. Since we've already sorted by date from newest to oldest, the `2/5/20` row for AL will come before the `1/11/20` row. Here's what the whole sorted table looks like:

| Date      | State |
| ----------- | ----------- |
| 2/5/20| AL|
| 1/11/20| AL |
| 2/6/20 |FL |
| 1/31/20    |FL       |
|   1/4/20    |FL       |

In the learning environment, we've shown what happens if you sort a portion of this tornado dataset by time, and then by date, and then by state. When you're ready to try this yourself, move on to the next set of instructions.


Four images to illustrate how sorting data works. The first one is unsorted tornado data, with columns for date, time, state, length, and width. The rows are in no apparent order. The second image is sorted by time from earliest to latest. The time column now starts with the earliest hour and minute in that column and ends with the largest. The third image takes the time-sorted table and sorts by date, newest to oldest. Now the data is organized so rows on the same day occur together. Within a day, the rows are still sorted by time. The last image sorts by state from A to Z. Now the rows are organized so all rows corresponding to the same state occur together. Within those sub-groups, they are organized by date and then by time.

Sorting Data

Sorting is excellent for organizing data, but finding specific data in a sorted table can still take a lot of scrolling. To answer questions about data with specific features, we're better off using **filters**.
		
Filtering a table consists of comparing each row of data to a list of criteria. Only those rows that match the criteria are included in the filtered table. For example, we could filter the tornado dataset by the following criteria:
* state = AR
* date is before 5/1/20

Applying this filter would result in a table of only those tornados in Arkansas before May 1st.

Sorting and filtering are often used together. For example, if we're interested in the length of tornados in Arkansas before May, we could take our filtered dataset and sort by length to see the longest and shortest tornados.

Let's look at an example of taking a question about data and breaking it down into filters. For example, let's imagine we've been asked to find out when the 2020 tornado season started in California (CA).

To break this question down, we start by identifying which columns of the dataset are being restricted. Since we're asked to find out *when* the tornado season *started*, this question will require some manipulation of the Date column. Since we're asked specifically about California, we'll also need to work with the State column.

Since the other columns aren't mentioned in the question, we won't need to filter or sort by them at all.

For the date, we're not asked to find a specific date, just to find the *start*, which will be the first date. This means we will want to sort the Date column from oldest to newest (further in the past to most recent).

For the state, we're given a specific value we want that column to have. For a tornado to be in California, it has to have CA in the state column. 

To get the answer to this question we'll

* filter by state = CA
* sort Date from oldest to newest

The answer will be the date in the first row of this filtered and sorted table. We've put a walkthrough of applying these filters in the learning environment. Note that we're only showing a portion of the tornado data so you can see what happens. Follow along and see how the table changes with each step.

Filter Data

In this lesson, you've learned how to explore datasets in Excel using three key methods:

* **formulas** to summarize and transform data
* **sorting** to organize data for visual inspection
* **filtering** to zoom in on data points with specified features

Specifically, you've learned how to

* use built-in Excel functions to find out the average, maximum, and sum of a set of numbers
* use draggable cell references and formulas to determine percents and baseline comparisons from raw data
* use sorting on multiple columns to gain insight into trends over time
* use filtering to answer questions about datapoints matching specific criteria

These are the essential skills you need to explore data in Excel and prepare for formal analysis. Congratulations on completing the lesson!

A spreadsheet interface surrounded by tables, charts, and functions.

Exploring Data

Investigate hotel cancellation rates in Excel.

Start by using conditional formatting and sparklines to get a general sense of the data. On the `Explore` sheet, you'll find a table breaking down the number of reservations by planned arrival month, reservation status, and lead time.

Apply a color scale to the table of data in `D9:O17`.

Create a sparkline in `P9` that draws a trend line for the `Cancelled` and `Long` row.

Record any observations you make in the `Notes` section.

Let's look at `Lead Time` on its own. In particular, let's take a look at the raw numbers instead of the short, medium, and long categories. These numbers are all listed on the `Lead Time` sheet. Create a histogram of the lead time data.

Take a note of any patterns you notice in the `Notes` section.

Let's break this analysis of lead time down by reservation status (not that we've grouped `No Show` in with `Cancellation`).

On the `Lead Time by Status` sheet, create a histogram of the `Cancelled` column.

On the same sheet, create a histogram of the `Kept` column.

Make a note of any observations you have in the `Notes` section.

Since it seems as if lead time might be connected somehow with cancellations, let's look at a scatter-plot to see if there's any correlation.

The `Lead Time to Cancellations` sheet has a table of average lead time and number of cancellations for each month in the dataset. Create a scatter-plot comparing these two columns.

Record any observations you have in the `Notes` section.

A correlation is just a correlation. Let's see if lead time and cancellations are both just being influenced by seasonal fluctuations.

The same monthly data is on the `Over Time` sheet, although the numbers have been normalized to fit in the same range. Create a line graph plotting both average lead time and the number of cancellations.

Let's end by breaking our analysis down by distribution channel: direct sales, corporate sales, and travel agent sales. There are three tables on the `By Distribution Channel` sheet.

The first table on the left has the average lead time by distribution channel. Create a bar chart of this data.

The middle table has the breakdown of kept/cancelled reservations by each distribution channel. Create a stacked bar chart of this data.

The final table on the right has the percentage of reservations that get cancelled for each distribution type and lead time category. Create a clustered column chart of this data.


Visualizing Hotel Data in Excel

Trends within the table and extremely large or small values.

To quickly identify any values above a certain threshold.

No - there are too many categories to be able to distinguish between them. 

No - the categories do not make up part of a whole.

Temperature and sales in 150 ice cream stores

Average gas price over the last 10 years.

Importing a CSV makes sure that all of Excel's features will work and allows you to properly specify data types during the import process.

Importing a CSV automatically inspects and validates the data.

It is actually better to open the CSV directly.

Filter the columns one at a time by `(Blanks)`.

Filter all the columns simultaneously by `(Blanks)`.

Scroll through the table looking for missing data.

Sort the data from least to most efficient and look for excessively large or small values.

Scroll through the table looking for very large or small values.

Use a heatmap to color in large or small values.

To prevent yourself and coworkers from accidentally overwriting important data and formulas.

To stop competitors from accessing confidential data.

To prevent coworkers from accidentally overwriting important data and formulas.

Test your knowledge of handling data in Excel!

Handling Data

Use Excel to inspect and format hotel booking and cancellation data.

Import the `bookings` CSV file to a new sheet. Because this data contains no decimals, you should use `,` as your delimiter even if your computer uses `,` for decimals. Make sure your imported data starts in cell `A1`. If it isn't already, name the new sheet `bookings` (right- or control-click the sheet name to change it).

If you aren't there already, click over to the imported data on the `bookings` tab. 

Since it is hard to spot extra whitespace in text data, let's start by trimming whitespace from the `hotel`, `status`, and `day` columns. 

Write a formula in `F2` that trims whitespace from the hotel data in `A2`. Drag the formula down by double-clicking the green box in the lower right of `F2`.

Write a formula in `G2` that trims whitespace from the status data in `B2`. Drag the formula down by double-clicking the green box in the lower right.

Write a formula in `H2` that trims whitespace from the day data in `D2`. Drag the formula down by double-clicking the green box in the lower right.

Open up the filter menu on the `hotel` column. It looks like two different encodings were used: one with the word hotel and one without.

Write a formula in `I2` that truncates the trimmed hotel data in `F2` to just one letter to maintain consistency. Drag the formula down by double-clicking the green box in the lower right.

Let's take a look for missing data.  Filter each column one at a time on `(Blanks)` (some columns might not have any!) Make a note of what data is missing in the summary table on the `Import and Inspect` sheet.

Let's take a look for suspicious data. Remove any filters. Sort `arrival` and note any suspiciously early or late data in the summary table on the `Import and Inspect` sheet.

Sort `Number of special requests` and note any suspiciously large or small entries in the summary table on the `Import and Inspect` sheet.

After inspecting the data, you've been asked to produce a couple of reports. These are on the `Format Numbers` sheet. 

The top table has the monthly percentage of total reservations by hotel type and reservation status.

This table is hard to read due to the formatting. Format all the numbers in this table to be percentages with no decimal points.

The second table on this sheet is a list of all the reservations with `5` special requests (the largest non-suspect number of special requests). 

When the table was copied onto this sheet, the dates got formatted as numbers. Reformat the dates as `Short Date`.

Sort the second table to make it easier to read.

The hotel chain is experimenting with lowering charges for special requests. 

They've asked for a table that calculates revenue based on new levels of charges. This table is on the `Format and Protect` sheet. Try a few different average charges to see how the average revenue changes.

Since we're going to be presenting this to others, let's write a note in `A17 ` explaining why we're excluding 2020 from the average calculation.

We'd like to protect the sheet so that the data doesn't accidentally get altered. Before doing that, we want to make sure that `F10` will still be editable. Unlock `F10`.

Hide the `bookings` sheet to clean up the file for presentation.

Handling Hotel Data in Excel

Learn to load, validate, and present data in Excel.

Even the most advanced analytics techniques are only as good as the data being used and the clarity with which results are presented. So handling your data properly is one of the most important skills you can learn in data analysis.

In this lesson, you'll learn how to prepare for analysis by

* importing external datasets from CSV files
* cleaning and validating data prior to analysis

You'll also learn how to present your analysis by

* formatting tables for human readability
* protecting data when sharing spreadsheets with others

Introduction to Handling Data

Many datasets are stored in formats other than Excel's **XLSX** format. One of the most common storage formats used across the data science world is the **CSV** or **comma-separated values** format. 

A CSV file is a text file containing data where the columns of data are separated by commas. Here's a sample of what a CSV file might look like if you open it up in a text editor:

```
month,sales,costs
Jan,1000,500
Feb,2000,800
```

Here's what this same CSV file looks like interpreted as a table:

| month | sales | cost |
|-------|-------|------|
| Jan   | 1000  | 500  |
| Feb   | 2000  | 800  |

### Other Delimiters

CSV files typically use commas as **delimiters**, the symbol that indicates where one column ends and another begins. However, in some cases this doesn't work very well. For example, in some countries the `,` is used as a decimal marker. In these countries, using a `,` as a delimiter will cause problems, since Excel can't tell the difference between `,` as a decimal and `,` as a delimiter. In cases like this, it's better to use another symbol, like the semi-colon `;`, as a delimiter. Even with a different delimiter, the format is still called CSV.

Excel provides an import tool that lets us select which symbol is being used as a delimiter. We've placed a slideshow illustrating this process in the learning environment. Note that some of the buttons and menus might look slightly different depending on your operating system and version of Excel.

When you're ready to practice this yourself, get started on the instructions below!


Importing Data

After importing a data file, we're often eager to get started with exploratory data analysis. Unfortunately, raw data is often messy data. If we don't take some time to inspect and clean a dataset, we may get meaningless or misleading results.

For example, let's take a look at this table of monthly sales:

![An Excel table with Month and Sales columns. The month column has entries: jan, jan, february, february, april, march, mar, and january. The sales column has entries: 1000, 100, 50, 30, 30, 22, 1000, 500](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/handling-data/clean-text/table.png)

Suppose we want to know the sales associated with `january`. If we just dive in and filter the table on `month equals january`, we'll get a value of 500.

![The same Excel table, now filtered to only show the single row with month january and sales 500](https://static-assets.codecademy.com/Courses/Spreadsheets/learn-excel-for-data-analysis/handling-data/clean-text/clean-text-0.png)

But there are several other rows of the original table that ought to be associated with `january`! Unfortunately, Excel can't tell that `jan` and `january` really mean the same thing. Even harder to spot is the fact that one of the rows labeled `jan   ` has three spaces at the end. We can't see this extra whitespace easily, but Excel will treat it as a different category.

In a small dataset, these inconsistencies can be edited by hand. For larger datasets, Excel provides text cleaning functions that we can write once and then drag down a column like any other formula. You can even nest these functions (like in the last slide to the right), though nesting is outside the scope of this course.

We've placed a slideshow demonstrating common text-cleaning functions in the learning environment. When you're ready to try these out yourself, move on to the next set of instructions!

Cleaning Text Data

Before moving on to analysis, we also need to inspect and validate numeric data. There's no one rule for what valid numeric data looks like, since the definition of "invalid" data depends on what we expect "valid" data to look like. For example, an entry of `-500` might be suspicious in a column of Celsius temperatures but reasonable in a column measuring the sea floor relative to sea level.
	
While there is no one rule for cleaning data, there are some common questions that can help guide our inspections:

1. Are there any blank or missing data points?
2. What range of numbers (or dates) are we expecting?
3. Are there any data points that don't meet those expectations?

We've already learned the tools we need to answer these questions! For example:

* **missing data points**: use the `Filter` menu and select `(Blanks)`
* **unexpected values**: use sorting or functions to inspect minimum and maximum values or trends over time

Of course, identifying invalid or messy data is only the first half of the process. Once we identify these datapoints we have to decide what to do with them. There is a whole science to handling missing data, so be sure to check out our course on [Handling Missing Data](https://www.codecademy.com/learn/handling-missing-data) before working with missing data.

For this course, imagine that you were asked to review the datasets to the right. You've opened them in Excel and are inspecting the data. What would you do in each case? Think critically about if you can (or should) use the data and what types of analysis would be valid, and what wouldn't. 

Flip the cards to read our answer.

When you're ready to practice inspecting data for missing or suspect entries, move on to the next set of instructions!

Inspecting Numeric Data

Because Excel is used to visually explore datasets and present those datasets for others, we often want to change the way tables are displayed to improve human readability.

We've loaded a table of data in the first slide in the learning environment. This table is perfectly accurate, but it isn't very easily readable. We want to present tables that provide insight, not tables that provide headaches!

We can solve this problem using Excel's formatting tools. These tools change the way data is displayed without changing any of the underlying data to maintain data integrity. Play through the slideshow in the learning environment to see how we can reformat this data using Excel. What do you think of the end result? 
	
We've emphasized that formatting doesn't change the value of a cell. Let's look at an example of why this is important to remember.

In the Excel table below, the cell `B3` is adding the values in `B1` and `B2`.

|   | A    | B   |
|---|------|-----|
| 1 |      | 1.1 |
| 2 |      | 1.4 |
| 3 | sum: | 2.5 |

If we format these numbers to have no decimal places, the table will be displayed like this:

|   | A    | B |
|---|------|---|
| 1 |      | 1 |
| 2 |      | 1 |
| 3 | sum: | 3 |

It seems as if Excel is calculating `1+1=3`! But this is just a consequence of the rounding, the underlying summation is still `1.1+1.4 = 2.5`.

When you're ready to practice formatting for yourself, get started on the next set of instructions!


Formatting Data

Okay, so we've explored a dataset and have some preliminary results to share with a team member. But our team member doesn't use Excel much, and we don't want them to accidentally overwrite a crucial formula. 
	
Excel provides built-in tools for protecting different aspects of a spreadsheet. It is important to note that none of these will stop a malicious actor from accessing or altering data. All anyone has to do to circumvent protected sheets and cells is to open the file in another program like Google Sheets.

However, protected and hidden sheets are still useful for preventing accidental data overwrites by ourselves or by others, controlling input to a spreadsheet, and cleaning up a spreadsheet for presentation.

Three common options for protecting and hiding data in Excel are

- Protecting an entire sheet from editing.
- Unprotecting specific cells to allow user input or data updates.
- Hiding sheets to keep raw data and background calculations out of the way.

We've placed a slideshow in the Learning Environment illustrating how to do each of these in Excel.

### Best practices for protecting and hiding ranges

Protected and hidden ranges can be very helpful. They can also be very frustrating. Pretty much every Excel user has a story of hours lost hunting for hidden data in a large spreadsheet. We recommend the following practices for avoiding these frustrations:

- Give hidden sheets meaningful names. Nobody wants to spend hours hunting for data hidden on `Sheet45321`.
- Highlight any unlocked cells in an otherwise protected sheet. Highlighting unlocked cells will both warn users that those cells can be overwritten and will also direct them to the cells they are supposed to change. For example, this is why we included "input your answer here" messages in the exercise spreadsheets for this course!
- Establish norms for spreadsheet hygiene with your team. If everyone follows the same conventions for protecting data, it will be easy to open and use each other's spreadsheets!


Protecting Data

Congratulations! You're now ready to analyze data from start to finish in Excel. You've learned how to prepare for analysis by:

* importing raw data into Excel from CSVs
* cleaning inconsistent text data
* inspecting numeric data for missing or potentially invalid entries

After performing your analysis, you've learned how to

* make tables human-readable with custom formatting
* protect data from being accidentally overwritten
* clean up the presentation of a file by hiding unneeded sheets


Use statistics, visualizations, and pivot tables to analyze Bitcoin data.

A shot of a screen with Bitcoin price being tracked.

Let's get started by importing the data into Excel. Open up `bitcoin-analysis.xlsx`. If your computer uses `.` for decimals, import `data.csv`. If your computer uses `,` for decimals, import `data-semicolon.csv`.

Let's inspect our data before diving into an analysis. 

First, trim whitespace from any text columns to be sure that those are clean.

Next, check any text data columns for inconsistent entries and reconcile them formulaically.

Lastly, check all text columns for blank/missing entries.

Now that our text data is nice and clean, let's look at the date and numeric columns. 

Start by checking all of these columns for missing data.

Make a note of any missing data *and* any data questions that might be influenced by the missing data.

Next, inspect each column for unusually large or small values. Make a note of any such values you notice.

In an actual business situation, you would probably want to look for correct values for any missing or suspect data first. To practice dealing with these values in analysis, and because the dataset isn't too suspect, we'll move on with this data as-is.

Manuela has asked you to analyze the potential upside to Bitcoin and its volatility. Let's start with the upside.

Find the first 2021 opening value and last 2021 closing value for each symbol. Record these on a sheet named `Upside`, and calculate the percentage change for each. Create a bar chart to visualize the change.

Now, let's look into volatility. Find the largest high and smallest low value for each symbol, and record these on a new sheet titled `Spread`. Use formulas to calculate each of the lows as a percentage of the high. Create a bar chart to visualize these percentages.

Create a pivot table on `Exploratory Data Analysis` that records the average closing value per month for each symbol. Use the months as row labels.

Apply the same color scale individually to each column to see how they vary over time.

Create another pivot table that has each day as a row, each symbol as a column, and records the high for that day in the values. 

Excel will automatically add `Month` to the rows when you add the date, remove that by un-checking it or dragging it out of the Rows box. 

Note that you'll be missing some values for SP500 over weekends. Create a line chart of these highs.

On each tab, adjust the formatting of the numbers to make them more readable.

Make sure all charts have titles and labels.

Add a brief note to explain the key takeaway from each sheet.

Add a final recommendation to a `Final Recommendation` sheet.

Analyze Bitcoin Data in Excel

### Why Learn Microsoft Excel?

Microsoft Excel is one of the most popular data tools in the business world. Excel is designed as a one-stop shop for storing and analyzing data, with built-in tools to help you transform data of all kinds into business insights. Used for everything from complex financial modeling to tracking shipment orders, knowledge of Excel is foundational to any career in data analysis or business intelligence.

### Take-Away Skills

In this course, you’ll learn to take raw data and produce insightful analyses using Excel's import tools, formulas, sorting and filtering tools, pivot tables, and charts. You will also learn best practices for managing, protecting, and cleaning data in Excel.

Get started with Excel for Data Analysis!

Learn how to leverage Excel to explore data.

Use Excel to analyze Bitcoin price behavior.

See where you can take your new Microsoft Excel skills!

Next Steps

Take your data storytelling skills to the next level by learning how to use Microsoft Excel for data analysis and visualization.

Learn Microsoft Excel for Data Analysis

PRO SALE: Get 50% off annual Pro memberships using code [LLM50](https://www.codecademy.com/checkout?plan_id=proGoldAnnualV2&discountCode=LLM50&plan_type=pro)