Learn the basics of adding columns to existing DataFrames in Pandas.

In the previous lesson, you learned what a DataFrame is and how to select subsets of data from one.

In this lesson, you'll learn how to modify an existing DataFrame.  Some of the skills you'll learn include:
- Adding columns to a DataFrame
- Using lambda functions to calculate complex quantities
- Renaming columns

Modifying DataFrames

Sometimes, we want to add a column to an existing DataFrame.  We might want to add new information or perform a calculation based on the data that we already have.

One way that we can add a new column is by giving a list of the same length as the existing DataFrame.

Suppose we own a hardware store called The Handy Woman and have a DataFrame containing inventory information:

<div class="narrative-table-container">

|Product ID|Product Description|Cost to Manufacture|Price|
|--------------|---------------------------|----------------------------|-|
|1                  | 3 inch screw            | 0.50                            | 0.75 |
|2                  | 2 inch nail               | 0.10                            | 0.25 |
|3                 | hammer                    | 3.00                            | 5.50 |
|4                  | screwdriver            | 2.50                            | 3.00|

</div>

<br>

It looks like the actual quantity of each product in our warehouse is missing!

Let's use the following code to add that information to our DataFrame.
```py
df['Quantity'] = [100, 150, 50, 35]
```
Our new DataFrame looks like this:

<div class="narrative-table-container">

|Product ID|Product Description|Cost to Manufacture|Price|Quantity|
|--------------|---------------------------|----------------------------|--------|-|
|1                  | 3 inch screw            | 0.50                            | 0.75 |100|
|2                  | 2 inch nail               | 0.10                            | 0.25 |150|
|3                 | hammer                    | 3.00                            | 5.50 |50|
|4                  | screwdriver            | 2.50                            | 3.00|35|

</div>

Adding a Column I

We can also add a new column that is the same for all rows in the DataFrame.
Let's return to our inventory example:

<div class="narrative-table-container">

|Product ID|Product Description|Cost to Manufacture|Price|
|--------------|---------------------------|----------------------------|--------|
|1                  | 3 inch screw            | 0.50                            | 0.75 |
|2                  | 2 inch nail               | 0.10                            | 0.25 |
|3                 | hammer                    | 3.00                            | 5.50 |
|4                  | screwdriver            | 2.50                            | 3.00|

</div>
<br>

Suppose we know that all of our products are currently in-stock.  We can add a column that says this:
```py
df['In Stock?'] = True
```
Now all of the rows have a column called `In Stock?` with value `True`.

<div class="narrative-table-container">

|Product ID|Product Description|Cost to Manufacture|Price|In Stock?|
|--------------|---------------------------|----------------------------|--------|-|
|1                  | 3 inch screw            | 0.50                            | 0.75 |True|
|2                  | 2 inch nail               | 0.10                            | 0.25 |True|
|3                 | hammer                    | 3.00                            | 5.50 |True|
|4                  | screwdriver            | 2.50                            | 3.00|True|

</div>

Adding a Column II

Finally, you can add a new column by performing a function on the existing columns.

Maybe we want to add a column to our inventory table with the amount of sales tax that we need to charge for each item.  The following code multiplies each `Price` by `0.075`, the sales tax for our state:
```py
df['Sales Tax'] = df.Price * 0.075
```
Now our table has a column called `Sales Tax`:

<div class="narrative-table-container">

|Product ID|Product Description|Cost to Manufacture|Price|Sales Tax|
|--------------|---------------------------|----------------------------|--------|-|
|1                  | 3 inch screw            | 0.50                            | 0.75 | 0.06|
|2                  | 2 inch nail               | 0.10                            | 0.25 | 0.02|
|3                 | hammer                    | 3.00                            | 5.50 | 0.41|
|4                  | screwdriver            | 2.50                            | 3.00| 0.22|

</div>

Adding a Column III

In the previous exercise, we learned how to add columns to a DataFrame.

Often, the column that we want to add is related to existing columns, but requires a calculation more complex than multiplication or addition.

For example, imagine that we have the following table of customers.

<div class="narrative-table-container"> <div class="narrative-table-scroll">

|Name|Email|
|--------|--------|
|JOHN SMITH|john.smith@gmail.com|
|Jane Doe|jdoe@yahoo.com|
|joe schmo|joeschmo@hotmail.com|

</div></div>

It’s a little annoying that the capitalization is different for each row.  Perhaps we’d like to make it more consistent by making all of the letters uppercase.

We can use the `apply` function to apply a function to every value in a particular column.  For example, this code overwrites the existing `'Name'` columns by applying the function `upper` to every row in `'Name'`.

```py
df['Name'] = df.Name.apply(str.upper)
```
The result:

<div class="narrative-table-container"> <div class="narrative-table-scroll">

|Name|Email|
|--------|--------|
|JOHN SMITH|john.smith@gmail.com|
|JANE DOE|jdoe@yahoo.com|
|JOE SCHMO|joeschmo@hotmail.com|

</div></div>

Performing Column Operations

A *lambda function* is a way of defining a function in a single line of code. Usually, we would assign them to a variable.

For example, the following lambda function multiplies a number by 2 and then adds 3:
```py
mylambda = lambda x: (x * 2) + 3
print(mylambda(5))
```
The output:
```py
> 13
```

Lambda functions work with all types of variables, not just integers! Here is an example that takes in a string, assigns it to the temporary variable `x`, and then converts it into lowercase:
```py
stringlambda = lambda x: x.lower()
print(stringlambda("Oh Hi Mark!"))
```
The output:
```py
> "oh hi mark!"
```

Learn more about lambda functions in <a href="https://www.codecademy.com/articles/lambda-functions" target="_blank">this article</a>!

Reviewing Lambda Function

We can make our lambdas more complex by using a modified form of an if statement.

Suppose we want to pay workers time-and-a-half for overtime (any work above 40 hours per week).  The following function will convert the number of hours into time-and-a-half hours using an if statement:
```py
def myfunction(x):
    if x > 40:
        return 40 + (x - 40) * 1.50
    else:
        return x
```
Below is a lambda function that does the same thing:
```py
myfunction = lambda x: 40 + (x - 40) * 1.50 if x > 40 else x
```
In general, the syntax for an if function in a lambda function is:
```py
lambda x: [OUTCOME IF TRUE] if [CONDITIONAL] else [OUTCOME IF FALSE]
```

Reviewing Lambda Function: If Statements

In Pandas, we often use lambda functions to perform complex operations on columns. For example, suppose that we want to create a column containing the email provider for each email address in the following table:

<div class="narrative-table-container"> <div class="narrative-table-scroll">

|Name|Email|
|--------|--------|
|JOHN SMITH|john.smith@gmail.com|
|Jane Doe|jdoe@yahoo.com|
|joe schmo|joeschmo@hotmail.com|

</div></div>

We could use the following code with a lambda function and the [string method `.split()`](https://www.codecademy.com/courses/learn-python-3/lessons/string-methods/exercises/splitting-strings):
```py
df['Email Provider'] = df.Email.apply(
    lambda x: x.split('@')[-1]
    )
```

The result would be:

<div class="narrative-table-container"> <div class="narrative-table-scroll">

|Name|Email|Email Provider|
|--------|--------|--------------------|
|JOHN SMITH|john.smith@gmail.com|gmail.com|
|Jane Doe|jdoe@yahoo.com|yahoo.com|
|joe schmo|joeschmo@hotmail.com|hotmail.com|

</div></div>

Applying a Lambda to a Column

We can also operate on multiple columns at once. If we use `apply` without specifying a single column and add the argument `axis=1`, the input to our lambda function will be an entire row, not a column.  To access particular values of the row, we use the syntax `row.column_name` or `row[‘column_name’]`.

Suppose we have a table representing a grocery list:

<div class="narrative-table-container"> <div class="narrative-table-scroll">

|Item | Price | Is taxed? |
|-------|--------|--------------|
|Apple|1.00  | No            |
|Milk|4.20  | No            |
|Paper Towels|5.00  | Yes            |
|Light Bulbs|3.75  | Yes            |

</div></div>

If we want to add in the price with tax for each line, we’ll need to look at two columns: `Price` and `Is taxed?`.

If `Is taxed?` is `Yes`, then we’ll want to multiply `Price` by 1.075 (for 7.5% sales tax).

If `Is taxed?` is `No`, we’ll just have `Price` without multiplying it.

We can create this column using a lambda function and the keyword `axis=1`:
```py
df['Price with Tax'] = df.apply(lambda row:
     row['Price'] * 1.075
     if row['Is taxed?'] == 'Yes'
     else row['Price'],
     axis=1
)
```



Applying a Lambda to a Row

When we get our data from other sources, we often want to change the column names.  For example, we might want all of the column names to follow variable name rules, so that we can use **`df.column_name`** (which tab-completes) rather than **`df['column_name']`** (which takes up extra space).

You can change all of the column names at once by setting the `.columns` property to a different list.  This is great when you need to change all of the column names at once, but be careful! You can easily mislabel columns if you get the ordering wrong.  Here's an example:
```py
df = pd.DataFrame({
    'name': ['John', 'Jane', 'Sue', 'Fred'],
    'age': [23, 29, 21, 18]
})
df.columns = ['First Name', 'Age']
```
This command edits the **existing** DataFrame `df`.

Renaming Columns

You also can rename individual columns by using the `.rename` method.  Pass a dictionary like the one below to the `columns` keyword argument: 
```py
{'old_column_name1': 'new_column_name1', 'old_column_name2': 'new_column_name2'}
```
Here's an example:
```py
df = pd.DataFrame({
    'name': ['John', 'Jane', 'Sue', 'Fred'],
    'age': [23, 29, 21, 18]
})
df.rename(columns={
    'name': 'First Name',
    'age': 'Age'},
    inplace=True)
```
The code above will rename `name` to `First Name` and `age` to `Age`.

Using `rename` with only the `columns` keyword will create a **new** DataFrame, leaving your original DataFrame unchanged.  That's why we also passed in the keyword argument **`inplace=True`**.  Using `inplace=True` lets us edit the **original** DataFrame.

There are several reasons why `.rename` is preferable to `.columns`:
-  You can rename just one column
- You can be specific about which column names are getting changed (with `.column` you can accidentally switch column names if you're not careful)

**Note:** If you misspell one of the original column names, this command won't fail.  It just won't change anything.

Renaming Columns II

Great job!  In this lesson, you learned how to modify an existing DataFrame.  Some of the skills you've learned include:
- Adding columns to a DataFrame
- Using lambda functions to calculate complex quantities
- Renaming columns

Let's practice what you just learned!

Review

Use Pandas to create and manipulate tables so that you can process your data faster and get your insights sooner.

Introduction to Pandas

Analyze the inventory of Petal Power gardening store.

Data for all of the locations of Petal Power is in the file `inventory.csv`.  Load the data into a DataFrame called `inventory`.

Inspect the first 10 rows of `inventory`.

The first 10 rows represent data from your Staten Island location.  Select these rows and save them to `staten_island`.

A customer just emailed you asking what products are sold at your Staten Island location.  Select the column `product_description` from `staten_island` and save it to the variable `product_request`.

Another customer emails to ask what types of seeds are sold at the Brooklyn location.

Select all rows where `location` is equal to `Brooklyn` and `product_type` is equal to `seeds` and save them to the variable `seed_request`.

Add a column to `inventory` called `in_stock` which is `True` if `quantity` is greater than 0 and `False` if `quantity` equals 0.

Petal Power wants to know how valuable their current inventory is.

Create a column called `total_value` that is equal to `price` multiplied by `quantity`.

The Marketing department wants a complete description of each product for their catalog.

The following lambda function combines `product_type` and `product_description` into a single string:
```py
combine_lambda = lambda row: \
    '{} - {}'.format(row.product_type,
                     row.product_description)
```
Paste this function into `script.py`.

Using `combine_lambda`, create a new column in `inventory` called `full_description` that has the complete description of each product.

Petal Power Inventory

```py
office_supply_store['remaining_inventory'] = office_supply_store.initial_inventory - office_supply_store.number_sold
```

```py
office_supply_store.remaining_inventory = office_supply_store.initial_inventory - office_supply_store.number_sold
```

```py
column['remaining_inventory'] = column['initial_inventory'] - column['number_sold']
```

```py
office_supply_store['remaining_inventory'] = column['initial_inventory'] - column['number_sold']
```

mylambda = lambda x: x**2 + 3*x
    if x > 7
    else 2*x - 10

customers = pd.DataFrame([
  ['Annie Hall', 'anniehall@snailmail.com', '201-555-0213'],
  ['Geoffrey Adams', 'gadams@yahee.com', '622-555-0994'],
  ['Casey Flanders', 'flandersc@netscope.com', '413-555-9431'],
  ['Hiroshi Tanaka', 'hiroshit@snailmail.com', '718-555-1985'],
  ['Sam Waterson', 'waterboy@hitmail.com', '594-555-8321']
],
  columns = ['name', 'email', 'phone_number']
)
customers['purchase_volume'] = [20, 6, 15, 9, 11]
customers['area_code'] = customers.phone_number.apply(
    lambda x: x.split('-')[0]
    )

```py
attendance['call-parents'] = attendance.apply(lambda row:
  'Yes'
  if row['days-absent'] > 10
  else 'No',
  axis=1
)
```

```py
attendance['call-parents'] = attendance.apply(
  if 'days-absent' > 10: 'Yes'
  else: 'No',
)
```

```py
attendance['call-parents'] = attendance.apply(lambda row:
  if row['days-absent'] > 10
  'Yes'
  else 'No',
  axis=1
)
```

```py
attendance['call-parents'] = attendance.apply(lambda row:
  'Yes'
  if row['days-absent'] > 10
  else 'No',
)
```

\begin{array}{l}
\textbf{Attendance}\\
\begin{array}{l|l|l}
& \textbf{student-name} & \textbf{days-absent} \\
0 & \text{Dean} & 11 \\
1 & \text{Henry} & 7 \\
2 & \text{Nora} & 18 \\
3 & \text{Ayesha} & 9 \\
4 & \text{Cole} & 10 \\
5 & \text{Dimitri} & 3
\end{array} 
\end{array}

This code will not work because the lambda function takes an input of `x` but the `if` statement checks against the value of `age`.

This code will not work because the `if` statement is in the wrong position. It needs to come before `'Welcome to the Site!'` line.

The `lambda` definition needs to come before `=` sign. The correct structure would be `lambda x: age_check = ` 

age_check = lambda x: 'Welcome to the site!' \
    if age > 13 \
    else 'Sorry you are too young to enter this site.'

```py
grades = grades.rename(columns={
    'id': 'Name',
    'unit-1': 'Ecology',
    'unit-2': 'Cells',
    'unit-3': 'Genetics'})
```

```py
grades = grades.rename({
    'id': 'Name',
    'unit-1': 'Ecology',
    'unit-2': 'Cells',
    'unit-3': 'Genetics'})
```

```py
grades.columns = ['Name', 'Cells', 'Ecology', 'Genetics']
```

```py
 grades.columns = {
    'id': 'Name',
    'unit-1': 'Ecology',
    'unit-2': 'Cells',
    'unit-3': 'Genetics'}
```

\begin{array}{l}
\textbf{Grades} \\
\begin{array}{l|l|l|l|l}
& \textbf{id} & \textbf{unit-1} & \textbf{unit-2} & \textbf{unit-3} \\
0 & \text{Dean} & 81 & 79 & 89 \\
1 & \text{Henry} & 72 & 78 & 75 \\
2 & \text{Nora} & 82 & 91 & 85 \\
3 & \text{Ayesha} & 84 & 89 & 90 \\
4 & \text{Cole} & 75 & 80 & 80 \\
5 & \text{Dimitri} & 91 & 95 & 96
\end{array} 
\end{array} 

The code will not run because `Grade for the Year` has fewer elements than the other columns.

This code will not run because `Grade for the Year` is an invalid column name. 

This code will not run because DataFrames can not contain strings such as the student's names.

grades = pd.DataFrame({    
  'name': ['Chloe', 'Grace', 'Jeremy', 'Isla'],    
  'unit_1_grade': [95, 82, 83, 75],
  'unit_2_grade': [91, 74, 89, 84],    
  'Grade for the Year': [93, 78, 86] })

```py
customers[customers.column == 'age']
```

customers = pd.DataFrame([
  ['Jesse Sternberg', '193 6th Avenue', 31],
  ['Amy Lauder', '546 Marblehead Way', 43],
  ['Gerri Sanderson', '65 New York Street', 35],
  ['Austin Barnes', '2888 North Ogden Avenue', 28]],
  columns = ['name', 'address', 'age'])

```py 
clinic_df[clinic_df.month == 'May']
```

clinic_df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west'])

```py 
df.reset_index(inplace=True, drop=True)
``` 

\begin{array}{l}
\textbf{Initial Data Frame:} \\
\begin{array}{l|l|l|l}
 & \textbf{First Name} & \textbf{Last Name} & \textbf{Occupation} \\
1 & \text{Ashley} & \text{Smith} & \text{Engineer} \\
4 & \text{Matt} & \text{Kravitz} & \text{Teacher} \\
9 & \text{Jacob} & \text{Rogers} & \text{Customer Service}
\end{array} \\
\\
\textbf{Final Data Frame:}\\
\begin{array}{l|l|l|l}
 & \textbf{First Name} & \textbf{Last Name} & \textbf{Occupation} \\
0 & \text{Ashley} & \text{Smith} & \text{Engineer} \\
1 & \text{Matt} & \text{Kravitz} & \text{Teacher} \\
2 & \text{Jacob} & \text{Rogers} & \text{Customer Service}
\end{array}
\end{array}

```py 
content = pd.read_csv('content_inventory.csv')
```

```py 
content = pd.read_csv('content_inventory')
```

```py 
content = pd.from_csv(content_inventory.csv)
```

```py 
content = pd.from_csv('content_inventory.csv')
```

Creating, Loading, and Selecting Data with Pandas

Learn the basics of creating, loading, and selecting data in Pandas using DataFrames.

Pandas is a Python module for working with tabular data (i.e., data in a table with rows and columns). Tabular data has a lot of the same functionality as SQL or Excel, but Pandas adds the power of Python.

In order to get access to the Pandas module, we'll need to install the module and then import it into a Python file. To learn how to install Python for data analysis on your personal computer, please refer to the following Codecademy resources:

* <a href="https://www.codecademy.com/articles/introducing-jupyter-notebook" target="_blank">Introducing Jupyter Notebook</a>
* <a href="https://www.codecademy.com/articles/setting-up-jupyter-notebook" target="_blank">Setting Up Jupyter Notebook</a>
* <a href="https://www.codecademy.com/articles/getting-started-with-jupyter" target="_blank">Getting Started with Jupyter Notebook</a>

Otherwise, let's move on! The `pandas` module is usually imported at the top of a Python file under the alias `pd`.

```py
import pandas as pd
```

If we need to access the `pandas` module, we can do so by operating on `pd`.

In this lesson, you’ll learn the basics of working with a single table in Pandas, such as:
- Create a table from scratch
- Loading data from another file
- Selecting certain rows or columns of a table

----
**Note:** In order for Codecademy to properly display data from Pandas, we need to import another special library:
```py
import codecademylib3
```
When you're on Codecademy.com, we'll always provide this import for you at the top of `script.py`.

When you're not on Codecademy.com, you won't need it.

Importing the Pandas Module

A DataFrame is an object that stores data as rows and columns. You can think of a DataFrame as a spreadsheet or as a SQL table. You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.

DataFrames have rows and columns. Each column has a name, which is a string. Each row has an index, which is an integer. DataFrames can contain many different data types: strings, ints, floats, tuples, etc.

You can pass in a dictionary to `pd.DataFrame()`. Each key is a column name and each value is a list of column values.  The columns must all be the same length or you will get an error.  Here’s an example:

```py
df1 = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Joe Schmo'],
    'address': ['123 Main St.', '456 Maple Ave.', '789 Broadway'],
    'age': [34, 28, 51]
})
```

This command creates a DataFrame called `df1` that looks like this:

<div class="narrative-table-container"> <div class="narrative-table-scroll">

|name|address|age|
|-|-|-|
|John Smith|123 Main St.|34|
|Jane Doe|456 Maple Ave.|28|
|Joe Schmo|789 Broadway|51|

</div>
</div>

Create a DataFrame I

You can also add data using lists. 

For example, you can pass in a list of lists, where each one represents a row of data.  Use the keyword argument `columns` to pass a list of column names.  

```py
df2 = pd.DataFrame([
    ['John Smith', '123 Main St.', 34],
    ['Jane Doe', '456 Maple Ave.', 28],
    ['Joe Schmo', '789 Broadway', 51]
    ],
    columns=['name', 'address', 'age'])
```

This command produces a DataFrame `df2` that looks like this:

<div class="narrative-table-container">

|name|address|age|
|-|-|-|
|John Smith|123 Main St.|34|
|Jane Doe|456 Maple Ave.|28|
|Joe Schmo|789 Broadway|51|

</div>
<br>

In this example, we were able to control the ordering of the columns because we used lists.

Create a DataFrame II

We now know how to create our own DataFrame.  However, most of the time, we'll be working with datasets that already exist.  One of the most common formats for big datasets is the *CSV*.

*CSV (comma separated values)* is a text-only spreadsheet format.  You can find CSVs in lots of places:
* Online datasets (here's an example from <a href="https://catalog.data.gov/dataset?res_format=CSV" target="_blank" rel="noopener noreferrer">data.gov</a>)
* Export from Excel or Google Sheets
* Export from SQL


The first row of a CSV contains column headings. All subsequent rows contain values. Each column heading and each variable is separated by a comma:

```
column1,column2,column3
value1,value2,value3
```
That example CSV represents the following table:

<div class="narrative-table-container">

|column1|column2|column3|
|-----------|------------|------------|
|value1    |value2    |value3    |

</div>

Comma Separated Variables (CSV)

When you have data in a CSV, you can load it into a DataFrame in Pandas using `.read_csv()`:

```py
pd.read_csv('my-csv-file.csv')
```

In the example above, the `.read_csv()` method is called. The CSV file called `my-csv-file` is passed in as an argument.

We can also save data to a CSV, using `.to_csv()`.

```py
df.to_csv('new-csv-file.csv')
```

In the example above, the `.to_csv()` method is called on `df` (which represents a DataFrame object).  The name of the CSV file is passed in as an argument (`new-csv-file.csv`). By default, this method will save the CSV file in your current directory.

Loading and Saving CSVs

When we load a new DataFrame from a CSV, we want to know what it looks like.

If it's a small DataFrame, you can display it by typing `print(df)`.

If it's a larger DataFrame, it's helpful to be able to inspect a few items without having to look at the entire DataFrame.

The method `.head()` gives the first 5 rows of a DataFrame.  If you want to see more rows, you can pass in the positional argument `n`.  For example,  `df.head(10)` would show the first 10 rows.

The method **`df.info()`** gives some statistics for each column.

Inspect a DataFrame

Now we know how to create and load data. Let's select parts of those datasets that are interesting or important to our analyses.

Suppose you have the DataFrame called `customers`, which contains the ages of your customers: 
<div class='narrative-table-container'>

|name|age|
|-|-|
|Rebecca Erikson|35|
|Thomas Roberson|28|
|Diane Ochoa|42|
|...|...|

</div>
<br>

Perhaps you want to take the average or plot a histogram of the ages.  In order to do either of these tasks, you'd need to select the column.

There are two possible syntaxes for selecting all values from a column:
1. Select the column as if you were selecting a value from a dictionary using a key. In our example, we would type `customers['age']` to select the ages.
2. If the name of a column follows all of the rules for a variable name (doesn't start with a number, doesn't contain spaces or special characters, etc.), then you can select it using the following notation: `df.MySecondColumn`.  In our example, we would type `customers.age`.

---
When we select a single column, the result is called a *Series*.

Select Columns

Selecting Multiple Columns

Let's revisit our `orders` from ShoeFly.com:

<div class="narrative-table-container">

|id|first_name|last_name|email|shoe_type|shoe_material|shoe_color|
| --- | --- | --- | --- | --- | --- | --- |
|54791|Rebecca|Lindsay|RebeccaLindsay57@hotmail.com|clogs|faux-leather|black|
|53450|Emily|James|EmilyJames25@gmail.com|ballet flats|faux-leather|navy|
|91987|Joyce|Waller|Joyce.Waller@gmail.com|sandals|fabric|black|
|14437|Justin|Erickson|Justin.Erickson@outlook.com|clogs|faux-leather|red|
|...| | | | | | | |

</div>

<br>

Maybe our Customer Service department has just received a message from Joyce Waller, so we want to know exactly what she ordered. We want to select this single row of data.

DataFrames are zero-indexed, meaning that we start with the 0th row and count up from there. Joyce Waller's order is the 2nd row.

We select it using the following command:

```py
orders.iloc[2]
```

When we select a single row, the result is a *Series* (just like when we select a single column).

Select Rows

Selecting Multiple Rows

You can select a subset of a DataFrame by using logical statements:
```py
df[df.MyColumnName == desired_column_value]
```
We have a large DataFrame with information about our customers.  A few of the many rows look like this:

<div class="narrative-table-container">
<div class="narrative-table-scroll">

|name|address|phone|age|
|-|-|-|-|
|Martha Jones|123 Main St.|234-567-8910|28|
|Rose Tyler|456 Maple Ave.|212-867-5309|22|
|Donna Noble|789 Broadway|949-123-4567|35|
|Amy Pond|98 West End Ave.|646-555-1234|29|
|Clara Oswald|54 Columbus Ave.|714-225-1957|31|
|...|...|...|...|...|

</div>
</div>
<br>

Suppose we want to select all rows where the customer's age is 30.  We would use:
```py
df[df.age == 30]
```
In Python, `==` is how we test if a value is exactly equal to another value.

We can use other logical statements, such as:
* Greater Than, `>` &mdash; Here, we select all rows where the customer's age is greater than 30:
```py
df[df.age > 30]
```
* Less Than, `<` &mdash;
Here, we select all rows where the customer's age is less than 30:
```py
df[df.age < 30]
```
* Not Equal, `!=` &mdash;
This snippet selects all rows where the customer's name is *not* `Clara Oswald`:
```py
df[df.name != 'Clara Oswald']
```

Select Rows with Logic I

You can also combine multiple logical statements, as long as each statement is in parentheses.

For instance, suppose we wanted to select all rows where the customer's age was under 30 *or* the customer's name was "Martha Jones":

<div class="narrative-table-container"> <div class="narrative-table-scroll">

|name|address|phone|age|
|-|-|-|-|
|Martha Jones|123 Main St.|234-567-8910|28|
|Rose Tyler|456 Maple Ave.|212-867-5309|22|
|Donna Noble|789 Broadway|949-123-4567|35|
|Amy Pond|98 West End Ave.|646-555-1234|29|
|Clara Oswald|54 Columbus Ave.|714-225-1957|31|
|...| | | | |

</div></div>
<br>

We could use the following code:
```py
df[(df.age < 30) |
   (df.name == 'Martha Jones')]
```
In Python, `|` means "or" and `&` means "and".

Select Rows with Logic II

Suppose we want to select the rows where the customer's name is either "Martha Jones", "Rose Tyler" or "Amy Pond".

<div class="narrative-table-container">
<div class="narrative-table-scroll">

|name|address|phone|age|
|-|-|-|-|
|Martha Jones|123 Main St.|234-567-8910|28|
|Rose Tyler|456 Maple Ave.|212-867-5309|22|
|Donna Noble|789 Broadway|949-123-4567|35|
|Amy Pond|98 West End Ave.|646-555-1234|29|
|Clara Oswald|54 Columbus Ave.|714-225-1957|31|
|...|...|...|...|...|

</div>
</div>
<br>

We could use the `isin` command to check that `df.name` is one of a list of values:
```py
df[df.name.isin(['Martha Jones',
     'Rose Tyler',
     'Amy Pond'])]
```

Select Rows with Logic III

When we select a subset of a DataFrame using logic, we end up with non-consecutive indices.  This is inelegant and makes it hard to use `.iloc()`.

We can fix this using the method `.reset_index()`.  For example, here is a DataFrame called `df` with non-consecutive indices:

<div class="narrative-table-container"> <div class="narrative-table-scroll">

|          | First Name | Last Name|
|--------|-----------------|---------------|
|0        | John              | Smith        |
|4        | Jane               | Doe          |
| 7       | Joe                  | Schmo      |

</div></div>
<br>

If we use the command `df.reset_index()`, we get a new DataFrame with a new set of indices:

<div class="narrative-table-container"> <div class="narrative-table-scroll">

| | index| First Name | Last Name|
|-|--------|-----------------|---------------|
|0|0        | John              | Smith        |
|1|4        | Jane               | Doe          |
|2| 7       | Joe                  | Schmo      |

</div></div>
<br>

Note that the old indices have been moved into a new column called `'index'`.  Unless you need those values for something special, it's probably better to use the keyword `drop=True` so that you don't end up with that extra column.  If we run the command `df.reset_index(drop=True)`, we get a new DataFrame that looks like this:

<div class="narrative-table-container"> <div class="narrative-table-scroll">

|          | First Name | Last Name|
|--------|-----------------|---------------|
|0        | John              | Smith        |
|1        | Jane               | Doe          |
| 2       | Joe                  | Schmo      |

</div></div>
<br>

Using `.reset_index()` will return a new DataFrame, but we usually just want to modify our existing DataFrame. If we use the keyword `inplace=True` we can just modify our existing DataFrame.




Setting indices

You've completed the lesson!  You've just learned the basics of working with a single table in Pandas, including:
- Create a table from scratch
- Loading data from another file
- Selecting certain rows or columns of a table

Let's practice what you've learned.


Analyst the A/B test data from ShoeFly.com using aggregate measures.

Examine the first few rows of `ad_clicks`.

Your manager wants to know which ad platform is getting you the most views.

How many views (i.e., rows of the table) came from each `utm_source`?

If the column `ad_click_timestamp` is not null, then someone actually clicked on the ad that was displayed.

Create a new column called `is_click`, which is True if `ad_click_timestamp` is not null and `False` otherwise.

We want to know the percent of people who clicked on ads from each `utm_source`.

Start by grouping by `utm_source` and `is_click` and counting the number of `user_id`'s in each of those groups.  Save your answer to the variable `clicks_by_source`.

Now let's pivot the data so that the columns are `is_click` (either `True` or `False`), the index is `utm_source`, and the values are `user_id`.

Save your results to the variable `clicks_pivot`.

Create a new column in `clicks_pivot` called `percent_clicked` which is equal to the percent of users who clicked on the ad from each `utm_source`.

Was there a difference in click rates for each source?

The column `experimental_group` tells us whether the user was shown Ad A or Ad B.

Were approximately the same number of people shown both ads?

Using the column `is_click` that we defined earlier, check to see if a greater percentage of users clicked on Ad A or Ad B.

The Product Manager for the A/B test thinks that the clicks might have changed by day of the week.

Start by creating two DataFrames: `a_clicks` and `b_clicks`, which contain only the results for `A` group and `B` group, respectively.

For each group (`a_clicks` and `b_clicks`), calculate the percent of users who clicked on the ad by `day`.

Compare the results for `A` and `B`.  What happened over the course of the week?

Do you recommend that your company use Ad A or Ad B?

A/B Testing for ShoeFly.com

\begin{array}{|l|l|}
\textbf{product} & \textbf{price} \\
\text{baseball} & 0.50 \\
\text{football} & 2.10 \\
\text{basketball} & 2.50 \\
\text{helmet} & 6.25 \\
\text{cleats} & 3.50 \\
\text{hockey stick} & 12.00
\end{array}

```py
customer_purchases.name.nunique()
```

```py
customer_purchases.name.unique()
```

```py
customer_purchases.name.count()
```

```py
movie_ratings.groupby('movie').rating.mean()
```

```py
movie_ratings.rating.groupby('movie').mean()
```

```py
movie_ratings.movie.groupby('rating').mean()
```

```py
movie_ratings.groupby('rating').movie.mean()
```

```py
movie_ratings.groupby('movie', 'critic').rating.mean()
```

movie_ratings.groupby('critic').rating.max().reset_index()

```py
movie_review_pivot = movie_ratings.pivot(
    columns = 'movie',
    index = 'critic',
    values ='rating')
```

```py
movie_review_pivot = movie_ratings.pivot(
    columns = 'critic',
    index = 'movie',
    values = 'rating')
```

```py
movie_review_pivot = movie_ratings.pivot(
    columns = 'movie',
    index = 'rating',
    values = 'critic')
```

```py
movie_review_pivot = movie_ratings.pivot(
    columns = 'rating',
    index = 'movie',
    values = 'critic')
```

\begin{array}{l}
\textbf{Initial Table:} \\
\begin{array}{l|l|l}
\textbf{critic} & \textbf{movie} & \textbf{rating} \\
\text{Korey} & \text{Movie A} & 94 \\
\text{Korey} & \text{Movie B} & 88 \\
\text{Korey} & \text{Movie C} & 60 \\
\text{Goodwin} & \text{Movie A} & 75 \\
\text{Goodwin} & \text{Movie B} & 96 \\
\text{Goodwin} & \text{Movie C} & 70 \\
\text{Martin} & \text{Movie A} & 80 \\
\text{Martin} & \text{Movie B} & 86 \\
\text{Martin} & \text{Movie C} & 72 \\
\text{Partick} & \text{Movie A} & 84 \\
\text{Patrick} & \text{Movie B} & 94 \\
\text{Patrick} & \text{Movie C} & 65 \\
\end{array} \\
\\
\textbf{Final Table:} \\
\begin{array}{l|l|l|l}
\textbf{critic} & \textbf{Movie A} & \textbf{Movie B} & \textbf{Movie C} \\
\text{Goodwin} & 75 & 96 & 70 \\ 
\text{Korey} & 94 & 88 & 60 \\ 
\text{Martin} & 80 & 86 & 72 \\
\text{Patrick} & 84 & 94 & 65 \\
\end{array} \\
\end{array}

```py
checkouts.groupby(['location']).book_title.count().reset_index()
```

```py
checkouts.groupby(['book_title'])['location'].count().reset_index()
```

```py 
checkouts.groupby(['location', 'book_title'].count().reset_index() 
```

```py
checkouts.groupby(['location'].count('book_title').reset_index() 
```

Practice what you've learned about aggregates with Pandas in this multiple choice quiz!

Aggregates in Pandas

Learn the basics of aggregate functions in Pandas.

In this lesson, you will learn about *aggregates* in Pandas.  An *aggregate* statistic is a way of creating a single number that describes a group of numbers. Common aggregate statistics include mean, median, and standard deviation.

You will also learn how to rearrange a DataFrame into a *pivot table*, which is a great way to compare data across two dimensions.

Introduction

Aggregate functions summarize many data points (i.e., a column of a dataframe) into a smaller set of values.

Some examples of this type of calculation include:

* The DataFrame `customers` contains the names and ages of all of your customers.  You want to find the median age:
```py
print(customers.age)
>> [23, 25, 31, 35, 35, 46, 62]
print(customers.age.median())
>> 35
```
* The DataFrame `shipments` contains address information for all shipments that you've sent out in the past year.  You want to know how many different states you have shipped to (and how many shipments went to the same state).
```py
print(shipments.state)
>> ['CA', 'CA', 'CA', 'CA', 'NY', 'NY', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ']
print(shipments.state.nunique())
>> 3
```
* The DataFrame `inventory` contains a list of types of t-shirts that your company makes.  You want a list of the colors that your shirts come in.
```py
print(inventory.color)
>> ['blue', 'blue', 'blue', 'blue', 'blue', 'green', 'green', 'orange', 'orange', 'orange']
print(inventory.color.unique())
>> ['blue', 'green', 'orange']
```

The general syntax for these calculations is:
```py
df.column_name.command()
```

The following table summarizes some common commands:

|Command|Description|
|---------------|---------------|
|`mean`|Average of all values in column|
|`std`|Standard deviation|
|`median`| Median |
|`max`| Maximum value in column|
|`min`| Minimum value in column|
|`count`| Number of values in column|
|`nunique`| Number of unique values in column|
|`unique`|List of unique values in column|

Calculating Column Statistics

When we have a bunch of data, we often want to calculate aggregate statistics (mean, standard deviation, median, percentiles, etc.) over certain subsets of the data.

Suppose we have a grade book with columns `student`, `assignment_name`, and `grade`.  The first few lines look like this:

<div class="narrative-table-container">

|student|assignment_name|grade|
|-|-|-|
|Amy|Assignment 1|75|
|Amy|Assignment 2|35|
|Bob|Assignment 1|99|
|Bob|Assignment 2| 35|
|...|
</div>

We want to get an average grade for each student across all assignments.  We could do some sort of loop, but Pandas gives us a much easier option: the method `.groupby`.

For this example, we'd use the following command:
```py
grades = df.groupby('student').grade.mean()
```
The output might look something like this:

<div class="narrative-table-container">

|student|grade|
|-|-|
|Amy|80|
|Bob|90|
|Chris|75|
|...|
</div>

In general, we use the following syntax to calculate aggregates:
```py
df.groupby('column1').column2.measurement()
```
where:
* `column1` is the column that we want to group by (`'student'` in our example)
* `column2` is the column that we want to perform a measurement on (`grade` in our example)
* `measurement` is the measurement function we want to apply (`mean` in our example)

For more on the groupby method, [review the pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby).

Calculating Aggregate Functions I

After using `groupby`, we often need to clean our resulting data.

As we saw in the previous exercise, the `groupby` function creates a new Series, not a DataFrame.  For our ShoeFly.com example, the indices of the Series were different values of `shoe_type`, and the name property was `price`.

Usually, we'd prefer that those indices were actually a column.  In order to get that, we can use `reset_index()`. This will transform our Series into a DataFrame and move the indices into their own column.

Generally, you'll always see a `groupby` statement followed by `reset_index`:
```py
df.groupby('column1').column2.measurement()
    .reset_index()
```

When we use groupby, we often want to rename the column we get as a result. For example, suppose we have a DataFrame `teas` containing data on types of tea:

<div class="narrative-table-container">

|id|tea|category|caffeine|price|
|-|-|-|-|-|
|0|earl grey|black|38|3|
|1|english breakfast|black|41|3|
|2|irish breakfast|black|37|2.5|
|3|jasmine|green|23|4.5|
|4|matcha|green|48|5|
|5|camomile|herbal|0|3|
|...||||||

</div>

We want to find the number of each `category` of tea we sell. We can use:

```py
teas_counts = teas.groupby('category').id.count().reset_index()
```

This yields a DataFrame that looks like:

<div class="narrative-table-container">

| |category|id|
|-|-|-|
|0|black|3
|1|green|4
|2|herbal|8
|3|white|2
|...||||
</div>

The new column contains the counts of each category of tea sold. We have 3 black teas, 4 green teas, and so on. However, this column is called `id` because we used the `id` column of `teas` to calculate the counts. We actually want to call this column `counts`. Remember that we can rename columns:

```py
teas_counts = teas_counts.rename(columns={"id": "counts"})
```

Our DataFrame now looks like:

<div class="narrative-table-container">

| |category|counts|
|-|-|-|
|0|black|3
|1|green|4
|2|herbal|8
|3|white|2
|...||||
</div>

Calculating Aggregate Functions II

Sometimes, the operation that you want to perform is more complicated than `mean` or `count`.  In those cases, you can use the `apply` method and lambda functions, just like we did for individual column operations.  Note that the input to our lambda function will always be a list of values.

A great example of this is calculating percentiles.  Suppose we have a DataFrame of employee information called `df` that has the following columns:
- `id`: the employee's id number
- `name`: the employee's name
- `wage`: the employee's hourly wage
- `category`: the type of work that the employee does

Our data might look something like this:
<div class="narrative-table-container">

|id|name|wage|category|
|-|-|-|-|
|10131|Sarah Carney|39|product|
|14189|Heather Carey|17|design|
|15004|Gary Mercado|33|marketing|
|11204|Cora Copaz|27|design|
|...|

</div>

If we want to calculate the 75th percentile (i.e., the point at which 75% of employees have a lower wage and 25% have a higher wage) for each `category`, we can use the following combination of `apply` and a lambda function:

```py
# np.percentile can calculate any percentile over an array of values
high_earners = df.groupby('category').wage
    .apply(lambda x: np.percentile(x, 75))
    .reset_index()
```

The output, `high_earners` might look like this:

<div class="narrative-table-container">

||category|wage
|-|-|-|
|0|design|23
|1|marketing|35
|2|product|48
|...|

</div>



Calculating Aggregate Functions III

Sometimes, we want to group by more than one column.  We can easily do this by passing a list of column names into the `groupby` method.

Imagine that we run a chain of stores and have data about the number of sales at different locations on different days:

<div class="narrative-table-container"> 

|Location | Date | Day of Week | Total Sales|
|-|-|-|-|
| West Village | February 1 | W | 400 |
|West Village| February 2| Th| 450|
|Chelsea | February 1 | W | 375|
|Chelsea | February 2| Th | 390|
</div>

We suspect that sales are different at different locations on different days of the week.  In order to test this hypothesis, we could calculate the average sales for each store on each day of the week across multiple months.  The code would look like this:
```py
df.groupby(['Location', 'Day of Week'])['Total Sales'].mean().reset_index()
```
The results might look something like this:

<div class="narrative-table-container">

|Location|Day of Week|Total Sales|
|-|-|-|
|Chelsea|M|402.50|
|Chelsea|Tu|422.75|
|Chelsea|W|452.00|
|...|
|West Village|M|390|
|West Village|Tu|400|
|...|
</div>

Calculating Aggregate Functions IV

When we perform a `groupby` across multiple columns, we often want to change how our data is stored.  For instance, recall the example where we are running a chain of stores and have data about the number of sales at different locations on different days:

<div class="narrative-table-container">

|Location | Date | Day of Week | Total Sales|
|-|-|-|-|
| West Village | February 1 | W | 400 |
|West Village| February 2| Th| 450|
|Chelsea | February 1 | W | 375|
|Chelsea | February 2| Th | 390|
</div>
We suspected that there might be different sales on different days of the week at different stores, so we performed a `groupby` across two different columns (`Location` and `Day of Week`).  This gave us results that looked like this:

<div class="narrative-table-container"> 

|Location | Day of Week | Total Sales |
|-|-|-|
| Chelsea | M          | 300                    |
| Chelsea | Tu          | 310                    |
| Chelsea | W   | 375                    |
| Chelsea | Th        | 390                    |
|... | | |
| West Village | Th| 450                    |
| West Village | F      | 390                    |
| West Village | Sa | 250                    |
|...| | | |
</div>
In order to test our hypothesis, it would be more useful if the table was formatted like this:

<div class="narrative-table-container">

|Location | M | Tu | W | Th | F | Sa | Su |
|-|-|-|-|-|-|-|-|
| Chelsea | 300 | 310 | 375 | 390 | 300 | 150| 175 |
| West Village | 300 | 310 |400 | 450 | 390 | 250 | 200|
|...| | | | | | | | |

</div>

Reorganizing a table in this way is called **pivoting**.  The new table is called a **pivot table**.

In Pandas, the command for pivot is:
```py
df.pivot(columns='ColumnToPivot',
         index='ColumnToBeRows',
         values='ColumnToBeValues')
```

For our specific example, we would write the command like this:
```py
# First use the groupby statement:
unpivoted = df.groupby(['Location', 'Day of Week'])['Total Sales'].mean().reset_index()
# Now pivot the table
pivoted = unpivoted.pivot(
    columns='Day of Week',
    index='Location',
    values='Total Sales')
```

Just like with `groupby`, the output of a pivot command is a new DataFrame, but the indexing tends to be "weird", so we usually follow up with **`.reset_index()`**.

For more on the pivot function, [see the pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html). 

Pivot Tables

This lesson introduced you to aggregates in Pandas.  You learned:
* How to perform *aggregate* statistics over individual rows with the same value using `groupby`.
* How to rearrange a DataFrame into a *pivot table*, a great way to compare data across two dimensions.

Review 

A merge where only matching rows are included.

A merge where all rows are included, whether they match or not.

A merge where all rows from the first Dataframe are included, but only matching rows from the second Dataframe.

A merge where all rows from the second Dataframe are included, but only matching rows from the first Dataframe.

```py
merged_df = pd.merge(
    df_one,
    df_two,
    how = 'outer'
)
```

```py
merged_df = pd.merge(
    df_one,
    df_two,
    type = 'outer'
)
```

```py
merged_df = pd.outermerge(
    df_one,
    df_two
)
```

```py
merged_df = pd.merge.outer(
    df_one,
    df_two
)
```

```py
pets_owners = pd.merge(
    pets,
    owners.rename(columns = {'id':'owner_id'})
)
```

```py
pets_owners = pd.merge(
    pets,
    owners
)
```

```py
pets_owners = pd.merge(
    pets.rename(columns = {'owner_id':'id'}),
    owners
)
```

```py
pets_owners = pd.merge(
    pets,
    owners,
    how = 'inner'
)
```

```py
scheduled_appointments = pd.merge(
    vets.rename(columns = {'id':'doctor-id'}),
    appointments,
    how = 'inner'
)
```

```py
scheduled_appointments = pd.merge(
    vets.rename(columns = {'id':'doctor-id'}),
    appointments,
    how = 'outer'
)
```

```py
scheduled_appointments = pd.merge(
    vets.rename(columns = {'id':'doctor-id'}),
    appointments,
    how = 'right'
)
```

```py
scheduled_appointments = pd.merge(
    vets.rename(columns = {'id':'doctor-id'}),
    appointments,
    how = 'left'
)
```

\begin{array}{l}
\textbf{appointments} \\
\begin{array}{l|l|l|l|l}
\textbf{id} & \textbf{doctor-id} & \textbf{pet-id} & \textbf{date} & \textbf{treatment} \\
1 & 1 & 6 & \text{11/13/17} & \text{cushing test} \\
2 & 1 & 2 & \text{11/14/17} & \text{allergy test} \\
3 & 2 & 1 & \text{11/13/17} & \text{check up} \\
4 & & 3 & \text{11/15/17} & \text{x-rays} \\
5 & & 4 & \text{11/17/17} & \text{check up} \\
6 & 4 & 5 & \text{11/16/17} & \text{blood work} \\
\end{array} \\
\\
\textbf{vets} \\
\begin{array}{l|l|l|l}
\textbf{id} & \textbf{first-name} & \textbf{last-name} & \textbf{specialty} \\
1 & \text{Susan} & \text{Hitt} & \text{Dogs} \\
2 & \text{Gregory} & \text{Lofton} & \text{Dogs} \\
3 & \text{Laura} & \text{Mulberry} & \text{Cats} \\
4 & \text{Ellen} & \text{Dory} & \text{Fish} \\
\end{array}
\end{array} 


```py
appointments_all = pd.concat([greg_appointments, susan_appointments])
```

```py
appointments_all = pd.merge(
    greg_appointments, 
    susan_appointments)
```

```py
appointments_all = pd.merge(
    greg_appointments, 
    susan_appointments
    how = 'outer')
```

```py
appointments_all = pd.merge(
    greg_appointments, 
    susan_appointments
    how = 'concatenation')
```

Learn Pandas: Multiple Tables Quiz

Investigate a sales funnel using Pandas merges.

Inspect the DataFrames using `print` and `head`:
- `visits` lists all of the users who have visited the website
- `cart` lists all of the users who have added a t-shirt to their cart
- `checkout` lists all of the users who have started the checkout
- `purchase` lists all of the users who have purchased a t-shirt

Combine `visits` and `cart` using a *left merge*.

How many of the timestamps are `null` for the column `cart_time`?

What do these null rows mean?

What percent of users who visited Cool T-Shirts Inc. ended up *not* placing a t-shirt in their cart?

**Note:** To calculate percentages, it will be helpful to turn either the numerator or the denominator into a *float*, by using `float()`, with the number to convert passed in as input. Otherwise, Python will use integer division, which truncates decimal points.

Repeat the left merge for `cart` and `checkout` and count `null` values.  What percentage of users put items in their cart, but did not proceed to checkout?

Merge all four steps of the funnel, in order, using a series of *left merges*.  Save the results to the variable `all_data`.

Examine the result using `print` and `head`.

What percentage of users proceeded to checkout, but did not purchase a t-shirt?

Which step of the funnel is weakest (i.e., has the highest percentage of users not completing it)?

How might Cool T-Shirts Inc. change their website to fix this problem?

Using the giant merged DataFrame `all_data` that you created, let's calculate the average time from initial visit to final purchase.  Add a column that is the difference between purchase_time and visit_time. 

Examine the results by printing the new column to the screen.


Calculate the average time to purchase by applying the `.mean()` function to your new column.


Page Visits Funnel

In this lesson, you'll learn how to combine information from multiple DataFrames.

In order to efficiently store data, we often spread related information across multiple tables.

For instance, imagine that we own an e-commerce business and we want to track the products that have been ordered from our website.

We could have one table with all of the following information:
- `order_id`
- `customer_id`
- `customer_name`
- `customer_address`
- `customer_phone_number`
- `product_id`
- `product_description`
- `product_price`
- `quantity`
- `timestamp`

However, a lot of this information would be repeated.  If the same customer makes multiple orders, that customer's name, address, and phone number will be reported multiple times.  If the same product is ordered by multiple customers, then the product price and description will be repeated.  This will make our orders table big and unmanageable.

So instead, we can split our data into three tables:

* `orders` would contain the information necessary to describe an order: `order_id`, `customer_id`, `product_id`, `quantity`, and `timestamp`
* `products` would contain the information to describe each product: `product_id`, `product_description` and `product_price`
* `customers` would contain the information for each customer: `customer_id`, `customer_name`, `customer_address`, and `customer_phone_number`

In this lesson, we will learn the Pandas commands that help us work with data stored in multiple tables.


Introduction: Multiple DataFrames

Suppose we have the following three tables that describe our eCommerce business:
- `orders` &mdash; a table with information on each transaction: 

<div class="narrative-table-container">

<table>
<tr><th>order_id</th><th>customer_id</th><th>product_id</th><th>quantity</th><th>timestamp</th></tr>
<tr><td>1</td><td>2</td><td>3</td><td>1</td><td>2017-01-01</td></tr>
<tr><td>2</td><td>2</td><td>2</td><td>3</td><td>2017-01-01</td></tr>
<tr><td>3</td><td>3</td><td>1</td><td>1</td><td>2017-01-01</td></tr>
<tr><td>4</td><td>3</td><td>2</td><td>2</td><td>2017-02-01</td></tr>
<tr><td>5</td><td>3</td><td>3</td><td>3</td><td>2017-02-01</td></tr>
<tr><td>6</td><td>1</td><td>4</td><td>2</td><td>2017-03-01</td></tr>
<tr><td>7</td><td>1</td><td>1</td><td>1</td><td>2017-02-02</td></tr>
<tr><td>8</td><td>1</td><td>4</td><td>1</td><td>2017-02-02</td></tr>
</table>

</div>
<br>

- `products` &mdash; a table with product IDs, descriptions, and prices:

<div class="narrative-table-container">

<table>
<tr><th>product_id</th><th>description</th><th>price</th></tr>
<tr><td>1</td><td>thing-a-ma-jig</td><td>5</td></tr>
<tr><td>2</td><td>whatcha-ma-call-it</td><td>10</td></tr>
<tr><td>3</td><td>doo-hickey</td><td>7</td></tr>
<tr><td>4</td><td>gizmo</td><td>3</td></tr>
</table>

</div>
<br>

- `customers` &mdash; a table with customer names and contact information:

<div class="narrative-table-container">

<table>
<tr><th>customer_id</th><th>customer_name</th><th>address</th><th>phone_number</th></tr>
<tr><td>1</td><td>John Smith</td><td>123 Main St.</td><td>212-123-4567</td></tr>
<tr><td>2</td><td>Jane Doe</td><td>456 Park Ave.</td><td>949-867-5309</td></tr>
<tr><td>3</td><td>Joe Schmo</td><td>798 Broadway</td><td>112-358-1321</td></tr>
</table>

</div>
<br>

If we just look at the `orders` table, we can't really tell what's happened in each order.  However, if we refer to the other tables, we can get a more complete picture.

Let's examine the order with an `order_id` of `1`.  It was purchased by Customer 2.  To find out the customer's name, we look at the `customers` table and look for the item with a `customer_id` value of `2`.  We can see that Customer 2's name is Jane Doe and that she lives at 456 Park Ave. 

Doing this kind of matching is called **merging** two DataFrames.


Inner Merge I

It is easy to do this kind of matching for one row, but hard to do it for multiple rows. 

Luckily, Pandas can efficiently do this for the entire table. We use the `.merge()` method.

The `.merge()` method looks for columns that are common between two DataFrames and then looks for rows where those column's values are the same.  It then combines the matching rows into a single row in a new table.

We can call the `pd.merge()` method with two tables like this:
```py
new_df = pd.merge(orders, customers)
```
This will match up all of the customer information to the orders that each customer made.

Inner Merge II

In addition to using `pd.merge()`, each DataFrame has its own `.merge()` method.  For instance, if you wanted to merge `orders` with `customers`, you could use:
```py
new_df = orders.merge(customers)
```
This produces the same DataFrame as if we had called `pd.merge(orders, customers)`.

We generally use this when we are joining more than two DataFrames together because we can "chain" the commands.  The following command would merge `orders` to `customers`, and then the resulting DataFrame to `products`:
```py
big_df = orders.merge(customers)\
    .merge(products)
```

Inner Merge III

In the previous example, the `.merge()` function "knew" how to combine tables based on the columns that were the same between two tables.  For instance, `products` and `orders` both had a column called `product_id`.  This won't always be true when we want to perform a merge.

Generally, the `products` and `customers` DataFrames would not have the columns `product_id` or `customer_id`.  Instead, they would both be called `id` and it would be implied that the id was the `product_id` for the `products` table and `customer_id` for the `customers` table.  They would look like this:

#### Customers
<div class="narrative-table-container"> <div class="narrative-table-scroll">
<table>
<tr><th>id</th><th>customer_name</th><th>address</th><th>phone_number</th></tr>
<tr><td>1</td><td>John Smith</td><td>123 Main St.</td><td>212-123-4567</td></tr>
<tr><td>2</td><td>Jane Doe</td><td>456 Park Ave.</td><td>949-867-5309</td></tr>
<tr><td>3</td><td>Joe Schmo</td><td>798 Broadway</td><td>112-358-1321</td></tr>
</table>
</div></div>

#### Products
<div class="narrative-table-container"> <div class="narrative-table-scroll">
<table>
<tr><th>id</th><th>description</th><th>price</th></tr>
<tr><td>1</td><td>thing-a-ma-jig</td><td>5</td></tr>
<tr><td>2</td><td>whatcha-ma-call-it</td><td>10</td></tr>
<tr><td>3</td><td>doo-hickey</td><td>7</td></tr>
<tr><td>4</td><td>gizmo</td><td>3</td></tr>
</table>
</div></div>

<br>
**How would this affect our merges?**

Because the `id` columns would mean something different in each table, our default merges would be wrong.  

One way that we could address this problem is to use `.rename()` to rename the columns for our merges.  In the example below, we will rename the column `id` to `customer_id`, so that `orders` and `customers` have a common column for the merge.
```py
pd.merge(
    orders,
    customers.rename(columns={'id': 'customer_id'}))
```

Merge on Specific Columns

In the previous exercise, we learned how to use `.rename()` to merge two DataFrames whose columns don't match.

If we don't want to do that, we have another option.  We could use the keywords `left_on` and `right_on` to specify which columns we want to perform the merge on.  In the example below, the "left" table is the one that comes first (`orders`), and the "right" table is the one that comes second (`customers`).  This syntax says that we should match the `customer_id` from orders to the `id` in customers.
```py
pd.merge(
    orders,
    customers,
    left_on='customer_id',
    right_on='id')
```

If we use this syntax, we'll end up with two columns called `id`, one from the first table and one from the second.  Pandas won't let you have two columns with the same name, so it will change them to `id_x` and `id_y`.

It will look like this:

<div class="narrative-table-container"> <div class="narrative-table-scroll">

|id_x|customer_id|product_id|quantity|timestamp|id_y|customer_name|address|phone_number|
|-|-|-|-|-|-|-|-|-|
|1|2|3|1|2017-01-01 00:00:00|2|Jane Doe|456 Park Ave|949-867-5309|
|2|2|2|3|2017-01-01 00:00:00|2|Jane Doe|456 Park Ave|949-867-5309|
|3|3|1|1|2017-01-01 00:00:00|3|Joe Schmo|789 Broadway|112-358-1321|
|4|3|2|2|2016-02-01 00:00:00|3|Joe Schmo|789 Broadway|112-358-1321|
|5|3|3|3|2017-02-01 00:00:00|3|Joe Schmo|789 Broadway|112-358-1321|
|6|1|4|2|2017-03-01 00:00:00|1|John Smith|123 Main St.|212-123-4567|
|7|1|1|1|2017-02-02 00:00:00|1|John Smith|123 Main St.|212-123-4567|
|8|1|4|1|2017-02-02 00:00:00|1|John Smith|123 Main St.|212-123-4567|


</div></div>

The new column names `id_x` and `id_y` aren't very helpful for us when we read the table. We can help make them more useful by using the keyword `suffixes`. We can provide a list of suffixes to use instead of "_x" and "_y".

For example, we could use the following code to make the suffixes reflect the table names:
```py
pd.merge(
    orders,
    customers,
    left_on='customer_id',
    right_on='id',
    suffixes=['_order', '_customer']
)
```
The resulting table would look like this:

<div class="narrative-table-container"> <div class="narrative-table-scroll">

|id_order|customer_id|product_id|quantity|timestamp|id_customer|customer_name|address|phone_number|
|-|-|-|-|-|-|-|-|-|
|1|2|3|1|2017-01-01 00:00:00|2|Jane Doe|456 Park Ave|949-867-5309|
|2|2|2|3|2017-01-01 00:00:00|2|Jane Doe|456 Park Ave|949-867-5309|
|3|3|1|1|2017-01-01 00:00:00|3|Joe Schmo|789 Broadway|112-358-1321|
|4|3|2|2|2016-02-01 00:00:00|3|Joe Schmo|789 Broadway|112-358-1321|
|5|3|3|3|2017-02-01 00:00:00|3|Joe Schmo|789 Broadway|112-358-1321|
|6|1|4|2|2017-03-01 00:00:00|1|John Smith|123 Main St.|212-123-4567|
|7|1|1|1|2017-02-02 00:00:00|1|John Smith|123 Main St.|212-123-4567|
|8|1|4|1|2017-02-02 00:00:00|1|John Smith|123 Main St.|212-123-4567|
</div>
</div>

Merge on Specific Columns II

Working with Multiple DataFrames

### Why Learn Pandas?
Pandas provides tools for working with tabular data, i.e. data that is organized into tables that have rows and columns. Tabular data has a lot of the same functionality as SQL or Excel, but Pandas adds the power of Python.

### Take-Away Skills:
After learning Pandas, you’ll be able to ingest, clean, and aggregate large quantities of data, and then use that data with other Python modules like Scipy (for statistical analysis) or Matplotlib (for visualization). 

This course will cover how to create Pandas DataFrames, calculate aggregates, and merge multiple tables.

Learn the basics of Pandas, an industry standard Python library that provides tools for data manipulation and analysis.