See how data literacy has been used in courtrooms, NASA launch sites, 19th-century London, and more!

Welcome to this course on data literacy! First things first, let’s answer a crucial question: Why is data literacy important? In other words, why should anyone aim to be data literate?

There are a lot of good answers! In this lesson, we’ll see how data literacy helped 19th-century doctors end cholera epidemics and discover the root cause of the disease. We’ll explore how it has helped reveal discrimination in hard-to-measure settings like hiring practices and advance medical knowledge by improving clinical trial data quality.

Data literacy also helps us to produce readable work for other people. As we’ll see, even when good data is there, the inability to tell a clear story can have dire consequences. 

It’s no secret that data is an incredibly powerful tool. With all that's at stake, it’s also not a surprise that understanding a data-driven conclusion can feel overwhelming sometimes – both as an audience member and as an analyst. No matter which side we find ourselves on, data literacy is about how well we read, interpret, and communicate with data. 

Let’s dive in with some case studies about data literacy triumphs and failures. 


An illustration of a space shuttle with large trails of smoke behind it.

Welcome to Data Literacy

**Garbage in, garbage out** is a data-world phrase that means “our data-driven conclusions are only as strong, robust, and well-supported as the data behind them.”

For example: we have a lot of data on heart attacks, but there’s room for improvement when it comes to data quality. Heart disease is the leading cause of death in women, but as of 2021, women account for only 38% of participants in relevant research studies. 

There are key differences between men’s and women’s heart attacks that impact how they’re treated, but our data doesn’t yet adequately outline those differences. This leads ultimately to worse outcomes in treatment and a higher post-heart attack mortality rate for women.

How does data literacy factor in? Part of understanding and communicating with data means **asking the right questions so that we end up with useful, relevant data**. We can already answer lots of questions about heart attacks, but we won’t learn the ins and outs of women’s heart attacks by studying mostly men. 

Part of practicing good data literacy means asking…
* Do we have sufficient data to answer the question at hand?
* Can my data answer my exact question?


Bad data, represented as a can of garbage labeled "bad data", flows into a laptop. The laptop screen shows code (both on the laptop and in the cloud) and is labeled "great model." Finally, another arrow leads to the laptop screen shows garbage shooting out into the world and is labeled "bad predictions."

Data Gaps

One question the data on heart attacks might prompt is “why did the trials have only 38% female participation?”

In part, for historical reasons: in the 1950s, pregnant women in Europe and Canada were prescribed a drug called thalidomide for morning sickness. This drug resulted in severe birth defects and was taken off the market. As a result, in 1977 the US Food and Drug Administration (FDA) recommended excluding from early-stage clinical trials all women who could become pregnant. While intended to protect women, the recommendation put them at risk in a different way, limiting our knowledge of the effects of drugs on women's bodies.

The FDA reversed these recommendations in the 1990s, and today government-funded clinical trials **must** include women and other minorities. Yet, the trials don't need to include minority groups at representative levels, and the majority of drug trials in the US aren’t government-funded anyway.

In this case, participation might also be impacted by media representations. In typical TV or movie heart attacks, we almost always see a man clutching at his arm or chest. Not only do women have heart attacks too (we wouldn’t know it from watching TV), they rarely experience chest pain as a symptom. 

(In fact, in the top 20 “heart attack” movies* on IMDB, only two heart attacks happen to women: one is fake, and the other is a disguised murder. So… zero real heart attacks in women in a list of top 20 “heart attack” movies!)

It might seem like a stretch from data literacy to TV heart attacks, but sound science means examining bias and controlling variables wherever possible.

Part of practicing good data literacy means asking…
* Who participated in the data?
* Who is left out?
* Who made the data?


*top movies with keyword “heart attack” where there is actually a heart attack mentioned or shown in the movie – not _The Exorcist_, which is on that list because people have had heart attacks while watching it... yikes!

A timeline titled "A short history of women’s participation in US clinical trials." The text is as follows:
1950s: Thalidomide drug is prescribed to treat morning sickness. It causes severe birth defects.
1961: Thalidomide is taken off the market and banned. 
1977: US Food & Drug Administration bars women of child-bearing potential from participating in early stage clinical trials. This leads to an overall drop in women’s inclusion in clinical trials. 
1993: After much campaigning by health advocates, Congress mandates adequate inclusion of women in government-sponsored clinical trials.
2001: Report by the Institute of Medicine establishes the importance of sex-based biology, further advocating for the need to include women in clinical trials at representative levels. 
2021: Despite lots of progress, women account for only 38% of participants in heart disease clinical trials.

Addressing Bias

Now let’s check out a case study that showcases the value of data literacy in the legal system.

Big, amorphous injustices like hiring discrimination are hard to prove in court. Hiring discrimination is a pattern of biased behavior towards candidates. This bias results in qualified candidates not getting hired because of their traits.

Throughout the 1900s, companies in the US were able to justify hiring on a case-by-case basis. After all, it’s legal to hire or not hire candidates based in part on soft qualities such as “fit” and “office culture.” But if these qualities are a mask for factors like a candidate’s race, gender, or religion, the company has broken anti-discrimination laws.

Usually, a lawyer would have to show the many individual cases that proved a company was discriminatory. Instead, lawyer Elaine W. Shoben shifted the burden of proof to companies. How was she able to do this with data literacy? She used the power of statistics! **Statistics helps us test the likelihood of an event happening by random chance versus systematically.**

What does that actually mean? For example, you’re more likely to see more cars on the road at 8am on Wednesday than at 8am on Sunday. This isn’t a random occurrence – the increase in cars is systematically explainable by the existence of rush hour due to standard business hours. It is statistically more likely to see many cars during rush hour than at other times.

We'll see in the next exercise exactly how Elaine Shoben used statistics to change how we assess bias in hiring.


Two animations of cars on the road. The top shows a road with 2 lanes of multicolored cars. There are many cars driving with no space between them, and the image is labeled "8 am Wednesday." The bottom image shows the same view of 2 lanes of multicolor cars, but there are only a few cars driving with lots of space between them. It's labeled "8 am Sunday. "

What is Statistics?

So how did Elaine Shoben show that discrimination was at play in hiring decisions? It’s a bit heavy on the legal jargon, but we can break it down to see how it works.

1. First, she said that we can use statistics to see if the hiring results of subjective interviews are so unlikely that they couldn’t have happened by chance. In other words, is it even possible (in statistical terms) that the pattern of who got the job could be based on random chance?
2. If the results **couldn’t** happen by chance, then the alternative is that they must happen by “purposeful exclusion.” In other words, it would mean people are excluded from the job by discriminatory hiring practices.
3. If the employers are aware of the “exclusionary effect,” and they continue to use that same hiring process, then they’re showing a “reckless disregard” for the rights of individual candidates not to be discriminated against in the hiring process. (Read it a few times if you need to!)
4. Once we acknowledge that, the burden shifts to employers to show why their hiring requirements are valid and necessary. We no longer assume the hiring practices are legitimate and make job candidates prove otherwise. 

Statistics at work! That’s definitely a bit of legal jargon, but how cool is it to use statistics to reveal a systematic pattern of discrimination, rather than trying to piece together a case from individual experiences. That’s really what stats is all about.


Statistics At Work

Okay, we’ve walked through recognizing data quality and bias in healthcare and using statistics to answer big legal questions. Where else does data literacy come into play?

Data visualization is one of the most visible and obvious places we interact with data. It helps us to explore and understand data-driven arguments and is a powerful tool for communication. 

While most data viz we see is of the “everyday” variety, in this case study we’ll look at a highly consequential visualization: one of the charts that NASA-contracted engineers used to make the argument that the Challenger space shuttle should not take off on January 28, 1986.

The Challenger space shuttle carried seven US astronauts who were supposed to deploy a satellite and study Halley’s Comet while they were in orbit. Less than two minutes after takeoff, however, the shuttle exploded, killing all seven crew members.

The explosion was caused by a failure of two O-rings: small rubber rings that helped create an airtight seal between the space shuttle and its launch fuel supply. Before the launch, engineers were concerned about how the low-temperature forecast would affect the O-rings’ ability to make a proper seal. 

The engineers made their arguments in favor of postponing the launch using, in part, a series of data visualizations that showed launch success rates at various temperatures. Tragically, their arguments did not prevent the launch from proceeding.


High Stakes Visualizations

Before we pick apart this visualization, it’s worth saying that hindsight is 20/20. If it were as simple as “obviously, the O-rings were going to fail,” then the Challenger would never have been launched. This event was the culmination of several years rather than an isolated incident, so there were many other factors at play. 

Following the incident, a Presidential Commission was initiated to investigate the causes of the catastrophe. The commission determined that it was directly the result of O-ring failure. However, they also concluded that management from both NASA and Morton Thiokol (the company NASA had contracted to design and maintain its rocket boosters) had ignored evidence that indicated there was significant risk of O-ring failure at lower temperature launches. Additionally, the commission noted that they had failed to adequately test the equipment they were using, despite consistent requests from engineers for several years preceding the incident.

In short, it is unlikely that this particular visualization played a pivotal role in the decision-making conversation that ended with management settling to launch as scheduled.

From a data literacy standpoint, though, we can definitely see how a better visualization would make the trend of the data more apparent. The engineers had the data to know that O-rings began to fail at lower temperatures. But their visualization was not created in a way that made that danger clear. 

The visualization of rocket launches was organized by date, which made it hard to see the pattern of launch failures at lower temperatures (see the top-right image). When Edward Tufte later organized the rockets by temperature, that pattern became much more obvious (see the lower image). Additionally, including all of the rocket symbols for decoration didn’t make the argument clearer, but instead added distracting visuals to the page. 

The visualization would have been easier to interpret with fewer distracting lines and a more direct link between temperature and launch failures.  

While most of us will (thankfully) never be in the position of making or interpreting life-or-death data visualizations, good data literacy helps us to make informed decisions everyday. Should I bring an umbrella? Should I postpone my trip to avoid public health risks? Should I buy stock in Blockbuster? Whatever the questions, improving our data literacy can help us reach the answers.


A comparison of the original visualization, which organizes rockets by launch date, and the revised visualization, which arranges rockets by temperature. In the revised graph, there are 5 O-ring issues in the 8 coldest rockets (out of 8 O-ring issues in 48 rockets). The pattern is much more obvious than in the original graph, where there is no cluster of O-ring failures apparent. 

The Challenger Visualizations

Wow, we’ve made it through a lot of content! Let’s kick off the final section: analysis, or turning data into useful information. The key question of analysis is, “what’s the takeaway?” 

Let’s start with a humble reminder that humans have some limitations when it comes to numbers. 

We’re generally very good when it comes to numbers we can count, or numbers we use in context. For example, money makes a lot of sense in everyday amounts like coffee, bills, or rent. We can visualize what those numbers mean and understand the consequences of them increasing or decreasing by, say, 20%.

But numbers without everyday context are another story. Think about the GDP of a country, or the personal wealth of an evil billionaire. How would that number change if we added 20% to it? We can probably do the calculation without too much challenge, but what does the change in that number actually mean in real life terms? 

With really big (or really small) numbers, it takes extra care and attention to understand how big or how small the quantity is.

On that note, powers of ten make a big difference, especially at a large scale! A million vs. a billion? Really different! (1 million seconds is equal to about 11 days. 1 billion seconds is equal to about 32 years 🤯 ) 

**Part of an analyst’s job is to provide context and clarifications to make sure that audiences are not only reading the correct numbers, but understanding what they mean.**

Visual comparison of 1 small yellow circle (top) to 1000 small yellow circles in a 20-by-50 grid (bottom). The single rectangle represents 1 million (1 with 6 zeroes) and the hundred rectangles represent 1 billion (1 with 9 zeroes). 

Numeracy

In the world of data, we’ll hear time and time again that “correlation does not equal causation.” In other words, while two events might be connected or related, that doesn’t mean they’re in a cause-and-effect relationship.

A “causal link” means proving that one event causes another. One of the most important ways this has been applied in the last few centuries has been in epidemiology, the study of diseases. Discovering correct causal links has meant big things for the prevention and treatment of diseases. 

Let’s take a look at one of the earliest instances of successful causal analysis in medicine, which starts with a man called John Snow. (Not the fantasy-famous Lord of the North, but a real-life nineteenth century London doctor.)

Until Dr. Snow’s discovery in the mid-nineteenth century, people believed that cholera was caused by vapors rising from the burial grounds of plague victims from two centuries earlier. (A good try, but cholera is actually a waterborne disease caused by bacteria found in sewage. It causes severe dehydration and has a fatality rate of over 50% when untreated.)

By studying earlier cholera epidemics and organizing his data analysis around his hunch that cholera was waterborne, Dr. Snow was able to link an 1854 cholera outbreak in London to a contaminated water pump – effectively proving a causal relationship between contaminated water and cholera **before humans even knew that bacteria existed**!

How’d he do it? Let’s find out in Part 2. 


A black and white printed map of London in 1854, showing case rates of cholera as small black bars at the locations they occurred, like a bar chart scattered around a map. The bars are clustered around Broad Street, and the Broad Street water pump is marked near the geographical center of the bars. 
Description of the geography show on the map: The map centers on Broad street in Soho.  The north edge of the view shows Oxford Streed from Regent Circus to Soho Square. The south side of the map is bounded just south of Piccadilly Circus. 

Causal Analysis and John Snow's cholera theory: Part 1

Dr. John Snow’s causal analysis breakthrough started with how he visualized his data: he organized cholera death records by location rather than by time, which was more common. He made a map, and discovered that the deaths centered around a water pump on Broad Street.

From there, Dr. Snow used death records that seemed to contradict his theory to strengthen his explanation. For instance, a woman who died of cholera in a completely different neighborhood had just visited her aunt’s house near Broad Street and drunk water from the pump. 

Dr. Snow also found that a workhouse and a brewery near the pump both had few or no cholera deaths. Upon investigation, he learned that the workhouse had its own water supply, and that the brewers not only had access to a well at the brewery, but that they drank only malt liquor and never visited the Broad Street pump.

Snow advised that the handle be taken off the Broad Street pump to prevent people from drinking the contaminated water. The handle was removed, and this action coincided with the end of that outbreak. The number of deaths was already trailing off (more than 75% of residents had left the area to avoid “choleric vapors”), but this public health intervention prevented the disease from recurring as people returned, and the epidemic ended.

The built-in test cases helped Snow to isolate variables and prove that the key variable was that people who developed cholera had drunk water from the contaminated pump. From there, repeated studies of cholera and modern lab experiments have only confirmed the causal link he discovered.

In modern lab science, we use controlled experiments to isolate variables and prove causation. Controlled experiments are often not possible outside of lab settings, though, so data scientists do the best they can to isolate and control variables and get comfortable working with some amount of error. 


Same image as the last exercise. A black and white printed map of London in 1854, showing case rates of cholera as small black bars at the locations they occurred, like a bar chart scattered around a map. The bars are clustered around Broad Street, and the Broad Street water pump is marked near the geographical center of the bars. Description of the geography show on the map: The map centers on Broad street in Soho. The north edge of the view shows Oxford Streed from Regent Circus to Soho Square. The south side of the map is bounded just south of Piccadilly Circus.

Causal Analysis and John Snow's cholera theory: Part 2

Wowie, we’ve covered a lot! The road to confident data literacy is full of fascinating examples, and knowledge that helps us make sense of one of the most powerful tools of our age: data.

In this lesson, we covered data quality and data ethics, and saw how interrogating data quality can lead to better outcomes in healthcare. Along with that, we saw how recognizing and addressing bias leads to stronger data and deeper truth.

We learned how people have used statistics to solve important but nebulous legal problems like hiring discrimination. Knowing how to use statistics has huge consequences for learning about issues that are too big to fully address one-by-one.

We saw how visualizations can make or break a data conclusion when it comes to communicating findings. While most of us aren’t making or interpreting life-or-death data viz, learning from the Challenger space shuttle explosion can help everyone to make more intentional, meaningful decisions around data visualization.

Finally, we talked about data analysis. We covered the importance of context when it comes to interpreting data: not just what the numbers are, but what they mean. And we learned about correlation and causation, capping things off with a triumph for modern medicine based on good data visualization and causal analysis.

Hot dog! That’s a lot of reasons to get excited about data literacy. 


Image of two eyes overlaid on many different types of visualizations used to represent data.

Wrap Up

Case Studies in Data Literacy

Dive into the world of data by learning how data is collected, how its quality is assessed, and what's at stake when things go right — and wrong. See how data literacy has been used in courtrooms, NASA launch sites, 19th-century London, and more!

Explore the world of data through case studies, and learn about collection methods and data quality.

Introduction to Data

Congratulations on completing the course! Please take 2 minutes to let us know your thoughts. This information will help us improve our content by writing more of what our learners want to see. 

Data Science Survey

Data, data everywhere, but where do we get started?

>We chose it because we deal with huge amounts of data. Besides, it sounds really cool.

_– Larry Page, Co-founder of Google_

The founders of Google were playing with the mathematical term, "googol", which is a 1 with 100 zeros after it (a number so big that it is pretty much incomprehensible to people). And Google knew they were working with an incomprehensible amount of data.

But what does all of that data look like? And what does it mean to work with a dataset?

Data can mean a lot of things, but within data science, it typically means a collection of organized observations.

There are two types of organization: methodology and shape.

The methodology is how the data was collected. We will get more into that later in this lesson.

The most common shape for data is a spreadsheet or table. The things we are measuring (variables) are in the columns, and the individual instances (observations) are in the rows. We can read each column “down” the table (viewing multiple observations), and each row “across” the table (viewing multiple variables). 

This isn't the only way to organize data, but it is the most common.

In this lesson, we're going to cover some basics of working with data. Some of this might be familiar if you've been working with data for a while, but some of this might also be new.

These ideas apply to all datasets you will work with. We will use an example of creating a tree census..


Introduction to Data Types and Quality

For your new role as a tree census taker, you'll start with height and species. 'Height' and 'Species' are our **variables**. The height of each tree can "vary" from one tree to another (hence the name).

Each individual tree is called an **entity**, **observation**, or **instance** (there are a lot of names for this). We'll stick with observations, but know that these three terms are used interchangeably.

In a well-organized dataset, the variables describe a characteristic of our entities. However, it can be surprisingly difficult to define good variables. Good variables measure only one characteristic and should not be a characteristic themselves. Let's look at an example.

For example, in our tree dataset, we are interested in the type of environment the tree is in. For example, we are looking at trees along city streets, highways, and in undeveloped areas. We also want to know if trees are standing alone or with others.

There are many ways to organize this. We could:
1. Make 3 new variables: 'City', 'Highway', 'Undeveloped' and input 'alone' or 'group' in the values.
2. Make 2 new variables: 'Location' and 'Single' and input the location type in the 'Location' variable and 0 or 1 in the 'Single' variable.

Option 1 might seem ok during the collection phase, but it will be difficult when we start trying to analyze the data. For example, finding all of the 'City' or 'Highway' trees and then segmenting them by alone would be a challenge.

You may have already noticed that 'City', 'Highway', and 'Undeveloped' can be grouped together as a characteristic (and there are categories like 'Park' or 'Yard' that are missing). Rather than naming our variables for the categories themselves, we are better off having one variable named 'Location Type' and entering all the possible values. This will make analysis easier later on, and we can add new categories if we need to (like 'Park').

Looks like Option (2) is the better organization for us.

But what about 'Alone' and 'Group'? Well, we will talk more about this later, but for now, just know that the variable name will be 'Single', and we will fill it in with 1 for True/Yes and 0 for False/No.


In all tidy datasets, every column represents a variable and every variable appears in only one column.
In this not tidy dataset, the titles of some columns form a group and should be encoded as a characteristic (Location). One characteristic (Alone) is encoded in multiple columns.
This distinction is visually represented with trees and in a table. In both cases, the non-tidy dataset has missing values. 

The Shape of Data

In our tree census, we are collecting data about two types of variables: one that we measure (height) and one that we categorize (species).

The difference between measuring and categorizing is so important that the data itself is termed differently:
* Variables that are measured are **Numerical** variables
* Variables that are categorized are **Categorical** variables.

#### Numerical variables

Numerical variables are a combination of the measurement and the unit. Without the unit, a numerical variable is just a number.

Imagine I go into a cafe and ask the barista for 3. Three what? ☕? 🍩? 💵?
Or my friend asks how far Toledo is and I say 300. 300 miles? Kilometers? Minutes? Without units, numbers don't mean anything.

There are two ways to get a number: by counting and measuring. Counting gives us whole numbers and **discrete** variables. Measuring gives us potentially partial values and **continuous** variables.

In our tree census, we are measuring the height of our trees in feet (indicated in the variable name, 'Height (ft)'), a continuous variable.

#### Categorical variables

Categorical variables describe characteristics with words or relative values.

In the tree census, trees species are described with words like London Plane, Honeylocust, or Pin Oak. This is the best description and encodes all the information we need about the species. This kind of categorical variable is a **nominal variable** which literally means a named value.

We also captured whether or not our trees grew alone. In our 'Single' variable, there were only two options: Yes and No. This is called a **dichotomous variable**. Dichotomous variables have only 2 logical possibilities, "on/off", "yes/no", "true/false", "0/1", there's no middle ground and no 3rd option. If there is a logical third option, it's not a dichotomous variable.

Finally, let's say that we wanted to capture how "pretty" we thought each tree was. This isn't really a thing we can measure, but we can subjectively say on a scale of 1 to 5, how pretty we think each tree is. The prettiest trees are a 5, the least pretty trees are a 1.

That ranking is inherently ordered and therefore called an **ordinal variable**.

Ordinal variables are really popular in survey design "on a scale of 1-5 how much do you agree with this statement?" This is called a **likert** scale. They also show up in the Olympics and other competitions where someone wins 1st, 2nd, or 3rd place.

Ordinal variables can get a little confusing because they are often represented as numbers. But they don't represent measurements or counts, they represent categories. For example, let's say an Olympian wins Gold and Bronze medals, it doesn't make sense to say that they averaged Silver. The same is true of likert scales: there's no average between "Very pretty" and "Pretty."


Three trees. One is the first place, a 5 on our scale, and labeled 'Prettiest'. The next is second place, a 3 on our scale, and labeled 'Pretty', the last is in 3rd place, a 1 on our scale, and labeled 'Least Pretty'.

Variable Types

> All happy datasets are alike; each unhappy dataset is unhappy in its own way.

_- Leo Tolstoy (if he had written a Data Science book)_

Ok, Tolstoy was writing about families in Anna Karenina, but families are just like data. Clean datasets are all alike, but every messy dataset is messy in its own unique way. That's why cleaning data involves a lot of critical thinking when considering the nuances of the dataset you are working with.

Fortunately, there are some patterns in what can go wrong, and the first step in cleaning data is knowing what to look for.

#### What is a messy dataset?
Imagine we are outside collecting the data about our trees. We have our iPad and our tape measure. Our fingers are cold, we are distracted by a beautiful bird 🦜, and we're ready for lunch 🍕, but we just have to measure and categorize these last 3 trees. The last 3 entries look like this:

![trees dataset](https://static-assets.codecademy.com/Courses/data-literacy/data-types/Trees_Ex4.png)

Yikes! What a mess. But we're hungry, so we decide to fix the issues after lunch. They never get fixed. Six weeks later, we are back at our desk ready to analyze our data. Oh no! We have over 10,000 observations and quite a few problems.

#### Messy Data Problems

Different problems need to be handled differently, so let's categorize them:
* Typos like Tuuullip for Tulip
* Missing data like the Pin Oak (tree ID 11222) that doesn't have a height
* Inconsistent coding like the Pin Oak (tree 18564)'s Prettiness value is 'three' rather than '3' and the Single value for all of our trees is 'no' rather than '0'.

If we don't fix these issues, we will likely end up with problems in our analysis. For example:
* Tulip trees might be divided into 2 categories
* We might get an inaccurate average height for Pin Oaks because we are missing a data point
* Our computer might return an error message when we try to group trees into their Prettiness value or find all of the trees that grow alone.

Finding and solving these problems requires detective work. For now, we will fix these issues manually, but know that if you work with data, you will see these issues again. We cover how to deal with these issues and more in [How to Clean Data with Python](https://www.codecademy.com/learn/practical-data-cleaning) and in the course of our Data Scientist Career Paths.


Dealing with Messy Data

In our dataset, we had some missing values. There are various types of "missing-ness" that affect how we treat the missing data. 

If we remember when we were collecting our data, we were hungry, our fingers were cold, and we were distracted. Even though there's a reason we didn't enter some values (we were hungry and tired), it's not a systemic reason. There's no deeper meaning to why the data is missing: it just wasn't entered properly. This kind of missing is **Missing Completely at Random.**

However, we don't always know if there is a deeper meaning, so we have to treat missing data like a mystery to solve. For example, we might notice that all of the Redwood trees are missing Height values. Well, that's interesting! We can predict if a tree is missing its Height value based on what Species it is. 

More generally _we can predict if one value is missing based on the value in another variable_. This kind of missing is called **Missing at Random.** It is a confusing label because it's not really missing at "random" in the normal meaning of the word. If we dig a little deeper into how the data was collected, we might uncover a story about the data collection or about Redwoods. For example, our tape measures might have been too short to measure them. 

Finally, data can be structurally missing, meaning that we wouldn't expect a value there to begin with. For example, let's say we are also collecting data about fruit on our trees. Some trees will have visible fruit. For those trees, we can count how many fruits are visible. If there's no visible fruit, we can't count how many there are. The number of fruits will be **Structurally Missing**.

What should I do about missing data?

Well, for structurally missing data, we can just ignore it, we don't expect there to be values there anyhow.
For Missing at Random and Missing Completely at Random, there is an entire science behind what to do with these values. Learn more in our course on [Handling Missing Data](https://www.codecademy.com/learn/handling-missing-data).

When trying to recover missing data or work around it, the most important thing to consider is that anything you do will affect your analysis. Once data goes missing, it can't be recovered, so whatever decision you make becomes a part of your result and your analysis (even doing nothing will affect the analysis). Because of this, it is best practice to keep track of which values were missing just in case you ever need to revisit your data.


Table of data with missing values highlighted. Missing values are interspersed throughout, illustrating how damaging missing data can be.

Working with Missing Data

Great! We collected our measurements, made decisions about handling missing data. Now we need to ask ourselves if the dataset we have really describes the world. We need to know if it is accurate. Accuracy is a measure of how well records reflect reality.  

While doing some [Exploratory Data Analysis](https://www.codecademy.com/learn/eda-exploratory-data-analysis-python) you notice that the trees you measured are overall taller than the trees I measured. That's interesting. You're not really sure why that is, so we compare how we measured the trees. 

We realize that you measured starting from the ground and I measured starting from where the roots become the trunk. It's not a huge difference, but it's enough to affect the accuracy of our data. The tree heights are not accurate because we don't know how tall each tree really is. We could also say that the height variable is not *reliable*. Without a standard measurement unit and standard method, comparing trees, or even getting an average tree height is impossible. 

Standardization is essential for accuracy – but it's not the only way that accuracy can be compromised. 

There are a lot of ways a dataset can have low accuracy, but it all comes down to the question of: "are these measurements (or categorizations) correct?"
It requires a critical evaluation of your specific dataset to identify what the issues are, but there are a few ways to think about it. 
* First, thinking about the data against expectations and common sense is crucial for spotting issues with accuracy. You can do this by inspecting the distribution and outliers to get clues about what the data looks like. 
* Second, critically considering how error could have crept in during the data collection process will help you group and evaluate the data to uncover systematic inconsistencies.
* Finally, identifying ways that duplicate values could have been created goes a long way towards ensuring that reality is only represented once in your data. A useful technique is to distinguish between what was human collected versus programmatically generated and using that distinction to segment the data. 

Holding these perspectives in mind is important for both numeric and categorical variables. In fact, they often provide clues about each other. 

As far as resolving accuracy issues, there's no simple solutions, and every solution has to be tailored to that specific dataset. In the end, the only way to improve a dataset's accuracy is to use real-world knowledge to be sure that the dataset reflects reality. But even then, the reason that we collect data is generally to learn something new about the world. Sometimes the data will surprise you, but distinguishing between a new finding and inaccuracy is the work of a skilled data scientist. 


Six trees. The tree measurements start and stop at different points on each tree. Some trees are measured from the roots, some are measured from the trunk. These differences cause issues with accuracy.

Accuracy

It's not just typos, mistakes, missing data, poor measurement, and duplicated observations that make a dataset low quality. We also have to make sure that our data actually measures what we think it is measuring. This is the **validity** of our dataset. 

Validity is a special kind of quality measure because it's not just about the dataset, it's about the relationship between the dataset and its purpose. A dataset can be valid for one question and invalid for another. 

Let's think again about our trees dataset. After we finished collecting the data, we thought of another question we wanted to answer: how old are our trees?
 
We know that you can measure the age of a tree by counting the rings, but we didn't do that. Let's say that we did measure the width of the tree. 

We decide that since number of rings and width are related, we will use width as a proxy for the age. With that decision, we just compromised the **validity** of our dataset. Our data doesn't measure age, it measures width. And even though there is a relationship between the number of rings and the width, it's not a direct relationship and therefore cannot be substituted without affecting the validity of our dataset and measures.


Now let's say that we want to know how much our trees grow every year. We found a dataset for the same region from 20 years ago. We use the locations to match up the old and new measurements. But this data can tell us how much they grow every 20 years, not every year. If we try to use these two datasets to measure yearly growth, we will compromise the **validity** of the dataset again.

Using proxies and inappropriate time spans are just two ways to compromise the validity of a dataset. There are infinite ways in which a given dataset is not valid for answering a given question. The best way to spot issues with the validity of a dataset is to ask: Does this variable measure what I think it does?


Validity

Great! You've cleaned the data, decided what to do about missing data points, resolved any accuracy issues. You've made sure that all the questions can be answered by the data we have and gotten all stakeholders to agree on the research questions.  

You are ready to do your analysis. Then you notice that all of the records are from New York State. You've been hired to work on the census for all of the North Atlantic. That includes multiple states in the U.S. and many regions of Canada. Where is the rest of the data? 

You go back to the dataset and start to think about when it was collected. Right, Spring of 2020 - the border was closed. Census takers collected data in the region that they could: the areas that were convenient. This is a **convenience sample**. It's great for preliminary understanding, but not good for representing a broader population.

If we were to create a model to predict tree prettiness based on the variables in our dataset, it might only be relevant for trees in New York. We've introduced **bias** into our dataset by constraining our sample. 

Convenience samples aren't the only type of sampling errors, but they are common. The goal of a sample is to represent a population. Any time a sample is made that does NOT reflect the entire population, it is a sampling error. 

Best practice is to create a **sample** that represents the entire **population**. 

The **population** is all of the trees in the North Atlantic region. The **sample** is the trees that we have data about (it will almost never be all of them). 

The sample should look like the population in as many characteristics as possible. Therefore, our sample needs to include many different kinds of trees from many different locations. 

There are a lot of techniques for creating representative samples, but they all have the same goal: to find a mix of observations that contains all of the features in the larger population. 


Representative Samples

Congratulations! You've done the hard work of collecting your data and preparing it for analysis. Along the way you've addressed some data quality issues and considered the relationship between your data and your questions. 

While preparing the North Atlantic Tree Census, you've:
* Defined your variables to create tidy datasets
* Classified the variables based on their types 
* Reconciled messy data 
* Decided how to deal with missing data
* Addressed issues of accuracy
* Aligned your questions to the available data to ensure validity
* Created representative samples

These are important techniques for anyone working with data to always be conscious of. In the end, your human oversight and critical consideration of the data will have the biggest impact on data quality.

Image of 6 trees rooted in the ground with a night sky above them.

Review of Data Types and Quality

Data Types and Quality

Some researchers measured distances in meters and others measured in yards and records the values without units. 

Some researchers use iPads to record the data and others use pencil and paper before inputting the values into the system.

Some measurements are made mechanically and some are made digitally.

A researcher measures the weight of some inanimate objects in the morning and other inanimate objects in the evening. 

Blood pressure measured daily: `110/90, 120/92, 115/86, 120/90`

Country of birth: `Italy, Mexico, United States, Thailand, Spain`

Survey responses: `Strongly Disagree, Disagree, Somewhat Disagree, Somewhat Agree, Agree, Strongly Agree`

Test your knowledge of data types and data quality.

[World Health Organization (WHO)]: https://www.who.int/data/gho/
[Mental health]: https://www.who.int/health-topics/mental-health#tab=tab_1
[Road traffic mortality]: https://www.who.int/data/gho/data/themes/topics/topic-details/GHO/road-traffic-mortality
[FiveThirtyEight]: https://data.fivethirtyeight.com/
[Political polls]: https://projects.fivethirtyeight.com/polls/
[The best NBA players]: https://projects.fivethirtyeight.com/nba-player-ratings/
[Data.gov]: https://www.data.gov/
[Maritime limits and boundaries]: https://catalog.data.gov/dataset/maritime-limits-and-boundaries-of-united-states-of-america#topic=ocean_navigation
[Storm Events Database]: https://catalog.data.gov/dataset/ncdc-storm-events-database2
[Census data for the United States of America]: https://www.census.gov/data.html
[Census data for EU countries]: https://ec.europa.eu/eurostat/web/population-demography/population-housing-censuses
[Data Unicef]: https://data.unicef.org/
[Google Dataset Search]: https://datasetsearch.research.google.com/
[Orchids]: https://datasetsearch.research.google.com/search?query=Value%20of%20the%20import%20and%20export%20of%20orchids%20in%20the%20Netherlands%202008-2020&docid=L2cvMTFweDF5bnRzOQ%3D%3D
[Biodiversity at U.S. national parks]: https://datasetsearch.research.google.com/search?query=national%20parks&docid=L2cvMTFqbl82ZmdmeQ%3D%3D
[Revenue of the cosmetic & beauty industry in the U.S.]: https://datasetsearch.research.google.com/search?query=cosmetics&docid=L2cvMTFuZmJqOWtsXw%3D%3D

[Image of data points on a screen and a clipboard with a pencil.]: https://static-assets.codecademy.com/Courses/data-literacy/data-collection/1682.svg
[Image of computer screen with plots and charts about website user data.]: https://static-assets.codecademy.com/Courses/data-literacy/data-collection/user-data.jpg
[Image of computer screen with Google search homepage open in a web browser.]: https://static-assets.codecademy.com/Courses/data-literacy/data-collection/search.jpg

![Image of data points on a screen and a clipboard with a pencil.]
 

Most people are aware that data visualizations, machine learning algorithms, and analyses require data. But where does that data come from? Does everyone have access to data for analysis? 

In this article, we'll cover:

* The importance of ethics in data collection.
* How data may be collected.
* Common sources of freely available datasets.

### Data ethics and privacy

Whenever we talk about data collection, we need to discuss data ethics. Much of the data available to us comes from individuals and would be considered **personally identifiable**, meaning we could use it to identify someone. Many people use the acronym PII (pronounced "pie" &mdash; yum!) for the term "personally identifiable information". Examples of PII are address, email, phone number, social security numbers, credit card numbers, and medical records. We all have an obligation to protect personally identifiable information.

Ethical issues regarding data collection may be divided into the following categories:

* **Consent**: Individuals must be informed and give their consent for information to be collected.
* **Ownership**: Anyone collecting data must be aware that individuals have ownership over their information.
* **Intention**: Individuals must be informed about what information will be taken, how it will be stored, and how it will be used.
* **Privacy:** Information about individuals must be kept secure. This is especially important for any and all personally identifiable information.

### Data collection

We collect, process, and analyze data to better understand our world and make more informed decisions. The first step in any data work is to collect the data itself. Data can come from a lot of places, including research, governments, technology, observation, or directly from individuals &mdash; the list is endless! 

We collect this data in many different ways. One way is to seek out information that doesn't yet exist and measure it directly. This can include activities like surveys, observational studies, or recording the results of an experiment. This kind of data might be considered _static_, meaning the information is collected once and does not change. Think about conducting a survey by mail: the survey results are collected and recorded only once.

![Image of computer screen with plots and charts about website user data.]

Data can also be live and ever-changing based on the most up-to-date information. For example, apps and websites can track clicks and time spent on pages across multiple users at the same time without a human actively recording all the data points. Unlike the static data of more traditional methods, sensors and trackers can also continuously update data to include new information in a live feed. Think about weather predictions: the data that goes into weather predictions are updated continuously to get the most accurate predictions.

Finally, rather than collect measurements directly, we can also use existing data that was collected by others or for some other purpose. There are lots of databases that are freely available for public use. We can even compile data from a variety of sources and join them together before an analysis.

### Data sources

Many organizations house all kinds of data. Datasets are often kept private or can only be accessed for a fee. This may be done for reasons like protecting the identity of individuals, keeping valuable information from competitors, or making a profit from data collection.

The following list has links and descriptions for websites that provide free access to some interesting datasets. The companies and organizations on this list provide public access to data, allowing anyone with internet access to view this information. The websites vary in how they provide data access: some may have a CSV or Excel file of data that can easily be downloaded to a computer, while others allow access to a database via an API.
 
* [World Health Organization (WHO)]: Data available on the WHO's site cover a variety of health-related topics, such as COVID-19, air pollution, and even brain health. There are fact sheets and direct access to various datasets, including: 
 * [Mental health]
 * [Road traffic mortality]
* [FiveThirtyEight]: This is a very popular analysis website that provides direct access to some of their datasets. Topics include sports, politics, science & health, culture, and economics. Check out some of these interesting finds:
 * [Political polls]
 * [The best NBA players]
* [Data.gov]: The U.S. government has its own open data collection. The site includes information on agriculture, climate, energy, and many other topics. Here are a few unique datasets:
 * [Maritime limits and boundaries]
 * [Storm Events Database]
 * [Census data for the United States of America]
 * [Census data for EU countries]
* [Data Unicef]: UNICEF’s Data and Analytics team provides global access to data on children. This organization believes that the right data in the right hands can help us make informed and equitable decisions. You can view a variety of topics and data from various countries.

![Image of computer screen with Google search homepage open in a web browser.]

Looking for something specific? [Google Dataset Search] works like a google search bar for datasets. We think the following datasets look really interesting!
* [Orchids] &mdash; Did you know the total value of trees, plants, and flowers exported from the Netherlands in 2020 was nearly 9.8 billion euros? 
* [Biodiversity at U.S. national parks] &mdash; Did you know that Haliaeetus leucocephalus (also know as a Bald Eagle) can be found in just about every U.S. National Park? Check out this data file to explore animal and plant species that have been identified and verified by evidence in national parks.
* [Revenue of the cosmetic & beauty industry in the U.S.] &mdash; Talk about big money: the revenue of the U.S. cosmetic industry was estimated to amount to about 49.2 billion U.S. dollars in 2019.

Learn about where we get data and what we need to consider ethically.

Data Collection

## Overview
In this project, you'll practice using summary statistics from real data. 

You will:

* interpret summary statistics in context
* make choices about which summary statistics are appropriate based on the situation
* answer free-response questions and compare your answers to a sample solution available after you submit your answer
* answer toggle questions that can be clicked on to toggle an explanation or solution
* explore some really interesting data!

As an added bonus, some toggle prompts are labeled _"Just for Fun"_ and don't have to be answered. These prompts provide extra interesting details about the data for your enjoyment!

## Motivation
You just got a very cool job in the film industry! Your first assignment is to do some research on the content produced by streaming services in the last few years. You're putting together information for a report on the kinds of films being produced as well as any patterns that might be worth exploring further.

## Dataset
You decide to start your research by exploring some data about films and documentaries produced by Netflix. The dataset you'll be using is a modified version of one found on the website [Kaggle](https://www.kaggle.com/luiscorter/netflix-original-films-imdb-scores). Your dataset includes 503 films with the following variables:

* `title`: title of the film
* `genre`: genre of the film
* `language`: primary language of the film
* `year`: year the film premiered
* `runtime`: length of the film in minutes
* `score`: film rating of 1 to 10 (worst to best) from the website [IMDb](https://www.imdb.com/)

## Individual Variables
#### Language
You decide to start with the `language` variable. The table that follows gives the count of films in each language.

![Table showing counts of languages. English: 360, Spanish: 29, Hindi: 27, French: 15, other single: 51, multiple: 21.](https://static-assets.codecademy.com/Courses/data-literacy/stats/project/language.svg)
 

Using the table, try answering the following individual questions. Then use your answers to write up a brief summary about the primary languages of the films.

 
<details>
<summary>There are clearly a lot of films that have English as their primary language. Of the 503 films, what proportion have English as their primary language? (Click to Toggle Correct Answer)</summary>
There are 360 films with English as the primary language. To get the proportion, we divide 360 by the total of 503 films: 360 &#247; 503 = 0.72. About 0.72 of the films have English as their primary language.
</details>
 

<details>
<summary>What is the ratio of English-language films to films in a single language that is NOT English? (Click to Toggle Correct Answer)</summary>
We know there are 360 English-language films, but we have to do a little work to find the number of other single-language films. We can add the four categories for Spanish, Hindi, French, and "other single" (29 + 27 + 15 + 51 = 122). Or we can subtract the English and "multiple" categories from the total (503 - 360 - 21 = 122). This means the ratio of English films to non-English single-language films is 360 to 122. Since 360 &#247; 122 is 2.95, this means there are almost 3 English films for every non-English film.
</details>
 

<details>
<summary>What proportion of the films has multiple primary languages? (Click to Toggle Correct Answer)</summary>
There are 21 films that have multiple primary languages. Dividing 21 by 503 gives a proportion of 0.04.
</details>
 

<FreeResponseQuestion id="2db85de2b8e640b28886d69b6110923b"/>


<details>
<summary>Just for Fun: About 4% of the films have multiple primary languages. Find out which genres tend to have this interesting feature. (Click to Toggle the Information)</summary>
Of the 21 films with multiple primary languages, 16 are documentaries. This makes a lot of sense if the documentary makers are using a different language than the documentary subjects.
</details>
 

#### IMDb Score
You are excited to take a look at all the IMDb scores! Are most films rated around the same score? Are there some extremely low or high scores?

![Histogram plot showing the distribution of IMDb scores that range from 1 to 10. The bars form a bell shape centered around 6, with most values between 4 and 8.](https://static-assets.codecademy.com/Courses/data-literacy/stats/project/imdb-distribution.svg)

**Mean:** 6.3

**Standard Deviation:** 1.0

<FreeResponseQuestion id="431e39f4e5214663937930c2aefdb5d7"/>


<details>
<summary>Just for Fun: Find out which films got the highest and lowest IMDb scores! (Click to Toggle the Information)</summary>

Highest Scoring Films

|TITLE|GENRE|LANGUAGE|YEAR|RUNTIME|SCORE|
|---|---|---|---|---|---|
|David Attenborough: A Life on Our Planet|Documentary|English|2020|83|9.0|
|Emicida: AmarElo - It's All For Yesterday|Documentary|other single|2020|89|8.6|
|Springsteen on Broadway|Other|English|2018|153|8.5|
|Taylor Swift: Reputation Stadium Tour|Other|English|2018|125|8.4|
|Ben Platt: Live from Radio City Music Hall|Other|English|2020|85|8.4|

Lowest Scoring Films

|TITLE|GENRE|LANGUAGE|YEAR|RUNTIME|SCORE|
|---|---|---|---|---|---|
|Enter the Anime|Documentary|multiple|2019|58|2.5|
|Dark Forces|Horror/Thriller|Spanish|2020|81|2.6|
|The App|Action/Sci-Fi|other single|2019|79|2.6|
|The Open House|Horror/Thriller|English|2018|94|3.2|
|Kaali Khuhi|Other|Hindi|2020|90|3.4|

</details>
 

#### Genre
Your colleague is helping you with your research. They will be writing up the summary for the `genre` variable.

<FreeResponseQuestion id="2a92ee2f2fb34fb781920dda3647d16f"/>


#### Runtime
Your colleague used analytics software to create a summary of the `runtime` variable. The software program has a default setting for numeric variables that outputs a distribution plot, the mean, and the standard deviation. The analytics software produced the following plot and statistics for the runtimes:

![Histogram plot showing the distribution of runtimes that range from nearly 0 to just over 200 minutes. There is a long tail of low values that lead to a bell-shaped group of bars centered around 100 minutes.](https://static-assets.codecademy.com/Courses/data-literacy/stats/project/runtime-distribution.svg)

**Mean:** 92.5

**Standard Deviation:** 28.4

<FreeResponseQuestion id="f5ba9dd795574fa0aff1a9444d7bbda1"/>
<FreeResponseQuestion id="a4a973c65e75483d8183d47cb706f7f2"/>

<details>
<summary>Just for Fun: Based on the distribution plot, what is your guess for the median and IQR of runtimes? Find out if your guess was close! (Click to Toggle the Median and IQR)</summary>

**Mean:** 92.5

**Standard Deviation:** 28.4

The mean describes a typical runtime as in the low 90s. The standard deviation describes the distribution as having wide variability, with runtimes an average of almost 30 minutes different than the mean. These measurements are not wrong, but they don't help us do a good job of summarizing what we're seeing in the distribution.

**Median:** 97.0

**IQR:** 21.8

In contrast, the median describes a higher runtime as most typical. The low IQR indicates that half the values aren't very far from the center value. These descriptions better match the large number of values near 100 that we see in the distribution plot.

Since the mean is less than the median, it seems like the left-skew is more influential on the mean than the high potential outlier is.

</details>
 


<details>
<summary>Just for Fun: Find out which movie has that really high runtime. (Click to Toggle the Information)</summary>

|TITLE|GENRE|LANGUAGE|YEAR|RUNTIME|SCORE|
|---|---|---|---|---|---|
|The Irishman|Drama|English|2019|209|7.8|

This film is about 3 and a half hours long!
</details>
 

## Relationships
Next, you want to explore some relationships between variables. Aggregating data across genres and languages may give you some insights.

#### IMDb Score by Genre
You want to know if any genres have particularly high IMDb scores, so you look at a table of means and standard deviations of IMDb scores for each genre.

![An 8 by 3 table showing genres, means, and standard deviations of IMDb scores. Row 1: Action/Sci-Fi, 5.7, 1.0. Row 2: Animation, 6.6, 0.9. Row 3: Comedy, 5.8, 0.8. Row 4: Documentary, 7.0, 0.8. Row 5: Drama, 6.3, 0.8. Row 6: Horror/Thriller, 5.6, 1.0. Row 7: Other, 6.4, 1.0. Row 8: Romance/Romantic Comedy, 5.9, 0.6.](https://static-assets.codecademy.com/Courses/data-literacy/stats/project/imdbscore-genre.svg)

<FreeResponseQuestion id="d642ebfa2a984dc59ed549cad85ba54c"/>


#### Runtime by Language
You are wondering if there are any differences in the length of films across languages. 
![A 6 by 3 table showing languages, means, and standard deviations of runtimes. Row 1: English, 92, 29. Row 2: Spanish, 93, 25. Row 3: Hindi, 115, 17. Row 4: French, 91, 13. Row 5: other single, 95, 24. Row 6: multiple, 73, 36.](https://static-assets.codecademy.com/Courses/data-literacy/stats/project/runtime-language.svg)

<FreeResponseQuestion id="50d9c6f66e9d40209570201e25fe220f"/>

#### Runtime and IMDb Score
For your final exploration, you've just added the following plot and summary statistic to your report.
![Scatter plot of IMDb scores by runtimes. The points do not form any line or obvious pattern. Most points are in a mass centered between 50 and 150 minutes and scores of 4 to 8.](https://static-assets.codecademy.com/Courses/data-literacy/stats/project/scatter-scrn.svg)

**Correlation Coefficient:** 0.92

Your colleague is concerned there may be an error. The plot and the correlation coefficient don't match. 

<FreeResponseQuestion id="ddfd122b0624447bbf5451a68e07c9f7"/>
<FreeResponseQuestion id="2d0576e376204d4ba94f014d01fa4894"/>


<details>
<summary>Just for Fun: The correlation coefficient was incorrect. What do you think the true correlation coefficient is? Find out if your guess is correct! (Click to Toggle the Information)</summary>
The true correlation coefficient is -0.04. The sign indicates a negative relationship, but the value is so close to zero that we should conclude there's really no linear relationship between the variables.
</details>
 

## Final Thoughts
You finished your first movie research report! You did a lot of exploring and learned a lot about the different features of Netflix content. These findings could be the final step or the starting point of a deeper analysis. Great job!

Explore Netflix data with your new understanding of summary statistics!

Movie Statistics Project

The frequency and proportion of each color in the dataset.

The mean and standard deviation for the color variable.

The median and interquartile range of the color variable.

The distributions for the living room and dining room have the same center value, but the distribution for the living room has a wider spread.

The center of the distribution for the kitchen is 1.9 hours.

The average distance from the center for values in the bathroom's distribution is 1.2 hours.

The distributions for the living room and dining room have the same spread, but the distribution for the living room has a higher center value.

The distribution may be considered skewed because it is asymmetrical with a longer tail on the left side.

This distribution may be considered normal because it is symmetrical and bell-shaped.

There is no change in frequency across this distribution.

Half of the ages in the distribution are less than 65.4 and the other half are greater than 65.4. The middle 50% of the data spans 7.2 years. 

The second quartile is greater than 65.4.

Quiz yourself on basic statistics concepts!

Statistical Thinking

Learn about useful descriptive statistics and their practical application.

Let's imagine we are working for the city government of the fictional city of Melody Metropolis. The mayor of Melody Metropolis wants to know more about the musicians who currently live in the city. The learning environment shows a dataset we have on musicians living in the city as of last year. How would you describe this dataset? See if you can answer any of the following questions:

* What does a typical musician's income look like? 
* Is there a wide range of musician ages?
* What proportion of the musicians in the dataset play guitar?

We can try to make generalizations by looking over the rows and columns, but it's difficult to answer these questions precisely. We need some kind of "data vocabulary" that can help us measure and describe the variables in the dataset. *Summary statistics* can be used for exactly this purpose! 

With a basic understanding of summary statistics, we can communicate and understand a lot more specific information about the musicians in the city. But learning statistics is often associated with a lot of negativity:

* Memorization of lots of math formulas
* Long calculations done by hand
* Confusing or meaningless interpretations

None of these struggles need to be part of learning to use statistics. In this lesson, we'll gain a conceptual understanding of how summary statistics can easily help us communicate and interpret our dataset.

Introduction

To start our summary for the mayor, let's describe some of the categorical variables in the musician dataset &mdash; those variables that contain qualitative information on the city's musicians. First, let's look at information about the `title` variable, which tells us the job title each musician holds. 

The following table shows:

* **frequency**: the count of musicians for each job title
* **proportion**: the frequency divided by the total number of musicians
* **percentage**: the proportion converted from a decimal to a percentage

![A 6 by 4 table with columns labeled job title, frequency, proportion, and percentage. Row 1: performer, 333, 0.35, 35%. Row 2: manager 113, 0.12, 12%. Row 3: producer, 87, 0.09, 9%. Row 4: educator, 239, 0.25, 25%. Row 5: composer, 186, 0.19, 19%. Row 6: total, 958, 1.00, 100%.](https://static-assets.codecademy.com/Courses/data-literacy/stats/ex2-jobs-table-narr.svg)

From the table, we can learn about the different job titles of musicians in the city. There are 333 performers out of a total of 958 musicians. The proportion of performers is 333 &#247; 958 = 0.35. To make this even easier to understand, we can convert the proportion to a percentage by multiplying it by 100: 35% of musicians in the city are performers.

We can also compare one category to another by checking the ratio of their frequencies. For example, there are far fewer managers than performers. Their ratio is 333 performers to 113 managers, which can be simplified by dividing: 333 &#247; 113 = 2.95. This means there are almost 3 performers for every manager in the city.

Try finding some other ratios from the table:

<details>
<summary>What is the ratio of educators to composers? (Click to Toggle Correct Answer)</summary>
The ratio is 239 educators to 186 composers. Simplified, this ratio is 239 &#247; 186 = 1.28, which is little more than 1 educator for every composer. There are not many more educators than composers in the city.
</details>
 

<details>
<summary>What is the ratio of performers to producers? (Click to Toggle Correct Answer)</summary>
The ratio is 333 performers to 87 producers. Simplified, this ratio is 333 &#247; 87 = 3.83, which means there are about 4 performers for every producer.
</details>

Describing Categorical Variables

Now that we've learned about some of the categorical variables in our musician dataset, it's time to explore some numeric variables &mdash; those with quantitative data. There are a lot of ways we can describe the **distribution** of a numeric variable. A distribution is a function that shows all possible values of a variable and how frequently each value occurs. This may sound pretty technical, but visualizing the distribution can make it easy to understand.

In the learning environment, the distribution of musician ages is plotted with age on the x-axis and frequency on the y-axis. From this plot, we can see:

* Ages range from about 15 to 70.
* There are few musicians under 30 or over 50 years old.
* There are a lot of musicians between the ages of 30 and 50.

This distribution might be considered bell-shaped or hill-shaped and symmetrical. This is actually a very common pattern and is called a **normal distribution**.

![Plot of a normal distribution. The bars in the plots become larger and then smaller moving left to right, forming a bell shape.](https://static-assets.codecademy.com/Courses/data-literacy/stats/ex3-normal-graphic.svg)

Viewing a plot or knowing a variable is normally distributed gives us some general information, but still nothing specific. We need exact measurements to describe where the center of the distribution is and how wide the values are spread away from that center. There are several sets of statistics we may use for these measurements, and we will need to know when to use which combination.


A histogram plot of the musician age distribution with age in the x-axis and frequency on the y-axis. Age ranges from 0 to 70 years. Frequency ranges from 0 to 200. The frequency bars are in a bell shape. 

Describing Numeric Variables

Let's return to our visualization of the distribution of musician ages.

![Bar plot showing the distribution of ages. Ages range from about 15 to 70. The frequencies determine the heights of the bars. The bars form a bell-shaped pattern.](https://static-assets.codecademy.com/Courses/data-literacy/stats/ex4-age-narr_corr.png)

* What would you say is the typical age of a musician in Melody Metropolis? 
* Are most musicians about this age, or are there lots of musicians of many different ages?

To answer these questions more specifically, we should take some measurements of our variable.

* The **mean**, also called the average, describes the center of a numeric distribution by adding all values and dividing by the count.
* The **standard deviation** describes the spread of values in a numeric distribution relative to the mean. It is calculated by finding the average squared distance from each data point to the mean and square-rooting the result.

The mean age of musicians is 40.6 years and the standard deviation is about 9.3 years. We might interpret this standard deviation as moderate variability in age. Had the standard deviation been 1 year, we might say there's hardly any variability in age. The plot of this narrow distribution might look something like the following:

![Bar plot showing the distribution of ages when the standard deviation is 1. Ages range from about 37 to 43 years. The frequencies determine the heights of the bars. The bars form a very narrow bell-shaped pattern.](https://static-assets.codecademy.com/Courses/data-literacy/stats/ex4-age-lowsd.svg)

The mean and standard deviation are common choices, especially for normal distributions. Their mathematical formulas have special properties that make them easy to use in other contexts, such as statistical testing. However, the mean and standard deviation are not always the best measurements to describe a distribution.

Mean and Standard Deviation

As we're moving through the numeric variables in our musician dataset, we come across some interesting details when we inspect the income variable.

![Plot showing the distribution of the income variable. The incomes range from about $10,000 to $90,000. The bars indicating frequencies are bell-shaped but asymmetrical with a long tail on the right.](https://static-assets.codecademy.com/Courses/data-literacy/stats/ex5-income-large.svg)

![Table showing the mean of $34,795 and the standard deviation of $11,971 for the income variable](https://static-assets.codecademy.com/Courses/data-literacy/stats/ex5-mean-table.svg)

1. We notice that the shape of the distribution is different than the shape of the age distribution. There are quite a few musicians with higher incomes that are creating a longer tail on the right side.
2. We also notice that the mean indicates that the typical income is $34,795. This value seems a little high since most of the incomes seem to be between $15,000 and $40,000.

What we have learned is that the income distribution is skewed. A **skewed** distribution is asymmetrical with a steep change in frequency on one side and a flatter, trailing change in frequency on the other. Specifically, the income distribution is right-skewed (also called positively-skewed) because the tail is on the right side.

So why does the mean seem wrong? Remember, the mean is the sum of all the values in the dataset divided by the total count. That sum is made very large by all the higher incomes in that right tail. This makes the mean a greater number than we would like it be. When the data are skewed, the mean may not be the best measure of a typical observation.

There are a number of ways to deal with this issue. We will handle the problem with the income data by taking some alternative measurements.

Skewed Distributions

Let's find an alternative measure to the mean. We want to find a value that represents the typical musician income, but we don't want to use the actual values in the computation because the data are skewed.

One method would be to find the middle value when all values are arranged from smallest to largest. This value is called the **median**, but it's also referred to as the 50th percentile or the second quartile (Q2).

Let's look at some simple data in the learning environment where the median (Q2) is 13. Half the data points are less than 13, and half are greater than 13. 

These data span 22 values, ranging from 6 to 28. We could use this as our measure of spread, but what if the highest number wasn't 28 but 280? The median would still be 13, but now the range is 274 (280-6), which doesn't tell us a lot about the bulk of the data. 

A better measurement might be the **interquartile range (IQR)**. A quartile is simply a marker for a quarter (25%) of the data.
* The first quartile marks 25% (Q1 = 10).
* The second quartile marks 50% (Q2 = 13 &mdash; the median)
* The third quartile marks 75% (Q3 = 22)

The IQR is the difference between Q3 and Q1 (22-10 = 12), marking the range for just the middle 50% of the data.

Let's find out how the median and IQR work out for our income data.

![Table showing the median of income is $32,978 and the IQR is $17,150.](https://static-assets.codecademy.com/Courses/data-literacy/stats/ex6-income-stats.svg)

This looks better &mdash; the median of $32,978 is lower than the mean of $34,795 and seems more typical.

Median and IQR

Image of two distribution plots titled "Income Distribution." The top plot shows a right-skewed distribution with a median of $32,978 and a mean of $34,795. The bottom plot shows the same distribution with three additional incomes from Paul McCartney, BTS, and Beyonce. This plot has a median of $33,011 and a mean of $228,235.

For our income data, the difference between the mean ($34,795) and median ($32,978) was only about $2,000. You may be wondering: Is the difference ever larger?

Let's imagine some very famous celebrity musicians have all decided to move to Melody Metropolis. We know celebrities make much more money than the typical musician in our dataset. In the learning environment, we've added three new incomes to the distribution:

* **$48 million**: Paul McCartney, British musician of the Beatles 
* **$57 million**: BTS, South Korean K-pop band
* **$81 million**: Beyoncé, American singer-songwriter

The second plot shows that the median appears almost unaffected by the addition of these three gigantic incomes: the median moves from $32,978 to $33,011. However, the mean makes a drastic change from $34,795 to $228,235. The mean is now well beyond even the maximum in the original distribution. An income of $228,235 is definitely not a great measure of the center of our income distribution.

These celebrity incomes are examples of **outliers**, extreme values that are distant from the rest of the distribution. Just as with skewness, outliers tend to more heavily influence the mean than the median. This same pattern occurs with measures of spread: the standard deviation is more influenced by outliers and skewness than the interquartile range (IQR). 

Because the median and IQR are NOT heavily influenced by extreme values, we say they are **robust**. Robust statistics are often a better choice to measure the center and spread of a distribution that is skewed or has outliers.

Source:

[The World’s Top-Earning Musicians Of 2019](https://www.forbes.com/sites/zackomalleygreenburg/2019/12/06/the-worlds-top-earning-musicians-of-2019/?sh=87be291164e7)

Outliers and Robust Measures

One measure that we haven't covered that is usually talked about alongside the mean and median is the mode. The mode is defined as the value with the highest frequency, but we can also think of the mode as the value where the peak of the distribution occurs. While not great for computations, the mode can help us identify interesting features in a variable.

For instance, there might be more than one mode, such as in our distribution of years of experience. In the following plot, we can see there's one peak near the 10-year mark and another near the 30-year mark. We would call this distribution _bimodal_ because it has two modes.

![Histogram plot showing frequencies for the experience variable. There are two peaks: one at around 10 years and one at around 30 years. The title is "Bimodal distribution".](https://static-assets.codecademy.com/Courses/data-literacy/stats/ex8-bimodal.svg)

Sometimes bimodal distributions occur when there are differences across categories of another variable. Given that the city seems to have a lot of young people in bands, let's see if this pattern is reflected when we find the mean of each category of the `band` variable.

![Table showing mean years of experience for those in a band and those not. Those in a band have a mean of 14.4 years of experience. Those not in a band have a mean of 26.2 years of experience.](https://static-assets.codecademy.com/Courses/data-literacy/stats/ex8-table.svg)

These means are very different and very close to the locations of the modes in our plot. This indicates that there may be some differences in experience level between these two groups that are showing up in our distribution plots as two peaks.

By making this separation and then summarizing with the mean, we have **aggregated** our data. In this case, we have aggregated by summarizing a numeric variable (`experience`) across each value of a categorical variable (`band`).

We have aggregated some other data in tables in the learning environment. Do you see any interesting patterns?

An image of 3 tables titled "Aggregate Statistics." The first table shows median income by title: performer $40,387, manager $27,984, producer $36,897, educator $29,387, composer $35,027. The second table shows median and IQR of income by band membership: yes $31,683 $22,234, no $32,370 $15,984. The third table shows mean experience by instrument: voice 19 years, guitar 14 years, piano 34 years, drums 10 years, saxophone 28 years, violin 36 years.

Aggregate Data

Aggregating data is a way of exploring variable relationships. We specifically looked at relationships between a numeric variable and a categorical variable, but we should also examine relationships between two numeric variables.

For example, we might wonder: Does musician income vary with years of experience? To start, we can take a look at a **scatter plot** with experience on the x-axis and income on the y-axis. Each point in the plot represents a musician, and the coordinates of that point are the musician's experience (x) and income (y).

![Scatter plot showing income in USD by experience in years. The points form a pattern moving from the lower left to the upper right of the plot. They are not in a perfect line though. There is a range of incomes for every amount of experience.](https://static-assets.codecademy.com/Courses/data-literacy/stats/ex9-IEscatter.svg)

The cloud of points in the plot has a pattern. The points move from the lower left to the upper right part of the plot. In other words, lower levels of experience tend to be associated with lower incomes, and higher levels of experience tend to be associated with higher incomes. The points don't form a perfect line though &mdash; there is some variation.

We can describe this relationship more precisely by measuring the **correlation coefficient**. This number ranges from -1 to +1 and tells us two things about a linear relationship:

1. **Direction**: A positive coefficient means that higher values in one variable are associated with higher values in the other. A negative coefficient means higher values in one variable are associated with lower values of the other.
2. **Strength**: The farther the coefficient is from 0, the stronger the relationship and the more the points in a scatter plot look like a line.

The correlation coefficient for income and experience is 0.74 &mdash; the relationship is positive and moderately strong.

Variable Relationships

Congratulations! We've finished our summary report for the mayor of Melody Metropolis, and you've just completed the lesson on statistical thinking! We covered a lot of information, including:

* Summary statistics give us measurements to describe our data.
* Categorical variables can be described with frequencies and proportions.
* Numeric variables can be described using measures of center and spread.
* Sometimes we need robust statistics like the median and IQR because our data is skewed or has outliers.
* We can aggregate data to get a look at relationships between pairs of variables.
* Scatter plots and correlation coefficients help us describe linear relationships between pairs of numeric variables.

And hopefully, the most important discovery you made is that learning statistics doesn't have to be a struggle. We gain valuable skills just by understanding some basic concepts. Enjoy using your new data vocabulary!

Image of a report with pie charts and percentages. The report is on a desk with other objects and papers.

Review

We'll cover common pitfalls in visualizations...

Have you ever used a map that was just… wrong? As in, “this entire building doesn’t exist on the map” wrong, or “it almost had me drive down a one-way street” wrong. You notice pretty quickly, right?

What about a map where countries are the wrong size? How long do you think it would take to notice that something was off?

Anyone who has ever learned geography from a map has learned something “wrong.” Mapmakers have known for hundreds of years that it’s just not possible to transfer the earth’s 3D information to a 2D map without changing the shape or size of the continents and oceans – something has to be sacrificed.

(Check out this GIF showing a Mercator projection versus the actual size of countries. The Mercator projection, invented in **1569** and widely used to teach geography through the 20th century and today, severely distorts the size of land masses near the poles.)

The same idea is true in data visualization. Data visualizations take information from the living, breathing world that we inhabit and show it to us in just a few square inches of screen or paper. That always involves prioritizing speed, or accuracy, or sample size, or cost, or another factor, at the expense of something else. 

While the best graphs really do teach us something new, and help us understand a deeper truth using data, graphs can also be misleading, both intentionally or unintentionally. We shouldn’t ever assume that a data visualization shows us the truth, the whole truth, and nothing but the truth. 

This lesson will help us recognize the elements of misleading and confusing graphs so that we can avoid making them ourselves. 


A gif of a world map showing the difference between the Mercator projection and actual relative size of each country. In a Mercator map, country sizes are normal at the equator but much larger than reality at the poles. This change is therefore most obvious with Russia, Canada, and Antarctica. 

Axes and scaling are like the page layout and spacing of a paper book: they’re not the most exciting parts, but they do present plenty of opportunities to make it harder to read.

Let’s start with axes – the x-axis (horizontal, left-right) and y-axis (vertical, top-bottom). A common misleading aspect of an axis is a **break**. A break starts the count at a number that’s not zero, or jumps ahead – this can distort the amount of difference between data points by removing context, and make small differences in data seem bigger. 

Here’s an illustration of that idea: it would be almost impossible to tell at a glance if there were 100, 105, or 110 people standing in a room – but you’d be able to easily tell the difference between 0, 5, and 10 people standing in a room. Using a break on an axis can have the same effect, amplifying the **change** rather than the **context** because it alters the proportions in the visualization. Check out the graphs to the right to see what this looks like in practice. 

So what to do? If you’re looking at a graph, take a second to check where the axis starts. If there’s a break, factor that in as you think about what the numbers mean. 

If you’re making the graph, instead of using a big break…
* Keep enough context to view differences **in proportion** to a meaningful amount, OR
* Make two graphs, one without a break and one “zoomed in”, OR
* Choose a visualization type that shows the change, rather than the raw numbers


Two bar charts side by side. Both are titled "Event Attendance", with "Number of Attendees" on the y-axis and "Event Date" on the x-axis. Each chart has three bars (yellow, blue, orange) in increasing height from left to right. The difference is the y-scale on each chart. 

On the lefthand chart, the scale goes from 0 to 150 in intervals of 50. The bars show 100, 105, and 110, so are all clustered near the 100 line. The bars look relatively similar in height.

On the righthand chart, the scale starts at 0 but has a break (shown by a zigzag in the axis). The numbers pick back up at 100 and increase by 5s, so the axis ticks are 0, 100, 105, 110. As such, the heights of bars representing those numbers (100, 105, 110) stretch over the whole vertical space of the graph. The bars look relatively much more different in height than in the left graph. 

Axes

Two line graphs side-by-side. The lefthand graph is titled "Painkiller prescribing information, Linear y-axis." The x-axis shows "Hours from dosing" from 0 to 12. The y-axis shows "Concentration of painkiller in bloodstream" from 0 to 140, at evenly-spaced intervals of 20. There are 5 lines representing different doses of the drug. The three lowest doses (10, 20, and 40 mg) are relatively similar and never reach above a concentration of 40 in the bloodstream. The two highest doses (80 and 160 mg) show significant spikes in concentration, reaching up to concentrations of 90 and 120 (2 or 3 times more than the lower doses). 

The righthand graph is titled "Painkiller prescribing information, Log y-axis." The x-axis shows "Hours from dosing" from 0 to 12. The y-axis shows "Concentration of painkiller in bloodstream" from 0 to 100 on a log scale. This means the axis runs from 0 to 100, with 10 about halfway between those two numbers. As such, all the lines appear more or less flattened out, and all of them are clustered nearer to the center of the graph. It's still obvious that higher doses result in a higher concentration in the bloodstream, but the big spike that is clear in the linear graph is completely invisible on the log scale. 


Scaling refers to the distances between numbers on an axis. Almost all graphs use a **linear scale**, where the numbers count up by a consistent interval – tenths of a centimeter or millions of dollars, if it’s the same interval, it’s a linear scale. 

The other scaling option is a **logarithmic scale**, a.k.a. log scale. The log scale is common for showing exponential growth that won’t fit on the page with a linear scale, but it’s almost never a good choice for a general audience. Unless people use log scales regularly, they tend to have trouble interpreting them correctly. 

Check out the graphs in the LE to see how the pharmaceutical company Purdue infamously used this misinterpretation to their advantage in the early 2000s. The linear scale shows how the concentration of a painkiller drug spikes sharply in the bloodstream at higher doses – the log scale makes it look like all doses behave pretty similarly. (These are reproductions of the original graphs, but we can definitely see how differently they represent the same numbers.)

In general, just like it’s always worth checking for a break, it’s always worth checking how a graph is scaled. 

Last thing about axes and scaling: **generally, we measure time horizontally**, putting that variable on the x-axis. For the vast majority of circumstances, this makes the most sense and helps readers to intuit what the graph measures. 


Scaling

Color is often the first thing we register when looking at data visualizations. There are three types of color scales, used for the three major types of relationships we can visualize with color. 

Sequential scales are colors in a sequence – often, this is the same hue with more and more white added to or taken away from the color. Sequential scales are used to show a variable increasing or decreasing in intensity or amount, like income, depth, or percent of population that owns a chinchilla. 

![sequential scale](https://static-assets.codecademy.com/Courses/data-literacy/misleading-confusing-charts/Color-scale-1.svg)

Divergent scales are anchored by colors from opposite sides of the color wheel, a.k.a. complementary colors. A divergent scale is used to visualize data where the middle is a baseline, and either side represents a contrasting change. For example, divergent scales do a good job of showing a positive/negative swing in voting or polling, temperatures above and below freezing, or gains and losses over time. 

![divergent scale](https://static-assets.codecademy.com/Courses/data-literacy/misleading-confusing-charts/Color-scale-2.svg)

Categorical scales use a variety of colors to differentiate categories without assigning a rank or order to them. In other words, “purple” doesn’t necessarily mean more than “green” – the two are just different colors. Categorical scales are for categorical data, like types of vegetables in a supermarket, or different treatments tested in a controlled study, or organizational blocks on a calendar. 

![categorical scale](https://static-assets.codecademy.com/Courses/data-literacy/misleading-confusing-charts/Color-scale-3.svg)

Sequential color scale: light blue, medium blue, dark blue.

Diverging color scale: orange, light warm gray, medium blue.

Categorical color scale: orange, deep purple, light green.

Color Scales

Two data visualizations comparing the number of men and women in specific situations. The first uses a light aqua/turquoise color for women and blue-ish purple for men to compare how representation at 7 levels of work (from entry level to C-suite) changes when women's performance is slightly undervalued. The second uses a light taupe/tan color for men and a dusty rose color for women to compare the number of deputy mayors, commissioners and agency heads in the Cabinets of Bill de Blasio and his 3 predecessor mayors of NYC. 

The first is labeled "From the op-ed “This Is How Everyday Sexism Could Stop You From Getting That Promotion,” published in The New York Times in October 2021." The second is labeled "From the article “Gender in De Blasio’s Cabinet” published in The New York Times in May 2014."

Once we’ve picked the right color scale, there are still a few more considerations to be made to reduce confusion.  

First up, we tend to view darker colors as “more” and lighter colors as “less.” For example, if we’re visualizing which US states have the most pet ferrets, California – with the most pet ferrets of any state – should be the darkest state on the map. When this scale is reversed, people will tend to just read the graph wrong rather than reading the legend carefully.

We also come to data visualizations with pre-existing associations for certain colors. These can be culturally specific (red means bad vs. red means lucky), or influenced by the norms for a particular field (red means negative financial balance).

Sometimes it’s good to stick with what’s recognizable: it would be confusing for US voters if a major newspaper decided to visualize Democrats in red and Republicans in blue, since these colors are overwhelmingly associated with the opposite party.

But in other cases, switching up colors that have existing cultural associations can reduce harmful stereotyping. Using pink for women and blue for men reinforces an outdated, binary view of women as soft and passive, and men as strong and unemotional. This design choice will not only turn off some viewers, it may also distract on a graph where gender is a relevant variable but not the whole focus. 

It would be confusing to just reverse this stereotypical color palette, but there are plenty of good alternatives – check out two examples from The New York Times on the right. The important thing is to be consistent with the alternative palette. 

Color Associations

A good title is one of the best and fastest tools for making a more understandable visualization. Lots of confusion can be saved with a descriptive title. 

If the graph doesn’t have a good title (or even a title!), viewers have to do more legwork to first figure out what each axis measures and then what the data points show. 

The title can be a question that visualization answers, like, “Who speaks more in Disney movies, male or female characters?”

Titles can also be a statement of what the visualization shows, like “Comparing denim inseam lengths through the decades” or “Millennials really do spend more on rent than on avocados” or “The effect of hunger on mood level.”

Like a good title, annotations on a graph also help the viewer to understand what’s going on. Annotations are perfect for calling out points of interest, explaining outliers, or including background information that a viewer won’t necessarily know from just looking at the graph. 

Check out this _Live Births_ graph from FlowingData to see how much value the annotations add. They...
* add detail to the highest and lowest points on the graph
* explain what the 0% baseline means
* provide a caveat for the 2021 data
* reinforce in words that the percents on the y-axis show "more births" and "fewer births" 

Just a few lines of thoughtful annotation here and there give the audience so much more ability to interpret the graph at a deeper level!

A line chart titled "Live Births from 2004 to 2021 compared to monthly counts in 2003." The y-axis shows "Change in live birth count since 2003," measured from -15% to +10% in intervals of 5%. 0% is annotated "This baseline represents equal number of births to 2003 monthly counts." Above 0% is labeled with an up arrow and "More births," and below is labeled with a down arrow and "Fewer births." The x-axis shows each year from 2004-2021. The line is predictably jagged as births rise and fall within each year, and the jagged line rises from 2004, peaks at +10% in 2007, falls to about -5% by late 2011, stays there until 2016, then falls again to a low of -16% in early 2021. 

The line is colored orangey-red above 0%, a grayish-neutral in the middle, and teal-blue below 0%.

Annotations on the chart are as follows:
2007, highest point on the line: "In November 2007, births peaked relatively at 353,660, which was 10% higher than during the same month in 2003."
2021, lowest point on the line: "Births have been going down for the past decade, but the pandemic pushed it down faster. Births in January 2021 were 16% lower than in 2003."
2021, most recent point on the line (-7%): "June 2021 births show a possible rise."
2021, x-axis label: "Data for 2021 is provisional. The counts these % changes are based on were rounded to the nearest thousand."

Labels and Titles

That was a short primer on some common misleading parts of data visualizations – pitfalls that get between **the actual truth of the data** and **how the viewer understands it**.

People make misleading or confusing graphs on purpose and by accident. We’ve seen how intentional changes to the scaling can really distort what the graph seems to say, and also how going with the status quo on color palettes can be helpful or harmful. Most of us aren’t out here trying to make misleading graphs – very often, it’s just a question of making a better or worse decision while designing the data viz.

And improving our design skills – knowing when to use different color palettes and how to label our axes clearly – not only makes our graphs better, **it also reduces bias**! When we follow best practices in design, we’re not “going rogue” and just making design decisions that make the graph look the most extreme, or suit our needs at that moment. We’re following established practices that help to standardize graphs so that we can focus on what the data’s really saying. 

Long story short: good data viz is always a combination of quality data and effective design choices. If we can avoid confusion or missteps along the way, that’ll always be a good thing! 


A horizontal diagram of 3 steps, with arrows between each step.

The first is a cloud with blue and purple dots representing floating data. The text above it reads “data in the wild”.

The second is a window to represent the transformations we make on data to turn it into data viz and insights. On the window reads “avoiding confusion / deception” The text above it reads “good design practices."

The third is a scatterplot that uses the same dots from the cloud, but now ordered so there’s a clear correlation in the data. The plot has title and axis labels, a dotted line showing the general break between purple and blue data dots, and a speech bubble callout as an annotation on the graph. The title is “Effective data viz”.

Conclusion

Misleading and Confusing Graphs

## Building Effective Data Visualizations

You've joined Shinji, Paola, and Raj in Dr. Dinkle's ecology lab! The lab is swamped with its latest project after a successful round of funding (award-winning board game designer Claude Tuber wants to turn the lab's work into the tabletop role-playing game of a generation). They're excited to have your data visualization skills!

In a few weeks the lab will present data on birdfeeder birds to the Backyard Birders Association of nearby Dandelion City. One component is a visualization of the wingspans of the three most common backyard birds in Dandelion City: the Northern Cardinal, American Robin, and Blue Jay.

The lab's paid intern Janis is trying her chops at data visualization, and will be drafting all of the visuals for this presentation. With your knowledge of data visualization, your task is to help her make the most effective visualization of wingspan data to present to the backyard birders of Dandelion City.

Here's the data we're working with:

![ENTER ALT TEXT](https://static-assets.codecademy.com/Courses/data-literacy/data-viz-basics/project/data-table-2.svg)

### Chart Type

First up... picking the right chart type for our data.

<FreeResponseQuestion id="55d4ba541fd34e8bac360ee4b42e1b0e"/>

 

### Color Palette

Now that we've gotten the chart type to align with our data, let's move on to the next important consideration: color.


<FreeResponseQuestion id="0c77ecaede214186b3b8110261664ce8"/>


### Annotations

With chart type and color palette sorted, let's look on to annotations that will help our audience make the most of the final chart. 

<FreeResponseQuestion id="31742e77c7ad4d43b1b2ae88281aec95"/>


### Design

And finally, let's finish up with some design considerations.

<FreeResponseQuestion id="c348059ceb534bbeb0d0be531c3bd8fc"/>


With your guidance, Janis has produced a readable and beautiful visualization for the presentation to the Backyard Birders, and the lab's work moves smoothly along. Strong work!

Help Janis the intern make a visualization to share with Dandelion City's Backyard Birders Association!

Data Viz Basics Project

Scatterplot, univariate map, violin plot. 

chart-makers have conscious or unconscious bias, or a lack of knowledge or skill.

people are trying to spread misinformation. 

people don't put in enough effort to make truthful, beautiful visualizations. 

the chart-maker lacks expertise in the subject. 

sometimes good, and sometimes bad. Color associations can pull on both helpful prior knowledge and harmful stereotypes.

helpful: cultural associations help visualizations to just _make sense_ and reduce the amount of work the audience has to do to understand what's going on. 

harmful: choosing colors with established cultural associations reinforces the status quo in a negative way. 

(1) having the same information shown with multiple different visual cues in a chart

(2) it helps organize and prioritize information on a chart.

(1) leveling content to an accessible level

(2) unnecessarily complicated information can discourage or confuse audiences. 

(1) making multiple graphs that show the same information 

(2) you never know what audience members will respond to, so giving them more options is helpful.

(1) having all conclusions from a graph available in written context as well

(2) alt text is a necessary component of web design. 

the audience's reading level and familiarity with the subject matter.

how best to make sure they see your expertise on the subject matter. 

what they'll find most visually appealing.

You've finished Data Viz Basics and Misleading & Confusing Charts -- now test your knowledge!

Data Viz Basics Quiz

We'll cover chart types, aesthetic properties like shape and color, communicating to an audience, and universal design -- among other things! Buckle up y'all, this is data viz basics. 

Welcome! 

The field of data visualization has taken off in the last twenty years or so. People in all kinds of professional fields make data viz – from news media to business analysis – and most of us consume multiple visualizations per day, whether we notice it or not! 

Data visualizations are how we communicate data. We don’t read the numbers off a spreadsheet or list every number in a trend to communicate a point -- we make a visualization to show them. 

In this module, we’ll cover two major ideas:
1. We visualize data to accomplish a goal – to explore and see something new in the data, to strengthen an argument with a visual aid, to share a data-driven idea with someone else, and more. 

2. Our design decisions can either help or hinder how effectively a data viz communicates. We tell stories with data viz (yes, even if it’s a simple visualization) using good design choices in chart type, color, labels, context, and more. 

There is so much we could say about data viz! It definitely won’t all fit here though, so in this Data Viz Basics lesson, we’re only talking about visualizations that are static (no moving parts or animations) and fixed (no interactivity).

A person presenting a data visualization.

The first step of making a data visualization is choosing a chart type. Chart type isn’t our only tool when it comes to visualizing data, but it’s an important one for communicating about the relationship we want to show. 

In this context, a “relationship” in the data could mean something like…
1. “the shop’s sales of Gouda were higher in 2021 than any year since 2006”
2. “30% of people ordered pizza with pineapple”
3. “most people in the sample have a shoe size between 6 and 10.5 ”
4. “as temperature increases, ice cream sales increase”

The first example is a change over time – that can be perfect for a line chart or a bar chart. 

The second example compares a part to the whole: 30% of people got pizza with pineapple, out of 100% of people who ordered pizza. A pie chart is the classic (sometimes controversial) choice, but newer options include waffle and donut charts. Yum!

The third example is a distribution – the spread of data points in one variable. A histogram is the classic choice for visualizing a distribution.

The fourth example is a direct comparison of two variables to help understand a trend. This is perfect for a scatterplot, with or without a trend line.

There’s often more than one possible chart we can use for a dataset. But different charts emphasize different questions, arguments, or relationships in the data, and whichever we choose should help translate that data relationship into a visual relationship. 

Next up, we’ll get to know these charts a little better.


Four charts. The first is a line chart titled "Gouda Sales at Stinky Vicki's Cheese Shop." The x-axis runs from 2005 to 2021. The line runs in a jagged U-shape overall, with the lowest point in 2020, with a callout reading "supply chain disruptions." The highest point is in 2021 and second highest is in 2006.

The second is a pie chart titled "Pizza Preference at Pat's Pizza." The larger part is labeled 70%, "no pineapple on pizza," and the smaller is labeled 30%, "pineapple on pizza."

The third is a scatterplot of yellow dots titled "Ice cream sales & temperature." The y-axis is labeled temperature in celcius and x-axis is Ice cream sales in dollars. The dots are in a positive correlation, with most of the dots in the area from lower left to upper right (low temp and low sales to higher temps and higher sales). There's a callout on one dot at a relatively high dollar value and not-that-warm temperature that reads "unusually warm early spring day."

The fourth graph is a histogram, vertical bars next to each other with heights that form an inverted U shape. The title reads "shoe size distribution of student athletes." The y-axis shows frequency from 0 to 80 and x-axis shows shoe size from 4 to 14. The tallest bar is size 7.5 with a frequency of about 72. 

From Data Type to Chart Type

Four univariate charts.  
The first is a simple map of Italy titled "Cities in Italy." Italy is shown in yellow, with nine dots scattered around the country representing major cities. They are unlabeled.

The second is a box plot titled "Eggs in chicken coop (per week)". The box plot is a horizontal pink rectangle with dotted lines extending out from either side, like a rectangular bead strung on a wire. Each dotted line is capped off with a vertical line. The left end cap is labeled 10. This is the lowest value in the dataset. The left edge of the box starts around 16; this is the 1st quartile. The median is shown as a vertical dotted line near the middle of the rectangle, labeled 28. The right edge of the box, i.e. the third quartile, is around 32. Finally, the right end cap (max value) is at 37.

The third chart is a bar chart titled "Wingspans of Common Feeder Birds." The y-axis is labeled "Wingspan (in)" and runs from 0 to 14 at an interval of 2 inches. The x-axis is labeled "Species." From left to right, the four bars are as follows: 1. orange bar for Black-capped Chickadee at 7 inches, 2. pink bar for Northern Cardinal at 11 inches, 3. yellow bar for Goldfinch at 8 inches, 4. blue bar for House Sparrow at 9 inches. 

The fourth graph is a histogram, vertical bars next to each other with heights that form an inverted U shape. The title reads "shoe size distribution of student athletes." The y-axis, labeled "Frequency," shows frequency from 0 to 80 and x-axis, labeled "Shoe Size," shows shoe size from 4 to 14. The tallest bar is size 7.5 with a frequency of about 72.

One big consideration when choosing a chart type is how many variables we’re comparing. **Univariate** charts help us visualize a change in **one variable**.

Often that means measuring “how much,” which can either be a **count** or a **distribution**.

A common chart for counts is the bar graph. If we want to compare an amount between different categories, like “how many of each coin is in the piggy bank” or “how many birds were saved by species,” a bar chart translates the difference in count to a difference in bar height. Remember, **the data relationship is translated to a visual relationship**.

Another common univariate chart is the histogram. Histograms measure the distribution or spread, of a variable.

Histograms are a great way to show the concept of a normal (or skewed) distribution. We can visualize the answer to questions like…
* “how does foot size vary across the population?"
* “what is the distribution of pregnancy length across the human population?”
* “how is income distributed in my country?”

A density curve also visualizes a distribution, without putting data in bins like a histogram does. 

A more “math-forward” way to visualize distributions is a box plot or violin plot. These visualizations make percentile and quartile values obvious. 

Last up, outside the counting and distribution category, let’s consider a univariate map. This would be a map where the only variable is geographic, i.e. a map that just shows us location and distance.


Univariate Charts

3 charts.

The first is a line chart titled "Imported and Domestic Sales at Stinky Vicki's Cheese Shop." The x-axis runs from 2005 to 2021. Imported (solid orange) and domestic (dotted pink) lines run in a jagged U-shape overall. Imported has the lowest point in 2020, while domestic cheese sales rise during this time. Both are about level in 2021.

The second is a scatterplot of yellow dots titled "Ice cream sales & temperature." The y-axis is labeled temperature in celcius and x-axis is Ice cream sales in dollars. The dots are in a positive correlation, with most of the dots in the area from lower left to upper right (low temp and low sales to higher temps and higher sales). There's a callout on one dot at a relatively high dollar value and not-that-warm temperature that reads "unusually warm early spring day."

The third is a map of Italy titled "Italian pasta shape origins." Italy is shown in yellow, with nine dots scattered around the country representing major cities. The names of pastas are overlaid according to their general region of origin. From north to south on the mainland: ravioli, farfalle, penne, tortellini, gnocchi, bucatini, lasagna, and oricchiette, with spaghetti and casarecce on Sicily. Lorighittas and Fregola are on Sardinia. 

Next up, bivariate and multivariate charts! These charts show the relationships between two or more variables. 

The classic bivariate example is the scatter plot &mdash; one variable on the x-axis, another on the y-axis, and each point helps us compare the two variables by its position on the graph.

Scatterplots translate the relationship between two variables in the data into an easy-to-see spatial relationship. Because we’re relying on the idea that each variable increases as we move up the X or Y axis, the scatterplot only makes sense for numeric variables, not categorical.

A line chart is another common bivariate chart, often measuring a variable changing over time. A stock chart, for example, measures the value of a company over time. 

A line chart with multiple lines for different variables is a multivariate chart. For an example, check out the line chart that plots both imported and domestic cheese sales. 

Last but not least, let’s think about a bivariate map. It shows a basic geographical map plus an additional variable &mdash; this example shows roughly where different pasta shapes originated in Italy. We can also map precipitation, altitude or depth, median income, museum locations, or combinations of variables... the list is endless. 

Charts often rely on visual signifiers besides chart type to visualize additional variables in the data. For example, the lines on a multivariate line chart are distinguished by pattern and color, and a scatter plot can use color, shape, or dot size to make a third variable apparent. Read on for info on color, shape, and more!


Bi- and Multivariate Charts

A bubble chart from the The New York Times titled "The Facebook Offering: How It Compares."

From left to right (the x-axis), the graph measures time from 195 to 2012. From bottom to top (the y-axis) the graph measures company value in billions of dollars from 0 to 25. 

The marks on the graph are circles in different sizes. The size of each circle corresponds to its company value. Most circles are small, below 5 billion, and spread pretty evenly over time with a cluster in 2000. The largest and highest circle by far is Facebook in 2012, at $104 billion.  The second-largest circle is just over $25 billion in 2004, labeled "Google." A relatively small circle around $4 billion and 1982 is annotated "Apple." 

Text at the top of the visualization reads "Facebook: Facebook's offer price was $38 a share, giving the company a valuation of $104 billion, nearly four times larger than Google in 2004." The graph is from May 2012.




We’ve covered how we use chart type to highlight a relationship in the data. Now we’ll talk about how we use aesthetic properties to further clarify and visualize the “details” of the data. 

Aesthetic properties are the attributes we use to communicate data visually:
* Position
* Size
* Shape
* Color / pattern 

Most of us are already familiar with this concept, even if we’ve never seen it put into words before. Let’s walk through a visualization together to see how its aesthetic properties help us understand the data story. 

This visualization published by The New York Times in 2012 shows different tech IPOs (initial public offering, or when a formerly-private company becomes available as a publicly traded stock).

Take a minute to look at the graph and see how position, size, shape, and color are used. What does it mean for a data point to be up and to the right versus down and to the left? What do bigger and smaller signify? How does shape come into play? What does color mean?

Okay, let’s walk through it together now. 

We can start with position: from left to right (the x-axis), the graph measures time. From bottom to top (the y-axis) the graph measures company value in billions of dollars. 

As far as shape goes, there’s only one here – circles.

Moving on to size, it looks like the size of each circle tracks with its company value. Companies with a larger IPO amount get a bigger circle. 

Finally, color corresponds to time. Earlier corresponds to red, and later to blue. The middle portion, around 1995, is purple. This visually separates the three decades into three general zones. 

Aesthetic Properties I: the Menu

Let’s take a closer look at this graph. There’s a connection here between size and y-position (how high or low a circle is): they actually tell us the same information twice!

This is an example of **information redundancy**, or encoding the same information in different visual properties. We already know that Facebook has the largest company value because it’s the highest circle on the chart. Its large size gives us another way to visually compare it to the other data points.

Info redundancy is also helpful for prioritizing values. There are lots and lots of smaller companies on this graph – if every circle were the size of Google’s circle, the bottom part of the graph would be an unreadable ball pit. Or, if all the circles were the size of the smallest ones, the chart would lose some of its emphasis on Facebook's large IPO value. **Information redundancy helps key data points to stand out.**

Color and x-position are also redundant on this graph, making the chart a little easier and faster to interpret. The three color groups in the graph help break up the three-ish decades shown, giving us a sense in one glance that red circles are part of an early group, purples are in a middle group, and blues are the latest.

We’ll dive deeper into accessibility later, but for now note that information redundancy is also an important practice to ensure that colorblind viewers can access all the information in a chart. 

To sum up, information redundancy visualizes the same information using multiple different aesthetic properties. It’s important for readability, organization and prioritization of information, and accessibility. 


Aesthetic Properties II: Information Redundancy

The best data visualizations help us to understand what’s in the data, draw meaningful conclusions, and make decisions about the next steps. This requires context and **different context is appropriate for different audiences**. 

Let’s walk through an imaginary-world example: 

Shinji, Paola and Raj work together in an ecology lab. Their lab is applying for funding for a field research trip next year. This week, each of them will present the lab’s work and data to a different potential funder: 

* Sir Avon Rattleborough, retired ecologist and expert field researcher
* Claude Tuber, board game developer and eccentric venture capitalist
* Milana Diamante, heiress and amateur biologist 

The three labmates know they’ll have to communicate differently to each of these potential funders. **They’ll communicate the same information, but each lab member will personalize their chart with a title and annotations that work best for their intended audience.**

With this in mind, let’s get some background on the funders and see which title and annotation is most effective for each graph…

**Sir Avon** has over 50 years of ecology research under his belt, and lives for the details. He’s not familiar with the lab’s work specifically, but he knows the lingo. Sir Avon always wants to see proof. 

**Claude** is a big picture thinker: he’s wondering if the lab’s research can be turned into a fun and educational board game for a broad audience. He’s excited about it but knows nothing about ecology. Claude has a limited attention span. 

**Milana** is an enthusiastic citizen scientist in her free time, and is eager to support a worthy cause. She’s never done field research herself, but has some background knowledge and has been following the lab’s work for a couple years. Milana loves to ask follow-up questions. 

The three characters are shown above three slightly different graphs. Each graph is a scatterplot that plots Primary Feather Count on the y-axis and Wing Ratio on the x-axis. There are three different types of points plotted: orange circles with generally lower wing ratio and higher primary feather count, blue triangles with generally lower wing ratio and lower primary feather count, and yellow squares with generally higher wing ratio and lower primary feather count.

From left to right:
Sir Avon Rattleborough is a white-haired person wearing a tan field vest and pants, holding a field notes book and a magnifying glass. The graph below him is titled "Wing shape predicts ecological niche." The annotations on the graph are a callout pointing at one point that reads "Blue-tailed Hawk is a crossover species, PFC=18" and a note in the bottom righthand corner that reads "error margin = +/- 10mm".

Claude Tuber has wavy orange hair and wears a pink suit decorated with chess pieces. The graph below him is titled "What lifestyle does this bird live? Its wings will tell us!" Instead of focusing on the individual points in the scatterplot, three ovals in orange, blue and yellow have been drawn around the points in those corresponding colors. The points themselves are faded out so that the focus is on the general ovals, which are labeled "Passive soaring flight," "Acrobatic flight" and "Active soaring flight."

Finally, Milana Diamante is a white-haired person wearing casual everyday clothes, with a pair of binoculars around her neck. The graph below her is titled "Bird wing shape predicts ecosystem role." The x-axis label reads "Wing ratio (span x area)" instead of just "Wing ratio." There is a yellow callout on the graph that reads "Blue-Tailed Hawk has traits of Passive & Acrobatic" and a note on the bottom right of the chart that says "See appendix for flight and error details."


Consider the Audience

We tend to think of context as “outside” a data visualization, but Shinji, Paola and Raj had the right idea by including appropriate context using titles and annotations. Each of them also did a great job of **considering their specific audiences when making decisions about what context to include or not.** 

Viewers need context to understand what a data visualization means and why it matters.

Paola predicted that, as a scientist, Sir Avon would want to know the technical details like the amount of error in the measurements. She didn’t take up space on the graph with definitions, since she knew they weren’t necessary in this case. 

Shinji decided to use a question and answer format for their title to help communicate the takeaway of the graph to Claude in an accessible way. Shinji went for a more aggregated, less detailed approach to help keep the conclusion simple and digestible.

Finally, Raj made good choices in his visualization for Milana: a descriptive but slightly-less-technical title, and a pointer towards definitions of terms he thinks Milana may not know and would be interested to learn.

In each case, the lab member made sure to…
* Provide necessary details
* Include context that’s helpful for the specific audience
* Avoid “chart junk”: excess graphics, annotations, and general lines that don’t actually contain information


Context is Key

Comparisons of a colorblind friendly palette and harder-to-read palette. Each palette is shown as it appears to those with unaffected vision, as well as with red-weak, green-weak, and blue-weak vision.  The accessible palette is clearly differentiable across the vision spectrum, while the harder to read palette is muddied in all except unaffected vision. 

Now that we can make some visualizations, let’s talk accessibility to make sure our work reaches everyone who wants to interact with it!

The most commonly discussed accessibility concern is color, since color blindness affects 1 in 12 males and 1 in 200 females. That’s more common than we tend to think! There are a few different types of color blindness: check out the images for simulations of each form.

The big takeaway when designing for color accessibility is to think not only about the **hue** of a color (e.g. red, green, or purple), but the **value** as well (e.g. bright red, light green, dark purple). **Good color comparisons use high contrast values, not just different hues.** 

It’s also important to use readable fonts in readable sizes, and make sure they’re web-accessible if online. (If you don’t do this, [] [] [] [] [] [] [] will happen!)

Finally, for online data visualizations, make sure to include alt text as we would for any other web image. Alt text ensures that users experiencing a visualization through a screen reader won’t miss out on whatever information it contains. 

To recap, here’s a checklist for baseline accessibility:
* Colorblind-friendly color palettes
* Large enough font size
* Readable, web-accessible font type
* Alt text on data visualization images online


Accessibility Basics

In the last exercise we covered accessibility guidelines specifically related to vision access – using color palettes, fonts, and alt-text to ensure that people across the vision spectrum can access our work.

The same accessibility goal – making our work available and easier to access for more people – is a great principle to keep in mind no matter what. (This is actually called “universal design.”)

We can apply it when it comes to…
* Readability: keep the reading level to a high school level whenever possible
* Prior knowledge: define unfamiliar terms and avoid unnecessary jargon
* Information overload: introduce new information with intentional pacing and organization

Remember the ecology lab’s pitches for funding? Each lab member took a different approach based on their audience. The difference between the pitches for Sir Avon and Milana is a perfect example of “leveling” content for the intended audience: swapping the more technical term “ecological niche” with the synonym “ecosystem role” makes the graph’s title readable with more everyday words, without changing the meaning or even sacrificing detail.

Finally, with any of these practices, there may be situations where it’s not the right call. Trinh probably shouldn’t avoid all technical language on the graphs in his peer-reviewed paper, and there may be no need for Anabelle to define industry terms when presenting a chart to work colleagues in her department. But for most audiences, and especially for broad or unknown audiences, keeping accessibility in mind will help everyone get the most out of our visualizations. 


Further Accessibility: Universal Design

Last up, we’re talking authorship. 

When Paola made the visualization for Sir Avon, at first she included fewer details and took out her annotations about points of interest. She assumed there was no way she – a student scientist – would know something that Sir Avon didn’t already know. 

But she was wrong! By the time Paola was able to make a data visualization about bird wing shapes, she was becoming an expert on the subject. Sir Avon, while a celebrated ecologist in general, did not share her depth of specific, current knowledge on this topic. 

Dr. Dinkle, the lab’s leader, helped Paola to own her role as the author – to recognize that the amount of time she had invested in data cleaning and data analysis, as well as data visualization, qualified her to make decisions about what should be included. It was not only worth sharing her opinion, the graph actually suffered without those annotations! Her perspective on the data helped Sir Avon to grasp the key takeaways. 

Paola’s experience is true for many people. It can feel unnatural to “speak for” the data, and many authors worry that they will influence their audiences by including annotations. But that’s just it! **Data does not speak for itself**, and often, the author of a data viz is the person in the best position to create insightful, helpful annotations. 

In fact, this context isn’t only helpful – very often, it’s the ethical way to present information. Providing context or even a written summary of what a graph shows helps to limit false conclusions, misinterpretations, and misinformation. 


Sir Avon Rattleborough's character is at the top. 

Below are two versions of the graph Paola presented to him. One as shown before: The title is "Wing shape predicts ecological niche." The annotations on the graph are a callout pointing at one point that reads "Blue-tailed Hawk is a crossover species, PFC=18" and a note in the bottom righthand corner that reads "error margin = +/- 10mm". 

The other graph has both annotations removed, taking away key context about a specific point.

Owning the Role of Author

We made it! This lesson covered everything you need to know to start designing data visualizations that look great, communicate effectively, and treat data ethically.

Here’s a recap of what we talked about…
* Choosing the right chart type for our data
* Mapping variables onto aesthetic properties to communicate using position, shape, size, color
* Making effective color choices for data, design, and accessibility
* Designing accessible visualizations
* Scaffolding the visualization with context for the audience
* Owning the role of author

That's a big chunk of information. Pat yourself on the back for making it to the end of this lesson, then go forth and visualize!

Data Visualization Basics

[algorithms]: https://www.codecademy.com/resources/docs/general/algorithm
[google search]: https://www.google.com/search?q=cases+where+drivers+drove+into+lakes+and+rivers+because+their+GPS+instructed+them+to&rlz=1C5CHFA_enUS742US742&oq=cases+where+drivers+drove+into+lakes+and+rivers+because+their+GPS+instructed+them+to&aqs=chrome..69i57.233j0j7&sourceid=chrome&ie=UTF-8
[A Reuters article from 2018]: https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G
[The Gender Shades project]: http://gendershades.org/index.html
[Buolamwini et al. 2018, _Proceedings of Machine Learning Research_]: http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf

[Image of bus driving into a lake]: https://static-assets.codecademy.com/Courses/data-literacy/analyses/1683.svg
[Bar plot showing compositions of the three benchmarking datasets. Over 80% of the Adience dataset is equal parts lighter males and females. The IJB-A dataset is about 60% lighter male and 20% lighter female, with less than 10% darker female. The new PPB dataset is nearly equal quarters of lighter male, lighter female, darker male, and darker female.]: https://static-assets.codecademy.com/Courses/data-literacy/analyses/FacialBenchmarks.svg

Would you follow your GPS anywhere? Even into a lake? That may sound ridiculous, but a quick [google search] brings up dozens of cases where drivers drove into lakes and rivers because their GPS instructed them to. Following GPS instructions against your better judgment is one example of **automation bias**. 

![Image of bus driving into a lake]

As humans, we have many biases, both implicit and explicit. Biases are systematic errors in thinking influenced by cultural and personal experiences. Biases distort our perception and cause us to make incorrect decisions. One bias that many humans share is automation bias. Automation bias stems from the idea that computers or machines are more trustworthy than humans because they are more objective. Automation bias is at the root of why people follow their GPS into trouble, even when contradictory information is available.

Computers, data, and [algorithms] are not actually completely objective. It is true that data analysis can help us make better decisions, but it is not immune to bias. Humans create technologies and algorithms. As a result, they often have human biases encoded into them. It's clear that we need to pay attention to other information streams (our eyes and ears) when we drive with GPS. Similarly, we need to look at more information sources when we evaluate data analysis results or reports.

If we want to be responsible when we use data and algorithms, we need to understand the different types of bias that show up at each stage of analyzing data. Let’s take a closer look at some types of bias that impact data analysis and data-driven decision-making.

## Bias in data collection

Before we can analyze data or use machine learning algorithms, we need to collect data. Data collection is subject to **selection bias** (also called sample bias). Selection bias occurs when study subjects (_i.e._, the sample) are not representative of the population. Selection bias can be due to poor study design if the sample is too small or is not randomized. Selection bias can also crop up when the only data available is influenced by **historical bias** &mdash; systematic influence based on historic social and cultural beliefs. 

[A Reuters article from 2018] highlights how the company Amazon produced a machine-learning algorithm that suffered from such a selection bias. The company designed the algorithm to help recruiters hire top talent. The model was trained on thousands of resumes from people that were or were not hired by Amazon. It learned 50,000 phrases associated with resumes and began to ignore common phrases, such as the names of programming languages. However, the algorithm also learned to downgrade resumes that contained the word "women’s." This included resumes that referenced women’s colleges, teams, or committees. 

This is an example of selection bias because the data used to train the algorithm were not representative of the modern applicant pool. The majority of Amazon's past applicants and employees were male. This means a larger proportion of the successful resumes in the training data came from male applicants. Amazon did not explicitly train the algorithm to use gender. Yet, the algorithm still found and used gender-associated terms to weed out women candidates. 

We can do our best to avoid selection bias by doing everything possible to have a representative sample, not just a convenient one. For example, it’s a good idea to include data inputs from multiple sources to diversify data. This is easier said than done, however, and we need to acknowledge and address historical bias in data sources and work towards building frameworks to increase inclusivity. 

## Bias in building and optimizing algorithms

**Algorithmic bias** arises when an algorithm produces systematic and repeatable errors that lead to unfair outcomes, such as privileging one group over another. Algorithmic bias can be initiated through selection bias and then reinforced and perpetuated by other bias types.

Facial recognition software is an area where algorithmic bias can do a lot of harm. This software is sold to police departments and used to recognize criminals in surveillance footage. If the software systematically makes more mistakes depending on race or gender, people in some groups will be incorrectly pursued more often, which has serious, negative outcomes for individuals.
 
[The Gender Shades project] tested commercial facial recognition software for these kinds of biases. IBM, Microsoft, and Face++ are three companies that offer facial recognition software with a binary gender classifier feature. Researchers assessed the accuracy of these algorithms and discovered that they suffered from algorithmic bias. The algorithms were good at identifying lighter males, okay at identifying darker males and lighter females, and very bad at identifying darker females. 

Each software used proprietary algorithms and did not report performance results with benchmarking datasets. However, the developers probably tested the software on one of two commonly-used benchmarking datasets: Adience or IJB-A. These datasets include few dark-skinned people and especially low proportions of dark-skinned females. Testing an algorithm with a non-representative dataset leads to **evaluation bias**. Testing with a non-representative benchmarking dataset would give high overall accuracy scores, even if the algorithms were inaccurate for certain groups.
 
Another key point when it comes to algorithmic bias in facial recognition software is that the algorithms are proprietary, making them “black boxes”. In addition to not knowing what data were used to train and test the algorithm, we can’t know how it was designed or how it works. As a result, it’s impossible to evaluate the algorithms themselves.
 
Avoiding algorithmic bias relies on transparency, especially concerning data used for training and testing an algorithm. In response to the poor performance of facial recognition with darker females, a new benchmarking dataset was developed (PPB) that is more representative of the full spectrum of humanity. This is a big step forward, as long as the new dataset is actually used by companies making and selling facial recognition software.

![Bar plot showing compositions of the three benchmarking datasets. Over 80% of the Adience dataset is equal parts lighter males and females. The IJB-A dataset is about 60% lighter male and 20% lighter female, with less than 10% darker female. The new PPB dataset is nearly equal quarters of lighter male, lighter female, darker male, and darker female.]

Data for this plot came from [Buolamwini et al. 2018, _Proceedings of Machine Learning Research_].

## Bias in interpreting results and drawing conclusions 

Bias also influences the final stages of data analysis: interpreting results and drawing conclusions. The following bias types are ones we should watch out for when evaluating or generating data reports:

- **Confirmation bias** is our tendency to seek out information that supports our views. Confirmation bias influences data analysis when we consciously or unconsciously interpret results in a way that supports our original hypothesis. To limit confirmation bias, clearly state hypotheses and goals before starting an analysis, and then honestly evaluate how they influenced our interpretation and reporting of results.

- **Overgeneralization bias** is inappropriately extending observations made with one dataset to other datasets, leading to overinterpreting results and unjustified extrapolation. To limit overgeneralization bias, be thoughtful when interpreting data, only extend results beyond the dataset used to generate them when it is justified, and only extend results to the proper population. 

- **Reporting bias** is the human tendency to only report or share results that affirm our beliefs or hypotheses, also known as “positive” results. Editors, publishers, and readers are also subject to reporting bias as positive results are published, read, and cited more often. To limit reporting bias, report negative results and cite others who do, too. 

## Conclusions

Data and machine learning algorithms are now ubiquitous. They influence decisions about who is hired or fired, accepted into schools, or allowed to rent houses. They even influence which neighborhoods are more heavily policed and who is granted parole. Therefore, we must recognize that data and algorithms can be biased, just like the humans who create and train them. Learning more about the types of bias that influence how algorithms function will improve our ability to perform and interpret data analyses and will help us make more informed decisions.

Bias is everywhere in data. The key to combatting bias is knowing what to look out for.

Bias in Data Analysis

To know the future with absolute certainty.

To mathematically summarize large datasets.

To discover useful information and support decision-making. 

Descriptive, Exploratory, Inferential, Causal, Predictive


Descriptive, Exploratory, Inferential, Causal, Magic-Ball

Explanatory, Exploratory, Inferential, Causal, Predictive

Summative, Exploratory, Deductive, Effective, Predictive

Quiz yourself on different types of data analysis!

Data Analyses and Conclusions

### Why Principles of Data Literacy? 
This no-code course introduces the foundational how’s and why’s of data. How do statistics help us make conclusions from data? Why is good design critical for communicating data stories through data viz? What are the different kinds of analysis we can perform on a dataset? This course will help you feel empowered to answer these questions (and more!) and work with data with confidence.
### Take-Away Skills
You will learn how to evaluate data quality, interpret statistical conclusions, create and read data visualizations, and analyze data responsibly.

Discover the world of data in this fully conceptual course where you will learn how to think about, visualize, and analyze data.