A huge part of data science involves acquiring raw data and getting it into a form ready for analysis. Some have estimated that data scientists spend 80% of their time cleaning and manipulating data, and only 20% of their time actually analyzing it or building models from it.
When we receive raw data, we have to do a number of things before we’re ready to analyze it, possibly including:
- diagnosing the “tidiness” of the data — how much data cleaning we will have to do
- reshaping the data — getting the right rows and columns for effective analysis
- combining multiple files
- changing the types of values — how we fix a column where numerical values are stored as strings, for example
- dropping or filling missing values - how we deal with data that is incomplete or missing
- manipulating strings to represent the data better
We will go through the techniques data scientists use to accomplish these goals by looking at some “unclean” datasets and trying to get them into a good, clean state. Along the way we will use the powerful tidyverse packages dplyr and tidyr to get our data squeaky clean!
We have provided an example of data representing exam scores from
1000 students in an online math class.
These data frames, which you can view in the rendered notebook, are hard to work with. They’re separated into multiple tables, and the values don’t lend themselves well to analysis. Try to think about how you would plot the exam score average against the age of the students in the class. This would not be easy!
In the next exercises, we’ll transform this data so that performing a task like that visualization would be simple.