A fundamental part of data science is statistics.
Some people say that data science is statistics dressed up for the 21st century - and there’s some truth in that. Statistics has been practiced for centuries but with the advent of computing prowess in the middle of the 20th century, statistics has taken on a new form.
As a refresher, statistics is the practice of applying mathematical calculations to sets of data to derive meaning. Statistics can give us a quick summary of a dataset, such as the average amount or how consistent a dataset is.
There are two types of statistics: descriptive statistics and inferential statistics. Descriptive statistics describe a dataset using mathematically calculated values, such as the mean and standard deviation. For instance, the graph below from FiveThirtyEight charts the wage gap between American men and women in 2014. An example of a descriptive statistic would be that at the 90th income percentile, women make 80.6% of what men make on average.
These values are useful when summarizing data collections.
On the other hand, inferential statistics are statistical calculations that enable us to draw conclusions about the larger population. For instance, from looking at the graph we can infer that at the 99th income percentile, women make less than 78% of what men make on average. We can also infer that the reason why the wage gap is smallest at the 10th income percentile is because the minimum wage for men and women is the same.
In statistics, you will see many different types of distributions. The one on the right is called a normal distribution. These distributions are very common. In a normal distribution, the mean sets the middle of the distribution and the standard deviation sets the width.
Play with different values for the mean and standard deviation and see how the distribution is affected.
How do you think changing these values affect descriptive statistics? How about inferential statistics?