You’ve already seen how statistics and probability help us understand our data. But how do we actually interact with datasets? This is where programming comes in.
With the right instructions, we can use the processing power of computers to analyze and visualize millions of data points. Once these instructions are written, we can then run them at any time, allowing us to replicate, update, and automate our analyses as we receive new data.
A set of instructions like this is called a program or an algorithm, and is written in a computer programming language, which we usually refer to as code for short. In future Codecademy courses, you will learn all the details of writing code to analyze data. For now, let’s go over a common example of how computer programming can help us understand a dataset and begin to make predictions! The dataset we’ll be looking at is a collection of flipper and bill measurements for three different species of penguins collected by Dr. Kristen Gorman and the Palmer Station in Antarctica.
When you opened this exercise, some code running behind the scenes loaded this data and created a visualization of the flipper and bill measurements for three penguin species. Take a look at the visualization. You might notice that there isn’t much overlap between species on the plot. For example, Chinstrap penguins seem to usually have longer bills than Adelie penguins, and so Chinstrap penguins appear above Adelie penguins in the visualization. Regions like these that predominantly feature one species over another are called clusters in data science, and are often used when building computer models to make predictions about data.
For example, if we find a new penguin that has 180mm long flippers and a 35mm long bill, we might conclude (based on these clusters) that our penguin is more likely to be an Adelie penguin than either of the other two species. In our code running behind the scenes, we’ve built a computer model to do this kind of prediction automatically.
Change the flipper and bill measurements in the terminal and run the code. An algorithm we’ve loaded for you will attempt to predict which species of penguin matches the measurements you chose. How well do you think it does? What happens if you enter values that are between two regions, or values outside of the existing regions altogether?
In future Codecademy courses, you will learn to do all of this yourself! In the meantime, you are welcome to explore the code we loaded for you by opening the
code.py file in this exercise. Otherwise, select
next whenever you are ready to move on to the next exercise.