Let’s explore another useful chart: the scatterplot. Scatterplots are great for directly comparing two continuous numeric variables, helping us to visualize the relationship or correlation between them.
- If the points are scattered from bottom-left to top-right in the graph, we’d be able to say “As X increases, Y also increases.” This indicates a positive correlation.
- If the points are scattered from top-left to bottom-right in the graph, we’d be able to say “As X increases, Y decreases.” This indicates a negative correlation.
- If the points are scattered in a horizontal line, we can say “As X increases, Y remains constant.” This indicates no correlation. A cloud of scattered points with no clear directional grouping also indicates no correlation.
Closely grouped points in a line or curve are evidence of a stronger and more consistent relationship between the variables, while a looser cloud of points indicates a weaker relationship.
With that in mind, how do we make a scatterplot in matplotlib? With the scatterplot graph function, of course! Use plt.scatter()
with the following parameters:
x
andy
: the continuous numeric variables to be comparedcolor
: marker color, as a color code, color name, or hex codealpha
: marker opacity, as a number between 0 (transparent) and 1 (opaque)
Only x
and y
are required parameters, but adjusting the formatting of the markers helps make patterns in the graph more apparent. For scatterplots with many overlapping points, for example, adjusting the alpha
to make semi-transparent points can turn the scatterplot from a unicolored blob into an effective heatmap. We can also adjust marker shape and size using the marker
and s
parameters, respectively. The default is a size 2 circle.
Let’s check out an example: suppose we have a dataset of tree measurements. We might expect increased age to also increase the trunk circumference (a positive correlation!) since trees get bigger as they get older. To visualize the relationship between age and tree trunk circumference, the following plot could be used:
plt.scatter(data.tree_age, data.trunk_circumference, color='yellowgreen', alpha=0.5) plt.title('Effect of Tree Age on Trunk Circumference in Red Oaks') plt.ylabel('Trunk Circumference (cm)') plt.xlabel('Tree Age (years)') plt.show()
We’d expect to see a positive correlation, since we know that as oak trees get older, their trunks get thicker. Let’s check out some music-related correlations in the Jupyter notebook.
Instructions
Run the Setup cells to load in the necessary packages and the spotify_data_by_genres
csv. Run the cell below to check out the columns in the dataset to get a sense of what it contains.
Write the code to make a scatterplot comparing danceability
and valence
. danceability
is a measure of how “danceable” a song is, from 0 (not dance-y, like an audiobook or funeral dirge) to 1 (very dance-y, like disco or house music). valence
is a measure of the song’s mood: sad to happy from 0 to 1. What kind of correlation do you expect?
Adjust the alpha
to 0.25 to make the relationship more obvious. Play around with other alpha values to see how the graph’s readability changes!
Change the color
to 'teal'
and the alpha
to 0.15
.
OPTIONAL: Experiment with other Spotify categories in the x and y parameters, if you want!