When it comes to categorical variables, the measures of central tendency and spread that worked for describing numeric variables, like mean and standard deviation, generally becomes unsuitable when we’re dealing with discrete values. Unlike numbers, categorical values are not continuous and oftentimes do not have an intrinsic ordering.
Instead, a good way to summarize categorical variables is to generate a frequency table containing the count of each distinct value. For example, we may be interested to know how many of the New York City rental listings are from each borough. Related, we can also find which borough has the most listings.
pandas library offers the
.value_counts() method for generating the counts of all values in a DataFrame column:
# Counts of rental listings in each borough df.borough.value_counts()
Manhattan 3539 Brooklyn 1013 Queens 448
By default, it returns the results sorted in descending order by count, where the top element is the mode, or the most frequently appearing value. In this case, the mode is
Manhattan with 3,539 rental listings.
movies DataFrame, find the number of movies in each
genre and save the counts to a variable called
genre_counts to see the result.