Codecademy Logo

Thinking about Data

Correlation Coefficient

A correlation coefficient is a value between -1 and +1 that measures the strength and direction of a linear relationship between two variables. A value near +1 indicates a strong positive correlation, a value near -1 indicates a strong negative correlation, and a value close to 0 suggests little to no correlation.

Robust Statistics

The median and interquartile range (IQR) are robust statistics because they are not heavily affected by outliers or skewed data, unlike the mean and standard deviation.

Outliers

Outliers are values that are much higher or lower than most of the data. They are far from the rest of the distribution and can affect how data is analyzed.

Data Aggregation

Data can be aggregated by summarizing a numeric variable for each category in a dataset. This helps in comparing values across different groups.

Distribution

A distribution represents all possible values of a variable and how often each value occurs. It helps describe patterns in data by showing how values are spread across a dataset.

Mean

The mean, or average, represents the center of a numeric distribution. It is calculated by adding all values to a dataset and dividing them by the total number of values.

Standard Deviation using Prompt Engineering

The standard deviation measures how spread out values are in a numeric distribution. It calculates the average distance of each value from the mean, showing how much the data varies.

Skewed Distribution

A skewed distribution is asymmetrical, with a rapid change in frequency on one side and a slower, trailing change on the other, creating a long tail.

Median

The median represents the center of a numeric distribution by identifying the middle value when all data points are arranged in order from smallest to largest.

Categorical Variables

Categorical variables can be described using frequencies, proportions, or ratios to summarize how often each category appears in a dataset.

Interquartile Range (IQR)

The interquartile range (IQR) measures the spread of values by calculating the range between the first quartile (Q1) and the third quartile (Q3), representing the middle 50% of the data.

Scatter Plots and Correlation Coefficients

Scatter plots and correlation coefficients help show relationships between two numeric variables. Scatter plots visualize the data, while correlation coefficients measure the strength and direction of the relationship.

Summary Statistics

Summary statistics are used to measure and describe the variables in a dataset, providing an overview of the data.

Learn more on Codecademy