Linear Regression is the workhorse of applied Data Science; it has long been the most commonly used method by scientists and can be applied to a wide variety of datasets and questions. Unlike more recently developed methods in Machine Learning, Linear Regression models can be used to predict new data points and help us understand the relative impact one variable has on another. For example, one well-designed regression model can answer:

  • How, and to what extent, does advertising in print media effect the total sales of a product?
  • What are the predicted total sales for a product, given the amount spent on print advertisement this month?

In this lesson, you will learn how to harness the malleability and explanatory power of Linear Regression models by following the four primary steps of statistical model building: confirming data assumptions, building a model on training data, assessing model fit, and analyzing model results. Using two real-life datasets, conversion and advertising, this lesson will also focus on the application of regression modeling for use in marketing data science. Let’s get started!


Two datasets, conversion and advertising, are included in workplace for you to explore. conversion is an example of data that a scientist would receive from a large social media platform, like Facebook or Twitter, and can be used to answer questions about the performance of newsfeed advertisements. advertising is an example of data that would be recorded by the marketing department of a company; a marketing data scientist would often be given this data and expected to answer questions around of the return on investment for various media channels, like podcast or tv advertisements.

Familiarize yourself with both before moving forwards! Some suggestions:

  • Use str(dataset_name) to print out a list of variables and their associated datatypes
  • Use a combination of mean() and R’s variable selection syntax to calculate the average value of a data variable. For instance, we could calculate the mean of sales in the advertising dataset using mean(advertising$sales)
  • Use ggplot‘s geom_bar and geom_point functions to visually analyze the distribution of data values.

Sign up to start coding

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?