Linear Regression is the workhorse of applied Data Science; it has long been the most commonly used method by scientists and can be applied to a wide variety of datasets and questions. Unlike more recently developed methods in Machine Learning, Linear Regression models can be used to predict new data points and help us understand the relative impact one variable has on another. For example, one well-designed regression model can answer:
- How, and to what extent, does advertising in print media effect the total sales of a product?
- What are the predicted total sales for a product, given the amount spent on print advertisement this month?
In this lesson, you will learn how to harness the malleability and explanatory power of Linear Regression models by following the four primary steps of statistical model building: confirming data assumptions, building a model on training data, assessing model fit, and analyzing model results. Using two real-life datasets, conversion
and advertising
, this lesson will also focus on the application of regression modeling for use in marketing data science. Let’s get started!
Instructions
Two datasets, conversion
and advertising
, are included in workplace for you to explore. conversion
is an example of data that a scientist would receive from a large social media platform, like Facebook or Twitter, and can be used to answer questions about the performance of newsfeed advertisements. advertising
is an example of data that would be recorded by the marketing department of a company; a marketing data scientist would often be given this data and expected to answer questions around of the return on investment for various media channels, like podcast or tv advertisements.
Familiarize yourself with both before moving forwards! Some suggestions:
- Use
str(dataset_name)
to print out a list of variables and their associated datatypes - Use a combination of
mean()
and R’s variable selection syntax to calculate the average value of a data variable. For instance, we could calculate the mean ofsales
in theadvertising
dataset usingmean(advertising$sales)
- Use
ggplot
‘sgeom_bar
andgeom_point
functions to visually analyze the distribution of data values.