Learn

Congratulations! In this lesson you’ve learned to:

• Fit a multiple linear regression model in Python
• Write and interpret a multiple regression model
• Understand what binary and quantitative predictor coefficients mean visually and in context
• Check the assumption that multicollinearity isn’t present

### Instructions

A new dataset has been loaded for you called family. The data is modified from the Family Income and Expenditure Survey (FIES) of the Philippine Statistics Authority (PSA), a survey taken every three years on family income and expenditure in the Philippines. We’ll work with the following variables from this dataset:

• income (income)
• total food expenditure (food)
• total housing and water expenditure (housing)
• source of income (source).

The income and expenditure variables are measured in thousands of Philippine pesos. Try practicing multiple regression in script.py using the following instructions. Sample solutions are provided in solutions.py.

1. Create a heat map of the quantitative variables in the family dataset. Do any pairs have high correlations?
2. Fit a model for income using food, housing, and source as predictors and inspect a summary of the results. The binary variable source has values Entrepreneurial Activities and Wage/Salaries. According to the summary, which value of source is coded as 1 and which is coded as 0?
3. Write out the regression equation from the coefficients. Did you remember that you can print just the coefficients using .params?
4. Interpret the intercept of the equation. Is this interpretation practical?
5. Interpret the coefficient on the variable source in terms of expected income. How is the intercept different between groups?
6. Interpret the coefficient on food. Is there an increase or decrease in income associated with an increase in food expenditure?
7. Interpret the coefficient on housing. Is there an increase or decrease in income associated with an increase in housing expenditure?
8. Create a scatter plot of housing on the x-axis and income on the y-axis, colored by source. Looking only at the Wage/Salaries group, use the regression equation for step 3 to add three lines to the plot for when food expenditure is 10,000, 100,000, and 200,000 pesos, giving each line a different color. Remember that food is measured in thousands of pesos, so 10,000 pesos is food = 10. Why did we have to look at only one value of source to produce these lines?