Python provides a module named datetime
to deal with dates and times.
It allows you to set date
,time
or both date
and time
using the date()
,time()
and datetime()
functions respectively, after importing the datetime
module .
import datetimefeb_16_2019 = datetime.date(year=2019, month=2, day=16)feb_16_2019 = datetime.date(2019, 2, 16)print(feb_16_2019) #2019-02-16time_13_48min_5sec = datetime.time(hour=13, minute=48, second=5)time_13_48min_5sec = datetime.time(13, 48, 5)print(time_13_48min_5sec) #13:48:05timestamp= datetime.datetime(year=2019, month=2, day=16, hour=13, minute=48, second=5)timestamp = datetime.datetime(2019, 2, 16, 13, 48, 5)print (timestamp) #2019-01-02 13:48:05
In seaborn, distributions can be visualized using .histplot()
, .kdeplot()
, and .boxplot()
, among other visualization functions.
The main parameters are data
and x
.
data
is an optional parameter for the name of the pandas DataFrame.x
is the column name for the variable of interest.The y-axis shows the frequency for histograms, the probability density for KDE plots, and the values for box plots.
For box plots, setting the y
parameter to a grouping variable will show a box plot for each group on the same plotting grid.
import seaborn as sns# histogram of heightssns.histplot(data=df, x='height')# KDE plot of heightssns.kdeplot(data=df, x='height')# box plot of heightssns.boxplot(data=df, x='height')# box plots of heights by age groupsns.boxplot(data=df, x='height', y='age_range')
In seaborn, a scatter plot can be created with .scatterplot()
. The main parameters are data
, x
, and y
.
data
is an optional parameter for the name of the pandas DataFrame.x
is the column name for the x-axis of the plot.y
is the column name for the y-axis of the plot.A scatter plot with a regression line can be created with .regplot()
. This function takes the same parameters as .scatterplot()
and produces the same plot, but with a regression line drawn on the scatter plot. By default, a 95% confidence interval is included as a shaded region around the line.
import seaborn as sns# scatter plot of bird count by temperaturesns.scatterplot(data=df, x='bird_count', y='temperature')# same plot with regression linesns.regplot(data=df, x='bird_count', y='temperature')
Correlation ranges from negative one to positive one and is used to measure the strength of a linear association between two quantitative variables. A correlation closer to negative one indicates a strong negative linear where large values of one variable are associated with small values of the other. A correlation closer to positive one indicates high positive linearity where large values of one variable are associated with large values of the other. A correlation of 0 indicates there is no linear relationship. The figure shows pairs of variables with correlations ranging from negative one to one.
In Python, we can use scikit-learn to run a simple linear regression.
The sample code shows each step for running a simple linear regression with a pandas DataFrame called df
:
from sklearn.linear_model import LinearRegression# format variablesX = df['input_var'].to_numpy().reshape(-1, 1)y = df['output_var'].to_numpy().reshape(-1, 1)# run regressionlm = LinearRegression()model = lm.fit(X, y)# view interceptlm.intercept_# view slope coefficientlm.coef_
In order to use a simple linear regression model to make a prediction, we need to plug in the slope and intercept to the equation for a line (y=mx+b). For example, suppose we fit a linear model to predict weight based on height and calculate an intercept of -200 and slope of 5. The equation is:
Therefore, a person who is 60 inches tall would be expected to weigh 100 pounds: