# Basic Data Analysis

### Summary Statistics in R

Because R is mainly a statistical processing software, summary statistics come standard with base R functionality.

• Use `mean()` and `median()` to calculate average of a vector.
• Use `min()`, `max()`, and `range()` to see the range of a vector.
• Use `sd()` or `var()` to calculate the spread of a vector.
• Use `table()` to view the frequency of each value in a vector.
```## AVERAGEmean(dat) #meanmedian(dat) #median
## RANGEmin(dat) #minimum valuemax(dat) #maximum valuerange(dat) #minimum and maximum
## FREQUENCYtable(dat) #frequency of each value```

### ggplot() Initializes a ggplot Object

Invoking the `ggplot()` function returns an object that serves as the base of a ggplot2 visualization.

`viz <- ggplot()viz # renders blank plot`

Data is bound to a ggplot2 visualization by passing a data frame as the first argument in the `ggplot()` function call. Layers can be added to the plot object by adding function calls after `ggplot()` with a `+` plus sign. These functions have access to the data frame and can use the column names as variables.

For example, consider a data frame `sales` with the columns `cost` and `profit`. To assign the data frame `sales` to the `ggplot()` object that is initialized:

`viz <- ggplot(data=sales) +        geom_point(aes(x=cost, y=profit))viz # renders plot`

In the example above:

• The ggplot object or canvas was initialized with the data frame `sales` assigned to it
• The subsequent `geom_point` layer used the `cost` and `profit` columns to define the scales of the axes for that particular geom. Notice that it referred to those columns with their column names.
• The variable name of the ggplot object is stated so the plot is viewable.

### ggplot2 Aesthetics

In ggplot2 aesthetics are the instructions that determine the visual properties of a plot and its geometries.

Examples of ggplot2 aesthetics include:

• scales for the x and y axes
• color of the data points on the plot based on a property or on a color preference
• the size or shape of different geometries

Aesthetics are set either manually or by aesthetic mappings. Aesthetic mappings “map” variables from the bound data frame to visual properties in the plot. These mappings are provided in two ways using the `aes()` mapping function:

1. At the canvas level: All subsequent layers on the canvas will inherit the aesthetic mappings defined when the ggplot object was created with `ggplot()`.
2. At the geom level: Only that layer will use the aesthetic mappings provided.

For example, the following code assigns `aes()` mappings for the `x` and `y` scales at the canvas level:

`viz <- ggplot(data=airquality, aes(x=Ozone, y=Temp)) +       geom_point() +        geom_smooth()` In the example above:

• The aesthetic mapping is wrapped in the `aes()` aesthetic mapping function as an additional argument to `ggplot()`.
• Both of the subsequent geom layers, `geom_point()` and `geom_smooth()` use the scales defined inside the aesthetic mapping assigned at the canvas level.

You could create the same plot by setting the aesthetics at the geom level, as follows:

`viz <- ggplot(data=airquality) +  geom_point(aes(x=Ozone, y=Temp)) +  geom_smooth(aes(x=Ozone, y=Temp))`

### Creating Regression Models in R

The `lm()` function creates a linear regression model in R. The `glm()` function creates a logistic regression model in R.

These functions take a formula `Y ~ X` where `Y` is the outcome variable and `X` is the predictor variable. We can add additional predictor variables using `+`.

A summary of these models can be printed using the `summary()` function.

```## Linear regression modeltemp_lm <- lm(temp ~ month + region, data = world)summary(temp_lm) #print summary
## Logistic regression modelwinning_glm <- glm(win ~ ranking + home + starting_players, data = team)summary(winning_glm) #print summary```

### Making Predictions from Regression Objects in R

To make predictions of the outcome variable using a regression model, we need a dataset whose column names match the names of the coefficients in the model. Once establishing the data to make predictions about, we can use the `predict()` function to generate predictions. This will produce 1 predicted outcome for each observation in this new dataset.

```## Create linear regression modellm1 <- lm(y ~ x1 + x2 + x3, data = dat)
## Establish data to make predictions aboutpred_data <- data.frame(  x1 = c(0, 1, -1),   x2 = c(1, 6, 5),  x3 = c(10, -4, 9))
## Make predictionspredict(lm1, pred_data)```