Linear Modeling Exercise¶

Setup¶

library(tidyverse)
library(modelr)

options(repr.matrix.max.rows=15, repr.plot.height=4.5)

For this, we will use the starwars data set:

starwars

Plotting Values¶

Plot the mass by height for each character:

ggplot(starwars) +
    aes(x=height, y=mass) +
    geom_point()

What 'shape' does this data have? Does it look like it might fit on a line?

Modeling a Line¶

The lm function fits a line to some data. It returns an lm object, which we can then save. Its first parameter is a formula: these will show up from time to time in R, and it says we want to model the mass in terms of the height. It will learn a model

where, in this case, is the mass, and is the height; is the intercept, and the slope or coefficient.

sw_linear = lm(mass ~ height, starwars)
summary(sw_linear)

From a frequentist perspective, is this model statistically significant?

Let's plot it. The add_predictions dplyr verb adds predictions from a model to a data frame, in a pred column:

sw_withline = starwars %>%
    add_predictions(sw_linear)
sw_withline

ggplot(sw_withline) +
    aes(x=height) +
    geom_point(mapping=aes(y=mass)) +
    geom_line(mapping=aes(y=pred), color="blue")

Outliers¶

Now, there is this one weird data point that doesn't fit.

How might we figure out what it is? Can you find it?

What happens if we remove it prior to modeling? First, create a version of the data with the weird point removed:

Now, train a model on this new data:

Is your new model significant? Does it fit better (higher )?

Now, plot the data with your new model. Use the full data - not the data with it removed - but you can still do so and add predictions from the new model:

What does this tell us about the suitability of the model?