Linear Modeling Exercise

Setup

In [ ]:
library(tidyverse)
library(modelr)
In [ ]:
options(repr.matrix.max.rows=15, repr.plot.height=4.5)

For this, we will use the starwars data set:

In [ ]:
starwars

Plotting Values

Plot the mass by height for each character:

In [ ]:
ggplot(starwars) +
    aes(x=height, y=mass) +
    geom_point()

What 'shape' does this data have? Does it look like it might fit on a line?

Modeling a Line

The lm function fits a line to some data. It returns an lm object, which we can then save. Its first parameter is a formula: these will show up from time to time in R, and it says we want to model the mass in terms of the height. It will learn a model

y=mx+b

where, in this case, y is the mass, and x is the height; b is the intercept, and m the slope or coefficient.

In [ ]:
sw_linear = lm(mass ~ height, starwars)
summary(sw_linear)

From a frequentist perspective, is this model statistically significant?

Let's plot it. The add_predictions dplyr verb adds predictions from a model to a data frame, in a pred column:

In [ ]:
sw_withline = starwars %>%
    add_predictions(sw_linear)
sw_withline
In [ ]:
ggplot(sw_withline) +
    aes(x=height) +
    geom_point(mapping=aes(y=mass)) +
    geom_line(mapping=aes(y=pred), color="blue")

Outliers

Now, there is this one weird data point that doesn't fit.

How might we figure out what it is? Can you find it?

In [ ]:

What happens if we remove it prior to modeling? First, create a version of the data with the weird point removed:

In [ ]:

Now, train a model on this new data:

In [ ]:

Is your new model significant? Does it fit better (higher R2)?

Now, plot the data with your new model. Use the full data - not the data with it removed - but you can still do so and add predictions from the new model:

In [ ]:

What does this tell us about the suitability of the model?