library(tidyverse)
library(modelr)
options(repr.matrix.max.rows=15, repr.plot.height=4.5)
For this, we will use the starwars
data set:
starwars
Plot the mass by height for each character:
ggplot(starwars) +
aes(x=height, y=mass) +
geom_point()
What 'shape' does this data have? Does it look like it might fit on a line?
The lm
function fits a line to some data. It returns an lm
object, which we can then save. Its first parameter is a formula: these will show up from time to time in R, and it says we want to model the mass in terms of the height. It will learn a model
where, in this case, y is the mass, and x is the height; b is the intercept, and m the slope or coefficient.
sw_linear = lm(mass ~ height, starwars)
summary(sw_linear)
From a frequentist perspective, is this model statistically significant?
Let's plot it. The add_predictions
dplyr verb adds predictions from a model to a data frame, in a pred
column:
sw_withline = starwars %>%
add_predictions(sw_linear)
sw_withline
ggplot(sw_withline) +
aes(x=height) +
geom_point(mapping=aes(y=mass)) +
geom_line(mapping=aes(y=pred), color="blue")
Now, there is this one weird data point that doesn't fit.
How might we figure out what it is? Can you find it?
What happens if we remove it prior to modeling? First, create a version of the data with the weird point removed:
Now, train a model on this new data:
Is your new model significant? Does it fit better (higher R2)?
Now, plot the data with your new model. Use the full data - not the data with it removed - but you can still do so and add predictions from the new model:
What does this tell us about the suitability of the model?