library(tidyverse)
library(modelr)
library(tidyr)
library(alr3)
options(repr.plot.height=4.5)
head(heights)
ggplot(heights) +
aes(x=Mheight) +
geom_histogram()
ggplot(heights) +
aes(x=Dheight) +
geom_histogram()
What if we want to view these side by side?
tall_heights = gather(heights, "var", "height")
head(tall_heights)
ggplot(tall_heights) +
aes(x=height) +
geom_histogram() +
facet_wrap(~ var)
But we don't see relationships yet - let's plot the two variables together.
ggplot(heights) +
aes(x=Mheight, y=Dheight) +
geom_point() +
geom_rug()
This looks like a line might fit. Let's try it!
The lm
function lets us fit a line.
height_model = lm(Dheight ~ Mheight, data=heights)
summary(height_model)
'Goodness of Fit' statistics:
Let's plot the model!
heights_with_preds = heights %>% add_predictions(height_model)
head(heights_with_preds)
ggplot(heights_with_preds) +
aes(x=Mheight, y=Dheight) +
geom_point() +
geom_line(mapping=aes(y=pred), color="blue") +
geom_rug()
To test error distribution, we plot error (residual) by fitted (predicted) value.
heights_with_mod = heights %>%
add_predictions(height_model) %>%
add_residuals(height_model)
head(heights_with_mod)
ggplot(heights_with_mod) +
aes(x=pred, y=resid) +
geom_hline(yintercept=0, linetype="dashed") +
geom_point() + geom_smooth()
No observable pattern, we think residuals are indepent of predicted values so the model assumption seems to hold.
If it doesn't, then goodness-of-fit statistics and inference are invalid.
Let's quick look at residual vs. actual:
ggplot(heights_with_mod) +
aes(x=Dheight, y=abs(resid)) +
geom_hline(yintercept=0, linetype="dashed") +
geom_point() + geom_smooth()
And let's plot by explanatory variable:
ggplot(heights_with_mod) +
aes(x=Mheight, y=resid) +
geom_hline(yintercept=0, linetype="dashed") +
geom_point() + geom_smooth()
What about normal distribution? We use a 'Q-Q' plot
ggplot(heights_with_mod) +
aes(sample=resid) +
geom_qq()
The poins for a straight line, indicating normally-distributed errors.
Our model looks sound.
Another question: how good is the model at predicting unknown data?
To test an ML model:
First, let's set a 'test' flag on each piece of test data.
height_parts = heights %>%
mutate(IsTest = runif(n()) > 0.9)
summary(height_parts)
train_heights = height_parts %>%
filter(!IsTest) %>%
select(-IsTest)
Fit a model on the training data:
train_mod = lm(Dheight ~ Mheight, train_heights)
summary(train_mod)
Now test on our testing data, which the model couldn't see:
test_heights = height_parts %>%
filter(IsTest) %>%
select(-IsTest) %>%
add_predictions(train_mod) %>%
add_residuals(train_mod)
head(test_heights)
What is the model's root mean squared error (RMSE)?
sqrt(mean(test_heights$resid ^ 2))