This notebook contains our analysis of the recommender evaluation results.
It proceeds in a few steps:
Libraries:
library(MASS)
library(plyr)
library(dplyr)
library(readr)
library(ggplot2)
library(tidyr)
library(modelr)
library(tibble)
options(repr.plot.height=5)
options(repr.matrix.max.rows=10)
options(repr.matrix.max.columns=10)
First, we need to read the user data from the underlying MovieLens data:
users.meta = read_delim("data/ml-1m/users.dat", delim=":",
col_names=c("user", "gender", "age", "occupation", "zip"),
col_types="i_c_c_c_c") %>%
mutate(gender=as.factor(gender),
age=as.factor(age),
occupation=as.factor(occupation))
users.meta
Now, we want to get user profile statistics. In order to do that, we need each user's train profile; we can load that from the training data output by the cross-folding process.
For each partition, we need to load the train ratings (historical profile) for all the test users in that partition. So we will load the test data (to get the users) and the train data (to get the histories), then merge them.
train.ratings = ldply(1:5, function(part) {
message("reading part ", part)
test.fn = sprintf("build/ML-1M.out/part0%d.test.csv", part)
train.fn = sprintf("build/ML-1M.out/part0%d.train.csv", part)
test = suppressMessages(read_csv(test.fn, col_names=c("user", "item", "rating", "timestamp")))
train = suppressMessages(read_csv(train.fn, col_names=c("user", "item", "rating", "timestamp")))
test %>%
select(user) %>%
distinct() %>%
mutate(part=as.factor(part), proto=as.factor('UserPct')) %>%
inner_join(train)
})
Quick summary just to see what this data looks like:
train.ratings %>%
select(part, proto, rating) %>%
summary()
Now we want to compute per-user profile statistics:
user.stats = train.ratings %>%
group_by(user) %>%
summarize(nratings = n(), meanRating=mean(rating), ratingVar=var(rating))
user.stats
Join these statistics with our user metadata table, so that we have one table of user information:
users = users.meta %>% inner_join(user.stats)
users
Now that we have user information, we can read the per-user recommender evaluation results.
user.results.all = read_csv("build/movielens-user-results.csv", guess_max=10000) %>%
rename(user=User) %>%
mutate(DataSet=as.factor(DataSet),
Partition=as.factor(Partition),
Algorithm=as.factor(Algorithm)) %>%
inner_join(users)
head(user.results.all)
We will often want to use the standard-crossfold user results, so filter those down into a table:
user.results = user.results.all %>%
filter(DataSet == 'ML-1M')
Users with more ratings are easier to predict for; we want to be able to remove that effect. We will focus for now on nDCG.
There are two ways we can try to model: we can consider all algorithms, and just try to predict based on e.g. nratings. The other is that we can aggregate by user to produce a per-user 'difficulty' score. We will do the latter to avoid statistical non-independence problems.
user.ndcg = user.results %>%
group_by(user) %>%
summarize(nDCG=mean(nDCG)) %>%
inner_join(users)
user.ndcg
Let's plot to see what this is looking like:
ggplot(user.ndcg) +
aes(x=nratings, y=nDCG) +
geom_point() +
geom_rug()
Oof, that distribution is concentrated. Let's log the number of ratings, since log is often a good transform for counts:
ggplot(user.ndcg) +
aes(x=nratings, y=nDCG) +
geom_point() +
geom_rug() +
scale_x_log10()
Looking better. Let's see how the difficulty (nDCG) is distributed (Q-Q plot):
ggplot(user.ndcg) +
aes(sample=nDCG) +
geom_qq()
That isn't quite normal; however, a log transform looks like it will be too aggressive. Let's try a square root:
ggplot(user.ndcg) +
aes(sample=sqrt(nDCG)) +
geom_qq()
That looks close to normal.
Let's use log10 of the nratings as our predictor, and square root of nDCG as our dependent variable. But we will try more basic models too so we can see the improvement.
user.ndcg.mod.raw = lm(nDCG ~ nratings, data=user.ndcg)
summary(user.ndcg.mod.raw)
user.ndcg.mod.log = lm(nDCG ~ log10(nratings), data=user.ndcg)
summary(user.ndcg.mod.log)
user.ndcg.mod = lm(sqrt(nDCG) ~ log10(nratings), data=user.ndcg)
summary(user.ndcg.mod)
This is the best model. Let's fill in our data with it. We compute residuals manually, because we linear model residuals are in square-root space. For checking the model, we want:
$$\epsilon_i = \sqrt{\mathrm{nDCG}}-y_i$$However, we also do want the error, which is
$$e_i = \mathrm{nDCG} - y_i^2$$user.ndcg.preds = user.ndcg %>%
add_predictions(user.ndcg.mod) %>%
rename(predRoot=pred) %>%
mutate(pred = predRoot * predRoot, resid=sqrt(nDCG) - predRoot, error=nDCG-pred)
head(user.ndcg.preds)
ggplot(user.ndcg.preds) +
aes(x=nratings, y=nDCG) +
geom_point() +
geom_smooth() +
geom_line(mapping=aes(y=pred), color="red") +
geom_rug() +
scale_x_log10() + scale_y_sqrt()
We will be good citizens and plot our residuals.
ggplot(user.ndcg.preds) +
aes(x=pred, y=resid) +
geom_point() +
geom_rug() +
geom_hline(yintercept=0, linetype="dashed")
That is mostly noise with the exception of that one line at the bottom; but our initial plot of outcome variable distribution suggested there's a chunk of 0 values that will cause funny business.
ggplot(user.ndcg.preds) +
aes(sample=resid) +
geom_qq()
Not bad! We don't need this, since we are not looking for an inferentially-valid linear model. But it is a good idea to check anyway.
Let's analyze accuracy by gender.
# FIXME re-add rank.ndcg
gender.results = user.results %>%
select(Algorithm, gender, MAP=AvgPrec, MRR=RecipRank, RMSE, nDCG) %>%
gather("Metric", "value", -Algorithm, -gender) %>%
group_by(Algorithm, gender, Metric) %>%
summarize(value=mean(value)) %>%
ungroup()
head(gender.results)
overall.results = user.results %>%
select(Algorithm, MAP=AvgPrec, MRR=RecipRank, RMSE, nDCG) %>%
gather("Metric", "value", -Algorithm) %>%
group_by(Algorithm, Metric) %>%
summarize(value=mean(value)) %>%
ungroup() %>%
mutate(gender = 'Any')
combined.results = rbind(gender.results, overall.results)
head(combined.results)
ggplot(combined.results) +
aes(x=gender, y=value, fill=Algorithm) +
geom_bar(stat="identity", position="dodge") +
facet_wrap(~ Metric, scales="free")
Now we want to control for the size of user's profile. Since male users rate more movies than female users, and long profiles are easier to predict for, it is possible that the increase in improvement for male users is due to rating size.
We will control for that by using our linear model for predicting 'difficulty' using profile size, and look at the remaining effect of gender on accuracy.
gender.corr.ndcg = user.results %>%
select(user, Algorithm, gender, nDCG) %>%
inner_join(select(user.ndcg.preds, user, Pred.nDCG=pred)) %>%
mutate(Corr.nDCG = nDCG - Pred.nDCG) %>%
group_by(gender, Algorithm) %>%
summarize(nDCG=mean(nDCG), Corr.nDCG=mean(Corr.nDCG))
head(gender.corr.ndcg)
ggplot(gender.corr.ndcg) +
aes(x=gender, y=Corr.nDCG, fill=Algorithm) +
geom_bar(stat="identity", position="dodge")
After controlling for the number of ratings, we still see increased accuracy for male users.
Michael tried this control for gender effects, but it was kinda weird and doesn't tell us anything meaningful, we think. So ignore it (but the code is still here for history).
gender.effects = user.results %>%
select(gender, MAP=AvgPrec, MRR=RecipRank, RMSE, nDCG) %>%
gather("Metric", "value", -gender) %>%
group_by(gender, Metric) %>%
summarize(effect=mean(value, na.rm=TRUE)) %>%
ungroup()
gender.effects
controlled.gender.results = user.results %>%
select(Algorithm, gender, MAP=AvgPrec, MRR=RecipRank, RMSE, nDCG) %>%
gather("Metric", "value", -Algorithm, -gender) %>%
inner_join(gender.effects) %>%
mutate(offset = value - effect) %>%
select(-effect) %>%
group_by(Algorithm, gender, Metric) %>%
summarize(offset=mean(offset)) %>%
ungroup()
head(controlled.gender.results)
ggplot(controlled.gender.results) +
aes(x=gender, y=offset, fill=Algorithm) +
geom_bar(stat="identity", position="dodge") +
facet_wrap(~ Metric, scales="free")
We now want to examine accuracy as a function of age. We're basically going to do the same thing we did for gender.
Start by aggregating the results:
# FIXME Re-add rank nDCG
age.results = user.results %>%
select(Algorithm, age, MAP=AvgPrec, MRR=RecipRank, nDCG) %>%
gather("Metric", "value", -Algorithm, -age) %>%
group_by(Algorithm, age, Metric) %>%
summarize(value=mean(value))
age.results
And then we can plot them:
ggplot(age.results) +
aes(x=age, y=value, fill=Algorithm) +
geom_bar(stat="identity", position="dodge") +
facet_grid(Metric ~ ., scales="free_y")
Let's control for profile size again.
age.corr.ndcg = user.results %>%
select(user, Algorithm, age, nDCG) %>%
inner_join(select(user.ndcg.preds, user, Pred.nDCG=pred)) %>%
mutate(Corr.nDCG = nDCG - Pred.nDCG) %>%
group_by(age, Algorithm) %>%
summarize(nDCG=mean(nDCG), Corr.nDCG=mean(Corr.nDCG))
head(age.corr.ndcg)
ggplot(age.corr.ndcg) +
aes(x=age, y=Corr.nDCG, fill=Algorithm) +
geom_bar(stat="identity", position="dodge")
Discrepancy in performance is robust to controls for size.