Week 15 — What Next? (12/5–9)#

This is the last week of class. We’re going to recap, and talk about what’s next, both for learning and for putting what you’ve learned to practical use.

Revisions#

Dec. 5: clarified that there is no week 15 quiz — there was previously a disagreement between the heading and the text.

🧐 Content Overview#

Element	Length
🎥 Recap	6m23s
🎥 Drift	7m13s
🎥 Time Series Operations	11m16s
🎥 Correlated Errors	13m50s
🎥 Publishing Projects	9m55s
🎥 Production Applications	7m28s
🎥 Topics to Learn	6m25s
🎥 General Tips	10m25s

This week has 1h13m of video and 0 words of assigned readings. This week’s videos are available in a Panopto folder.

📅 Deadlines#

Makeup Midterm, December 10
Assignment 7, December 11

🎥 Recap#

This video reviews the concepts we have discussed this term and puts them into the broader context of data science.

Video (6m23s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand RECAP Learning Outcomes (Week) Wrapping up! Tie together the class content again Apply Pandas time series operations and model correlated regression errors Take the results of data science analysis in production or publication Know some topics to study further to expand your data science skills Photo by Lumitar on Unsplash The Data Science Workflow Transform / Prepare (ETL) Raw Source Data Prepared Data Inference Findings Modeling Model + Predictions Data Description What is Data Science? The use of data to provide quantitative insights on questions of scientific, business, or social interest. Data Management Reading from static files Processing and integrating with Pandas To learn more: CS 510 Databases Application- and type-specific data in other classes Mathematical Fundamentals Probability Theory Linear algebra (a little) To learn more: Math 562 (Probability and Statistics) Math 503 (Advanced Linear Algebra) Inference Basic parametric pairwise comparisons (t-tests) Bootstrapping Sampling theory Linear regression models (OLS & logistic) To learn more: Math 562 (Probability and Statistics) Prediction Regression: continuous outcomes Classification: categorical (esp. binary) outcomes To learn more: CS 534 (Machine Learning) Many other data science classes Evaluation and Tuning Train/test splits Classification and continuous prediction metrics Hyperparameter tuning To learn more: CS 534 (Machine Learning) Unsupervised Learning Lower-dimensional embedding (matrix decomposition) Clustering To learn more: CS 534 (Machine Learning) Will appear in other data science classes Workflows Data science pipeline Breaking code into separate scripts & modules Design patterns for code workflows You will apply this throughout your classes! Wrapping Up This class is designed to lay a conceptual foundation for your future data science studies. Other classes will build on these concepts and ideas! Photo by Dave Heere on Unsplash

Welcome. This is the last week of CSI 533. I want to start us off by doing some recap of what we've learned this semester.
The point to where we're gonna go in the last week or so are learning outcomes for this week or to tie up the class content together.
Give you a brief summary of what we've done, how it fits into the broader,
how it forms a broader picture into which the further topics you're going to study throughout your graduate degree will fit.
We're also going to tie up a couple of loose ends. We're going to talk about time, serious operations and modeling correlated regression errors.
And we're going to talk about how some of the things you need to do to take data science analysis,
results and outputs into publication or into production.
And then finally, I'm going to talk about some topics to study further to expand your data science skills.
So we've talked about the data science where I want to bring up the state of the science workflow again,
because it provides a context in which a lot of the things that we've been talking about fit.
We talk about how to describe data. We talk about how to integrate it, source it and transform it.
We talk about the various kinds of biases and you have to pay attention to throughout this workflow.
We have different kinds of tasks, such as modeling and inference.
I also want to return us, though, to our question at the beginning of this semester or our definition from the beginning of this semester,
that data science is the use of data to provide quantitative insights on questions of scientific business or social interest.
So the goal of all of this, what we're doing with data science, is to produce insights.
Now, there's a lot of overlap.
There's a lot of predictive predictive tools that are not necessarily being used to provide insights back to the operators,
but are being used to, say, generate predictions about customer transactions,
about whether something is fraud, whether something is a cybersecurity attack,
that the the action itself is not necessarily the model itself is not necessarily providing insights.
But we need to gain insights about whether it's working, how effective it is, where we need to go, try to improve it.
But also the techniques that we're using are applicable to other purposes as well,
such as training, machine learning models that can do various types of predictions.
We've talked about data management, so we've put primarily focused on reading from static files.
We do a little bit with obtaining data from the web. We've been processing and integrating our data with pandas.
Really useful utility skill. To take this a step further is to actually to be able to work with a database.
And so in the CSA five 10 class databases, you learn how to design data models and relational databases, how to query them, how to put data in them.
You're also going to see in other classes information about managing data that is particular to different applications and types of data.
We talked about some mathematical fundamentals, probability theory.
We talked a little bit about some linear algebra and we've used bits and pieces of it throughout the class.
You can learn a lot more about these taking math 562 and taking the linear algebra class as well.
Four. We've talked about statistical inference. We did some basic parametric pairwise comparisons with T tests.
We learned how to bootstrap confidence intervals and P values.
We learned some sampling theory that underlies a lot of statistical inference and underlies the bootstrap.
And we talked about doing inference with regression models to learn a lot.
You'll learn a lot more about this and the probability and statistics class. We've done some predictions.
We've predicted continuous outcomes with regression models. We've predicted binary outcomes with classification models.
You will learn a lot more about different models for doing. For doing these kinds of predictions in the machine learning class.
And they're also going to come up throughout a number of your other data science classes,
such as recommender systems, natural language processing, information retrieval and social media mining.
We talked about evaluating your your predictive model, your classifier or your regression based predictor.
We talked about how to do. How to do train test splits, why we need those.
We talked about metrics for assessing the effectiveness of your model.
We talked about strategies that set up to choose your models, hyper parameters. Again, you're going to see a lot more of this in C.S. 534.
We also introduced unsupervised learning.
We've looked at two different unsupervised learning techniques where we don't oppose the predictions that we're trying to do,
the supervised learning where we have a classification outcome or a label.
We're trying to predict the classifier. We have a value. We're trying to predict with a regression model.
We don't have a supervisions ignore a label that we're trying to predict. We're trying to allow the model to obtain.
To learn structure just from the input features,
we've seen how to do this with lower dimensional embedding with Matrix composition, and we've seen how to do this with clustering.
Again, you'll see more about this in machine learning and it'll appear in other classes.
We talked about workflows. We talked about the data science pipeline.
We talked about how to break code into scripts and modules, how to use get how to design, has some design pattern for your code workflows.
You're going to use this throughout your data science work, both in your studies and in your applied work when you're done.
I encourage you to take what you've learned here, refer back to the material in order to structure your assignments for other classes.
So to wrap up, this class has been designed to lay a conceptual foundation for the rest of your data science studies.
So do you take other classes? You have these concepts. You know what a classifier is.
You know how to evaluate one. You know how predictive models and statistical inference fit in the broader space.
You have a good working base to start to work with data, be able to ask questions, be able to think about how.
How to answer them and represent their answers. Other classes are going to build on these concepts and ideas.
And there's a number that you can take in the computer science department,
the math department and other departments as you complete either your master's degree or your HD.

🚩 Makeup Exam#

The makeup midterm due December 10 and operates on the same schedule as the other midterms. It covers the entire semester, with an emphasis on material since the 2nd midterm. The rules and format are the same as for Midterm A.

I encourage you to come and take the exam, as it is the single best way to prepare for the final. You can at any time leave the exam, and take it with you; I will only grade it if you turn it in, and there is no problem with showing up, looking at it, even filling it out, and deciding you like your current grades well enough to not turn it in.

Warning

If you turn in the makeup exam to be graded, its grade will replace the lower of your Midterm A and B grades, even if that lowers your final grade. Only turn it in if you think you did better than your worst normal midterm!

🎥 Data and Concept Drift#

This video introduces a fundamental assumption of predictive modeling and the way drift can affect it.

Video (7m13s)

Slides

This video I want to introduce to concepts you need to be on the lookout for when you're trying to do models and particularly in production abound.
Caught the drift of your data and your concepts. Learning outcomes are few to know critical assumptions of machine learning,
evaluation and deployment and understand how models can direct degrade over time.
So a fundamental assumption that our machine learning evaluation makes or that machine learning makes.
Excuse me, is that the test data?
Looks like the training data. The test of specifically the test of the training data are drawn from the same distribution.
And one of the really crucial parts of this is that the conditional probability.
So the probability of y given a particular set of covariates or features X is the same.
Between the two. Otherwise you couldn't do machine learning because the idea is to learn that conditional probability.
Remember that we're trying to do estimates of just conditional expectation or estimates of conditional probability.
If it's not the same and the training gave it to the test data, then you can't learn.
The training data to predict the test data. But also, if the probability of Y is not consistent, then it's we're gonna be looking at different data.
The class weights may be wrong.
If the probability of the covariates are different, effectively, we're learning from data that's distributed differently.
It might work, but it very well may not. We have this fundamental assumption.
If you're trying to learn from Trinny data to predict to predict test data, you're assuming the test data looks like the training data.
There's also then another assumption when you're trying to when your purpose of your talent testing
your model is to determine if the model is going to work well when you actually go try to deploy it.
So if you're trying to test a spam filters, accuracy you what your goal is not to build a better spam filter that works well on a test dataset.
The goal is to build a spam filter that's going to correctly classify spams as they come into your system.
And so they're we're making the crucial assumption that the probability of data, it's actually going to see the live data that the live data.
Strong from the same distribution as our test data. Because if it's not, then what we've tested it on isn't what it's actually going to run on.
And so we haven't actually evaluated the model for the perforates task,
if the the actual data is going to see come from a different distribution than the data that we're testing on.
So a few different ways that this can drift. So the crap class private lens can change.
It might just be that spams become more likely. The feature distributions can change.
So emails start using Pertti, messages start using more of a particular kind of text or the geographic
distribution of the sender changes or other features that you're using changes, but also the relationship changes.
What, like the way people write spams may change. And so then how you predict a spam from the features changes.
One particular example. So examples like December purchasing is different.
A model that is trained on January through November.
You try to predict December, December looks different because the holiday season.
Also, if you've got a model that was working great, you had seasonal effects, you had cyclical effects.
You are doing really good at forecasting. Your sales revenues for four years.
Kofod has so fundamentally changed how we go about purchasing both in person and online,
that you have a massive distribution shift in your model is no longer gonna be valid.
So how do you deal with this? I'm going to talk briefly about two ways to start trying to deal with it.
The first thing zero to do is to be aware of the proof of the assumptions that
you're making and the potential problems of a violation of those assumptions.
You even noted, go look for them. But then off line, one thing we can do is we can temporal if our data have timestamps, we can temporarily split it,
because in production, your goal isn't to predict a randomly selected set of data points from the other data points.
Your goal is to predict the next data point from historical data points.
So what you can do is you can simulate that. You can say. You're trying to predict sales.
You can set as your test data November and use trading data everything before November.
You'll want to use the Sambor because you don't want your if you like your model to look into the future.
You're violating the simulation stuff. Fidelity. And but this is the basic idea.
You train on the data before the test data and you test on on that day.
You need to hyper print are tuning. You can test on. You can use the month before as tuning data or because of seasonal effects.
You may want to use the same month, the year before. As for your hyper parameter tuning, but this is the idea.
You do this temporal split upside, you get this increased fidelity of your what you're actually trying to soft new salt.
Can you predict next data?
Given the historical data, drawback is that we're not randomly sampling the results and so we have to be a lot more careful.
But the statistical inference of the results. But. The trade off is often worth it for understanding problems that are going to be things that are
going to be deployed in a temporal setting where we expect data distributions to change over time.
Then when you go online, need to continuously monitor what your system is doing.
If you have the ability to measure its online accuracy, its making its predictions of whether incoming messages on your platform are spam.
And you can test whether that was right or not. Watch that.
But even just watching other metrics, like the frequency with which it's classifying something as spam.
If that jumps or if that drops. That doesn't necessarily mean your model performance is different, but it means something's changed.
So if. Fifty percent of your messages are spam on usually.
Or the spam reporter spam classifier flags 50 percent of that spam. And suddenly it starts only flacking 40 percent of spam.
You should look into. That's a signal that you need to look into. Why?
To see if you've got a data set shift, a status shift that's causing your model to no longer perform properly.
Also, you need to regularly retrain and reevaluate your model.
You can't just say, oh, I've got a model. It's got this accuracy. Good. Let's put it production.
Let's run it for two years. You need to retrain the model frind with new data.
And then also, you need to periodically reevaluate your model to see is it still giving the accuracy.
So we're running it for a year, collecting more data. If we tested out some new data, is it still giving the accuracy that that we expect for it?
So to wrap up. Machine learning, training and evaluation assumes that the training data and test data match each other and match real life.
You can't always rely on that. You need to pay attention to shifts in your data and in its distributions.
That makes your that either cause the model not to be able to learn from training to predict test data,
or that cause the test data to no longer be an accurate assessment of what's going to happen when you try to use your model for real use.

Resources#

A unifying view on dataset shift in classification (available through Boise State library)

🎥 Time Series Operations#

Time is an important kind of data that we haven’t spent much time with — this video discusses the fundamental Pandas operations for working with time-series data.

Video (11m16s)

And this video, I want to introduce the concept of a time series, some basic time series operations in pandas.
There's a notebook also in the resources that gives you a demonstration of the PANDAS code that actually works with Time series.
Learning outcomes are for you to be able to summarize and plot time series data
with pandas and understand that Time series data is often not independent.
So at times, serious is a sequence of observations over time,
these observations may be periodically sampled, like maybe having one measurement for each day.
Or they may they may not be. You might be that you record a sequence of events like every time someone uses the card scanner at a door.
Typically, their observations are the same kind of thing. Like we have our instances in a normal data frame, the same kind of thing.
The observations we're looking at a time series are homogenous,
like the price of a stock or revenue of a country or some kind of user activity and mathematical notation.
We typically refer to a time with the letter T. And we have the current time as T t t minus one T plus one.
We often speak of times steps. So we might time as in years or months, days, seconds, whatever is appropriate to our particular application.
And it is. And we can talk about the value at time T X of T.
We can talk about X, T, minus one, etc. for the previous value, the value of the previous timestamp.
This particularly applies when we have a periodically type sample time step.
So to show you an example of a time series, this is from the movie lens data that we've been working with.
And it is by month the number of each ratings in the month.
So our time steps are months. And our value.
Is the number of ratings. For movies in time are in month t.
We can see those values. There were some spikes of growth. There were some spikes of activity here.
We've got a few one month spikes. We had a significant jump there.
And we've there's been declining month over month activity as we go into 2000 or in 2018, 2019.
So there are many ways, as we talked earlier, that are many ways we can represent time, we can have string dates, years, months.
We can have timestamps to work with a time series, though, and PAN does.
We have to convert it to a time to a date time.
We can't use pandas time serious operations without converting our time into a date time somehow first from whatever format it's in.
So the setup steps to be able to start working with time, serious data and pandas are to first create or convert a date time column.
It might be that you need to put it together multiple columns.
It might just be the you need to convert an existing column so that you have a timestamp
column of type date time for each for each of your instances in your data frame.
Then typically you're going to want to index the data frame by timestamp and then sort it by index using the sort index method.
And you can use it. There's a lot you can use a lot of date time operations without setting an index.
But setting an index is often the most convenient way to work with time.
Serious data in Pandas and Time series data is an exception to our general guideline to prefer unique indexes because pandas will have some memory,
can have memory and performance problems with non-unique indexes. You might have multiple events that happened at the same time.
The utility of a time series index around looking up things by time and by date over what that overrides.
The general concern of don't use duplicate and of avoid indexes.
A duplicate keys. So if you operations that we can perform at this time series, data one is re sampling and re sampling works like group by.
But it works on time intervals. And so in this in this code here, what we're doing is we are re sampling by interval one month.
We could sample by day, we could sample. By week.
But this is telling us that we want to group the data by one week or by one month periods so our groups will be or it's indexed by a timestamp.
But our groups are going to be October of 2000. 13.
November of 2013, and then within that, we're going to do exactly the same thing we do with any group.
We're going to pick a column and we're gonna count it. And this is going to count the number of ratings in each month.
So after the recent the results of a resample work, exactly like your group, buy into all your Google group,
I think that's just rather than being grouped by the distinct values of a column.
They're grouped by a time period of the index. Or you can also group by a column.
There's an on option for every sample that allows you to specify the column that you want
to do the re sampling on if you have not indexed your data frame by its time stamp column.
So this is our first operation. We can take it. We can sample our day.
If we have this particularly works for data that is. If this works well for data that is events, it works well for data, that is.
It can also work well for it, for data that is already sampled,
that if you've got daily measurements of something and you want to take it by month, you can re sample that by month.
You can also up sample and have it fill in missing values for time period that you're missing.
If you want to increase the resolution of data.
So another thing we can do is plot and matplotlib and Seabourne both render timestamps on their x axis relatively well.
So if I just do X and that's that line plot and I give it a series, it's going to mark the x axis by an appropriate time,
derive value in this case because my data starts in two thousand or nineteen ninety six and runs until relatively recently.
It is it's, it's, it's marking the X axis by year.
So they, they know what to do with time data, with timestamp columns.
And they will the air, they will render the x axis with appropriate labels for time, serious data.
Another operation is to be able to do a range select so time series indexes, support range operations and so we can select by range,
you can say every we want everything from January 1st through December thirty first.
Now, one key thing to note is that when you're doing a range selection of a time series column,
unlike basically every other slicing context in Python, it includes the end point.
So this this query is going to include December thirty, first all and it's going to include all day on December thirty first.
It's not going to be the usual stop right before.
We can also just pass in a partial date time in order to select by a period.
So if we want to select July 2010, we can you do that lock.
And we can look up 2010, Dasch 07. It's going to give us the ratings that are in July.
Another operation we can perform a set of operations, we can perform our diff and shift.
So the diff operation computes X sub T minus X, T, minus one.
So what it does is it computes. So we've got our it's basically the opposite of cume.
So we've got our data points. And what it does is it computes each one minus the one before it.
The first one gets narm. So your first value becomes more than your second value becomes the second minus the first et cetera.
Shift is an operation that just shifts data points by one or more time steps.
And so a shift of one which is the default, if you don't specify parameter is what we call a lag operator.
It move points down once that each one has its previous value. So def is just X minus X that shift or it's it's X minus the lag of X.
Shift of minus one is lead. So each one has the next value, if you want to compare each value with the one that's gonna come after it.
But so these are this. This diff operation and then shift, which is its building block, allow you to compare data to the previous data point.
This is particularly useful for data that is periodically sampled to its day by days or by seconds.
It doesn't it's not quite as useful by events, but. You can convert event timestamped event data into period data.
Just the library sampling. So there's a few different effects that we want to think about in Time series data.
One is a trend, which is a period over period change in values.
So if the price of something is going up and it'll be noisy around that,
but often noisy around that, but the overall trend is up as you go through time.
You can have linear trends. You can have exponential trends. When you have an.
You might hear the value are not being used sometimes in discussing disease, transmit disease transmission epidemiology.
That's a multiplier for an exponential trend.
We can have seasonal and cyclic effects with our periodic effects in the data and seasonal ones are are around this time of year.
So like holiday, if you're in a commerce setting, holidays look really, really differently than June.
There can also be cyclical effects like on a week cycle, et cetera, that does that affect your data?
People behave differently on weekends than during the week, etc.
There's also shocks which are impact events that impact the time series and change data going forward.
And these can have short term effects or they can have continued effects. But a shock is like an outside event.
Like if your time series is, say, the the price of a stock, a shock would be an event that affects the stock price.
One thing that's important to know about time, serious data, is that as often what we call auto correlated, which means correlated with itself.
Because it's X, sub T plus one at exit two, plus one, an exit T are not independent.
Today's weather is probably more likely correlated with tomorrow's weather than it is weather in three months.
So if you have observations of a variable over time, especially that you're trying to predict the future with today.
It usually violates independence. And so we're going to need to take special care in order to model what's going on when we have data.
And even if we have other predictive variables.
There's a non independence between from one step to another that we need to account for when we're doing statistical modeling.
So to wrap up repeated data over time requires particular handling and are distinct operations to work for it.
PANDAS provides access to daytime columns, daytime indexes to allow you to do various times serious operations.

Resources#

Pandas time series analysis

📓 Time Series Example#

The MovieLens Time Series notebook demonstrates basic time series operations in Pandas.

🎥 Correlated Errors#

Regression models require that the data be independent. This video introduces two kinds of non-independence and methods for addressing them: grouped observations addressed with a mixed-effects model and temporal auto-correlation addressed with ARIMA models.

Video (13m50s)

This video, we're going to talk about correlated errors,
so the previous video we talked about time serious and we talked about autocorrelation where its value is correlated with the value before it.
This video, we're going to talk more generally about the idea of correlated errors and how they can be addressed.
Now,
I want you to be able to know when you need to look for a model that can handle correlated errors and know a few of the models you need to go study.
So observations, linear regression has this assumption that we've particularly our errors are independent, but often they aren't.
So if we've got time series data, we have this autocorrelation that will manifest in non-independent errors.
We don't control for it within subjects, designs and experimental design, where you have each participant gets more than one treatment.
And you want to compare how the treatment performed within the same person. You're not independent because.
Do you have. There's a person level effect and the the the person has a certain propensity to the outcome that you're trying to measure.
And then the treatments also go into effect. You can have other data that can be grouped as well.
If you're trying to to study medical outcomes and you're assigning people to a treatment or a control group, even though within the same hospital,
like people that are being treated by the same medical staff,
might have different outcomes than people who are being treated by different medical staff.
And so if you just pull all of the participants together, you can have this non-independent,
that student that participants at one hospital are more likely to look like each other than they are likely to look like participants,
more random other hospital.
And so, for an example, to talk about this more concretely, I want to talk about you're trying to measure search engine effectiveness.
So you have to search algorithms. Maybe you're working for Google or you're working for for Microsoft on being working for some
other somewhere else where you have a search setting and you've got to search algorithms,
maybe ones your current one and one's your proposed new one, or you're trying out two different possibilities.
Users issue queries and you're trying to measure and the search systems return results and
you measure the accuracy of the result list that you get back from each search system.
And so your data point, your instance consists of a query and it consists of the responses from one of the search systems,
along with which system generated those responses. And it might be that we somehow have experts assessing how good the results are.
It might be that we're looking at quick click logs to collect. We actually have users. They issue a query.
One of our systems returns results. So we see how likely they are to click a result and how high up in this is in the search list.
The result they click is that we use that as the basis to measure how good the searches are.
The problem is that these results are not independent.
We have these queries and if the same a query appears multiple times, not all queries are easy to are as easy to answer.
Maybe one query has more documents that are relevant.
Maybe one query has no documents that are relevant. Maybe queries have language that matches your documents or matches too many documents, too.
It's hard to figure out which one in particular the user is looking for.
There are many different reasons for query difficulty. You can take the information retrieval class to learn about a lot more about that.
But this is the setup for that. I'm going to use as an example here we have queries.
Query might happen multiple times, like people will probably search Google for Pendas Time series a lot.
And so you're trying to measure the effectiveness of your two systems at these repeated queries.
So a naive solution without taking into account the correlated errors is what you
might try to build a linear model where you might try to try to do a T test,
but you might try to build a linear model where you have Alfa's your intercept and you
have a variable that says whether it's provided by the new system or the old system,
if you have more than one more than two systems, it will be a dummy coding of which system is as providing the results.
And we have our errors and we're trying to predict our outcome outcome as some metric doesn't really matter which one for our purposes right now.
And so beedis of new is going to be greater than zero if the new system performs better than the old one.
We're gonna want to check the confidence interval of this. But the problem is, as I just said,
these epsilon's are not going to be idee normal because the Epsilon's for the same
query are going to be more like each other than the Epsilon's for a different query.
That's a lack of independence. So the solution to this is to use what's called stratified errors or group effects.
There's a lot of different names for a similar concept, but the idea is to allow the query to be a predictor variable, too.
Now, one way to do this is something called repeated measures.
Enova or repeated never measures in other requires you have a very balanced and matched design.
The next effect's model is a little more flexible. And what it does is we've got our model before.
We've got our intercept. We've got our Beda. But then we also have this query effect.
And so this is a per query value and you're talking to Wilma about how you actually find it's actually coefficients.
On the one hot encoding of query as a categorical variable.
And then we have our error. So what this.
This is called a mixed effects model because it combines two types of effects.
The fix effects, which are what we control, the experimental condition.
The thing we're wanting to test, we are looking for the coefficients on the fixed effects.
In our example, it's which search algorithm we're using.
And then we have the random effects, which are the experiments, natural sources of variance in this case.
It's the queries. They said this is what this is actually coefficients on a one hot encoding the query.
And so we're effectively we're making a query as a predictor.
And what's going to happen is the intrinsic difficult relative.
Difficulty, your years of the query is going to be encoded here.
And this is going to capture the concept of given a query of a particular difficulty.
Does the new system perform better than the old one,
and that's going to give us a much more accurate picture of the performance of the new search system.
And ideally, once this is done, we saw you always check your assumptions.
We have a much better chance that the epsilons of eyes are going to be idee normal.
Not necessarily. Still might not be enough to capture everything that's going on in the model.
But if your primary source of variance that you have a problem with or your primary
source of structure or correlation in your errors is effects within a group.
You have your observations come in groups and there can be a group level effect.
The mixed effects model allows you to control for those group level effects to get a much more accurate estimate of your actual experimental effect.
So the basic idea here is that when you've got correlation or structure in your residuals, you can.
You can try to model that directly. So random effects capture these natural but known sources of variance between observations.
And then the performance becomes the system effectiveness, plus or minus the query difficulty.
And then, as I said, hopefully you're a man who wears the jewels or idee normal. And now you're linear model interpretation works again.
So we've taken something that didn't work as a linear model. You've got structure in your errors because you've got this within query correlation.
Or non-independent. But you can. We've now taken that effect out and now our errors are just the remaining error.
So when do you need to use this kind of an analysis design?
Basically, any time your data points come in groups, you need to think about using this kind of an analysis design.
Well, any time you trying to do a regression and your data points come in groups,
you want to think about this kind of analysis design, you might have multiple measurements for the same object.
The key thing, though, is that if you're trying to assign understand the difference between measurements,
you're trying to understand the difference between objects.
You're just measuring them with multiple devices, then the object would be a fixed effect and the measurement device would be a random effect.
Remember, the fixed effect is the thing you're trying to study is the thing you're controlling and manipulating in your experiment.
And the random effect is the other sources of variance that you need to account for.
Also, if you have an extra feature that's being shared, you've got a feature that's being shared between instances.
You might want to model that as a random effect as well.
If you don't care about learning the effect of the feature, you just need to control for the effect of the feature,
then it can work as a random effect of the mixed effects model.
You can also have nonlinear mixed effects models for doing categorical regression, buisson regression, etc.
So now to go back to the autocorrelation that we talked about in the TIME series, video time series data that says correlated with itself.
So it raises the question, can we predict exit time T with X at time T minus one?
Also, if we removing that prediction, removing the autocorrelation may let a linear model work for other effects in the data.
And so the idea of auto regression is that you predict data point with history, so X sub T,
we try to predict it with a linear model vadis of zero and Y Gamma one, x T minus one.
So basically each one is a linear function of the previous value. You can generalize this to more than one previous value.
So.
If the aiyah k auto regressive category or the auto regressive model of order K is going to look at the K previous points, I'm summing one Dekay here.
And what this does is that it me, it was a shock or a change and it changes the value of your time series.
That effect will accumulate and carry forwards. If you got a change, then that changes.
That changes at time, T. So then time T plus one since that has the odds is based on time t it key.
It carries the value forward and then carries it forward and the T plus two etc.
A moving average uses the area from the previous prediction to predict the next point.
So X of T is our intercept, plus a coefficient times the error at T minus one, and then it has its own error.
And then likewise, we can have a K which looks at the previous K errors.
What this does is that an adjustment or a shock wears off.
So it'll affect it for the first few points and then it will its effect will wear off in the.
And the data will regress back to its mean. Now we can put the auto regressive model in the moving average model together into a model Koldo Rhema.
And so it is an A.R of P model. So the Rhema peak you are is a R of P and a moving average of R,
so we can model can capture both long term effects of a change and short term wearing off effects of a change.
And then also it's applied to diffs.
So Q If Q is one, then rather than modeling the time series data itself, we're applying the ARMM models to the diff of the Time series data.
If it's two, then if the diff of the diffs oftentimes that the auto regression in moving average acquire the model to be what's
called stationary and sometimes duffing that often defining your time series can bake it at least stationary enough.
We can also think of the auto regression and the integration parts as a type of feature engineering.
So if our data points and we've got these X values, we're engineering the feature, the value with the previous timestamp or we're X of we're.
We're engineering the feature difference from the previous timestep or.
The difference at the previous timestep, so time t we have the feature X, T minus one, minus X, T, minus two.
That's one way you can think about the auto regressive part and the integration part of the Arima model.
Now, we can integrate then this with prediction, because we can use a rhema to model the natural time series behavior of the data,
and then we can combine that with additional variable from features,
maybe is during a holiday or you've got a feature that it turns one at some point in time
and then stays one throughout the rest to capture a change that happened in the system.
And you can use it with this design, then beedis of one is going to be the in fact,
the influence of that feature after we've controlled for the normal temporal behavior of the data.
And then hopefully our Epsilon's some Ts might be idee normal now.
And we can get back to the point where our linear model works again.
Now, I've tried to give you the rough shape in the rough structure, which is that you know what, you need to go study and you have a starting point.
What I told you was not enough to go effectively apply a rhema.
There are resources. I'm pointing to you to one extended slide deck that that talks a lot more about Rhema.
If you need to use Erina four times serious modeling, you do need to go do some more study.
There's also a time series, analysis class that you can take in the math department and spend an entire semester learning how to do time,
serious analysis with models like a rhema.
So to wrap up, we often have structured errors in a regression problem and in some cases modeling that structure.
This is a structured. Errors mean our errors aren't. Independents were violating a key assumption for the inferential validity of linear regression.
But in some cases, we can model that structure. Either the rant, the random effects of a mixed effects model,
modeling the effect of which group a data point is in eurema model modeling temporal
effects of the data in order to yield a model that is inferentially valid.
Again.

Resources#

Linear Mixed Effects in statsmodels
Time Series Analysis in statsmodels
Time Series Analysis slides — much more in-depth treatment

🎥 Publishing Projects#

This video talks about going from an analysis and its notebooks to a publishable paper.

Video (9m55s)

Slides

We've talked a lot about how to do the analysis part of data science.
I now want to take this a little bit of time to talk about what do you need to do in order to take that and actually put it into a publication.
So we're going to talk about you.
I want you to understand what's needed to go from analysis to a published report and also to be able to outline a research paper.
So a lot of different audiences that a document that comes out of a data science project can before it might be an internal document.
Just for our collaborators to update them on some analysis you're doing as part of a project.
It might be a report that's intended to be used by decision makers in order to make a data informed decision.
It might be a report for other elsewhere in your organization or else other organizations.
It might be a scientific publication for the scientific community and or it might be a document intended to educate the lay public.
So there are a variety of formats that these can come in. One is that it could be a written document that's either electronic or printed.
It can be a presentation that you're giving live or that you're recording as a prerecorded video.
It could be an interactive online demo or dashboard. These happens for product dashboards and internal monitoring all the time.
They also happen sometimes in journalism where a.
A news organization will make an interactive data visualization to allow people to explore the data that underlies some of their reporting.
The goals of a publication of any of these forms, any of these audiences is that the reader needs to be able to understand what you did,
what you learned, and then what they should do or take away with it.
This is in general, not every analysis is going to have actionable. Here's what you need to do.
Kinds of insights, but your readers need to understand what the takeaways are and why they should believe them.
That's the part of the being able to communicate your methods is to show.
The evidence behind showing your work at showing the evidence behind the conclusions that you're drawing.
So the typical outline for a lot of publications, particularly scientific publications.
But a lot of other publications are going to have so many of these same outline elements
is that you first have an introduction that sets the stage for why you're doing this.
Oftentimes an introduction will also for.
Foreshadow the conclusions of the report of a data analysis, particularly a scientific publication, is not a mystery novel.
You should not keep your reader guessing until the end. What the conclusion is, you can just say it and need action.
We're trying to solve this problem. Here's the bullet point summary of what we found.
And then you get into the details, then you've got background and related work and background and related work are not the same thing,
but they're often put together in one section. Background is background information that people need to understand what you did.
Background on the problem. Background on the methods that you're using to apply it.
You know, there is common knowledge that you can assume. And what that common knowledge is varies from audience to audience.
But the background is there to the background is the material that they need to understand.
On top of the common assumed knowledge in order to understand what it is that you did.
Related work is other work working on similar related problems that might be working on the same problem with different scope, different methods, etc.
It might be using the same methods for a different problem,
but it fills in the place of the other knowledge that we have about the problem space that you're trying to study.
So people have a context for how your Rinella assess is filling in a gap in our current knowledge.
So the background is the prerequisite knowledge and the related work is the adjacent knowledge.
And between them, your readers will be able to understand what you did and understand how what you did contributes to knowledge.
The methods are where you explain what you did. What's the data? What's the statistical methods you did?
What's the machine learning methods you did? As I say, in the variant side, methods may sometimes be split.
So if you've got something where maybe you have a new algorithm and then you're doing experiments on it.
Those who usually be separate methods, you've got your. Here's my new algorithm.
And then maybe as an experimental result section, you'll have methods and then results.
Then you have your result. So the methods are here's the experiment, I'm going to run the data, I'm going to use Hymes setting up the experiment.
Results are OK. So I did that. What do we learn?
It's where you have your key figures and charts, et cetera. You're presenting the results and run the experiment.
Discussion, then, is where you talk about what the results mean.
So you have the individual results of a chart that shows the accuracy of a model, you have a chart.
You have a chart that shows what's happening in some data over time.
Discussion is where you pull it together and you connect the results that you have
back to the original problem context that you put forward in your introduction.
And then finally, in the conclusion you summarize takeaways of the paper and you often point to what some future work, maybe some variance on this.
Some research communities put related work at the end right before the conclusion.
Here's everything I did, by the way. Here's what other people did. Now we conclude.
If you're doing something that's not not a scientific publication, but it's an institutional report.
A lot of those often lead with an executive summary. That's one to two pages with the key points and takeaways.
And it does not get into all of the methodological details, but it summarizes.
OK, here's the key. Here's what you need to know. If you trust me that I did all the details.
Right. Well, the analysis. Right. Here's what you need to know, where here's the recommended course of action.
And then the rest of the report. Backs that up.
And so if they have a question of how you know, the things you're telling them,
they need to know why they're where their recommendations are coming from, they can go read the rest of the report.
And some writing you might merge discussion with results, you might talk about the results and talk about what they mean together.
You might merge discussion with conclusion. We have a longer conclusion that does the discussion integrated.
But these are the key pieces usually in this order for the typical kinds of research papers,
but also other kinds of documents that come out of data science projects.
Another point you need to pay attention to when you're going into publication is rendering your plots,
especially if you're going to a publication, they might be printed or anything.
That's where you're delivering as PBS. And so this key thing is that you want high quality images.
And when practical vector images are good. Python plotting software can export an image to PDAF low tech and pull in the PDAF.
And there's nope, it's not rendered to pixels, it's actually includes all of the paths, the circles, the letters.
And so you can zoom in that PDAF as far as you want and you're never hitting pixelation.
If you have very complex image, it can overwhelm pdaf,
you wind up with a PDAF that just takes forever to render because you've got an image that's trying to render five million data points.
So in those cases, or if you're working with software that you were, you don't have a good vector image path,
high resolution images, PMG, you don't want to use JPEG, never use JPEG for a figure because J.
Peg loses information in the compression use PMG or TIFF and at least 300 Deepthi.
Six hundred is better. Your typical printer is around 300 S.P.I.
And so if your image is 300 DPN, most people will be able to print it.
Well if it's six hundred GPI, basically anyone can print it.
Well and also look very good on high resolution screens. You want to make sure you have clear labels and captions, distinct colors, shapes, et cetera.
So the image is clear, readable, self-contained.
Oftentimes we'll have it like outdraw seven different versions, image or holle a bunch of different figures.
But then I'll combine them together. I'll have the one clean picture that has all the pieces, all the bells and whistles.
Everything's very, very.
I pay particular attention to the precise wording and consistency of the labels and everything that's gonna go into the final publication.
Then you also are going to want to experiment with dimensions in aspect ratio.
So if I've got a two column paper so that my columns are about three in that three point seventy five or so inches wide,
I might have a five inch wide image and then I'll scale down a little bit to fit in that column and it'll look really good in the final paper.
Another thing, though, that you need to be able to do is to check the writing in the presentation, step outside yourself.
Forget that you did all of the work and wrote the document.
Forget all of the knowledge you have or set aside all of the knowledge you have as the person who did the work and read what you wrote and ask.
Do I understand this is if I was not the one who created this?
Can I understand it? Could I reproduce it? Is there information that's missing?
Other readers help you with this. And you should get input from other people who weren't you.
But as a first pass, in order to make the most use of the time, the best use of their time as you can.
You need to be able to to set aside your knowledge as the creator and evaluate whether what you wrote is clear and complete.
And in a coherent order. And this applies to any kind of writing you're doing,
both presenting the results in a report from a paper, but also in other writing like documentation.
Writing a read me about how to run your experiment.
You need to be able to step back and ask if I followed the instruction, the steps of these instructions, one after another.
What I get the result or is there a piece? Is there are there steps that are missing?
Are there pieces or knowledge that are missing?
So if I just read this, these instructions say in this read me file, I would not have enough information in order to complete the expected activity.
So to wrap up internal and external publications require special attention to writing and the visual presentation.
Hopefully this gives you a little bit of a start for learning how to do that.

Resources#

PlotNine is a good plotting library for preparing consistent, publication-ready graphics.
The book gender example also demonstrates the current evolution of my own practices for preparing for publication.

🎥 Production Applications#

How do you put the results of your data science project into product?

Video (7m28s)

Slides

This video, I want to talk with you a little bit about what you some of the things you need to consider in the
design patterns in order to be able to take your data science results and then put them in production.
So the learning outcomes are for you to understand how data science outcomes can be used in
business settings and think about what you need to do to put models and outcomes into production.
So variety of purposes, data science, conservative business.
We can use data driven reports to inform decisions we can do forecasting for internal purposes to inform internal decision making,
internal planning purposes.
We can also have data set out science outputs for Real-Time, decisions either internal for making internal business decisions or customer facing.
Part of your e-commerce platform, providing recommendations, doing fraud detection,
various things in order to make your customer product, your customer experience, work smoothly.
So one of the first things is that reproducibility is crucial.
If you've got if you've maybe if you're running regular forecasts of future business activity or future demands,
you can do inventory planning to be able to rerun those reports.
You need to have a reproducible pipeline so you can rerun this week's report quickly and easily without having to do a bunch of manual labor.
You also need if you're if you have something that's online, making online decisions, you need to be able to retrain it as new data comes in.
So for online use, we're building an app or building a data science product that's going to be making decisions
in an online fashion as as new users come in or as new decision requirements come in.
There's a variety of modalities for delivering. It might be a Web app, a mobile app, a desktop app.
It might be something that just lives in server side infrastructure, like the spam filter that's built into email, infrastructure, etc.
But in terms of the technology structures, there are some exceptions. But mobile and desktop apps often use Web technology to connect to models.
So even if you're targeting mobile apps, even if you're targeting desktop apps,
learning to build Web based services for data science outputs is going to be a really useful skill.
Multiple audiences can try to use these. You might have an internal recording dashboard that talks about your customer volume or it talks about the
throughput on your assembly line or other aspects of the of the functioning and health of of your factory setting.
You can use it for internal decision making and then also either to help your customers make
decisions or as you're making decisions about your customers in an online interactive setting.
So one architecture that's common for these kinds of applications is what's called a service oriented architecture.
And what a service oriented architecture means is that your infrastructure is split in the different services, different individual pieces.
So you've got a Web server and you're your customers or your users come in with a computer,
with Web browser, maybe with a mobile platform, talks to a Web server that serves up the application.
The mobile, the mobile platform might talk directly to backend.
Then the Web server talks of various other services in order to complete in order to fill out.
It's the user experience.
Those services then go and they get data from various databases in your back end in order to serve up the responses to the requests.
So a lot of organizations use this.
Amazon uses service oriented architectures extensively.
So when you go to a page on Amazon.com, there's one service that's providing product details and other services,
providing people also bought recommendation and other services handling the shopping cart.
And so each of these services are working and are working independently and the Web server is putting them together into a composite experience.
So from a data science perspective, a lot of what happening is you need two pieces, you need a UI.
And in the Web server, it might the UI might come from the service itself,
but you need a user interface component and then you need the service itself is going to serve up the data science.
Usually it's going to be some kind of a prediction.
It's going to serve up the results of running the model that you've trained, the machine learning model that you've built and trained on the new data.
To answer that particular request. So a lot of your work is going to be building up service.
And one way to design it.
And this goes well with the service oriented architecture is each different model that you have can live in a different service.
So to deploy, the predictions are made available with a Web service like your Web server.
Might it you might use other things. You might use something like zero MQ or some other or thrift our or some other RBOC
protocol or you might use it HDP rest API for the Web server to talk to these services.
One example is tensor flow serving if you're building your machine learning model with Tenzer flow.
There's a program called Tensor Flow Serving that allows you to upload your saved model and and service requests based on that model.
So you got to train the model offline and other hardware so that your your online system can just keep running.
You're not using it. C.P.E. power to train the model.
It's just dedicated for serving up responsive, serving up recommendations, making decisions about new incoming messages, etc.
On other hardware, you train your model on your model you like, you save that train model to disk somehow.
Well, or you might actually upload the model directly. Some have a model server, but you save it somehow and you make it available to the service,
which will then reload the model and start serving up predictions from the new model that you just trained.
So a few useful capabilities for building out this kind of infrastructure are to be able to train
models and live or freshly exported data CENI be able to get your current data from the database.
So you've got all the current customer transactions and then you train your statistical model on them.
Part of that process, it's useful to be able to hold out test status.
You can train this model, you can test it again before you to fly it to make sure you didn't accidentally train a model that performs badly.
It can be useful to version your models so that you have the ability, particularly in the rollback to old model version.
Maybe you train the model, you test the model. It doesn't work. You and you find out you can put it in production and suddenly your spam filter
metrics change to be able to as a stopgap rollback to the previously train model.
So you can then go figure out what went wrong with your model without leaving your customers at the new bad experience.
The exact details of this depend a lot on your institution, your product, your infrastructure.
But these are some of the things that you need to be able to keep in mind skills that are useful to learn in order to build this.
So Web back end programing to build the service web front end programing, you know, better build dashboards to build the user interface.
It's pieces that are going to make your service visible and available to users in some organizations
that kids are going to be handled by other people on your team or by a completely different team.
But in some many cases, it can be useful.
You at least need to be able to talk with those people and then also be able to do performance measurement and tuning to understand.
To be able to debug if your model's performing too slowly. So to wrap up many data science projects,
result in online production capabilities is often done by training a model and deploying as a Web services or some other kind of network service.
And it's useful to be able to learn at least some of the skills in order to do that,
or at least be able to talk with the other folks in your organization that are handling that deployment and monitoring.

🎥 Topics to Learn#

This video goes over some useful topics to learn to fill out more of your data science education.

Video (6m25s)

Slides

So you've learned a lot this semester, I hope this video I want to talk with you some briefly overview,
some more things to go learn and give you pointers to places where you might be able to go learn them,
learning outcomes or which you used to know some topics and some software to study for expanding your data science skills.
You do not need to study all of this to be a competent data scientist, but also to have some input,
some data points to be able to help select the classes you want to take to complete your graduate degree.
So to learn more about machine learning and statistics, we've just introduced the basics.
As I said early in the semester, basically every almost every week of this class could be an entire class on its own.
So six, five thirty for the machine learning class.
You're going to learn a lot more about different machine learning models, how to build them, how to optimize them, how to evaluate them.
We've learned that the concepts, what they do, what they are, how a couple of them work.
534 is going to teach you a lot more about the insides of machine learning model math, 562 probability and statistics, too.
If you want to learn a lot more about statistical inference, that class will teach you five.
Seventy two is going to teach you a lot more about computational sides of statistics, particularly a lot of simulation things,
Monte Carlo simulations that are useful, the kinds of things you do with simulation, the kinds of things you're doing with bootstrapping.
That class is going to go into a lot more detail on Time series analysis.
If you want to learn a lot more about that, we have an entire class on it.
If you want to learn more about working with text to key topics, to be looking at our natural language processing and information retrieval,
computer science class has a department, has classes on both of those.
For working on specific application areas, if you want to do work on social media, you can take CSI 539, the social media mining class.
If you want to do work on information retrieval, we have two classes on it.
Five thirty seven, which is the how information retrieval works in six thirty seven, which is an advanced research class on information retrieval.
If you want to one about recommendation and personalization, you can take C.S. 538,
which discusses how to build recommended building, evaluate recommender systems to recommend products to people.
Software engineering skills are really useful for be able to take data to take your data science and connect it into product.
So to that end there, CSPI 73, the advanced software engineering class, it's useful for learning the process of software engineering.
It's also studying programing practices in software engineering in general on your own and practicing practice,
practice, practice, write code, do modeling, build web applications around or build applications around your data science analysis that you're doing,
but also then reflect and think about things as you write code.
Think is this readable? How can I learn? What do I know? What could I do to try to make this more readable, more efficient?
What do I think I need to go learn in order to expand my ability to do that advanced machine learning.
So we have a class on deep learning. See a six thirty three also learning software.
So Tensor Flow, Pi Torch are both useful for each useful pieces of software to learn to go beyond what you can do in Saikat learn.
And they are often associated with deep learning and they're used extensively for deep learning.
But they're not just deep learning packages. What they really are is it is very effective optimization engines that are useful for
modular differentiable like you set up a differentiable model and they optimize it.
So a lot of those models are deep neural nets. But there's a lot of other kinds of models.
I've used a tensor flow, for example, to build advanced matrix factorization.
There's nothing neural going on.
It's just a matrix composition, but it's taking advantage of tensor flows optimizers to build a recommendation engine around that.
And so either Tenzer flow or PI Torture's very useful software as you're trying to build your own machine learning models.
And it's a lot easier to use one of those pieces of software than to build your entire optima, your own optimizer from scratch.
I personally do a lot of work with Bayesian inference. That's a useful statistical paradigm and philosophy to learn.
Unfortunately, we don't have a class here that teaches it really.
But the books, Statistical Rethinking by Richard McIlwraith is a good book for learning that also software standard software is really good for doing.
It used a lot for Bayesian inference. That's not the only thing you can do.
It's useful for a lot of scenarios where you're trying to do what it is, isn't it?
It's it allows you to draw from arbitrarily complex distribution, not arbitrarily complex,
but allows you to draw from complex distributions connected to data.
PMC three, a similar software that works directly in Python Stand, has a good bridge to Python as well.
It's also really useful to learn causal inference and we don't have a dedicated causal inference class.
But you will learn a lot of causal material in the advanced econometrics class, in the economics department.
Also, there's the book counterfactuals and causal inference that you may wish to read and ordered learn causal inference techniques.
If you want to learn more about critical perspectives on data science, it's a variety of resources I can point you to.
So the book data feminism that I listed in the Class Resources is a good primer on critical thinking on data.
There's also if you want to get an example of some of the kinds of things that critical data scholars think about.
There's a short little book called Data Now Bigger and Better.
That is a selection of essays on data from a critical and a particularly from an anthropological perspective.
Also, if you want to learn more,
if you want to learn more about the underlying philosophies and mechanisms by which science in general, not just data science works,
Sergius is Mondo's introduction to science and technology studies is very good and a keyword that you can go look for to find much,
much more reading as critical data studies. So to wrap up, this class has laid a foundation for you to integrate further knowledge.
Never stop learning new things. There is so much more to learn as a student and as a practitioner to be able to do good and effective data science
and bring data to bear on questions that you have and decisions that you or your organization needs to make.

🎥 General Tips#

Some final closing tips and suggestions for you to think about as you take the next steps in your data science career.

Video (10m25s)

Slides

This video, I want to give you a few general tips,
wrap up the semester with some suggestions and some advice for doing data science and for learning more.
So that's our learning outcome. Some concluding tips for your data science work.
The first is to never take your eye off of questions. Good questions are fundamental.
John Tukey, who developed a lot of important statistical concepts.
Is quoted as saying that it's far better to have an approximate answer to the right question than an exact answer to the wrong question.
If you have a good question that you can get insight into, even if you don't have a crisp and precise answer for it, you can often refine the answer.
But that's also often more illuminating for what you want to learn and for
what you want to what you need to do than a very precise answer to a question.
That is not ultimately the question you want to ask. So.
Focus a lot of energy on defining your question properly and appropriately and understanding.
Understanding its relationship to contacts and to talk about that just a bit. And what your answer means to it and its limitations.
The second piece of advice is to work reproducibly. I've talked about this a little bit.
As we've been wrapping up, there's many reasons that you should make your analysis reproducible.
One, though, really important.
One is that when you go when you write up your results, either for a publication, for your thesis, for for a report, you need to know what you did.
And if you've been doing a bunch of different things, writing scripts, processing data here and there,
it's easy to lose track of exactly what steps got you from source data to conclusions.
And if you put it into a reproducible pipeline where it's automated, ideally you can rerun the whole thing with one command or a few commands.
Then it's easy to make sure that you have accurately described all of the steps because you can check.
Do I describe the steps that are in my pipeline? And then you can try every running the pipeline to make sure it still works.
Do you still get the same conclusions? And if not. Either you missed the step or what you read.
What you reporting is the results of running a different set of steps or a set of steps in a different order.
So in addition to all of the the external scientific benefits of reproducibility,
the benefits of reproducibility for being able to maintain your model in the long run in production settings,
just for documentation purposes, the reproducibility helps you make sure that you've correctly described what it is that you did.
You never lose sight of the context of what you're doing.
There's a reason why when we talk to the very beginning of the semester about developing questions,
we start with goals and we have questions that are going to advance those goals, analysis to answer the questions.
And the analysis is of data.
We don't we we focus on a question, we're trying to answer the question that the question has to be precise in order for us to answer it.
But that question, as with the TOOKEY quote, it may well be an approximation for something we really care about.
And so. We don't want to. Once we define the question, we don't want to just focus on the question.
We need to keep the big picture in mind to remember why.
If we need to do further refinements of the question, we have a touchstone to look to to figure out how we need to revise the question.
If we need to adjust it, because our data can't answer that question.
Well, when your new question will, what new question is going to bridge between the data and our goals?
So we have some where we're going. It also helps. It's crucial for being able to contextualize our results.
Why do we care about the answer to this question? What does it mean for what we know about the world or for what we're trying to do in the world?
It's crucial for helping identify limitations. If you know the goal and then you have a cutoff of goal, you've a question you can.
Having the goal gives you a place to context to talk about.
OK. So our question can do this. But there's these other things that are important to our goal that we can't do right now.
That's fine. It gives you a way to document them. It's also then super useful for generating next ideas.
Our question has taken us two steps towards our goal. Our goal goals still 10 steps away.
That's a lot of future work that's already written for us. But then also, we don't want to overlook detail.
So when we're specifying our questions and results, we need to be precise and specific about what exactly were measuring.
And this might seem that it's. This might at first glance seem to contradict the two key quote at the beginning is he said,
better to have the vague question than the precise wrong question.
But they're they're not really in conflict because whenever we're measuring something, we're measuring something very precise.
Computers and measurements are very precise things. We need to understand precisely what that is.
And we need to understand the relationship between the precise thing we computed.
A measurement of. How many people?
Clicked. I've never seen this number before in our interface for dealing with incoming calls.
And we need to connect that to the vaguer question that we're trying to answer so we understand the approximation.
It's not enough to say, well, OK, I have I have the right question and I have an approximate answer to it.
We need to understand the approximation and the best of our ability, the relationship of that approximate of the precise thing we measured.
How what's an approximation of the question we're trying to answer and how that's then going to advance the goal we have?
We can't we can't overlook the details to understand precisely what it is that we're doing and different people approach.
Connecting data and goals in different ways. Some are very top down thinkers.
They start with the big picture goals. They're very focused on big picture things.
That's often my default mode of operation. I'm a relatively big picture person.
But in order to be a big picture person, effectively, you have to be able to connect that big picture with specific,
concrete, measurable things that are going to advance the big picture.
You can't if you don't do that, then you're going to wind up either not being able to make progress because you can't
actually define something actionable that's going to advance the big picture or
you're going to not have a clear sense of whether you're advancing it or going to make
a lot of unsubstantiated statements because you're not connecting to the details.
Some people start from the details, more bottom up development of ideas.
Neither of these are wrong. Different faculty in the department tend to start from top down or bottom up places.
You need to learn to be able to learn from people who communiqu, who structure ideas in both ways.
But bottom up, you start more with the data and the details of the problem, and you build that into a big picture.
And there you need to be able to connect to to see the context and not lose the context for what it is that you're doing.
Both are good ways to approach problems and are complementary perspectives that can work very well together.
But it's something to be aware of. You can't lose sight of either the context why you're doing the thing that you're doing or of the details to
understand what precisely it is did to be able to reason about how that relates to your context and your goals.
Be curious. There's so much to learn.
I am continually learning new things about programing, about statistics, about the problems that I'm trying to solve with with these techniques.
And so learn about that. I've given you some pointers of some things to read in these videos.
I've given you pointers elsewhere in the course materials. There's so much to go study.
This class is intended to open the door to the world of data science that you can walk through and have.
A basic framework in which to fit all the new information that you're going to need to be acquiring over the next years.
Pause and reflect. And this applies to a lot of things. It applies to the work you're doing, reflect on what does this result mean?
Reflect on is my code efficient, is my code readable?
Does this chart makes sense? Reflect on your practices of work.
How am I organizing my work to go from my goals to my analysis?
How my organizing my to do list something I also tell my students is to take time to reflect on their practices.
How do you organize just overall your work as a student and to reflect?
Is this working for me? Am I having problems with my productivity, problems with getting things done?
Problems with communication problems? Or are maybe not.
Problems might be a harsh way to frame it. Are there places I can improve?
What's working about how I'm doing my work as a student? What's not working and how can I improve?
There's a there's a risk of spending so much time in your process for getting things done.
You never get things done. But it's important to reflect on our work, on its outcomes.
How could this paper be better? How could this report be better?
How can I do this kind of project better the next time it is done, it's published, it's good.
How can I do even better the next time? So to wrap up, good state of science requires ongoing reflection, study and practice.
Never stop learning, never stop paying attention to the various things like.
Ideas. Don't live well in little boxes.
There may be someone working on some completely different problem, but they're using.
They had an insight into their problem, which you can go apply to yours.
So take advantage of the opportunity to go to a wide range of talks, to read or write a wide range of papers and a wide range of books,
etc., to get ideas for how you can do better work and how you can better understand your problems.
Space, the world around you, your customers, the needs of your organization in order to do better data science tomorrow than you did yesterday.

🚩 No Quiz 15#

There is no quiz 15 — makeup midterm instead.

Note

We have had a few weeks without quizzes (3, by my count). When setting the final grade, I am going to ensure that you are held harmless for weeks without quizzes (basically, give you the quiz grade you would have received if we had a full 15 quizzes, and you had a perfect score on the skipped quizzes).

📃 Farewell#

It’s been grand!

I would love to hear feedback on how to further improve the course; I have tried to make some corrections as a result of the midterm assessment process, and will be keeping those notes for next year, but further suggestions either in the course evaluations or by Piazza are welcome.

I hope to see many of you in future courses!

📩 Assignment 7#

Assignment 7 is due Sunday, December 11, 2022.

🚩 Final Exam#

The final exam will be on Canvas. It will be available for 72 hours beginning on Monday, December 12, 2022.

Format#

The exam follows the same format as the midterms.

Study Tips#

Review the previous quizzes, assignments, and midterms.
Read through the makeup midterm, even if you do not plan to turn it in.
Review lecture slides to see where you are unclear on concepts and need to review.
Skim assigned readings, particularly the section headings to remind yourself what was in them.
Review the course glossary, keeping in mind that it does contain terms we haven’t gotten to yet.

CS 533 Fall 2022

Week 15 — What Next? (12/5–9)

Contents

Week 15 — What Next? (12/5–9)#

Revisions#

🧐 Content Overview#

📅 Deadlines#

🎥 Recap#

🚩 Makeup Exam#

🎥 Data and Concept Drift#

Resources#

🎥 Time Series Operations#

Resources#

📓 Time Series Example#

🎥 Correlated Errors#

Resources#

🎥 Publishing Projects#

Resources#

🎥 Production Applications#

🎥 Topics to Learn#

🎥 General Tips#

🚩 No Quiz 15#

📃 Farewell#

📩 Assignment 7#

🚩 Final Exam#

Format#

Study Tips#