Week 15 — What Next? (12/6–10)
This is the last week of class. We’re going to recap, and talk about what’s next, both for learning and for putting what you’ve learned to practical use.
🧐 Content Overview
This week has 1h13m of video and 0 words of assigned readings. This week’s videos are available in a Panopto folder and as a podcast.
🎥 Recap
This video reviews the concepts we have discussed this term and puts them into the broader context of data science.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
RECAP
Learning Outcomes (Week)
Wrapping up!
Tie together the class content again
Apply Pandas time series operations and model correlated regression errors
Take the results of data science analysis in production or publication
Know some topics to study further to expand your data science skills
Photo by Lumitar on Unsplash
The Data Science Workflow
Transform / Prepare (ETL)
Raw Source Data
Prepared Data
Inference
Findings
Modeling
Model + Predictions
Data Description
What is Data Science?
The use of data to provide quantitative insights on questions of scientific, business, or social interest.
Data Management
Reading from static files
Processing and integrating with Pandas
To learn more:
CS 510 Databases
Application- and type-specific data in other classes
Mathematical Fundamentals
Probability Theory
Linear algebra (a little)
To learn more:
Math 562 (Probability and Statistics)
Math 503 (Advanced Linear Algebra)
Inference
Basic parametric pairwise comparisons (t-tests)
Bootstrapping
Sampling theory
Linear regression models (OLS & logistic)
To learn more:
Math 562 (Probability and Statistics)
Prediction
Regression: continuous outcomes
Classification: categorical (esp. binary) outcomes
To learn more:
CS 534 (Machine Learning)
Many other data science classes
Evaluation and Tuning
Train/test splits
Classification and continuous prediction metrics
Hyperparameter tuning
To learn more:
CS 534 (Machine Learning)
Unsupervised Learning
Lower-dimensional embedding (matrix decomposition)
Clustering
To learn more:
CS 534 (Machine Learning)
Will appear in other data science classes
Workflows
Data science pipeline
Breaking code into separate scripts & modules
Design patterns for code workflows
You will apply this throughout your classes!
Wrapping Up
This class is designed to lay a conceptual foundation for your future data science studies.
Other classes will build on these concepts and ideas!
Photo by Dave Heere on Unsplash
- Welcome. This is the last week of CSI 533. I want to start us off by doing some recap of what we've learned this semester.
- The point to where we're gonna go in the last week or so are learning outcomes for this week or to tie up the class content together.
- Give you a brief summary of what we've done, how it fits into the broader,
- how it forms a broader picture into which the further topics you're going to study throughout your graduate degree will fit.
- We're also going to tie up a couple of loose ends. We're going to talk about time, serious operations and modeling correlated regression errors.
- And we're going to talk about how some of the things you need to do to take data science analysis,
- results and outputs into publication or into production.
- And then finally, I'm going to talk about some topics to study further to expand your data science skills.
- So we've talked about the data science where I want to bring up the state of the science workflow again,
- because it provides a context in which a lot of the things that we've been talking about fit.
- We talk about how to describe data. We talk about how to integrate it, source it and transform it.
- We talk about the various kinds of biases and you have to pay attention to throughout this workflow.
- We have different kinds of tasks, such as modeling and inference.
- I also want to return us, though, to our question at the beginning of this semester or our definition from the beginning of this semester,
- that data science is the use of data to provide quantitative insights on questions of scientific business or social interest.
- So the goal of all of this, what we're doing with data science, is to produce insights.
- Now, there's a lot of overlap.
- There's a lot of predictive predictive tools that are not necessarily being used to provide insights back to the operators,
- but are being used to, say, generate predictions about customer transactions,
- about whether something is fraud, whether something is a cybersecurity attack,
- that the the action itself is not necessarily the model itself is not necessarily providing insights.
- But we need to gain insights about whether it's working, how effective it is, where we need to go, try to improve it.
- But also the techniques that we're using are applicable to other purposes as well,
- such as training, machine learning models that can do various types of predictions.
- We've talked about data management, so we've put primarily focused on reading from static files.
- We do a little bit with obtaining data from the web. We've been processing and integrating our data with pandas.
- Really useful utility skill. To take this a step further is to actually to be able to work with a database.
- And so in the CSA five 10 class databases, you learn how to design data models and relational databases, how to query them, how to put data in them.
- You're also going to see in other classes information about managing data that is particular to different applications and types of data.
- We talked about some mathematical fundamentals, probability theory.
- We talked a little bit about some linear algebra and we've used bits and pieces of it throughout the class.
- You can learn a lot more about these taking math 562 and taking the linear algebra class as well.
- Four. We've talked about statistical inference. We did some basic parametric pairwise comparisons with T tests.
- We learned how to bootstrap confidence intervals and P values.
- We learned some sampling theory that underlies a lot of statistical inference and underlies the bootstrap.
- And we talked about doing inference with regression models to learn a lot.
- You'll learn a lot more about this and the probability and statistics class. We've done some predictions.
- We've predicted continuous outcomes with regression models. We've predicted binary outcomes with classification models.
- You will learn a lot more about different models for doing. For doing these kinds of predictions in the machine learning class.
- And they're also going to come up throughout a number of your other data science classes,
- such as recommender systems, natural language processing, information retrieval and social media mining.
- We talked about evaluating your your predictive model, your classifier or your regression based predictor.
- We talked about how to do. How to do train test splits, why we need those.
- We talked about metrics for assessing the effectiveness of your model.
- We talked about strategies that set up to choose your models, hyper parameters. Again, you're going to see a lot more of this in C.S. 534.
- We also introduced unsupervised learning.
- We've looked at two different unsupervised learning techniques where we don't oppose the predictions that we're trying to do,
- the supervised learning where we have a classification outcome or a label.
- We're trying to predict the classifier. We have a value. We're trying to predict with a regression model.
- We don't have a supervisions ignore a label that we're trying to predict. We're trying to allow the model to obtain.
- To learn structure just from the input features,
- we've seen how to do this with lower dimensional embedding with Matrix composition, and we've seen how to do this with clustering.
- Again, you'll see more about this in machine learning and it'll appear in other classes.
- We talked about workflows. We talked about the data science pipeline.
- We talked about how to break code into scripts and modules, how to use get how to design, has some design pattern for your code workflows.
- You're going to use this throughout your data science work, both in your studies and in your applied work when you're done.
- I encourage you to take what you've learned here, refer back to the material in order to structure your assignments for other classes.
- So to wrap up, this class has been designed to lay a conceptual foundation for the rest of your data science studies.
- So do you take other classes? You have these concepts. You know what a classifier is.
- You know how to evaluate one. You know how predictive models and statistical inference fit in the broader space.
- You have a good working base to start to work with data, be able to ask questions, be able to think about how.
- How to answer them and represent their answers. Other classes are going to build on these concepts and ideas.
- And there's a number that you can take in the computer science department,
- the math department and other departments as you complete either your master's degree or your HD.
🚩 Makeup Exam
The makeup midterm is December 7 in class. It covers the entire semester, with an emphasis on material since the 2nd midterm.
The rules and format are the same as for Midterm A.
I encourage you to come and take the exam, as it is the single best way to prepare for the final. You can at any time leave the exam, and take it with you; I will only grade it if you turn it in, and there is no problem with showing up, looking at it, even filling it out, and deciding you like your current grades well enough to not turn it in.
Warning
If you turn in the makeup exam to be graded, its grade will replace the lower of your Midterm A and B grades, even if that lowers your final grade.
Only turn it in if you think you did better than your worst normal midterm!
🎥 Data and Concept Drift
This video introduces a fundamental assumption of predictive modeling and the way drift can affect it.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
CONCEPT AND DATA DRIFT
Learning Outcomes
Know crucial assumptions of machine learning evaluation and deployment
Understand how models can degrade over time
Photo by guille pozzi on Unsplash
Fundamental Assumption
Deployment Assumption
Drifts and Shifts
Offline Solution: Temporal Splitting
Random train-test split ensures train/test comparable
Assumes test is uniformly drawn from same distribution!
Alternative: temporal split
Select temporally-contiguous test data (e.g. 1 month)
Train on data before test data (no time travel!)
Benefit: simulates actual use
Drawback: temporal data no longer random, inference harder
Online Solution: Continuous Monitoring
Instrument your system in production
Watch key metrics over time
Click-through rate
Classification rate
Regularly re-train and re-evaluate
Train model on new data
Evaluate model on new data
Wrapping Up
ML training and evaluation assumes that the training and test data match real life.
You can’t always rely on that.
Photo by Jackson Douglas on Unsplash
- This video I want to introduce to concepts you need to be on the lookout for when you're trying to do models and particularly in production abound.
- Caught the drift of your data and your concepts. Learning outcomes are few to know critical assumptions of machine learning,
- evaluation and deployment and understand how models can direct degrade over time.
- So a fundamental assumption that our machine learning evaluation makes or that machine learning makes.
- Excuse me, is that the test data?
- Looks like the training data. The test of specifically the test of the training data are drawn from the same distribution.
- And one of the really crucial parts of this is that the conditional probability.
- So the probability of y given a particular set of covariates or features X is the same.
- Between the two. Otherwise you couldn't do machine learning because the idea is to learn that conditional probability.
- Remember that we're trying to do estimates of just conditional expectation or estimates of conditional probability.
- If it's not the same and the training gave it to the test data, then you can't learn.
- The training data to predict the test data. But also, if the probability of Y is not consistent, then it's we're gonna be looking at different data.
- The class weights may be wrong.
- If the probability of the covariates are different, effectively, we're learning from data that's distributed differently.
- It might work, but it very well may not. We have this fundamental assumption.
- If you're trying to learn from Trinny data to predict to predict test data, you're assuming the test data looks like the training data.
- There's also then another assumption when you're trying to when your purpose of your talent testing
- your model is to determine if the model is going to work well when you actually go try to deploy it.
- So if you're trying to test a spam filters, accuracy you what your goal is not to build a better spam filter that works well on a test dataset.
- The goal is to build a spam filter that's going to correctly classify spams as they come into your system.
- And so they're we're making the crucial assumption that the probability of data, it's actually going to see the live data that the live data.
- Strong from the same distribution as our test data. Because if it's not, then what we've tested it on isn't what it's actually going to run on.
- And so we haven't actually evaluated the model for the perforates task,
- if the the actual data is going to see come from a different distribution than the data that we're testing on.
- So a few different ways that this can drift. So the crap class private lens can change.
- It might just be that spams become more likely. The feature distributions can change.
- So emails start using Pertti, messages start using more of a particular kind of text or the geographic
- distribution of the sender changes or other features that you're using changes, but also the relationship changes.
- What, like the way people write spams may change. And so then how you predict a spam from the features changes.
- One particular example. So examples like December purchasing is different.
- A model that is trained on January through November.
- You try to predict December, December looks different because the holiday season.
- Also, if you've got a model that was working great, you had seasonal effects, you had cyclical effects.
- You are doing really good at forecasting. Your sales revenues for four years.
- Kofod has so fundamentally changed how we go about purchasing both in person and online,
- that you have a massive distribution shift in your model is no longer gonna be valid.
- So how do you deal with this? I'm going to talk briefly about two ways to start trying to deal with it.
- The first thing zero to do is to be aware of the proof of the assumptions that
- you're making and the potential problems of a violation of those assumptions.
- You even noted, go look for them. But then off line, one thing we can do is we can temporal if our data have timestamps, we can temporarily split it,
- because in production, your goal isn't to predict a randomly selected set of data points from the other data points.
- Your goal is to predict the next data point from historical data points.
- So what you can do is you can simulate that. You can say. You're trying to predict sales.
- You can set as your test data November and use trading data everything before November.
- You'll want to use the Sambor because you don't want your if you like your model to look into the future.
- You're violating the simulation stuff. Fidelity. And but this is the basic idea.
- You train on the data before the test data and you test on on that day.
- You need to hyper print are tuning. You can test on. You can use the month before as tuning data or because of seasonal effects.
- You may want to use the same month, the year before. As for your hyper parameter tuning, but this is the idea.
- You do this temporal split upside, you get this increased fidelity of your what you're actually trying to soft new salt.
- Can you predict next data?
- Given the historical data, drawback is that we're not randomly sampling the results and so we have to be a lot more careful.
- But the statistical inference of the results. But. The trade off is often worth it for understanding problems that are going to be things that are
- going to be deployed in a temporal setting where we expect data distributions to change over time.
- Then when you go online, need to continuously monitor what your system is doing.
- If you have the ability to measure its online accuracy, its making its predictions of whether incoming messages on your platform are spam.
- And you can test whether that was right or not. Watch that.
- But even just watching other metrics, like the frequency with which it's classifying something as spam.
- If that jumps or if that drops. That doesn't necessarily mean your model performance is different, but it means something's changed.
- So if. Fifty percent of your messages are spam on usually.
- Or the spam reporter spam classifier flags 50 percent of that spam. And suddenly it starts only flacking 40 percent of spam.
- You should look into. That's a signal that you need to look into. Why?
- To see if you've got a data set shift, a status shift that's causing your model to no longer perform properly.
- Also, you need to regularly retrain and reevaluate your model.
- You can't just say, oh, I've got a model. It's got this accuracy. Good. Let's put it production.
- Let's run it for two years. You need to retrain the model frind with new data.
- And then also, you need to periodically reevaluate your model to see is it still giving the accuracy.
- So we're running it for a year, collecting more data. If we tested out some new data, is it still giving the accuracy that that we expect for it?
- So to wrap up. Machine learning, training and evaluation assumes that the training data and test data match each other and match real life.
- You can't always rely on that. You need to pay attention to shifts in your data and in its distributions.
- That makes your that either cause the model not to be able to learn from training to predict test data,
- or that cause the test data to no longer be an accurate assessment of what's going to happen when you try to use your model for real use.
🎥 Time Series Operations
Time is an important kind of data that we haven’t spent much time with — this video discusses the fundamental Pandas operations for working with time-series data.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
TIME SERIES
Learning Outcomes
Summarize and plot time series data with Pandas
Understand that time series data is often not independent
Photo by Bruno Figueiredo on Unsplash
Time Series
Time Series
Time Representation
Many representations:
String dates
Years & months
Timestamps (seconds, ms, etc.)
Pandas: must convert to a datetime
pd.to_datetime
Pandas Setup Steps
Create/convert datetime column with instance times
Index data frame by timestamp
Sort index
ratings['timestamp'] = pd.to_datetime(ratings['timestamp'], unit='s')rts = ratings.set_index('timestamp').sort_index()
Time series exception to “prefer unique indexes” guideline
Operation: Resampling
Like groupby, but on time intervals
Compute aggregate functions for time intervals
‘on’ option allows a non-index column
monthly_ratings = rts.resample('1M')['rating'].count()
Operation: Plotting
Matplotlib & Seaborn render timestamp X axes well
sns.lineplot(data=monthly_ratings)
Operation: Range Select
Time series indexes support range operations
Select by range: rts.loc['2010-01-01':'2010-12-31']
Includes end point (unlike normal slice!)
Select by period: rts.loc['2010-07']
Operation: Diff and Shift
Time Effects
Autocorrelation
Wrapping Up
Repeated data over time requires particular handling and has distinct operations.
Pandas provides these through datetime columns and indexes.
Photo by Javier Esteban on Unsplash
- And this video, I want to introduce the concept of a time series, some basic time series operations in pandas.
- There's a notebook also in the resources that gives you a demonstration of the PANDAS code that actually works with Time series.
- Learning outcomes are for you to be able to summarize and plot time series data
- with pandas and understand that Time series data is often not independent.
- So at times, serious is a sequence of observations over time,
- these observations may be periodically sampled, like maybe having one measurement for each day.
- Or they may they may not be. You might be that you record a sequence of events like every time someone uses the card scanner at a door.
- Typically, their observations are the same kind of thing. Like we have our instances in a normal data frame, the same kind of thing.
- The observations we're looking at a time series are homogenous,
- like the price of a stock or revenue of a country or some kind of user activity and mathematical notation.
- We typically refer to a time with the letter T. And we have the current time as T t t minus one T plus one.
- We often speak of times steps. So we might time as in years or months, days, seconds, whatever is appropriate to our particular application.
- And it is. And we can talk about the value at time T X of T.
- We can talk about X, T, minus one, etc. for the previous value, the value of the previous timestamp.
- This particularly applies when we have a periodically type sample time step.
- So to show you an example of a time series, this is from the movie lens data that we've been working with.
- And it is by month the number of each ratings in the month.
- So our time steps are months. And our value.
- Is the number of ratings. For movies in time are in month t.
- We can see those values. There were some spikes of growth. There were some spikes of activity here.
- We've got a few one month spikes. We had a significant jump there.
- And we've there's been declining month over month activity as we go into 2000 or in 2018, 2019.
- So there are many ways, as we talked earlier, that are many ways we can represent time, we can have string dates, years, months.
- We can have timestamps to work with a time series, though, and PAN does.
- We have to convert it to a time to a date time.
- We can't use pandas time serious operations without converting our time into a date time somehow first from whatever format it's in.
- So the setup steps to be able to start working with time, serious data and pandas are to first create or convert a date time column.
- It might be that you need to put it together multiple columns.
- It might just be the you need to convert an existing column so that you have a timestamp
- column of type date time for each for each of your instances in your data frame.
- Then typically you're going to want to index the data frame by timestamp and then sort it by index using the sort index method.
- And you can use it. There's a lot you can use a lot of date time operations without setting an index.
- But setting an index is often the most convenient way to work with time.
- Serious data in Pandas and Time series data is an exception to our general guideline to prefer unique indexes because pandas will have some memory,
- can have memory and performance problems with non-unique indexes. You might have multiple events that happened at the same time.
- The utility of a time series index around looking up things by time and by date over what that overrides.
- The general concern of don't use duplicate and of avoid indexes.
- A duplicate keys. So if you operations that we can perform at this time series, data one is re sampling and re sampling works like group by.
- But it works on time intervals. And so in this in this code here, what we're doing is we are re sampling by interval one month.
- We could sample by day, we could sample. By week.
- But this is telling us that we want to group the data by one week or by one month periods so our groups will be or it's indexed by a timestamp.
- But our groups are going to be October of 2000. 13.
- November of 2013, and then within that, we're going to do exactly the same thing we do with any group.
- We're going to pick a column and we're gonna count it. And this is going to count the number of ratings in each month.
- So after the recent the results of a resample work, exactly like your group, buy into all your Google group,
- I think that's just rather than being grouped by the distinct values of a column.
- They're grouped by a time period of the index. Or you can also group by a column.
- There's an on option for every sample that allows you to specify the column that you want
- to do the re sampling on if you have not indexed your data frame by its time stamp column.
- So this is our first operation. We can take it. We can sample our day.
- If we have this particularly works for data that is. If this works well for data that is events, it works well for data, that is.
- It can also work well for it, for data that is already sampled,
- that if you've got daily measurements of something and you want to take it by month, you can re sample that by month.
- You can also up sample and have it fill in missing values for time period that you're missing.
- If you want to increase the resolution of data.
- So another thing we can do is plot and matplotlib and Seabourne both render timestamps on their x axis relatively well.
- So if I just do X and that's that line plot and I give it a series, it's going to mark the x axis by an appropriate time,
- derive value in this case because my data starts in two thousand or nineteen ninety six and runs until relatively recently.
- It is it's, it's, it's marking the X axis by year.
- So they, they know what to do with time data, with timestamp columns.
- And they will the air, they will render the x axis with appropriate labels for time, serious data.
- Another operation is to be able to do a range select so time series indexes, support range operations and so we can select by range,
- you can say every we want everything from January 1st through December thirty first.
- Now, one key thing to note is that when you're doing a range selection of a time series column,
- unlike basically every other slicing context in Python, it includes the end point.
- So this this query is going to include December thirty, first all and it's going to include all day on December thirty first.
- It's not going to be the usual stop right before.
- We can also just pass in a partial date time in order to select by a period.
- So if we want to select July 2010, we can you do that lock.
- And we can look up 2010, Dasch 07. It's going to give us the ratings that are in July.
- Another operation we can perform a set of operations, we can perform our diff and shift.
- So the diff operation computes X sub T minus X, T, minus one.
- So what it does is it computes. So we've got our it's basically the opposite of cume.
- So we've got our data points. And what it does is it computes each one minus the one before it.
- The first one gets narm. So your first value becomes more than your second value becomes the second minus the first et cetera.
- Shift is an operation that just shifts data points by one or more time steps.
- And so a shift of one which is the default, if you don't specify parameter is what we call a lag operator.
- It move points down once that each one has its previous value. So def is just X minus X that shift or it's it's X minus the lag of X.
- Shift of minus one is lead. So each one has the next value, if you want to compare each value with the one that's gonna come after it.
- But so these are this. This diff operation and then shift, which is its building block, allow you to compare data to the previous data point.
- This is particularly useful for data that is periodically sampled to its day by days or by seconds.
- It doesn't it's not quite as useful by events, but. You can convert event timestamped event data into period data.
- Just the library sampling. So there's a few different effects that we want to think about in Time series data.
- One is a trend, which is a period over period change in values.
- So if the price of something is going up and it'll be noisy around that,
- but often noisy around that, but the overall trend is up as you go through time.
- You can have linear trends. You can have exponential trends. When you have an.
- You might hear the value are not being used sometimes in discussing disease, transmit disease transmission epidemiology.
- That's a multiplier for an exponential trend.
- We can have seasonal and cyclic effects with our periodic effects in the data and seasonal ones are are around this time of year.
- So like holiday, if you're in a commerce setting, holidays look really, really differently than June.
- There can also be cyclical effects like on a week cycle, et cetera, that does that affect your data?
- People behave differently on weekends than during the week, etc.
- There's also shocks which are impact events that impact the time series and change data going forward.
- And these can have short term effects or they can have continued effects. But a shock is like an outside event.
- Like if your time series is, say, the the price of a stock, a shock would be an event that affects the stock price.
- One thing that's important to know about time, serious data, is that as often what we call auto correlated, which means correlated with itself.
- Because it's X, sub T plus one at exit two, plus one, an exit T are not independent.
- Today's weather is probably more likely correlated with tomorrow's weather than it is weather in three months.
- So if you have observations of a variable over time, especially that you're trying to predict the future with today.
- It usually violates independence. And so we're going to need to take special care in order to model what's going on when we have data.
- And even if we have other predictive variables.
- There's a non independence between from one step to another that we need to account for when we're doing statistical modeling.
- So to wrap up repeated data over time requires particular handling and are distinct operations to work for it.
- PANDAS provides access to daytime columns, daytime indexes to allow you to do various times serious operations.
🎥 Publishing Projects
This video talks about going from an analysis and its notebooks to a publishable paper.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
PUBLISHING PROJECTS
Learning Outcomes
Understand what is needed to go from analysis to publishable reports.
Outline a research paper.
Photo by Good Good Good on Unsplash
Publication Audiences
Data science document products can be for:
Collaborators
Decision-makers
Other organizations
Scientific community
Lay public
Formats
Written document (electronic or printed)
Presentation (live or recorded)
Interactive online demo/dashboard
Publication Goals
Reader needs to understand:
What you did
What you learned
What they should do / take away
(In general. Not everything is actionable.)
Typical Outline
Introduction
Background & Related Work
Methods
Results
Discussion
Conclusion
Variants:
Some communities put related work at the end
Institutional reports often lead with Executive Summary
1-2 page summary w/ key points
Discussion may be merged w/ Results or Conclusion
Methods may split
Rendering Plots
High-quality images
When practical: vector images (PDF good for LaTeX)
Otherwise: high-resolution images (at least 300dpi, 600 better)
Complex images can overwhelm PDF
Clean and clear labels and captions
Distinct colors, shapes, etc.
Experiment with dimensions & aspect ratio
E.g. 5”-wide image, scale down to 3.75” columns
Step Outside Yourself
Forget you did the work and wrote the document.
Can you understand it? Could you reproduce it? Is info missing?
Applies everywhere:
Reports
Papers
README
Wrapping Up
Internal and external publications require special attention to writing and visual presentation.
Photo by Kari Shea on Unsplash
- We've talked a lot about how to do the analysis part of data science.
- I now want to take this a little bit of time to talk about what do you need to do in order to take that and actually put it into a publication.
- So we're going to talk about you.
- I want you to understand what's needed to go from analysis to a published report and also to be able to outline a research paper.
- So a lot of different audiences that a document that comes out of a data science project can before it might be an internal document.
- Just for our collaborators to update them on some analysis you're doing as part of a project.
- It might be a report that's intended to be used by decision makers in order to make a data informed decision.
- It might be a report for other elsewhere in your organization or else other organizations.
- It might be a scientific publication for the scientific community and or it might be a document intended to educate the lay public.
- So there are a variety of formats that these can come in. One is that it could be a written document that's either electronic or printed.
- It can be a presentation that you're giving live or that you're recording as a prerecorded video.
- It could be an interactive online demo or dashboard. These happens for product dashboards and internal monitoring all the time.
- They also happen sometimes in journalism where a.
- A news organization will make an interactive data visualization to allow people to explore the data that underlies some of their reporting.
- The goals of a publication of any of these forms, any of these audiences is that the reader needs to be able to understand what you did,
- what you learned, and then what they should do or take away with it.
- This is in general, not every analysis is going to have actionable. Here's what you need to do.
- Kinds of insights, but your readers need to understand what the takeaways are and why they should believe them.
- That's the part of the being able to communicate your methods is to show.
- The evidence behind showing your work at showing the evidence behind the conclusions that you're drawing.
- So the typical outline for a lot of publications, particularly scientific publications.
- But a lot of other publications are going to have so many of these same outline elements
- is that you first have an introduction that sets the stage for why you're doing this.
- Oftentimes an introduction will also for.
- Foreshadow the conclusions of the report of a data analysis, particularly a scientific publication, is not a mystery novel.
- You should not keep your reader guessing until the end. What the conclusion is, you can just say it and need action.
- We're trying to solve this problem. Here's the bullet point summary of what we found.
- And then you get into the details, then you've got background and related work and background and related work are not the same thing,
- but they're often put together in one section. Background is background information that people need to understand what you did.
- Background on the problem. Background on the methods that you're using to apply it.
- You know, there is common knowledge that you can assume. And what that common knowledge is varies from audience to audience.
- But the background is there to the background is the material that they need to understand.
- On top of the common assumed knowledge in order to understand what it is that you did.
- Related work is other work working on similar related problems that might be working on the same problem with different scope, different methods, etc.
- It might be using the same methods for a different problem,
- but it fills in the place of the other knowledge that we have about the problem space that you're trying to study.
- So people have a context for how your Rinella assess is filling in a gap in our current knowledge.
- So the background is the prerequisite knowledge and the related work is the adjacent knowledge.
- And between them, your readers will be able to understand what you did and understand how what you did contributes to knowledge.
- The methods are where you explain what you did. What's the data? What's the statistical methods you did?
- What's the machine learning methods you did? As I say, in the variant side, methods may sometimes be split.
- So if you've got something where maybe you have a new algorithm and then you're doing experiments on it.
- Those who usually be separate methods, you've got your. Here's my new algorithm.
- And then maybe as an experimental result section, you'll have methods and then results.
- Then you have your result. So the methods are here's the experiment, I'm going to run the data, I'm going to use Hymes setting up the experiment.
- Results are OK. So I did that. What do we learn?
- It's where you have your key figures and charts, et cetera. You're presenting the results and run the experiment.
- Discussion, then, is where you talk about what the results mean.
- So you have the individual results of a chart that shows the accuracy of a model, you have a chart.
- You have a chart that shows what's happening in some data over time.
- Discussion is where you pull it together and you connect the results that you have
- back to the original problem context that you put forward in your introduction.
- And then finally, in the conclusion you summarize takeaways of the paper and you often point to what some future work, maybe some variance on this.
- Some research communities put related work at the end right before the conclusion.
- Here's everything I did, by the way. Here's what other people did. Now we conclude.
- If you're doing something that's not not a scientific publication, but it's an institutional report.
- A lot of those often lead with an executive summary. That's one to two pages with the key points and takeaways.
- And it does not get into all of the methodological details, but it summarizes.
- OK, here's the key. Here's what you need to know. If you trust me that I did all the details.
- Right. Well, the analysis. Right. Here's what you need to know, where here's the recommended course of action.
- And then the rest of the report. Backs that up.
- And so if they have a question of how you know, the things you're telling them,
- they need to know why they're where their recommendations are coming from, they can go read the rest of the report.
- And some writing you might merge discussion with results, you might talk about the results and talk about what they mean together.
- You might merge discussion with conclusion. We have a longer conclusion that does the discussion integrated.
- But these are the key pieces usually in this order for the typical kinds of research papers,
- but also other kinds of documents that come out of data science projects.
- Another point you need to pay attention to when you're going into publication is rendering your plots,
- especially if you're going to a publication, they might be printed or anything.
- That's where you're delivering as PBS. And so this key thing is that you want high quality images.
- And when practical vector images are good. Python plotting software can export an image to PDAF low tech and pull in the PDAF.
- And there's nope, it's not rendered to pixels, it's actually includes all of the paths, the circles, the letters.
- And so you can zoom in that PDAF as far as you want and you're never hitting pixelation.
- If you have very complex image, it can overwhelm pdaf,
- you wind up with a PDAF that just takes forever to render because you've got an image that's trying to render five million data points.
- So in those cases, or if you're working with software that you were, you don't have a good vector image path,
- high resolution images, PMG, you don't want to use JPEG, never use JPEG for a figure because J.
- Peg loses information in the compression use PMG or TIFF and at least 300 Deepthi.
- Six hundred is better. Your typical printer is around 300 S.P.I.
- And so if your image is 300 DPN, most people will be able to print it.
- Well if it's six hundred GPI, basically anyone can print it.
- Well and also look very good on high resolution screens. You want to make sure you have clear labels and captions, distinct colors, shapes, et cetera.
- So the image is clear, readable, self-contained.
- Oftentimes we'll have it like outdraw seven different versions, image or holle a bunch of different figures.
- But then I'll combine them together. I'll have the one clean picture that has all the pieces, all the bells and whistles.
- Everything's very, very.
- I pay particular attention to the precise wording and consistency of the labels and everything that's gonna go into the final publication.
- Then you also are going to want to experiment with dimensions in aspect ratio.
- So if I've got a two column paper so that my columns are about three in that three point seventy five or so inches wide,
- I might have a five inch wide image and then I'll scale down a little bit to fit in that column and it'll look really good in the final paper.
- Another thing, though, that you need to be able to do is to check the writing in the presentation, step outside yourself.
- Forget that you did all of the work and wrote the document.
- Forget all of the knowledge you have or set aside all of the knowledge you have as the person who did the work and read what you wrote and ask.
- Do I understand this is if I was not the one who created this?
- Can I understand it? Could I reproduce it? Is there information that's missing?
- Other readers help you with this. And you should get input from other people who weren't you.
- But as a first pass, in order to make the most use of the time, the best use of their time as you can.
- You need to be able to to set aside your knowledge as the creator and evaluate whether what you wrote is clear and complete.
- And in a coherent order. And this applies to any kind of writing you're doing,
- both presenting the results in a report from a paper, but also in other writing like documentation.
- Writing a read me about how to run your experiment.
- You need to be able to step back and ask if I followed the instruction, the steps of these instructions, one after another.
- What I get the result or is there a piece? Is there are there steps that are missing?
- Are there pieces or knowledge that are missing?
- So if I just read this, these instructions say in this read me file, I would not have enough information in order to complete the expected activity.
- So to wrap up internal and external publications require special attention to writing and the visual presentation.
- Hopefully this gives you a little bit of a start for learning how to do that.
Resources
PlotNine is a good plotting library for preparing consistent, publication-ready graphics.
The book gender example also demonstrates the current evolution of my own practices for preparing for publication.
🎥 Production Applications
How do you put the results of your data science project into product?
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
PRODUCTION APPLICATIONS
Learning Outcomes
Understand how data science outcomes can be used.
Think about how to put mdoels and outcomes into production.
Photo by C D-X on Unsplash
Using Data Science
Data-driven reports to inform decisions
Regular forecasts for internal purposes
Data science outputs for real-time decisions
Internal
Customer-facing
Reproducibility
Reproduction is crucial:
Regular reports / predictions re-run
Daily / weekly / monthly reports
Retraining models for online use
Online Use
Many modalities: web, mobile app, desktop app, server infra
Mobile & desktop often use web tech to connect to models
HTTP (often REST) API
Some low-latency exceptions
Multiple audiences
Internal reporting
Internal decision-making
Customer decision-making
Service-Oriented Architecture
Client
Web Server
Services
Databases
Deployment
Predictions made available via a web service
E.g. TensorFlow Serving
Model trained offline on other hardware
Model-training script saves trained model to disk
Web service loads trained model & serves predictions
Useful Infrastructure Capabilities
Train models on live or freshly-exported data
Hold out test data to evaluate new model before deployment
Version your models (including retrain w/ same hyperparams!)
Roll back to old model version
Details depend on institution & infrastructure.
Skills to Learn
Web back-end programming (to build services)
Web front-end programming (to build dashboards)
Performance measurement & tuning
Wrapping Up
Many data science projects result in online production capabilities.
Often done by training a model and deploying it in a web service.
Photo by Massimo Botturi on Unsplash
- This video, I want to talk with you a little bit about what you some of the things you need to consider in the
- design patterns in order to be able to take your data science results and then put them in production.
- So the learning outcomes are for you to understand how data science outcomes can be used in
- business settings and think about what you need to do to put models and outcomes into production.
- So variety of purposes, data science, conservative business.
- We can use data driven reports to inform decisions we can do forecasting for internal purposes to inform internal decision making,
- internal planning purposes.
- We can also have data set out science outputs for Real-Time, decisions either internal for making internal business decisions or customer facing.
- Part of your e-commerce platform, providing recommendations, doing fraud detection,
- various things in order to make your customer product, your customer experience, work smoothly.
- So one of the first things is that reproducibility is crucial.
- If you've got if you've maybe if you're running regular forecasts of future business activity or future demands,
- you can do inventory planning to be able to rerun those reports.
- You need to have a reproducible pipeline so you can rerun this week's report quickly and easily without having to do a bunch of manual labor.
- You also need if you're if you have something that's online, making online decisions, you need to be able to retrain it as new data comes in.
- So for online use, we're building an app or building a data science product that's going to be making decisions
- in an online fashion as as new users come in or as new decision requirements come in.
- There's a variety of modalities for delivering. It might be a Web app, a mobile app, a desktop app.
- It might be something that just lives in server side infrastructure, like the spam filter that's built into email, infrastructure, etc.
- But in terms of the technology structures, there are some exceptions. But mobile and desktop apps often use Web technology to connect to models.
- So even if you're targeting mobile apps, even if you're targeting desktop apps,
- learning to build Web based services for data science outputs is going to be a really useful skill.
- Multiple audiences can try to use these. You might have an internal recording dashboard that talks about your customer volume or it talks about the
- throughput on your assembly line or other aspects of the of the functioning and health of of your factory setting.
- You can use it for internal decision making and then also either to help your customers make
- decisions or as you're making decisions about your customers in an online interactive setting.
- So one architecture that's common for these kinds of applications is what's called a service oriented architecture.
- And what a service oriented architecture means is that your infrastructure is split in the different services, different individual pieces.
- So you've got a Web server and you're your customers or your users come in with a computer,
- with Web browser, maybe with a mobile platform, talks to a Web server that serves up the application.
- The mobile, the mobile platform might talk directly to backend.
- Then the Web server talks of various other services in order to complete in order to fill out.
- It's the user experience.
- Those services then go and they get data from various databases in your back end in order to serve up the responses to the requests.
- So a lot of organizations use this.
- Amazon uses service oriented architectures extensively.
- So when you go to a page on Amazon.com, there's one service that's providing product details and other services,
- providing people also bought recommendation and other services handling the shopping cart.
- And so each of these services are working and are working independently and the Web server is putting them together into a composite experience.
- So from a data science perspective, a lot of what happening is you need two pieces, you need a UI.
- And in the Web server, it might the UI might come from the service itself,
- but you need a user interface component and then you need the service itself is going to serve up the data science.
- Usually it's going to be some kind of a prediction.
- It's going to serve up the results of running the model that you've trained, the machine learning model that you've built and trained on the new data.
- To answer that particular request. So a lot of your work is going to be building up service.
- And one way to design it.
- And this goes well with the service oriented architecture is each different model that you have can live in a different service.
- So to deploy, the predictions are made available with a Web service like your Web server.
- Might it you might use other things. You might use something like zero MQ or some other or thrift our or some other RBOC
- protocol or you might use it HDP rest API for the Web server to talk to these services.
- One example is tensor flow serving if you're building your machine learning model with Tenzer flow.
- There's a program called Tensor Flow Serving that allows you to upload your saved model and and service requests based on that model.
- So you got to train the model offline and other hardware so that your your online system can just keep running.
- You're not using it. C.P.E. power to train the model.
- It's just dedicated for serving up responsive, serving up recommendations, making decisions about new incoming messages, etc.
- On other hardware, you train your model on your model you like, you save that train model to disk somehow.
- Well, or you might actually upload the model directly. Some have a model server, but you save it somehow and you make it available to the service,
- which will then reload the model and start serving up predictions from the new model that you just trained.
- So a few useful capabilities for building out this kind of infrastructure are to be able to train
- models and live or freshly exported data CENI be able to get your current data from the database.
- So you've got all the current customer transactions and then you train your statistical model on them.
- Part of that process, it's useful to be able to hold out test status.
- You can train this model, you can test it again before you to fly it to make sure you didn't accidentally train a model that performs badly.
- It can be useful to version your models so that you have the ability, particularly in the rollback to old model version.
- Maybe you train the model, you test the model. It doesn't work. You and you find out you can put it in production and suddenly your spam filter
- metrics change to be able to as a stopgap rollback to the previously train model.
- So you can then go figure out what went wrong with your model without leaving your customers at the new bad experience.
- The exact details of this depend a lot on your institution, your product, your infrastructure.
- But these are some of the things that you need to be able to keep in mind skills that are useful to learn in order to build this.
- So Web back end programing to build the service web front end programing, you know, better build dashboards to build the user interface.
- It's pieces that are going to make your service visible and available to users in some organizations
- that kids are going to be handled by other people on your team or by a completely different team.
- But in some many cases, it can be useful.
- You at least need to be able to talk with those people and then also be able to do performance measurement and tuning to understand.
- To be able to debug if your model's performing too slowly. So to wrap up many data science projects,
- result in online production capabilities is often done by training a model and deploying as a Web services or some other kind of network service.
- And it's useful to be able to learn at least some of the skills in order to do that,
- or at least be able to talk with the other folks in your organization that are handling that deployment and monitoring.
🎥 Topics to Learn
This video goes over some useful topics to learn to fill out more of your data science education.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
THINGS TO LEARN
Learning Outcomes
Know topics and software to study for expanding your data science skills.
Select the classes to take to complete your graduate degree.
Photo by Reinhart Julian on Unsplash
Machine Learning and Statistics
CS 534: Machine Learning
Math 562: Probability and Statistics II
Math 572: Computational Statistics
Math 573: Time Series Analysis
Working with Text
CS 536: Natural Language Processing
CS 537: Introduction to Information Retrieval
Read about NLP and IR
Application Areas
Social media: CS539, read papers in ICWSM & CSCW
Information retrieval: CS 537 & 637
Recommendation & personalization: CS 538
Software Development
CS 573: Advanced Software Engineering
Study programming practices & software engineering
Practice, practice, practice
Think about your code’s readability & effectiveness
Advanced ML
CS 633: Deep Learning
Software: TensorFlow, PyTorch
“Modular differentiable programming”
Bayesian Inference
I’m a Bayesian. Mostly.
Book: Statistical Rethinking (McElreath)
Software: STAN or PyMC3
Causal Inference
Book: Counterfactuals and Causal Inference (Morgan & Winship)
ECON 522: Advanced Econometrics
Critical Perspectives
Book: Data Feminism (D’Ignazio & Klein)
Book: Data, Now Bigger And Better
Book: Introduction to Science and Technology Sudies (Sismondo)
Keyword: Critical Data Studies
Wrapping Up
This class lays a foundation for you to integrate further knowledge.
Never stop learning new things.
Photo by Annie Spratt on Unsplash
- So you've learned a lot this semester, I hope this video I want to talk with you some briefly overview,
- some more things to go learn and give you pointers to places where you might be able to go learn them,
- learning outcomes or which you used to know some topics and some software to study for expanding your data science skills.
- You do not need to study all of this to be a competent data scientist, but also to have some input,
- some data points to be able to help select the classes you want to take to complete your graduate degree.
- So to learn more about machine learning and statistics, we've just introduced the basics.
- As I said early in the semester, basically every almost every week of this class could be an entire class on its own.
- So six, five thirty for the machine learning class.
- You're going to learn a lot more about different machine learning models, how to build them, how to optimize them, how to evaluate them.
- We've learned that the concepts, what they do, what they are, how a couple of them work.
- 534 is going to teach you a lot more about the insides of machine learning model math, 562 probability and statistics, too.
- If you want to learn a lot more about statistical inference, that class will teach you five.
- Seventy two is going to teach you a lot more about computational sides of statistics, particularly a lot of simulation things,
- Monte Carlo simulations that are useful, the kinds of things you do with simulation, the kinds of things you're doing with bootstrapping.
- That class is going to go into a lot more detail on Time series analysis.
- If you want to learn a lot more about that, we have an entire class on it.
- If you want to learn more about working with text to key topics, to be looking at our natural language processing and information retrieval,
- computer science class has a department, has classes on both of those.
- For working on specific application areas, if you want to do work on social media, you can take CSI 539, the social media mining class.
- If you want to do work on information retrieval, we have two classes on it.
- Five thirty seven, which is the how information retrieval works in six thirty seven, which is an advanced research class on information retrieval.
- If you want to one about recommendation and personalization, you can take C.S. 538,
- which discusses how to build recommended building, evaluate recommender systems to recommend products to people.
- Software engineering skills are really useful for be able to take data to take your data science and connect it into product.
- So to that end there, CSPI 73, the advanced software engineering class, it's useful for learning the process of software engineering.
- It's also studying programing practices in software engineering in general on your own and practicing practice,
- practice, practice, write code, do modeling, build web applications around or build applications around your data science analysis that you're doing,
- but also then reflect and think about things as you write code.
- Think is this readable? How can I learn? What do I know? What could I do to try to make this more readable, more efficient?
- What do I think I need to go learn in order to expand my ability to do that advanced machine learning.
- So we have a class on deep learning. See a six thirty three also learning software.
- So Tensor Flow, Pi Torch are both useful for each useful pieces of software to learn to go beyond what you can do in Saikat learn.
- And they are often associated with deep learning and they're used extensively for deep learning.
- But they're not just deep learning packages. What they really are is it is very effective optimization engines that are useful for
- modular differentiable like you set up a differentiable model and they optimize it.
- So a lot of those models are deep neural nets. But there's a lot of other kinds of models.
- I've used a tensor flow, for example, to build advanced matrix factorization.
- There's nothing neural going on.
- It's just a matrix composition, but it's taking advantage of tensor flows optimizers to build a recommendation engine around that.
- And so either Tenzer flow or PI Torture's very useful software as you're trying to build your own machine learning models.
- And it's a lot easier to use one of those pieces of software than to build your entire optima, your own optimizer from scratch.
- I personally do a lot of work with Bayesian inference. That's a useful statistical paradigm and philosophy to learn.
- Unfortunately, we don't have a class here that teaches it really.
- But the books, Statistical Rethinking by Richard McIlwraith is a good book for learning that also software standard software is really good for doing.
- It used a lot for Bayesian inference. That's not the only thing you can do.
- It's useful for a lot of scenarios where you're trying to do what it is, isn't it?
- It's it allows you to draw from arbitrarily complex distribution, not arbitrarily complex,
- but allows you to draw from complex distributions connected to data.
- PMC three, a similar software that works directly in Python Stand, has a good bridge to Python as well.
- It's also really useful to learn causal inference and we don't have a dedicated causal inference class.
- But you will learn a lot of causal material in the advanced econometrics class, in the economics department.
- Also, there's the book counterfactuals and causal inference that you may wish to read and ordered learn causal inference techniques.
- If you want to learn more about critical perspectives on data science, it's a variety of resources I can point you to.
- So the book data feminism that I listed in the Class Resources is a good primer on critical thinking on data.
- There's also if you want to get an example of some of the kinds of things that critical data scholars think about.
- There's a short little book called Data Now Bigger and Better.
- That is a selection of essays on data from a critical and a particularly from an anthropological perspective.
- Also, if you want to learn more,
- if you want to learn more about the underlying philosophies and mechanisms by which science in general, not just data science works,
- Sergius is Mondo's introduction to science and technology studies is very good and a keyword that you can go look for to find much,
- much more reading as critical data studies. So to wrap up, this class has laid a foundation for you to integrate further knowledge.
- Never stop learning new things. There is so much more to learn as a student and as a practitioner to be able to do good and effective data science
- and bring data to bear on questions that you have and decisions that you or your organization needs to make.
🎥 General Tips
Some final closing tips and suggestions for you to think about as you take the next steps in your data science career.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
GENERAL TIPS
Learning Outcomes
Some concluding tips for your data science work.
Photo by Sam Dan Truong on Unsplash
Questions
Good questions are fundamental
Tukey:
Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.
Photo by Jon Tyson on Unsplash
Work Reproducibly
For many, many reasons.
But accurate description is easier if work is reproducible (and internally reproduced).
Never Lose Context
Keep the big picture in mind
Goals — Questions — Analysis — Data
Keeping the goal in mind helps with:
Contextualizing results
Identifying limitations
Generating next ideas
Photo by Adrian Dascal on Unsplash
Don’t Overlook Detail
Be precise and specific in specifying questions and results
Always know what, precisely, you are measuring
Communicate precise results & definitions
Put them in context
Photo by dorota dylka on Unsplash
Be Curious
Learn more about statistics
Learn more about applications
Learn more about anything interesting
Photo by Siora Photography on Unsplash
Pause and Reflect
Photo by Lukasz Saczek on Unsplash
Wrapping Up
Good data science requires ongoing reflection, study, and practice.
Try to do better science tomorrow than you did yesterday.
Photo by Jan Tinneberg on Unsplash
- This video, I want to give you a few general tips,
- wrap up the semester with some suggestions and some advice for doing data science and for learning more.
- So that's our learning outcome. Some concluding tips for your data science work.
- The first is to never take your eye off of questions. Good questions are fundamental.
- John Tukey, who developed a lot of important statistical concepts.
- Is quoted as saying that it's far better to have an approximate answer to the right question than an exact answer to the wrong question.
- If you have a good question that you can get insight into, even if you don't have a crisp and precise answer for it, you can often refine the answer.
- But that's also often more illuminating for what you want to learn and for
- what you want to what you need to do than a very precise answer to a question.
- That is not ultimately the question you want to ask. So.
- Focus a lot of energy on defining your question properly and appropriately and understanding.
- Understanding its relationship to contacts and to talk about that just a bit. And what your answer means to it and its limitations.
- The second piece of advice is to work reproducibly. I've talked about this a little bit.
- As we've been wrapping up, there's many reasons that you should make your analysis reproducible.
- One, though, really important.
- One is that when you go when you write up your results, either for a publication, for your thesis, for for a report, you need to know what you did.
- And if you've been doing a bunch of different things, writing scripts, processing data here and there,
- it's easy to lose track of exactly what steps got you from source data to conclusions.
- And if you put it into a reproducible pipeline where it's automated, ideally you can rerun the whole thing with one command or a few commands.
- Then it's easy to make sure that you have accurately described all of the steps because you can check.
- Do I describe the steps that are in my pipeline? And then you can try every running the pipeline to make sure it still works.
- Do you still get the same conclusions? And if not. Either you missed the step or what you read.
- What you reporting is the results of running a different set of steps or a set of steps in a different order.
- So in addition to all of the the external scientific benefits of reproducibility,
- the benefits of reproducibility for being able to maintain your model in the long run in production settings,
- just for documentation purposes, the reproducibility helps you make sure that you've correctly described what it is that you did.
- You never lose sight of the context of what you're doing.
- There's a reason why when we talk to the very beginning of the semester about developing questions,
- we start with goals and we have questions that are going to advance those goals, analysis to answer the questions.
- And the analysis is of data.
- We don't we we focus on a question, we're trying to answer the question that the question has to be precise in order for us to answer it.
- But that question, as with the TOOKEY quote, it may well be an approximation for something we really care about.
- And so. We don't want to. Once we define the question, we don't want to just focus on the question.
- We need to keep the big picture in mind to remember why.
- If we need to do further refinements of the question, we have a touchstone to look to to figure out how we need to revise the question.
- If we need to adjust it, because our data can't answer that question.
- Well, when your new question will, what new question is going to bridge between the data and our goals?
- So we have some where we're going. It also helps. It's crucial for being able to contextualize our results.
- Why do we care about the answer to this question? What does it mean for what we know about the world or for what we're trying to do in the world?
- It's crucial for helping identify limitations. If you know the goal and then you have a cutoff of goal, you've a question you can.
- Having the goal gives you a place to context to talk about.
- OK. So our question can do this. But there's these other things that are important to our goal that we can't do right now.
- That's fine. It gives you a way to document them. It's also then super useful for generating next ideas.
- Our question has taken us two steps towards our goal. Our goal goals still 10 steps away.
- That's a lot of future work that's already written for us. But then also, we don't want to overlook detail.
- So when we're specifying our questions and results, we need to be precise and specific about what exactly were measuring.
- And this might seem that it's. This might at first glance seem to contradict the two key quote at the beginning is he said,
- better to have the vague question than the precise wrong question.
- But they're they're not really in conflict because whenever we're measuring something, we're measuring something very precise.
- Computers and measurements are very precise things. We need to understand precisely what that is.
- And we need to understand the relationship between the precise thing we computed.
- A measurement of. How many people?
- Clicked. I've never seen this number before in our interface for dealing with incoming calls.
- And we need to connect that to the vaguer question that we're trying to answer so we understand the approximation.
- It's not enough to say, well, OK, I have I have the right question and I have an approximate answer to it.
- We need to understand the approximation and the best of our ability, the relationship of that approximate of the precise thing we measured.
- How what's an approximation of the question we're trying to answer and how that's then going to advance the goal we have?
- We can't we can't overlook the details to understand precisely what it is that we're doing and different people approach.
- Connecting data and goals in different ways. Some are very top down thinkers.
- They start with the big picture goals. They're very focused on big picture things.
- That's often my default mode of operation. I'm a relatively big picture person.
- But in order to be a big picture person, effectively, you have to be able to connect that big picture with specific,
- concrete, measurable things that are going to advance the big picture.
- You can't if you don't do that, then you're going to wind up either not being able to make progress because you can't
- actually define something actionable that's going to advance the big picture or
- you're going to not have a clear sense of whether you're advancing it or going to make
- a lot of unsubstantiated statements because you're not connecting to the details.
- Some people start from the details, more bottom up development of ideas.
- Neither of these are wrong. Different faculty in the department tend to start from top down or bottom up places.
- You need to learn to be able to learn from people who communiqu, who structure ideas in both ways.
- But bottom up, you start more with the data and the details of the problem, and you build that into a big picture.
- And there you need to be able to connect to to see the context and not lose the context for what it is that you're doing.
- Both are good ways to approach problems and are complementary perspectives that can work very well together.
- But it's something to be aware of. You can't lose sight of either the context why you're doing the thing that you're doing or of the details to
- understand what precisely it is did to be able to reason about how that relates to your context and your goals.
- Be curious. There's so much to learn.
- I am continually learning new things about programing, about statistics, about the problems that I'm trying to solve with with these techniques.
- And so learn about that. I've given you some pointers of some things to read in these videos.
- I've given you pointers elsewhere in the course materials. There's so much to go study.
- This class is intended to open the door to the world of data science that you can walk through and have.
- A basic framework in which to fit all the new information that you're going to need to be acquiring over the next years.
- Pause and reflect. And this applies to a lot of things. It applies to the work you're doing, reflect on what does this result mean?
- Reflect on is my code efficient, is my code readable?
- Does this chart makes sense? Reflect on your practices of work.
- How am I organizing my work to go from my goals to my analysis?
- How my organizing my to do list something I also tell my students is to take time to reflect on their practices.
- How do you organize just overall your work as a student and to reflect?
- Is this working for me? Am I having problems with my productivity, problems with getting things done?
- Problems with communication problems? Or are maybe not.
- Problems might be a harsh way to frame it. Are there places I can improve?
- What's working about how I'm doing my work as a student? What's not working and how can I improve?
- There's a there's a risk of spending so much time in your process for getting things done.
- You never get things done. But it's important to reflect on our work, on its outcomes.
- How could this paper be better? How could this report be better?
- How can I do this kind of project better the next time it is done, it's published, it's good.
- How can I do even better the next time? So to wrap up, good state of science requires ongoing reflection, study and practice.
- Never stop learning, never stop paying attention to the various things like.
- Ideas. Don't live well in little boxes.
- There may be someone working on some completely different problem, but they're using.
- They had an insight into their problem, which you can go apply to yours.
- So take advantage of the opportunity to go to a wide range of talks, to read or write a wide range of papers and a wide range of books,
- etc., to get ideas for how you can do better work and how you can better understand your problems.
- Space, the world around you, your customers, the needs of your organization in order to do better data science tomorrow than you did yesterday.
🚩 No Quiz 15
There is no quiz for this week.
📃 Farewell
It’s been grand!
I would love to hear feedback on how to further improve the course; I have tried to make some corrections as a result of the midterm assessment process, and will be keeping those notes for next year, but further suggestions either in the course evaluations or by Piazza are welcome.
I hope to see many of you in future courses!
📩 Assignment 7
Assignment 7 is due Sunday, December 12, 2021.
🚩 Final Exam
The final exam will be on Tuesday, Dec. 14, 2021 at 9:30 AM.
Exam Rules
You may have 1 note sheet, letter- or A4-sized, two-sided.
You should not need a calculator, but may bring one if you wish.
You may answer in either pen or pencil.
Study Tips
Review the previous quizzes, assignments, and midterms.
Sit for the makeup midterm, even if you do not plan to turn it in.
Review lecture slides to see where you are unclear on concepts and need to review.
Skim assigned readings, particularly the section headings to remind yourself what was in them.
Review the course glossary, keeping in mind that it does contain terms we haven’t gotten to yet.