Week 15 — What Next?

This is the last week of class. We're going to recap, and talk about what's next, both for learning and for putting what you've learned to practical use.

Recap
Data and Concept Drift
Time Series Operations
Time Series Example
Weekly Quiz 15
Correlated Errors
Publishing Projects
Production Applications
Topics to Learn
General Tips
Farewell
Makeup Exam
Assignment 7
Final Exam

This week's videos are also available as a Panopto playlist.

Recap

This video reviews the concepts we have discussed this term and puts them into the broader context of data science.

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand RECAP Learning Outcomes (Week) Wrapping up! Tie together the class content again Apply Pandas time series operations and model correlated regression errors Take the results of data science analysis in production or publication Know some topics to study further to expand your data science skills Photo by Lumitar on Unsplash The Data Science Workflow Transform / Prepare (ETL) Raw Source Data Prepared Data Inference Findings Modeling Model + Predictions Data Description What is Data Science? The use of data to provide quantitative insights on questions of scientific, business, or social interest. Data Management Reading from static files Processing and integrating with Pandas To learn more: CS 510 Databases Application- and type-specific data in other classes Mathematical Fundamentals Probability Theory Linear algebra (a little) To learn more: Math 562 (Probability and Statistics) Math 503 (Advanced Linear Algebra) Inference Basic parametric pairwise comparisons (t-tests) Bootstrapping Sampling theory Linear regression models (OLS & logistic) To learn more: Math 562 (Probability and Statistics) Prediction Regression: continuous outcomes Classification: categorical (esp. binary) outcomes To learn more: CS 534 (Machine Learning) Many other data science classes Evaluation and Tuning Train/test splits Classification and continuous prediction metrics Hyperparameter tuning To learn more: CS 534 (Machine Learning) Unsupervised Learning Lower-dimensional embedding (matrix decomposition) Clustering To learn more: CS 534 (Machine Learning) Will appear in other data science classes Workflows Data science pipeline Breaking code into separate scripts & modules Design patterns for code workflows You will apply this throughout your classes! Wrapping Up This class is designed to lay a conceptual foundation for your future data science studies. Other classes will build on these concepts and ideas! Photo by Dave Heere on Unsplash

Data and Concept Drift

This video introduces a fundamental assumption of predictive modeling and the way drift can affect it.

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand CONCEPT AND DATA DRIFT Learning Outcomes Know crucial assumptions of machine learning evaluation and deployment Understand how models can degrade over time Photo by guille pozzi on Unsplash Fundamental Assumption Deployment Assumption Drifts and Shifts Offline Solution: Temporal Splitting Random train-test split ensures train/test comparable Assumes test is uniformly drawn from same distribution! Alternative: temporal split Select temporally-contiguous test data (e.g. 1 month) Train on data before test data (no time travel!) Benefit: simulates actual use Drawback: temporal data no longer random, inference harder Online Solution: Continuous Monitoring Instrument your system in production Watch key metrics over time Click-through rate Classification rate Regularly re-train and re-evaluate Train model on new data Evaluate model on new data Wrapping Up ML training and evaluation assumes that the training and test data match real life. You can’t always rely on that. Photo by Jackson Douglas on Unsplash

Resources

A unifying view on dataset shift in classification (available through Boise State library)

Time Series Operations

Time is an important kind of data that we haven't spent much time with — this video discusses the fundamental Pandas operations for working with time-series data.

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand TIME SERIES Learning Outcomes Summarize and plot time series data with Pandas Understand that time series data is often not independent Photo by Bruno Figueiredo on Unsplash Time Series Time Series Time Representation Many representations: String dates Years & months Timestamps (seconds, ms, etc.) Pandas: must convert to a datetime pd.to_datetime Pandas Setup Steps Create/convert datetime column with instance times Index data frame by timestamp Sort index ratings['timestamp'] = pd.to_datetime(ratings['timestamp'], unit='s')rts = ratings.set_index('timestamp').sort_index() Time series exception to “prefer unique indexes” guideline Operation: Resampling Like groupby, but on time intervals Compute aggregate functions for time intervals ‘on’ option allows a non-index column monthly_ratings = rts.resample('1M')['rating'].count() Operation: Plotting Matplotlib & Seaborn render timestamp X axes well sns.lineplot(data=monthly_ratings) Operation: Range Select Time series indexes support range operations Select by range: rts.loc['2010-01-01':'2010-12-31'] Includes end point (unlike normal slice!) Select by period: rts.loc['2010-07'] Operation: Diff and Shift Time Effects Autocorrelation Wrapping Up Repeated data over time requires particular handling and has distinct operations. Pandas provides these through datetime columns and indexes. Photo by Javier Esteban on Unsplash

Resources

Pandas time series analysis

Time Series Example

The MovieLens Time Series notebook demonstrates basic time series operations in Pandas.

Weekly Quiz 15

Take the Week 15 Quiz in Blackboard.

Correlated Errors

Regression models require that the data be independent. This video introduces two kinds of non-independence and methods for addressing them: grouped observations addressed with a mixed-effects model and temporal auto-correlation addressed with ARIMA models.

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand CORRELATED ERRORS Learning Outcomes Understand the idea of correlated errors and how they can be addressed. Know when you need to look for a model that can handle correlated error. Know two models to study further when needed. Photo by Tyler Nix on Unsplash Correlated Errors Observations are not necessarily independent: Time series: autocorrelated Within-subjects designs: grouped Other data may also be grouped Medical data collected by hospital Grouping Example Example problem: measuring search engine effectiveness Two search algorithms Users issue queries Results are assessed for relevance (experts or click logs) Problem: results are not independent * Some queries are harder than others Naïve Solution Stratified Errors / Group Effects Fixed and Random Effects Basic Idea: Modeling Additional Effects Random effects capture natural sources of variance shared between observations Performance: system effectiveness + query difficulty Remaining variance (residuals) hopefully i.i.d. normal! Our linear model works again This applies generally – if we know an effect in the variance structure, we can remove it (“control for” it). When To Use Stratified Errors Any time your data points come in groups Multiple measurements for the same object Trying to understand difference between measurements Understanding diff. between objects, they are fixed effects! When you have an external feature shared between instances Autocorrelation Autoregression Moving Average ARIMA Auto Regression Integrated Moving Average: ARIMA(p, q, r) AR(p) model MA(r) model Applied to q-order diffs (1 = diff, 2 = diff of diffs) AR & I can be viewed as a type of feature engineering Integration with Prediction Wrapping Up We often have structured errors in a regression problem. In some cases, modeling that structure yields a valid model. Photo by Pawel Janiak on Unsplash

Resources

Linear Mixed Effects in statsmodels
Time Series Analysis in statsmodels
Time Series Analysis slides — much more in-depth treatment

Publishing Projects

This video talks about going from an analysis and its notebooks to a publishable paper.

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand PUBLISHING PROJECTS Learning Outcomes Understand what is needed to go from analysis to publishable reports. Outline a research paper. Photo by Good Good Good on Unsplash Publication Audiences Data science document products can be for: Collaborators Decision-makers Other organizations Scientific community Lay public Formats Written document (electronic or printed) Presentation (live or recorded) Interactive online demo/dashboard Publication Goals Reader needs to understand: What you did What you learned What they should do / take away (In general. Not everything is actionable.) Typical Outline Introduction Background & Related Work Methods Results Discussion Conclusion Variants: Some communities put related work at the end Institutional reports often lead with Executive Summary 1-2 page summary w/ key points Discussion may be merged w/ Results or Conclusion Methods may split Rendering Plots High-quality images When practical: vector images (PDF good for LaTeX) Otherwise: high-resolution images (at least 300dpi, 600 better) Complex images can overwhelm PDF Clean and clear labels and captions Distinct colors, shapes, etc. Experiment with dimensions & aspect ratio E.g. 5”-wide image, scale down to 3.75” columns Step Outside Yourself Forget you did the work and wrote the document. Can you understand it? Could you reproduce it? Is info missing? Applies everywhere: Reports Papers README Wrapping Up Internal and external publications require special attention to writing and visual presentation. Photo by Kari Shea on Unsplash

Resources

PlotNine is a good plotting library for preparing consistent, publication-ready graphics.
The book gender example also demonstrates the current evolution of my own practices for preparing for publication.

Production Applications

How do you put the results of your data science project into product?

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand PRODUCTION APPLICATIONS Learning Outcomes Understand how data science outcomes can be used. Think about how to put mdoels and outcomes into production. Photo by C D-X on Unsplash Using Data Science Data-driven reports to inform decisions Regular forecasts for internal purposes Data science outputs for real-time decisions Internal Customer-facing Reproducibility Reproduction is crucial: Regular reports / predictions re-run Daily / weekly / monthly reports Retraining models for online use Online Use Many modalities: web, mobile app, desktop app, server infra Mobile & desktop often use web tech to connect to models HTTP (often REST) API Some low-latency exceptions Multiple audiences Internal reporting Internal decision-making Customer decision-making Service-Oriented Architecture Client Web Server Services Databases Deployment Predictions made available via a web service E.g. TensorFlow Serving Model trained offline on other hardware Model-training script saves trained model to disk Web service loads trained model & serves predictions Useful Infrastructure Capabilities Train models on live or freshly-exported data Hold out test data to evaluate new model before deployment Version your models (including retrain w/ same hyperparams!) Roll back to old model version Details depend on institution & infrastructure. Skills to Learn Web back-end programming (to build services) Web front-end programming (to build dashboards) Performance measurement & tuning Wrapping Up Many data science projects result in online production capabilities. Often done by training a model and deploying it in a web service. Photo by Massimo Botturi on Unsplash

Topics to Learn

This video goes over some useful topics to learn to fill out more of your data science education.

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand THINGS TO LEARN Learning Outcomes Know topics and software to study for expanding your data science skills. Select the classes to take to complete your graduate degree. Photo by Reinhart Julian on Unsplash Machine Learning and Statistics CS 534: Machine Learning Math 562: Probability and Statistics II Math 572: Computational Statistics Math 573: Time Series Analysis Working with Text CS 536: Natural Language Processing CS 537: Introduction to Information Retrieval Read about NLP and IR Application Areas Social media: CS539, read papers in ICWSM & CSCW Information retrieval: CS 537 & 637 Recommendation & personalization: CS 538 Software Development CS 573: Advanced Software Engineering Study programming practices & software engineering Practice, practice, practice Think about your code’s readability & effectiveness Advanced ML CS 633: Deep Learning Software: TensorFlow, PyTorch “Modular differentiable programming” Bayesian Inference I’m a Bayesian. Mostly. Book: Statistical Rethinking (McElreath) Software: STAN or PyMC3 Causal Inference Book: Counterfactuals and Causal Inference (Morgan & Winship) ECON 522: Advanced Econometrics Critical Perspectives Book: Data Feminism (D’Ignazio & Klein) Book: Data, Now Bigger And Better Book: Introduction to Science and Technology Sudies (Sismondo) Keyword: Critical Data Studies Wrapping Up This class lays a foundation for you to integrate further knowledge. Never stop learning new things. Photo by Annie Spratt on Unsplash

General Tips

Some final closing tips and suggestions for you to think about as you take the next steps in your data science career.

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand GENERAL TIPS Learning Outcomes Some concluding tips for your data science work. Photo by Sam Dan Truong on Unsplash Questions Good questions are fundamental Tukey: Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise. Photo by Jon Tyson on Unsplash Work Reproducibly For many, many reasons. But accurate description is easier if work is reproducible (and internally reproduced). Never Lose Context Keep the big picture in mind Goals — Questions — Analysis — Data Keeping the goal in mind helps with: Contextualizing results Identifying limitations Generating next ideas Photo by Adrian Dascal on Unsplash Don’t Overlook Detail Be precise and specific in specifying questions and results Always know what, precisely, you are measuring Communicate precise results & definitions Put them in context Photo by dorota dylka on Unsplash Be Curious Learn more about statistics Learn more about applications Learn more about anything interesting Photo by Siora Photography on Unsplash Pause and Reflect Photo by Lukasz Saczek on Unsplash Wrapping Up Good data science requires ongoing reflection, study, and practice. Try to do better science tomorrow than you did yesterday. Photo by Jan Tinneberg on Unsplash

Farewell

It's been grand! I would love to hear more feedback on how to improve this material, because I hope to keep using it for a while.

Video

Makeup Exam

The makeup midterm is due Saturday, Dec. 12 at 5:00 PM.

Grade Replacement

If you turn in the makeup exam to be graded, its grade will replace the lower of your Midterm A and B grades, even if that lowers your final grade. Only turn it in if you think you did better than your worst normal midterm!

Assignment 7

Assignment 7 is due December 13, 2020.

Final Exam

The final exam will be released at 5PM on Monday, Dec. 14, and due at 5PM Thursday, Dec. 17.