Week 1 โ Questions (8/22โ26)
The key learning outcomes for this week are:
Ask and refine questions that can be answered with data
Install and run the software required for the course
Write and run basic Python code in a Jupyter notebook
Begin to think about the complexity of meaningful questions
This week draws from chapters 1โ4 from ๐ย Python for Data Analysis and chapters 1-2 of ๐ย Think Like a Data Scientist.
If you already know Python, the Python parts should mostly be review.
๐ง Content Overview
This week has 1h20m of video and 0 words of assigned readings. This weekโs videos are available in a Panopto folder.
๐
Deadlines
This week has the following deadlines:
๐ฅ Introduction
This video introduces the course and our learning outcomes for the term.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
INTRODUCTION
Video Outcomes
Introduce the class subject
Understand course learning outcomes
Understand structure of this class
Know where to get help
What is Data Science?
The use of data to provide quantitative insights on questions of scientific, business, or social interest.
Data and Meaning
โBy itself, some piece of data has no meaning. Data is only given meaning โ as evidence โ by the people who make use of it.โ
Sergio Sismondo, An Introduction to Science and Technology Studies, 2nd Edition. Wiley-Blackwell, 2010, page 133.
Learning Outcomes
Explore a data set to determine whether and how it might illuminate questions of interest.
Define and operationalize a research question such that a data analysis could produce meaningful knowledge.
Use best practices to carry out analyses in a documented, reproducible, and efficient fashion.
Present the results of a data analysis with appropriate visuals and written argument.
Identify weaknesses in a data analysis and assess their impact on the correctness and utility of the results.
Assess ethical implications of an analysis in terms of both classical human subject research ethics and contemporary concerns such as fairness and bias.
Understand the space of data science techniques and applications, and relate future learning to this framework.
Course Components
Videos and readings โ this is the primary content delivery
Web site is your guide!
In-class meetings โ application, discussion, exercises
Quiz before class on Thursday
Assignments
Exams
2 midterm + final
Getting Help
In-class from peers and instructor
Course forum on Piazza โ encouraged!
Office hours
Remote office hours available by appointment
Please use Piazza for course inquiries
Use the Internet
Documentation and resources
Public Q&A sites (within reason)
Wrapping Up
This class is designed to:
Provide flexibility in learning
Use each modality to its best advantage
to achieve our learning outcomes.
Letโs learn!
Weโre in the Internet!
- Hello and welcome to CS 533 : Introduction to Data Science.
- This is the introduction video that's going to give you an overview of what it is going to we're
- going to be talking about this semester and provide you with guidance for how to get started with the course.
- Our learning outcomes for this video are to introduce the class subject What is data science for
- our purposes, for you to understand the learning outcomes of the course for you to understand
- how this class is structured at a high level we'll be supplementing this with a discussion in class.
- And then also to know where to get help throughout the semester as you're in the course.
- To get started though I want to talk about what data science is, and there are many different
- people that are going to have many different definitions of data science.
- But the definition I've been using as I've built out this class is that data science is the use
- of data to provide quantitative insights on questions of scientific business or social interest.
- There may be many different things to which we want to apply data science.
- Maybe in a business context where we want to gain
- understanding about the effectiveness of our business processes.
- We may want to evaluate a change to some aspect of
- how we carry out our business or how we manufacture or provide our products.
- We may want to understand the impacts of some of our business decisions.
- We may have a scientific question where we are trying to produce
- generalizable knowledge to understand the world around us.
- We may have social interests particularly if we are deploying data science in a nonprofit or a
- government agency or an educational setting where we're trying to understand social dynamics
- we're trying to understand the effectiveness and the impact of policies and programs and various
- other purposes to inform our social mission with quantitative insights To give you some more
- about how I go about thinking about this and some more of the background of how I've been
- designing the class, I find this quote from Sergio Sismondo's.
- introduction to science and technology studies useful.
- In that book, he says by itself some piece of data has no meaning.
- Data is only given meaning as evidence by the people who make use of it.
- And this means that we can't just go find a bunch of data and oh we have the answer.
- We have to do work. To get our data into a form
- where it can actually provide answers to the questions we care about.
- We need to do work to frame our questions in a way that we can actually answer them with the
- data that we have available or that we can obtain.
- And this is a human process. And it's also a social process which is one of the reasons why
- we're going to be using our in-class time for discussion and application exercises The
- data gains its meaning and becomes evidence for particular conclusions or particular answers
- because we go through the process of interpretation and we go through the process of presenting
- our interpretations to others and discussing and debating how we came to the conclusions, the
- conclusions we came to, the level of support they have from the data.
- So we're going to be practicing that a lot in this class of going from
- data to evidentiary meaning as a human process.
- In pursuit of all of this there are a number of learning outcomes for this course.
- The first one is that I want you to be able to explore a data set.
- Someone gives you some data and you first need to be able to get your bearings in it Understand
- what data you have, be able to assess whether or not it would
- answer particular questions and what questions you might be able to answer with it.
- You need to be able to define a question that we can answer.
- This is not an easy process, we're going to be talking quite a bit more about it in the next two videosโฆBut.
- Taking a goal that we have and turning it into a question that we can actually answer through a
- data analysis is a process that takes work and we're going to be learning that I want you to be
- able to then actually carry out your analysis in a
- way that is documented so other people can understand what you did.
- It's reproducible So others can do the same analysis.
- And it also doesn't unnecessarily waste computing resources to come to the answers.
- You make efficient use of the resources we have available to us. And also then we can scale to
- doing analysis on very large data sets It's not enough to just do an analysis I want you to be
- able to present the results of your analysis, both through
- visuals - charts and graphics - and through written argument.
- To be able to communicate to other people your peers in the class, myself.
- In the future your advisors your supervisors at work, what it is that you learned from the data
- and why your conclusions are a reliable and defensible interpretation of the data.
- I want you to be able to identify weaknesses in data analysis.
- No data analysis is perfect There are going to be weaknesses and downsides, and we'll have to
- make trade-offs when we make decisions of how we analyze the data, but also not all weaknesses are created equal.
- I want you to be able to assess the impact a weakness has on the correctness and utility of the
- results Some Weaknesses are fatal flaws so we can no longer trust the results we get.
- Other weaknesses, however, are caveats that we needed to acknowledge and account for when we
- apply the results but they don't fundamentally undermine their validity.
- I also want you to be able to reason about the ethical implications of your work.
- There's a variety of different frameworks and perspectives that we can take on the ethics of
- data science work that we're going to be talking about this semester, but I want you to be able
- to think about and account for, and assess the ethical implications of data science work.
- And then finally, I want you to understand how the broader picture of data science, the
- various techniques and applications fit into a framework, to give you a map of the space.
- And particularly with the role this class plays in Boise state's graduate curriculum, I want to
- give you a framework that you can use to relate what you're going to learn in other classes such
- as machine learning or large-scale data analysis or recommender systems or social media mining
- together and develop a coherent picture of what it is to do data science.
- To support these outcomes, there are a number of components of the class. The first is the videos and readings.
- You're watching one of those videos right now. This is going to be the primary mechanism for content delivery.
- I'm not planning to lecture live in class. I'll be doing a little bit of lecture style things
- here and there as we need to clear up things that are confusing or things you have questions about.
- But our primary content delivery is going to be through these prerecorded lecture videos.
- The website is your guide to all of this. Each week is going to have a page that lays out the
- videos, the readings that you need to do in order to be prepared for class,
- prepared for the assignments, and to learn the material.
- Our in-class meetings are going to be focused on discussing and applying the material.
- Some of that will be open discussion. Some of that will be guided
- application exercises carried out with your classmates.
- We're going to be having some quizzes ; particularly each Thursday there's going to be a quiz before class.
- That assesses your understanding of the material and gives both you and I an initial check on
- how well you're understanding it and whether you're ready to apply it in our in-class work and in the assignment.
- We're going to have assignments that come at a relatively
- steady pace throughout the semester : one every other week.
- They give you more extended place to practice and develop your skills and demonstrate your
- mastery of the material that we're going to be covering this semester.
- And then finally we have two midterm exams and a final exam to give an additional check of your
- conceptual knowledge of what it is that we're doing in this class.
- As we go through the semester I expect you're probably going to have questions and you're going to need some help.
- And there's a variety of places You can get it, you can get it in our class
- meetings both from your peers in the class and from me.
- You can get it at any time through the class forum on Piazza.
- I strongly encourage you to ask questions and answer each other's questions on Piazza.
- When you ask a question I'm probably not going to answer it immediately ; I'm going to give some
- time in order to see if others have an answer first.
- Oftentimes helping other people answer questions strengthens our own understanding of the material.
- So I encourage you to make use of Piazza both to answer your get your questions answered and to
- develop and practice your understanding of the material through helping each other.
- I am going to be having office hours. I'm going to be having them physically in my office so long as health permits.
- And I will also be able to have remote office hours by appointment.
- I do ask that if you have any inquiries about the class that you direct them through Piazza.
- If it's something that's just for me, perhaps about grades or
- something, you can send a private message to the instructors on Piazza.
- That's going to let me keep all of my course communications in one
- place, so I don't accidentally lose something in my email.
- And finally you can use the internet as a resource. There's a lot of documentation, resources
- , Many of our readings are going to come from various places on the internet.
- There's many more sites where you can search for solutions and help.
- Also I encourage you to make use of public Q&A sites.
- When you're a practicing data scientist after this
- class, the internet is at your disposal for getting your work done.
- There's no reason not to go ahead and practice making use of the resources that are available to you in this class.
- I just ask that you do it in a way that's in keeping with the principles of academic integrity.
- You can go onto a public Q and a site and you can ask โI'm trying to do this thing I'm having a
- problemโ, and ask a specific question about the problem that you're getting stuck on.
- I don't want you to just copy in a piece of the assignment description, paste it in a question
- answer site and ask โhow do I do this ?โ Demonstrate your learning of the material in how you go
- about identifying where it is that you're stuck and asking a question that's going to help
- uncover the missing piece that you need in order to move forward.
- So with all of that this class is designed to provide flexibility in
- learning and to use each modality to its best advantage.
- There are trade-offs I'm not going to pretend like there's no trade-offs to how I've designed this class.
- But with this design we can do the content delivery (me lecturing) in a way where you can go
- back and rewatch videos later, you can catch up if you
- have to miss some material, you can speed up the video if I talk too slow.
- We're going to focus our in-class time on things that can
- only be done through synchronous discussion.
- Where we're actually engaging with the material together and talking to each other about it.
- And these things together with the various parts of the class I
- hope are going to allow us to achieve our learning outcomes for the semester.
- I hope you have a great semester. I hope you learn a lot.
- I also expect to learn from you. Every time I teach It's a learning experience for me as well.
- And I learn more about the subject and about how to teach and communicate
- about it from my students. So let's learn togetherโฆ
๐ฅ Asking Questions
In this video, I introduce questions in their broader context of using data to advance goals.
I also introduce the idea of operationalization, which will be a key concept throughout the class.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
ASKING QUESTIONS
Learning Outcomes
Understand components of a good question
Understand how questions relate to goals and analyses
Think about data for a question
Key term: operationalization
What is Data Science?
The use of data to provide quantitative insights on questions of scientific, business, or social interest.
Example: Department Interest
Context: changed assignment structures in CS121
Goal: Assess whether new change improved CS121
What does โimproveโ mean?
What data could inform the assessment?
What could we do with that data to measure improvement?
Operationalization
We call this operationalization
Start with a goal
Refine through intermediate questions
Define a specific measurement to take
What, precisely, is collected
How, precisely, it is computed
operationalize, v.:
The process of doing this.
operationalization, n.:
The result of this process.
Example: Department Interest
Goal: Assess whether new change improved CS121
Question: Are students better prepared to excel at work?
Question: Are students better prepared for the next class?
Goals, Questions, and Analysis
Questions do not have 1 level
May be multiple levels
Answering the question should advance the goal.
Or higher-level question
Carrying out the analysis should answer the question.
Operationalized Questions
They are specific
They are answerable
They can be answered with available data
Questions that fail this are not bad!
Example: Department Interest
Goal: Assess whether new change improved CS121
Question: Are students better prepared for next class?
Question: Are students more likely to pass CS 221?
Data: student grades for CS121 & 221
Caveats: many
Clarifying Questions
Your boss, client, etc. sets goals, may have questions
But they need to be refined before meeting data!
Refine this through clarifying questions!
Wrapping Up
Multiple layers to translate between data and high-level goals or objectives.
Questions bridge this gap.
Multiple levels of questions may be needed.
- So this video, we're want to talk about asking questions.
- What makes a good question? How does a question relate to the broader context of what we're trying to do in this class?
- The learning outcomes for this video are few to understand what makes a good question.
- Understand how it relates to goals and analysis and start to think about data for a question.
- We're also going to introduce a key term operationalization that is going to come up throughout the rest of the class.
- To set the stage, I want to review our definition of data science that I introduced in the class introduction video that we're learning
- about how to use data to provide quantitative insights on questions of scientific business or social interest.
- But in order to do that effectively, we need to be able to write good questions, refine those questions,
- connect them both to the data we might be able to use to shed these quantitative insights and to the goals,
- the business purposes or scientific purposes for which we're asking the questions in the first place.
- So I want to work through this with you with an example.
- So suppose in the Boys State Computer Science Department, we have our introductory classes.
- Yes. One twenty one to twenty one. Three twenty one. Suppose we make some change to see.
- Yes. Twenty one. Like we change the way we do the assignments. And we want to assess whether this new change improved.
- C. S 121. So we have a business purpose here of we're making a change to one of our courses.
- And we want to see if that change is improving the course in some way.
- But in order to do that, we need to identify a number of things, such as what does it mean to improve C.
- S one twenty one? What data could we use to try to inform this assessment of whether we improved?
- Yes, 121. And what could we do with that data to measure improvement?
- And. So this process is called operationalization.
- We have a goal. Assess whether we improved 121. That, in turn is in service of the broader goal of delivering a high quality undergraduate education.
- Then we've refined through intermediate questions. I'm going to show some of those in a bit to determine a specific measurement to take.
- And at the end of the day, if we are fully operationalize the goal or a question,
- we know precisely what data we're going to collect or has been collected and how what measurement
- or measurements we're going to compute over that data in order to try to answer our question,
- we use the term in a couple of senses. First, operationalize can be a verb.
- It's the process of doing this operationalization. Then, as a tense of the verb operationalization is also a noun.
- And it's the result of this process.
- So the specific measurement and analysis that we're going to do over specific data can be called an operationalization of the question.
- So we have our goal of assessing whether some change improves.
- Yes, 121. We can ask an intermediate question.
- OK, so what does it mean to improve it? Well, students are better prepared to go excel in the workplace.
- Well, it's a while until this is the freshman class. It's a while until the students are going out and the job market.
- Or we have for information on how well equipped they were. So can we ask a shorter term question?
- That's going to help us get to that. Are students better prepared for the next class?
- And we call this intermediate question a proxy.
- So if our goal is better, prepare them for doing their work, they're doing the work we're training them for the proxy can be.
- Well, are they better prepared for the next class?
- So questions don't have one level and there's a there's a path here between goal our goal, improve education, deliver a high quality education,
- the subgoal of assess whether this change that was intended to improve the educational effectiveness of our introductory
- programing class actually did so to get all the way down to the data that we can use in order to try to measure it.
- We can also have multiple levels of questions, as we've already seen well.
- Are they prepared for for their work? Well, we can't. That's a long timeframe.
- It's difficult to measure that on the timeframe we need in order to iterate on the on class structures.
- So we use this that we step down one level. We use this proxy.
- Are they better prepared for the next class?
- So if we want to think about the quality of our questions, like we need a way to assess whether a question is good.
- And there's a couple of ways we do that. One is looking upward.
- So the question should advance the goal and we should be able to look at the goal and look at the question and say yes.
- Answering this question does move us forward in this goal.
- No one question is going to be the complete answer to our goal.
- But our students, better prepared for the next class, moves us one step closer.
- We can say yes, if we if students are better prepared for the next class,
- that is probably evidence that we have improved the effectiveness of the introductory class.
- Also, though, carrying out the analysis should answer the question.
- We want to work our questions down to the point where we have a question that's specific.
- We we can it's clear that the question will advance either the top level goal or a higher level question that in turn advances the goal.
- But also it's specific enough that we can look at a data analysis plan.
- Here's the data we're going to use. Here's the measurements we're going to take. Here's the analysis we're going to perform and we can say, yes,
- doing this data analysis plan will answer this question or at least answer the question in a useful sense.
- And so if we can make those connections that we can see, doing the analysis will answer the question, answer the question.
- Answering the question will advance the goal. Then we have a connection.
- We have a connectedness between the analysis and the data that we can actually do.
- And. The question or the goal that we're trying to advance through this data analysis.
- So a fully operationalized question is going to be specific and it's going to be answerable and with the available data.
- Now, there are lots of useful questions that we can't answer with available data.
- That does not mean they're bad or we should ignore them. They're incredibly useful for contextualizing the limits of a data analysis that we do.
- We have a data analysis. It can answer one question that will advance the goal.
- There are three other questions related to the goal that cannot be answered by our analysis.
- Well, that's useful in our report to talk about the limitations. Well, we can answer this question.
- We can't answer these others. Maybe we can think about how to how to answer those others questions.
- But when we're trying to get down to a question that we can answer with data.
- And remember, we're talking about data sciences, quantitative insights into these questions.
- We want to see, can we actually answer the question with data? And can we match the analysis plan to the question to the goal.
- So to go back to our example of trying to measure the effectiveness of improving one twenty one, are students better prepared for the next class?
- Well, we can make that more specific. Are they more likely to pass?
- Yes. To twenty one. Now we have a very specific question.
- We can answer it with the student grades from six to twenty one. We can look at students who took our class.
- Our new C. S one twenty one and took our old C as one.
- And we can compare the pass rates. Now there are many caveats.
- There are a lot of challenges to doing this properly. It can only measure one piece of what's going on.
- But it's a specific question that we can answer with data.
- Our students in the new version of our intro class more or less likely to pass the next class,
- will get to talk more about this question in the next video.
- Now, to get to this kind of a question, I've given you the example and work through it here.
- In practice, you're going to need to work with your boss, your client, your advisor, other stakeholders,
- whoever is going to be acting on the results of your data analysis, which may be yourself.
- To get to these operation, to get to these fully operationalized questions, they're going to have goals.
- They may have some some high level questions, they may have some specific questions that can't map to the data.
- One of the key ways to be able to do this refinement is through clarifying questions such as.
- So if if the department chair came to you and said,
- I would like you to help me measure the effect of this improvement to see us one twenty one, well, then we can ask questions.
- What do we mean by improve? What would be evidence that we did improve?
- Six one twenty one. And so we're gonna have practice in the synchronous time.
- That's one of the things we're gonna do this week in thinking about clarifying questions.
- But these clarifying questions that you can ask to your client.
- We're going to use the term client generally for whoever is you're doing the data
- analysis for to figure out what they actually want and what you can do with the data.
- That's going to advance their goals. So to wrap up, there are multiple layers to translate between our high level goals,
- deliver a high quality undergraduate education and what we can actually do with data
- measure whether this change increased students ability to pass the next class.
- Questions bridge this gap and we can have multiple layers of questions in order to get from high level goal to something we can do with data.
- You're gonna be doing this a lot through the rest of the semester.
๐ฅ Questioning Questions
We make our operationalizations better by questioning them. What do they capture? Who or what do they prioritize?
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
QUESTIONING QUESTIONS
Learning Outcomes
Identify stakeholders and subgroups for a problem
Evaluate the operationalization of a question to see who and what it prioritizes
Key concept: understand a metric by asking what makes it improve
Example from Last Video
Goal: Assess whether new change improved CS121
Question: Are students better prepared for next class?
Question: Are students more likely to pass CS 221?
Data: student grades for CS121 & 221
Caveats: many
Measuring the Question
Option 1: Pass Rate
What fraction of students pass CS221?
Does the new 121 method improve it?
Option 2: Grade
What is the average grade in CS221?
Does the new 121 method improve it?
Letter or course points?
Understanding Measurements
How do I improve this measurement?
Improving Metrics
Option 1: Pass Rate
Increase the number of people who pass the class
The only way to improve it is helping those w/ the most difficulty
Option 2: Grade
Make students do better
Can be improved at the top (A- โ A)
Measurement
You get what you measure.
Stakeholders
Students
Faculty
Department
Employers
University
High-performing students
Underprepared students
Marginalized students
How Do I Improve This?
Clarifies metric behavior
What changes to improve it?
What can remain the same while it is improved?
How can it be gamed or manipulated?
Metrics are always lossy and reductive, but this question helps evaluate and improve.
Wrapping Up
It is crucial to appropriately define measurements, and question our definitions.
How do we improve a metric?
Who or what does a metric prioritize?
Photo: Nicolos Tissot on Unsplash
- Welcome back. And this video, I want to talk with you about what I'm going to call questioning questions I want us to dig deeper into.
- We have a question. We have a candidate operationalization. How do we understand what it's actually going to be doing and how do we evaluate it?
- So we're learning outcomes here are to be able to identify stakeholders and subgroups for a
- problem and to evaluate the operationalization of a question to see who and what it prioritizes.
- A key concept I want you to take away from this lecture is that we can understand
- a metric or a measurement by asking what it takes to make it improve.
- So to return to the example from the previous video, we want to assess whether a new change is improved.
- Our introductory computer science class.
- We can't one way we can go try to do that is to look at students who have taken their grades in the next class,
- see us to twenty one and see if they're more likely to pass the next class.
- We can measure this by looking at the grades. But this is not the only way that we can.
- We can try to measure whether we've improved since to twenty or one twenty one.
- So this option, I'm going to call it option one. We're looking at the pass rate. What fraction of students pass.
- Yes. To twenty one for whatever definition of pass they get a C minus or better.
- We could, we could up get. We could say what fraction get a B or get a B minus or better.
- But what fraction of students pass. And does the Nusi as 121 method improve it.
- Another way we could go try to look at it would be to look at the grades students receive in to twenty one.
- So what is the average grade for students? For first students with the new intro.
- With the previous intro and C two twenty one. Does the new, does the new technique in intro improve it.
- We then have a question. Do we want to look at letter grades and, and maybe compute the average with the same formula used for GPA.
- Do we want to look at coarse points. So actually look at what did they get in ninety five and ninety nine.
- But we can look at the movement in the average grades now to evaluate what we have to figure out.
- Which of these do we want to do. And a key tool that I want you to become familiar with for evaluating and understanding different measurements,
- metrics, statistics, etc., is to ask how do I improve this measurement?
- So we're computing's that. They were computing a pass rate, were computing an average grade.
- How do I make this measurement better or worse? And this is going to give us a lot of insight into how the measurement works, what it prioritizes.
- This this way of thinking about a measurement or a problem,
- I think is going to serve you well throughout this class and throughout the rest of your education and work.
- So let's talk about what these measurements and how to improve these measurements.
- So if we want to improve the pass rate and we're not changing six to twenty one itself,
- we can improve the pass rate by just passing everybody in to twenty one.
- But if we're leaving to twenty one alone and we're changing, we're trying to improve the pass rate.
- By making a change to the class that prepares students for it,
- the only way to improve this is by helping those students who are going to have the most difficulty with C.
- S one to one with C. S two one. If a student goes through one twenty one and they're going to get A, B and A and C.
- S one twenty one. No change that we make to their excuse me to twenty one.
- No change that we make to their one twenty one experience is going to improve our metric.
- The only way to improve it is by helping more students move.
- From a D to a C minus. Option two, why measuring the grade, we can improve this by making students do better at.
- But we make do it by enabling students to do better.
- If a student that would have gotten an A minus under the previous one twenty one is now prepared to the point where they'll get an A.
- We improve it and we improve it just as much as if we move from a C minus to a C.
- And so we there's more opportunity to make the grade better.
- But we can improve this metric only by helping the students who were already well-prepared for C to twenty one.
- So this brings us to a key point you get, which you measure.
- If you set something up as the evaluation criteria, we're gonna see this really clearly once we start optimizing machine learning models.
- When you set something up as your optimization criteria, that's what you get.
- If you evaluate 120, what if you evaluate changes to the introductory class by people passing the next class, then you?
- That's going to. That structures the pedagogy because design and the teaching evaluations to favor preparing students to pass the next class.
- If you measure if you measure increased in average grade, that structure did to improve average grades.
- But it might but it might focus the attention more on helping the students who were going to get a pretty good grade.
- Anyway.
- So going beyond, though, just just so we're looking here at two subgroups, one metric clearly favors the students who are on the edge pass rate.
- You can only improve it by helping the students who are on the edge. The average grade, you can help across the board.
- There are a variety of different stakeholders in the in the design of of an introductory class.
- There's the students themselves who are going to be learning. There's the faculty who have to teach it.
- Either they teach it themselves, they teach it directly, or they depend on students being prepared by that class as a prerequisite for the class.
- They do teach the department, obviously, as a stakeholder because it has an interest in students producing a good education
- that that students are want to come for and that employers want to hire for.
- Employers want students to come out of the program well-prepared.
- The university wants one has an interest in having programs that that produce well-prepared students.
- And there also are attractive to students to be it to increase enrollment numbers.
- But then even within within stake, the broad stakeholder categories, we have different subgroups.
- So just within students, we can talk about high performing students.
- We can talk about students who are underprepared for one or another class.
- We can talk about students who are in some way or another marginalized.
- And they may experience changes in the class structure of the class, delivery of the class assessment differently.
- So even even once we've identified a stakeholder group, not every subset of that stakeholder group is going to experience.
- What we're trying to study or is going to be reflected in the data in the same way,
- we need to be able to identify these different groups to understand what it is that we're actually measuring.
- So I want to return, though, this key question ask how do I improve this more?
- Faced with a metric that really helps us clarify how a metric or a measurement behaves.
- What changes to improve it? What also what can remain the same?
- While it is improved. And then another important question is, how can it be gamed or manipulated?
- Metrics are always what we call Lawsie and reductive.
- Which means what that means is that no measurement captures everything about a phenomenon we care about.
- And reductive means that we're we're taking a complex virts phenomenon.
- We're reducing it down to one or a handful of measurements. We always lose something there.
- But the results. But this question helps us evaluate what we're losing and iterate and improve our metrics and also assess and
- assess whether those those weaknesses are actual challenges to validity or just something we need to keep in mind.
- So to wrap up, it's crucial to appropriately define our measurements and then to question our definitions and a good import.
- A useful way to do that is to ask how do we improve a metric that we're thinking about using?
- And then also we want to look at who or what does a metric prioritize?
- And is that prioritization is consistent with what we want to accomplish through organizational business or scientific goals.
Note
In this video, I talk about asking what it takes to improve a metric. It would be more
precise to ask what would change a metric; whether increasing or decreasing a metric is
an improvement depends on the metric and application context. The metric itself does not
improve or get worse without that context.
When I re-record this video for a future revision, Iโll change my language.
๐ Textbook Chapters - Questions
The first videos of this class go with chapters 1 and 2 of ๐ย Think Like a Data Scientist.
๐จโ๐ซ Tuesday Class
On Tuesday, we will meet on Zoom and do the following:
๐ฉ Week 1 Quiz
The Week 1 individual quiz (in Canvas) is due at 8am Thursday.
It is only over material prior to this point.
You will have 30 minutes from when you start the quiz to complete it.
These quizzes are never going to be terribly long; this one is particularly designed to be pretty easy, as an
initial quick check and to give you experience with the quiz process.
Tip
I recommend watching the remaining videos before class Thursday as well, but they will not
be tested over in the quiz.
๐ฅ ๐ Our First Python Notebook
This video shows you how to start Jupyter, create a notebook, and run Python code.
It also shows you how to prepare a notebook to submit as an assignment.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
OUR FIRST PYTHON NOTEBOOK
Learning Outcomes
Start Jupyter
Run Python code
Prepare a notebook for submission
USINNG ONYX
- In this video, I'm going to show you how to start up the Jupiter environment that we're going
- to be using for our Python program and write some of our first Python code.
- Also up to the steps needed to turn a notebook into a PDAF.
- You can submit on blackboard for one of the assignments. So I've already installed Anaconda.
- You can find the instructions for that on the course Web site. On Windows, when we've installed in a condo, we get.
- A new kind of prompt available in the START menu, so I can going to start the anaconda power shall prompt.
- And this starts up a power shell command line.
- That has Anaconda activated the process for doing this on Linux or on Windows is slightly different, although excuse me,
- on Linux or on Mac is slightly different, although you can also start the prompt from the Anaconda navigator.
- I will show you in another video how to activate Anaconda when we have it installed on Onyx, which will also apply to other Linux systems.
- So I'm in my anaconda, prompted to my home directory. I'm going to go to the directory I've created for working on C.S. 533.
- So I'm going to c.D into documents. Yes, 533 assignments, and here I'm going to start the Jupiter environment with Jupiter notebook.
- So we're gonna be doing our work in what we call notebooks. They're a part of Jupiter. We can start this at the command line with Jupiter notebook.
- And it's going to start up the Jupiter system and open it up in our Web browser.
- The Web browser is the interface that we use to interface with Jupiter and interact with notebooks.
- So if I had some notebook files in here, they would be listed in the notebook list and we could open them.
- So in the assignment, you'll download the starter notebook. Save it in the directory you're working in.
- When you run Jupiter notebook, it will appear. But right now, I'm one to create a new notebook.
- I'm going to create a new Python three notebook because Python three is what we're using in this class.
- And it's a new notebook and it's untitled. So I'm going to given a name here.
- I'm just going to call it demo notebook because it's the notebook that I'm using to demonstrate.
- If I go back to our notebook list, we now see it and its demo notebook that I pay NDB the I pay NDB file as the source filed for the notebook.
- You're going to be submitting those as one of the things you submit in your assignment. So now a notebook is made up of cells.
- And right now we have one cell here. So I want to put some code in it.
- I'm just going to write the string hello, world. The string isn't close in case, and it's enclosed in double quotes.
- And I am going to hit shift enter. And that run shift enter runs the cell that we're currently in.
- And now it's labeled in one. It's the first cell that we ran. And it has an output out, one that says, hello, world.
- When you run a cell and the last line of the cells and expression that has some value, what Jupiter will do is it will show you that value.
- So because the last and only line of the cell is the string value.
- Hello, world. It shows me the value. Hello, world.
- If I put in a value a number five, it will show me the number five.
- It only prints the value of the last line, but it lets us very quickly just see an object.
- We don't even need to worry about print statements. If we do want to create output, we can.
- Called print, and it will print the output for the way Python would usually print it.
- It shows up as a as as text. Jupiter is going to show the output of our program here.
- It's formatted a little differently because its output. It's not just showing the results of an expression.
- These code cells are not the only cells that we can have in a notebook.
- So I'm going to insert a new cell above the above this one.
- And I am going to change its type to markdown.
- And in a markdown cell, we don't write python code, but we write text.
- So I wrote the text, this is the demonstration node to show notebook to show you how to run Python code.
- And if I run this notebook, it just renders it as text.
- Now I can edit it. I'm going to double click to edit it again.
- This supports all markdown features so we can give this notebook a heading.
- We always want to begin our notebook with a level one heading, which is done with a single hash that gives the title of the notebooks,
- that then when we convert it to another format, we're gonna have the title right there.
- Markdown sells. Support a variety of formatting features.
- Such as bold and I tell Lex.
- Also, Bullethead Leth. Lists and numbered.
- Blasts. I'm going to stick another cell in here, why I use the menu.
- I can also hit the A key and it will add a new cell above and M changes it to mark down.
- There's keys that will that will let us navigate the notebook quickly.
- Also, the notebook is what we call modal.
- If the if the interface has two modes, if the cell is surrounded in green, I'm editing the contents of this cell.
- We can also show Mathen that. Tickle expressions like Y equals X plus B, you put them in dollar signs and they're going to be rendered.
- I'm going to shift enter again. And now the math is showing up like math.
- When it's blue, when the cell is highlighted in blue, we're not editing the contents of the cell, but rather we are moving around cells.
- So the up and down arrows, keys like a will add a cell instead of typing a in the cell.
- Once we're on a cell, we can hit enter to edit the cell and escape to change back to the mode where we navigate cells.
- So now we have this notebook control as saves the notebook. Jupiter has its own set of menus so we can do a variety of things like again, save.
- Save the notebook with a new name. We can make a copy of The Notebook.
- We can also we're going to go submit the notebook in order to give you feedback.
- I want a PDAF of your notebook so that I can use blackboards, PDAF markup tools to give you feedback on your assignments.
- So. Jupiter has the direct ability to create a PDAF, but unfortunately requires an entire low tech installation.
- An additional software on top of that in order to go from a notebook to a PDAF.
- So what instead we're going to do is we're going to go into the notebooks print preview.
- So I clicked file on the notebook interface. I go to print preview.
- This shows a trimmed down version of The Notebook that's not interactive.
- It doesn't have any in the interface. And now we can print this version of The Notebook.
- And we can print your browser is going to let you save as a PDAF when you go to print.
- So we're just going to use that save as PDAF. I'm going to go put it in my assignment's directory.
- Demo not booked up PDAF. Close this window.
- Now, if I go to that directory, I'm going to see both my IP and B file.
- That's the actual notebook file itself and the PDA file I just exported.
- I can look at the contents of that PDA file and it looks like we expected.
- We see the notebook title at the top where he wrote that level one heading.
- We see all of our output when you're submitting an assignment.
- What I want you to submit is both the IP JMB file and the notebook file.
- So now. So when we're done with a notebook, then we go what it was file menu and we close and halt.
- And this closes the notebook tab. But it also shuts down the python instance that's running in the background to let us run the code in the notebook.
- If you don't close a halt, you're going to wind up with a bunch of python instances kicking around that you may not want.
- We're gonna see more features of Jupiter as we go through the class, including things to manage the python processes that are running.
- But now you've seen the first steps to how you can open Jupiter.
- You can create a notebook. You've seen notebook cells and you've seen how we can take this notebook and create output.
- You're going to submit when you submit the results of an assignment.
Note
I have seen cases where the instructions to export to a PDF do not work correctly in Firefox; they result in a mangled
PDF file that does not correctly display charts. If you encounter this problem, I have documented some fixes over in
the common problems page.
๐จโ๐ซ Thursday class
In class on Thursday, we will:
Meet our teams
Take the Week 1 team quiz (over the same material as the individual quiz)
Debug software installations
Activity to dive deeper into defining problems and questions
๐ Textbook Chapters - Python
The Python material we are working on this week is a subset of the material in chapters 1โ4 of
the textbook. I donโt expect to you get through all 3 chapters
thoroughly this week, and we will be introducing more Python features as we need them throughout the
semester. I will note specific chapters and sections relevant to videos in their Resources
subsections.
๐ฅ Data Types and Control Flow
This video introduces fundamental Python data types and operations, along with variables and basic control flow.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
DATA TYPES AND OPERATIONS
Learning Outcomes
Understand basic Python data types and operations
Store Python objects in variables
Write simple Python code to do arithmetic
Perform basic operations with lists and dictionaries
More resources linked!
Numbers
Python supports numbers like other programming languages
Integers: 1, 2, 47
Floating-point (decimal-ish): 3.5, 1.9, 2.4492
Can write in scientific notation: 6.02214e23
Can be negative: -3.14
Numeric Operations
Normal arithmetic operations behave how you expect:
+ - / *
Power: 2**5
Division is always floating-point: 5 / 2 is 2.5
Floor (integer) division: 5 // 2 is 2
- This video, we're going to talk about data types and operations in Python to get you started,
- more on being able to write some of your own Python code,
- are learning outcomes for this video or for you to understand basic python data types of operations,
- to be able to work with python variables, storing objects in them, to write some simple python code,
- to do arithmetic and perform basic operations with lists and dictionaries. Also, this slide deck is a notebook.
- And so rather than the little embedded slides widget like we have for a lot of the videos, for this one, there will be a link to the notebooks.
- You can download it and run the code yourself.
- There's also going to be more resources linked in the class notes that I'll talk about briefly at the end of this video.
- Python supports couple primary types of numbers. First, we can write integers just by writing the number.
- There's no decimal point in there. And if we run, that is just a it's a python line.
- So a python line is called a statement. A statement can have something like an F or something like that.
- Or it can just be what we call an expression. And an expression is a set of operations that results in a value.
- And so a number just writing the number itself is an expression. So we can we can write an integer.
- We can write a floating point number with a decimal point. These are stored in floating point format.
- There are a couple of nuances about that. We'll talk about when we talk more in detail about different types of data.
- We can also use scientific notation with the E notations,
- the six point O to each of the twenty three Avogadro's number a mole and we write E twenty three and that means times ten to twenty third power.
- So we can also do arithmetic on these numbers, so the usual arithmetic operations, addition, subtraction, etc., they work as we would expect.
- They were just like they do.
- And when you're writing them in math, when they work in other programing languages, we can add five in to order of operation disrespected.
- So three times six at plus two.
- We can also then use parentheses to change the groupings so we can add two and three before multiplying by six.
- It works like you would expect from almost any other programing language.
- No surprises here if you're familiar with Java or Perl or something else.
- Those are our basic arithmetic operations.
- If we want to raise something to a power, the star star operator is what Python uses to raise something to a power.
- So two to the fifth power is two star star five.
- We can also get a number of other mathematical operations from two different python modules math and num pi num.
- Pi has duplicates of most of the math ones, so I usually just work with num pi.
- You have to import a module before you can use its function. So I'm going to import the num pi module here and I want to give it a shorter alias enpi.
- This is very common in Jupiter notebook's that we that we import num pi S&P so then
- we can just write n.p. dot log to compute the natural logarithm of the number twenty.
- We can also store values in variables.
- Let's just give them a name so X equals seven stores, the value seven in the variable X, there is no declaration necessary, unlike an JARBOE.
- Just assign the value to a variable. Then we can use it.
- So X is plus five is going to return a 12 because X currently stores the value seven.
- Now if we change the variable, so we say X equals two.
- OK. It's changed the variable. Python. So the variables are all stored in a commons memory space.
- And the Jupiter notebook runs the cells in the order we ran them.
- And it shows us here that number. But in in. Is the order in which that cell was run nine, 10, 11, if we so we've changed the value of of X here.
- If we go back up and rerun this cell, it's going to use the new value of X.
- This is important to keep in mind, and it's an easy way to get your notebook very confused if you've been running cells out of order.
- When we're developing a notebook we're working on, a data analysis will often run things out of order, try things out.
- But it's important to keep things clear and consistent in your notebook so that if you were to rerun the notebook from top to bottom,
- it runs and produces the correct results. You can do experiments, but before you go,
- say to submit your notebook to me in an assignment or before you go to to submit it to your client or use it for your final analysis.
- Make sure that if you rerun it from top to bottom,
- you get the right results so that you can be confident that you're actually computing the results you want.
- And there's not something that's just an artifact of the order in which you happened to run the cells.
- So we've seen numbers do seem variables that we can also write strings.
- We can put them in quotes. Python takes both double and single quotes.
- There's no difference between them. The backslash is an escape character.
- So if we want to have double quotes and a double coded string, we can we can do that with by by prefixing them with a backslash.
- We can one of the fundamental string operations is to contaminate two strings, and if you have strings,
- the plus operator, it's the same operator we use for additions and with numbers, it can cat Nates the two strings.
- So hello. Plus world is Hello World. There's a bunch of other operations.
- For example, split separates a string into a list by default.
- If you don't tell it how to split it, it uses whitespace. So this is going to split the string.
- Hello Space World into a list of two items.
- Hello and world. So Python is strict about types, every object, every value has a type.
- And it won't auto convert them. So if you've programed in Perl or JavaScript or P HP and you take a string and a number and you can cabinet them,
- it tries to convert the number to a string. Python won't do that.
- So if we do this, we try to add a number to a string. It's going to give us a type error and a type error and tells us what's going on.
- One of the skills you're going to need to develop in this class is the ability to read error messages.
- And this error message tells us a cup, a few important things. It tells us that the error is a type error.
- Other errors you're going to see are value errors, index errors, key errors, et cetera.
- But this is telling us a type error, which means that we're trying to do something with the wrong type of data.
- It then tells us two other things. It gives us this trace back of the code.
- So it shows us where in the code it went wrong. The only thing we're doing here, we're not calling in many library functions or anything like that.
- We're just trying to add a string in a number. We have our line of code. So it's showing us that it happened on line one, Maroon plus five,
- and then it tells us a little bit more about the error can only concatenate stir, not end to stir.
- So what this is telling us is that and Cat Nation only works on strings.
- You can't concatenate things that aren't strings to a string.
- And what we have here, we have a string and we have a number.
- So if we want to if we want to to put five at the end of our our string, we can convert it with the Sturr function.
- So Sturr is a function that takes an object and returns a string representation of that object intended for human consumption.
- So if we do this, then we get our strength, then that will concatenate correctly and we get the string Maroon five.
- So we've now seen three different kinds of operations that we can perform on python values.
- We've seen an operator like Plasty Duck, the binary operators that go between their two operands so we can say six plus seven.
- There's quite a few of these operators. We've seen a function which in this case, the function comes from a module.
- So ENPI dialog and a function takes a value in return some other values.
- We can compute the natural log of 10. And we've seen a method which is a function that's attached to an object.
- So speed up. So the log function isn't attached to any particular object.
- It's just a function hanging around. But the method. Hello.
- So split is going to work on the hello world. String and.
- And split it. So we've seen a method. If you're familiar with Java, they're like methods in Java and they operate on a particular.
- They operate on a particular object. The Java equivalent of a python function would be a static method.
- We've seen these three different kinds of operations in this class. We're going to learn how to write functions.
- Eventually, we're going to learn how to write our own methods. But we aren't going to need that for the vast majority of this class.
- You can also define how operators work on custom, on custom data types in Python.
- We're not going to do that. It's learning to do that is outside the scope of this class.
- But it is how some of the libraries that we're gonna be using work on the inside.
- So we've got these different kinds of operations. There's a few other things that we can do.
- So a few other data types you can work with. So now that split method, it returns a list.
- Hello, World. And in Python, we write lists with square brackets and commas separating the values and we can write them.
- So I can make a list that consists of these three values. Martin Cross and Grip's.
- I can also save a list of variables. Now I have the variable Rowdy three.
- That contains a list of these three names. We can then, though, add to the list.
- So if too rowdy three that append Vogul. And we're going to now have a list that contains Martin Cross scripts and vocal.
- Now, notice that in this code I did Rowdy three at a pen and then I just wrote Rowdy three.
- That's because if you remember from the previous video. Jupiter shows the value of the last expression in your cell.
- Rowdy three A doesn't return anything. The list append method.
- Add something to the end of a list and actually it modifies that particular list.
- We don't have a new list here. It modified our list object and and stuck Voegele at the end of it.
- And it doesn't return anything.
- So instead, what we're gonna do, what I often like to do when I do an operation like this is then at the end of the cell,
- I just put the variable that I've been modifying so that it'll then show me what's currently in the variable.
- So we can see that after we appended Voegele to the list stored in the variable Rowdy three, that list now consists of four items.
- And yet the list now consists of four items and includes our new entry at the end.
- So lists are indexed, starting with zero so rowdy, three of zero gives us Martin.
- We can index backwards from the end, rowdy three negative one gives us voegele a slice, takes multiple elements from a list.
- So rowdy three one Colen three gives us element the element that one and two.
- What it does is it gives you it starts at the first index of the slice and gives
- you all of the elements up to but not including the last element to the slice.
- So. Zero is the first item, one is the second item.
- So it's giving us items one and two and then three is one past the end.
- These kinds of half open intervals. We call this a half open interval because it includes the left side and not the right side.
- They're very, very common when we're using zero based indexing in a data structure because it's a very convenient way to express a range.
- Also, the length of the range is. The end, minus the beginning, three, minus one.
- It's going to give us a list of length to. One more thing we can do here is the land function.
- Is a standard python function that will give you the length of anything that has a length like a list.
- A number of other data structures have links. Most data structures that can contain other data structures will have a length.
- It will also work on a string. But the length of rowdy three is currently four.
- So we can also loop over a list. So this list, this loop here is going to loop for each person in the list.
- Stored and rowdy three. It's going to print the person. So we get our four people.
- Martin Cross scripts and vocal. What if we want to know the position of each item in the list as we go through the loop?
- The enumerate function wraps a list and returns the list, but also the position in the list as we go through the loop.
- And then this string here, this. That's prefixed with an F. We call this an F string.
- And when you put a F right before the opening quote of a string,
- you can then use squiggly braces and variable names to include variable values in the string.
- It's one of the ways that Python lets you easily build up strings that contain additional data.
- And so we're going to we're going to run this loop.
- And now we see each. Each person is now prefixed with their member number.
- And it's starting from zero because as we saw before, Python always starts from zero.
- So the first one is members zero. And this enumerate function is just giving us the positions along with each item.
- Python for loops operate over what what it calls iterable, something that iterable just feeds an object that you can use in a for loop.
- Lists are iterable, but if you want to loop over a sequence of numbers like you might in Java.
- So you want to go from zero one and two, you use a range.
- So this is going to print zero one and two, because, again, the python does not include upper bounds of ranges and slices by default.
- So we're going to go zero, one and two. So a tuple is another container.
- It's like a list, except its size can't be changed. It's used for representing things like pairs.
- So I'm going to create a variable called coords and I'm going to store the tuple three five in it coords.
- Sub-Zero, then, is the first element of the tuple three chords.
- One is five. If we did the Len of this tuple, we would we would get would get to a tuple can be unpacked by unpacking the tuple we take it.
- So this tuple has two elements and we can say X, comma, Y to unpack the tuple into two different variables and then X is going to be three.
- So the parentheses with the comma packs the tuple and assigning it to variable separated with a comma unpacks the tuple.
- The tuple size has to match. So if we say X, Y, Z of coords and try to run that, it's going to tell us not enough values to unpack.
- It expects three values X, Y and Z. But coords only has two values.
- A dictionary is another data structure that maps, keys, often strings, but not always you can.
- You can use numbers, tuples, any data structure that can't be changed, you can use as the key for a string or for a dictionary.
- And so we're going to map some different animals to what they eat here. We created a variable.
- Assigning to a variable doesn't return a value, so there's nothing to print here.
- And then we can look up a value by its key. So we say Dietze of rabbit equals plants.
- And that gives there like lists, except we can look them up by any key we want instead of having to look them up by a position.
- So everything in Python is an object which has a type. We saw this when we saw the type error.
- We try to to try to add the number five to the string maroon.
- The Sipes we've seen in this video are integers, strings, lists, tuples and dictionaries.
- There's a lot more to do with these.
- I refer to you to the readings and also we're going to be introducing a various features of them as we go throughout the class.
- Now, another thing that's important to understand is that in Python variables, store references to objects.
- This is how Java works as well. But this matters, particularly for mutable objects.
- So we have our list. Rowdy three. Now, if we assign the list to another variable, rowdy five and then we add Amanda and we print routing.
- We show the result of rowdy five. OK. We have our list. Now we've added Amanda to the end of it.
- The rowdy five and rowdy three variables are references to the same list objects.
- We only look at Rowdy three. It's now going to show five elements, including Amanda.
- Because when we assign the variable, it doesn't make a copy of the list.
- All it does is it creates another variable that also refers to the list, too.
- We modify this list, object and append modifies the object, what we call in place.
- That means it modifies the object itself. It does not return a new object.
- It the object changes and any variable that's referring to that object gets the change of this in-place
- distinction is going to be important throughout the semester because some of the libraries we use,
- they they offer options to whether you want to modify something in place or whether you want to return a new object that has the new data.
- So there's a variety of ways. Resources to learn Python.
- I'm going to be making some videos, but we're not going to have time in the videos to go into every piece of python you might need.
- The tech chapters two, and I'm going to be writing some resources, chapters two and three in the text book are going to.
- They talk about basic python operations and data structures. There's the Python tutorial that I'm providing you a link to.
- That's a relatively comprehensive tutorial from the Python developers about the key Python language features.
- If you really want to dove in depth, there's a book, Learn Python the Hard Way, which is quite comprehensive.
- I'm also going to be providing in the resources section of the class site some additional notebook's that walk
- through and demonstrate different Python features and give you information about the different operations.
- For example, I'm planning on one such notebook that goes over a bunch of different things you
- can do with lists more than I've had a chance to get into in in this video.
- So to wrap up, Python supports many different data types. Everything's that object and variable store references to objects.
- If you do an operation that modifies an object, all variables that refer to the same object are going to get are going to see the change.
- You can perform a number of standard arithmetic operations on python variables.
- And there's many, many more operations that we're going to be seeing as we go throughout the semester.
๐ฅ Control Structures
This video introduces Python control structures and code layout.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
CONTROL STRUCTURES
Learning Outcomes
Write basic Python control structures
Understand Python block syntax
Know standard practice for whitespace
Key concept: Python uses whitespace to detect blocks, such as the bodies of loops or conditionals.
Photo by gryffyn m on Unsplash
For Loops
for i in range(5):ย ย ย print(f'iteration {i}')print('done')
Iterate over an iterable
range(5) iterates 0, 1, 2, 3, 4
Whitespace delimits blocks
Blocks
Python blocks begin with โ:โ at the end of a line
if, else, elif, for, while, def, class
Block content is indented one level
Standard practice is 4 spaces (Jupyter defaults to this)
Block ends when indentation returns to previous level
for i in range(5): print(f'iteration {i}โ)print('done')
Comments
Python comments
Begin with โ#โ
Continue until end of line
# this is a comment
print('foo') # can comment here too
If Statements
x = 5if x >= 10: print('big')elif x >= 5: print('medium')else: print('small')
These are false:
False (Boolean value)
None (equiv. to null)
0
Empty list, set, or tuple ([])
Empty string('')
Most other things are true.
While Loops
while condition: # do something pass
Iterates until condition is false
โpassโ does nothing
Here only to make it valid Python
Wrapping Up
Python provides the usual control structures (if, for, while)
Blocks are based on indentation
'#' starts a comment
Photo by Erik Kroon on Unsplash
- Well, in this video, I'm going to talk with you about some of the basic control structures that we have
- in Python and the syntax that Python uses for indicating different pieces of code.
- The learning outcomes for this video are for you to be able to write basic python control structures.
- Understand the python block syntax. No, the standard practice for using whitespace in Python.
- One of the key concepts here is that Python uses whitespace to detect blocks such as the bodies of loops or conditionals.
- Unlike other languages such as Java, JavaScript, HPC, etc., all use squiggly braces.
- Python uses indentation as syntactically significant indicator of what a block is, such as the body of a loop.
- So in the first Python intro video, we saw a for loop in the for loop iterates over it iterable.
- So you the, the, the syntax is that we have for ie the variable name in.
- Range of music range here, but this is the iterable expression.
- That's something that we can we can loop over and then within within that loop, we are going to we're gonna print and we're using an F string.
- Remember the F string that says use variables in the string.
- And so we are we're printing out the iteration number through each iteration of this loop.
- And then at the end of the loop, we're gonna print done, and that's gonna happen once.
- Because whitespace still limits blocks.
- The colon and the and the indentation, they indicate that we're in a new block.
- And then when the indentation stops and it goes back out to the same level.
- As the for loop started, that indicates the end of the block.
- So, as I said, the python block begins with a colon at the end of the line.
- And there's several different python keywords that. Are something that can start a block the if family f.
- Else L.F., the loop's foreign while and then the keywords for defining functions and classes, def and class block contents indented one level.
- The standard practice here is to use force bases.
- Jupiter, along with most modern python editing environments, default their configuration to force spaces for you automatically.
- But this convention, while Python does not strictly mandate this.
- All it mandates is that you are consistent. You can't say mixed tabs and spaces in the same file.
- This almost universally standard python practice is to indent with four spaces
- and then the block ends when the indentation returns to the previous level.
- You can also have comments in Python, a comment begins with a hash and continues until the end of the line.
- You can have a car line where all it is is a comment. You can also put a comment at the end of a line that contains some code.
- When we're writing Jupiter notebooks,
- we're going to put a lot of the discussion and the explanation in markdown cells in the Jupiter notebook rather than in comments.
- But comments are very useful when you're when you start writing Python scripts.
- They're also useful when you want when you want to write.
- Just really briefly, why a particular line in one of your code cells is working the way that it's working?
- The if statement is structure.
- It works like the fourth statement in terms of how the blocks work.
- And we open with if we don't need any parentheses, we have F and we have an expression in this case,
- we're gonna say if X is greater than or equal to 10, then we're one to print big if Elif is the python else.
- If you can have as many of these as you want and if you don't have to have one, you don't have to have an else either.
- But it's not else F or else, if it's just L.F. e-Life is the python syntax for else F.
- So if it's not greater than 10 but it is greater than or equal to five, it's going to print medium.
- And that's the one we're actually going to run in this case because X is equal to five.
- And then finally. Else we're going to print small. So in Python there are several things that are considered false.
- For the purpose. Most for the purposes of. Of an if statement.
- The bout valued the boolean value false, which you write with an uppercase F, is considered false as is none, which is a special python value.
- That is the python's version of a no means no data here.
- Zero is false. Empty containers, empty lists, sets, tuples and empty strings are all also false.
- Most other things are true. But those are the.
- Those are the. That's how ifs work in Python. They aren't strict like in Java.
- It has to be a boolean python does not require it to be a billion.
- It requires to be something that can be converted to a boolean. And then these are the things that.
- These are the things that Python converts to false when it's doing that boolean conversion a while loop iterates until a condition becomes false.
- I'd put a past statement here. The past is not part of the while loop. The past is just here to make the syntactically valid python.
- Because while loops can't be empty in general, Bloks cannot be empty.
- So passe is a python statement that does nothing.
- It's just needed when you need to make something syntactically valid. Maybe you're in the middle of testing some code.
- Maybe you're working on an F and you don't have all that figured out yet. So you just want to make one of the branches of the F.
- Do nothing. For now, you can just say pass. So to wrap up, Python provides the usual control control structures for a programing language.
- If for while blocks crucially are based on indentation and the standard there is to use for spaces.
- The hash sign starts a comment. There is some limitations to working with for loops.
- We're not going to use for work to working with, well, any loops in Python.
๐ Further Python study
If you would like further study on Python fundamentals, especially if programming in general is new to you, here are some resources:
๐ฅ Scientific Python
This video introduces NumPy numpy.ndarray
, the fundamental numeric array data structure for scientific computing.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
SCIENTIFIC PYTHON
Learning Outcomes
Understand the limitations of core Python data types for data science
Know the three key array data types:
ndarray
Series
DataFrame
Perform basic vectorized operations
Photo by Faris Mohammed on Unsplash
Lists of Numbers
Python:
numbers = [0.3, 9.2, 1.0, 6.7]
Remember that everything is an object?
List of 4 pointers (8 bytes each), with header (16 bytes)
To 4 objects: 8 bytes (double) + 16 bytes (object header)
Total: 144 bytes
Elements can have different types!
Sum our Numbers
total = 0for x in numbers: total = total + x
Whatโs wrong?
Python is slow โ convenient, but slow
Interpreted
Dynamically typed
Pointers to objects cause cache misses
Enter NumPy
NumPy provides efficient numeric array types (โndarrayโs)
import numpy as npnumbers = np.array([0.3, 9.2, 1.0, 6.7])
All elements have the same type
Stored directly in the array โ no indirection, contiguous
Many ways to load or create arrays without going through lists
Sum our Numbers
sum = np.sum(numbers)
Shorter (although we could have used Pythonโs sum earlier)
Implemented in compiled language
NumPy array internal layout is compatible with C and/or Fortran
Donโt loop over arrays.
Vectorization
NumPy lets us perform operations on arrays:
scale = np.linspace(0, 1, 4)numbers + scale
Result:
array([0.3 , 9.53333333, 1.66666667, 7.7 ])
linspace creates an array from 0 to 1 (both inclusive), evenly spaced into 4 elements (same size as numbers)
+ does elementwise addition: adds corresponding elements
Efficiently in compiled code
About Arrays
Each array has 3 key things:
A data type (dtype) โ what kind of elements?
A shape (tuple of ints) โ how big?
May be multidimensional, e.g. (100,50) for 100x50 matrix
Elements โ the data itself
Pandas
Pandas builds on NumPy with two new data types:
Series โ an array with an associated index (element labels)
DataFrame - a table where each column is a series
Youโll see these briefly in A0, and more next week!
Use of Lists and Loops
Weโll still sometimes use lists and loops
Lists of arrays or data frames
Looping over input files or groups of data
But we avoid looping over individual data points.
Wrapping Up
NumPy provides efficient arrays
These are the backbone of our data processing
Prefer โvectorizedโ operation whenever possible
Practice: put these in a notebook!
Fortak, Lord High Researcher of Clan Urdnot; Mass Effect 2
- This video, I'm going to introduce some of the fundamental structures and principles of doing scientific computing in Python.
- Since the last couple of videos, I've briefly introduced Python's core structures and core data types.
- But a lot of our work is going to be working with an additional set of structures,
- a set of libraries known as scientific python or as the pie data stack.
- So learning outcomes of this video are to understand limitations of core python data types for data science to know.
- Three key rate data types particularly. Are we focusing primarily on the number high end the array?
- Also briefly introduce serious and data frame. We're going to see a lot more about those next week and the.
- To be able to perform basic vectorized operations. So in Python, we can write a list of numbers like this.
- So numbers equals I'm using the list syntax that we talked about in the earlier video.
- And I've got four numbers in here that I'm storing in this list and the variable numbers.
- Now. This seems like a perfectly natural thing to do.
- But remember, we said I said in the previous video that everything in Python is an object.
- So this isn't just a list of numbers. If we wrote this in Java or C, we would have an array of numbers where system array.
- And it's the stores, the numbers, one after the other. But in Python, that's not how it works because everything is an object.
- What our list stores is, it stores pointers to numbers.
- So we've got a list. And it's got a pointer to O point three and a pointer to nine point two, et cetera.
- So what we store is the list itself has these pointers, which are eight bytes each.
- And it has the. Numbers themselves.
- A flooding point. A double precision flooding point number takes eight bites. But the numbers aren't just numbers, they're objects.
- And every python object has at least 16 bites.
- This is all on a 64 bit system, has at least 16 bytes of header information.
- And so this whole list of numbers takes 144 bytes because we've the list has a header.
- It has pointers. The pointers are the objects that have headers in addition to the data.
- Also, the elements of a list can be different types. So when you go over the list, there's no guarantee that everything is a number.
- So if we if we want to sum our numbers, there is a python function called some that will double do a sum.
- But it's basically doing this. So we'll initialize a variable called total.
- Well, then loop over all of our numbers and we'll add each one to the total.
- And that's gonna make the total equal the total of the numbers. This works, it works just fine.
- And for a list of four numbers, it's completely fine. But Python.
- There's a couple of issues here. One python is Python. The language itself is rather slow.
- It's quite convenient, but it's slow and it's slow for two reasons.
- One is that it is interpreted the python code is compiled to an internal data structure,
- but then there's C code that runs in a loop interpreting that data structure.
- It's also dynamically typed. So remember, I said there's the the values and the numbers are in the list can have different types.
- We wrote a set of numbers there. But Python isn't guaranteed that they're all numbers.
- And so rather than saying, okay, I have a number, I'm going to keep adding it.
- What it says is I have a thing and I'm going to try to add it to the thing I already have.
- And it has to go look up how to do that, and it does that every time for each number.
- This is all very slow. Also, since it's pointers, if you've taken the computer architecture class.
- That may ring a few alarm bells for you because rather than just having an array of numbers which will be loaded into our cash very quickly accessed,
- we have an array of pointers and each pointer has to go off and look up the number in memory.
- And those numbers might be stored next to each other, but they might be stored all over the heap.
- We're gonna have cash misses which make these slow process even slower so we can write code like this and it works fine,
- but it's not an efficient way to do computation. And as we get to larger and larger data sets, you get a few hundred.
- You got a few thousand numbers. You're gonna be fine. When you've got a million numbers, when you have a hundred million or a billion numbers.
- Then things start to really get slow. So none PI is a python package that provides efficient data types for doing numeric computation.
- And NUM Pi underlies almost all of the rest of the scientific python and data science and machine learning for Python software.
- It has a data type called an NDA array. There's a variety of different ways you can create one,
- but here we're going to just create one using the array constructor and then we're going to pass it our list.
- So we're creating the list in this case. We are going to see later many ways to load arrays without having to go through a list.
- I'm just doing this here so I can demonstrate how the array works.
- But all the elements are of the same type in an array and they're also stored directly in the array.
- So this ENDI array, it's the stores, the floats, one right after each other.
- Eight bytes each. And so we don't have the indirection, three pointers. We don't have all of the overhead of storing all of these different objects.
- It's just storing the numbers, one right after each other. You can have an Endi array of objects and that's going to store the pointers.
- And that's useful in a few cases, especially for treating strings consistently with numbers.
- But it really shines when we're dealing with arrays of numbers for various scientific computing applications.
- So if you want to sum our numbers, we can use the num pi some function and it it's much shorter.
- A little python has a some function, as I mentioned, that we could have used, but also it's implemented in a compiled language.
- And when you have a num high array that's storing numbers, whether the integers,
- whether they're floating point numbers, it's stored internally in a format that's compatible with C or Fortran.
- And so a lot of num pi. Functions.
- What they're doing is they're passing the array to see code or Fortran code or
- C++ code that has a comp. loop that works on that data type and is able to very,
- very efficiently sum up those numbers. We don't have a cast mate cash issues from the indirection.
- We don't have the overhead of Python's interpreted code.
- We don't have the overhead of having to deal with the the elements of the array might be of different types.
- They're all the same type. We can work over them in in a loop, in comp.
- Machine code. So in general, don't loop. You can loop over a number high end the array.
- It's iterable just like a list. But in general, you don't want to do that.
- You want to set up your code so that num pi can do the looping for you.
- And effectively what we wind up using Python as is a scripting language to tell the underlying C, C++ and Fortran code.
- What to do.
- And the fact that Python is a slow language doesn't matter very much because the vast majority of our processing time won't be spent in Python.
- So I thought none pile. So has a feature called Vector Ization.
- There are a lot of operations that operate on an entire array at a time.
- So if I get it, I can create another array. The Linn's base function here.
- It. The land space function here.
- It creates an array of four values that are evenly spaced from zero to one inclusive.
- And then the plus operator here, remember, plus between two numbers is going to add it between two strings.
- It's going to concatenate them plus between two arrays requires them to be of compatible shapes.
- And it adds the the corresponding elements of the arrays to each other and returns a new array.
- So what if we have a bunch of number one array of numbers and we have another array of numbers?
- We want to add them together. We just add the two arrays and it does that addition again in a loop written in C or Fortran.
- And it does it very, very quickly.
- You can also add an integer or an integer or a floating point, single number to an array, and it'll add it to every element of the array.
- But this is the key point to be able to make scientific computing with Python fast.
- We setup our code and throughout. We're gonna be trying to set it up so that we use vectorized nation as much as possible.
- And we vectorized over as much data at a time as possible so we can allow the optimized loops and in num pi,
- in Pandas and Sai Pi and psychic learn to do the work and to put as much of the work as possible into those compile loops.
- So we're not spending a lot of time in slow python code. Each array has three key things.
- It has a data type called a D type, and that says what kind of elements are in the array?
- PI has data types for your standard integers of various sizes.
- Single and double precision floating point numbers. It also has D types for working with.
- Date. Date. Times. Strings and then storing arrays.
- That's where pointers to arbitrary python objects.
- The data type or the array also has a shape which is a tuple of integers that says how big the array is.
- The array may be multidimensional. So Endi array stands for N Dimensional Array.
- And it can be one, two, three, four, whatever dimensional. So if we have a 100 by 50 matrix, it's stored in a in a number PI in the array of shape.
- One hundred, comma 50. And then there's the data. It's stealth.
- That's the elements of the array. The data points themselves that are stored in the array.
- So then pandas, which we're going to see next week, builds on top of a raise with two new data types,
- a series is an array with an associated index that allows us to look up.
- So an ENDI array, like a python list is indexed using numbers starting from zero zero one, two, three, four, five.
- But sometimes for a lot of times we're gonna have some other natural index. If you've taken databases, it's equivalent to the primary key.
- So a series is an array with an associated index that might be other numbers.
- That might be strings. But some other way of accessing the points.
- It also has an efficient representations that you can have a series that's indexed zero through and minus one where N is the length of the series.
- And it does not take up a lot of space to do that. And then a data frame is a table where each column is a series.
- And they all share the same index. And we're gonna see those a lot because we load in a set of data points that's gonna be in a data frame.
- Now, an assignment zero, you're going to briefly see both of these data structures.
- I walk you through everything you have to do with them in assignment zero. And we're going to introduce them a lot more.
- Woomera's talking about how to describe data next week.
- But. Endi Arae, the number higher radiata structure is the fundamental core that all of these others are built on.
- The series augments it with an index. The data frame collects multiple series together with column names like a spreadsheet table.
- So we're still going to sometimes use Python native lists and loops.
- Oftentimes, it's going to be because for some reason, we need a list of arrays or data frames.
- Also, if we need to loop, if we have, say,
- 20 input files that we need to put together to to to be our data set or we got different groups of data, we're going to loop over those.
- But the big thing we avoid doing is looping over individual data points.
- We load in a few hundred thousand records. They're going to be in a data frame.
- We don't loop over the rows of a data frame. If we can avoid it, because there's almost always a more efficient way to do that computation,
- that pushes a lot of it into the C and C++ code and Fortran code that underlies NUM, Pi, pandas, et cetera.
- So wrap up num pi provides efficient to ray data structures that are more memory compact's.
- They don't take up nearly as much space and they're also much more efficient to compute over.
- These are going to be the backbone of our data processing throughout the rest of the class.
- And we want to prefer vectorized operations that perform these loops in native comp. machine code whenever possible for a little bit of practice.
- I encourage you to take the example code from this from these slides and go and try them
- in a notebook so you can get a little more practice creating notebooks and running code.
- I will see you in class.
๐ฉ Assignment 0
Complete and submit Assignment 0 by midnight on August 28.