Week 5 — Filling In (9/19–23)#

This week introduces one new statistical concept — the hypothesis test — and is otherwise about practice and solidifying concepts. I’m also going to take a step back and give some more context to some of the things we’re talking about.

Our learning outcomes are:

Compute and interpret hypothesis test
Avoid p-hacking and HARKing
Understand how to read and interpret Python errors
Understand how the quantitative techniques we are learning in this class fit in a broader landscape of epistemologies

🧐 Content Overview#

Element	Length
🎥 Comparing Distributions	5m6s
🎥 Testing Hypotheses	14m51s
🎥 T-tests	12m24s
📃 Cookbook 1	2000 words
🎥 Epistemology	25m44s
📃 One Sample T-test and Q-Q Plot	653 words
🎥 Python Errors	7m28s
🎥 Python Libraries	3m43s
🎥 Learning More	5m10s

This week has 1h14m of video and 2653 words of assigned readings. This week’s videos are available in a Panopto folder.

📅 Deadlines#

Week 5 Quiz is due on Thursday at 8AM.
Assignment 2 is due on Sunday, September 25, 2022 at 11:59 PM.
Midterm A is next week, on September 28.

📓 Assignment 1 Solution#

The Assignment 1 solution is on Piazza.

📃 Course Glossary#

If you haven’t yet, I highly recommend consulting the course glossary. Please post on Piazza if you have suggested additions!

The glossary is also likely to be useful in studying for the exam next week.

📓 Writing Functions#

I’ve used Python functions in a few of my example notebooks. The function notebook talks more about them, how to write them, and how to use them.

🎥 Comparing Distributions#

This video describes how to use Q-Q plots to compare data against a distribution.

Video (5m6s)

Slides

This video and one to interest. Introduce us to ways of comparing distributions visually so our outcomes are to be able to draw.
Q Q Plot to compare two distributions and assess fit with one of these plots.
So we get some data points, we can draw them with a histogram. We know how to do this and we can look at them.
And we want to if we want to ask, say, are these points normally distributed? That's hard to tell from the histogram.
These look like they're probably normal. If I could envision drawing a curve that looks approximately normal through those.
OK. But that's not a very good. That's not a very good test.
And it also. Particularly because with the beaning and also with random.
Little clusters here and there. It's difficult to accurately tell from a histogram of what we're looking at as normal,
but histograms are not the only ways we can show distributions if we want to compare to distributions.
We can use something called a Q cue plot and a Q Q plot gives us an easy way to compare whether the data are likely drawn from the same distribution.
So each point is each point corresponds to a data point.
But. On the on the x axis, we are showing that the quantity of corresponding to that point.
In the theoretical distribution, so the reference distribution, so the theoretical Quantrill's in this case might want to test for normality.
This is a standard normal distribution. So the first point in the data set, where would the first point be?
Probably be if I drew this at this many points from a normal distribution.
And the y axis is the the value in the actual data.
In this case, normalized or standardized have a mean of zero and a standard deviation of one.
But otherwise unchanged. And if the data come from the reference and reference distribution of the same distribution.
So we're using normal as our reference and the data are normal. Then we're going to see.
Fist straight line. That we see through the data points.
Because when we align up the data points and where we would expect them to be, what we would expect them to be if they were normal.
Those match the data, match what we would expect if this data were drawn from a normal distribution.
If we see some points over here doing this. And something like that.
Then that would indicate that we don't have a normal distribution and we see a few points here that are kind of off the line a little bit.
That's OK. Especially out of the tails. It's it's common to see a couple of points at each side that that aren't quite on the line.
But the fact that in the middle of the plot. That line is lining up.
That gives us confidence that what we're looking at is a normal distribution and.
This this plot design allows us to directly compare to distributions in a way that's much, much more precise than trying to eyeball a histogram.
The X coordinates can be values from the theoretical distribution, such as a normal. You can use this for any kind of distribution.
You can use it for an exponential a T, a wible pick your distribution.
And but they can also be from a second datasets. We have if we have two data sets and we want to see if they were drawn from the same distribution.
We can do that with a Q Q plot as well. So the plot one of these the stats that's models package provides the coocoo plot function and.
There's a couple of options. There's one, the line equals forty five options going to show a match line.
It's going to draw that red line that we saw. And there's another option that aligns the data to standardizes the data.
That drawing the 45 degree line, that straight line from corner to corner is going to be the line where a match that data match falls.
I'm writing a notebook, the one sample notebook in this week's material. It demonstrates how to draw.
Q Cube plot. And it also shows how to build your own Kyuki plot. So we use stats, models, Kyuki plot in there.
But then I also work through the pieces just using standard num pi a matplotlib features in order to get us to a cute, cute plot.
Hopefully that I hope make a little more sense, but most of the time I just draw the queue.
I use just use stats, models, Kyuki plot to get my Kyuki plots.
So to conclude Q Q plots, let us compare data between two to see compare data from two different distributions.
It might be observed data and a theoretical distribution, or it might be two different, two different samples of data points.
And on the Q Q plot, when the distributions match the data are in a straight line.

Resources#

NIST Handbook on quantile-quantile plots

🎥 Testing Hypotheses#

Video (14m51s)

Slides

This video, I want to talk with you now about testing hypotheses.
So we've talked about estimating values and computing the precision of an estimate by looking at the width of a confidence interval.
We've seen how to do that, both param metrically using the standard error for the mean and also doing it with the bootstrap.
In this video, we're going to talk about no and alternative hypotheses to test for hypotheses and bootstrapping P values.
So if we had the question, do pet Gentoo penguins have longer flippers than chinstrap penguins?
We can what we can do to try to assess this question is measured the difference in average flipper length.
And then we can ask.
Suppose they didn't have different length that Jeonju penguins and chinstrap penguins are the same in terms of their flipper length.
How likely would I be to find this big of a difference in flipper length?
Under those conditions, this probability, we call that the null hypothesis.
For most statistical tests, the null hypothesis is everything's the same.
There's no actual difference. And this probability that we see this big of a difference, given that there's no difference, is called a P value.
So. We're doing the test. We define two hypotheses.
The null hypothesis in this case that the mean flipper account or the mean flipper length for a Gentoo is the same as for a chinstrap.
And the alternative hypothesis, they aren't. And then we use we can use in this case,
we can use what's called a two sample t test that computes the probability that our that we see this big of a difference in our sample means,
given that there is no difference. H0 The true means are the same.
If P is low, a common threshold is point zero five. Then we read say that the data reject that.
We reject the null hypothesis. If it's if P is large, then we say the data could not reject the null hypothesis.
This bias is our discovery procedure into towards not claiming things.
But we. We. We say here, if it's small, that we reject the null.
So there's many different tests, each with their own H zero H.
Not that the null hypothesis.
The one sample T test is the null hypothesis is that the mean is zero, or we can set it to mean as any other particular constant.
The two sample T test asks. So we've got sample two different samples and we have their sample means.
The null hypothesis is that the means are the same. The paired t test, we have, again, two measurements, but in the two sample T tests,
we have measurements from one sample of measurements from another sample in the paired T test.
We have one sample and we take two measurements for each for each element in the sample.
And this can be one way this comes up is in what we call between subjects or within subjects experiment.
So when it between subjects experiment you. Take say, if you're trying to test a the effectiveness of a computer interface or your.
You're trying to test whether the new widget you stuck on your Web site is improving sales.
You have a sample of your users, half of them. You have used the new widget.
Half of them, you have used the old widget. You compute the sales and then used you test.
It's between subjects, but you have different subjects in each condition or within subjects.
Experiment. The subjects are in both conditions.
So if you have two different, maybe you have two different widgets for two different boxes for how you want to finish checking out.
A shopping cart. And you're doing a study in a lab and you have you want everybody to try both.
And you have them. So you have everybody. Put the shopping experience with one interface.
Complete it with another. You measure maybe the speed, maybe the error rate.
When you do this, you do what's called counterbalancing.
You have half of them, have them in one order and half of them in the other in case there's an order or learning effect.
But the idea here is that for each user, you have their speed with one interface in their speed with another interface.
So they're paired in that case. You do the paired t test.
And what you're tactically want to test is your testing. It's it's a one sample t test.
The test, if the me if the mean of the difference between two measurements for the same person, the same idea when the sample is zero.
If it's zero then they came from the same. Then there's no actual difference.
So that's our that's our null hypothesis. ANOVA extends it beyond one level.
So to one sample, two sample pairs. What if we have.
Five samples. Then we use Enova. What if we have?
Five measurements. Then we use what's called repeated measures, Enova. But the H.
The null hypothesis for Enova is that all means are equal.
So each of these tests, though, they make assumptions and they're often quite strong, the T test makes them assumptions about the data.
It's relatively robust to violations of its normality assumptions.
Independence is a killer in any of these tests.
And this is why you have to use the paired t test when you have paired data because the samples are not independent.
If I do, I if I am particularly slow at using of doing a kind of task,
then even if the new interface makes me faster at it than the old interface, I will.
My speeds will be lower than they would someone else.
And so by pairing it controls for that, because it's just looking at the difference between the speed.
How much did it speed me up? How much did it speed up?
Enova also depends on quite a few assumptions getting into the details of a nova are beyond the scope of what we have time to get to,
particularly this week. But we can also bootstrap a P value in.
The idea here is that we have some statistic. I'm going to call it T. But I might be the mean.
It might be some kind of a normalized statistic. It might be whatever statistic we're trying to compute and what we do.
So the goal of the bootstrap lots is resample. But the goal here is to simulate the distribution of the statistic under the null hypothesis.
So we transform the data to follow the null hypothesis in some way.
We bootstrap the transform data and we compute a statistic from each bootstrap sample.
And then we look at what's the probability that this bootstrapped statistic exceeds.
Or is at least as large as our sample statistic. The tricky part of doing this is properly sampling from the null hypothesis.
And this you're gonna have to figure out for each bootstrap that you want to do. But if we're going to test if we're gonna do the two sample test.
So the T-test, the the H not is that this means are the same.
We can also think about a somewhat richer no hypothesis that chinstrap it Gentoo instant's strap Flipper's have the same distribution.
But here we're only going to measure the mean. So the no bootstrap, though not since the null hypothesis is they have the same distribution.
What I can do is I can pool all the measurements.
So it is going to take all of the Gentoo chinstrap, penguin flipper length, stick them all together in one big list.
And then I'm going to. Because if if there's no difference in flipper length between a Gentoo, an entrenched chinstrap,
then if I'm sampling of flipper length, it shouldn't matter which penguin I'm sampling it from.
So. We compute our bootstrap samples, one that's the length of our chinstraps sample ones that that's the length of our Gentoo sample.
And we compute the P value. That's the fraction of the bootstrap runs where the magnitude, the difference exceeds our observed difference.
And so if our observed value is if the observed value we have and the difference is common, then the penguins are probably not that different.
But if it's very difficult to boots, if it's very rare for this bootstrap procedure to give us the value we saw,
then that's evidence that there's a difference in the flipper lengths of the penguins.
The notebook I'm going to give you to go with this video shows the code for how to actually do the bootstrap of the P value in addition to the T test.
Another issue we run into is when we have multiple usually we don't just run one test in a paper or another report.
We have many if we call the. So the threshold we use, we compare this p value, the probability of seeing this, of seeing a result that's extreme.
If the null hypothesis is true.
We compare that to what we called the significance level alpha, which is our threshold for saying, and probably it's more likely to be true.
So if we have Alpha is O point five, then if we run 20 tests.
That. And the null hypothesis is true and all of them.
We will probably find one that sticks statistically significant. If we run 100, we'll probably find five.
And so you get what's called a false discovery. There are ways to correct for this the bond ferrone correction is a conservative but
sound correction that scales down our significance level by the number of tests we run.
Not not just the ones that are reported in the paper.
And there's other corrections that work in various situations to be able to correct for these multiple comparisons is the Benjamina family,
the Benjamina Horsburgh, which works if all of our tests are independent.
There's another variant that works that can deal with non independence between dependency relationships, between our different tests.
So, for example, if we if we compare all pairs of penguin species, chinstrap to Gentoo,
chinstrap to a deli Chinh strapped to their skewes, you just accidentally chinstrap to a deli.
Adelita Gentoo. Those aren't all independent. But one thing we need to think about is what the results are for.
If this is our threshold and we're saying, OK, P less than point zero five,
we found something and that's our evidence, then we need to be scaling and most of the time.
But as I said back in the first video this week, my perspective on this topic is multi evidentiary.
A P value is not a conclusive judgment that we found something.
A P-value is one piece of evidence that this is worth continuing to look at and in support of of the results that we're finding.
It's if the P value is high. So the results we're seeing are completely consistent with the null hypothesis.
Then that means we probably haven't found something. But if it's low, there's a bunch of other reasons why why we might have seen what we've seen.
We should have some skepticism. It's one piece of evidence among many of the uncorrected p value is still evidence we have.
So depending on how we treat it in the process of drawing our conclusions.
So a few pitfalls, the P values of the multiple comparison ones.
There's also the issue of what we're doing, experiments on the same data set over and over.
Should we be correcting for all of the P values ever run on that data set?
They're also designed for prospective experiments. So the mathematics around P values, also the mathematics around confidence intervals, et cetera.
They came out of this sampling theory where we're thinking about things prospectively.
So I'm going to go and I'm going to take a sample and computer statistic and I want to know what the distribution of that is.
But in data science, we're often in retrospect, if we have an existing data set,
even if we're gonna sample from it, we're often looking at the whole data set before we do our sampling.
That doesn't mean they're not useful. The statistic, they're really, really P-value.
We can still compute it. It's still a piece of evidence. It's just not.
The judge's final sentence also full validity for what he values telling us requires us to plan the whole analysis before we look at the data.
The P-value computes the probability of our statistic being this large, given the null hypothesis, is true.
But when we look at our data, first we do an exploratory analysis. We plot some charts.
I say, oh, this looks like a relationship. This looks like looks like these means are different.
And then we go to our hub. Then we go do a T test.
What we're actually computing is the probability of the statistic,
given that the null hypothesis is true and that the null hypothesis looks false because our choice of what to test is influenced by this information.
And so the p value is not measuring what the process that we actually use to drive it.
Also, the null hypothesis is usually not precisely true. Paying mean penguins might be slightly different, just not different enough to matter.
Again, I said this is not this does not mean we should not compute the P values.
This does not mean they're useless. This means they need to be. Taken as one piece of evidence in the context that the totality of the evidence
that we bring to understanding what it is that we're trying to understand,
statistics and actual application is a lot of reasoned judgment calls.
There's bad call there. There's there's many, many ways to do statistics poorly.
I have a book on my shelf called Statistics Done Wrong The Woefully Complete Guide.
The reason judgment's doesn't mean anything goes,
but it does mean that we need to there's not gonna be the bright line rule to say, yes, we found something.
We have our pieces of evidence who have our confidence intervals. We have our P values. We have the other things we compute from the bootstrap.

Resources#

Statistical test selection flowchart

💥 Cartoon#

Read XKCD #882: Significant.

This is called p-hacking: running tests until we find one that is significant.

🎥 T-tests#

This video discusses the t-test in more detail, and the different kinds of t-tests that we can run. It also introduces degrees of freedom.

Video (12m24s)

Slides

This video, I'm going to talk with you a little bit more about t tests. Give us some more ideas of what they do, how they work.
And also, when you need to use the different kinds and their python functions,
learning outcomes are for you to be able to select the appropriate form of t test for some data that you have.
And understand what the T test does. So if we have.
So we have some idea normal's so they're independent and identically distributed in their normal observations of a variable.
And we want to test the null hypothesis that the true mean is equal to some fixed value.
Often this is going to be we're going to see if it's equal to zero. But we can see if it's equal to any other particular value.
And so the null hypothesis is the day, the mean, the true mean of the population is equal to to our value, say zero.
And we want to see, does the data support or reject the null hypothesis.
So what we do is we calculate a test statistic called a T statistic,
which is the difference between the sample mean and the the target mean divided by the standard error.
And what this gives us is it gives us a normalized version of the difference in the difference and means that's
normalized by the natural variance from the sampling distribution that we would expect in the for computing the mean.
And the idea here is that if the MI so that the sampling, the standard error gives us the standard deviation of our sampling distribution.
Remember, the sampling distribution of the sample mean is normal with variance of the standard error.
And so we. If the meet the difference in means is small with respect to that width, that that error, then it's going to be.
Then the difference in means is probably due to sampling error. But if the difference is big,
then it's substantially less likely that that's due to the sampling and more likely that the mean the population doesn't actually equal our target,
such as zero. So what we do with this T statistic is we compute the probability that doing this a bunch of times.
So repetitions. We compute the probability that doing this a bunch of times is going to give T.
T statistics that are at least as large in their magnitude as the observed one that we have.
If the null hypothesis is true. This is a probability given age zero.
So we're trying to see what's the probability that we have a T with this size of absolute value.
This is called a two sided T test that the common ones we're just seeing, the magnitude of T is going to be this large.
If the means are actually equal, the code for this is to use the T test one sample function from Saipov.
What happens if we get this T statistic and it follows a distribution called a T distribution with N minus one degrees of freedom?
And so if the distribution and so we can look and we can see where does the T statistic, where does the observed T value fall in this distribution?
And it might be that, oh, there's a lot. So it's it's here. And if we consider it where, it would be on the left as well.
OK. There's a lot of probability mass outside those values.
So a value, this extreme value of this about point five under this particular distribution.
That's got to be common. Just do the sampling variation. That value is completely expected.
But if we go all the way out here to a T statistic of point of three point five.
There is not much probability mass outside the three point fives.
And that means that if the no hypothesis were true, seeing a T statistic of three point five would be very unlikely.
Which means that the data it's the data probably didn't come from the null hypothesis.
They probably came from something else. And so we would reject the null hypothesis.
But that's what happened with the tease, with the T test. You compute this T statistic from your data, which is this standardized difference in means.
And you compare that to the distribution, the t distribution gives us the distribution of T statistics under the null hypothesis,
which is that there is no difference in means. And as the degrees of freedom go to infinity, the T statistic approaches normal.
It's an adjusted normal, basically. So I want to talk just a moment about these degrees of freedom.
So the degrees of freedom are the number of ABB's observations in a series.
We have a series of observations we have and observations. How many of them can freely vary for the purposes of computing a statistic?
If we if we're just trying to if we're trying to compute the mean all of our values can change.
Change any value in our dataset. It's going to change the mean.
But if we have the mean, the sample mean and we're trying to compute the standard deviation.
Then we can't. Not all of them can vary.
Only out of end data points only and minus one can vary.
Because if we have X one directs event X squared minus one and we have the mean, we can compute the last value.
So for the purpose of the standard deviation, effectively, one way we can see it is that computing the intermediate statistic,
the mean we're gonna compute the standard deviation if we can, to compute this T statistic.
It uses up one degree of freedom. Can you fix it?
Only N minus one of our samples can vary, and so that lead.
So we have N minus one degrees of freedom or computing standard error, which then goes into the T statistic.
So for the the one sample t test, the degrees of freedom and minus one,
we use that to compute the T distribution, we see where the C which T distribution we're going to use.
We see where our value lies on it. The two sample T test it take so that one simple T test test is my saying is the mean of my source population.
My sample came from equal to some fixed value. The two sample team test asks, I have two samples.
Did they come from a population with the same mean? So let's say we have the we have the Adelie penguins and the Gentoo penguins.
We have the mean flipper length of these two samples of penguins. And we want to see, do they come from the same man or they come from?
Does the population of Adelie penguins and Gentoo penguins probably have the same mean?
Or does it probably have different means that this depends on the samples being independent?
No relationship between data points in each sample data point is in one of the two categories.
If you have relationships, then we're going to see the paired T test.
But this makes sense of how a penguin is either in a deli or a Gentoo.
So we're gonna use the the independent, the independent T test.
The null hypothesis is mbewe one equals mbewe to the means of the two populations that are represented by these two samples are the same.
And our T statistic is we subtract the means and then we divide by a combination of their standard deviations and sample sizes.
This is the degrees of freedom for this are significantly more complicated if we
allow the two populations to have the two samples to have different variances.
Are we allowed? Excuse me? We assume the two part. We allow the two populations of different variances.
If we assume they have the same variance, we get a much simpler version of the two sample T test.
But in general, they might have different variances. If this is the T statistic, the degrees of freedom are relatively complicated.
I don't recommend actually calculating this yourself. We have tea test functions that are going to help us calculate the T test.
This just gives you an idea of what those are doing so that you can better understand when you run a T test what's actually happening.
So we do our Adelie penguins, we get a T statistic of minus five point seven eight.
We can plot that on our on our T distribution. We get a P value of six point oh five times 10 to the negative eight.
This data would be very, very shocking to find if.
A deli and Gentoo penguins had the same flipper like. So we reject the null a paired t test is when we have two measurements from the same samples.
So each data point appear, rather we have two measurements,
but rather than having a measurement from one sample on a measurement from another sample like the penguins, what we have is they're paired.
So each measurement in one group hat is paired with a measurement and the other, for example, we've got two tests in this class.
Your score on test one in your score on test two would be paired sample of paired observations.
And what we do then is we compute the difference between these observations, say your test to score minus your test one score.
And that gives us one sample and we compare them that mean to zero.
So apparently test is a one sample T test that the mean difference is not with the null hypothesis that the mean difference is equal to zero.
And this is really useful for testing if there is a difference. When you take the same student and you give them two tests, you take the same.
The same patient and you give them two treatments, one week one and one week two.
This is a way of testing if there's a difference between the two observations of the two treatments on the same patient or the same research subject.
You're going to use this in assignment to to compare ratings from two different sources.
For the same movies. So to run these tests, the one if you have one sample and you want to test it, mean use the one sample T test.
If you have paired measurements. So you've got the movies and you have the all critics score and you have the top critics star.
Those are two measurements for the same movie. And you want to see if they're different than you use a paired T test.
Saipov gives you a paired T test function. But you can also compute the differences and use the one sample T test.
You can test two independent sample. So.
The four, if you see the four, two independent samples, you use the two sample T test the independent tests.
So sci fi calls this T test end by default. Both sigh pie and Stach models to independent two sample T test functions assume equal variance.
You can turn this off and say pi by setting equal var equal to false.
When you're calling the T test ion d function and it'll use the more sophisticated T statistic and the more sophisticated degrees of freedom.
So there are a lot of other tests. This is the usual practice for test.
A significant test. You compute a statistic and then you compare that statistic to a known distribution
that describes how it would be distributed under the null hypothesis.
In these cases, we compute a T statistic and the T distribution describes how that statistic is distributed when the null hypothesis is true.
We can also use these statistics then to bootstrap,
although sometimes we might we might not use the standardized Verd like the T statistic and we can directly bootstrap with the means.
To wrap up significance tests, compute the probability of a statistic under the null hypothesis.
Fundamentally under the hood. That's what they do.
The tricky parts are you need to have the statistic and you need to know its distribution under the null hypothesis.
The T test does this with the T statistic and with the standard I. Or with the T statistic, which is a standardized version of the vet of the.
The difference of the mean observed from the mean that you're trying to compare.

📓 Tying It Together#

I will be adding a notebook reading here to tie together some Week 4 and 5 material.

🎥 Epistemology#

In this video, I talk about how the quantitative data science methods we are learning fit into a broader picture of source of knowledge.

Video (25m44s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand WAYS OF KNOWING Learning Outcomes Understand different sources of knowledge Identify sources of knowledge appropriate to a question Photo by Jaredd Craig on Unsplash Some Types of Knowledge Quantitative Qualitative Analytical / Proof Theory Critical No form is the “best”. And, we often need multiple together! These are both empirical Quantitative Methods Numeric measurements and estimates. How many? What fraction? Predicting numeric outcomes Quantitative Submethods Observational studies – what can we see? Experimental studies – what happens when we act? Simulation studies – what happens in synthetic environments? Numbers. What Are They Good For? We can: How often does X happen? When X happens, is Y likely to happen? We can’t: Why does X happen? What do the situations where X happens look like? What new X should we look for? Qualitative Methods Interpretation and analysis. Why did someone do X? What do people want? Engages directly with source material. Note: surveys are not qualitative. Example Problem Goal: identify employee concerns and opportunities to improve Method: semi-structured interviews Resulting Data: interviews with employees Don’t have time to interview them all. Goal is not statistical representation. Example Method: Grounded Theory Break interviews into distinct statements / observations Group observations together into themes If it is like an existing theme: add it If it is not like an existing theme: start a new one Merge themes if/as necessary Stop at saturation – more data does not produce more concepts Outcome: themes with illustrative quotations Qualitative Empiricism Qualitative methods are empirical Conclusions are grounded in and supported by data The data is just analyzed differently! Qualitative can demonstrate: Existence of phenomena Mechanisms for phenomena Data is “thick”, but low in quantity Mixing Methods Qual and quant can work together Qual first: Qualitative study with small group to identify concerns Quantitative survey to measure prevalence Quant first: Quantitative analysis to identify stores with low employee retention Qualitative interviews to see what is happening at those stores A Number 30 Numbers Have No Meaning What does 30 mean? We need context to interpret the meaning Typical size of this class High for graduate classes, average for department Class size affects student-teacher engagement Context We need context to Interpret quantities Identify quantities of interest Analytical Methods Mathematical proofs and derivations. Usually quantitative Always true (given assumptions) May be probabilistic Example: law of large numbers On average, the average of a random sample is approximately the true average Critical Methods Analyze and explicate power structures. Who has power in a situation? What are the effects of that power? How does perspective & power affect questions and conclusions? Example: employer-employee power dynamic affects response to survey. One Application Who is making decisions? What to ask Who to ask What to do with the results Asking employees is not sharing power Theory Merriam-Webster: A plausible or scientifically acceptable general principle or body of principles offered to explain phenomena. Theory drives analysis – what questions do we ask? Theory comes from all these sources Data refines, confirms, and rejects theory Social Construction Knowledge is said to be socially constructed. Knowledge comes through processes These processes are social And therefore have biases, etc. Does not mean knowledge is fake, made up, or unreliable. Wrapping Up Quantitative measurements and analysis are not the only, or the best, way to produce knowledge. Multiple sources together give us knowledge and insight.

Hello again, this video. I want to talk to you about ways of knowing what's called epistemology in our learning outcomes for this video,
for us to understand different sources of knowledge and identify saw sources of knowledge that are appropriate to a question,
because the way that we're going about work in this class, we said at the very beginning,
data science is the practice of using data to gain quantitative insight in the questions of social, scientific or business interest.
It is not the only way to obtain knowledge is not the only source of knowledge.
And so I want us in this video to think a little bit about some of these other sources of knowledge and how they relate to each other.
So some types of knowledge we can can encounter are quantitative knowledge that we've been working on in this class.
I'm going to work on for the rest of the class. And that's around numbers and quantities, qualitative knowledge,
which looks at the actual content and we aren't doing quantitative count like we aren't counting things.
We are computing statistics. We're looking at say what taxed or what videos or what images actually say analytical or knowledge or proofs.
These are these are mathematical derivations.
There's no there's no data in them that's driving them.
For example, the fact that the T statistic is distributed under the T distribution, under the null hypothesis.
That's mathematically proven. We didn't do a bunch of simulations or experiments to figure that out.
There's theory building to put together knowledge from a variety of sources and from a variety
of experiments or studies or investigations into a coherent theory of how a subject works.
And then there's critical knowledge that pokes at and investigates the structure, particularly the power structures.
But the structure of how something how a process for producing knowledge or how a thing we're studying is put
together and what how it got to where it is and what impact that may have on the results that we get from it.
One thing that's important to note is both quantitative and qualitative forms of knowledge are empirical.
They're getting from the data to the conclusions or to the insights that we have on
the question of quantitative methods do not have a monopoly on being data driven.
No one form of knowledge is the best.
And we often need multiple forms of knowledge together to gain a full understanding of something that we're trying to study.
So a quantitative measure methods are the things that we've been doing this semester, numeric methods and estimates.
We could ask how many of us there are, if something, what fraction and have some property.
We can try to predict numeric outcomes, lots of different things we can do with numbers.
And there's various sub methods. For example, observational studies.
They look at what we can see. We have data that we've collected.
Either we've gone and collected it in the field.
We have it from a dataset and we can see we've observed some process and we can look for correlations,
we can look for connections, and we can try to gain insight based on what we can observe in the world.
Experimental studies look at what happens when we intervene, when we act, when we change something.
How does what we're studying respond to it? So the observational studies can be so we look at our penguins and we can see their flipper lengths.
We can see their body masses. Experimental studies can let us get at.
Well, what happens when we give some penguins one diet and then other penguins another diet to understand it,
to isolate the impact of diet on, say, penguin growth?
And a simulation study looks at what happens in synthetic environments,
and these allow us to run a lot more analysis and studies than we can in an experiment because we don't have to collect data.
We don't have to use experimental resources in order to to create and observe changes.
We can just turn read a bunch of random numbers.
According to some structured process, and these are really good for understanding, particularly what have our made her methods.
So. If we have a method for studying what would happen in an experiment, we can simulate a bunch of different experiments,
maybe with different outcomes to see what our method would do when we got to control the experimental results because we made them up.
This is a really powerful tool for studying the statistical properties themselves.
And one of the things, especially those of you who are in the data science P.
D program, but also as you get deeper in your data science knowledge,
that's something that's going to be be very important for you to be able to go to to level up
from using the methods that we were developing and that we're learning to study phenomenon.
We're going to use we're gonna use our linear models to study something about penguins, maybe to being able to study the methods themselves.
Can what can we say about how a linear model works or about how a T test works or or the bootstrap simulation studies are really useful for that.
But these quantitative methods, they can ask some really useful.
They can address some really useful questions. We can see how often something happens.
We can see when it happens. How likely is it when X happens?
How likely is it for Y to happen? We can get these kinds of how often, how many, how likely kinds of questions.
What's hard to answer is why does X happen or when X happens?
What do those situations actually look like?
So an example, I'm going to talk a little bit more about it as we go through the slide, but we're trying to understand.
So we've got we've got a large company with a number of locations, B, where a supermarket chain.
And we want to understand employee retention. We can look at, say, how often employees leave or say, leave within their first year.
We can ask. When some particular change happened to the store, change happens in personnel policies.
How likely are employees to leave or if one employee leaves?
Are there more that are likely to follow? But, well, we don't really what the numbers can't give us is for any given employee leaving.
What was the situation? Why did they leave? It also doesn't give us insight into what new phenomenon should we look for.
The numbers can't tell us on their own and the quantitative methods can't tell us on their own.
What questions to even ask of the data. So this gets us to qualitative methods.
And qualitative methods are about interpretation and analysis.
And it lets us look at questions like why did someone do X quantitative?
Lets us ask how often it happened. Qualitative. Let's start to get at.
Why did they do it? Why did it happen? Or what do people want?
And the key thing about qualitative methods is that they engage directly with some kind of source material,
written text images, interviews and interview transcripts.
But you've got some source material, often textual.
But sometimes it's visual, sometimes it's video, and you're engaging directly with it and interpreting it.
Humans ever. You can't import pandas that qualitative, that's not a thing.
There's a very human process of engaging with the material in order to derive knowledge.
One quick note is that surveys where they get data from humans are not on their own qualitative.
If you ask someone a Likert Stiles survey strongly disagree to strongly agree and a bunch of questions.
You're getting quantitative data. If you have open free form text and you start to go analyze what they said.
That's qualitative data.
So in our example problem, if we want to understand employee concerns and opportunities to improve how our organization is treating employees,
we could go do some structured interviews and a semi structured interview.
Is it interview where you have a plan going into it, but it's not a rigorous answer.
These questions in a row and we're done kind of plan. You have goals for the interview.
You have key questions thought of in advance.
But the conversation, an interview can also go in the directions that seem natural as the interviewer is leading through it.
And the resulting data is you've got all of these interviews with employees. You.
You're probably going to transcribe them because we're doing the analysis as often a
lot easier to work with the transcribed text rather than the actual than the audio.
But you don't. It's expensive. And if you're going to sit down and have a half hour interview with each employee, that takes a long time.
You don't have time to interview them all.
One key thing, when we're doing quantitative methods or goals, statistical representations that the statistics we computer,
accurate estimates of the underlying parameters, the goal and qualitative analysis is not statistical representation.
The goal is to make sure that the different.
Subgroups and sex and characteristics and dynamics that you might try to be look, be able to find in the data are all represented at least once.
We're not going to be getting statistical counts out of this.
And so the statistical representation is just not the right way to frame thinking about qualitative responses.
So one particular method, there's many methods for qualitative analysis.
You can spend. I have spent my research career of 10 plus years now getting very good at a lot of quantitative methods.
You can spend that much time and then the entire rest of your career getting very good at qualitative methods.
There is skill just as much as the quantitative methods. There are no easier really to do them well.
But one particular qualitative method is called grounded theory,
and the idea of grounded theory is to start from the text itself and build up to ideas and figure out what themes are in the text
and how they connect to each other in a way that is as close to the text as possible and is derived from the text as possible.
And to get the ideas and the concepts from the text rather than from your preconceived ideas going into it of what should be in the text.
So what you do when you're doing grounded theory, if you're going to do grounded theory on these interviews,
is that you would break them into distinct statements or observations and then you would group these observations together into themes.
So you would take an observation. And if it's like an existing theme that you have, you would add it.
And if it's not like any any of the existing themes, you would start a new theme taking observations.
See, is this like any of the five kinds of things I've seen so far? Yes.
Put it in the pile. Like actually physically moving things around the piles is useful for this.
And if not, we're going to start a new one. So you have this this matching thing going on.
And then as you're going through the process, or once you get to the end of the process, you may merge.
Themes may say, oh, these things were really they were seeming separate.
I am coding, but they really do seem like they're together or maybe they're subthemes of another theme.
Now, in statistics, we have this notion of, OK, we have a statistically significant sample or a statistically significant results.
We know if we found it, we can talk about our sample size and things with this kind of a method.
What you use as your stopping condition is what's called saturation. And here, like stopping early to get statistical significance, not valid.
Saturation is. And the idea of saturation is that more data does not reveal more concepts.
So you've grouped your things together. You found 12 themes and doing this grounded theory process with another trend.
Another few interviews does not generate any new themes.
And this, if you're doing the process, if you got confidence in the process, then what this means is that you've hit conceptual saturation.
You have identified the themes that are coming from your data. More data isn't adding new themes.
So we have the themes. Our goal here is to extract themes. Now we have them.
The outcome of this process is our themes with illustrative quotations.
So we've put these observations in cities into the additional into the different themes and these observations themselves,
the things that our our interviewees, our employees said.
Those are quotations that illustrate the dynamic going on in that theme.
As I said before, qualitative methods are empirical. The conclusions are grounded in and supported by data.
The data is just analyzed differently. To get to those conclusions. Qualitative is very useful for demonstrating the existence of a phenomenon.
People do this because people say they they give or there's this reason why people leave.
Yes, we have at least one who said that was the reason they leaved in our in our data set.
And they can also help us elucidate the mechanisms for phenomenon because they allow us to dove much more deeply into each
specific example to understand what were the factors that caused this employee to leave or that causes them to continue to stay.
And the data is is what's called sometimes thick but low in quantity.
We don't have very many of these interviews. Relatively speaking, we might have 10 or 20.
We might even have 50. But we're not going to have the hundreds of data points we might have from a survey or from some other quantitative method.
But each of those individual data points,
an interview with a with an employee is much richer and gives us much more detail on it, on individual samples.
You can also mix methods, what's called mixed methods, research.
There's a variety of ways that you can combine quantitative and qualitative methods together.
I'm going to talk briefly about two.
One is where you do a qualitative study first so you can do a qualitative study with a small group to identify what concerns people have.
You may do some interviews with employees to understand some of their concerns around workplace safety or workplace environment.
And then you can follow that up with a quantitative study to see how widely held these different
concerns that came up in your qualitative study or qualitative demonstrates existence.
Quantitative measures, prevalence. Another one you can do as quantitative first.
You could do a quantitative analysis to identify the stores with low employee retention.
That's something quantitative analysis can do well. And then a qualitative analysis can help you understand why.
So go to some of those stores, interview some of the employees,
interview some of the managers and understand what's happening at the stores that have low employee retention.
These are just two ways that we can use qualitative methods and quantitative methods together, either to have the qualitative methods,
generate the hypotheses that we're going to go test quantitatively and measure the prevalence of quantitatively
or to better understand the dynamics and the nuances of what's happening in an effect that we see quantitatively.
So 30. What is 30 mean?
Numbers on their own don't mean anything. We have to have context to interpret this meaning.
Thirty is the typical size of this class. This fall, we only have twenty three.
But the typical size, this class is 30. OK, that give me a little more meaning.
At least know what we're talking about now in our department.
That's large for an art, for a graduate class, but it's about average for the department's classes overall.
OK. That helps. Gives me context to understand that we're looking at a large graduate class.
Why do we care about graduate classes as well? Class size affects teacher student engagement.
So this gives us. There's many other reasons why we might care about class size.
But it's easier to individually engage with students when the class size is relatively small.
That gives us more context for understanding why we even care about this value.
So we need context to interpret quantities and identify quantities of interest.
Qualitative methods can give us a lot of that context and a really, really rich way.
Now I want to talk briefly about analytical methods. Analytical methods are dealing with mathematical proofs and derivations.
As I said, they're usually quantitative, but they're not empirical. We're not looking at what data is telling us.
We're looking at the intrinsic fundamental properties of computation of method, etc. They're always true, given their assumptions that they.
If we have a proof that given some assumptions X follows, that's not something that depends on the condition.
When the assumptions hold, assuming that the proof is correct, it holds it may be probabilistic.
The proof we have may be something about how often something happens, like confidence intervals.
Confidence intervals don't always contain the mean. But we have a probability.
We have a proof about how often they do contain the mean.
Then another method is I want to bring up to you a little bit is critical methods, including what's called critical theory.
And the key idea of these methods is to analyze and explicate the power structures that
are at play and the thing that we're studying or in our process of studying in itself.
Who has power in a given situation? What are the effects of that power and that distribution of power?
Can also ask, how does perspective and the power or lack thereof that comes with it affect the questions and the can we equip.
The questions we ask and the conclusions that we come to from those?
For example, if we're trying to understand employee retention, the fact that it's employer or the employer's agent talking to the employee,
that creates a power dynamic that can affect the employees response to the survey if they think
things they say might be might affect their employment or the terms of their employment,
that might affect what they say. But that power dynamic can, in a very direct way, affect the data.
The results that we get out of it. So one way that one application for thinking about power is to look at who is making the decisions.
This looks at the overall process itself.
What we're doing this study to understand employee retention and what is causing low employee retention or what's caused it,
what's what's making our employees want to stay with the organization. But.
Credit one thing. Critical methods are going to get at is. Who is making the study about the decisions about the study?
Who's deciding what questions to ask in our survey or or our interviews?
Who is deciding who to talk to in this? And who's going to be deciding what to do with the results?
Because. If it is a collection of employees that are deciding what to do with the results,
that's very different than efforts, organization management that's deciding what to do with the results.
One thing also that's important to note is asking employees and questions about retention,
even very, very good, even if it's done very, very good faith.
We want to make the organization a good place to work. For going to ask our employees what.
What they see going on. How they think about their their work.
Asking the poor ye employees on its own is not sharing power in the way that critical methods.
Think about who has power and how it's distributed.
So then I want to talk then a little bit about theory, not in the critical theory sense, but in the theory building sense,
some dictionary defines theory as a plausible or scientifically acceptable general principle or body of principles offered to explain phenomena.
It's a it's a kahir is a framework. That links together our findings from a lot of different studies, from a lot of different experiments,
from different sources of knowledge into a coherent explanation or way of thinking about a phenomenon or a set of problems and theory drives analysis.
Because if we're doing good data science, if we're doing good quantitative analysis, our questions aren't just made up.
They come from a reason we have from theory why that's a relevant question to ask.
And theory in turn, comes from all of these sources because data.
And analytical and query and critical inquiry, they refine their confirm and confirm and sometimes they reject theories.
The process by which that happens is long and complicated and is the the subject of study
of both the philosophy of science and the discipline called science and technology studies.
But. Feary and our defriended are different modes of inquiry, have this.
This interactive relationship. That we have a theory that tells us things to go look for, we go look for them.
Those in turn come back and help us refine our general understanding of how the thing we're studying works.
Lastly, I want to say a little bit about our science and technology studies concept.
So often said that knowledge is socially constructed and it becomes relevant, particularly when we start thinking about as relevant in general,
being to try to understand how our tools work and their limitations and their possibilities.
It also becomes very important to start to think about the social ramifications
of data science and the ramifications of social structures on data science.
But what this means is that knowledge comes through processes and those processes are social.
Social construction does not mean that knowledge is fake or made up or unreliable or like, oh, it's socially constructed so we can dismiss it.
No, that's not what it means. All it means is that what we know about the world we know is through social processes.
Because you're in this class learning how to do data analysis.
I am teaching you a particular way of approaching data, a particular way of applying statistical tests and interpreting the results.
That is a social process. We're having these dialogs and conversations about it.
I learned these through a social process. I subject when we go and we publish papers.
We subject them to a peer review process. This is a social process, our peers look at it and assess it.
And as a as a community. We come to our notion, sometimes agreed upon, sometimes contested, about what kinds of questions we should ask,
how we should go about answering them, how do we know if what we've found is true?
Well, with our tools for doing that may be very reliable in the sense that they produce knowledge that's useful for for predicting future behavior,
for understanding what's going on around us.
There's still there's social processes and they were developed socially and
they therefore have the various biases and limitations of a social process.
And understanding that I find helps me.
Respect science and statistics and the quantitative methods we're studying.
More. Because rather than putting them on a pedestal that they can never achieve of, they are the ultimate, for example.
They are the ultimate source of knowledge. Some who basically go there were like,
if we can't show it with a randomized controlled trial and a statistic and a statistical test, not going to believe it.
And it puts them in what they can actually do. They can generate a lot of quantitative insight that can help us understand the world better.
We have processes that we think are probably pretty good.
Understanding the social nature of these processes is a useful tool for us to be able to think about how to make them better.
And so it's it's it's important to recognize this the social nature of our methods, but not to let that paralyze us.
And they said not to let that dismiss them. It doesn't mean we're not finding useful knowledge.
It just means there's social processes that got us to where we are today.
So to wrap up quantitative measurements and analysis are not the only or the best way to produce knowledge.
There is no best way to produce knowledge. Multiple sources of knowledge together give multiple sources of knowledge together.

🚩 Week 5 Quiz#

The Week 5 quiz is about material through this point. The subsequent videos are to help you better understand and contextualize material.

📓 One Sample Notebook#

The One Sample notebook demonstrates how to compute a one-sample t-test, and draw a Q-Q plot to compare a distribution with normal.

Resources#

NIST Handbook on quantitative meaures (has info on 1-sample and 2-sample t-tests)

🎥 Python Errors#

This video discusses common Python errors and how to read errors.

Video (7m28s)

Slides

Hello, and this video, I want to talk with you just a little bit about how to read and interpret Python error messages,
learning outcomes are to be able to read a Python error message and understand
common kinds of python errors and the things that tend to cause them so.
You get a Python exception, so Python reports errors typically through exceptions.
And so we write in a lot of code and it gives us an error. So how do we read this?
There's a few things we want to look at. First. It's going to tell us near the top.
The type of error. This is an attribute error. Then at the bottom, it tells us, again, the error type.
And it gives us a message that often tells it gives us more information about what precisely went wrong in this case.
It tells us the data frame object has no attribute DFA. So I ran this in a notebook or ratings as a data frame.
There is no attribute DFA and that to the attribute errors telling us that.
So starting at the bottom here. This is one of the most important things to look at.
This is going to tell you what went wrong. And understanding what went wrong is the first piece to being able to understand why your code broke.
But then the last piece we have is the traceback. And it's so it's code says traceback and it says most recent call last.
So this tells you where that error happened. And the most recent call it, as it says, is last.
So if we go from the bottom up, we're going to find this happened in the.
Pandas core generic code and get atter that's deep in pandas.
And then it happened in our code. Ratings that DFA a good place to start and understanding a stack trace is to look
for the last entry since that's the innermost one that's in code you wrote,
because that's where your that's where in your code the error happened.
But the stack trace lets us work back up to understand why we got the error that we did and will often give us insight into.
So data frame has no attribute DFA, but where did we try to access that actually?
Oh, we access that in our code for saying ratings that DFA and there's no such attributes.
That's giving us the error. Being able to read the error message gives us a lot of useful information about what went wrong and being able to read the
error message and think about what that means in terms of your code is a crucial skill to being able to debug your python.
So a few components here, as we've seen, there's the error type. There's the location of the code where it happened.
And then there's additional information to go with the error.
Not all error types and not all code that throws errors or a particular type is going to put very useful information in the type error in the message.
It really varies from error to error and from library to library. But a lot of them do put useful information into that error message.
Once you know how to interpret it, so common type you there's a number of types of errors that you're going to see rather commonly.
The first one is a name error and a name error means you're trying to use a variable that doesn't exist.
So it's several different kinds of errors for referencing things that don't exist.
A name errors. You just try use a variable X at X does not exist and attribute error means you're trying to use a class member,
a method or a field on a classman object that does not exist.
Python calls those attributes. So when we say DOT, we have already a data frame.
We say dot act. When we say dot mean, what we're doing is we're getting the attribute mean,
which happens to be a function because that's how Python implements methods. But attribute error means we're trying to access one of these methods.
It can be a method. It can be a data field. But it doesn't actually exist.
A key error happens when we try to look up a key either in a dictionary or in a panda's index.
So using dot lock key area will come up. And the key doesn't exist.
So sometimes this arises. One way this arises is when pandas when you're passing something to dot lock.
Or just to the square brackets on a data frame and PANDAS isn't correctly interpreting the type of object that you're giving it.
That that's one way that can be caused, that this can be caused.
But the key here means it can't find the thing it's looking for in something that's looked up by key.
Pendas indexes.
If you're if you're passing in or a list or an array of index values, all of them have to be in the index or you're going to get a clear it index.
Error happens when you're accessing item by position, zero through and minus one,
either in a list or with Pande as I lock or something else that that uses this kind of zero based indexing.
And the index is out of bounds. It could be an umpire rate as well.
Index. So these are all different kinds of doesn't exist. Name error means the variable doesn't exist.
Attribute error means the class member doesn't exist. Key error means the key in the dictionary or index doesn't exist.
An index error means you've given it a numeric position that's out of bounds.
A few other errors you're going to see are file not found error. That means you're trying to access a file that doesn't exist.
If this happens, when you're reading a file, that usually means the file doesn't exist.
File is not found. If you get this when writing a file, what that usually means is you're trying to write it to a directory that does not yet exist.
The file doesn't have to exist for you to write it, but the directory you're writing it into does.
OS error often happens when we're reading and writing a file,
and that happens when there are other errors in the process, either of opening the file or of actually reading its data.
One way you might see this is if you're trying to write to a file on windows that you already have open somewhere else.
So to read an error, you want to understand three things, you want to understand what kind of thing went wrong.
So look at the air type RBD. Have a namer. You have a key air-to-air file not found.
Are these are all different. These all have different causes. Then you want to see what specifically went wrong.
The error message might tell you for a file not found error, it will usually tell you what file was not found.
And then you want to ask where it went wrong from the stack trace.
The key thing here is to understand what went wrong in your code so you can fix it.
Understanding this can really help you learn as you understand not how do I fix the error, but why did the error occur?
Oftentimes that gives more insight into how the code in the libraries that we're using in Python itself actually work.
Copying errors into Google can help you find useful resources.
But it's often not a good strategy for solving the problem because.
Someone else may have encountered that error. And good resources will give you the kinds of things that are going to help you fix that kind of error.
But there's a good chance no one else has counted that exact error.
And the solution that worked for them might not be applicable in your context.
And so it's important when we get an error message to understand why it went wrong so we can understand how to fix it in our particular context.
So to conclude, Python errors are reported by exceptions.
When we get an exception, we get a stack trace that includes a lot of useful information about what exception happened, where in the code it occurred.
And learning to read these is going to help us better understand the system that we're using and be able to fix and debug our code.

🎥 Python Libraries#

Video (3m43s)

Slides

In this video, I'm going to briefly talk to you about the relationships between some of the different Python libraries that
we've been using since learning outcome as to know the relationships between these key data science libraries.
So NUM Pi is the foundation of most of what we're doing. And it provides the array features it built.
It has the Endi array n dimensional array data structure that we use to represent vectors matrices.
You can also use it for Tensas, which are three dimensional. They have rows and rows, columns in depth.
You can there's many of manipulation functions to do.
Vector operations. Vector vector vector matrix vector scalar matrix matrix to reshape, manipulate data frames.
All of the data frames make arrays and matrices. It also has what's called a ray broadcasting support.
So you can multiply a matrix by a vector in that doesn't do a matrix vector multiply.
But it if the matrix is as tall as the excuse me at the vector is, as long as the matrix is tall, it'll apply at it.
It'll apply it down the columns and apply this the same vector to every row and num pi really is the backbone of vector ization.
And spending some time, the textbook talks a lot more about how to use num pis various factorization capabilities.
But it's the it gives us the victimization operations and it's the backbone then of the rest of the libraries.
So we put them together. Cyb High builds on top of num pi and adds additional scientific capabilities, some primitive statistical capabilities.
Pandas builds on top of num pi to give us the data structures that allow us to assign labels to columns and rows.
So we have our column names, we have a row names. A num pi matrix has to be homogenous.
Every element has to be the same type. Abandons data frame from each column can have a different type.
The columns have names. We can have a row indexes.
It also gives us some additional data types, such as it has a categorical data type that NUM Pi does not natively support.
It provides additional date time capabilities, stats, models,
builds on top of Pan doesn't PSI Pi to give much more sophisticated statistical modeling capabilities?
There's overlap. There's some functions that appear both num pi and PSI, Pi, num pi and pandas, CI pilots, stats, models.
But that's, that's models builds on all of these to give us relatively sophisticated statistical modeling capabilities.
If you're coming from our stats models is where you're gonna find a lot of the linear models,
a lot of the hypothesis tests and things that you might be looking at looking for from R, then I'm plotting.
Matplotlib is the low level plotting library that most other, particularly static plotting infrastructure.
A python is built. It works directly with num pi seabourne builds on top of matplotlib.
It also uses side pi and pandas and stats models to draw some of its more sophisticated plots, but it pulls all of it together.
I'll let you do some of your data visualization. So if you need basic array functions, you're gonna get those from num pi.
You need to do more advanced things with a raise and some basic statistical functions.
Then you're gonna reach for PSI, Pi, PSI, Pious signal processing.
It's got a whole bunch of different things and it also has sparse matrices, which you're going to see later.
And it has a whole bunch of statistical distributions if you need to work with label data and basically
any of our any of our observation data that we're gonna be pulling in for our work in this class,

🎥 Learning More#

In this video I talk about how I go about expanding my own data science knowledge and techniques, with the goal of giving you ideas for how you can continue learning beyond this class.

Video (5m10s)

Hi, no slides this time. Want to talk to you for just a few minutes about where to learn more, so we've introduced a number of Python things.
We're going to see more python things. We've introduced some basic inferential techniques.
We're going to see more inferential techniques. We're going to see more techniques of other kinds.
But this class, I'm not certainly not able to cover everything you're going to ever need to know.
You're not going to cover everything you're ever going to need to know and a master's degree or an APHC.
So where do you go to learn more? One immediate thing is to take more classes for many of you.
This is your first class in a data science oriented master's or HD program.
You're going to take, say, the machine learning class in the spring.
You're going to take maybe natural language processing, maybe recommender systems.
Maybe you're going to take one of the information retrieval classes or social network analysis.
The variety of classes you can go take classes in mathematics and statistics and econometrics and the econ department.
You can go look for econometrics. There's a variety of classes that you can take that can help fill in some knowledge.
But beyond that, there's a lot you can do to study and learn more.
There's lots of books I'm referencing some in this class.
I'm going to be adding links to more as we go through the class further and get get to more techniques that are worth pointing you to a book for.
But also, some of the things that I do is, for one, I read a lot, particularly when I'm reading a lot of papers.
I read a lot of things on the Internet. And I pay attention to the statistical methods.
So I'm very active on Twitter, for example, and I follow a number of people who tweet about statistical methodology.
Some of them are very advanced and sophisticated statisticians, and I pay attention to what they have to say.
What are they saying about statistical methods? Oftentimes I need a lot more background to fully understand the argument that they're weighing in on.
But I'll see if it seems relevant to anything I'm working on and maybe go and get some of that background.
And so that's one sources is seeking out.
Blogs, social media feeds, etc, from people who are regularly posting information about statistical techniques or data science techniques.
You have to be careful. There's a lot of bad information out there. But there's a lot of quite good.
There's a lot of quite good and well-informed statisticians and data scientists who are posting things that you might want to pay attention to.
And some of them are also very, very funny. I also, as I'm reading research papers, I pay attention to their method.
Sometimes it's just to better understand the methods.
Sometimes a paper was going to have an interesting method that might be useful for me to add and adapt to my own research.
I also reflect on my processes as I'm working on a project.
They ask what's working, what's not? Where are the holes?
Whereas this project at the edge of my knowledge and I really need to understand something else to push it forward.
I pay attention when I'm at conferences to what methods do people seem to be applying?
What methods do I maybe need to add to my tool box?
Because people are getting a lot of value out of them and doing the kinds of research that I try to do.
I also spend time trying to understand my tools, reading the documentation for things like pandas and num pi, trying things out.
Oftentimes I'll spin up a notebook just to try something out, to see how it works and how it works in my data.
Looking at examples, I'm pretty opinionated in my programing, so there's a lot of examples I don't like,
but but looking at examples and trying to understand how other people did things and
why they did them that way and seeing what I can incorporate into my own process.
But a lot of it is practice. And also, I think as I'm working.
Is there a better way I can do this? Is there a way I can do this more efficiently?
Sometimes I don't care. Like, OK, that was fast enough. But paying attention to and reflecting on.
OK, how about how is this working? How am I actually doing?
This computation really goes a long ways towards identifying the places where I might need to go level up my skills.
So at the end of the day, what I do to keep learning more of this material is I pay attention to the work, both my work and the work of others.
Pay attention to the details. How are we getting these results?
What assumptions is this making? Are those assumptions valid? Can I apply this method in another technique, in another context?
It's question of all of that asking myself and then reflecting my own practice.
And as I'm working on a project. What am I happy about and how this project worked out?
What am I dissatisfied with?
What do I need to go learn in order to address the point of the project that this dissatisfied me and continually learn more things?
Spend time thinking, try the mountain new projects and building up my skills incrementally over time.

✅ Practice#

There are a few things you can do to keep practicing the material:

The HETREC data contains two data sets besides the movie data: Delicious bookmarks and Last.FM listening records. Download this data set and apply some of our exploratory techniques to it.
Download the SBA data from Week 4’s activity and describe the distributions of more of the variables.
Apply the inference techniques from Week 4 to statistically test the differences you observed in Assignment 1.

📓 More Examples#

Some more examples from my own work (these are not all cleaned up to our checklist standards):

Data summary from book gender paper - shows a number of descriptive things, including a stacked area chart; it also uses Plotnine.
Linkage statistics from book data - shows some matploblib things, and computing data linking statistics.

📓 Tutorials#

The tutorial notebooks include many useful things, and have a couple of additions moved over from 📅 Week 4 — Inference (9/12–16).

📩 Assignment 2#

Assignment 2 is due on Sunday, September 25, 2022.

CS 533 Fall 2022

Week 5 — Filling In (9/19–23)

Contents

Week 5 — Filling In (9/19–23)#

🧐 Content Overview#

📅 Deadlines#

📓 Assignment 1 Solution#

📃 Course Glossary#

📓 Writing Functions#

🎥 Comparing Distributions#

Resources#

🎥 Testing Hypotheses#

Resources#

💥 Cartoon#

🎥 T-tests#

📓 Tying It Together#

🎥 Epistemology#

🚩 Week 5 Quiz#

📓 One Sample Notebook#

Resources#

🎥 Python Errors#

🎥 Python Libraries#

🎥 Learning More#

✅ Practice#

📓 More Examples#

📓 Tutorials#

📩 Assignment 2#