Week 2 — Description (8/29–9/2)#

The learning outcomes for this week are:

To load a data file into Pandas and describe its basic structural characteristics.
To identify the type of a variable and descriptive statistics appropriate to it.
To describe the distribution of a variable numerically and visually.
To describe how data was collected.
To reason about limitations of the collection of data.

This week’s material uses chapters 5 and 10 of 📖 Python for Data Analysis.

🧐 Content Overview#

Element	Length
🎥 Describing Data	6m27s
📃 What is a Dataset?	1950 words
🎥 Pandas Basics	8m26s
🎥 Variables and Types	15m56s
📃 Datasheets for Datasets	4100 words
🎥 Groups and Aggregates	14m12s
🎥 Descriptive Statistics	16m32s
🎥 Describing Distributions	11m59s
🎥 Sources and Bias	8m46s
🎥 Codings and Encodings	8m57s

This week has 1h31m of video and 6050 words of assigned readings. This week’s videos are available in a Panopto folder.

📅 Deadlines#

Week 2 quiz at 8am on Thursday

🎥 Describing Data#

This video introduces the week’s topic: describing data and data sets. We discuss the pipeline by which phenomena become a data set.

Video (6m27s)

Slides

Welcome back. It's the beginning of the material for week two in which we're going to be talking about describing data.
And so the learning outcomes for this week are for you to be able to actually loaded data,
file independence and describe its basic structural characteristics.
How big is the data file? For example, what data types are it?
We want to be able to identify the type of a variable and descriptive statistics that are appropriate to that variable.
We also want to be able to describe the distribution of a variable,
to describe how data was collected and to start to reason about the limitations of that collection process and of the representation of data.
So before we get into describing data, I want to talk a little bit what is the data that we're talking about?
Let's start just with a definition from Oxford Dictionary. That data, facts or statistics collected together for reference or analysis.
That's a good enough definition for what we're going to be talking about.
So we have it's some data points, some facts of some kind or another that have been that have been assembled or collected together.
So. Where data comes from. In the broader scheme of what we're trying to do is there's a lot of ways to think about it.
There's a lot of philosophical questions about what data is and where it comes from.
But for our purpose, we can think about it in terms of there is some.
There's something that we want to learn about.
It might be objects themselves or entities that maybe want to learn something about people or animals or something.
It might be a process like a social process or a natural process that we want to learn about.
But that that thing, that conceptual thing results in some kind of a phenomena.
Either a naturally occurring phenomena that we can observe or an experiment.
And I want to make that we can use to to observe and elucidate what it is that we're looking at.
I want to make this distinction here between the thing and what we can observe,
because it might be that our observations are a layer removed from the thing that we're trying to study.
For example. It's impossible to observe anything about somewhat directly observe anything about someone's internal mental state.
But we can make observations of what they do or of what they tell us about what they're thinking and.
Those observations really valuable, but it's important to maintain the distinction that what we observe, if someone tells us they are feeling happy.
That observation is just that. They told us they are feeling happy. It's not directly the underlying mental state.
We have some phenomenon or experiment that then produces raw data that direct what was observed or what was measured as a result of this process.
The raw data is then transformed, cleaned up, documented and labeled to produce a data set that is basically ready to use for some purpose.
We then use that data set to make inferences or analysis or whatever else we're going to do,
which hopefully then give us answers to what it is that we're trying to study, at least partial answers, at least some answers.
But we have these multiple steps here. We have the thing we could observe. We had the observations themselves.
We then have the collection and organization and preparation of the observations into something that's usable
for an inference task or a prediction task or whatever is that we're trying to do with our data science tool.
One way to summarize it is that the data is the messy pile. The data set is when it's cleaned up and it's ready for us to actually be able to use it.
So there's a lot of. There's a number of definitions of a data set. One of the readings that I've assigned goes over some of those definitions.
But the common themes that are widely used across the definitions is that it's data that's collected or curated for a purpose.
It's mostly ready to use and it's documented for that purpose or for that task.
That doesn't mean it's the only purpose or task that it can be used for.
But it usually it was created or assembled for some particular purpose or task.
It's documented in the context of that. So when we get some data.
Whether it's raw data, whether it is a processed and ready, ready to use data set, there's a few things that we need to know.
One is we need to know how much data we have, how many records are there, how many columns, how big is the file?
What kinds of data do we have? We're going to talk a lot more about we're going to talk more about a number of these.
What is the data about? What are the things, the entities, the objects that the data is about?
How was it collected? How was it recorded?
The biopsy, the data might there might be bias in the selection process.
The recording process, et cetera, for the for the data.
We want to know what what is it that we do know about the process that it will be called the data generating process.
And the data generating process is the combination of the underlying phenomenon,
the observation method can ism and actually recording his observations into data.
The reading, as I said, the readings that we have for this week discuss this more.
So to wrap up datasets arise from curating or collecting data often result from some observations for a particular purpose.
There are layers between the thing that we actually want to study in the data that we have available.
And this week we're going to be leveraging the reading.
So the first week, it was primarily the videos that the textbook and the Python tutorials to supplement and get you more of an on ramp this week.
The readings are a fairly fundamental piece of what it is that we're going to be discussing and working on.

📃 What is a Dataset?#

Read What is a Dataset? by Leigh Dodds. This article looks at how the term ‘data set’ is defined by different communities, and the common themes to these definitions.

This article does reference some concepts we haven’t yet gotten to. For example, when they talk about whether the data has been labeled for a particular task, they are referring to labels we would use for a prediction or classification task, which we will discuss in a few weeks when we talk about regression and classification. Don’t feel like you need to understand every concept in this article in your first read; understand what you can, and we’ll return to it.

🌠 MovieLens Data#

Download the MovieLens 25M Dataset. The Zip file is 250MB, and the files take about 1.2GB uncompressed.

You can use the smaller 20M or latest-small versions for practice if you want to play with a smaller version.

I will be using this dataset in the demo notebooks for this week.

Data Citation

Harper, F. M. and Konstan, J. A. (2015) ‘The MovieLens Datasets: History and Context’, ACM Transactions on Interactive Intelligent Systems, 5(4), pp. 19:1–19:19. DOI 10.1145/2827872.

This paper is not an assigned reading — it is here for your information.

🎥 Pandas Basics#

In this video, I introduce Pandas and the Pandas DataFrame data structure. We see how to load a CSV file and inspect the resulting data frame.

Video (8m26s)

This video I'm going to introduce loading a data file into pandas and actually starting to see what the shape and the structure of the data is.
So a lot of the data files that we're going to the learning outcomes of this video are for you to be able to import Python libraries,
little data file on the pandas, examine the size and data types of a data frame,
and understand the relationship, particularly between a data frame and a series.
And we're going to start introducing the concept of an index. Going to see a lot more about that in a later video.
So most useful Python Modu functions are in modules.
There are some functions that are just built in, but most of the time we're going to need functions in various modules.
And before we can use a module, we have to import it.
So some of the standard imports, basically every one of our notebooks is going to have imports at the top.
Not to import, not umpired pandas. We've got our basic scientific computing facilities.
And common practice is to import these with aliases, NDP and PD, so that we can reach refer to them later by shorter names.
A lot of the files we've been working with, particularly early on,
are distributed in a format called comma separated value and comma separated value file is.
It's consists of one line per record and the values are separated by commas.
That's where it gets its name. So you've got a comma between the different values.
Also, sometimes in this case it does. The file will have a header.
So the first row is the names of the columns. It doesn't always have a header, but often it does.
And it's very convenient when it does. So Pan does lets us read a CSB file through a function called Read CSP.
And so we call reads ESV and we give it the Amelle twenty five M slash movies.
That CSB file, which is from the data set that I have you download in the information for the week.
We get the. We get. A data frame, and as the convention that I showed you in video last week, I,
I then put just right the variables right, the variable I saves it ends that we can immediately see at Pynt Jupiter.
Format's a panda's data frame nicely and it shows us the first five rows.
Shows us the first five rows, the last five rows. We've got an ellipsis in the middle indicating that there's a lead of data.
It also tells us how big it is. So sixty two thousand rows, three columns.
So they immediately get one of the questions I said, do we want to know is how much data we have.
Right here. We already have that answer. We've got. Sixty two thousand rows and three columns.
So another way we can look at the data frame is we can use the info method and the info method will print out information about the data frame.
And particularly we're going to see. The it tells us what the index is.
We have a range index, it tells us the information about the different columns.
And so we've said that we have a range index from a range index goes from.
Zero to sixty two thousand.
So we've got zero to sixty two thousand four hundred and twenty two for sixty two thousand four hundred and twenty three entries,
a range index just means we're looking up the data. Bye bye.
Index zero three minus one. We have three columns. One of them movie I.D. is an inch sixty 64 and the other two are object.
These store strings objects is how so Panda's data types to store a string.
It can't just store the string directly in the column number. The the num pi arrays we talked about last week.
They store data compactly, but strings it has to store a pointer to the string.
And so PANDAS uses that array of extorts pointers to objects that can be any object.
We happen to know their strengths in this case.
And we have, as I said, sixty two thousand four hundred and twenty three rows answers to our initial question.
How much data do we have? Sixty two thousand four hundred twenty three rows each with three columns.
And we also have. We have. And what kinds of data do we have?
We have a movie idea that's an integer and we have title and genres that are strings.
What is the data about? The data is about movies and each row is a movie and the data sheets for data, streets for data sets.
Paper talks about these as terms of these are instances and they represent speech in each row is an instance and it represents something.
So in this case, the row is information about a movie and it represents a movie.
So each column of the data frame is a series.
As we mentioned last week, a series is an array that has an index associated with it.
We get a column by accessing the data frame like a dictionary. You can treat a data frame basically as a dictionary that contains columns.
And so we can get the movie, we can we can get the title column out of the movies data frame.
And it shows us the titles, the bottom. It says this series has a name title length sixty two thousand four hundred twenty three.
It's indexed from zero to sixty two thousand four hundred twenty two. And it has a D type of object.
We're going to learn a lot more about indexes in another video. But a series, as I said, is an array with an index.
All columns, the data frame share the same index. That's an important link between the different columns.
There enter data from. So let's load another frame up the ratings frame.
We can look down at its info. It has four columns.
It has twenty five million instances. This is why this is called the movie lens.
Twenty five million data set. It contains twenty five million ratings.
Just over twenty five million. Twenty five million. Ninety four.
And each row contains a user I.D. that's an integer and 60 for a movie I.D. that's also an integer, a rating that's a floating point value float 64,
which is double precision floating point and a timestamp, which is which is also an integer of type and 64.
The whole thing's a six hundred and twenty three megabytes of memory. Remember, there's not a plus here.
If you remember, the movies had a plus after that. That's because by default, it just measures the memory taken up by the panda's data frame itself.
If a column has an object type, it does not measure the size of the objects.
So for movies, that was an underestimate because we have all these strings. It was not measuring how much memory is taken up by the strings.
But here we don't have any strings. We don't have any other object types. It's just insane floats.
So it can tell us directly. This data frames take seven hundred and sixty two point nine megabytes.
So data can also refer to other data. So ratings are instances.
This rating file we just loaded. Ratings are instances themselves. But but each connects a user to a movie.
So we have the rating. But it also references to other kinds of entities or objects, users and movies.
The rating doesn't just exist on its own, but it's provided by a user for a movie.
Is work a lot like foreign keys and relational databases? We're going to see later how to do a merge so that we can actually, say,
link ratings to the to the the movie information that they're associated with.
So to wrap up a data frame consists of columns. Each column is a series, an array with an index.
We can quickly find out how many rows, how many there are, and that in a data frame.
The instances of the data, we can find out what columns there are, what data types those columns.
Have we talking later this week about more things we can do with that and also more about understanding what the data being stored in these types is.

Resources#

Notebook (used for slides as well)
Textbook section 5.1

🎥 Variables and Types#

In this video, we learn the different types of data (variables) that we will encounter.

Video (15m56s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand VARIABLES, OBSERVATIONS, AND TYPES Learning Outcomes Know the relationship between Pandas structures and statistical variables Identify the type of a statistical variable Photo by Fabio Santaniello Bruun on Unsplash Things “Raw” Data Data Set Inferences Answers Phenomena / Experiment Data Pipeline Observations An observation (sometimes called a sample) Of the values of one or more variables Pertaining to a single object/instance Stored in Pandas as a DataFrame Variables are columns Data: Palmer Penguins Penguin Variables Species: penguin species (Adelie, Gentoo, Chinstrap) Island: island measured (3 options) BillLength: length of bill (mm) BillDepth: depth of bill (mm) FlipLength: length of flipper (mm) BodyMass: penguin body mass (g) Sex: penguin sex (M or F) Each about one penguin Continuous Variables Discrete Variables A discrete variable takes on distinct values Can have values with no intermediates Might not even have order Basically all other types of data! Many types (next slides) Stored as: lots of things Integer Integers are discrete Counts, etc. Order (4 < 5) Example: number of penguins measured Often treated continuously, to e.g. compute means Average number of cats per cat-owning household (1.8)1 Stored as: int, sometimes float 1 From American Veterinary Medical Association U.S. Pet Ownership Statistics Categorical Variable that takes on one of a fixed set of unordered values Can compare for equality Cannot sort or do arithmetic Can sort for convenience, but reflects convention, not meaning Example: species (Adelie, Gentoo, Chinstrap) Example: user (userId in MovieLens) Stored as: string (object), int, pandas Category Boolean / Logical Variable that is True or False Special case of categorical Usually stored as int or bool (which is a type of int) Ordinal Variable that takes on one of a fixed set of ordered values Can sort, compare for inequality Example: movie ratings (1–5 stars) Example: class grades Example: Likert scale (strongly agree – strongly disagree) Really can’t do arithmetic But we sometimes do anyway Stored as: int, float, string (with externally-defined order), Pandas Category (in ordered mode) Other Types Time – usually continuous Text – categorical-ish, usually converted to categorical or count Images – matrix of ints or reals, will extract features Money – often stored as int or float, but be very careful Pandas Type Insufficient Knowing the Python/NumPy/Pandas type is not sufficient What is an ‘int64’? Categorical? E.g. MovieLens userId, movieId Continuous measured with integer precision? Ordinal? Logical 0/1? Integers with missing values load as floats! A Bit About Entities What are the things being observed? Databases: we call these entities Sometimes straightforward: penguins Sometimes complex and linked Ratings About movies # of ratings could be a variable for a movie! Wrapping Up Many kinds of variables, broadly divided into continuous and discrete. Conceptual variable types do not map 1:1 to Pandas types. Need to be documented! Photo by Omer Salom on Unsplash

This video, I want to talk with you about variables and observations and type,
so in the previous video we saw how to load a data file in the pandas, how to see how many rows we have, what python types of data.
But this video, we're going to go from the python types, the conceptual types,
and start to talk about what kinds of data we collect and we store in these pandas data frames.
So the learning outcomes for this video are to know the relationship between pandas
structures and statistical variables and to identify the type of a statistical variable.
So you've referred back to our data pipeline. We have things that produce Farnam observable phenomena that produce raw data.
Robbs ovations that we then process into a data set.
So the core idea when we have something that's going to be a data set that's processed, ready to use for a task.
Well, we usually have is a table of observations. Each row is one observation.
The data sheets of reading calls this an instance. Sometimes this is called a sample.
And this is an observation of the values of one or more variables pertaining to a single object.
So I'm showing here three rows from a data set. The Palmer penguin's data set.
That's measurements of penguins in Antarctica. And so each one has several different variables.
Each row represents.
So the first row represents, we observed and Adelie penguin on Torgersen Island with a bill length of thirty nine point one millimeters,
depth of eighteen point seven, et cetera. We store this as a pandas data frame, reach variable as a column.
But but the variables have their own conceptual properties that we're going to start talking about here.
So for each penguin we have it species, there are three different species of penguin that are observed in this dataset island.
There are three different islands on which they're measured.
We have measurements of the Penguins bill its length and depth, the length of its flipper, the body mass of the penguin and then the penguin sex.
Each data point, each observation is about one penguin. And we have these different observations.
We also have that the documentation which you can the you'll find a link to online and the slides,
the documentation tells us things like the build length.
That's the length of the bill in millimeters. The penguin body mass is in grams.
And when we have a data set, it's important to document.
And we have data whether it's organized and curated and processed into a dataset or it's the Robbs ovations.
We need to have the first order of things that we need to know besides how much data we have.
OK, what are the columns? But things like what are the units we can't properly interpret?
The bill length bill length of thirty nine. Thirty nine what? It's probably not thirty nine feet.
That would be a very large penguin. But with that we when we're producing a data we need to document all of these things and we're consuming data.
We need to find the answers to all of these things. So the variables we have can take on a variety of different types,
and the first type I want to talk about is a continuous variable and a continuous variable can take on any value pop.
There might be a range that limits the range of values that the variable can take on.
Mathematically, continuous variables correspond to real numbers.
And the key idea here that really what makes it continuous is for any two values, we could have a value between them.
So if we have two penguins, let's say one has a has a flipper length of a 40 millimeters and another has a flipper length of forty five.
We could have a penguin with a flipper length of forty two millimeters.
And no matter how close together they are, we could conceivably have a penguin with a flipper length that's in between them.
That's what makes it continuous. And now.
Observations are often desk critize, so even if we have something continuous like the penguins flipper length,
often our observations will be discrete and noisy because we don't have infinitely precise rulers with which to measure penguin flippers.
But what makes it continuous is conceptual. Lee. It's a continuous variable.
Even if our actual measurements of it might be disk critize, typically it's stored as a float.
Occasionally it'll be stored as an. If we only measure to integer private precision, we might store it as an ant instead of a float.
It is important to note, though, that floating point storage is also it is imprecise.
It's fine for the vast majority of what we're doing because like if you're taking any kind of a physical or natural measurement.
The measurement instrument is going to have some imprecision in it.
The imprecision of storing and a floating point number is far less than the imprecision of most physical instruments.
Unless you're doing something like high energy physics and a particle accelerator, most of us aren't doing that here.
So for most for most of our purposes, for measuring safe physical quantities, we don't need to worry about the floating point imprecision.
The one except the exception is for storing money. A discrete variable, on the other hand, takes on two distinct values.
And most of our variables are basically going to fall into the category of continuous real number or discrete almost everything else.
And it can have we can have values that have no intermediates and.
If I have four eggs, five eggs, I mean, I could take crack and egg and have half the contents.
But I'm talking about distinct eggs. I have four or five eggs.
Descript variables might not have an order. There's many different types of discrete variables.
And I'm going to walk through some of those in these remaining slides.
So the first is an integer and integers are discrete and they're typically things like count counting.
Something is our canonical example of an integer. They have an order for it is less than five.
But. They but you can't have a value in the middle.
So, for example, the number of penguins measured the size of something is usually an integer.
We often treat an integer continuously. So, for example, the American Veterinary Medical Association computed that for households that own cats.
The average number of cats per household is one point eight. Now, you can't actually have point eight of a cat.
So in terms of the individual, if you were going to take observations and observe how many cats are in each household,
the number of cats would be an integer. You can't.
You can't have point eight cats.
But we then can treat it as a continuous value so we can talk about the average number of cats per household and that's meaningful to talk about.
Even though nobody actually has one point eight cats, integers are usually stored.
And sometimes as floats, particularly in.
If you have missing values, pandas can't represent missing values for in an integer type.
But so it will store those enties floats if it finds any missing values.
Categorical variable takes on one of a fixed set of unordered values and.
We can compare them for a quality, but that's about all we can do.
There's no order, so we can't sort out. We can't do arithmetic.
We can't sort for convenience. Like if we have our paint. The penguin species is one example of a categorical variable.
We can sort by species in alphabetical order. But that's just convention.
The convention of the English alphabet. It's not intrinsic to the meaning of the different of the different penguin species.
Said to Athalie is not less than chinstrap is not less than Gentoo.
They just happened to come in a particular order in the alphabet. Another example is the user in the movie lends data we saw in the previous video.
The user I.D. column is an integer, but that's not a count.
It's a user, right? It's an identifier. It's a kind of actually a categorical variable.
Which user in the system are we talking about? They're each assigned a numeric identifier for computational convenience.
But if we have users. Seventy five and users three hundred and forty two asking what is user seventy
five plus user three hundred forty two is a completely meaningless question.
They're categorical variables. There is no error, arithmetic or comparison operations that we can do between them.
Categorical variables are stored. They can be stored as strings. They can be stored as integers.
PAND is also as a category type. That's useful for storing categorical variables.
A boolean or a logical variable is a special case of a categorical that can either be just has the two values true and false.
Usually it's stored as an int or a bull. And the bull is just a special case type event.
The convention typically is that one is true and zero is false.
And then ordinal values there, like categorical, they take on a fix one of a fixed set of values.
But those values are ordered. This is the key difference between an ordinary and a categorical and so we can order, we can compare for any quality.
A few examples of these are classic rates. A is A better grade than B.
There in this order. But you can't directly do math on them.
A minus B, we can assign numbers to them to try to do math.
But but they're just in an order, not an order. A Likert scale.
If you've taken a survey that asks you to strongly disagree, to strongly agree with something that is ordinal.
Also a movie rating, like if you if you go and you rate a movie, if you read a product, five stars, four stars on Netflix.
What you're using a number, but it's ordinal in the sense that.
We know that with you, if you like something four stars, you're saying you like it more than you like a three star thing.
But we don't know if you are a five star and a four star and a three star.
Is the five star movie just as much better than the fourth star as the four star is than the three star?
Or does it just tell us which order they're in intrinsically?
All it tells us is the order that you put these movies in.
But sometimes we do arithmetic anyway, like Amazon computes the average rating for a product.
Even though ratings are ordinal. Or you have a GPA.
That's computing the average of your ordinal grades. They're stored as insur floats or strings with an externally defined order wave.
If we have us if we haven't a no variable started and stored in a string,
we have to have something that tells us what order of those values actually go in.
Also, the panda's category type has an ordered modes. You can tell it, this is a categorical variable and it is ordered.
There's other types of data we're going to encounter time. We usually are going to treat is continuous.
Like it might be stored as written out as a date.
But if we're actually going to work with time, we're probably going to convert it into a continuous variable.
Common common encodings for that are either a number of seconds or a number of years, sometimes a number of milliseconds.
Text is categorical ish, but we're usually can convert it into categorical or account variables.
We'll talk more about that later. We actually get to processing. Text images are stored as matrices of insur real's.
We may also extract features from them that become other kinds of variables. Money is often stored as an interim float,
but we have to be careful because the imprecision of floating point numbers can
cause a problem when we're using money for the purposes of causing finding,
of creating financial transactions for the kinds of things we might be doing with money here in this class.
It's not going to be a problem. Nobody loses money if you have a little imprecision in computing.
The average price of of a ton of potatoes,
it really becomes a problem when you feed that back in to an into actual financial
transaction systems because it is impossible to precisely represent 10 cents.
It's just a hair under or over 10 cents. When you store it in a floating point value.
So the other thing I want to highlight here, though,
is that knowing the python were no higher Panda's data type of variable is not sufficient to know its type in a statistical sense.
Suppose we have a variable that's that's an in 64. Well, what's in that?
Is it a categorical variable like our movie and user I.D. and movie lens?
Is it a continuous variable that happens to be measured with integer precision in the penguins data set when you download it?
The body mass is integers because they were just measuring to the whole Graham.
They didn't measure fractional grams. But as conceptually mass really is is a continuous value.
It's just we don't have our measurements aren't that precise.
Is it ordinal or is it is it zeros and ones that are representing the logical variable?
I said before integers is missing values also lotas float.
So there's a we can look at the data, we can look at the data types, we can look at that itself to try to start to get a sense of what it does.
But knowing that is not sufficient to know what type of data we're dealing with for the purposes of handling it properly.
I want to talk to us a little bit about entities or instances that I introduced in last time and that are talked about in the reading.
So we want to be clear when we have a we have a data frame, a data, a data table.
What are the things being observed in this set of observations? But what's being observed if you've taken the database's class?
We called these entities. But sometimes this is pretty straightforward.
For example, this penguin dataset. Each row represents the measurements of one penguin.
But sometimes they're complex and linked, such as the rating data table.
Each instance is a rating. But that's a rating about movies.
And we can also derive things such as. We could count the number of ratings for each movie, and that could be a variable for a movie.
We could do this aggregation. We're going to see how to do aggregations in a little bit.
That gives us a new variable number of ratings. That becomes a variable for observations of movies.
So wrap up,
there are many different kinds of variables broadly divided into continuous and discrete with several specific types of discrete variables.
These conceptual variable types do not map one to one dependance data types.
You need more information in order to know how to properly interpret and work with a variable.
So the data source that you're working with needs to be documented.
And if you're creating a data source, you need to document what all of the columns mean and how they're being encoded and stored.

Note

In this video, I present categorical and ordinal variables as fixed and finite. While this describes any categorical and ordinal variables you are likely to encounter, this is not strictly speaking correct. It is possible for a categorical or ordinal variable to be (countably) infinite.

Resources#

Palmer Penguins data set

📃 Datasheets for Datasets#

Read Datasheets for Datasets by Timnit Gebru et al.

📖 Textbook Chapters#

This week’s material uses chapters 5 and 10 of 📖 Python for Data Analysis. We aren’t getting to everything in those chapters though.

🎥 Groups and Aggregates#

In this video, we discuss how compute aggregate statistics in Pandas.

Video (14m12s)

Slides

This video, I'm going to show you how to do grouping and aggregate operations and pandas sort of learning
outcomes are few to be able to compute an aggregate aggregate values from a panda series,
compute grouped aggregate values from a PANDAS data frame, and also be able to order a data frame,
pick the larger the rows with the largest values for some series.
And then finally join two pandas data frames to get context for the results that we just computed in the first part.
So we have a data frame, so this is the movie lends data that we used and some of the earlier videos.
So we have the data frame and we've got this ratings table that has the user I.D., movie, I.D. rating and timestamp columns.
It's twenty five million rose by four columns. So an aggregate.
If we want to ask the question, what does the mean rating? So all of the rating values users has ever given.
What's the mean value? And this is this is the code we would use to do this.
And there's a few pieces. There's. So we're using this data frame.
There's a few pieces. We're using this data frame. There's a few pieces.
We're using this data frame. And we're we're then selecting a column.
Remember, this is the court. This is how we select a column. And then the result of that operation, this whole operation here is a series.
And so then we call the mean method on the series and we get the mean we get the mean rating.
Think a little bit about the previous video to think a moment about what the conceptual problem here is, the common operation.
But there is a little conceptual problem with it in terms of what it actually means.
There's a variety of different aggregate functions that we have in pandas.
We've got mean median mode. We've got the minimum and the maximum.
You can some you can count. You can compute standard deviation and variance.
There is there are several others as well.
These are all methods on a PANDAS series. If you have a series, this is a method.
You've got the serious object dot. And then this function parentheses to call it, and you're going to compute that aggregate statistics.
So let's see these in action. So I'm going to compute the mean rating and we get three point five.
I can compute the sum. There's an alternate form. All of these are also available as functions in the num pi module.
That ticker, an array and a series is a kind of array for some of the functions.
There are slight differences between the panda's versions and the num pi versions, but mean and some are the same.
So if we want to get the size of a series, there's a couple of different ways so we can ask the series for its size or do a line on it.
Those are the same operation and they will give us the total length of the series, including missing values.
If we've got a series that has missing values and we haven't seen missing values yet,
but they're going to come up later and we want to count how many values we actually have.
That's what the series count does count. Method does. So we can see those.
The size, the land, those are the same. Also, we can get a series is an array and a raise in the number PI world have a shape we can get shape,
which is the same as the size except as a tuple, because arrays can have more than one dimension.
This weird syntax here where we have a number with a comma after it inside parentheses.
That's the python syntax for a tuple consisting of exactly one value.
It's a little bit of a weird syntax, but it comes up in a few places. But that's what that means.
It's a tuple with exactly one value. Then we can count the number of ratings and since we don't have any missing ratings, it returns the same number.
So. Another thing we can do that. That's a form of an aggregate is to get a Quantrell and the quanti all takes a parameter that is the fraction.
And what it does is, is this the parameter as a fraction? We want to find the value.
If we sorted the if we sort of the series from low smallest to largest.
And we went that fraction along it, so point five would be the middle. The median value, we're gonna see median in the next video.
What's the value that's there? So we can go we can see those run.
The median rating is three point five. If we ask for the Quanti or point two, we're going to get the.
We're going to if we ask for quanti or point two, we get 3.0.
And what this means, it's it's point to the way across 80 percent of the ratings are 3.0 or higher.
On a five star scale. So think a little bit about why that might be.
We've seen so far aggregates that work over a single serious evalu to take the series.
We get one value, but sometimes we want to be able to group and compute aggregates per group.
So remember this. This data frame has movie.
The ratings are for movies. And they're provided by users, so maybe we want to get rather than just the mean overall rating.
Maybe you won't want to do is we want to find the average number of ratings per movie.
This would give us a measure of popularity. We could say, well, the movie that's rated the most frequently is the most popular.
We could also look at the average rating per movie. And so we can do this with the group by.
So group by. Returns an object that allows us to perform grouped operations on a data frame.
And so we give it the column name that we want to group by. In this case, movie I.D., we can group by more than one column at a time.
We're only doing one for now. Then we can we can do the.
We then in this group are we're going to say we only want to work on one column.
And otherwise, it's going to count the ratings and the time stamps and back, so they're going to be the same count.
So we're gonna say we're gonna to group by movie idea. They're going to say within each group.
We only want to work with the rating. And then we want to do is we want to count it all of the aggregate values.
The functions that we've seen before work on a per group basis as well. And.
Do note, though, that we are grouping by grouping the whole data frame by movie I.D. before we select the column.
If we did it the other way around, we were okay, select red and well,
now we don't have a movie idea to group by because we've pulled the rating out of the movie.
This order is important. So we group by movie I.D. That's another column in the frame and we use the rating column.
So let's see this in action. So we want to count the number of ratings per movie and what it gives us is a series whose index?
Is the movie I.D.? And whose value is the number of movies for that movie?
We haven't really seen indexes yet. We haven't really worked with them much yet.
But that's what it's doing here. We're indexing the data front. It's resulting in a series that's indexed.
And this is the thing. Serious ads on top of a normal non pie array is that we have this index that tells us, oh, this is for movie one.
This is for movie two thousand two hundred nine thousand one hundred and seventy one.
We can also compute multiple aggregates at the same time.
So the the AG, there's an AG function that allows you to to specify movies, to aggregate functions, to call you, specify them by name.
So here I'm doing the group by that we did before. And then I'm AG.
Calling AG to say I want to aggregate the values values this column.
But I'm giving it a list of two different aggregation functions, mean and count.
And when I run this, I get a data frame. That's indexed by movie I.D., but then it has two columns and the columns are named after the function,
so have a mean function that's the result of mean and account function.
That's the result of count. And because I know that I did this on the rating column, I know these are the mean and the count to the ratings.
So we can see that movie I.D. has a mean rating of three point eighty nine and.
And the number of ratings is fifty seven thousand three hundred and nine.
So sometimes you want to sort out data. So sort values will resource an entire data frame.
And by a specific column, you get a column numbers.
We could resource this whole data frame by by, say, the number of ratings.
Sometimes we also want to just get the largest or small. Sometimes the reason we want to sort.
Is I want I'd want to know the five movies with the most ratings.
In which case, we don't necessarily need to sort the entire thing. And largest. And then smallest.
Let us just get the rose with the with the end largest or smallest values for a particular column.
So if I go over and do this.
So I want to get the 10 movies with the most ratings I can call and largest and tell it, I want 10 and I want to do it by count.
And it gives me the 10 movies with the with the most ratings sorted in decreasing order of count.
And we see movie ads. Three hundred and fifty six gives has eighty one thousand movies with a mean a 4.0 five.
But this doesn't tell us what movie that is. What we can do.
Remember, we have this movie is table two that gives us the movie titles. We can join the tables together.
And the simplest way to join is to join on a common index. There's a set index method that lets you set a column is the index.
You can also specify columns to join by. We're going to see more of this later, particularly.
I'm going to make a note book that walks you through the different indexing operations.
And you can also read about them in more detail in the text book.
But. If we want to see, so I'm going to say.
So I'm going to take our movie's frame that as a movie column and I'm going to join it with movie stats and movie stats.
Remember, it's the result of our aggregate its index. Is the movie I.D. and so on when to call?
I'm going to tell it. I want to join movies on movie stats and I'm going to tell it on movie I.D. movies doesn't have a useful index.
Its index is just the positions. But on whereas when I use the on keyword in join what it does.
Is it. It tells it to join the left feet.
The left table movies, to use that color movie ideas column and join it with the index in the other table.
So movie starts has an index and it expects the movie idea, column and movies to match up with the index in movie stats.
And so the resulting frame. Has our title in our genre does.
And then it has the mean and the count for each of these movies.
So now if I say and largest of this movie info frame, I see that the most frequently rated movie with 81000 movie ratings is Forrest Gump.
So another thing you can do is so the movie level rating statistics to be computed, this count.
This mean those are just more variables.
Remember, in the earlier we talked about, we can make some of the variables you might observe are actually aggregates from other things.
Well, these are just more variables. So now we have sweet. So if we have an observation of a movie, it has an I.D., it has a title.
It has on Rose. And it has the number of people who've rated it in the mean rating.
These also can be aggregated. So in the downloads for this video, you're going to find the notebook that I was just using for practice.
What I'd like you to do is to go in and compute the mean number of ratings per movie.
Maybe use some additional exploration as well.
But that's going to let you start to see how we can build from these aggregates into into additional structures.
And also emphasize that. A data frame is just a data frame.
Mike, we give it meaning in terms of observations. But the fact that a data frame resulted from an aggregate doesn't make it special in any way.
We can aggregate the results of of of an aggregate because it's just another data frame.
Everything's a data frame or a series and pandas. So to wrap up aggregates, combine a series or array into a single value.
That's what it means to aggregate. We can do this over an entire series.
We can also do this on a group by group basis.
If we have another column that provides us with grouping information so we can compute the average Beacon computer Mean Asama or whatever per group,
you might have this like if you have if you have records of financial transactions, you might want to compute.
Well, what was what was our total profit in each month.
So you could group by year, maybe as you grew by month,
maybe group by year and month and take a some of the of the profit margin on each of your transactions.
And then finally join Combine's frames, we can start to put two frames together in order to get context for values.
We're going to see a lot of other uses for join later.
But this that's in this context that lets us get context for understanding what's going on in a value.

Resources#

Notebook for this video
Textbook 10.1–10.2

🎥 Descriptive Statistics#

In this video, we discuss descriptive statistics for numeric variables.

Video (16m32s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand DESCRIPTIVE STATISTICS Learning Outcomes Know what a statistic is Identify whether a mean or a median is more appropriate Describe the central tendency and spread of a data series Photo by Fabio Santaniello Bruun on Unsplash Statistic A statistic is a value computed from a collection of data Often summarizes (observations of) a variable Descriptive Questions Where is the variable centered? How large does it tend to be? Called meaure of central tendency How spread out is it? Mean What is a Mean? What does the mean measure? If every instance had the same value, what would it be? “Points per player” How do you change it? Increase total (score more points) It doesn’t matter where – one person can do all of it! Spread - Variance and Standard Deviation What is Variance? What does variance measure? The mean squared distance from the mean If mean is center, how far away do values tend to be? If mean is expectation, how far off does it tend to be? Square penalizes large differences more Standard deviation translates back to original units Computing Statistics Pandas: Series.mean() Series.std() Series.var() Note: np.std and np.var compute population std and var, not sample Change with ddof=1 Outliers Outliers are particularly large or small values Outliers draw the mean towards them! (also affect SD) Median What value divides upper half from lower half? Sort values Pick middle one If even number: take mean of middle 2 0 0 0 2 2 7 8 8 15 45 How to increase? Increase small values. Spread – Range and IQR The range is max – min The inter-quartile range is distance between 1st and 3rd quartiles (width of “middle 50%”) 0 0 0 2 2 | 7 8 8 15 45 Quick Summary >>> movie_info['count'].describe()count 59047.000000mean 423.393144std 2477.885821min 1.00000025% 2.00000050% 6.00000075% 36.000000max 81491.000000Name: count, dtype: float64 Mean, Median, and Skew What question do we want to answer? If we distributed the points equally, how many would each have? If we randomly selected a player, are they equally likely to have more or less points? Mode The most common value doesn’t work great for continuous values fantastic for categorical variables! Wrapping Up Mean and median describe where a value tends to be. Standard deviation, variance, range, and IQR measure how spread out it is. Mean-based computationally useful; median-based robust to outliers. Photo by Brylee Hawkins on Unsplash

This video I'm going to talk about, descriptive statistics are learning outcomes are first no, what a statistic is.
We talk about statistics. What is a statistic? Identify whether the mean or the median is more appropriate for summarizing some data.
And also to be able to strive both the central tendency and the spread of a data series.
So a statistic is a value that collect computed from a collection of data.
And it often summarizes a variable or particularly summarizes observations of a variable.
You've probably seen means medians, et cetera, before. Those are all examples of statistics.
In other contexts. We can get an additional statistic. But it's this one.
Excuse me. We can get additional statistics.
But if there's one value that summarizes the observations of a variable and it becomes useful for a variety of things.
So when we have a we have a variable. We have if we have a set of observations, there's a few different questions we want to ask of it.
And the readings talk about some conceptual questions, but these are getting at some just direct.
How are is the actual data values themselves laid out?
So one is where's the variables centered? This is called a measure of central tendency, a measure.
If it's a numeric variable, it measures how large is of value tend to be.
We also want to ask how spread out it is around that value.
So ways to do this. The mean of a data series is the sum divided by the number.
And so we have some data points here.
These are the scores of each of the players on the Chicago Bulls in the nineteen ninety eight game six of the NBA finals.
So we add up all of the values we're going to get. Eighty seven. There's ten of them.
And we we have a mean of eight point seven. This is often informally called an average.
When someone says average, they're usually talking about the mean.
But average itself is not a very specific term. It just means one of these measures of central tendency.
And so we want to be precise about. We often use it in informal discussion when we want to be precise.
Average is not a good enough term. We need something like mean, but the mean measures.
If every instance had the same value, what would it be?
If if the total is some resource or quantity, how much if it was evenly distributed among all the instances, how much would it be?
So all points per player kind of value. And if we go back, we think back to the question.
That I I told you to ask in the first week when we have a statistic. We have a metric.
How do I change this? Or if I have a definition that's defined as better, how do I improve it?
But how do you move this? Well, the way you move this is you increase the total score, more points.
But crucially, it does not matter where the total is increased among your data points.
One one value can get all of the increase in total to produce an increase in mean.
So we see a strong suit like there's an outlier here, Michael Jordan scored forty five points.
He could score 10 more points and that would have the same effect on the mean as every player scoring.
One more point. So. We want to measure how spread out the values are, one measure is the standard deviation.
So this is sample standard media. So the sample standard deviation and what it is, we take the mean.
X Bar is the main river from the previous slide. We subtract the mean from each value and then we square it and squaring does two things.
One. It makes it makes everything positive. We want to make all the values positive.
There's usually two ways to do it. Take the absolute value or take the square. But also squaring emphasizes larger values.
More one of the reasons it's really useful.
A third reason squaring is useful that does is that in a variety of contexts you'll see in future classes, particularly around machine learning.
It's really useful to have differentiable statistics and you can take the derivative of the square of the square.
But if we didn't have this square root, we would get the sample variance as squared.
But this is a measure of what this measures is the mean squared difference from distance from the mean.
If we just wanted to measure the mean distance, the total distance from the mean is.
Zero. Because we're subtracting the mean from every value.
And if you push through the algebra of that, I'll leave that as an exercise for you in order to better understand the algebra of these statistics.
But the sum of a bunch of values, minus the mean is zero.
So we needed to be positive. And what this measures is if the mean is the center or the expectation of our values.
How far away do values tend to be are if if it if the variance or the standard deviation is small?
That means the value is tightly clustered around the mean.
And if it's large, that means they're spread out quite a bit around the mean,
the state taking the square root of the variance means that the result is back in the original units rather than the square of the units.
And we can see here the standard DeeVee. So I've plotted here we have the mean.
And then we've got the standard deviation. I've plotted these points are one standard deviation away from the mean and each direction we
see that's pretty spread out in this particular data set because we have quite a bit of spread.
There's a number of values that are clustered over here. But then we've got this value over here.
So to compute these, we use some of the methods that we talked about in the group and aggregate video mean SDD is the standard deviation vare.
One note is that the number high versions of standard deviation and variance compute a slightly different statistic.
The population standard deviation and variance instead of the sample standard deviation and variance.
You can change this by passing a d o f option to them when you call them.
The difference is that they divide by N instead of N minus one.
And we'll but when we're computing the standard deviation or the variance of a set of data that we've collected,
that's a set of observations of the things we care about. We generally are going to want the sample standard deviation or variance.
So outliers are particularly large or small values and they draw the mean towards them and they also affect the standard deviation,
one of the reasons why the standard deviation was so large.
So we've got Michael Jordan's score over here and that that pulls the outlier or that pulls the mean quite a bit.
So the red the red line, again, is our mean, the blue line.
That's what the mean we would compute if we didn't have Michael Jordan's score of forty five.
That's so much larger than everybody else's. So we can see that one value is pulling the mean quite a bit to the right.
This is one of the downsides of using the mean.
So when we have a heavily skewed distribution like this, the skew is what we call it, when there's I mean,
there's a lot of stuff in one place and there's a few values that are way off in one direction.
We can have outliers without skew. We can have some very large and very small outliers.
If we if they're relatively balanced, they actually don't affect the mean too much because one pulls it way high, the other polls it way low.
It's when we were outliers tend to skew off in one direction that the mean starts to become a problem.
So the median is another way of measuring how where values tend to be and the way we computed that we sort the values and pick the middle one.
If there's an even number of values, we take the mean of the middle, too. So we have.
So we have five, ten values. So right in here is going to be the dividing line between them.
It's two and seven. The mean of two and seven is four point five.
So the median X tildy of our of our series is 4.5.
Now if we ask how do we move the median?
So remember, we can move the mean by just scoring more points, even if Michael Jordan is the one who scores all of the more points.
But the only way to increase the median is to increase the small values.
And similarly, you decrease the median by decreasing the large values.
And so no matter how many score points, more points, Michael Jordan scores.
We don't move the median. We can only move the median by having the players who scored the fewest points, scored more points and the.
We could have that that seven scored another point than we would. We would move the median just a little bit.
But primarily, we need to have the smallest values,
need to increase in order to increase our median and or at least the values in the smaller half of the distribution.
One can't really common use for medians as it is when we're talking about income and wealth.
Statistics mean income for a region is almost never reported.
What you report, you usually report the median income because income is usually a skewed distribution.
A few people have a very large income. A lot more people have a significantly smaller income.
And the mean would would be pretty high if the mean would be pretty high,
because you have these large incomes, the median wage ends up being reflecting the typical experience.
Of of people in the population when we have skewed values, so like standard deviation is a mean based measure of spread.
A. Measure of spread more connected to the median is the range which we often want to compute in general.
What's the maximum minus the minimum? And then the inter cortical range and the inter cortical range.
So the median is the point is the fiftieth percentile or the point five court quanti all.
The inter cortical range is the distance between the first and third quartiles.
The point, seven point two five point seventy five positions. If you split the data in half at the median, it's the medians of the two halves.
And so in our data set, we've got it split into two halves are lower, quartile is zero, value is zero.
Our upper one is eight. So the intercourse range is eight. And this gives you it gives you the width of the middle 50 percent of the data.
And so it gives you a measure of how spread out the data is. That is similarly robust outliers like the median is.
So if we want to, we can get a quick summary of a data series with the described method.
The described method prints its results, it does not return,
it makes for a few subtle differences in what you're doing in the room or what you're going to see in the room when you run it in a notebook.
But it gives us the count. It gives us the mean and standard deviation.
We have our main based metrics of central tendency and spread,
and then it gives us the minimum maximum and gives us the men the max and then the the points of off of the quartiles so that we can see the median.
This is gonna be our median. And then we can also see the inter cortical range, two to thirty six.
This particular data set, this is a heavily,
heavily skewed data set because one way we can see that it's skewed is that the median is significantly less than the mean.
That's evident. That's indication of scale. That mean is pulled way up.
Also, if you look at the seventy fifth percentile is thirty six.
But the mean is four hundred and twenty three. So.
If you pick a movie at random, it's very unlikely to have the mean or larger number of ratings.
I don't know exactly where. What what quantity of 423 is going to be at.
But probably somewhere over 80 percent, possibly even over 90.
So. 80, 90 percent of the movies.
So this is the number of ratings per movie. 80 to 90 percent of the movies have less than the mean number of ratings.
It's really emphasizes this difference between the mean and the median.
So how do you pick the mean works well for centered values. There's no excessively large or small values, especially skewed in one direction.
The mean is going to be approximately equal to the median.
A lot of other computations, as well as our ability to try to predict future values, really depend on the mean.
And also the mean is the central tendency so that if we take the total deviation from it, we get zero.
Now, the median is significantly more robust to outliers.
And so we're just trying to describe data and we have a strong skew.
We have outliers on one side of the data. Then we're it's of gives us a statistic that is not as strongly affected by them.
And if we think that as a as I indicated,
as we think back to the question of how do I change this value if we're using the statistic as our evaluation criteria, as our target or our goal?
That becomes very important because if our goal is to raise the mean, say, if our goal is to raise the main number of ratings,
we could do that by just getting a bunch of more ratings for the most popular movies.
But if our goal is to raise the median number of ratings, we can only do that by getting more people watching.
And rating less popular movies becomes a huge difference. It divides high into low the median value.
Is such that if you picked a random random observation, it's equally likely to be greater or less than the median.
That's not true for the mean. But it doesn't tell you, like, how far away the values are on its own.
So it's limited in its ability to generate predictions.
So when we when we think about what's when we want to do it really comes down to the question we want to answer.
And then also the other things are going to use it for. So the mean is answering.
If we distributed the points equally, how many would each player have? And the median gets to the distribution of players about the around the value.
We want to find one that players are equally likely to have more or less.
Another one, quick, is the mode, it's the most common value. It doesn't work for continuous, doesn't really work for continuous variables.
It's really, really useful for categorical variables. If you got a categorical variable that has like three codes.
The mode want to know which one. The common is super, super valuable thing.
It's also useful for integers in any other discrete variable. So wrap up the mean and the median, describe where a value tends to be stick.
Standard deviation, variance range and enter cortical range. Measure how spread out it is.
The mean is very computationally useful. We're going to need it a lot.
But it's very sensitive to outliers, median based. The median and the median based statistics like the ICU are are more robust outliers.
One of the things we're also going to see later on is we we can do data transformations
to get data to be less skewed and then we compute the mean and the transform
space and we can wind up with methods that are going to be give us the computational
benefits of the mean while also not having the outliers causing as many problems.

Terms#

This video introduces the following terms:

🎥 Describing Distributions#

This video introduces the concept of distributions, and how we can see the shape and layout of our data.

Video (11m59s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand DESCRIBING DISTRIBUTIONS Learning Outcomes Provide numerical descriptions of a distribution Visualize the distribution of a variable Photo by Leohoho on Unsplash What Questions? What is the average value? Mean, median How spread out is the data? Standard deviation, IQR Is it skewed? What does it look like? Numeric Descriptions Previous video! >>> numbers.describe()count 10000.000000mean 5.006208std 0.998461min 1.41039225% 4.33046850% 4.99063675% 5.678746max 9.081399dtype: float64 Histograms Shows how common different values are X-axis: values Y-axis: frequency (count or relative) Bins controls division points Pick for clarity and integrity! import matplotlib.pyplot as plt This data is symmetrical (not skewed) >>> plt.hist(numbers, bins=25) Real Data – Average Movie Rating >>> plt.hist(movie_info['mean']) >>> movie_info['mean'].describe() count 59047.000000mean 3.071374std 0.739840min 0.50000025% 2.68750050% 3.15000075% 3.500000max 5.000000Name: mean, dtype: float64 Slight left skew Mean less than median Longer tail on the left of the histogram Real Data – Movie Rating Count >>> plt.hist(movie_info['mean’], bins=100) >>> movie_info['count'].describe()count 59047.000000 mean 423.393144 std 2477.885821 min 1.000000 25% 2.000000 50% 6.000000 75% 36.000000 max 81491.000000 Name: count, dtype: float64 Very strong right skew Mean much greater than median Most movies have far fewer ratings than mean! Hard to really see in a histogram Alternate Histogram hist = movie_info['count'].value_counts()plt.scatter(hist.index, hist)plt.yscale('log')plt.ylabel('Number of Movies’)plt.xscale('log')plt.xlabel('Number of Ratings’) Points rather than bars One point per rating count Plotted on logarithmic axis The mode is 1 (100) Power law distribution (almost) Artifact Movie mean rating, more bins Certain values more common! 1, 2, 3, 4, 5, 2.5, 3.5… These are exact rating values Because so many movies have only one rating! >>> plt.hist(movie_info['mean'], bins=50) Categorical Distributions Categorical distributions with bar charts spec_counts = penguins['species'].value_counts()plt.bar(spec_counts.index, spec_counts)plt.xlabel('Species’)plt.ylabel('# of Penguins’) Adelie is most common

This video I'm going to talk with you about how to describe distribution,
so I said one of the learning outcomes with this week is that we can describe the distribution of a variable both numerically and graphically.
And we saw some of that in the descriptive statistics in the previous video.
We're going to go deeper into that in this video. So we want to be able to learning outcomes for for this particular video or to be able to provide
numerical descriptions of the distribution and visualize the distribution of a variable.
So the question when we ask, how is a variable distributed? There's a few questions that we want to.
A few things we want to break that down into four specific questions.
First, what is the average value mean or medium, whichever is appropriate?
How spread out is the data? This these two things really give us a lot.
Where is the data? How spread out is it? Tell us a lot about it.
But then we want to look at is it skewed? Does it do values tend high?
Or do they tend low with respect to that that average value?
And then what does the data actually look like visually?
So the numeric descriptions, the previous videos, descriptive statistics give us quite a few.
The described method gives us a quick way to generate several of them.
So I generated a ten thousand random numbers. And they've got a mean and a median, both of about five.
They're spread out. Excuse me, a standard deviation about one.
So these numbers visualized or we can visualize them in what we call a histogram and a histogram shows how common different values are.
The x axis are the values themselves. And the Y axis is how frequent that value is.
It can be either account, the number of values, the number of observations that are about that value, or it can be relative.
The percent or fraction of observations. This is account one. We can see it goes up to twelve.
The Y axis goes up to twelve hundred. When we had a continuous variable, the number of Binz control the division points.
So I said Binz equals twenty five. That divides the variable into twenty five.
The range of the of the variable as we've observed it in the twenty five bins of equal width.
And so for each bean it's showing how many values fall into that bean.
So the largest bit is just over twelve hundred centered right at five.
This data is also symmetrical. There's no skew to it.
There's a little bit of variation in the size stapes, the two sides, just the way things fall into the bins.
But there's no significant skew. It's evenly distributed around the mean, the mean and the median are equal reflecting that.
This this function, the highest function comes from the library matplotlib.
So the import for that is import matplotlib, that pie plot. The typical name for that is PLDT.
And that gives us functions to do. Plotting matplotlib is one of the fundamental plotting libraries for Python.
A lot of other plotting libraries such as Seabourne, Plot Nine, et cetera, are built on top of matplotlib.
So if we want to look at some real data, we can look at the average rating for movies.
So remember, we have this movie data set. We've computed each movies average rating.
Well, how are those average? How are those average ratings distributed? So, I mean, is three point seven the median is three point one five.
The there's a little bit of a what we call a left skew and the left skew this the direction and the skew is how far out.
The longer tail goes. So left skew means we have a longer tail on the left and the data is more bunched up on the right.
We have a longer tail on the left of the histogram. We can see that. That gives us a slight left skew.
The mean is slightly less than the median. That's also an indicator of left skew.
But it's not very skewed. The mean and the median are pretty close in this case.
We can also look at the movie rating count. So this is the distribution. So we have a variable.
So for each movie, we've got these two variables, its average rating and and the number of people who rated it.
And so we can look at the distribution of the movie rating count. This distribution is very, very heavily skewed.
A very strong right skew. The mean is much greater than the median for twenty three versus six.
Morreau, most this means most movies are going to have far fewer ratings in the mean and it's hard to
like the histogram shows us that it's skewed and shows us this huge spike at the small values.
But it's hard to really see what's happening with the distribution here.
So an alternate way that we can plot the distribution is.
So here's the word histogram. It just means the tabulation of how frequent each value is.
So we can compute the frequency of each value.
And the panda's method value counts. Does that. And so what we can what it computes.
It's a serious. Where the index is the values of the original series. And the value is the number of times that value appeared.
And so if we do that, we can then make a scatterplot and a scatterplot is where we plot some value on the X versus some value on the Y.
And the index, the index is an array. So our x axis and the scatterplot is the index, which is the actual values themselves.
And the Y index is the value of the earth in the array, which is how many times that value appeared in the original array.
And we're then we're going to do something. We're going to rescale it using what we call a log scale.
So you'll notice the Y, the X and the Y axes tend to the zero, tend to the one, tend to the to tend to the three.
Rather than being evenly spaced, they're evenly spaced.
The logarithms are evenly spaced and effectively since we're using ten base ten logarithms here.
What this does is it shows us, rather than the values themselves, that shows us the order of magnitude of each of the values.
And so. We can see one tend to the zero is the mode.
It's the most frequent one, that's the top point. And also we can see that the first part of the the first part of it is a lot, almost a line.
And when we have a line, we have a plot like this where the x axis is the number of ratings,
the y axis excuse me, the X axis is our value in the Y axis is how frequent that value is.
In this case, how many movies have that many ratings? And it's on a logarithmic.
It's on a log log scale. We when we see a line on a log log scale in this chart that indicates that what we're looking at is
a what's called a power law distribution or something that's close to a power law distribution.
And this is a common distribution that arises when we're talking about the popularity of various results of human activity.
It shows up in how frequently different words are used. It shows up in a lot of different human activity contexts.
If you look at, say, a social network. And you look at the popularity of different accounts.
You have accounts like Lady Gaga and beyond, say, have very, very popular Twitter accounts.
I have a moderately popular but much less popular Twitter account.
And a lot of people are down around 100 or 200 followers.
It's very common for the debt, for distributions of that kind of activity to fall to look like this.
And this kind of a chart where we have the scatterplot of X and Y axes is a good way
to see it and a good way to get a handle on what the state is actually looking like.
And we have this strong power law skew. So one of the artifacts of this I'm replanting our mean ratings here, except with more beans,
50 beans, instead of the default 10, we can we see the same basic shape.
We also see that a few values are much, much more common. Those values are one, two, three, four, five, two point five, three point five.
And the reason this is so these are exact rating values. Three point five is an exact.
You can rate it will be three point five. You can't rate a movie. Three point seven eight.
And so what the reason is. Look at how many movies have one rating.
Or two ratings. And if a movie only has one rating, it's mean rating is going to be the rating.
And so we're going to get a lot of movies since the most since this one rating is by far the most common popularity level of a movie.
We're gonna have a lot of movies. Where they're mean is exactly one of the possible rating values like three.
And so we see these spikes here in the distribution when we look at it in a more fine grained way,
just because there are so many movies that don't have very many ratings.
So we've seen numerical distributions, whether they're continuous, whether they're integer or other kinds of account data.
We've seen continuous distributions for a categorical distribution or go to as what's called a bar chart.
And so I'm again using value counts here to count the number of penguins.
So from the earlier video, the penguin dataset count the number of penguins of each species.
And then I'm plotting a bar chart that's showing us the the number of penguins that have each each species.
And we can see that the Adelie penguin is the most common here.
But the bar chart is going to be a really simple way to to view the distribution of a categorical variable.
Seabourne, which is an additional library we're going to see later, provides really convenient ways to do this.
But here I'm showing you the map, the raw matplotlib code, so that you can see how does the chart actually get generated?
It gets generated by counting, doing the value count, counting how many times each species appears.
And then we plot a bar chart whose x axis is the species and y axis is the number of times that species appears in the data set.

Resources#

Notebook

🎥 Data Sources and Bias#

Video (8m46s)

This video, I want to talk some about data sources where our data is coming from and particularly
introduce the concept of bias and start to talk about where biases can come from in our data.
So our learning outcomes are to understand what bias means and start to identify the sources of bias and observations of a variable.
So one of the goals of a lot of our data science work is festively as we develop more sophisticated tools is going to be to estimate things.
And in statistical terminology, what we say is that we're estimating the value of a parameter.
So I introduced the term statistic in the previous in a previous video.
But a parameter is some property of.
Of the world or of the population that we're trying to study. And our goal is to estimate that with some statistic.
So if we have our data pipeline, we have the things that we're trying to study,
we can we have observable phenomenon or experimental results that come out of those that become raw and then processed data.
The goal is to be able to use the data,
the processed data to estimate to computers a statistic that allows us to estimate the value, the parameter back in the world.
For example, if we want to understand the approval of our company.
And we want to estimate the parameter of either the net approval, like the number of people who agree,
minus the number of feet who approve of our company, minus disapprove.
Or maybe the percentage of the citizens of residents of the society who have a positive opinion of our company.
We could computer statistic.
We could take a sample of of people and look at the percentage that of that population that has about half of that sample,
that has a positive opinion of our company. And the goal of this process is that the statistic is approximately the parameter.
And what bias is bias is when the statistics systematically differs from the parameter.
And there are a few sources of this. One is selection bias, where some people are more likely to be contacted than others in our survey.
And it and if the people are poor,
more likely to be contacted are either more or less likely to have a positive opinion than those who aren't contacted.
That's a source of bias. Response bias is some people are more likely to respond.
So if one survey method is called random digit dialing, where you dial random phone numbers,
if some people are more likely to pick up the phone than others,
or if some people are more likely once they find out what the call is to respond to the survey than others.
That is that's going to also induce a bias. And then measurement bias is when the way that we measure the results skews one way or another.
And in our example here where this could arise is if the way that we frame the question.
Bias is the approval positive? What people say positively or negatively or how they respond?
Then that we have the response, they're going to answer our questions. But we've changed how there's a bias in how their opinion translates into data.
These biases can come up at the biases that these stages of the pipeline can come up and almost any data collection kind of process.
Controlling for them and counteracting them is a significant field of study where reputable, reputable political pollsters,
reputable survey organizations have very good mechanisms for quantifying and reducing these sources of bias.
But it's a way when we have our from the population of people, we're trying to study objects.
We're trying to study through to the data that we actually get. It's the places where we're bias can come into the process.
Bias also may not affect all groups equally.
We may have a group that shows up more frequently in the data than than they are in the population less frequently.
There may be a measurement skew so that the the way that we're measuring our data
responds to the thing we're trying to measure differently between different groups.
So one is one example of this is standardized tests like the S.A.T. and the ACTC are intended to measure your academic preparedness for college.
But there's two things that go into how well you're going to do in the essay tier.
The ACTC one is your raw economic or academic preparedness.
How good are you were engaging with the kind of material that they're testing your ability to engage on,
and the other is your preparedness for the test itself. And there are a lot of test preparation resources that help you prepare for the test.
Then there's the other things of just how much time do you have available to study and things like that.
And one of the outcomes of that is that socio economic status becomes a very strong indicator in a very strong factor in standardized test scores.
So if you have two students who given the same situation and the same economics, the same economic situation,
the same level of stress, the same level of preparedness would be able to equally well engage with the material.
And that ideally is what you want to test if you're say seeing if someone is going to be a an effective college student.
The one who has more economic security, they don't have to work as many hours that take from their studies.
They have the ability to, four, afford more test prep resources.
They're going to score higher on the standardized test than the person who, because of their social situation,
because of their economic situation, because of their background, is goes into the test less prepared.
These students, given this, if you swapped their circumstances, the scores would swap.
There's no difference in the student's academic ability to engage with the material and to do the work.
The system is responding. The measurement instrument, the standardized test is responding differently to the thing it wants to measure based on the
socio economic status and surrounding circumstances of the student we're trying to measure.
So one of the things immediately that we need to do with this in line with our theme this week
of describing data is that we need to clearly and fully document the data collection process.
This is a major focus of the data sheets reading because and this this does a few things at first.
It forces us to think about it if we're creating the data or if we're using an existing dataset.
We're trying to find the answers to these questions. It then enables further and future reuses of the data,
because if we've carefully documented the collection process, the data processing, etc., that results in the data.
Then other people who come across the data, future users that may want to reproduce our analysis, may want to apply the data to a different problem.
They'll have the information they need to assess what the likely biases are and if those biases are likely to be to affect their problem.
It also creates the basis for as potential if we discover in the future through research, additional potential biases.
It lets us go back and see well, based on the documentation of how this data is collected,
how likely is it for that to be a problem for this data as well? So the takeaway I want you to have, right.
I want you to start thinking about how bias can affect our data. And is this a bias?
Is is the systematic from a statistical perspective? Bias is the systematic deviation of our estimate from the thing we're trying to estimate.
But document your data. Look for the documentation of the data that you're using.
So to wrap up the goal,
as for our data to accurately reflect the population and for the statistics we compute from it to accurately and reliably approximate parameters,
they're never going to exactly equal the quantity of interest.
But hopefully they're pretty close and hopefully there's not systemic or systematic differences in one way or another.
But various sources of bias, sampling, bias, response, bias and measurement bias just for three.

Resources#

Olteanu, Alexandra and Castillo, Carlos and Diaz, Fernando and Kiciman, Emre, Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries (December 20, 2016). Frontiers in Big Data 2:13. doi: 10.3389/fdata.2019.00013, Available at SSRN: http://dx.doi.org/10.2139/ssrn.2886526

🚩 Week 2 Quiz#

The Week 2 quiz will be over the material above this point.

🎥 Codings and Encodings#

This video talks more about how data is encoded, and what we need to document about that.

Video (8m57s)

This video I'm going to talk with you about coding and coding data learning outcomes are to recognize when a variable might
need a code book or a dictionary to give it more explanation to understand the difference between a variable in its encoding,
the transforming and coding to another. We start to think about what we're going to need in order to do that.
So the data we've been talking about needs to be encode. It needs to be stored somehow.
So the variables we talked about in the earlier video, we actually have to record those values in some way.
It's important to recognize that the encoding and the value that we're encoding and the value that is encoded in that coding are not the same.
We could have the number twenty seven and we can write it in multiple different way.
This is warm up example. We could write the digits to seven.
We can write out the word 27. We could write it out and hexadecimal zero x one B. The value is twenty seven, but we have different ways of writing it.
So to encode numeric data, we can encode it as a binary integer.
And in these, each of these I'm showing the hexes, decimal values of the bytes used to encode it.
So we can encoded as a binary integer, we could encode it as a floating point.
Binary number. Looks very different, doesn't it?
We could encode it as a decimal number. Those are the codes for the ASCII codes for the digits.
Two and seven, when we save the CSB file, it's stored as text.
So it's storing the decimal digits. There's another format called binary code,
a decimal that's used on some mainframes and other systems for efficiently storing the actual decimal values.
Encodings can also be Lawsie. A floating point, for example, loses precision.
And we can encode we can record things as integers and it truncates whatever decimal part they may have had.
So this is just for encoding numbers. We've had these four ways of encoding, but that's the syntactic encoding of.
OK. This is stored as decimal characters. This is stored as a 32 bit integer is not enough to interpret it.
No, because we need to know how it was measured. We need to know what the units are.
Is this millimeters ft.
Crocodiles may have been transformed somehow some data sets, they center the values or they take the logarithm of their values.
We also need to know if there are any sensible values because it's not uncommon to get a data set where it's numbers.
But then there's a special value that you use to indicate, say, unknown data.
So we might have. We might have. The number of classes a student took in a day, one, two, seven, whatever.
And if we don't know, we record ninety nine. Data recordings today tend to actually just exclude the value.
But there's lots of historical data out there,
lots of historical data processing systems that use specific values to to indicate things such as unknown.
We need to know if any of those values are in the data. When we have categorical data, we need to know how the data is, what we call coded.
So the categorical data. There's a few different values the variable can take on.
We call these codes or levels of the categorical variable. We need to know what they are.
We also need to know how they're stored. Are they stored numeric or string some data sets?
We have a string like our penguins. We just wrote down the string. But some maybe there'll be numbers zero.
One, two, three, four. And when you when you the code book to tell us what those numbers are.
In any case, we know how the data was recorded. But we didn't know what they are. We also need to know how they're defined the.
What does a particular value for this categorical variable mean?
We need to know.
It's useful to know what rules were to use to decide which code to apply for some things that might be fairly straightforward and obvious,
but for others it might not be. Also, there's a made a question around coding categorical data about who decided the definitions and how
were the how was this set of codes decided upon and given the definitions that they were given.
One example of this is in census data.
The Category four rate, the way race is collected, that's changed throughout the history of the United States.
The set of categories, whether you have to pick one or whether it's a check, all that apply, etc., that changed throughout the throughout the course.
And it becomes a very political process about how do you define what how it is that we record race when we are collecting census data.
That then has strong effects for how we understand. The representation and distribution of race in the country, so few examples.
So the penguin data set we looked at. The species is a categorical variable and it's written down.
It just with the name of the species, Adelie chinstrap or Gentoo in the Rorty.
There's a raw version of the data that has the full biological species name.
These come for biological taxonomy. There's another data set that is used for some things that is credit information for German loan applicants.
And it has various variables. One of its variables is the status of the applicants checking account.
And it has the codes a11 for overdrawn a twelve for between zero and hundred deutschmarks,
a A13 for either at least 200 deutschmarks, or they've had their salary to positive for a year.
And then A14 means no checking account. And so we've got these categorical codes.
We have the if you see a 12 in the data, you do not know what it means without looking at the code book.
Even if it does look obvious, it's good to look at the code book category, but then we actually go to record categorical data.
A lot of times, especially in the raw data, that we're get the data files that we load, it's going to be directly stored as a string or an integer.
We're gonna have a column for the categorical variable and it has the value in there, but.
For computational purposes, we're often going to need to encode it differently because you can't compute on A13 a couple different encodings there.
One is one hot encoding where each different coder level gets a variable.
A logical variable, but we encode it with an integer zero or one.
And so for the German credit, we're going to have a eleven, twelve, thirteen, fourteen.
There are all different variables we could for the penguins. We would have three variables, one for each species, and exactly one of them is one.
So when a deli penguin would have one, the Adelie variable and zero in a chinstrap and Gentoo.
Another option is what's called dummy coding, which is very, very similar, except one of the codes doesn't get a variable.
So it all zeroes in the variables for the categorical variable.
I mean, it's the admitted one and a one at any of them means that one.
Why we need that is going to. It's going to come up when we start talking about linear modeling.
But it's a very common statistical way of encoding a variable. The variables that we use for this are called indicator variables.
So if we're transforming our penguins into either one hot or dummy code variables, we say that we have an indicator variable for chinstrap.
And it's one if the if the penguin is a chinstrap and zero if it is not.
So data has to be coded and encoded in order for us to process and analyze that, we actually have to store it somehow.
And the process of coding affects are the data that we have when we go when we do an analysis.
When we do an inference. The way that the data was coded affects how we view and how we understand the things that the data are actually about.
Sometimes this is relatively straightforward. The penguin species,
although the way the penguins got divided into their various species and got those species names is is a historical social process.
But with other things that are very that are have very contested social definitions, such as how do you indicate race?
It becomes a very strong lens that affects how we understand the underlying
reality that the data is supposed to represent and that the representation,
the coding, the codebook,
the definitions that need to be documented thoroughly in order for us to properly understand the data that we're working with.

Resources#

float.exposed - explore floating-point numbers

📓 Tutorial Notebooks#

The tutorial notebooks contain more info on the Python code, including more systematic overviews of the different Python and Pandas features. I recommend you particularly pay attention to the Python and Data tutorials.

We will not get to all Pandas features you might need in videos. These notebooks, the 📖 Python for Data Analysis textbook, and the Pandas User Guide all provide additional information on Pandas, NumPy, and SciPy.

✅ Practice#

In a few videos, I have used the Palmer Penguins data set.

Download ../resources/data/penguins.csv (provided under CC-0)
Look at the documentation and references in the source repository
Describe the distributions of the different variables numerically and graphically
See how many questions from Datasheets for Datasets you can find answers to in the penguin data documentation

For more practice, you can look at the paper for the MovieLens data and try to answer Datasheets questions for it too!

📩 Assignment 1#

Start working on Assignment 1. It is due at the end of Week 3.

CS 533 Fall 2022

Week 2 — Description (8/29–9/2)

Contents

Week 2 — Description (8/29–9/2)#

🧐 Content Overview#

📅 Deadlines#

🎥 Describing Data#

📃 What is a Dataset?#

🌠 MovieLens Data#

🎥 Pandas Basics#

Resources#

🎥 Variables and Types#

Resources#

📃 Datasheets for Datasets#

📖 Textbook Chapters#

🎥 Groups and Aggregates#

Resources#

🎥 Descriptive Statistics#

Terms#

🎥 Describing Distributions#

Resources#

🎥 Data Sources and Bias#

Resources#

🚩 Week 2 Quiz#

🎥 Codings and Encodings#

Resources#

📓 Tutorial Notebooks#

✅ Practice#

📩 Assignment 1#