Tricks with Boolean SeriesΒΆ
This notebook discusses Boolean series, and various useful things you can do with them.
First letβs import our libraries:
import pandas as pd
import numpy as np
rng = np.random.default_rng(20201103)
Computing ProbabilityΒΆ
If we have a logical series:
xb = pd.Series([True, False, True, True, True, False, False, True, False, True])
xb
0 True
1 False
2 True
3 True
4 True
5 False
6 False
7 True
8 False
9 True
dtype: bool
We can count the number of True
values with sum
:
xb.sum()
6
We can count the fraction of True
values, or the probability of True
, with mean()
:
xb.mean()
0.6
The NumPy equivalents also work:
np.mean(xb)
0.6
Creating with Logical OperationsΒΆ
If you do a logical operation (==
, <
, >
, <=
, or >=
) on a series, comparing it to either a fixed value or another series, you will get a Boolean series.
This is very useful for creating the outcome vector for a logistic regression, as the boolean series will be treated as 1 (True
) and 0 (False
), and can be predicted with either StatsModels or SciKit-Learnβs logistic regression or other classifier.
For example, if we draw some random numbers:
xs = pd.Series(rng.random(size=1000))
xs
0 0.163352
1 0.333021
2 0.054169
3 0.662442
4 0.517284
...
995 0.882045
996 0.897472
997 0.161948
998 0.850922
999 0.222618
Length: 1000, dtype: float64
xs.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x22e1f5277f0>
xs.describe()
count 1000.000000
mean 0.491382
std 0.285871
min 0.000805
25% 0.235146
50% 0.489305
75% 0.737922
max 0.997661
dtype: float64
We can get a series that is True
when the number is at least 0.8:
xs_ge = xs >= 0.8
xs_ge
0 False
1 False
2 False
3 False
4 False
...
995 True
996 True
997 False
998 True
999 False
Length: 1000, dtype: bool
We can verify that values start at 0.8:
xs[xs_ge].min()
0.8050954039782324
Since the values are drawn uniformly in the range \([0,1)\), then approximately 20% of the values should be at least 0.8 (\(P(X \ge 0.8) = 0.2\)). Letβs check:
xs_ge.mean()
0.183
Logical OperationsΒΆ
The bitwise negation operator, ~
, negates a boolean series:
~xb
0 False
1 True
2 False
3 False
4 False
5 True
6 True
7 False
8 True
9 False
dtype: bool
The np.logical_not
function also does:
np.logical_not(xb)
0 False
1 True
2 False
3 False
4 False
5 True
6 True
7 False
8 True
9 False
dtype: bool
We can combine them with bitwise and (&
) or or (|
).
Letβs find all the values between 0.8 and 0.9:
xs_in_range = xs_ge & (xs < 0.9)
xs[xs_in_range].describe()
count 87.000000
mean 0.849424
std 0.028517
min 0.805095
25% 0.826880
50% 0.845260
75% 0.878704
max 0.897472
dtype: float64
We can find everything except \([0.2,0.8)\):
xs_lohi = xs_ge | (xs < 0.2)
xs[xs_lohi].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x22e1edadeb0>
Look, we cut out the middle!