Random Numbers#

This notebook provides some more details on obtaining random numbers in Python and Pandas, and how to work with random number generators (RNGs).

We are just going to need a few packages:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import seedbank
C:\Users\michaelekstrand\scoop\apps\mambaforge\current\envs\cs533\lib\site-packages\scipy\__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.1
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

Generating Random Numbers#

Computers to not generate actually random numbers. Instead they generate pseudorandom numbers: numbers that, for our purposes, behave as if they were random, in that we canot predict (without knowing the generator’s state) what the next number will be.

As we have seen, we often seed the random number generator so its results are predictable, so that we can reproduce results.

Let’s create a random number generator with a particular seed:

rng = np.random.default_rng(42)

We can then get a number in the range \(0 \le n \lt 50\):

rng.integers(50)
4

If we ask for a second number, we get a different one:

rng.integers(50)
38

If, however, we create another generator with the same seed, we get the same first number:

rng2 = np.random.default_rng(42)
rng2.integers(50)
4

The generator object stores a state, an internal number that is used to generate the next number, and updated each time we generate more numbers. Without knowing what the state is, it is difficult to predict what the next number will be. Such numbers, assuming they obey reasonable statistical properties, are a sufficiently close approximation to random that we can use them for simulations, bootstraps, and other things that need access to the ability to generate random numbers or make random decisions. There are quite a few tests for determining if an RNG’s output is “good enough”, which are outside the scope of this class.

Modern Approach: NumPy Generators#

The modern approach to doing random number generation in scientific Python code is to use a Generator object from NumPy. The default_rng() function creates such a generator, as we have seen above:

rng = np.random.default_rng(20221013)

NumPy actually supports multiple different random number generation algorithms, and default_rng creates one using the current default settings. Future versions of NumPy may change those defaults, so sequences of numbers are only repeatable with the same NumPy version (and actually on the same OS & CPU architecture).

This object supports many different methods for producing different kinds of output. We have already seen integers():

rng.integers(1000)
0

It can also generate multiple integers at once as an array:

rng.integers(1000, size=10)
array([222, 196, 114, 502, 953, 182, 213,  59, 367, 111], dtype=int64)

If we want random floating-point numbers in the range \([0,1]\), we can use random():

rng.random()
0.8157251358428466

Like integers, and any other methods returning numbers, it can take a size:

rng.random(size=5)
array([0.68121396, 0.40387077, 0.87274614, 0.9671598 , 0.76114327])

The Generator class has methods to draw numbers from many distributions, like the normal:

rng.normal(0, 1)
-0.5737262994785399

We can see that these look pretty normal if we plot a histogram of many draws:

sns.displot(x=rng.normal(5, 2, size=10000), kde=True)
plt.show()
../../../_images/RandomNumbers_24_0.png

Another is the exponeitial:

sns.displot(x=rng.exponential(10, size=10000))
plt.show()
../../../_images/RandomNumbers_26_0.png

The choice() method uses the random number generator to choose from a list or array of values. Let’s make such an array:

vals = np.arange(0, 100, 5)
vals
array([ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80,
       85, 90, 95])

And randomly pick 5 elements:

rng.choice(vals, 5)
array([70, 60, 30, 10, 40])

By default, the sampling is with replacement when we use choice. To make it without replacement:

rng.choice(vals, 5, replace=False)
array([85, 75, 25, 95, 20])

There are several other ways that Generator can generate random outputs. I recommend reading its documentation.

If you use SeedBank, you can create Generator objects using its seedbank.numpy_rng() function:

rng = seedbank.numpy_rng(95)
rng.integers(1000)
449

This function has some additional functionality, like reusing SeedBank’s state from seedbank.initialize() if no generator seed is provided.

Old Style: NumPy Functions#

NumPy also exposes many functions directly from the numpy.random module that do not require you to have a random number generator object. They work the same way as Generator methods, and some of them have the same names.

To generate some random floats according to the standard normal distribution:

np.random.randn(5)
array([ 0.91286021,  0.24710817,  0.23935922, -1.46121369, -0.75626629])

You can think about these functions as if they used a global Generator that’s initialized when you start up Python.

Note

The legacy random functions actually use a numpy.random.RandomState, which is similar to a Generator but uses an older design that is not getting updates in new NumPy versions. You can usually just use generators, but you may sometimes interact with libraries that need a RandomState and cannot operate with a Generator.

To initialize the seed of this shared, global RNG, you can use numpy.random.seed():

np.random.seed(20201013)
np.random.randn(5)
array([ 0.29024655,  0.51061925,  0.10230704,  0.38427783, -0.42088693])

Alternatively, you can use seedbank.initialize(), which initializes both NumPy’s global RNG and several others:

seedbank.initialize(20201013)
SeedSequence(
    entropy=20201013,
)
np.random.randn(5)
array([-0.88487992,  0.64911734, -0.91608294,  0.67446197, -0.85210414])

Note

This produces different numbers because SeedBank uses a slightly different algorithm from seed to convert the seed you provide it into an actual RNG seed. Both, however, will initialize the RNG to produce reproducible results.

The numpy.random.choice() method works like numpy.random.Generator.choice() as well.

Random Sampling in Pandas#

Pandas provides the pandas.DataFrame.sample() method, and similar methods, to faciliate random sampling from data frames and series. By default this samping is without replacement — its replace method has an opposite default as choice().

By default, it also uses the NumPy global random number generator:

movies = pd.read_table('../data/hetrec2011-ml/movies.dat', sep='\t', encoding='latin1')
movies.sample(n=5)
id title imdbID spanishTitle imdbPictureURL year rtID rtAllCriticsRating rtAllCriticsNumReviews rtAllCriticsNumFresh ... rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore rtAudienceRating rtAudienceNumRatings rtAudienceScore rtPictureURL
4735 5049 48 Hrs. 83511 LĂ­mite 48 horas http://ia.media-imdb.com/images/M/MV5BMTc2Mjc4... 1982 48_hrs 7.4 33 31 ... 93 0 4 3 1 75 3.2 10471 61 http://content8.flixster.com/movie/26/94/26949...
8815 34482 The Browning Version 109340 La versiĂłn Browning http://ia.media-imdb.com/images/M/MV5BMTY5MjM0... 1994 1055873-browning_version 6.7 16 13 ... 81 0 3 1 2 33 3.5 322 63 http://content9.flixster.com/movie/28/01/28019...
7884 8918 Eulogy 349416 Eulogy http://ia.media-imdb.com/images/M/MV5BMTIxMzIz... 2004 eulogy 4.6 34 11 ... 32 3.9 9 1 8 11 3.6 2744 71 http://content6.flixster.com/movie/27/09/27091...
4721 5035 Wuthering Heights 32145 Cumbres borrascosas http://ia.media-imdb.com/images/M/MV5BMTYyNDY3... 1939 1024192-wuthering_heights 7.9 16 16 ... 100 0 1 1 0 100 3.9 2525 84 http://content7.flixster.com/movie/31/37/31379...
1647 1850 I Love You, Don't Touch Me! 130019 I Love You, Don't Touch Me! http://ia.media-imdb.com/images/M/MV5BMTI5MTk0... 1997 i_love_you_dont_touch_me 5.4 10 4 ... 40 0 1 1 0 100 3.1 352 44 http://content6.flixster.com/movie/10/85/09/10...

5 rows Ă— 21 columns

You can also pass a generator using the random_state option:

movies.sample(n=5, random_state=rng)
id title imdbID spanishTitle imdbPictureURL year rtID rtAllCriticsRating rtAllCriticsNumReviews rtAllCriticsNumFresh ... rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore rtAudienceRating rtAudienceNumRatings rtAudienceScore rtPictureURL
2306 2525 Albino Alligator 115495 Albino Alligator http://ia.media-imdb.com/images/M/MV5BMTg1MjQ0... 1996 albino_alligator 5.5 17 8 ... 47 5.4 6 2 4 33 3 1198 42 http://content8.flixster.com/movie/10/83/84/10...
5519 5857 So Fine 83099 Profesor a mi medida http://ia.media-imdb.com/images/M/MV5BMjAxMTUy... 1981 so_fine 0 4 3 ... 75 0 1 1 0 100 2.9 107 15 http://content6.flixster.com/movie/10/85/39/10...
4957 5278 Fraternity Vacation 89167 Fraternity Vacation http://ia.media-imdb.com/images/M/MV5BMTc5Mjkz... 1985 fraternity_vacation 0 3 0 ... 0 0 2 0 2 0 3 279 47 http://content9.flixster.com/movie/27/76/27769...
587 613 Jane Eyre 780362 Jane Eyre http://ia.media-imdb.com/images/M/MV5BMTQ0Njkx... 2006 NaN \N \N \N ... \N \N \N \N \N \N \N \N \N \N
9311 48982 Flushed Away 424095 RatĂłnpolis http://ia.media-imdb.com/images/M/MV5BOTIwOTc5... 2006 flushed_away 6.7 128 92 ... 71 6.7 33 25 8 75 3.3 58694 68 http://content6.flixster.com/movie/10/88/03/10...

5 rows Ă— 21 columns

Recommendations#

For quick-and-dirty code, I usually just call seedbank.initialize() at the top of my notebook, create a generator with the defaults for NumPy sampling, and then use Pandas’ defaults for Pandas sampling. The initialization code looks like this:

seedbank.initialize(20221013)
rng = seedbank.numpy_rng()

For more advanced code, particularly if I am writing library code to be used by others, I do everything by creating an Generator object and passing it around to other code that needs it.

Warning

None of the random number generators here are cryptographically secure. If you need random numbers that are unpredictable for security or cryptographic purposes, you will need to use a cryptographic random number generator, which is out-of-scope for the class.