"However, (current) computers are deterministic. So, how can we create randomness? \n",
"\n",
"Can you think of ideas to create true randomness?\n",
"1. You can sample a noisy measurement (temperature,...)\n",
"\n",
"Methods that create (sequences) of true random numbers have three problems:\n",
"\n",
"1. \n",
"2. \n",
"3. \n",
"1. The methods are slow.\n",
"2. The methods are non-repeatable, which spells trouble for debugging.\n",
"3. The quality of the methods depends on our understanding of the physical process.\n",
"\n",
"For these reasons, in most cases where programmers or scientists need random numbers, we use so called PseudoRandomNumberGenerators (short: PRNGs).\n",
"PRNGs are algorithms that, given a chosen initial state (seed), produce a deterministic sequence of numbers that behave as if they were drawn from the target distribution.\n",
%% Cell type:markdown id: tags:
## (Pseudo) Random Number Generation
Central to data-analysis is the modelling, simulation, and examination of noisy data. Noise is, by definition, random.
Also, many methods used for the examination of our data models are computationally more efficient with a stochastic ansatz.
Thus, our ability to simulate noisy data and its examination will depend on our ability to create (pseudo) randomness within our computer codes.
However, (current) computers are deterministic. So, how can we create randomness?
Can you think of ideas to create true randomness?
1. You can sample a noisy measurement (temperature,...)
Methods that create (sequences) of true random numbers have three problems:
1.
2.
3.
1. The methods are slow.
2. The methods are non-repeatable, which spells trouble for debugging.
3. The quality of the methods depends on our understanding of the physical process.
For these reasons, in most cases where programmers or scientists need random numbers, we use so called PseudoRandomNumberGenerators (short: PRNGs).
PRNGs are algorithms that, given a chosen initial state (seed), produce a deterministic sequence of numbers that behave as if they were drawn from the target distribution.
The first step to be able to reproduce more complicated probability distributions is the ability to reproduce the simplest one: the uniform distribution.
Many algorithms exist. However, as we will see in this notebook, not all algorithms are created equally well.
### Linear congruential random number generators
One of the oldest and simplest methods to generate pseudo-random numbers are so-called linear congruential generators that recursively perform the operation
$$ X_{n+1} = (a X_n + C)\mod m$$
where $0< m$ it the "modulus", $0<a<m$ is a "multiplier", $c$ is a "increment" and $X_0$ is the seed. Naturally, if $X_{n + i} = X_{n}$ for any $i$, the sequence will repeat.
For appropriate choices of the above variables, the recurrance will have very long (and known) periods, a good uniform distribution in the range $[0, m]$, and very low correlation of subsequent samples.
However, not all linear congruential generators are created equal.
### The RANDU algorithm
Lets consider the RANDU algorithm. The algorithm was very popular in the 1960s and 1970s and was used, for instance, by the IBM Scientific Subroutine Library for IBM System/360 computers.
The algorithm sets $a = 65539, m = 2^{31}$ and $c=0$.
Lets implement it in a simple python-function.
%% Cell type:code id: tags:
``` python
def RANDU_generator(size=1, seed=1):
# 0x80000000 is a "smarter" way to write 2^31 (in python powers are written as '**', ie. 2**3).
# we also hash the seed to enable non-iteger seeds.
seed = hash(seed) % 0x80000000
# create an empty list to store created samples
samples = []
for i in range(size):
sample = ((65539 * seed) % 0x80000000)
# store the samples as the seed for the next iteration
seed = sample
# rescale the sample to the unit-interval [0, 1]
# and append it to the list.
samples.append(sample / 0x80000000)
return samples
```
%% Cell type:markdown id: tags:
Let's see if the generated numbers do, indeed, seem uniform. First, let's consider histograms for different numbers of "samples".
%% Cell type:code id: tags:
``` python
# The following "magic" command enabled the "widget" backend of the matplotlib library.
# The command makes plots interactive in jupyter notebooks
This looks pretty good. If we were to check, for instance, the correlation between subsequent samples, and the period length, we would also not find issues.
Let's see if we can create a 2D uniform distribution.
# Most mathematical operations just work as expected with numpy arrays if the arrays have consistent shapes: we can add or substract them: np.array(a) +/- np.array(b), for instance, or compute their square (np.array(a)**2).
# The key advantage of numpy arrays that we need less `for`-loops in python. Python `for`-loops are notoriously slow.
A 3D scatter plot of the samples also appears to look ok. However, what happens if we turn the camera?
%% Cell type:code id: tags:
``` python
%matplotlib widget
fig = plt.figure()
# here we create a 3D plot. You will likely not need to do so again in this data lab. 3D plots in matplotlib are a little bit annoying and, sadly, rarely look "good".
Oh no - look at the correlation! The points are only distributed on a number of parallel planes - 15, in fact.
These correlations are now also present in any scientific work that made use of the RANDU-algorihtm, often unknowingly, as the algorithm was shipped by standard FORTRAN compilers of the time.
The general problem outlined here (plane-like correlation in some high-dimensional representations of the random numbers) is shared by all LCG's. This was known early in the development of modern computers/programming languages (Marsaglia, 1968). However, better choices can achieve much higher numbers of planes that yield random-enough results for many applications and more elaborate algorithms have been designed since. Still, even today, it is sometimes found that very popular random number generators exhibit similar flaws.
An important lesson is also that for us, as data analysts, implementing core functionality such as uniform random number generation, is not part of our job-description, nor our area of expertise.
Thus, we should try to make use of existing software that has been used, and checked, by many scientists and mathematicians, possibly and probably smarter than us.
Still, it is good to keep in mind that all pseudo random number generators do not yield true randomness and will, on some level, have such issues.
Numpy uses a different kind of (pseudo) random generator by default. Look [here](https://numpy.org/doc/stable/reference/random/bit_generators/pcg64.html#numpy.random.PCG64) for more details.
%% Cell type:code id: tags:
``` python
# again we define a seed to make our results reproduceable
seed = 1
# "default_rng" is the default random number generator of numpy
# All stochastic methods that make use of this RNG will now be defined by this seed
# here we create a 3D plot. You will likely not need to do so again in this data lab. 3D plots in matplotlib are a little bit annoying and, sadly, rarely look "good".