Updated lecture notes

fa949b3e · Michael Jasper Martins · 4c0c088f · fa949b3e
Commit fa949b3e authored 4 months ago by Michael Jasper Martins
--- a/Lecture_rngs.ipynb
+++ b/Lecture_rngs.ipynb
@@ -12,12 +12,13 @@
    "However, (current) computers are deterministic. So, how can we create randomness? \n",
    "\n",
    "Can you think of ideas to create true randomness?\n",
+    "1. You can sample a noisy measurement (temperature,...)\n",
    "\n",
    "Methods that create (sequences) of true random numbers have three problems:\n",
    "\n",
-    "1. \n",
-    "2. \n",
-    "3. \n",
+    "1. The methods are slow.\n",
+    "2. The methods are non-repeatable, which spells trouble for debugging.\n",
+    "3. The quality of the methods depends on our understanding of the physical process.\n",
    "\n",
    "For these reasons, in most cases where programmers or scientists need random numbers, we use so called PseudoRandomNumberGenerators (short: PRNGs).\n",
    "PRNGs are algorithms that, given a chosen initial state (seed), produce a deterministic sequence of numbers that behave as if they were drawn from the target distribution.\n",
 %% Cell type:markdown id: tags:
  
 ## (Pseudo) Random Number Generation
  
 Central to data-analysis is the modelling, simulation, and examination of noisy data. Noise is, by definition, random.
 Also, many methods used for the examination of our data models are computationally more efficient with a stochastic ansatz.
 Thus, our ability to simulate noisy data and its examination will depend on our ability to create (pseudo) randomness within our computer codes.
 However, (current) computers are deterministic. So, how can we create randomness?
  
 Can you think of ideas to create true randomness?
+1. You can sample a noisy measurement (temperature,...)
  
 Methods that create (sequences) of true random numbers have three problems:
  
-1.
-2.
-3.
+1. The methods are slow.
+2. The methods are non-repeatable, which spells trouble for debugging.
+3. The quality of the methods depends on our understanding of the physical process.
  
 For these reasons, in most cases where programmers or scientists need random numbers, we use so called PseudoRandomNumberGenerators (short: PRNGs).
 PRNGs are algorithms that, given a chosen initial state (seed), produce a deterministic sequence of numbers that behave as if they were drawn from the target distribution.
 The first step to be able to reproduce more complicated probability distributions is the ability to reproduce the simplest one: the uniform distribution.
 Many algorithms exist. However, as we will see in this notebook, not all algorithms are created equally well.
  
 ### Linear congruential random number generators
  
 One of the oldest and simplest methods to generate pseudo-random numbers are so-called linear congruential generators that recursively perform the operation
 $$ X_{n+1} = (a X_n + C)\mod m$$
 where $0< m$ it the "modulus", $0<a<m$ is a "multiplier", $c$ is a "increment" and $X_0$ is the seed. Naturally, if $X_{n + i} =  X_{n}$ for any $i$, the sequence will repeat.
 For appropriate choices of the above variables, the recurrance will have very long (and known) periods, a good uniform distribution in the range $[0, m]$, and very low correlation of subsequent samples.
 However, not all linear congruential generators are created equal.
  
 ### The RANDU algorithm
  
 Lets consider the RANDU algorithm. The algorithm was very popular in the 1960s and 1970s and was used, for instance, by the IBM Scientific Subroutine Library for IBM System/360 computers.
 The algorithm sets $a = 65539, m = 2^{31}$ and $c=0$.
 Lets implement it in a simple python-function.
  
 %% Cell type:code id: tags:
  
 ``` python
 def RANDU_generator(size=1, seed=1):
    # 0x80000000 is a "smarter" way to write 2^31 (in python powers are written as '**', ie. 2**3).
    # we also hash the seed to enable non-iteger seeds.
    seed = hash(seed) % 0x80000000
  
    # create an empty list to store created samples
    samples = []
    for i in range(size):
        sample = ((65539 * seed) % 0x80000000)
        # store the samples as the seed for the next iteration
        seed = sample
  
        # rescale the sample to the unit-interval [0, 1]
        # and append it to the list.
        samples.append(sample / 0x80000000)
    return samples
 ```
  
 %% Cell type:markdown id: tags:
  
 Let's see if the generated numbers do, indeed, seem uniform. First, let's consider histograms for different numbers of "samples".
  
 %% Cell type:code id: tags:
  
 ``` python
 # The following "magic" command enabled the "widget" backend of the matplotlib library.
 # The command makes plots interactive in jupyter notebooks
 %matplotlib widget
  
 import matplotlib.pyplot as plt
  
 fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(9,3), sharey=True)
  
 for i, size in enumerate([1000,10000,1000000]):
    samples = RANDU_generator(size=size, seed=1)
    axes[i].hist(samples, bins=100, density=True, label=f"Draws: {size}")
    axes[i].set_xlabel("X")
    axes[i].legend()
  
 axes[0].set_ylabel("Density")
 plt.show()
 ```
  
 %% Output
  

  
 %% Cell type:markdown id: tags:
  
 This looks pretty good. If we were to check, for instance, the correlation between subsequent samples, and the period length, we would also not find issues.
 Let's see if we can create a 2D uniform distribution.
  
 %% Cell type:code id: tags:
  
 ``` python
 import numpy as np
 ndim = 2
 size = 10000
  
 # first, we simple draw a 1D sequence of samples
 samples = RANDU_generator(size=int(size * ndim), seed=1)
  
 # then, we turn the samples into a numpy array. This is not neccessary, we could do all we need with native python,
 # however, numpy is (typically) the first import of any data-analysis script, as it provides many mathematical methods and constants,
 # for instant pi (np.pi), functions such as exponential, logarithm, sine and cosine (np.exp(), np.log(), np.sin(), np.cos()), and *much* better/faster
 # support for large arrays of data.
 samples = np.array(samples)
 print("samples: ", samples, "\narray shape:", samples.shape, "\narray length:",len(samples))
  
 # It is often useful to change the shape of numpy arrays. This can be done with the reshape method:
 samples = samples.reshape((int(size), ndim))
 print("Reshaped array:")
 print("samples: ", samples, "\narray shape:", samples.shape, "\narray length:",len(samples))
  
 # Most mathematical operations just work as expected with numpy arrays if the arrays have consistent shapes: we can add or substract them: np.array(a) +/- np.array(b), for instance, or compute their square (np.array(a)**2).
 # The key advantage of numpy arrays that we need less `for`-loops in python. Python `for`-loops are notoriously slow.
 ```
  
 %% Output
  
    samples:  [3.05189751e-05 1.83109660e-04 8.23987182e-04 ... 5.25819865e-01
     7.08105346e-01 5.16253293e-01]
    array shape: (20000,)
    array length: 20000
    Reshaped array:
    samples:  [[3.05189751e-05 1.83109660e-04]
     [8.23987182e-04 3.29593616e-03]
     [1.23597323e-02 4.44949684e-02]
     ...
     [4.96117560e-01 4.87469663e-02]
     [8.27423760e-01 5.25819865e-01]
     [7.08105346e-01 5.16253293e-01]]
    array shape: (10000, 2)
    array length: 10000
  
 %% Cell type:code id: tags:
  
 ``` python
 import matplotlib.pyplot as plt
  
 fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(9,3))
  
 samples_x = samples[:, 0]
 samples_y = samples[:, 1]
  
 axes[0].hist(samples_x, bins=100, density=True)
 axes[0].set_xlabel("x")
 axes[0].set_ylabel("density")
 axes[0].set_title("Histrogram")
 axes[1].hist(samples_x, bins=100, density=True)
 axes[1].set_xlabel("y")
 axes[1].set_ylabel("density")
 axes[1].set_title("Histrogram")
 axes[2].scatter(samples_x, samples_y, s=1)
 axes[2].set_xlabel("x")
 axes[2].set_ylabel("y")
 axes[2].set_title("Scatter plot")
 plt.tight_layout()
 plt.show()
 ```
  
 %% Output
  

  
 %% Cell type:markdown id: tags:
  
 Still, this looks quite good. Now lets see about $n_d = 3$, and consider how well they get distributed in space.
  
 %% Cell type:code id: tags:
  
 ``` python
 ndim = 3
 size = 10000
  
 # first, we again draw a 1D sequence of samples
 samples = RANDU_generator(size=int(size * ndim), seed=1)
 samples = np.array(samples)
 # and reshape them as desired
 samples = samples.reshape((size, ndim))
 ```
  
 %% Cell type:markdown id: tags:
  
 A 3D scatter plot of the samples also appears to look ok. However, what happens if we turn the camera?
  
 %% Cell type:code id: tags:
  
 ``` python
 %matplotlib widget
 fig = plt.figure()
 # here we create a 3D plot. You will likely not need to do so again in this data lab. 3D plots in matplotlib are a little bit annoying and, sadly, rarely look "good".
 ax = fig.add_subplot(projection="3d")
 ax.scatter(samples[:, 0], samples[:, 1], samples[:, 2], s=0.1)
 ax.set_xlabel("x")
 ax.set_ylabel("y")
 ax.set_zlabel("z")
 ax.zaxis.labelpad=-0.7
 plt.tight_layout()
 plt.show()
 ```
  
 %% Output
  

  
 %% Cell type:markdown id: tags:
  
 Oh no - look at the correlation! The points are only distributed on a number of parallel planes - 15, in fact.
 These correlations are now also present in any scientific work that made use of the RANDU-algorihtm, often unknowingly, as the algorithm was shipped by standard FORTRAN compilers of the time.
  
 The general problem outlined here (plane-like correlation in some high-dimensional representations of the random numbers) is shared by all LCG's. This was known early in the development of modern computers/programming languages (Marsaglia, 1968). However, better choices can achieve much higher numbers of planes that yield random-enough results for many applications and more elaborate algorithms have been designed since. Still, even today, it is sometimes found that very popular random number generators exhibit similar flaws.
  
 An important lesson is also that for us, as data analysts, implementing core functionality such as uniform random number generation, is not part of our job-description, nor our area of expertise.
 Thus, we should try to make use of existing software that has been used, and checked, by many scientists and mathematicians, possibly and probably smarter than us.
 Still, it is good to keep in mind that all pseudo random number generators do not yield true randomness and will, on some level, have such issues.
  
 Let's consider of-the-shelf software, namely (again) numpy.
 Numpy uses a different kind of (pseudo) random generator by default. Look [here](https://numpy.org/doc/stable/reference/random/bit_generators/pcg64.html#numpy.random.PCG64) for more details.
  
 %% Cell type:code id: tags:
  
 ``` python
 # again we define a seed to make our results reproduceable
 seed = 1
 # "default_rng" is the default random number generator of numpy
 # All stochastic methods that make use of this RNG will now be defined by this seed
 np.random.default_rng(seed)
  
 fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(9,3), sharey=True)
  
 for i, size in enumerate([1000,10000,1000000]):
    # lets draw samples using numpy
    samples = np.random.uniform(size=size)
    axes[i].hist(samples, bins=100, density=True, label=f"Draws: {size}")
    axes[i].set_xlabel("X")
    axes[i].legend()
  
 axes[0].set_ylabel("Density")
 plt.show()
 ```
  
 %% Output
  

  
 %% Cell type:code id: tags:
  
 ``` python
 ndim = 3
 size = 10000
  
 samples = np.random.uniform(size=(size, ndim))
  
 fig = plt.figure()
 # here we create a 3D plot. You will likely not need to do so again in this data lab. 3D plots in matplotlib are a little bit annoying and, sadly, rarely look "good".
 ax = fig.add_subplot(projection="3d")
 ax.scatter(samples[:, 0], samples[:, 1], samples[:, 2], s=0.1)
 ax.set_xlabel("x")
 ax.set_ylabel("y")
 ax.set_zlabel("z")
 ax.zaxis.labelpad=-0.7
 plt.tight_layout()
 plt.show()
 ```
  
 %% Output
  

  
 %% Cell type:markdown id: tags:
  
 This looks much better! Note that there are many more sophisticated methods to assess the quality of pseudo random number generators.
  
 %% Cell type:markdown id: tags: