Sampling from Known Distributions
A primer on basic sampling methods for simulations and other statistical applications.
Oct. 21, 2024
Interactive

Sampling from a probability distribution is an essential procedure in data modelling and simulation. For example, we may want to simulate foot traffic in a business in order to observe patterns and better allocate resources. Generating samples is straightforward if we have a model of the data we intend to simulate (i.e., we know its distribution). In this post, we review some basic sampling methods.

Inverse Transform Sampling

To simplify our approach, let $f$ be a probability density function (PDF) over the real line $ℝ .$ If our target PDF is continuous over $ℝ$ and its cumulative distribution function (CDF) is strictly increasing, then we have favourable conditions to apply inverse transform sampling. Define the CDF with: $F (x) = \int_{- \infty}^{x} f (t) d t .$ Recall that if $x$ is a real number, evaluating the CDF at $x$ gives the probability that a random variable takes on a value less than or equal to $x .$

Given these assumptions, our CDF is one-to-one and onto, which means that we can uniquely define its inverse $F^{- 1}$ at every point in $[0, 1]$ . Concretely, for any $x \in ℝ$ , let $F (x) = y$ , where $y$ is a probability, and so we have $F^{- 1} (y) = x$ . This inverse function, also known as a quantile function, is the key to this sampling method.

Intuitively, inverse transform sampling (ITS) follows from the fact that the uniform distribution over $[0, 1]$ can be mapped (or transformed) to our target distribution via $F^{- 1}$ . That is, once we know the quantile function all we need is a way to uniformly generate samples over $[0, 1]$ , then we apply the mapping to generate samples according to our target distribution.

Simulation: ITS

Enable javascript to load interactive content.

This simulation uses inverse transform sampling to generate samples according to a distribution defined by a mixture of Gaussians. A Gaussian (normal) distribution has a PDF parameterized by a mean $μ$ and standard deviation $σ$ given by $p (x, μ, σ) = \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{(x - μ)^{2}}{2 σ^{2}}},$ and we can define a finite mixture as a weighted (convex) combination of $n$ Gaussians with means $μ_{1}, μ_{2}, \dots, μ_{n}$ and standard deviations $σ_{1}, σ_{2}, \dots, σ_{n}$ as $f (x) = \sum_{i = 1}^{n} w_{i} \cdot p (x, μ_{i}, σ_{i}),$ where for each $i = 1, 2, \dots, n$ the weight $w_{i}$ is an element of $[0, 1]$ and we have $\sum_{i = 1}^{n} w_{i} = 1$ .

Since integration is a linear operator, we can derive a CDF $F$ for our mixture distribution as a weighted sum of Gaussian CDFs: $F (x) = \sum_{i = 1}^{n} \frac{w_{i}}{2} \cdot [1 + erf (\frac{x - μ_{i}}{σ_{i} \sqrt{2}})],$ where $erf$ is the error function.

Unfortunately, the inverse $F^{- 1}$ of our mixture cannot be evaluated exactly, but we can use numerical methods to approximate this value. In this case, given some probability $y$ we can use Halley's root-finding algorithm to solve for $x$ such that $F (x) - y = 0$ , which satisfies $F^{- 1} (y) = x$ ; note that this approximation gets closer to the true value given more iterations of the root-finding algorithm.