- Sampling from Known Distributions
- A primer on basic sampling methods for simulations and other statistical applications.
- Oct. 21, 2024
- Interactive
Sampling from a probability distribution is an essential procedure in data modelling and simulation. For example, we may want to simulate foot traffic in a business in order to observe patterns and better allocate resources. Generating samples is straightforward if we have a model of the data we intend to simulate (i.e., we know its distribution). In this post, we review some basic sampling methods.
Inverse Transform Sampling
To simplify our approach, let be a probability density function (PDF) over the real line If our target PDF is continuous over and its cumulative distribution function (CDF) is strictly increasing, then we have favourable conditions to apply inverse transform sampling. Define the CDF with: Recall that if is a real number, evaluating the CDF at gives the probability that a random variable takes on a value less than or equal to
Given these assumptions, our CDF is one-to-one and onto, which means that we can uniquely define its inverse at every point in . Concretely, for any , let , where is a probability, and so we have . This inverse function, also known as a quantile function, is the key to this sampling method.
Intuitively, inverse transform sampling (ITS) follows from the fact that the uniform distribution over can be mapped (or transformed) to our target distribution via . That is, once we know the quantile function all we need is a way to uniformly generate samples over , then we apply the mapping to generate samples according to our target distribution.
Simulation: ITS
This simulation uses inverse transform sampling to generate samples according to a distribution defined by a mixture of Gaussians. A Gaussian (normal) distribution has a PDF parameterized by a mean and standard deviation given by and we can define a finite mixture as a weighted (convex) combination of Gaussians with means and standard deviations as where for each the weight is an element of and we have .
Since integration is a linear operator, we can derive a CDF for our mixture distribution as a weighted sum of Gaussian CDFs: where is the error function.
Unfortunately, the inverse of our mixture cannot be evaluated exactly, but we can use numerical methods to approximate this value. In this case, given some probability we can use Halley's root-finding algorithm to solve for such that , which satisfies ; note that this approximation gets closer to the true value given more iterations of the root-finding algorithm.