Probability Integral Transform & Quantile Function Theorem

Introduction

In this brief blog post we present simple proofs for two very important theorems: the probability integral transformation and the quantile function theorem. Both theorems are rather important in statistics, computational math, machine learning and beyond.

We shed light on the background of these two well-known theorems. In particular, we investigate the “why-it-works” question. To this end, we take a closer look into the difference between continuous and discrete distributions. We furthermore demonstrate why the Probability Integral Transformation Theorem in general does not work for discrete distribution. In addition, we also illustrate under what circumstances a discrete distribution at least converges towards the assertion of the Probability Integral Transform Theorem. To this end, we provide heuristics and illustrations.

Let us first state the actual theorems in the next section. Afterwards, we are going to illustrate both and finally we will also prove them. The proofs provided in this blog post are based on results of Angus [1.].

Theorems

Let X be a real-valued random variable defined on a probability space (\Omega, \mathcal{A}, \mathbb{P}). Let F: \mathbb{R} \rightarrow [0,1] be defined by x \mapsto F(x):= \mathbb{P}(\{\omega: \ X(\omega)\leq x\}) with x\in \mathbb{R} be the distribution function of X.

Theorem I (Probability Integral Transformation):
If X has distribution function F which is continuous, then the r.v. Y:=F(X) has the distribution function of \text{Uniform}(0,1).

\square

The second theorem that we are going to prove is as follows.

Theorem II (Quantile Function):
Let F be a distribution function F. If F^{-1}: \ ]0,1[ \rightarrow \mathbb{R} is defined by F^{-1}(y)=\inf\{ x| F(x)\geq y \}, 0<y<1 and U has the distribution of \text{Uniform}(0,1), then X=F^{-1}(U) has distribution function F.

\square

But why are both theorem so important?

The Quantile Function Theorem is of utter importance whenever simulations need to be performed. In almost all computer algebra systems this theorem is applied to generate samples of probability distributions. Beside that distribution transformations are the basis for any statistical testing, which is an important field within Frequentist’s statistics. The Probability Integral Transformation is also an important basis for the definition or interpretation of a copula.

Wrapping it up, we can state that both theorem are fundamental results that are applied in many important areas and it is definitely worth understanding them. To this end, we will first try to understand what both theorems are actually doing with a hands-on approach.

Illustrations

Probability Integral Transformation

Let us draw a large sample X \sim \text{Normal}(0,1) distributed by the standard normal distribution. We can use a computer and R, for instance, to perform that kind of task.

X <- rnorm(n=10^6, mean=0, sd=1)
hist(X)

A simulated sample of a standard normally distributed random variable X is shown in the next Fig. 1.

Histogram of Standard Normal Distribution
Fig. 1: Histogram of Standard Normal Distribution

If we now apply the corresponding distribution function F:\mathbb{R}\rightarrow [0,1] of the standard normal distribution to the output of the random variable X, we get Y=F(X) which is clearly governed by the uniform distribution on [0,1] as we can see in the next histogram in Fig. 2.

Again, we use a computer and R to perform also the second step.

Y <- pnorm(X, mean=0, sd=1)
hist(Y)

Realize that the range of any distribution function F is [0,1], which equals the domain of any random variable governed by \text{Uniform}(0,1).

Histogram of Uniform Distribution on [0,1]
Fig 2: Histogram of Uniform Distribution on [0,1] generated by X\sim \text{Normal}(0,1) and its distribution function F, i.e. Y=F(X)

The demonstrated effect is not a coincidence but based on the Probability Integral Transform Theorem. That is, we could set X to be governed by any continuous random variable (i.e. one with continuous distribution function) and the result would always be the same.

For instance, let us do this again with the exponential distribution and with the Cauchy distribution. Both do have a continuous distribution function:

X <- rexp(n=10^6, rate = 1)
hist(X)
Y <- pexp(X, rate=1)
hist(Y)
X <- rcauchy(n=10^6, location=0, scale=1)
hist(X)
Y <- pcauchy(X, rate=1)
hist(Y)

You will notice that similar outputs as in Fig. 2 are generated, which indicates that Y=F(X)\sim \text{Uniform}(0,1).

However, if we do the same exercise with a discrete random (i.e. not continuous) variable such as the Binomial distribution (n = sequence of n independent Bernoulli experiments, p = relative frequency of successes = probability), X\sim \text{Binomial}(n, p) the resulting histogram is quite different. The discrete distribution function of X is denoted by G_X.

X <- rbinom(n=10^6, size=2, prob=0.5)
hist(X)
Y <- pbinom(X, size=2, prob=0.5)
hist(Y)
Fig. 3: Histogram of Y=G_X(X) with X\sim \text{Binomial}(n, p) with n=2 independent Bernoulli experiments and p=0.5

Without continuity the controlled correspondence between an element of the domain x\in \mathbb{R} and the image F(x)=:u\in [0,1] gets lost. Let us have a closer look into this.

The discrete distribution function G_X of the discrete random variable X is as follows:

Fig. 4: Plot of discrete distribution function G_X of X

In the next table we have listed all four (i.e. 2^2) possible outcomes:

1. Bernoulli Trial2. Bernoulli Trial
SuccessSuccess
FailureFailure
SuccessFailure
FailureSuccess
Tab. 1: Possible outcomes of two independent Bernoulli trials with p=0.5

Failure for both trials occur in one out of four possible cases, i.e. with probability 0.25. There are two ways to get one success and one failure, i.e. with probability 0.5. Two successes in a row occur with probability 0.25. Putting these pieces together one ends with the above discrete distribution function of X shown in Fig. 4.

Let us consider how this is reflected in the R simulation. To this end, we use the table command in R:

X <- rbinom(n=10^6, size=2, prob = 0.5)
table(X)
Y <- pbinom(X, size=2, prob = 0.5)
table(Y)

We can see that the R variable X contains about 25% of 0s, about 50% of 1s and about 25% of 2s. The R variable X (containing these figures) is then entered into the distribution function G_X, which translates that input into accumulated probabilities p\in \{0.25, 0.75, 1\}. Hence, we get about 25% of p=0.25, about 50% of p=0.75 and about 25% of p=1. This equals exactly the output of the histogram in Fig. 3.

Ultimately, the above example is not uniformly distributed since the increase of the discrete distribution function values are not even (i.e. ‘uneven jumps’). In our above example we have an increase of 0.25 two times and 0.5 once.

In continuous distribution functions these increases behave in a controlled manner and the distribution function balances the likelihood of events with the corresponding pace of the accumulation. For instance, the simulated standard normal distribution as shown in Fig. 1 contains only few samples in the tails left of -3. This, however, is compensated by the almost flat distribution function curve of the standard normal distribution around the area of -3. This has the effect that a wider range of the domain of X is needed to fill up the corresponding probability bucket. Note that in Fig. 2, for instance, this probability bucket comprises \frac{1}{20} since 20 buckets have been used to cluster the entire probability mass of 1.

The binomial distribution converges towards a normal distribution if we increase the number of trials (in R it is called the size). This implies that the ‘uneven jumps’ of a discrete Binomial distribution are getting less and less meaningful. Indeed, if we repeat a simulation as follows the result looks quite promising:

size <- 10^5
X <- rbinom(n=10^6, size=size, prob = 0.5)
hist(X)
Y <- pbinom(X, size=size, prob = 0.5)
hist(Y)

Ultimately, all continuous distributions need to be discretized due to the limitations of a computer. So, if we change the size from 10^3 to 10^5, we get the following histograms shown in Fig. 5 – Fig. 7. Apparently, the approximation of X, that is distributed according to the binomial distribution, to the normal distribution ultimately implies the approximation of Y toward the standard uniform distribution:

Fig. 5: Histogram of Y = G_X(X) with size set to 10^3
Fig. 6: Histogram of Y = G_X(X) with size set to 10^4
Fig. 7: Histogram of Y = G_X(X) with size set to 10^5

That is, if a discrete distribution either converges close enough to a continuous distribution or if the distribution has even jumps (i.e. discrete uniform) then the corresponding histogram of Y:=G_X(X) will show a similar pattern as described in Theorem I.

Quantile Function Theorem

This time we do not generate an uniform sample but we start with one. Hence, let us now draw a large uniformly distributed sample U \sim \text{Uniform}(0,1). We use R to perform the simulation.

U <- runif(n=10^6, min=0, max=1)
hist(U)

Starting off with a uniformly distributed sample U, we need to “invert” the steps performed to illustrate the Probability Integral Transform Theorem. That is, we now have to apply the generalized inverse distribution function of the intended distribution and apply U \sim \text{Uniform}(0,1) to it.

If we would like to receive the standard normal distribution, for instance, U needs to be fed into the following (generalized) inverse F^{-1} –shown in Fig. 8– of the Standard Normal distribution:

Fig. 8: Inverse F^{-1} of distribution function F of Standard Normal distribution

Take a while and think about what the function F^{-1} actually does – it takes a value in (0,1) (i.e. a probability) and assigns it to a real value. For instance, the probability 0.5 is assigned to the real value 0. Values in the left tail of the distribution get a rather small absolute number while values in the right tail get a number very close to zero. That is, the increase in probability is quite small in the tails, which makes a lot of sense given the nature of a normal distribution.

In R we need to take the following steps:

X <- qnorm(U, mean=0, sd=1)
hist(X)

The histogram of X=F^{-1}(U), as shown in Fig. 9, is governed by the standard normal distribution. Just as desired.

Fig. 9: Histogram of a sample X = F^{-1}(U) \sim \text{Normal}(0,1)

Please note, however, that the inverse distribution function F^{-1} does not necessarily mean the inverse function. Distribution functions are in general not strictly increasing but non-decreasing. For further details we highly recommend P. Embrecht’s & M. Hofert’s paper ‘A note on generalized inverses’. The difference between inverse and generalized inverse functions will also play a key role in the proof of the Probability Integral Transformation Theorem.

In addition, Theorem II (Quantile Function) does NOT require the distribution function to be continuous. At first sight, this might seem strange since the Quantile Function Theorem is just a kind of inversion of the other.

Let us test the Binomial distribution example (as explained above) with respect to the Quantile Function Theorem: We draw an uniformly distributed large sample U \sim \text{Uniform}(0,1) and put it into the generalized inverse distribution function G_X^{-1}:[0,1] \rightarrow \mathbb{R} of \text{Binomial}(n, p) with n=2 and p=0.5. The graph of the corresponding quantile function is illustrated in Fig. 10.

Fig. 10: Plot of quantile function of G_X with X \sim \text{Binomial}(2,0.5)

Again, let us take the time to think about the meaning of this generalized inverse of the Binomial (cumulative) distribution function \text{Binomial}(n=2, p=0.5) (i.e. quantile function). It assigns a probability to each relevant integer. In our example, it assigns 0.25 probability to the integer 0, 0.5 probability to the integer 1 and 0.25 to the integer 2.

Let us do this in R:

U <- runif(n=10^6, min=0, max=1)
hist(U)
X <- qbinom(U, size=2, prob = 0.5)
hist(X)

The result depicted in Fig. 11 is as expected.

Fig. 11: Histogram of X=G_X^{-1}(U) that is distributed according to \text{Binomial}(2,0.5)

Apparently, we have received the desired Binomial distribution by first generating a standard uniform sample and then applying the quantile function to it.

Proofs

This section is based on the paper [1.] by John E. Angus. The problem with the usual attempts to prove the theorem is that it is usually not made clear what happens when the generalized inverse of the distribution function. For instance, the proof provided on Wikipedia is as follows:

<<Given any random continuous variable X, define Y=F_X(X). Then:

    \begin{align*}F_Y(y) &=P(Y \leq y) \\&= P(F_X(X) \leq y) \\&= P(X\leq F_X^{-1}(y)) \\&= F_X(F_X^{-1}(y)) \\&= y\end{align*}

F_Y is just the CDF of a Uniform(0,1) random variable. Thus, Y has a uniform distribution on the interval [0,1]. >>Wikipedia retrieved in Nov. 2020.

The cited “proof” above seems to be quite straight-forward. However, it is not clarified what is meant by applying the “inverse” F_X^{-1}(y). In the next subsection we are going to learn more about that step in Lemma 1.

Probability Integral Transformation

The following lemma is the key to the proof of Theorem I. The principle idea is that the basic set can be decomposed into \Omega = \{\omega: X(\omega) \leq x\} \sqcup \{\omega: X(\omega) > x\}, where \sqcup stands for the disjoint union.

In addition, recall that a distribution function is monotone: if x<y, we have X(\omega) \leq x \leq y \Rightarrow F(x) \leq F(y).

Lemma 1:
Let X have a distribution function F. Then for all real x\in \mathbb{R} the following holds:

\ \  \mathbb{P}(\{F(X) \leq F(x)\} )= F(x).

Proof of Lemma 1:
Decompose the event \{\omega : F(X(\omega)) \leq F(x)\} as follows:

    \begin{align*}&\{\omega: F(X(\omega)) \leq F(x) \}  \\&= \left[ \{\omega: F(X(\omega)) \leq F(x) \} \cap \{\omega: X(\omega) \leq x \} \right]  \sqcup \\&\left[ \{\omega: F(X(\omega)) \leq F(x) \} \cap \{ X(\omega) > x \} \right]\end{align*}

Note that \{\omega: X(\omega) \leq x\} \subset \{\omega: F(X(\omega)) \leq F(x)\} since F is monotone. In addition, \{\omega: X(\omega) > x\} \cap \{\omega: F(X(\omega)) < F(x)\} = \emptyset. Again, due to the monotonicity of F we see that \{\omega: F(X(\omega)) < F(x)\} = \{\omega: X(\omega)<x\}. Hence, the intersection \{\omega: F(X(\omega)) < F(x)\} \cap \{\omega: X(\omega) > x\} is empty. Considering these facts we can derive:

(1)   \begin{align*} & \{\omega: F(X(\omega)) \leq F(x) \}  \\&= \{\omega: X(\omega) \leq x \} \sqcup \\& \ \left[  \{ \omega: X(\omega) > x \}  \cap \{ \omega: F(X(\omega)) = F(x) \} \right]\end{align*}

Taking probabilities in (1), the assertion follows since the event \{ \omega: F(X(\omega)) = F(x)\} has probability zero.

\square

Now, let us prove the Probability Integral Transformation Theorem.

Proof Theorem I (Probability Integral Transformation):
Let u\in (0,1). Since F is continuous, there exist a real x such that F(x)=u. Then by Lemma 1, \mathbb{P} ( \{ \omega: Y(\omega) \leq u \} ) = \mathbb{P}( \{ \omega: F( X(\omega) )  \leq  F(x) \} ) = u. This implies that Y is distributed as \text{Uniform}(0,1).

\square

Quantile Function Transformation

Proof Theorem II (Quantile Function Theorem):
First, suppose that F^{-1}(u)= \inf{x: F(x) \geq u}. Note that for any x such that 0<F(x)<1 and any u\in (0,1), F(x)\geq u if, and only if, x\geq F^{-1}(u).

Suppose that x \geq F^{-1}(u), then \{ x: F(x) \geq u \} = F^{-1}(u) is an interval that contains its left-hand endpoint since F is non-decreasing and right-continuous.

Conversely, suppose that u \leq F(x), then x \geq \inf{ x: F(x) \geq u } = F^{-1}(u). It follows now that \mathbb{P}( \{ F^{-1}(U) \leq x \} ) = \mathbb{P}( \{ U \leq F(x) \} ) = F(x), completing the proof.

\square

Literature

1.
Angus, J. E. The Probability Integral Transform and Related Results. SIAM Review 36, 652–654 (1994).