In this brief blog post we present simple proofs for two very important theorems: the probability integral transformation and the quantile function theorem. Both theorems are rather important in statistics, computational math, machine learning and beyond.
We shed light on the background of these two well-known theorems. In particular, we investigate the “why-it-works” question. To this end, we take a closer look into the difference between continuous and discrete distributions. We furthermore demonstrate why the Probability Integral Transformation Theorem in general does not work for discrete distribution. In addition, we also illustrate under what circumstances a discrete distribution at least converges towards the assertion of the Probability Integral Transform Theorem. To this end, we provide heuristics and illustrations.
Let us first state the actual theorems in the next section. Afterwards, we are going to illustrate both and finally we will also prove them. The proofs provided in this blog post are based on results of Angus [1.].
Let be a real-valued random variable defined on a probability space . Let be defined by with be the distribution function of .
Theorem I (Probability Integral Transformation):
If has distribution function which is continuous, then the r.v. has the distribution function of .
The second theorem that we are going to prove is as follows.
Theorem II (Quantile Function):
Let be a distribution function . If is defined by , and has the distribution of , then has distribution function .
But why are both theorem so important?
The Quantile Function Theorem is of utter importance whenever simulations need to be performed. In almost all computer algebra systems this theorem is applied to generate samples of probability distributions. Beside that distribution transformations are the basis for any statistical testing, which is an important field within Frequentist’s statistics. The Probability Integral Transformation is also an important basis for the definition or interpretation of a copula.
Wrapping it up, we can state that both theorem are fundamental results that are applied in many important areas and it is definitely worth understanding them. To this end, we will first try to understand what both theorems are actually doing with a hands-on approach.
Probability Integral Transformation
X <- rnorm(n=10^6, mean=0, sd=1) hist(X)
A simulated sample of a standard normally distributed random variable is shown in the next Fig. 1.
If we now apply the corresponding distribution function of the standard normal distribution to the output of the random variable , we get which is clearly governed by the uniform distribution on as we can see in the next histogram in Fig. 2.
Again, we use a computer and R to perform also the second step.
Y <- pnorm(X, mean=0, sd=1) hist(Y)
Realize that the range of any distribution function is , which equals the domain of any random variable governed by .
The demonstrated effect is not a coincidence but based on the Probability Integral Transform Theorem. That is, we could set to be governed by any continuous random variable (i.e. one with continuous distribution function) and the result would always be the same.
X <- rexp(n=10^6, rate = 1) hist(X) Y <- pexp(X, rate=1) hist(Y)
X <- rcauchy(n=10^6, location=0, scale=1) hist(X) Y <- pcauchy(X, rate=1) hist(Y)
You will notice that similar outputs as in Fig. 2 are generated, which indicates that .
However, if we do the same exercise with a discrete random (i.e. not continuous) variable such as the Binomial distribution ( = sequence of independent Bernoulli experiments, = relative frequency of successes = probability), the resulting histogram is quite different. The discrete distribution function of is denoted by .
X <- rbinom(n=10^6, size=2, prob=0.5) hist(X) Y <- pbinom(X, size=2, prob=0.5) hist(Y)
Without continuity the controlled correspondence between an element of the domain and the image gets lost. Let us have a closer look into this.
The discrete distribution function of the discrete random variable is as follows:
In the next table we have listed all four (i.e. ) possible outcomes:
|1. Bernoulli Trial||2. Bernoulli Trial|
Failure for both trials occur in one out of four possible cases, i.e. with probability . There are two ways to get one success and one failure, i.e. with probability . Two successes in a row occur with probability . Putting these pieces together one ends with the above discrete distribution function of shown in Fig. 4.
Let us consider how this is reflected in the R simulation. To this end, we use the table command in R:
X <- rbinom(n=10^6, size=2, prob = 0.5) table(X) Y <- pbinom(X, size=2, prob = 0.5) table(Y)
We can see that the R variable contains about 25% of s, about 50% of s and about 25% of s. The R variable (containing these figures) is then entered into the distribution function , which translates that input into accumulated probabilities . Hence, we get about 25% of , about 50% of and about 25% of . This equals exactly the output of the histogram in Fig. 3.
Ultimately, the above example is not uniformly distributed since the increase of the discrete distribution function values are not even (i.e. ‘uneven jumps’). In our above example we have an increase of 0.25 two times and 0.5 once.
In continuous distribution functions these increases behave in a controlled manner and the distribution function balances the likelihood of events with the corresponding pace of the accumulation. For instance, the simulated standard normal distribution as shown in Fig. 1 contains only few samples in the tails left of -3. This, however, is compensated by the almost flat distribution function curve of the standard normal distribution around the area of -3. This has the effect that a wider range of the domain of is needed to fill up the corresponding probability bucket. Note that in Fig. 2, for instance, this probability bucket comprises since 20 buckets have been used to cluster the entire probability mass of 1.
The binomial distribution converges towards a normal distribution if we increase the number of trials (in R it is called the size). This implies that the ‘uneven jumps’ of a discrete Binomial distribution are getting less and less meaningful. Indeed, if we repeat a simulation as follows the result looks quite promising:
size <- 10^5 X <- rbinom(n=10^6, size=size, prob = 0.5) hist(X) Y <- pbinom(X, size=size, prob = 0.5) hist(Y)
Ultimately, all continuous distributions need to be discretized due to the limitations of a computer. So, if we change the size from to , we get the following histograms shown in Fig. 5 – Fig. 7. Apparently, the approximation of , that is distributed according to the binomial distribution, to the normal distribution ultimately implies the approximation of toward the standard uniform distribution:
That is, if a discrete distribution either converges close enough to a continuous distribution or if the distribution has even jumps (i.e. discrete uniform) then the corresponding histogram of will show a similar pattern as described in Theorem I.
Quantile Function Theorem
This time we do not generate an uniform sample but we start with one. Hence, let us now draw a large uniformly distributed sample . We use R to perform the simulation.
U <- runif(n=10^6, min=0, max=1) hist(U)
Starting off with a uniformly distributed sample , we need to “invert” the steps performed to illustrate the Probability Integral Transform Theorem. That is, we now have to apply the generalized inverse distribution function of the intended distribution and apply to it.
If we would like to receive the standard normal distribution, for instance, needs to be fed into the following (generalized) inverse –shown in Fig. 8– of the Standard Normal distribution:
Take a while and think about what the function actually does – it takes a value in (i.e. a probability) and assigns it to a real value. For instance, the probability 0.5 is assigned to the real value 0. Values in the left tail of the distribution get a rather small absolute number while values in the right tail get a number very close to zero. That is, the increase in probability is quite small in the tails, which makes a lot of sense given the nature of a normal distribution.
In R we need to take the following steps:
X <- qnorm(U, mean=0, sd=1) hist(X)
The histogram of , as shown in Fig. 9, is governed by the standard normal distribution. Just as desired.
Please note, however, that the inverse distribution function does not necessarily mean the inverse function. Distribution functions are in general not strictly increasing but non-decreasing. For further details we highly recommend P. Embrecht’s & M. Hofert’s paper ‘A note on generalized inverses’. The difference between inverse and generalized inverse functions will also play a key role in the proof of the Probability Integral Transformation Theorem.
In addition, Theorem II (Quantile Function) does NOT require the distribution function to be continuous. At first sight, this might seem strange since the Quantile Function Theorem is just a kind of inversion of the other.
Let us test the Binomial distribution example (as explained above) with respect to the Quantile Function Theorem: We draw an uniformly distributed large sample and put it into the generalized inverse distribution function of with and . The graph of the corresponding quantile function is illustrated in Fig. 10.
Again, let us take the time to think about the meaning of this generalized inverse of the Binomial (cumulative) distribution function (i.e. quantile function). It assigns a probability to each relevant integer. In our example, it assigns probability to the integer , probability to the integer and to the integer .
Let us do this in R:
U <- runif(n=10^6, min=0, max=1) hist(U) X <- qbinom(U, size=2, prob = 0.5) hist(X)
The result depicted in Fig. 11 is as expected.
Apparently, we have received the desired Binomial distribution by first generating a standard uniform sample and then applying the quantile function to it.
This section is based on the paper [1.] by John E. Angus. The problem with the usual attempts to prove the theorem is that it is usually not made clear what happens when the generalized inverse of the distribution function. For instance, the proof provided on Wikipedia is as follows:
<<Given any random continuous variable , define . Then:
is just the CDF of a Uniform(0,1) random variable. Thus, has a uniform distribution on the interval . >> — Wikipedia retrieved in Nov. 2020.
The cited “proof” above seems to be quite straight-forward. However, it is not clarified what is meant by applying the “inverse” . In the next subsection we are going to learn more about that step in Lemma 1.
Probability Integral Transformation
The following lemma is the key to the proof of Theorem I. The principle idea is that the basic set can be decomposed into , where stands for the disjoint union.
In addition, recall that a distribution function is monotone: if , we have .
Let have a distribution function . Then for all real the following holds:
Proof of Lemma 1:
Decompose the event as follows:
Note that since is monotone. In addition, . Again, due to the monotonicity of we see that . Hence, the intersection is empty. Considering these facts we can derive:
Taking probabilities in , the assertion follows since the event has probability zero.
Now, let us prove the Probability Integral Transformation Theorem.
Proof Theorem I (Probability Integral Transformation):
Let . Since is continuous, there exist a real such that . Then by Lemma 1, . This implies that is distributed as .
Quantile Function Transformation
Proof Theorem II (Quantile Function Theorem):
First, suppose that . Note that for any such that and any , if, and only if, .
Suppose that , then is an interval that contains its left-hand endpoint since is non-decreasing and right-continuous.
Conversely, suppose that , then . It follows now that , completing the proof.