Probability Theory

Lecture notes that I scribbled for STAT5005: Advanced Probability Theory (2020 Fall) taught by Prof. Xiao Fang at CUHK.

The textbook is Probability: Theory and Examples (4th Edition) by Rick Durrett.

\( \renewcommand{\vec}[1]{\boldsymbol{#1}} \DeclareMathOperator*{\E}{\mathbb{E}} \DeclareMathOperator*{\Var}{\mathrm{Var}} \DeclareMathOperator*{\Cov}{\mathrm{Cov}} \DeclareMathOperator*{\argmin}{\mathrm{arg\,min\;}} \DeclareMathOperator*{\argmax}{\mathrm{arg\,max\;}} \def\ZZ{{\mathbb Z}} \def\NN{{\mathbb N}} \def\RR{{\mathbb R}} \def\CC{{\mathbb C}} \def\QQ{{\mathbb Q}} \def\FF{{\mathbb FF}} \def\EE{{\mathbb E}} \newcommand{\tr}{{\rm tr}} \newcommand{\sign}{{\rm sign}} \newcommand{\1}{𝟙} \newcommand{\inprod}[2]{\left\langle #1, #2 \right\rangle} \newcommand{\set}[1]{\left\{#1\right\}} \require{physics} \)

1 Measure Theory #

1.1 Probability Space #

Probability Space: \((\Omega, \mathcal{F}, P)\):

\(\Omega\): Sample space, the set of all possible outcomes of random experiments

\(\mathcal{F}\): Event space, a set of events. An event \(E\) is a subset of \(\Omega\)

\(P\): Probability function: \(\mathcal{F} \to [0,1]\)

Example (Throw a standard die)

The sample space \(\Omega=\{1,2,3,4,5,6\}\). The event space is the set of all subsets of \(\Omega\) (including \(\emptyset\)). The probability function maps each event to a number (probability), e.g., \(\{5\}\) is mapped to \(\frac{1}{6}\), while \(\{2,3\}\) is mapped to \(\frac{1}{3}\).

\((\Omega, \mathcal{F})\) is called a measurable space, i.e., a space on which we can put a measure.

A function \(\mu: \mathcal{F} \to \RR \) is called a measure if:

  1. \(\mu(A)\geq 0,\ \forall A\in \mathcal{F}\)
  2. \(\mu(\emptyset)=0\)
  3. If \(A_1,A_2, \dots \in \mathcal{F}\) are disjoint (\(A_i \cap A_j=\emptyset,\ i\neq j\)), then: \[\mu \left( \bigcup_{i=1}^{\infty}A_i \right) = \sum_{i=1}^{\infty} \mu(A_i)\]

If \(\mu(\Omega)=1\), we call \(\mu\) a probability measure, denoted by \(P\).

Properties of \(\mu\):

Let \(\mu\) be a measure on \((\Omega, \mathcal{F})\), then:

  1. If \(A \subset B\), then \(\mu(A)\leq \mu(B)\). [Monotonicity]
  2. \(\forall A,B, \mu(A)+\mu(B) = \mu(A\cup B)+\mu(A\cap B)\). [Addition Law]
  3. If \(A \subset \bigcup_{i=1}^{\infty} A_i\), then \(\mu(A)\leq \sum_{i=1}^{\infty}\mu(A_i)\). [Sub-Additivity] (Union Bound)
  4. If \(A_n \uparrow A\) (i.e., \(A_1 \subset A_2 \subset \dots \) and \(\bigcup_i A_i=A\)), then \(\mu(A_n)\uparrow \mu(A)\) (i.e., \(\mu(A_1)\leq \mu(A_2)\leq \dots \) and \(\lim_{i\to \infty} \mu(A_i)=\mu(A)\)). [Continuity from below]
  5. If \(A_n \downarrow A\) (i.e., \(A_1 \supset A_2 \supset \dots \) and \(\bigcap_i A_i=A\)) and \(\mu(A_1)<\infty\), then \(\mu(A_n)\downarrow \mu(A)\) (i.e., \(\mu(A_1)\geq \mu(A_2)\geq \dots \) and \(\lim_{i\to \infty} \mu(A_i)=\mu(A)\)). [Continuity from above]


Proof of (1): Note that \(A\) and \(B \setminus A\) are disjoint, we have: \[\mu(B)=\mu(A\cup (B\setminus A)) \stackrel{(3)}= \mu(A)+\mu(B\setminus A)\stackrel{(1)}\geq \mu(A)\]

Proof of (2): Write each term as a sum of disjoint events: \[ \left\{\begin{aligned}\mu(A) &= \mu((A\setminus B) \cup (A\cap B)) \stackrel{(3)}{=} \mu(A\setminus B) + \mu(A\cap B)\\ \mu(B) &= \mu(B\setminus A)+\mu(A\cap B) \end{aligned}\right. \]

\[\left\{\begin{aligned}\mu(A\cup B) &= \mu((A\setminus B) \cup (A\cap B) \cup (B\setminus A)) \stackrel{(3)}{=} \mu(A\setminus B) + \mu(A\cap B) + \mu(B\setminus A)\\ \mu(A\cap B) &= \mu(A\cap B) \end{aligned}\right. \]

Proof of (3): Write \(A\) as a disjoint union of events: \[A=A\cap (\bigcup_{i=1}^{\infty}A_i) = (A\cap A_1) \cup (A\cap (A_2\setminus A_1)) \cup (A\cap (A_3 \setminus (A_1\cup A_2))) \cup \dots \] Then we have:

\[\mu(A)=\mu(A\cap A_1) + \mu(A\cap (A_2 \setminus A_1)) + \ldots \leq \mu(A_1)+\mu(A_2)+ \dots \]

Proof of (4): The inequalities are obvious.

\[\begin{aligned}\mu(A) &= \mu \left( \bigcup_{i=1}^{\infty} A_i \right)\\ &=\mu \left( A_1 \cup (A_2 \setminus A_1) \cup (A_3 \setminus A_2) \cup \dots \right)\\ &= \mu(A_1) + \mu(A_2\setminus A_1)+\mu(A_3\setminus A_2) + \dots\\ &= \lim_{i\to \infty} \mu(A_i)\ \ \text{(by definition of limit.)} \end{aligned} \]

Proof of (5): Consider \(A_1 \setminus A_n \uparrow A_1\setminus A\), by the above conclusion, we have \(\mu(A_1 \setminus A_n) \uparrow \mu(A_1\setminus A)\), so \(\mu(A_1)-\mu(A_n) \uparrow \mu(A_1)-\mu(A)\), so \(\mu(A_n) \uparrow \mu(A)\)

Example (Undergraduate probability: discrete probability spaces)

Let \(\Omega\) be a countable set (finite or countably infinite), \(\mathcal{F}\) be the set of all subsets of \(\Omega\). Let \(P(A)=\sum_{\omega\in A} p(\omega)\) where \(p(\omega)\geq 0\) and \(\sum_{\omega\in \Omega}p(\omega)=1\).

One can verify that \(P\) is a probability measure on this space.

(Any discrete probability space can be constructed like this.)

For the discrete scenario, the above definition of probability space is quite enough. However, for continuous case, non-measurable sets can be constructed. That’s why we need to further define \(\sigma\)-field and Borel sets.

We assume \(\mathcal{F}\) (a collection of subsets of \(\Omega\)) is a \(\sigma\)-field or \(\sigma\)-algebra, that is, it satisfies:

  1. \(\Omega\in \mathcal{F}\)
  2. If \(E\in \mathcal{F}\) then \(E^C \triangleq \Omega \setminus E \in \mathcal{F}\)
  3. If \(E_i \in \mathcal{F}\), for \(i=1,2,3, \dots \), then \(\bigcup_{i=1}^{\infty} E_i\in \mathcal{F}\)


The smallest \(\sigma\)-field is \(\{\emptyset, \Omega\}\). The largest \(\sigma\)-field is all subsets of \(\Omega\).

The unions in (3) can be changed to intersections, i.e., given (1) and (2), (3) is equivalent to:

If \(E_i \in \mathcal{F}\), for \(i=1,2,3, \dots \), then \(\bigcap_{i=1}^{\infty} E_i\in \mathcal{F}\), because \(\left(\bigcap_{i=1}^{\infty} E_i\right)^c = \bigcup_{i=1}^{\infty} E_i^c\).

Therefore, a \(\sigma\)-algebra is a non-empty set of sets that is closed under countable unions, countable intersections, and complements.

Let \(\mathcal{A}\) be a collection of subsets of \(\Omega\), define \(\sigma\)-field generated by \(\mathcal{A}\) to be: \[\sigma(\mathcal{A})\triangleq \bigcap_{\substack{\mathcal{A}\subset \mathcal{F}\\ \mathcal{F} \text{ is a } \sigma\text{-field}}} \mathcal{F}\] In fact, this is the smallest \(\sigma\)-field that contains \(\mathcal{A}\).

Example 1

If \(\mathcal{A}=\{A\}\), where \(A \subset \Omega\), then \(\sigma(\mathcal{A})=\{\emptyset, \Omega, A, A^c\}\).

Example 2

Consider \(\Omega=\RR =(-\infty, \infty)\). \(\mathcal{A}=\left\{ (a,b]: -\infty<a<b<\infty \right\}\). \(\sigma(\mathcal{A})\) is called the Borel \(\sigma\)-field on \(\RR\), denoted by \(\mathcal{B}(\RR)\), or \(\mathcal{B}\) for shorthand.

Another equivalent definition of \(\mathcal{B}\)

\(\mathcal{B}=\sigma(\{(a,b): -\infty<a<b<\infty\}) = \sigma(\{\text{Open sets in }\RR \})\)

I.e., the Borel algebra on \(\RR\) is the smallest \(\sigma\)-algebra containing all open sets.


First show \(\sigma(\mathcal{A}) \subseteq \sigma(\mathcal{A}’)\). Since \((a,b]=\bigcap_{n=1}^{\infty}(a,b+\frac{1}{n})\), we have:

\((a,b] \in \sigma(\mathcal{A}’) \Rightarrow \mathcal{A} \subset \sigma(\mathcal{A}’) \Rightarrow \sigma(\mathcal{A})\subseteq \sigma(\mathcal{A}’)\)

Then show \(\sigma(\mathcal{A}’) \subseteq \sigma(\mathcal{A})\), use \((a,b)=\bigcup_{n=1}^{\infty}(a,b-\frac{1}{n}]\), and argue similarly.

Borel \(\sigma\)-field on \(\RR ^d\), denoted by \(\mathcal{B}(\RR^d)\) or \(\mathcal{B}^d\), is defined to be \[\mathcal{B}^d=\sigma(\{(a_1,b_1]\times (a_2,b_2]\times \dots \times (a_d,b_d]: -\infty<a_i<b_i<\infty\})\]


Note that in spite of the notation, \(\mathcal{B}^{d} = \mathcal{B} \times \mathcal{B} \times \dots \times \mathcal{B}\) is not the Cartesian product of the \(\mathcal{B}\)s; the Cartesian product is the set of measurable rectangles, while the product \(\sigma\)-field is the minimal \(\sigma\)-field over the measurable rectangles (see definition below).

Product of \(\sigma\)-algebra

Let \((S_1, \Sigma_1)\) and \((S_2, \Sigma_2)\) be measurable spaces. Let \(S\) denote the Cartesian product \(S = S_1 \times S_2\). For \(i=1,2\), let \(\rho_i\) denote the \(i\)-th coordinate map, i.e., \(\rho_1(s_1,s_2)=s_1\), \(\rho_2(s_1,s_2)=s_2\) for \(s_i \in S_i\). Then the product of \(\Sigma_1\) and \(\Sigma_2\) is defined as \[\begin{split} \Sigma &= \Sigma_1 \times \Sigma_2\\ &\triangleq \sigma(\rho_1, \rho_2)\\ &=\sigma(\set{\rho_1^{-1}(E_1): E_1 \in \Sigma_1} \cup \set{\rho_2^{-1}(E_2): E_2 \in \Sigma_2})\\ &=\sigma(\set{\set{(s_1, s_2) \in S: \rho_1(s_1,s_2) \in E_1}: E_1 \in \Sigma_1} \cup \set{\set{(s_1, s_2) \in S: \rho_2(s_1,s_2) \in E_2}: E_2 \in \Sigma_2})\\ &=\sigma(\set{E_1 \times S_2: E_1 \in \Sigma_1} \cup \set{S_1 \times E_2: E_2 \in \Sigma_2})\\ &=\sigma(\set{E_1 \times E_2: E_1 \in \Sigma_1, E_2 \in \Sigma_2}) \end{split} \]

The last equation requires a proof.

Lemma: If \(\forall A \in \mathcal{A}\), \(A \in \sigma(\mathcal{B})\), then we have \(\sigma(\mathcal{A}) \subseteq \sigma(\mathcal{B})\). Proof: Obviously \(\mathcal{A} \subseteq \sigma(\mathcal{B})\), by definition of \(\sigma(\mathcal{A})\), it is the intersection of all \(\sigma\)-algebra that contains it, so \(\sigma(\mathcal{B})\) is one of them. QED.

So we need to proof: (1) \(E_1 \times S_2 \in RHS\), (2) \(S_1 \times E_2 \in RHS\), and (3) \(E_1 \times E_2 \in LHS\).

(1) and (2) are true since \(S_2 \in \Sigma_2\) and \(S_1 \in \Sigma_1\) by definition. Because \(E_1 \times E_2 = (E_1 \times S_2) \cap (S_1 \times E_2)\), so (3) is true by the closeness of intersection of \(\sigma\)-algebra. QED.

Next we focus on discussing \(P: \mathcal{R}\to [0,1]\) on \((\RR ,\mathcal{R})\). The aim is to show that probability measures on \(\RR\) are determined by distribution functions. This means the cumulative distribution function (CDF) we learned in the elementary probability course actually determines a probability measure on \(\RR\).

\(F: \RR \to \RR \) is called a Stieltjes measure function if:

  1. \(F\) is non-decreasing
  2. \(F\) is right-continuous, i.e., \(\lim_{y\downarrow x} F(y)=F(x)\)

Theorem: Every Stieltjes measure function \(F\) determines a unique measure \(\mu\) on \((\RR , \mathcal{R})\) such that: \[\mu((a,b])=F(b)-F(a), \forall -\infty<a<b<\infty\]


The complete proof is long, so we only prove the uniqueness. We need Dynkin’s \(\pi\)-\(\lambda\) theorem.

Define \(\mathcal{P} = \{(a,b]: -\infty <a\leq b<\infty\}\) (a \(\pi\)-system), we need to prove that if two measure \(\mu_1\) and \(\mu_2\) agree on \(\mathcal{P}\), then they agree on \(\mathcal{R}=\sigma(\mathcal{P})\).

To this end, define \(L=\{A \in \mathcal{R}: \mu_1(A) = \mu_2(A)\}\).

If \(\mu_1, \mu_2\) agree on \(\mathcal{P}\), i.e., \(\forall (a,b], \mu_1((a,b]) = \mu_2((a,b])\), then \(\mathcal{P} \subset \mathcal{L}\).

We can also verify that \(\mathcal{L}\) is a \(\lambda\)-system, by the Dynkin’s \(\pi\)-\(\lambda\) theorem, \(\sigma(\mathcal{P})\subset \mathcal{L}\).

By definition, \(\mathcal{L} \subset \mathcal{R}=\sigma(\mathcal{P})\), so \(\mathcal{L}=\mathcal{R}\).

1.2 Random Variables \(X\) and their Distributions \(\mathcal{L}(X)\) #

Random Variable #


Consider \((\Omega,\mathcal{F},P)\), \(X:\Omega\to \RR \), How do define \(\Pr[X\leq x], \forall x \in \RR \)?

We can regard \(X\leq x\) as a set \(\{\omega \in \Omega: X(\omega)\leq x\} \subset \Omega\), and define \(\Pr[X\leq x] \triangleq P(\{\omega \in \Omega: X(\omega)\leq x\})\).

Since \(P: \mathcal{F}\to [0,1]\), we need to make sure \(\{\omega \in \Omega: X(\omega)\leq x\} \in \mathcal{F}\).

This requires \(X\) to be measurable.

Let \((\Omega_1,\mathcal{F}_1)\) and \((\Omega_2,\mathcal{F}_2)\) be measurable spaces. \(f: \Omega_1 \to \Omega_2\) is called measurable if:

\(\forall A \in \mathcal{F}_2, f^{-1}(A) \in \mathcal{F}_1\)

where \(f^{-1}(A)\triangleq \{\omega \in \Omega_1: f(\omega) \in A\}\)

When we need to emphasize the \(\sigma\)-field, we say \(f\) is \(\mathcal{F}_1 / \mathcal{F}_2\) measurable. Usually \(f\) is a R.V. and by default \(\Omega_2=\RR\), \(\mathcal{F}_2=\mathcal{B}(\RR)\), so we also omit the second measurable space and say \(f\) is \(\mathcal{F}_1\) measurable, denoted by \(f \in \mathcal{F}_1\).

Notation Given \(f: X \to Y\), we have:

  • \(f(A) = \{ f(x) \mid x\in A\} \text{ for } A \subset X\)
  • \(f^{-1}(B) = \{ x \mid f(x) \in B\} \text{ for } B \subset Y\)
  • \(f(\mathcal{A}) = \{ f(A) \mid A \in \mathcal{A} \}\)
  • \(f^{-1}(\mathcal{B}) = \{ f^{-1}(B) \mid B \in \mathcal{B} \}\)


The definition of measurable functions depends on two measurable spaces. Some literature calls the function \(f: \Omega_1 \to \Omega_2\) a measurable map from \((\Omega_1,\mathcal{F}_1)\) to \((\Omega_2,\mathcal{F}_2)\), or sometimes abuses the notation and says \(f: (\Omega_1,\mathcal{F}_1) \to (\Omega_2,\mathcal{F}_2)\) is measurable.

Fact 1

\(\sigma(f) \triangleq \{f^{-1}(A):A\in \mathcal{F}_2\}\) is a \(\sigma\)-field in \(\Omega_1\).

\(\sigma(f) \triangleq \{f^{-1}(A):A\in \mathcal{F}_2\}\) is called the \(\sigma\)-field generated by \(f\). Or equivalently, the smallest \(\sigma\)-field for \(f\) to be measurable.

\[f \text{ is measurable} \Longleftrightarrow \sigma(f) \subset \mathcal{F}_1\]


Let \(f(\omega_1)=a, \forall \omega_1 \in \Omega_1\). Then \(\sigma(f)=\{\emptyset, \Omega_1\}\subset \mathcal{F}_1\). So constant function is always measurable.

Let \(f(\omega_1)=a, \forall \omega_1 \in A \subset \Omega_1\), let \(f(\omega_1)=b, \forall \omega_1 \in A^c \subset \Omega_1\), \(a\neq b\). Then \(\sigma(f)=\{\emptyset, \Omega_1, A, A^c\}\).

Fact 2

\(\{B \subset \Omega_2: f^{-1}(B) \in \mathcal{F}_1\}\) is a \(\sigma\)-field in \(\Omega_2\).

This means if \(\mathcal{F}_2 = \sigma(\mathcal{A}_2)\), then to check whether \(f\) is measurable, we only need to check \(\forall A \in \mathcal{A}_2, f^{-1}(A) \in \mathcal{F}_1\).

For example, \(f: (\Omega_1, \mathcal{F}_1)\to (\RR, \mathcal{B})\) is measurable if \(f^{-1}((-\infty, x]) \in \mathcal{F}_1, \forall x \in \RR\). (It is easy to verify from the definition that \(\mathcal{B}=\sigma(\{(-\infty, x]: x \in \RR\})\).)


Because if \(f^{-1}(A) \in \mathcal{F}_1, \forall A \in \mathcal{A}_2\), let \(\mathcal{B}=\{B\subset \Omega_2: f^{-1}(B)\in \mathcal{F}_1\}\), then \(\mathcal{A}_2 \subset \mathcal{B}\). Because \(\mathcal{F}_2\) is the smallest \(\sigma\)-field that contains \(\mathcal{A}_2\), and \(\mathcal{B}\) is a \(\sigma\)-field, so \(\mathcal{F}_2 \subset \mathcal{B}\). According to the definition of \(\mathcal{B}\), we have \(\forall A \in \mathcal{F}_2, f^{-1}(A) \in \mathcal{F}_1\).

Fact 3

Let \((\Omega_1, \mathcal{F}_1), (\Omega_2, \mathcal{F}_2), (\Omega_3, \mathcal{F}_3)\) be measurable spaces. If \(f_1: \Omega_1 \to \Omega_2\), \(f_2: \Omega_2 \to \Omega_3\) are both measurable, then:

\(f_2 \circ f_1: \Omega_1 \to \Omega_3\) is measurable.


\((f_2 \circ f_1)^{-1}(A) = f_1^{-1}(f_2^{-1}(A))\). Since \(f_2\) is measurable, so for all \(A \in \mathcal{F}_3\), \(f_2^{-1}(A) \in \mathcal{F}_2\), similarly, since \(f_1\) is measurable, so \(f_1^{-1}(f_2^{-1}(A)) \in \mathcal{F}_1\).

If there is a measure \(\mu_1\) on \((\Omega_1, \mathcal{F}_1)\), and \(f: \Omega_1 \to \Omega_2\) is measurable. Then \(\mu_1\) induces a measure \(\mu_2\) on \((\Omega_2, \mathcal{F}_2)\) such that:

\[\forall A \in \mathcal{F}_2: \quad \mu_2(A) = \mu_1(f^{-1}(A))\]

Next we focus on functions from \((\Omega, \mathcal{F}, P)\) to \((\RR, \mathcal{B})\) or \((\RR^d, \mathcal{B}^d)\).

If \(f: \Omega \to \RR \) is measurable, then \(f\) is called a real-valued (or 1-dimensional) random variable, usually denoted by \(X\) (\(Y, Z, \dots \)).

If \(f: \Omega \to \RR^d\) is measurable, then \(f\) is called a \(d\)-dimensional random variable (or random vector), usually denoted by \(X=(X_1, X_2, \dots, X_d)^T\).

If \(P\) is a probability measure on \((\Omega, \mathcal{F})\), then its induced measure on \((\RR, \mathcal{B})\) (or \((\RR^d, \mathcal{B}^d)\)) is called the probability distribution of random variables (or random vectors).

The \(\sigma\)-field generated by random variable \(X\): \(\sigma(X) \triangleq \{f^{-1}(A):A\in \mathcal{B}\}\).

The \(\sigma\)-field generated by multiple random variables (or random vectors) \(X_1, X_2, \dots, X_k\):

\[\sigma(X_1, X_2, \dots, X_k) \triangleq \{(X_1, X_2, \dots, X_k)^{-1}(A):A\in \mathcal{B}\}\]

Fact 4

\(\vec{X}=(X_1,X_2, \dots, X_d)^T\) is a random vector if and only if \(X_i\) is a random variable, for all \(i=1, \dots , d\).



Since \(\vec{X}: \Omega\to \RR^d\) is measurable, we have \(\forall A \in \mathcal{R}^d, \vec{X}^{-1}(A) \in \mathcal{F}\).

For any \((a,b] \in \mathcal{R}\), we have:

\[X_i^{-1}((a,b]) = \vec{X}^{-1}(\underbrace{\RR}_{1\text{st}} \times \dots \times \RR \times \underbrace{(a,b]}_{i\text{-th}} \times \RR \times \dots \times \underbrace{\RR}_{d\text{-th}}) \in \mathcal{F}\]


For any \((a_1,b_1] \times \dots \times (a_d, b_d] \in \mathcal{R}^d\), we have:

\[\vec{X}^{-1}((a_1,b_1] \times \dots \times (a_d, b_d]) = \left[ X_1^{-1}((a_1,b_1]) \right] \cap \dots \cap \left[ X_d^{-1}((a_d,b_d]) \right] \in \mathcal{F}\]

Fact 5

If \(X_1, \dots ,X_n\) are random variables and \(f: (\RR^n, \mathcal{R}^n) \to (\RR, \mathcal{R})\) is measurable, then \(f(X_1, \dots ,X_n)\) is a random variable.


It follows from Fact 3 and Fact 4.

This means usual algebraic operations like \(X_1+X_2\), \(X_1 \cdot X_2\), etc. are all measurable (random variables).

E.g., to prove \(X_1+X_2\) is a random variable, we only need to show \(f: (\RR^2, \mathcal{R}^2) \to (\RR, \mathcal{R})\) and \(f(X_1, X_2)=X_1+X_2\). Since \(\sigma(\{(-\infty, x): x \in \RR \}) = \mathcal{R}\), we only need to show for all \(x \in \RR\), \(f^{-1}((-\infty, x)) \in \mathcal{R}^2\), i.e., \(\{(X_1, X_2): X_1+X_2 < x\} \in \mathcal{R}^2, \forall x \in \RR \).

Note that \(X_1+X_2<x\) if and only if there exists a rational number \(r \in (X_1, x-X_2)\). So \(\{(X_1, X_2): X_1+X_2 < x\} = \bigcup_{r \in \QQ}\{(X_1, X_2): X_1<r, X_2<x-r\} \in \mathcal{R}^2\).

If \(f: (\RR, \mathcal{R})\to (\RR , \mathcal{R})\) is measurable, it is called Borel-measurable.

If \(f\) is a continuous function, or a piecewise-continuous function, then it is Borel-measurable. (That’s why we can safely apply elementary functions on random variables, e.g., if \(X\) is a R.V., then so is \(\sin(X)\).)

If \(X_1, X_2, \dots \) are random variables, then so are \(\inf\limits_{n\geq 1} X_n\), \(\sup\limits_{n\geq 1} X_n\), \(\limsup\limits_{n\to \infty} X_n\), \(\liminf\limits_{n\to \infty} X_n\).

Distribution #

Let \((\Omega, \mathcal{F}, P)\) be a probability space. Let \(X: \Omega \to \RR \) be a real valued random variable. The induced measure \[\mu(A) = P(\{\omega \in \Omega: X(\omega) \in A\}) \triangleq \Pr[X \in A]\] is called the probability measure of \(X\).

If the induced measure is on \((\RR, \mathcal{R})\), then it is called the probability distribution of \(X\). The distribution function of \(X\) is defined to be:

\[F_X(x) = \Pr[X\leq x] \triangleq P(\{\omega \in \Omega: X(\omega) \leq x\}) \quad \forall x \in \RR\]

Properties of distribution function \(F\):

  1. \(F\) is non-decreasing.
  2. \(F\) is right-continuous.
  3. \(\lim\limits_{x\to -\infty} F(x)=0\); \(\lim\limits_{x\to \infty} F(x)=1\).

These properties are inherited from the properties of measures.

Proposition: If \(X\) has a continuous distribution function \(F\), then \(Y=F(X)\) has the uniform distribution.

Proof: For \(0<y<1\), denote \(F^{-1}\) to be the largest value among the preimage:

\[\Pr[Y\leq y] = \Pr[F(X)\leq y] = \Pr[X\leq F^{-1}(y)] = F(F^{-1}(y)) = y.\]

The last equality is due to the continuity of \(F\).

The next theorem provides a way of constructing a random variable with an arbitrary distribution.

Theorem: Let \(\Omega=(0,1)\), \(\mathcal{F}=\{\text{Borel sets}\}\), \(P=\text{Lebesgue measure}\). Define \(X: \Omega\to \RR \) to be \[X(\omega) = F^{-1}(\omega) \quad \forall \omega \in \Omega\] where \[F^{-1}(\omega) = \inf \{y: F(y)\geq \omega\} = \sup \{y: F(y)<\omega\}\]

Then the distribution function of \(X\) is \(F\).

Proof: Note that \[\Pr[X\leq x] = P(\{\omega: F^{-1}(\omega)\leq x\})\] \(F(x)=P(\{\omega: \omega \leq F(x)\})\)

The right-hand sides are equal by the definition of \(F^{-1}\), hence \(\Pr[X\leq x] = F(x)\).

\(X\) and \(Y\) are said to be equal in distribution if \(F_X(x) = F_Y(x)\) for all \(x \in \RR\).

The support of a random variable \(X\) with definition function \(F\) is defined to be: \(\{x \in \RR : F(x+\epsilon)-F(x-\epsilon)>0, \forall \epsilon\geq 0\}\).

Classification of distribution functions.

Note that the set of discontinuity points of \(F\): \(\{x \in \RR : \lim_{y \uparrow x} F(y)\neq F(x)\}\) is countable, denoted by \(\{a_1, a_2, \dots \}\). Let \(b_i=F(a_i)-F(a_i-) >0\).

  1. If \(\sum\limits_{i=1}^{\infty} b_i=1\), then \(F\) is called a discrete distribution.
  2. If \(\sum\limits_{i=1}^{\infty} b_i=0\), then \(F\) is called a continuous distribution.
    1. If \(F(x)=\int_{-\infty}^x f(y)\; dy, \forall x\), then \(F\) is absolutely continuous, with density function \(f\).

Theorem: Any distribution function \(F\) can be written as \[F=c_1 \underbrace{F_d}_{\text{discrete}} + c_2\underbrace{F_a}_{\text{abso. cont.}}+c_3 \underbrace{F_s}_{{\text{singular}}}\] where \(c_1,c_2,c_3\geq 0\), \(c_1+c_2+c_3=1\). \(F_s\) is a singular distribution function, meaning that \(F_s’\) exists hand equals to 0 almost everywhere.

1.3 Expectation \(\E[X]\): Definitions, Properties, Inequalities #

Definition of Expectation #

Let \(X\) be a random variable defined on \((\Omega, \mathcal{F}, P)\), its expectation is defined in four steps.

Definition 1 Given a set \(A \in \mathcal{F}\), define \[X(\omega)=\1_A(\omega)=\left\{ \begin{array}{l@{\quad:\quad}l} 1 & \text{if } \omega \in A\\ 0 & \text{if } \omega \not\in A \end{array}\right. \] Such a random variable is called an indicator random variable, denoted by \(\1_A\) and its expectation is defined to be: \[\E[X] = \E[\1_A] = P(A)\]

Definition 2 Let \(X=\sum_{i=1}^n a_i \1_{A_i}\), where \(A_1, \dots ,A_n \in \mathcal{F}\) are disjoint and \(a_1, \dots , a_n \in \RR \). Such a random variable is called a simple random variable and its expectation is defined to be: \[\E[X] = \sum_{i=1}^n a_i P(A_i)\]


Definition 2 is well-defined, i.e., if \(\{A_i\}\) and \(\{B_i\}\) are two different finite partitions of \(\Omega\), such that \(\sum_{i=1}^n a_i \1_{A_i} = \sum_{i=1}^m b_i \1_{B_i}\), then \(\sum_{i=1}^n a_i P(A_i) = \sum_{i=1}^m b_i P(B_i)\) (Proof skipped). This means we can safely ignore the underlying partitions of \(\Omega\), and simply denote the random variable as \(X\).

Definition 3 For a non-negative random variable, i.e., \(X(\omega)\geq 0, \forall \omega \in \Omega\), define \[\E[X] = \sup_{\substack{Y: 0\leq Y\leq X \\ Y \text{ is a simple random variable}}} \E[Y]\] Where the comparison between random variables is defined as: \[0\leq Y\leq X \Longleftrightarrow 0\leq Y(\omega) \leq X(\omega), \forall \omega \in \Omega\] Note: it can be \(+\infty\).

Definition 4 For an arbitrary random variable \(X\), write \(X=X^+ - X^-\) and \(|X| = X^+ + X^-\) where \[X^+(\omega) = \max \{X(\omega), 0\}, \qquad X^-(\omega) = \max \{-X(\omega), 0\}\]

\(\E[X]\) exists and set \(\E[X] = \E[X^+]-\E[X^-]\) whenever the subtraction makes sense, i.e., if \(\E[X^+]<\infty, \E[X^-]<\infty\). Otherwise if \(\E[X^+]=\E[X^-]=\infty\), then we say \(\E[X]\) does not exist.

If both \(\E[X^+]\) and \(\E[X^-]\) are finite, then \(\E[X]\) and \(\E[|X|] = \E[X^+ + X^-] \underbrace{=}_{\text{proved later}} \E[X^+] + \E[X^-]\) are also finite.


\[X=\left\{ \begin{array}{l@{\quad:\quad}l} 0 & \text{on } \Omega_0\\ \infty & \text{on } \Omega_0^c \end{array}\right. \]

If \(P(\Omega_0^c)>0\), then \[\E[X] = \sup\limits_{\substack{Y: 0\leq Y\leq X \\ Y \text{ is a simple random variable}}} \E[Y] = \sup_M M\cdot P(\Omega_0^c) = \infty\] If \(P(\Omega_0^c = 0)\), then \(\E[X]=0\).


According to the above example, sets with probability 0 can be ignored in computing expectations.

In particular, if \(X=Y\) almost surely (a.s.) then \(\E[X]=\E[Y]\) if exists.

Properties of Expectation #

Suppose \(X,Y\geq 0\) or \(\E[|X|], \E[|Y|]<\infty\), we have:

  1. If \(X\geq Y\) a.s., then \(\E[X]\geq \E[Y]\) (monotonicity)
  2. \(\E[aX] = a\E[X]\) (linearity)
  3. \(\E[X+Y]=\E[X]+\E[Y]\) (linearity)

Proof is long so is skipped. See the any advanced probability textbook.

Monotone Convergence Theorem (MCT) #

Whether \(\E \left[ \lim_{n\to \infty} X_n \right] = \lim_{n\to \infty} \E[X_n]\)? Not necessarily.

Counter Example

\[ X_n = \left\{ \begin{array}{l@{\quad:\quad}l} n & \text{if } \omega \in (0,\frac{1}{n})\\ 0 & \text{otherwise} \end{array}\right. \]

On the one hand, \(\lim_{n\to \infty} X_n = 0\), so \(\E[\lim_{n\to \infty} X_n] = \E[0]=0\).

On the other hand, \(\E[X_n]=n\cdot P((0,\frac{1}{n})) = 1\), so \(\lim_{n\to \infty} \E[X_n] = \lim_{n\to \infty} 1 = 1\).

Theorem: Monotone Convergence Theorem (MCT)

If \(X_n\geq 0, \forall n\), and \(X_n \uparrow X\) (i.e., \(X_n(\omega) \uparrow X(\omega), \forall \omega \in \Omega)\), then \(\E[X_n] \uparrow \E[X]\) (i.e., \(\lim_{n\to \infty} \E[X_n] = \E[\lim_{n\to \infty} X_n]\)).


By monotonicity, \(\{\E[X_n]\}_{n=1}^{\infty}\) is a sequence of non-negative non-decreasing numbers. It either converges to a value \(a\), or diverges to \(\infty\). We need to show \(\E[X] = a\) or \(\infty\).

Case 1: if \(\lim_{n\to \infty} \E[X_n] = \infty\), since \(\E[X]\geq \E[X_n], \forall n\), so \(\E[X] = \infty\). Case 2: if \(\lim_{n\to \infty} \E[X_n] = a\), since \(\E[X]\geq \E[X_n], \forall n\), so \(\E[X]\geq a\). We now show \(\E[X]\leq a\).

By definition 3: \[\E[X] = \sup_{\substack{Y: 0\leq Y\leq X \\ Y \text{ is a simple random variable}}} \E[Y]\]

It suffices to show that \(\E[Y]\leq a=\lim_{n\to \infty} \E[X_n]\), for all simple R.V. \(Y\leq X\).

Fix a \(Y\), and suppose \(Y = \sum_{i=1}^m b_i \1_{B_i}\), where \(B_i\) are disjoint. Then it suffices to prove that \(\lim_{n\to \infty} \E[X_n]\geq \sum_{i=1}^m b_i P[B_i]\), for all finite partitions \(\{B_i\}\) of \(\Omega\) with \(b_i\leq X(\omega), \forall \omega \in B_i\).

To that end, choose any \(\epsilon>0\), and define \(B_{in}= \{\omega \in B_i: X_n(\omega)\geq b_i -\epsilon\}\). Therefore \(\E[X_n]\geq \sum_{i=i}^m (b_i-\epsilon)P(B_{in})\). As \(n\to \infty\), the RHS converges to \(\sum_{i=i}^m (b_i-\epsilon)P(B_{i}) = \sum_{i=1}^m b_i P(B_i) - \epsilon\) (because of the continuity of probabilities and \(B_{in} \uparrow B_i\)). Hence, \(\lim_{n\to \infty} \E[X_n]\geq \sum_{i=1}^m b_i P(B_i) - \epsilon\). Since this holds for any \(\epsilon>0\), we have \(\lim_{n\to \infty} \E[X_n] \geq \sum_{i=1}^m b_i P(B_i)\), as required.

(In this proof, I don’t understand why \(B_{in} \uparrow B_i\).)


The Monotone Convergence Theorem can be generalized for negative R.V.s., but additionally we need to require that \(\E[X_1]> -\infty\).

The proof can be reduced to the above non-negative case: when discussing Case 2, we need to show \(\lim_{n\to \infty} \E[X_n]\geq \E[X]\). If \(\E[X_1] = +\infty\), this is trivial, so assume \(\E[X_1]\) is finite. Then by replacing \(X_n\) by \(X_n-X_1\), and \(X\) by \(X-X_1\), it suffices to assume that \(X_n\) and \(X\) are non-negative.

Fatou’s Lemma #

If the requirements of MCT do not hold, then Fatou’s lemma gives an inequality.

If \(X_n\geq 0 \quad \forall n\), then \[\liminf_{n\to \infty } \E[X_n] \geq \E \left[ \liminf_{n\to \infty } X_n \right]\] where the limit inferior of a sequence \((x_n)\) is defined by: \[\liminf_{n\to \infty} x_n \triangleq \lim_{n\to \infty} \left( \inf_{m\geq n} x_m \right)\]


\begin{align} \liminf_{n\to \infty } \E[X_n] &\geq \liminf_{n\to \infty} \E\left[\inf_{k\geq n} X_k\right]\\ &= \lim_{n\to \infty } \E\left[ \inf_{k\geq n} X_k \right]\\ &= \E \left[ \lim_{n\to \infty } \inf_{k\geq n} X_k \right] \text{ by MCT because } \inf_{k\geq n} X_k \text{ is non-decreasing}\\ &= \E \left[ \liminf_{n\to \infty } X_n \right] \end{align}

Dominated Convergence Theorem (DCT) #

If \(X_n\to X\) almost surely, and for all \(n\), \(|X_n|\leq Y\) for some \(Y\) with \(\E[Y]<\infty \). Then \(\E[X_n]\to \E[X]\) (i.e., \(\lim_{n\to \infty }\E[X_n] = \E[\lim_{n\to \infty }X_n]\)).

Special Case

If \(|X_n|\leq M\), then \(\lim_{n\to \infty }\E[X_n] = \E[\lim_{n\to \infty }X_n]\). This is called Bounded Convergence Theorem (BCT).


Note that \(X_n+Y\geq 0\). By Fatou’s Lemma:

\[\liminf_{n\to \infty } \E[X_n+Y] \geq \E[\liminf_{n\to \infty } (X_n +Y)] = \E[X+Y] \]

By linearity of expectation,

\[\liminf_{n\to\infty} \E[X_n] + \E[Y] \geq \E[X]+\E[Y] \Longrightarrow \liminf_{n\to\infty} \E[X_n] \geq \E[X]\]

To prove the other direction:

\[\begin{split} \limsup_{n\to\infty} \E[X_n-Y] &= -\liminf_{n\to\infty} \left[ -\E[X_n-Y] \right]\\ &= -\liminf_{n\to\infty} \left[ \E[-X_n+Y] \right]\\ &\leq - \E[\liminf_{n\to\infty} (-X_n + Y)]\\ &= -\E[-X+Y] = \E[X-Y] \end{split} \] Again by linearity, we have

\[\liminf_{n\to\infty} \E[X_n] \leq \E[X]\]

1.4 Independence #

Definition of Independence #

Definition Events \(A_1, \dots, A_n\) are independent if \[P\left(\bigcap_{i \in I} A_i\right) = \prod_{i \in I} P(A_i), \quad \forall I \subset \{1, \dots, n\}\]

Or equivalently, \[P \left( \bigcap_{i=1}^n B_i \right) = \prod_{i=1}^n P(B_i)\] Where \(B_i=A_i\) or \(B_i=\Omega\).


E.g., if \(A,B,C\) are independent, then all of the following equations are satisfied: \(P(A \cap B)=P(A)P(B)\), \(P(A \cap C)=P(A)P( C)\), \(P(B \cap C)=P(B)P( C)\), and \(P(A \cap B \cap C)=P(A)P(B)P( C)\).

It is not enough to only check the first 3 equations. If we have \(P(A_i \cap A_j)=P(A_i)P(A_j)\) for all \(i\neq j\), then \(A_1, \dots, A_n\) are called pairwise independent.

Definition Collections of events (e.g., \(\sigma\)-fields) \(\mathcal{A}_1, \dots, \mathcal{A}_n\) are independent if \[P(A_1 \cap \dots \cap A_n) = P(A_1)\times \dots \times P(A_n)\] For all \(A_i \in \mathcal{A}_i\) or \(A_i=\Omega\).

Definition Random variables \(X_i, \dots, X_n\) are independent if \(\sigma(X_1), \dots, \sigma(X_n)\) are independent, where \(\sigma(X) = \{X^{-1}(B): B \in \mathcal{B}\} = \sigma(\{X^{-1}((-\infty, x]): x \in \RR\})\).

Or equivalently, if for all \(j \in \NN\), for all distinct \(i_1, \dots, i_j \in \{1, \dots, n\}\), and for all Borel sets \(B_1, \dots, B_j\), we have \[\Pr\left[X_{i_1} \in B_1, X_{i_2} \in B_2, \dots, X_{i_j} \in B_j\right] = \Pr[X_{i_1} \in B_1] \Pr[X_{i_2} \in B_2] \dots \Pr[X_{i_j} \in B_j]\]


If \(X, Y\) are independent, then for all \(x,y\)

\(\Pr[X\leq x, Y \leq y] = \Pr[X\leq x] \cdot \Pr[Y \leq y]\)

(This is in fact “if and only if”, but “\(\Longleftarrow\)” is proved later.)

Theorem (Independence is preserved under deterministic transformations.) Let \(X\) and \(Y\) be independent R.V.s. Let \(f,g:\RR\to\RR\) be Borel-measurable functions. Then the R.V.s \(f(X)\) and \(g(Y)\) are independent.

Theorem If \(\mathcal{A}_1, \dots, \mathcal{A}_n\) are independent, and each \(\mathcal{A}_i\) is a \(\pi\)-system, then \(\sigma(\mathcal{A}_1), \dots, \sigma(\mathcal{A}_n)\) are independent.


Need to show: \(P(B_1 \cap \dots \cap B_n) = P(B_1)\times \dots \times P(B_n)\), for all \(B_i \in \sigma(\mathcal{A}_i)\).

Fix \(B_i \in \mathcal{A}_i\) or \(B_i = \Omega\) for \(i=2, \dots, n\).

Define \(\mathcal{L}_1 = \{B_1 \in \sigma(\mathcal{A}_1): P(B_1 \cap \dots \cap B_n) = P(B_1) \times \dots \times P(B_n)\}\).

By independence of \(\mathcal{A}_i\), \(A_1 \subset \mathcal{L}_1\).

Moreover, we can verify that \(\mathcal{L}_1\) is a \(\lambda\)-system.

By the \(\pi\)-\(\lambda\) theorem, \(\sigma(\mathcal{A}_1) \subset \mathcal{L}_1\).

Therefore \(P(B_1 \cap \dots \cap B_n) = P(B_1) \times \dots \times P(B_n)\) for \(B_1 \in \sigma(\mathcal{A}_1)\), \(B_2 \in \mathcal{A}_2\) (or \(B_2 = \Omega\)), …, \(B_n \in \mathcal{A}_n\) (or \(B_n = \Omega\)).

Repeat the same argument to \(2, \dots, n\).

Theorem Random variables \(X_1, \dots, X_n\) are independent if and only if \[\Pr[X_1\leq x_1, \dots, X_n\leq x_n] = \prod_{i=1}^n \Pr[X_i\leq x_i], \quad \forall x_i \in (-\infty, \infty]\] (I.e., the joint CDF = product of marginal CDFs)


Note that \(\sigma(X_i) = \sigma(\{\{X_i \leq x_i\}: x_i \in (-\infty, \infty]\})\). And \(\{\{X_i \leq x_i\}: x_i \in (-\infty, \infty]\})\) is a \(\pi\)-system. The equation implies that these \(n\) \(\pi\)-systems are independent. Therefore the theorem follows from the previous theorem.

Corollary If \(X_1, \dots, X_n\) has density \(f(x_1, \dots, x_n)\) and \(f(x_1, \dots, x_n)=g_1(x_1)\cdots g_n(x_n)\), \(g_i\geq 0\) and measurable. Then \(X_1, \dots, X_n\) are independent.


\[\begin{split} &\Pr[X_1\leq x_1, \dots, X_n\leq x_n]\\ =& \int_{-\infty}^{x_1} \dots \int_{-\infty}^{x_n} f(y_1, \dots, y_n)\; dy_1 \dots dy_n\\ =& \int_{-\infty}^{x_1} \dots \int_{-\infty}^{x_n} g(y_1) \cdots g(y_n)\; dy_1 \dots dy_n\\ \stackrel{\text{Fubini}}{=}& \int_{-\infty}^{x_1} g_1(y_1)\; dy_1 \dots \int_{-\infty}^{x_n} g_n(y_n)\; dy_n\\ =& \frac{\Pr[X_1\leq x_1]}{\int_{-\infty}^{+\infty} \dots \int_{-\infty}^{+\infty} g(y_2) \cdots g(y_n)\; dy_2 \dots dy_n} \dots \frac{\Pr[X_n\leq x_n]}{\int_{-\infty}^{+\infty} \dots \int_{-\infty}^{+\infty} g(y_1) \cdots g(y_{n-1})\; dy_1 \dots dy_{n-1}}\\ \stackrel{\text{Fubini}}{=}& \frac{\Pr[X_1\leq x_1] \cdots \Pr[X_n\leq x_n]}{\left[\int_{-\infty}^{+\infty} \dots \int_{-\infty}^{+\infty} g(y_1) \cdots g(y_n)\; dy_1 \dots dy_n\right]^{n-1}}\\ =& \Pr[X_1\leq x_1] \cdots \Pr[X_n\leq x_n] \end{split} \]

Corollary Suppose \(X_1, \dots, X_n\) are discrete R.V.s, then \(X_1, \dots, X_n\) are independent if and only if \[\Pr[X_1=x_1, \dots, X_n = x_n]= \prod_{i=1}^n \Pr[X_i=x_i], \quad \forall x_i\]

Theorem If \(X\) and \(Y\) are independent, \(\E[|X|], \E[|Y|] < \infty\), or \(X, Y \geq0\), then \(\E[XY] = \E[X]\E[Y]\).

(Proof skipped)

Theorem If \(X_1, X_2, \dots, X_n\) are independent, and \(I \subset \{1,2, \dots, n\}\), then \(\sigma(X_i:i \in I)\) and \(\sigma(X_j: j \in I^c)\) are independent.


(This looks obvious, but requires a proof.)

Note that

\[\begin{split} \sigma(X_i: i \in I) &= \sigma \left( \bigcap_{i \in I}A_i: A_i \in \sigma(X_i) \right)\\ \sigma(X_j: j \in I^c) &= \sigma \left( \bigcap_{i \in I^c}B_j: B_j \in \sigma(X_j) \right) \end{split} \]

The sets in \(\sigma(\dots)\) on the RHS are \(\pi\)-systems. And the two \(\pi\)-systems are independent by the condition. Hence the result follows.

(This result can be generalized into infinite number of R.V.s)

Corollary If \(X_1, \dots,X_n\) are independent, \(g, h\) are measurable. Then \(g(X_i: i \in I)\) and \(h(X_j: j \in I^c)\) are independent. Therefore if \(\E[|g|]<\infty, \E[|h|]<\infty\), then \(\E[g(X_i: i \in I) \cdot h(X_j: j \in I^c)] = \E[g]\E[h]\).

Theorem If \(\mathcal{A}\) is independent of itself, then \(P(A)=0 \text{ or } 1\) for any \(A \in \mathcal{A}\).

(This is just a simple fact, but it is used to proof the next theorem.)


\[P(A) = P(A \cap A) = P(A)\cdot P(A)= P(A)^2\]

So \(P(A)\) must be 0 or 1.

Kolmogorov’s Zero–One Law Suppose \(X_1, X_2, \dots\) are independent. Let \[\mathcal{T} = \bigcap_{n=1}^{\infty} \sigma \left( X_n, X_{n+1}, \dots \right)\] be the tail \(\sigma\)-field. Then \(\forall A \in \mathcal{T}\), we have \(P(A)=0 \text{ or } 1\).


Because \(\mathcal{T} \subset \sigma(X_1, X_2, \dots)\), it suffices to show \(\mathcal{T}\) is independent of \(\sigma(X_1, X_2, \dots)\), so that \(\mathcal{T}\) is independent of itself, and the theorem follows from the previous theorem.

Step 1: Since \(X_1, X_2, \dots\) are independent, so \(\sigma(X_1, \dots, X_n)\) and \(\sigma(X_{n+1}, X_{n+2}, \dots)\) are independent, for all \(n\).

Step 2: For all \(A \in \mathcal{T}\), by definition, \(A \in \sigma(X_{n+1}, X_{n+2}, \dots)\) for all \(n\). So by step 1, for all \(S \in \sigma(X_1, \dots, X_n)\), \(A\) and \(S\) are independent. So \(\mathcal{T}\) is independent of \(\sigma(X_1, X_2, \dots)\).

Example 1

\[ \left\{ \lim_{N\to\infty} \frac{X_1+ \dots +X_N}{N} \text{ exists}\right\} \in \mathcal{T}\]


LHS is equivalent to

\[ \left\{ \limsup_{N\to\infty} \frac{X_1+ \dots +X_N}{N} - \liminf_{N\to\infty} \frac{X_1+ \dots +X_N}{N} =0 \right\}\]

This is equivalent to for any fixed \(n\)

\[ \left\{ \limsup_{N\to\infty} \frac{X_n+ \dots +X_N}{N} - \liminf_{N\to\infty} \frac{X_n+ \dots +X_N}{N} =0 \right\}\]

This is an element in \(\sigma(X_n, X_{n+1}, \dots)\), for all \(n\). \(\blacksquare\)

According to 0-1 law \[\Pr\left[\lim_{N\to\infty} \frac{X_1 + \dots + X_N}{N}\text{ exists}\right]= 0\text{ or } 1\]

2 Law of Large Numbers (LLN) #

The goal is to proof:

If \(X_1, X_2, \dots\) are independent and identically distributed (i.i.d.) and \(\E[X_i]=\mu\), then \(\frac{S_n}{n} \to \mu\). The convergence differs in the weak and the strong law of large numbers.

2.1 Weak Law of Large Numbers (WLLN) #

Definition We say \(\{Y_n\}_{n=1}^{\infty}\) converges in probability to \(Y\) if \(\forall \epsilon>0\), we have \[\Pr[|Y_n-Y|>\epsilon] \to 0 \text{ as } n \to \infty\]

Definition For \(p>0\), \(\{Y_n\}_{n=1}^{\infty}\) converges in \(L_p\) to \(Y\) if \[\E \left[ |Y_n-Y|^p \right]\to 0 \text{ as } n\to \infty\]

Fact If \(Y_n\to Y\) in \(L_p\) for some \(p>0\), then \(Y_n\to Y\) in probability.


By Markov Inequality,

\[\Pr[|Y_n-Y|>\epsilon]\leq \frac{\E[|Y_n-Y|^p]}{\epsilon^p}\to 0\]

Definition The covariance of two square-integrable random variables \(X, Y\) is defined as

\[\Cov(X,Y) = \E[(X-\E[X])(Y-\E[Y])] = \E[XY]-\E[X]\E[Y]\]

We say \(X\) and \(Y\) are uncorrelated if \(\Cov(X,Y)=0\).

Theorem (WWLN with finite second moment) If \(X_1, X_2, \dots\) are uncorrelated, and \(\E[X_i]=\mu_i\), \(\Var\left[X_i\right]\leq c < \infty\). Let \(S_n = \sum_{i=1}^n X_i\), then \[\frac{S_n-\sum_{i=1}^n \mu_i}{n} \to 0 \text{ in } L_2 \text{ and in probability}\]


\[\begin{split} &\E\left[ \left( \frac{S_n-\sum_{i=1}^n \mu_i}{n} \right)^2\right] \\ =&\Var[\frac{S_n}{n}]\\ =&\frac{\Var[S_n]}{n^2}\\ =&\frac{\sum_{i,j=1}^n \Cov(X_i, X_j)}{n^2}\\ =&\frac{\sum_{i=1}^n \Var[X_i]}{n^2} \leq \frac{c}{n} \to 0 \end{split} \]

The last equation uses the fact that \(\Cov(X_i, X_j)=0, i \neq j\).

Theorem If \((S_n)_{n=1}^{\infty}\) is a sequence of random variables with \(\sigma_n^2=\Var[S_n]\) and \(\frac{\sigma_n^2}{b_n^2} \to 0\), then \[\frac{S_n-\E[S_n]}{b_n} \to 0 \text{ in } L_2 \text{ and in probability}\]


\[\E \left[ \left( \frac{S_n-\E[S_n]}{b_n} \right)^2 \right] = \frac{\Var(S_n)}{b_n^2} = \frac{\sigma_n^2}{b_n^2} \to 0\]

Example (Coupon Collector’s Problem)

\(n\) cards, select one by one at random with replacements. Compute the average time one needs to get the whole collection. The answer is \(\Theta(n \log n)\).

Let \(n\geq1\) be an integer and \(X_1, X_2, \dots\) be independent and uniformly distributed on \(\{1, \dots, n\}\). Define for \(k=0,1, \dots, n\) \[\tau_k^n = \inf\{m\geq0 : |\{X_1, \dots, X_m\}| = k\}\] where \(|\cdot|\) denotes the size of a set. Let \(T_n=\tau_n^n\), we have \[\frac{T_n}{n \log n} \to 1 \text{ in probability}\]


Let \(Y_k = \tau_k^n - \tau_{k-1}^n, k=1,2,\dots,n\), then \(Y_k\) follows a geometric distribution with parameter \(\frac{n-(k-1)}{n}\) and

\[\E[Y_k] = \frac{n}{n-k+1}, \quad \Var[Y_k] = \frac{k-1}{n}\cdot \frac{n^2}{(n-k+1)^2}\]

Moreover, \(Y_1, \dots, Y_n\) are independent and \(T_n = \sum_{k=1}^{n} Y_k\), therefore \[\E[T_n] = n \cdot (1+\frac{1}{2}+ \dots + \frac{1}{n}) \sim n \log n\]


\[\Var[T_n] = \sum_{k=1}^n \Var[Y_k] \leq \sum_{k=1}^{n} \frac{n^2}{(n-k+1)^2} \leq Cn^2\]

for some constant \(C\). Choosing \(b_n = n\log n\) in the above theorem, we obtain

\[\frac{T_n - n \log n}{n \log n} \to 0 \text{ in probability}\]

Theorem (WLLN for Triangular Arrays) Assume for each \(n \geq 1\), \(\{X_{n,k} : 1\leq k \leq n\}\) are independent. Let \(b_n > 0\) and \(\overline{X}_{n,k} = X_{n,k} \1\{|X_{n,k}|\leq b_n\}\). If

  1. \(\sum\limits_{k=1}^n \Pr[|X_{n,k}|> b_n] \to 0\), and
  2. \(b_n^{-2} \sum\limits_{k=1}^n \E\left[(\overline{X}_{n,k})^2\right] \to 0\)

then with \(S_n = \sum_{k=1}^{n} X_{n,k}\) and \(a_n = \sum_{k=1}^{n} \E[\overline{X}_{n,k}]\),

\[\frac{S_n-a_n}{b_n}\to 0 \text{ in probability}\]


Let \(\overline{S}_n = \sum_{k=1}^{n} \overline{X}_{n,k}\), we have

\[\begin{split} &\Pr\left[ \left| \frac{S_n-a_n}{b_n} \right| > \epsilon \right]\\ \leq& \Pr \left[ \left\{ \bigcup_{k=1}^n \left[ |X_{n,k}| > b_n \right] \right\} \cup \left\{ \left| \frac{\overline{S}_n-a_n}{b_n} \right| > \epsilon \right\}\right]\\ \leq& \sum_{k=1}^{n} \Pr[|X_{n,k}|> b_n] + \Pr\left[ \left| \frac{\overline{S}_n - a_n}{b_n} \right| > \epsilon \right] \end{split} \]

The first inequality is due to event inclusion. The second is due to Union Bound.

The first term tends to 0 by (1). For the second term, by Chebyshev Inequality:

\[\begin{split} \Pr\left[ \left| \frac{\overline{S}_n - a_n}{b_n} \right| > \epsilon \right] &\leq \frac{1}{\epsilon^2} \Var\left[\frac{\overline{S}_n}{b_n}\right] \text{ (Chebyshev)}\\ &= \frac{1}{\epsilon^2 b_n^2} \sum_{k=1}^{n} \Var\left[ \overline{X}_{n,k}\right]\\ &\leq\frac{1}{\epsilon^2 b_n^2} \sum_{k=1}^{n} \E\left[ (\overline{X}_{n,k})^2\right] \end{split} \] Which tends to 0 by (2).

Theorem (WLLN without Moment Assumption) Let \(X_1, X_2, \dots\) be i.i.d. with

\[x\cdot \Pr[|X_1| > x] \to 0, \text{ as } x\to \infty\]

Let \(S_n= \sum_{i=1}^{n} X_i\), \(\mu_n = \E[X_1 \1_{\{|X_1| \leq n\}}]\). Then

\[\frac{S_n}{n}-\mu_n \to 0 \text{ in probability.}\]


In the previous theorem, let \(X_{n,k}=X_k\) and \(b_n=n\). Then \(\overline{X}_{n,k} = X_k \1_{\{|X_k|\leq n\}}\). We have

\[\sum_{k=1}^{n} \Pr[|X_{n,k}| > b_n] = \sum_{k=1}^{n} \Pr[|X_k| > n] = n\cdot \Pr[|X_1|> n] \to 0\]


\[b_n^{-2} \sum_{k=1}^{n} \E\left[\overline{X}_{n,k}^2\right] = n^{-2} \sum_{k=1}^{n} \E \left[ \left( X_k \1_{\{|X_k| \leq n\}} \right)^2 \right] = n^{-1} \E \left[ \overline{X}_1^2 \1_{\{|X_1|\leq n\}} \right]\]

From Lemma 2.2.8 of the textbook: If \(Y\geq 0\), \(p>0\), then \(\E[Y^p] = \int_{0}^{\infty} py^{p-1} \Pr[Y>y]\;dy\). We have

\[n^{-1} \E \left[ \overline{X}_1^2 \1_{\{|X_1|\leq n\}} \right] = \underbrace{n^{-1} \int_{0}^{n}}_{\text{average}} 2\underbrace{y \Pr[|X_1|>y]}_{0 \text{ as } y\to\infty}\; dy \to 0\]

From the previous theorem, the result follows.

Theorem (WLLN with Finite First Moment) Let \(X_1, X_2, \dots\) be i.i.d. with \(\E[|X_1|] < \infty\). Let \(S_n = \sum_{i=1}^{n} X_i\), \(\mu = \E[X_1]\), then

\[\frac{S_n}{n} \to \mu \text{ in probability.}\]


By Dominated Convergence Theorem (DCT),

\[x \Pr[|X_1|>x] \leq \E \left[ |X_1| \1_{\{|X_1|>x\}} \right] \to 0\]

Again by DCT,

\[\E \left[ X_1 \1_{\{|X_1|\leq n\}} \right] \to \mu\]

The theorem follows from the previous theorem.

2.2 Strong Law of Large Numbers (SLLN) #

Definition A sequence of R.V.s \(Y_n\) converges to \(Y\) almost surely (a.s.) if \[P \left( \left\{ \omega: \lim_{n\to\infty} Y_n(\omega)=Y(\omega) \right\} \right) = 1\]

Fact \(Y_n \to Y \text{ a.s. } \Longrightarrow Y_n\to Y \text{ in probability}\)


For all \(\epsilon>0\),

\[\begin{split} &\limsup_{n\to\infty} \Pr[|Y_n-Y| > \epsilon]\\ \leq& \limsup_{n\to\infty} P \left( \bigcup_{m=n}^{\infty} \{|Y_m-Y|>\epsilon\} \right)\\ =& P \left( \bigcap_{n=1}^{\infty}\bigcup_{m=n}^{\infty} \left\{ |Y_m-Y|>\epsilon \right\} \right)\\ \leq& P \left( \left\{ \omega: Y_n(\omega)\to Y(\omega) \right\} \right)\\ =& 0 \end{split} \]

Definition Let \(A_1, A_2, \dots \in \mathcal{F}\) be a sequence of events. The event \(\{A_n \text{ i.o.}\}\), called \(A_n\) occurs infinitely often, is defined as:

\[\begin{split} \{A_n \text{ i.o.}\} &= \{\omega \in \Omega: \omega \text{ is in infinitely many } A_i\text{’s}\}\\ &=\bigcap_{n=1}^{\infty} \bigcup_{k=n}^{\infty} A_k\\ &= \limsup_{n\to\infty} A_n\\ &= \left\{\omega \in \Omega: \exists \text{ a subsequence } n_k, \text{ s.t. } \omega \in A_{n_k}, \forall k\right\} \end{split} \]

Fact \(Y_n \to Y\text{ a.s.} \Longleftrightarrow \forall \epsilon>0, \Pr \left[ |Y_n-Y|>\epsilon \text{ i.o.} \right]=0\).


The RHS is equivalent to \(\Pr[|Y_n-Y|\leq \epsilon \text{ i.o.}] = 1\), which is equivalent to

\[\Pr \left[ \limsup_{n\to\infty} |Y_n-Y|\leq \epsilon \right]=1\]

“\(\Longrightarrow\)”: It follow that the event \(\{\omega: Y_n(\omega)\to Y(\omega)\}\) implies \(\{\omega: \limsup_{n\to\infty}|Y_n(\omega)-Y(\omega)|\leq \epsilon\}\).

“\(\Longleftarrow\)”: We need to show \(\Pr[\limsup_{n\to\infty} |Y_n-Y| = 0] = 1\):

\[\begin{split} &\Pr \left[ \limsup_{n\to\infty} |Y_n-Y| = 0 \right]\\ =& \Pr \left[ \lim_{\epsilon \downarrow 0} \left\{ \limsup_{n\to\infty} |Y_n-Y|\leq\epsilon \right\} \right]\\ =& \lim_{\epsilon \downarrow 0} \underbrace{\Pr \left[ \limsup_{n\to\infty} |Y_n-Y|\leq\epsilon \right]}_{=1}\\ =&1 \end{split} \]

The second equality is due to the continuity of probability measure.

Theorem (Borel-Cantelli Lemma)

  1. If \(\sum\limits_{n=1}^{\infty} P(A_n)<\infty\), then \(P(A_n \text{ i.o.})=0\);
  2. If \(\sum\limits_{n=1}^{\infty} P(A_n)=\infty\) and \(\{A_n\}_{n=1}^{\infty}\) are independent, then \(P(A_n \text{ i.o.})=1\).


  1. We have \[P(A_n \text{ i.o.}) = P \left( \bigcap_{n=1}^{\infty} \bigcup_{k=n}^{\infty} A_k \right) = \lim_{n \to \infty} P \left( \bigcup_{k=n}^{\infty} A_k \right) \leq \lim_{n\to\infty} \sum_{k=n}^{\infty} P(A_k) = 0\]

    The second equality is due to the continuity. The last equality is because of the convergence of \(\sum_{n=1}^{\infty} P(A_n)\).

  2. We have \[\begin{split} P(A_n \text{ i.o.}) &= 1-P \left( \bigcup_{n=1}^{\infty} \bigcap_{k=n}^{\infty} A_k^c \right)\\ &= 1- \lim_{n \to \infty} P \left( \bigcap_{k=n}^{\infty} A_k^c \right)\\ &= 1-\lim_{n \to \infty} \prod_{k=n}^{\infty} P(A_k^c) \text{\quad (by independence)}\\ &= 1-\lim_{n \to \infty} \prod_{k=n}^{\infty} (1-P(A_k)) \\ &\geq 1-\lim_{n \to \infty} \prod_{k=n}^{\infty} e^{-P(A_k)} \\ &= 1-\lim_{n \to \infty} e^{-\sum_{k=n}^{\infty} P(A_k)} \\ &= 1 \end{split} \]

    The inequality is due to \(1+x\leq e^x\).

Theorem (SLLN, easier version)

Let \(X_1, X_2, \dots\) be i.i.d. with \(\E[X_i]=\mu\), \(\E[X_i^4]<\infty\). Let \(S_n = \sum_{i=1}^{n} X_i\), then

\[\frac{S_n}{n} \to \mu \text{ a.s.}\]


WLOG, assume \(\mu=0\), otherwise consider \(X_i-\mu\).

According to Borel-Cantelli Lemma (1), need to show

\[\sum_{n=1}^{\infty} \Pr \left[ \left| \frac{S_n}{n} \right| > \epsilon\right] <\infty, \forall \epsilon>0\]

Notice that

\[\begin{split} \Pr \left[ \left| \frac{S_n}{n} \right| > \epsilon \right] &=\Pr \left[ \left| \frac{S_n}{n} \right|^4 > \epsilon^4 \right] \\ &\stackrel{\text{Markov}}{\leq} \frac{\E[S_n^4]}{\epsilon^4 n^4} = \frac{\sum_{i,j,k,l=1}^{n} \E[X_i X_j X_k X_l]}{\epsilon^4 n^4}\\ &\leq \frac{\sum_{i,j=1}^{n} \E[X_i^2 X_j^2]}{n^4 \epsilon^4}\\ &\leq \frac{C}{\epsilon^4 n^2} \end{split} \]

It converges when sum over \(n\) to infinity.

Some useful theorems (applications of B-C lemma)

Theorem If \(Y_n \to Y \text{ a.s.}\), then \(Y_n \to Y\) in probability (already proved above). If \(Y_n\to Y\) in probability, then there exists a subsequence \(n_k\) such that \(Y_{n_k}\to Y \text{ a.s.}\).


Since \(Y_n \to Y\) in probability, we can choose a subsequence \(n_1<n_2<n_3<\dots\) such that

\[\Pr \left[ |Y_{n_k}-Y| > \frac{1}{2^{k}} \right] \leq \frac{1}{2^{k}}\]

So we have

\[\sum_{k=1}^{\infty} \Pr \left[ |Y_{n_k}-Y| > \frac{1}{2^{k}} \right] < \infty\]

By B-C lemma (1), \(\Pr[|Y_{n_k}-Y| > \frac{1}{2^{k}} \text{ i.o.}]=0\). This implies \(\Pr[|Y_{n_k}-Y| > \epsilon \text{ i.o.}]=0, \forall \epsilon>0\), so almost sure convergence holds.

Theorem If \(A_1, A_2, \dots\) are pairwise independent and \(\sum_{i=1}^{\infty} \Pr[A_i]=\infty\). Then as \(n\to\infty\)

\[\frac{\sum_{i=1}^n \1_{A_i}}{\sum_{i=1}^n \Pr[A_i]} \to 1 \text{ a.s.}\]


Let \(S_n=\sum_{i=1}^{n} \1_{A_i}\), \(b_n = \sum_{i=1}^{n} \Pr[A_i]\). Then by Chebyshev Inequality, for all \(\epsilon>0\)

\[\begin{split} \Pr\left[ \left| \frac{S_n}{b_n}-1 \right| > \epsilon\right] &\leq \frac{\Var\left[\frac{S_n}{b_n}\right]}{\epsilon^2} = \frac{\sum_{i=1}^{n} \Var[\1_{A_i}]}{\epsilon^2 b_n^2} = \frac{\sum_{i=1}^{n} \Pr[A_i](1-\Pr[A_i])}{\epsilon^2 b_n^2}\\ &\leq \frac{\sum_{i=1}^{n} \Pr[A_i]}{\epsilon^2 \left( \sum_{i=1}^{n} \Pr[A_i] \right)^2} = \frac{1}{\epsilon^2 b_n} \end{split} \]

Choose a subsequence \(n_k = \min \{m: b_m \geq k^2\}\). This is possible because \(\sum_{i=1}^{\infty} \Pr[A_i]=\infty\). Then

\[\sum_{k=1}^{\infty} \Pr\left[ \left| \frac{S_{n_k}}{b_{n_k}} -1 \right| > \epsilon\right] \leq \sum_{k=1}^{\infty} \frac{1}{\epsilon^2 b_{n_k}} \leq \sum_{k=1}^{\infty} \frac{1}{\epsilon^2 k^2} < \infty\]

By B-C lemma (1), \(\frac{S_{n_k}}{b_{n_k}} \to 1 \text{ a.s.}\).

By the definition of \(n_k\), it is easy to see that \(n_1 < n_2 < \dots\), so for any sufficiently large integer \(m\), there exists \(k\) such that \(n_{k-1}\leq m \leq n_k\). And

\[\underbrace{\frac{b_{n_{k-1}}}{b_{n_{k}}}}_{\to 1}\cdot \underbrace{\frac{S_{n_{k-1}}}{b_{n_{k-1}}}}_{\to 1 \text{ a.s.}} = \frac{S_{n_{k-1}}}{b_{n_{k}}} \leq \frac{S_m}{b_{m}} \leq \frac{S_{n_k}}{b_{n_{k-1}}} = \underbrace{\frac{S_{n_k}}{b_{n_k}}}_{\to 1 \text{ a.s.}} \cdot \underbrace{\frac{b_{n_k}}{b_{n_{k-1}}}}_{\to 1}\]

Both sides tend to 1 a.s., so \(\frac{S_m}{b_{m}}\) in the middle must tend to 1 a.s..

Theorem (SLLN) Let \(X_1, X_2, \dots\) be i.i.d. with \(\E[|X_i|]<\infty\), \(\E[X_i]=\mu\). Let \(S_n=\sum_{i=1}^{n} X_i\), then

\[\frac{S_n}{n}\to\mu \text{ a.s.}\]


WLOG, assume \(X_i\geq0\), otherwise, consider \(X_i=X_i^+ - X_i^-\).

Step 1 (truncation): Let \(Y_k=X_k \1_{\{|X_k|\leq k\}}\), \(T_n=\sum_{i=1}^{n} Y_i\), then

\[\sum_{i=1}^{\infty} \Pr[X_i \neq Y_i] = \sum_{i=1}^{\infty} \Pr[|X_1|>i]\leq \int_{0}^{\infty} \Pr[|X_1|>x] \dd{x} = \E[|X_1|]<\infty\]

By B-C lemma (1), we have \(\Pr[X_i \neq Y_i \text{ i.o.}] = 0\).

Therefore it suffices to prove \(\frac{T_n}{n}\to \mu \text{ a.s.}\) (this step is not obvious, but can be proved).

Step 2 (2nd moment calculation): Fix \(\alpha>1\), choose \(k(n)= \lfloor \alpha^n \rfloor \geq \frac{\alpha^n}{2}\).

We will show \(\frac{T_{k(n)}}{k(n)}\to\mu \text{ a.s.}\).

First claim (prove it later) that \(\forall \epsilon>0\),

\[\sum_{n=1}^{\infty} \Pr \left[ \left| \frac{T_{k(n)}-\E[T_{k(n)}]}{k(n)} \right| >\epsilon\right] <\infty\]

From claim and B-C lemma (1), we have

\[\frac{T_{k(n)}-\E[T_{k(n)}]}{k(n)} \to 0 \text{ a.s.}\]


\[\frac{\E[T_{k(n)}]}{k(n)} = \frac{\E \left[ \sum_{m=1}^{k(n)} Y_m \right]}{k(n)} = \frac{\sum_{m=1}^{k(n)} \E[X_1 \1_{\{|X_1|\leq m\}}]}{k(n)} \to \mu\]

Where the convergence is due to \(\E[X_1 \1_{\{|X_1|\leq m\}}] \to \mu\) as \(m\to\infty\) (Dominated Convergence Theorem (DCT)).


\[\frac{T_{k(n)}}{k(n)}\to\mu \text{ a.s.}\]

Step 3 (subsequence method): For \(k(n) \leq m < k(n+1)\), we have

\[\underbrace{\frac{k(n)}{k(n+1)}}_{\to \frac{1}{\alpha}} \underbrace{\frac{T_{k(n)}}{k(n)}}_{\to \mu} \leq \frac{T_m}{m} \leq \underbrace{\frac{k(n+1)}{k(n)}}_{\to \alpha} \underbrace{\frac{T_{k(n+1)}}{k(n+1)}}_{\to \mu}\]

So \(\limsup_{m\to\infty} \frac{T_m}{m} \leq \mu \alpha \text{ a.s.}\), and \(\liminf_{m\to\infty} \frac{T_m}{m} \geq \mu \frac{1}{\alpha} \text{ a.s.}\), for any \(\alpha>1\).

Letting \(\alpha \downarrow 1\), we have \(\lim_{m\to\infty} \frac{T_m}{m} = \mu \text{ a.s.}\).


Proof of the claim:

\[\begin{split} &\sum_{n=1}^{\infty} \Pr \left[ \left| \frac{T_{k(n)}-\E[T_{k(n)}]}{k(n)} \right| >\epsilon\right] <\infty\\ \leq& \epsilon^{-2} \sum_{n=1}^{\infty} \frac{\Var\left[T_{k(n)}\right]}{k^2(n)} = \epsilon^{-2} \sum_{n=1}^{\infty} \frac{1}{k^2(n)} \sum_{m=1}^{k(n)} \Var\left[Y_m\right]\\ =& \epsilon^{-2} \sum_{m=1}^{\infty} \Var\left[Y_m\right] \sum_{n: k(n)\geq m} \frac{1}{k^2(n)} \quad \text{(Fubini)}\\ \leq& \epsilon^{-2} \sum_{m=1}^{\infty} \Var\left[Y_m\right] 4\cdot \sum_{n: k(n)\geq m} \alpha^{-2n}\\ \leq& \frac{4}{\epsilon^2 (1-\alpha^{-2})} \sum_{m=1}^{\infty} \frac{\Var\left[Y_m\right]}{m^2} \end{split} \]

Note that

\[\begin{split} \sum_{m=1}^{\infty} \frac{\Var\left[Y_m\right]}{m^2} &\leq \sum_{m=1}^{\infty} \frac{\E[Y^2_m]}{m^2}\\ &= \sum_{m=1}^{\infty} \frac{\int_{0}^{m} 2y \Pr[|Y_m|>y] \dd{y}}{m^2}\\ &= \sum_{m=1}^{\infty} \frac{\int_{0}^{\infty} 2y \1_{\{y < m\}} \Pr[|X_1|>y] \dd{y}}{m^2}\\ &= \int_{0}^{\infty} \Pr[|X_1|>y] \cdot 2y \cdot \left( \sum_{m:m\geq y} \frac{1}{m^2} \right) \dd{y} \quad \text{(Fubini)}\\ &\leq C \int_{0}^{\infty} \Pr[|X_1|>y] \dd{y} = C\E[|X_1|] < \infty \end{split} \]

The last inequality is due to \(\sum_{m:m\geq y} \frac{1}{m^2} \approx \int_{y}^{\infty} \frac{1}{x^2} \dd{x} \approx \frac{1}{y}\).

2.3 Convergence of Random Series #


3 Central Limit Theorems (CLT) #

3.1 Convergence of Distributions #

Definition (Distribution) Given a R.V. \(X\) on \((\Omega, \mathcal{F}, P)\), its distribution is the function \(\mu\) defined on \(\mathcal{B}\) (the Borel subsets of \(\RR\)) by \[\mu(B) = \Pr[X \in B] = P(X^{-1}(B)), \quad B \in \mathcal{B}\]

The distribution of \(X\) is usually denoted as \(\mu\) or \(\mathcal{L}(X)\). We also write \(X \sim \mu\) to indicate that \(\mu\) is the distribution of \(X\).

\(\mathcal{L}(X)\) is determined by the cumulative distribution function of \(X\):

\[F_X(x) = \Pr[X\leq x]\]

Definition \(F_n\) is a sequence of distribution functions (d.f.). \(F\) is a d.f.. \(F_n\) is said to converge weakly to \(F\), denoted by \(F_n \Rightarrow F\), if \(F_n (y) \to F (y)\) for every continuity point of \(F\).

Definition \(X_n\) is a sequence of random variables. \(X\) is a random variable. \(X_n\) converges to \(X\) weakly (or in distribution), denoted by \(X_n \Rightarrow X\) (or \(X_n \stackrel{d}{\to} X\)), if \(F_{X_n} \Rightarrow F_{X}\).

Fact \(X_n \to X\) in probability implies \(X_n \Rightarrow X\).


Assume \(X_n \sim F_n\) and \(X \sim F\). Let \(x\) be a continuity point of \(F\), then for all \(\epsilon>0\) \[F_n(x) = \Pr[X_n \leq x]\leq \Pr[|X_n-X|>\epsilon] + \Pr[X\leq X+\epsilon]\]

By condition, the first term in the RHS tends to 0. The proof follows by letting \(\epsilon\to0\).

Theorem (Skorokhod’s Theorem) If \(F_n \Rightarrow F_{\infty}\), then we can construct R.V.s \(Y_n\), \(1\leq n \leq \infty\) on the same probability space such that \(Y_n\) has distribution function \(F_n\) for \(1\leq n \leq\infty\) and \(Y_n \to Y_{\infty}\) a.s..


Let \(\Omega=(0,1)\), \(\mathcal{F}=\mathcal{B}\), \(P\) is Lebesgue measure. Define \(Y_n: \Omega\to \RR\) to be



\[F_n^{-1}(\omega) = \inf\{y: F_n(y)\geq \omega\} = \sup\{y: F_n(y)<\omega\}\]

We have proved before that \(Y_n \sim F_n\).

Next we need to show \(F_n^{-1}(\omega) \to F_{\infty}^{-1}(\omega)\) a.s..

Theorem (Slutsky’s Theorem) If \(X_n \Rightarrow X\) and \(Y_n \to c\) in probability, where \(c\) is a constant. Suppose the random variables are defined on the same probability space. Then \[X_n + Y_n \Rightarrow X + c,\quad X_n Y_n \Rightarrow cX\] (Proof skipped)

How to prove \(X_n \Rightarrow X\)?

Method 1: Direct computation of \(F_{X_n}\), typically difficult.

Method 2:

Theorem (A sufficient and necessary condition for convergence in distribution) \(X_n \Rightarrow X\) if and only if for all \(g: \RR\to\RR\) bounded and continuous, we have \(\E[g(X_n)] \to \E[g(X)]\).


Assume \(X_n \sim F_n\), \(X \sim F\).

“\(\Longrightarrow\)” By Skorokhod’s theorem, we can construct \(Y_n, Y\) which have distribution \(F_n, F\), respectively. And \(Y_n \to Y\) a.s.. Since \(g\) is continuous, so \(g(Y_n) \to g(Y)\) a.s.. We have

\[E[g(X_n)] = \E[g(Y_n)] \stackrel{\text{BCT}}{\longrightarrow} \E[g(Y)] = \E[g(X)]\]

“\(\Longleftarrow\)” For any continuity point \(x\) of \(F\) and \(\epsilon>0\), define a continuous function \[g(y)=\left\{ \begin{array}{l@{\quad:\quad}l} 1 & \text{if } y\leq x\\ 0 & \text{if } y\geq x+\epsilon\\ \text{linear} & \text{if } x\leq y \leq x+\epsilon \end{array}\right. \]

We have

\[\begin{split} F_n(x) &= \Pr\left[X_n \leq x\right] = \E \left[ \1_{\{X_n \leq x\}} \right] \leq \E[g(X_n)]\\ &\leq \E[g(X)] \leq \E \left[ \1_{\{X \leq x+ \epsilon\}} \right]\\ &= F(x+\epsilon) \downarrow F(x) \text{ as } \epsilon \downarrow 0. \end{split} \]

Therefore \(\limsup_{n\to\infty} F_n(x)=F(x)\). Lower bound is proved similarly.

Theorem (CLT assuming finite third moment) Let \(X_1, X_2, \dots\) be a sequence of i.i.d. R.V.s such that \(\E[X_i]=\mu\), \(\E[X_i^2]=\sigma^2\) and \(\E[|X_i|^3]<\infty\). Let

\[W_n = \sum_{i=1}^{n} \frac{X_i-\mu}{\sigma \sqrt{n}}\]


\[W_n \Rightarrow Z \sim N(0,1)\]

(Roughly, for large \(n\), \(S_n=X_1+ \dots + X_n \Rightarrow N(n\mu, n \sigma^2)\).)

4 Martingales #

4.1 Conditional Expectation #

Definition #


Expectation of \(X\) conditioning on \(Z=z_i\):

\[\E[X \mid Z=z_i] = \sum_{j=1}^{\infty} x_j\Pr[X=x_j \mid Z=z_i]\]


Roll a die, \(\Omega=\{1,2,3,4,5,6\}\). \(X\): resulting number, \(Z\): 1 if odd, 2 if even. Then

\[\E[X \mid Z=1] = 3 \qquad \E[X \mid Z=2]=4\]

We now consider \(\E[X\mid Z]\), regarded as a R.V. on \(\{Z=z_i\}\), takes value \(\E[X \mid Z=z_i], \forall i\).

Note that:

  1. \(\E[X \mid Z]\) is measurable with respect to \(\sigma(Z)\).
  2. \(\E[X\mid Z]\) on \(\{Z=z_i\}\) equals the expectation of \(X\) on \(\{Z=z_i\}\). Denote \(\mathcal{A} = \sigma(Z)\).

Definition Let \((\Omega, \mathcal{F}, P)\) be a probability space, and let \(X\) be a R.V. that is \(\mathcal{F}/\mathcal{B}\)-measurable and \(\E[|X|]<\infty\). \(\mathcal{A}\) is a \(\sigma\)-field and \(\mathcal{A} \subset \mathcal{F}\). Then function \(\E[X \mid \mathcal{A}]: \Omega \to \RR\) called the conditional expectation of \(X\) given \(\mathcal{A}\), if

  1. \(\E[X \mid \mathcal{A}]\) is measurable w.r.t. \(\mathcal{A}\), i.e., it is \(\mathcal{A}/\mathcal{B}\)-measurable.
  2. \(\forall A \in \mathcal{A}\), \(\E[X \1_A] = \E[\E[X\mid \mathcal{A}]\cdot \1_A]\)

\(\E[X \mid Z]\) is defined as \(\E[X \mid \sigma(Z)]\).

In other words, \(Y=\E[X \mid \mathcal{A}]\) if and only if \(Y\) is \(\mathcal{A}\)-measurable, and for all \(A \in \mathcal{A}\),

\[\int_{A} Y \dd{P} = \int_{A} X \dd{P}\]

The above definition is a bit strange. It is easier to understand it through an discrete example.

In the above discrete setting, if \(A=\{Z=z_i\}\), then (2) becomes

\[\E\left[X \1_{\{Z=z_i\}}\right] = \E\left[c_i \cdot \1_{\{Z=z_i\}}\right]\]

where \(c_i\) is the value of \(\E[X\mid \mathcal{A}]\) on event \(A\). Therefore

\[\sum_{j=1}^{\infty} x_j \Pr[X=x_j,Z=z_i] = c_i \Pr[Z=z_i]\]

So we have

\[c_i = \sum_{j=1}^{\infty} x_j \Pr[X=x_j \mid Z=z_i] = \E[X \mid Z=z_i]\]

“Uniqueness” of the conditional expectation:

Proof of a.s. Uniqueness

Suppose \(Y_1\) and \(Y_2\) both satisfy definition (1) and (2). Let \(A=\{Y_1-Y_2 \geq \epsilon\} \in \mathcal{A}\) for some constant \(\epsilon>0\). From (2) we have

\[\begin{split} 0&=\E[X \1_A] - \E[X \1_A]\\ &= \E[Y_1 \1_A] - \E[Y_2 \1_A]\\ &= \E[(Y_1-Y_2)\1_A]\\ &=\E[\epsilon\1_A]\\ &\geq\epsilon \Pr[A] \end{split} \]

So \(\Pr[A]=0\). So \(\Pr[Y_1>Y_2] = \lim_{\epsilon \downarrow} \Pr[Y_1-Y_2\geq \epsilon] = 0\).

Similarly, \(\Pr[Y_1<Y_2]=0\). Therefore \(Y_1=Y_2\) a.s.

Properties #

  1. Let \(A=\Omega\) in definition (2), we have \[\E[X] = \E[\E[X \mid \mathcal{A}]]\]

  2. If \(X\) is \(\mathcal{A}\) measurable, then \(X\) always satisfies definition (2). By the uniqueness, we have \(\E[X \mid \mathcal{A}] = X\).

  3. \(\E[|\E[X\mid \mathcal{A}]|] \leq \E[|X|]\)


    Let \(Y=\E[X\mid \mathcal{A}]\), \(A=\{Y>0\} \in \mathcal{A}\). From definition (2), we have

    \[\E \left[ Y \1_{\{Y>0\}} \right] = \E \left[ X \1_{\{Y>0\}} \right] \leq \E \left[ |X| \1_{\{Y>0\}} \right]\]


    \[\E \left[ -Y \1_{\{Y\leq0\}} \right] = \E \left[ -X \1_{\{Y\leq0\}} \right] \leq \E \left[ |X| \1_{\{Y\leq0\}} \right]\]


    \[\E[|Y|] = \E \left[ Y \1_{\{Y>0\}} \right] + \E \left[ -Y \1_{\{Y\leq0\}} \right] \leq \E \left[ |X| \1_{\{Y>0\}} \right] + \E \left[ |X| \1_{\{Y\leq0\}} \right] = \E[|X|]\]

  4. If \(\sigma(X)\) and \(\mathcal{A}\) are independent, then \(\E[X \mid \mathcal{A}] = \E[X]\).


    For all \(A \in \mathcal{A}\), we have

    \[\E \left[ X \1_A \right] = \E[X] \Pr[A] = \E \left[ (\E[X]) \1_{A} \right]\]

    The result follows by the uniqueness.

  5. If \(\E[|X|] < \infty\), \(\E[|Y|] < \infty\), \(a\) is a constant, then

    \[\E[aX+Y \mid \mathcal{A}] = a\E[X\mid \mathcal{A}] + \E[Y \mid \mathcal{A}]\]

    (Conditioned linearity, proof skipped.)

  6. If \(X\leq Y\), then \(\E[X \mid \mathcal{A}]\leq \E[Y \mid \mathcal{A}]\)

    (Conditioned monotonicity, proof skipped.)

  7. (Conditioned DCT) If \(|X_n| \leq Y\) for all \(n\), \(\E[|Y|] \leq \infty\), and \(X_n \to X\) a.s., then

    \[\E[X_n \mid \mathcal{A}] \to \E[X \mid \mathcal{A}] \text{ a.s.}\]

Theorem If \(X \in \mathcal{A}\), \(\E[|XY|]<\infty\), \(\E[|Y|]<\infty\), then

\[\E[XY \mid \mathcal{A}] = X\E[Y \mid \mathcal{A}]\]

Theorem (Tower Property) If \(\mathcal{A}_1 \subset \mathcal{A}_2\), then

  1. \(\E[\E[X \mid \mathcal{A}_1] \mid \mathcal{A}_2] = \E[X \mid \mathcal{A}_1]\)
  2. \(\E[\E[X \mid \mathcal{A}_2] \mid \mathcal{A}_1] = \E[X \mid \mathcal{A}_1]\)

I.e., the smaller \(\sigma\)-field always wins.


(1) Let \(Y = \E[X \mid \mathcal{A}_1]\), \(Z=\E[\E[X \mid \mathcal{A}_1] \mid \mathcal{A}_2]\). By definition, for all \(A \in \mathcal{A}_1\),

\[\int_{A} X \dd{P} = \int_A Y \dd{P}\]

Similarly, for all \(A \in \mathcal{A}_2\),

\[\int_{A} Y \dd{P} = \int_A Z \dd{P}\]

Since \(\mathcal{A}_1 \subset \mathcal{A}_2\), we have for all \(A \in \mathcal{A}_1\),

\[\int_{A} X \dd{P} =\int_{A} Y \dd{P} = \int_A Z \dd{P}\]

The proof follows by definition.

(2) Let \(Y = \E[X \mid \mathcal{A}_1]\), \(Z=\E[\E[X \mid \mathcal{A}_2] \mid \mathcal{A}_1]\), \(W = \E[X \mid \mathcal{A}_2]\). By definition, for all \(A \in \mathcal{A}_1\),

\[\int_{A} X \dd{P} = \int_A Y \dd{P}, \quad \text{and } \int_{A} W \dd{P} = \int_A Z \dd{P}\]

Similarly, for all \(A \in \mathcal{A}_2\),

\[\int_{A} X \dd{P} = \int_A W \dd{P}\]

The proof follows by definition.

Theorem (Triangular Inequality) Suppose \(\E[X^2]<\infty\), then for any \(Y \in \mathcal{A}\) with \(\E[Y^2]<\infty\), we have

\[\E\left[(X-\E[X \mid \mathcal{A}])^2\right] \leq \E\left[(X-Y)^2\right]\]

In other words, \(\E[X\mid \mathcal{A}]\) is the variable \(Y\) that minimizes the mean square error \(\E[(X-Y)^2]\).

This result gives a “geometric interpretation” of \(\E(X \mid \mathcal{A})\). \(L^2(\mathcal{F}) = \{Y \in \mathcal{F}: \E[Y^2]<\infty\}\) is a Hilbert space. In this case, \(\E[X \mid \mathcal{A}]\) is the projection of \(X\) onto \(L^2(\mathcal{F})\). That is, the point in the subspace closest to \(X\).


\[\begin{split} \E\left[(X-Y)^2\right] &= \E\left[(X-\E[X \mid \mathcal{A}]+\E[X \mid \mathcal{A}]-Y)^2\right]\\ &= \E\left[(X-\E[X \mid \mathcal{A}])^2\right] + \E\left[(Y-\E[X \mid \mathcal{A}])^2\right] - 2\E\left[(X-\E[X \mid \mathcal{A}])(Y-\E[X \mid \mathcal{A}])\right]\\ &\geq \E\left[(X-\E[X \mid \mathcal{A}])^2\right] \end{split} \]

where the third term equals 0 because:

Denote \(Z \triangleq Y-\E[X \mid \mathcal{A}] \in \mathcal{A}\), then

\[\begin{split} &\E\left[(X-\E[X \mid \mathcal{A}])(Y-\E[X \mid \mathcal{A}])\right] \\ =& \E\left[(X-\E[X \mid \mathcal{A}]) Z \right] \\ =& \E[XZ] - \E[Z \E[X \mid \mathcal{A}]] \\ =& \E[XZ] - \E[\E[ZX \mid \mathcal{A}]] \\ =& \E[XZ] - \E[ZX] \\ =& 0 \end{split} \]

It is easy to see \(\E[(X-Y)^2]\) is minimized when \(Z=0\).

Example #

4.2 Martingales Definitions and Properties #

Definition Let \(\{\mathcal{F}_n, n\geq 1\}\) be a sequence of increasing \(\sigma\)-fields (i.e., a filtration: \(\mathcal{F}_1 \subset \mathcal{F}_2 \subset \mathcal{F}_3 \subset \dots\)). Let \(\{S_n, n\geq1\}\) be a sequence of random variables. \(\{S_n\}\) is called a martingale with respect to \(\{\mathcal{F}_n\}\) if:

  1. \(\E[|S_n|]<\infty, \forall n\geq1\)
  2. \(S_n \in \mathcal{F}_n, \forall n\geq1\)
  3. \(\E[S_{n+1} \mid \mathcal{F}_{n}] = S_{n}, \forall n\geq1\)


  1. We write \(\{S_n, \mathcal{F}_n\}\) is a martingale for simplicity.

  2. \(\mathcal{F}_n\) is usually taken as \(\sigma(S_1,S_2, \dots, S_n)\).

  3. \(\E[S_{n+m} \mid \mathcal{F}_n]=S_n, \forall n\geq1, m\geq1\).


    \[\begin{split} \E[S_{n+2}\mid \mathcal{F}_n]&=\E[\E[S_{n+2}\mid \mathcal{F}_{n+1}] \mid \mathcal{F}_n] \quad \text{(Tower property)}\\ &= \E[S_{n+1} \mid \mathcal{F}_n]\\ &= S_n \end{split} \] The proof follows by induction.

  4. \(\E[S_1] = \E[S_2] = \E[S_3]= \dots\)

Definition Suppose \(\{S_n\}\) is a martingale w.r.t. \(\set{\mathcal{F}_n}\). Let

\[X_1 = S_1, \quad X_2=S_2-S_1, \quad \dots, X_n=S_n-S_{n-1}, \dots\]

Then \(\set{X_n}\) are called martingale differences.

5 Questions #

5.1 Example of R.V. series that converge in probability but not almost surely? #

Divide \(\Omega=[0,1]\) in halves and define events \(E_1=[0,\frac{1}{2}]\), \(E_2=[\frac{1}{2},1]\). Then cut \(\Omega\) into three intervals, and similarly define \(E_3, E_4, E_5\). Repeat this process and define \(X_n = \1_{\{E_n\}}\) and \(X=0\). Then \(X_n \to X\) in probability but not almost surely.

To show convergence in probability, for all \(\epsilon>0\),

\[\begin{split} &\lim_{n \to \infty} \Pr[|X_n-X|>\epsilon]\\ =& \lim_{n \to \infty} \Pr[X_n=1]\\ =& \lim_{n \to \infty} P(E_n)\\ =& 0 \end{split} \]

The sequence of \(X_n\) does not converge to 0 almost everywhere (in fact it converges nowhere) because each point will land in infinitely many of our smaller intervals, and lie outside of infinitely many. Hence, we can find sub-sequences of \(\{X_n\}\) which send our given point to 0 and subsequences which send it to 1.

The difference between the two convergences is important, but largely for philosophical reasons. There is not application that requires the strong law of large number. See here.