Written by Li Zhuohua
on September 15, 2021

Stochastic Models

Lecture notes that I scribbled for SEEM5580: Advanced Stochastic Models (2021 Fall) taught by Prof. Xuefeng Gao at CUHK.

The textbook is Stochastic Processes (2nd Edition) by Sheldon M. Ross.

Table of Contents

1 Preliminaries
- 1.1 Probability
- 1.2 Random Variables
- 1.3 Expectation
- 1.4 Moment Generating Function (MGF)
- 1.5 Characteristic Function (CF)
- 1.6 Exponential Distribution
- 1.7 Stochastic Process
2 Poisson Processes
- 2.1 Counting Process
- 2.2 Poisson Process Definition 1
- 2.3 Poisson Process Definition 2
- 2.4 Two Equivalent Definitions of Poisson Process
- 2.5 Interarrival and waiting times
- 2.6 Poisson Process Definition 3
- 2.7 Why Poisson?
- 2.8 Non-homogeneous Poisson Process
- 2.9 Compound Poisson Random Variables and Process
- 2.10 Conditional Poisson Process
- 2.11 Introduction to Renewal Process
3 Discrete-Time Markov Chain (DTMC)
4 Continuous-Time Markov Chain (CTMC)
- 4.1 Definition of CTMC
- 4.2 Birth and Death Process
- 4.3 Kolmogorov Differential Equations
- 4.4 Limiting Distribution and Stationary Distribution
5 Martingales
6 Brownian Motion

1 Preliminaries #

1.1 Probability #

Axioms #

\(\Pr[E]\in [0,1]\)
\(\Pr[\Omega]=1\)
For any sequence of events \(\left\{ E_i \right\}_{i=1}^{\infty}\) which are mutually exclusive (i.e., \(E_i \cap E_j=\emptyset\)), then \[\Pr \left[ \bigcup_{i=1}^{\infty} E_i \right] = \sum_{i=1}^{\infty} \Pr[E_i]\]

Independence #

Two events \(A\) and \(B\) are called independent if \(\Pr[A\cap B] = \Pr[A]\cdot \Pr[B]\).

If \(\Pr[A]=1\), then event \(A\) holds almost surely.

1.2 Random Variables #

Definition of Random Variables #

A random variable \(X\) is a mapping: \(\Omega \to \RR \), satisfying:

\[\left\{ \omega\in \Omega: X(\omega)\leq a \right\} \in \mathcal{F}, \forall a \in R\]

Note

This is a special case (real-valued random variables) of the general measure-theoretic definition.

We will use the abbreviation:

\[\left\{ X\leq a \right\} \triangleq \left\{ \omega\in \Omega: X(\omega)\leq a \right\} \]

We also call a random variable as R.V.

According to the definition, \(\left\{ X\leq a \right\}\) is an event, so it is a subset of \(\Omega\).

The cumulative distribution function (CDF) of a R.V. \(X\) is defined by:

\[F(x)=\Pr[X\leq x], x\in \RR \]

A R.V. is discrete if the possible values it can take is countable. In this case, \(F(x)=\sum_{y_i\leq x}\Pr[X=y_i]\).

A R.V. is continuous if there exists a function \(f(x)\) called probability density function (PDF) such that:

\[F(a)=\Pr[X\leq a]=\int_{-\infty}^a f(x) \dd{x}, a \in \RR\]

And so \(f(x)=\frac{d}{dx}F(x)\) (only consider differentiable cases)

Random Vectors #

The joint CDF of two R.V.s \(X\) and \(Y\) is given by: \[F(x,y)=\Pr[X\leq x,Y\leq y], x,y \in \RR \] where \(\Pr[X\leq x, Y\leq y]=\Pr[\left\{ X\leq x \right\} \cap \left\{ Y\leq y \right\}\).

The marginal CDF is defined as:

\[F_X(x) = \Pr[X\leq x] = F(x,\infty)=\lim_{y_n\to \infty} F(x,y_n)\]

Two R.V.s are independent if \(F(x,y)=F_X(x)\cdot F_Y(y)\) for \(x,y\in \RR \). I.e., \(\Pr[X\leq x, Y\leq y]=\Pr[X\leq x] \cdot \Pr[Y\leq y]\).

1.3 Expectation #

Definition #

The expectation of a R.V. is defined by: \[\E[X]=\int_{-\infty}^{+\infty}x \dd{F(x)} = \left\{ \begin{aligned} &\sum_x x \Pr[X=x] &\text{if } X \text{ is discrete}\\ &\int_{-\infty}^{+\infty}x f(x)\dd{x} &\text{if } X \text{ is continuous} \end{aligned} \right. \] whenever the integral exists.

For a function \(h: \RR \to \RR \): \[\E[h(X)]=\int_{-\infty}^{+\infty}h(x) \dd{F(x)} = \left\{ \begin{aligned} &\sum_x h(x) \Pr[X=x] &\text{if } X \text{ is discrete}\\ &\int_{-\infty}^{+\infty}h(x) f(x)\dd{x} &\text{if } X \text{ is continuous} \end{aligned} \right. \]

For a function \(h: \RR^n \to \RR \): \[\E[h(X_1,X_2, \dots , X_n)] = \int_{\RR^n}h(x_1,x_2, \dots ,x_n)\dd{F(x_1, \dots , x_n)}\]

Variance of \(X\) #

\[\Var[X] = \E[X^2]-\left( \E[X] \right)^2 = \E \left[ \left( X-\E[X] \right)^2 \right]\]

Covariance of two R.V.s \(X\) and \(Y\) #

Covariance measures the dependence between R.V.s. (There are also other ways to do so.)

\[\Cov[X,Y] = \E[X \cdot Y] - \E[X]\cdot \E[Y]\]

If \(\Cov[X,Y]=0\), \(X\) and \(Y\) are uncorrelated.

Properties #

\(\E[\alpha X+\beta Y]=\alpha\E[X]+\beta\E[Y], \alpha, \beta \in \RR \).
\(\Var\left[\sum_{i=1}^n X_i\right]=\sum_{i=1}^n \Var[X_i] + \sum_{i\neq j}\Cov[X_i, Y_i]\).
If \(X\) and \(Y\) are independent, then \(\Cov[X,Y]=0\), \(\E[XY]=\E[X]\E[Y]\).

Note

Covariance only measures one particular type of dependence, therefore these two are not equivalent. If two R.V.s are independence, their covariance is 0, but having a covariance of 0 does not imply the variables are independent.

Wikipedia has some examples that show this.

1.4 Moment Generating Function (MGF) #

The moment generating function of a R.V. \(X\) is defined by: #

\[\Psi_X(t) = \E \left[ e^{tX} \right] = \int_{-\infty}^{+\infty} e^{tx}\dd{F(x)}, \ \ t\in \RR \]

whenever this expectation exists.

The meaning of “moment” #

\[\Psi_X’(t) = \dv{t} \E \left[ e^{tX} \right] = \E \left[ \dv{t} e^{tX} \right] = \E \left[ Xe^{tX} \right]\]

The interchanging of derivative and expectation can be done using the dominated convergence theorem.

Let \(t=0\), we get the “first moment of \(X\)”: \(\Psi_X’(0) = \E[X]\)

Similarly,

\[\Psi_X’’(t) = \dv[2]{t} \E \left[ e^{tX} \right] = \E \left[ \dv[2]{t} e^{tX} \right] = \E \left[ X^2e^{tX} \right]\]

And \(\Psi_X’’(0) = \E[X^2]\), which is the second moment of \(X\).

In general, we have \(\Psi_X^{(n)}(0)=\E \left[ X^n \right]\), the n-th moment of \(X\).

Theorem: If the MGF \(\Psi_X(t), \Psi_Y(t)\) are finite for all \(t\in \RR \) and \(\Psi_X(t)=\Psi_Y(t)\) for all \(t\in \RR \), then \[\Pr[X\leq x] = \Pr[Y\leq x] \text{ for all } x \in \RR \]

Therefore, when a MGF exists, it uniquely determines a distribution.

Exercise

If \(X_1 \sim N(\mu_1, \sigma_1^2), X_2 \sim N(\mu_2, \sigma_2^2)\) and \(X_1,X_2\) are independent, then \(X_1+X_2\sim N(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)\)

Proof:

First prove the MGF of the normal distribution \(N(\mu,\sigma^2)\) is \(e^{t\mu + \frac{1}{2}\sigma^2t^2}\):

\[\begin{aligned} \E \left[ e^{tX} \right] &=\int_{-\infty}^{+ \infty}e^{tx}\dd{F(x)}\\ &= \int_{-\infty}^{ +\infty}e^{tx} \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \dd{x}\\ &=\frac{1}{\sqrt{2\pi}\sigma}\int_{-\infty}^{ +\infty} e^{-\frac{(x-\mu)^2-2\sigma^2tx}{2\sigma^2}} \dd{x}\\ &=\frac{1}{\sqrt{2\pi}\sigma}\int_{-\infty}^{ +\infty} e^{-\frac{(x-(\mu+\sigma^2t))^2-2\mu\sigma^2t-\sigma^4t^2}{2\sigma^2}} \dd{x}\\ &=e^{t\mu+\frac{1}{2}\sigma^2t^2}\frac{1}{\sqrt{2\pi}\sigma}\int_{-\infty}^{ +\infty} e^{-\frac{(x-(\mu+\sigma^2t))^2}{2\sigma^2}} \dd{x}\\ &=e^{t\mu+\frac{1}{2}\sigma^2t^2} \end{aligned} \]

Then just verify: \[\begin{aligned}\E \left[ e^{t(X_1+X_2)} \right] &= \E \left[ e^{tX_1} \right] \E \left[ e^{tX_2} \right] \text{ (because of independence)}\\ &= e^{t(\mu_1+\mu_2)+\frac{1}{2}(\sigma_1^2+\sigma_2^2)t^2} \end{aligned} \]

Drawback #

MGF may not exist.

Example

Consider \(X\) with density function \(f(x)=\frac{1}{\pi}\frac{1}{1+x^2}, x\in \RR \), then

\[\E \left[ e^{tX} \right] = \int_{-\infty}^{+\infty} e^{tx}\frac{1}{\pi}\frac{1}{1+x^2}\dd{x} = \infty, t\neq 0\]

(This is a special case of Cauchy distribution.)

1.5 Characteristic Function (CF) #

Definition #

The CF of a R.V. \(X\) is defined by \[\Phi_X(t)=\E \left[ e^{itX} \right] = \E \left[ \cos(tX)+i\sin(tX) \right],\ \ t\in \RR \]

MGF may not exist but CF always exist, because \(e^{itX}\) is well-defined and well-bounded.

When CDF/PDF are hard to write, CF is important.

The CF uniquely determined the distribution of a R.V.

1.6 Exponential Distribution #

Definition #

A continuous R.V. \(X\) is said to have exponential distribution with parameter \(\lambda>0\) if its PDF is: \[f(x)=\left\{\begin{aligned} & \lambda e^{-\lambda x} &x>0\\ & 0 & \text{otherwise} \end{aligned}\right. \]

Its MGF: \(\frac{\lambda}{\lambda-t},\ \ t<\lambda\) Mean: \(\E[X] = \frac{1}{\lambda}\) Variance: \(\Var[X] = \frac{1}{\lambda^2}\)

Properties #

Memoryless Property \[\Pr[X>t+s\mid X>s] = \Pr[X>t], \ \ \forall t,s\geq 0\]

Proof

\[\begin{aligned}\Pr[X>t+s\mid X>s] &=\frac{\Pr[X>t+s, X>s]}{\Pr[X>s]}\\ &=\frac{\Pr[X>t+s]}{\Pr[X>s]}\\ &=\frac{1-\int_{-\infty}^{t+s}\lambda e^{-\lambda x}\dd{x}}{1-\int_{-\infty}^{s}\lambda e^{-\lambda x}\dd{x}}\\ &=\frac{e^{-\lambda(t+s)}}{e^{-\lambda s}}\\ &=e^{-\lambda t} = \Pr[X>t] \end{aligned} \]
If \(X\sim \exp (\lambda)\) and \(Y\sim \exp (\mu)\), \(X\) and \(Y\) are independent, then: \[\min \left\{ X,Y \right\}\sim \exp (\lambda+\mu) \]

Proof

\[\begin{aligned} \Pr[\min \left\{ X,Y \right\} > x] &=\Pr[X>x,Y>x]\\ &=\Pr[X>x] \Pr[Y>x]\\ &=e^{-x(\lambda+\mu)} \end{aligned} \]

Also, \(\Pr[X<Y] = \frac{\lambda}{\lambda+\mu}\)

Proof

\[\begin{aligned} \Pr[X < Y] &=\int_0^\infty \Pr[X<Y \mid Y=y]f_Y(y) \, dy \\ &= \int_0^\infty \int_0^y f_X(x)f_Y(y) \,\, dx dy \\ &= \int_0^\infty \mu e^{-\mu y}\left(1- e^{-\lambda y} \right) \,dy \\ &=\mu \int_0^\infty e^{-\mu y} \, dy - \mu \int_0^\infty e^{ - (\lambda+\mu)y} \, dy \\ &= 1- \frac{\mu}{\lambda+\mu} \\ &= \frac{\lambda}{\lambda+\mu} \end{aligned} \]

1.7 Stochastic Process #

Definition #

A stochastic process \(X=\{X_t: t\in T\}\) is a collection of random variables defined on the same probability space.

\(T\): Index set, time \(t\in T\)
\(X_t\): State of the process at time \(t\)

\(X\) is called a discrete-time process if \(T\) is countable, e.g., \(T=\{1,2,3, \dots \}\). In this case, \(X=\{X_n:n=0,1,2, \dots \}\).

\(X\) is called a continuous-time process if \(T\) is continuous, e.g., \(T=[0,+\infty)\). In this case, \(X=\{X_t: t\geq 0\}\).

You can think that a process denotes an evolution of some system in time.

Since \(X_t: \Omega\to \RR \), therefore \(X: \Omega \times T\to \RR \). So for a fixed \(\omega \in \Omega\), \(X_t(\omega)\) becomes a function of \(t\in T\), called a sample path of \(X\).

Example (Random Walk)

Let \(\{\xi_i:i=1,2, \dots \}\) be a sequence of i.i.d. R.V.s. Define \(S_0=0\), and \(S_n = \sum_{i=1}^n \xi_i\), for \(n\geq 1\).

Then \(\{S_n:n=0,1,2, \dots \}\) is called a random walk.

(P.S.: \(\xi_i\) can have different distribution.)

When \(\xi_i\) has distribution \(\Pr[\xi_i=1]=\Pr[\xi_i=-1]=\frac{1}{2}\), \(\{S_n:n=0,1,2, \dots \}\) is called a simple symmetric random walk.

A stochastic process \(X\) is said to have independent increments, if for any \(0\leq t_0<t_1<\dots <t_n\), \(X(t_1)-X(t_0), X(t_2)-X(t_1), \dots ,X(t_n)-X(t_{n-1})\) are (mutually, not just pairwise) independent, for all \(n\geq 1\).

\(X\) is said to have stationary increments if for any fixed \(t>0\), \(X(t+s)-X(s)\) have the same distribution, for all \(s\geq 0\). (I.e., the distribution depends only on the length of the time interval \(t\).)

2 Poisson Processes #

2.1 Counting Process #

A stochastic process \(N=\{N(t):t\geq 0\}\) is called a counting process if \(N(t)\) represents the number of events occurred up to time \(t\) (including \(t\)).

A counting process satisfies:

\(N(0)=0\)
\(N(t)\) is integer-valued
\(N(s)\leq N(t)\) for \(s\leq t\)

For \(s < t\), \(N(t)-N(s)\) equals the number of events that have occurred in the interval \((s.t]\).

A sample path of Poisson process is a right-continuous function, and has left limits:

Figure 1: A sample path of a Poisson process

2.2 Poisson Process Definition 1 #

A counting process \(N=\{N(t): t\geq 0\}\) is called a Poisson process if:

\(N(0)=0\)
\(N\) has independent increments
For any \(s,t\geq 0\), \(\Pr[N(t+s)-N(s)=n] = \frac{(\lambda t)^n}{n!}e^{-\lambda t}\), for \(n=0,1,2, \dots \). I.e., \(N(t+s)-N(s)\) has a Poisson distribution with mean \(\lambda t\).

\(\lambda\): rate of arrival (events occurring)

\[\E[N(t)]=\lambda t \Longrightarrow \lambda = \frac{\E[N(t)]}{t}\]

2.3 Poisson Process Definition 2 #

A counting process \(N=\{N(t):t\geq 0\}\) is called a Poisson process with rate \(\lambda>0\), if

\(N(0)=0\)
\(N\) has stationary and independent increments
As \(h\to 0\), \(\Pr[N(h)=0]=1-\lambda h +o(h)\), \(\Pr[N(h)=1]= \lambda h +o(h)\), \(\Pr[N(h)\geq2]=o(h)\) (infinitesimal behavior)

2.4 Two Equivalent Definitions of Poisson Process #

Proof (Definition 1 => Definition 2)

From \(\Pr[N(t+s)-N(s)=n] = \frac{(\lambda t)^n}{n!}e^{-\lambda t}\), it is easy to see the distribution of \(N(t+s)-N(s)\) is irrelevant with \(s\), so N is stationary.

Let \(s, n = 0\), we get \(\Pr[N(t)=0] = e^{-\lambda t}\). Denote \(P_0(t)\triangleq \Pr[N(t)=0]\), then

\(\Pr[N(h)=0]= e^{-\lambda h} = 1-\lambda h +o(h)\), this is from the Taylor Expansion of \(e^x\).

\(\Pr[N(h)=1]= \lambda h e^{-\lambda h} = \lambda h (1-\lambda h +o(h)) = \lambda h + o(h)\)

\(\Pr[N(h)=2]= \frac{1}{2}\lambda^2 h^2 e^{-\lambda h} = \frac{1}{2}\lambda^2 h^2 (1-\lambda h +o(h)) = o(h)\)

Proof (Definition 2 => Definition 1)

Denote: \(P_n(t)\triangleq\Pr[N(t)=n]\).

We first show \(P_0(t)=e^{-\lambda t}, t\geq 0\):

\[\begin{aligned}P_0(t+h) &= \Pr[N(t+h)=0]\\ &= \Pr[N(t)=0, N(t+h)-N(t)=0]\\ &= \Pr[N(t)=0] \cdot \Pr[N(t+h)-N(t)=0] \ \ \text{(because of independent increments)}\\ &= P_0(t)[1-\lambda h+o(h)]\\ \end{aligned} \]

So \(\frac{P_0(t+h)-P_0(t)}{h} = -\lambda P_0(t) + o(h)\), i.e., \(P_0’(t) = -\lambda P_0(t) \Rightarrow \ln P_0(t) = -\lambda t \Rightarrow P_0(t)=e^{-\lambda t}\).

Then we consider \(n>0\):

\[\begin{aligned}P_n(t+h)=&\Pr[N(t+h)=n]\\ =&\Pr[N(t)=n, N(t+h)-N(t)=0] + \Pr[N(t)=n-1, N(t+h)-N(t)=1]\\ &+ \Pr[N(t+h)\leq n-2, N(t+h)-N(t)\geq 2]\\ =& P_n(t)\cdot \Pr[N(h)=0]+P_{n-1}(t)\cdot \Pr[N(h)=1] +o(h)\\ =& P_n(t)(1-\lambda h + o(h)) + P_{n-1}(t)(\lambda h +o(h)) + o(h)\end{aligned} \]

The third equation uses the independent increments property of \(N\).

Compute the derivative: \[\frac{P_n(t+h)-P_n(t)}{h} = -\lambda P_n(t)+\lambda P_{n-1}(t)+\frac{o(h)}{h}\]

I.e., \(P_n’(t) = -\lambda P_n(t)+\lambda P_{n-1}(t)\), so we get a differential equation:

\[\left\{\begin{aligned}P_n’(t)&=-\lambda P_n(t)+\lambda P_{n-1}(t),\ \ t>0\\ P_n(0)&=\Pr[N(0)=n]=0 \end{aligned}\right. \]

For \(n=1\), we have \(P_1’(t)=-\lambda P_1(t) + \lambda e^{-\lambda t}\), this is a First-order Linear Differential Equation with Constant Coefficients.

Solve it and get \(P_1(t) = \lambda t e^{-\lambda t}\)

Then prove by induction (details are skipped), get \(P_n(t) = \frac{(\lambda t)^n}{n!}e^{-\lambda t}\).

According to the stationary condition, we have \(\Pr[N(t+s)-N(s)=n] = \frac{(\lambda t)^n}{n!}e^{-\lambda t}\).

2.5 Interarrival and waiting times #

In the above we consider Poisson process in terms of counting. Now we change our perspective: in terms of arrival times.

Consider a Poisson process \(\{N(t), t\geq 0\}\) with rate \(\lambda\geq0\). Denote \(X_1\) the arrival time of the first event, \(X_n\) the time between the \(n-1\)-th arrival and \(n\)-th arrival.

\((X_n, n=1,2, \dots)\) is called the sequence of interarrival times.

Also we compute the partial sum \(S_n\): the arrival time of n-th event.

\[S_n = \sum_{i=1}^n X_i, n=1,2, \dots \]

What are the distributions of \(X_n\) and \(S_n\)?

\[\Pr[X_1>t]=\Pr[N(t)=0]=e^{-\lambda t},\ \ t>0 \Rightarrow X_1 \sim \exp (\lambda)\]

\[\begin{aligned}\Pr[X_2>t \mid X_1 = s] &=\Pr[N(t+s)-N(s)=0 \mid N(s)=1, N(u)=1, 0<u<s]\\ &=\Pr[N(t+s)-N(t)=0]\\ &=\Pr[N(t)=0]\\ &=e^{-\lambda t},\ \ t>0 \Rightarrow X_2 \sim \exp (\lambda) \end{aligned} \]

And \(X_2\) is independent of \(X_1\).

Proposition

\((X_n: n=1,2, \dots)\) is a sequence of i.i.d. exponential random variable with \(\E[X_i]=\frac{1}{\lambda}\)

Note

This is another perspective of Poisson process, it is important because it tells us the exact time of arrivals.

\[\Pr[S_n\leq t]=\Pr[N(t)\geq n] = \sum_{k=n}^{\infty} \frac{(\lambda t)^k}{k!}e^{-\lambda t}\]

\(S_n\) follows a Gamma distribution (Erlang distribution), a generalization of Poisson, its density is given by

\[f_n(t)=\lambda e^{-\lambda t}\frac{(\lambda t)^{n-1}}{(n-1)!}\]

2.6 Poisson Process Definition 3 #

Given a sequence of i.i.d. exponential random variables \(\{X_i, i=1,2, \dots \}\) with \(\E[X_i]=\frac{1}{\lambda}\). Let \(S_n=\sum_{i=1}^n X_i\) for \(n\geq 1\) and \(S_0=0\). Define \(N(t)=\max \{n: S_n\leq t\}, t\geq 0\). Then \(\{N(t):t\geq 0\}\) is a Poisson process with rate \(\lambda\).

Note

Definition 3 implies definition 1 and 2. Proof is complex and so skipped.

In the definition, the exponential distributions are important, otherwise it will not be a Poisson.

By this definition, we can simulate sample paths by generating \(X_i\). This can also be used to test whether a coming process is Poisson (Compute the interarrival times and use statistical tests to see whether they are close to exponential.).

Example

Compute \(\Pr[X_1\leq s \mid N(t)=1], s<t\)

\[\begin{aligned}\Pr[X_1\leq s \mid N(t)=1] &= \frac{\Pr[X_1\leq s, N(t)=1]}{\Pr[N(t)=1]}\\ &=\frac{\Pr[N(s)=1, N(t)-N(s)=0]}{\lambda t e^{-\lambda t}}\\ &=\frac{\Pr[N(s)=1]\Pr[N(t)-N(s)=0]}{\lambda t e^{-\lambda t}}\\ &=\frac{\lambda s e^{-\lambda s} e^{-\lambda (t-s)}}{\lambda t e^{-\lambda t}}\\ &=\frac{s}{t} \end{aligned} \]

Therefore, given that before time \(t\) there is exactly one event happened, the event happening time \(X_1\) is uniformly distributed over \((0,t)\). This is a special property of Poisson.

Generalization

Given \(N(t)=n\), what is the joint distribution of \((S_1,S_2, \dots, S_n)\)?

First we introduce the concept of “order statistics”:

Let \(Y_1,Y_2, \dots , Y_n\) be \(n\) i.i.d. R.V.s with PDF \(f\). And \(Y_{(1)}, Y_{(2)}, \dots , Y_{(n)}\) are the order statistics corresponding to them if \(Y_{(k)}\) is the \(k\)th smallest value among \(Y_1, \dots ,Y_n\), for \(k=1,2, \dots ,n\). The joint density of \(Y_{(1)}, Y_{(2)}, \dots , Y_{(n)}\) is given by: \[f(y_1,y_2, \dots ,y_n)=n\,! \prod_{i=1}^n f(y_i), \ \ y_1<y_2< \dots <y_n\]

Then we compute the conditional density function of \(S_1, \dots ,S_n\) given that \(N(t)=n\).

Let \(0<t_1<t_2<\dots <t_{n+1} = t\) and select \(h_i\) small enough such that \(t_i+h_i<t_{i+1}, \ i=1, \dots ,n\).

\[\begin{aligned}&\Pr[t_i\leq S_i\leq t_i+h_i, i=1,2, \dots, n \mid N(t)=n]\\ =& \frac{\Pr[N(t_i+h_i)-N(t_i)=1, i=1, \dots, n, \text{no events elsewhere}]}{\Pr[N(t)=n]}\\ =& \frac{\lambda h_1 e^{-\lambda h_1} \dots \lambda h_n e^{-\lambda h_n} e^{-\lambda(t-h_1-h_2-\dots-h_n)}}{\frac{e^{-\lambda t}(\lambda t)^n}{n\,!}}\\ =& \frac{n\,!}{t^n} h_1 h_2 \dots h_n \end{aligned} \]

Hence,

\[\frac{\Pr[t_i\leq S_i\leq t_i+h_i, i=1,2, \dots, n \mid N(t)=n]}{h_1 h_2 \dots h_n} = \frac{n\,!}{t^n}\]

By letting \(h_t\to 0\), we get the conditional density function:

\[f(t_1, \dots ,t_n)=\frac{n\,!}{t^n}, 0<t_1<\dots <t_n\]

2.7 Why Poisson? #

Note

The Poisson distribution is a limiting case of the binomial distribution which arises when the number of trials n increases indefinitely whilst the product \(\lambda= np\), which is the expected value of the number of successes from the trials, remains constant.

In one word, \(Bin(n,p) \approx Poi(\lambda), np \approx \lambda\) when \(n\) is large, \(p\) is small.

Proof

Split time into \(n\) pieces, for each time slot, an event happens with probability \(p\), so the expectation of the number of events happen is \(\lambda = np\).

According to Binomial distribution, the probability of \(k\) events happen is \(C_n^k p^k (1-p)^{n-k}\). As \(n\to \infty\), we have:

\[\begin{aligned}&\lim_{n\to \infty} C_n^k p^k (1-p)^{n-k}\\ =&\lim_{n\to \infty} C_n^k \left( \frac{\lambda}{n} \right)^k \left( 1- \frac{\lambda}{n} \right)^{n-k}\\ =&\lim_{n\to \infty} \frac{n}{n} \cdot \frac{n-1}{n} \dots \frac{n-k+1}{n} \frac{\lambda^k}{k\,!} \left( 1-\frac{\lambda}{n} \right)^n \left( 1-\frac{\lambda}{n} \right)^{-k} \end{aligned} \]

Since \(\lim_{n\to \infty}\frac{n}{n} \cdot \frac{n-1}{n} \dots \frac{n-k+1}{n}=1 \), \(\lim_{n\to \infty}\left( 1-\frac{\lambda}{n} \right)^{-k} =1\), and \(\lim_{n\to \infty} \left( 1-\frac{\lambda}{n} \right)^n = e^{-\lambda}\), we have:

\[\Pr[k \text{ events happen}] = \frac{\lambda^k}{k\,!} e^{-\lambda}\]

2.8 Non-homogeneous Poisson Process #

Intuition

The number of customers coming from 10 am to 11 am is larger than those from 1 am to 2 am. So in some scenarios, \(\lambda\) may not be a constant anymore.

A counting process \(\{N(t), t\geq 0\}\) is called a non-homogeneous or non-stationary Poisson process with rate function \(\lambda(t)\), if

\(N(0)=0\)
\(N\) has independent increments
As \(h\downarrow 0\), \(\Pr[N(t+h)-N(t)=0]=1-\lambda(t) h +o(h)\), \(\Pr[N(t+h)-N(t)=1]= \lambda(t) h +o(h)\), \(\Pr[N(t+h)-N(t)\geq 2]=o(h)\), for any \(t\geq 0\)

Theorem: For a non-homogeneous Poisson process with rate function \(\lambda(t),\ t\geq 0\), we have:

\(N(t+s)-N(s)\) follows a Poisson distribution with mean \(\int_s^{t+s} \lambda(u)\dd{u}=\E[N(t+s)-N(s)]\), i.e., for any \(t,s\geq 0\)

\[\Pr[N(t+s)-N(s)=n]=\frac{\left( \int_s^{t+s} \lambda(u)\dd{u} \right)^n}{n\,!} e^{-\int_s^{t+s} \lambda(u)\dd{u}}, n=0,1,2, \dots \]

Proof

Similar to the proof of Definition 2 => Definition 1.

2.9 Compound Poisson Random Variables and Process #

Intuition

We may be only interested in the rewards/costs instead of counting.

Let \(\{X_i: i=1,2, \dots \}\) be a sequence of i.i.d. R.V.s with a common distribution \(F\) (CDF). Let \(N\) be a Poisson R.V. with mean \(\lambda\), and it is independent of \(\{X_i, i=1,2, \dots \}\). \(W=\sum_{i=1}^N X_i\) is called a compound Poisson random variable, with Poisson parameter \(\lambda\), and component distribution \(F\).

Note

If \(X_i=1\), then \(W=N\), which is just a Poisson R.V.

Compute the M.G.F. of \(W\):

\[\begin{aligned}\Psi_W(t)&=\E \left[ e^{tW} \right]=\E \left[ e^{t \sum_{i=1}^N X_i} \right]\\ &=\sum_{n=0}^{\infty}\E \left[ e^{t\sum_{i=1}^n X_i} \right] \Pr[N=n]\\ &=\sum_{n=0}^{\infty} \left[ \lambda \Psi_X(t) \right]^n \frac{e^{-\lambda}}{n\,!}\\ &=e^{\lambda (\Psi_X(t)-1)} \end{aligned} \]

So \(\E[W]=\Psi_W’(0)=\lambda \Psi_X’(0)=\lambda \E[X_i]\), and \(\Var[W]=\lambda \E[X_i^2]\)

A stochastic process \( \left\{ X(t): t\geq 0 \right\}\) is called a compound Poisson process if \(X(t)=\sum_{i=1}^{N(t)} X_i, t\geq 0\), where \(\left\{ N(t), t\geq 0 \right\}\) is a Poisson process and \(\left\{ X_i, i=1, 2, \dots \right\}\) is a sequence of i.i.d. R.V.s that is independent of \(\left\{ N(t), t\geq 0 \right\}\).

This is not a counting process any more.

Example: An insurance company

An insurance company receives claims that arrive according to a Poisson process with rate \(\lambda\). Suppose the amount of claims form a set of i.i.d. R.V.s that is independent of the claim arrival process. If \(X(t)\) denotes the total amount of claims by time \(t\), then \(\left\{ X(t):t\geq 0 \right\}\) is a compound Poisson process.

\(Y(t)=y_0 + c\cdot t - \sum_{i=1}^{N(t)} X_i\)

where \(Y(t)\) is the profit, \(c\cdot t\) is the premium.

If \(Y(t)<0\), there will be a bankruptcy. One interesting problem is that, given some time, what is the chance that there will be a bankruptcy?

2.10 Conditional Poisson Process #

Intuition

For a Poisson process \(\{N(t), t\geq 0\}\) we have \(\Var[N(t)]=\E[N(t)]\)

For many real-world systems, it is not the case that \(\frac{\Var[N(t)]}{\E[N(t)]}=1\).

E.g. for deterministic counting process, \(\frac{\Var[N(t)]}{\E[N(t)]} < 1\).

In what condition we have \(\frac{\Var[N(t)]}{\E[N(t)]} > 1\)?

We need more uncertainty, “over dispersion of count data”.

Let \(\Lambda >0\) be a R.V. with distribution function \(G(x)=\Pr[\Lambda<x]\) (CDF). Let \(N\) be a counting process such that given \(\Lambda=\lambda\), \(\{N(t), t\geq 0\}\) is a Poisson process with rate \(\lambda\). Then \(N\) is called a conditional Poisson process.

\[\begin{aligned}\Pr[N(t+s)-N(s)=n]&=\int_0^{\infty} \Pr[N(t+s)-N(s)=n \mid \Lambda=\lambda]\dd{G(\lambda)}\\ &=\int_0^{\infty} \frac{(\lambda t)^n}{n\,!} e^{-\lambda t}\dd{G(\lambda)}\\ \end{aligned} \]

Note

\(\{N(t), t\geq 0\}\) is not a Poisson process. The increments are still stationary (because \(N(t+s)-N(s)\) depends only on the time increment \(t\)), but not independent (because one can verify that \(\Pr[N(t+s)=0]\neq \Pr[N(s)=0]\cdot \Pr[N(t+s)-N(s)=0]]\)).

Suppose \(\{N_1(t), t\geq 0\}\) is a Poisson process with rate \(\lambda=1\). Define \(\tilde{N}(t)\triangleq N_1(\lambda t), t\geq 0\)

Fact

\(\{\tilde{N}(t), t\geq 0\}\) is a Poisson process with rate \(\lambda\).

Similarly, define \(N(t)\triangleq N_1(\Lambda t), t\geq 0\) (\(\Lambda\) is independent of \(N_1\)）(\(\Lambda t\) is called “random time change”.) Then \(\{N(t), t\geq 0\}\) is a conditional Poisson process.

\[\E[N(t)]=\E[N_1(\Lambda t)] = \int_0^{\infty} \E[N_1(\Lambda t) \mid \Lambda =\lambda]\dd{G(\lambda)} = \int_0^{\infty} \lambda t \dd{G(\lambda)}=t \E[\Lambda]\]

\[\E[N^2(t)]=\E[N_1^2(\Lambda t)] = \int_0^{\infty} \E[N_1^2(\Lambda t) \mid \Lambda=\lambda]\dd{G(\lambda)} = \int_0^{\infty} [(\lambda t)^2+(\lambda t)] \dd{G(\lambda)}=t^2 \E[\Lambda^2] + t\E[\Lambda]\]

\[\Var[N(t)] = t^2 \E[\Lambda^2] + t\E[\Lambda] -t^2(\E[\Lambda])^2= t\E[\Lambda] + t^2 \Var[\Lambda]\]

\[\frac{\Var[N(t)}{\E[N(t)]} = 1+\frac{\Var[\Lambda]}{\E[\Lambda]}t \geq 1\]

Note

Conditional Poisson process is a special case of doubly stochastic Poisson process (or Cox process), where \(\lambda(t)\) itself is a stochastic process.

Conditional Poisson process is the case where \(\lambda(t)\) is a constant R.V. \(\Lambda\).

Exercise

For some currently unknown reasons, the average rate at which seismic shocks occur in a certain region changes over time (let’s say it is either \(\lambda_1\) or \(\lambda_2\).). Suppose it is \(\lambda_1\) for \(p\) percent of time and \(\lambda_2\) for the remaining time. Given \(n\) shocks in the first \(t\) times units, compute the probability it is a \(\lambda_1\) season.

Suppose \[\Lambda=\left\{\begin{aligned}&\lambda_1 & \text{with probability } p\\ &\lambda_2 &\text{with probability } 1-p \end{aligned} \right. \]

Compute \(\Pr[\Lambda=\lambda_1 \mid N(t)=n]\) (\(N(t)\) is a conditional Poisson process)

\[\begin{aligned}\Pr[\Lambda=\lambda_1 \mid N(t)=n] &= \frac{\Pr[\Lambda=\lambda_1, N(t)=n]}{\Pr[N(t)=n]}\\ &=\frac{\Pr[N(t)=n \mid \Lambda=\lambda_1] \Pr[\Lambda=\lambda_1]}{\Pr[N(t)=n]}\ \ \text{(Bayes’ theorem)}\\ &=\frac{p e^{-\lambda_1 t}(\lambda_1 t)^n}{p e^{-\lambda_1 t}(\lambda_1 t)^n + e^{-\lambda_2 t}(\lambda_2 t)^n(1-p)}\end{aligned} \]

2.11 Introduction to Renewal Process #

A generalization of Poisson process. It is also a counting process, but allows interarrival times to be general: i.i.d. non-negative R.V.s, not only exponential distribution.

Let \(\{X_n:n=1,2, \dots \}\) be a sequence of i.i.d. non-negative R.V.s with a common distribution function \(F\). Assume \(F(0)=\Pr[X_n=0]<1\) and \(\E[X_n]=\mu \in (0,\infty)\). Define \(S_0=0\) and \(S_n=\sum_{i=1}^n X_i\) for \(n\geq 1\). Let \(N(t)=\sup\{n: S_n\leq t\}\) for \(t\geq 0\). Then the counting process \(\{N(t), t\geq 0\}\) is called a renewal process.

Note

\((X_n)\): sequence of interarrival times.

Interarrival times can be zero, as long as the probability is not 1. (Otherwise the model is not interesting.)

The “\(\sup\)” can be changed to “\(\max\)”. Because in a finite time \(t\), there cannot be an infinite number of renewals occurred. Because:

\[\Pr \left[ \lim_{n\to \infty} \frac{S_n}{n} = \mu \right]=1 \ \ \text{(Strong law of large numbers)}\]

So \(S_n \to \infty\) as \(n\to \infty\) with probability 1, i.e., if \(S_n\leq t\), then \(n\) must be finite. Therefore, only a finite number of events occurred on \([0,t]\) for any \(t\).

So the “\(\sup\)” cannot be infinite, so it can be replaced with “\(\max\)”.

Compute the distribution of \(N(t)\):

\[\begin{split} \Pr[N(t)=n]&=\Pr[N(t)\geq n] - \Pr[N(t)\geq n+1]\\ &=\Pr[S_n\leq t]-\Pr[S_{n+1}\leq t]\\ &=F_n(t)-F_{n+1}(t) \ \ \text{(denote } F_n(t)=\Pr[S_n\leq t]\text{)} \end{split} \]

How to compute \(F_n(t)\)?

\[\begin{aligned}F_1(t)&=\Pr[X_1\leq t]=F(t)\\ F_2(t)&=\Pr[X_1+X_2\leq t]=F*F_1(t)\\ &\vdots\\ F_{n+1}(t)&=\Pr[S_n+X_{n+1}\leq t]=F*F_{n}(t)\end{aligned} \]

Note

\(F_n(t)\) is actually the \(n\)-fold convolution of \(F\) with itself.

Renew function: \[\begin{split} m(t)=\E[N(t)] = \sum_{n=1}^{\infty} \Pr[N(t)\geq n] =\sum_{n=1}^{\infty} \Pr[S_n\leq t] = \sum_{n=1}^{\infty} F_n(t) \end{split} \]

3 Discrete-Time Markov Chain (DTMC) #

3.1 Definition of DTMC #

Let \(\{X_n: n=0,1,2, \dots \}\) be a discrete-time stochastic process taking on a finite or countable number of values, say, \(S = \{0,1,2, \dots \} \) (this is called state space). Suppose

\[\Pr[X_{n+1}=j \mid X_n=i, X_{n-1}=i_{n-1}, \dots, X_1 = i_1, X_0 = i_0] = \Pr[X_{n+1}=j \mid X_n=i] \text{ (Markov property)}\]

for all states \(i_0,i_1, \dots ,i_{n-1}, i, j\) and \(n\geq 0\). Then \(\{X_n:0,1,2, \dots \}\) is called a discrete-time Markov chain (“time-homogeneous”) (it is called a chain because the state space is countable or finite, this will be generalized in Markov process).

Fact

Conditional independence.

\[\Pr[A\mid B,C] = \Pr[A\mid B] \Longleftrightarrow \Pr[A,C \mid B] = \Pr[A\mid B] \Pr[C \mid B]\]

The conditional distribution of future state \(X_{n+1}\), given the past state \(X_0, \dots ,X_{n-1}\) and the present state \(X_n\), is independent of the past states and depends only on the present state.

Denote: \(P_{ij} \triangleq \Pr[X_{n+1}=j \mid X_n=i] \text{ (transition probability)}\)

Note

We only consider the time-homogeneous scenario, so the transition probability does not depend on the time \(n\).

Properties of transition probabilities:

\(P_{ij}\geq 0, i,j \in S\)
\(\sum_{j \in S} P_{ij}=1, \forall i \in S\)

Transition probability matrix:

\[P=(P_{ij})_{i, j \in S} =\begin{bmatrix} P_{00} & P_{01} & P_{02} & \dots \\ P_{10} & P_{11} & P_{12} & \dots \\ \vdots & \vdots & \vdots & \ddots\\ \end{bmatrix} \]

Note

The row sum of \(P\) is always 1. However, this is not the case for column sums.

Consider the following markov chain with 2 states. State 1 leads to 1 with probability 1 and state 2 leads to state 1 with probability 1. Then the sums of the columns of the transition matrix are 2 and 0.

Example (Random Walk)

Let \((X_i)_{i=1}^{\infty}\) be i.i.d. with \(\Pr[X_i=k]=a_k, k=0, \pm 1, \pm 2, \dots \)

Let \(S_0=0, S_n=\sum_{i=1}^n X_i, n\geq 1\), then \(\{S_n: n=0, 1, \dots \}\) is a DTMC.

Proof

Need to prove: \(\Pr[S_{n+1}=j | S_n=i, \dots, S_1=i_1, S_0=0] = \Pr[S_{n+1}=j \mid S_n = i]\)

\[\begin{split} \text{LHS}&=\Pr[S_{n+1}=j | S_n=i, \dots, S_1=i_1, S_0=0]\\ &=\Pr[X_{n+1}=j-i \mid S_n=i, \dots, S_0 = 0]\\ &=\Pr[X_{n+1}=j-i ] = a_{j-i}\\ \text{RHS}&=\Pr[S_{n+1}=j \mid S_n = i]\\ &= \Pr[X_{n+1} = j-i \mid S_n=i] = a_{j-i} \end{split} \]

Example (Inventory Model for non-perishable items)

\(X_n\): inventory level at end of week \(n\), \(X_0=3\)

\(D_n\): the demand in the \(n\)-th week. \((D_n)_{n=1}^{\infty}\) are i.i.d with distribution:

\(n\) 0 1 2 3

\(\Pr[D_n]\) \(\frac{1}{8}\) \(\frac{1}{4}\) \(\frac{1}{2}\) \(\frac{1}{8}\)

Ordering policy: if \(X_n\leq 1\), order up to 4, otherwise do not order.

Question: \(X_n: n=0,1, \dots \) a DTMC?

Yes. State space: \(S=\{0,1,2,3,4\}\)

\[X_{n+1}=\left\{ \begin{array}{l@{\quad:\quad}l} 4 -D_{n+1}& \text{if } X_n\leq 1\\ (X_n -D_{n+1})^+ & \text{otherwise} \end{array}\right. \]

\(\Pr[X_{n+1}=j \mid X_n=i] = P_{ij}\)

\(P\) is a 5 by 5 matrix.

\(n\)	0	1	2	3
\(\Pr[D_n]\)	\(\frac{1}{8}\)	\(\frac{1}{4}\)	\(\frac{1}{2}\)	\(\frac{1}{8}\)

Elements of DTMC:

State space \(S\)
(One-step) transition matrix
The distribution of \(X_0\)

A toy example

\(S=\{a,b\}\), \(P=\begin{bmatrix}\frac{1}{2} & \frac{1}{2}\\ 0 & 1\end{bmatrix}\)

Can draw state transition diagram:

Questions: How to compute:

\(\Pr[X_n=j \mid X_0=a], j \in S\), i.e., \(n\)-step transition probability.

\(\E \left[ \sum\limits_{i=0}^n f(X_i) \right]\), where \(f: S\to \RR \) is the “cost function”.

\(\lim\limits_{n\to \infty} \Pr[X_n=j \mid X_0=a], j \in S\)

\(\lim\limits_{n\to \infty} \frac{1}{n}\E \left[ \sum\limits_{i=0}^n f(X_i) \right]\)

We will answer these in the following chapters.

3.2 Chapman–Kolmogorov Equation #

Now compute \(n\)-step transition probabilities.

Write \(n\)-step transition probabilities: \(P_{ij}^{(n)}=\Pr[X_n=j \mid X_0=i], i,j \in S\)

Theorem:

\[P_{ij}^{(n+m)}=\sum\limits_{k\in S} P_{ik}^{(n)}\cdot P_{kj}^{(m)} \quad \forall i,j \in S, n,m \geq 0\]

Proof:

\[\begin{split} \Pr[X_{n+m}=j \mid X_0=i]&= \sum_{k \in S} \Pr[X_{n+m}=j, X_n=k \mid X_0=i]\\ &=\sum_{k \in S} \Pr[X_{n+m}=j \mid X_n=k, X_0=i] \Pr[X_n=k \mid X_0=i]\\ &=\sum_{k\in S} P_{kj}^{(m)} \cdot P_{ik}^{(n)} \end{split} \]

Write \(P^{(n)}=\left( P_{ij}^n \right)_{i,j \in S}\) the n-step transition matrix. \[P^{(n+m)} = P^{(n)}\cdot P^{(m)}\] So \(P^{(2)}=P^{(1)}\cdot P^{(1)} = P^2\), by induction, we have \(P^{(n)}=P^n, n\geq 1\).

What about unconditional probabilities？ We need to know the distribution of \(X_0\):

\[\Pr[X_n=j] = \sum_{i \in S} \Pr[X_n=j \mid X_0=i] \Pr[X_0=i]\]

3.3 Classification of States #

Intuition

To compute \(\lim\limits_{n\to \infty} \Pr[X_n=j \mid X_0=a], j \in S\)，or \(\lim\limits_{n\to \infty} P^n\), can we simplify the computation?

State \(j\) is said to accessible from state \(i\) if there exists some \(n\geq 0\), such that \(P_{ij}^{(n)}>0\). We write \(j \leftarrow i\).

State \(j\) and state \(i\) communicate if \(i \leftrightarrow j\).

Lemma Communication is an equivalence relation, that is:

\(i\leftrightarrow i\)
if \(i\leftrightarrow j\), then \(j\leftrightarrow i\)
if \(i \leftrightarrow j\), \(j\leftrightarrow k\), then \(i\leftrightarrow k\)

Proof of 3 There exists \(n,m\geq 0\) such that \(P_{ij}^{(n)}>0\), \(P_{jk}^{(m)}>0\), then \(P_{ik}^{n+m}=\sum_{l\in S}P_{il}^{(n)}P_{lk}^{(m)}\geq P_{ij}^{(n)}P_{jk}^{(m)}>0\).

Two states that communicate are said to be in the same class.

A DTMC is said to be irreducible if there is only one class, i.e., all states communicate with each other.

The example given above whose \(P=\begin{bmatrix}\frac{1}{2} & \frac{1}{2}\\ 0 & 1\end{bmatrix}\) is not irreducible, because states \(a,b\) are not in the same class (since \(a\) is not accessible from \(b\)).

Classification of states: #

Recurrent
- Positive recurrent
- Null recurrent
Transient

Define the hitting time \(\tau_j = \inf \{n\geq 1: X_n=j\}\), for \(j \in S\). (Meaning: the time step at which the state first transits into \(j\).) (This is a random quantity, but not a random variable because it is possible to be infinite).

Define \(f_{jj}=\Pr[\tau_j < \infty \mid X_0=j] = \sum_{n=1}^{\infty} \Pr[X_n=j, X_k\neq j, k=1, \dots, n-1 \mid X_0=j]\).

Definition: A state \(j\) is called recurrent if \(f_{jj}=1\), and called transient if \(f_{jj}<1\).

Theorem: A state \(j\) is recurrent if and only if \(\sum\limits_{n=1}^{\infty} P_{jj}^{(n)}=\infty\).

Example

In the above example where \(P=\begin{bmatrix}\frac{1}{2} & \frac{1}{2}\\ 0 & 1\end{bmatrix}\)

For state \(b\), \(P_{bb}^{(n)}=1 \quad \forall n\geq 1\) (state \(b\) is absorbing because \(P_{bb}=1\)), so \(b\) is recurrent.

For state \(a\), \(P_{aa}^{(n)}=\Pr[X_n=a \mid X_0=a]= \left( \frac{1}{2} \right)^n\), so \(\sum\limits_{n=1}^{\infty} P_{aa}^{(n)}=\sum\limits_{n=1}^{\infty} \frac{1}{2^{n}}=1<\infty\), so \(a\) is transient.

Using definition we get the same results:

For state \(b\), \(\Pr[\tau_b<\infty \mid X_0=b] = 1\)

For state \(a\), \(\Pr[\tau_a<\infty \mid X_0=a]=\frac{1}{2}\), because given \(X_0=a\), we have:

\[\tau_a=\left\{ \begin{array}{l@{\quad:\quad}l} 1 & \text{w.p. } \frac{1}{2}\\ +\infty & \text{w.p. }\frac{1}{2} \end{array}\right. \]

Proof of theorem

If state \(j\) is recurrent, then the number of visits to state \(j\) denoted by \(N_j\), is \(\infty\) with probability one.

Note that \[\begin{aligned}\sum\limits_{n=1}^{\infty} P_{jj}^{(n)}&=\sum\limits_{n=1}^{\infty} \Pr[X_n=j \mid X_0=j]\\ &= \sum\limits_{n=1}^{\infty} \E[\1\{X_n=j\} \mid X_0=j]\\ &=\E \left[ \sum\limits_{n=1}^{\infty} \1\{X_n=j\} \mid X_0=j \right]\\ &= \E[N_j \mid X_0=j] =\infty\end{aligned} \]

(The third equation is from Monotone Convergence Theorem (MCT).)

If state \(j\) is transient, we need to show \(\sum\limits_{n=1}^{\infty} P_{jj}^{(n)}=\E[N_j \mid X_0=j]<\infty\).

Each time when the DTMC returns to state \(j\), there is a positive probability \(1-f_{jj}>0\) that it will never return again.

Define “success” if it will return and “failure” if it will return. Then \(N_j\) is the trial number on which the finite success occurs.

\(N_j\) follows the geometric distribution with success probability \(1-f_{jj}\). \(\E[N_j \mid X_0=j]=\frac{1}{1-f_{jj}}<\infty\).

Corollary With probability one, a transient state will only be visited for a finite number of times.

Corollary A finite-state DTMC has at least one recurrent state. (Important)

Corollary If \(i\) is recurrent and \(i \leftrightarrow j\), then \(j\) is recurrent.

Proof

There exists some \(m,n\geq 0\) such that \(P_{ij}^{(n)}>0\) and \(P_{ji}^{(m)}>0\) \[P_{jj}^{(m+n+s)} = \sum_{k,l \in S} P_{jk}^{(m)} P_{kl}^{(s)} P_{lj}^{(n)} \geq P_{ji}^{(m)} P_{ii}^{(s)} P_{ij}^{(n)}\]

\[\sum_{n=1}^{\infty} P_{jj}^{(n)} \geq \sum_{s=1}^{\infty} P_{jj}^{(m+n+s)}\geq P_{ji}^{(m)}\left(\sum_{s=1}^{\infty} P_{ii}^{(s)}\right) P_{ij}^{(n)} = \infty\]

This means “recurrent” and “transient” have “class property”.

Example (Random Walk)

State space \(S=\ZZ \), \(P_{i, i+1}=p\), \(P_{i,i-1}=1-p\)

The DTMC is irreducible (because all states communicate).

To show state 0 is recurrent (or not), we need to compute:

\[\sum_{n=1}^{\infty} P_{00}^{(n)}=\sum\limits_{n=1}^{\infty} P_{00}^{(2n)}\]

This is because of the property of random walk: it is impossible to have odd number of steps, starting with state 0 and getting back to state 0.

\[P_{00}^{(2n)}=\Pr[X_{2n}=0 \mid X_0=0] = C_{2n}^n \cdot p^n (1-p)^{n} = \frac{(2n)!}{n! \cdot n!} p^n (1-p)^n \]

Using Stirling’s Formula: \(n! \sim \left( \frac{n}{e} \right)^n \sqrt{2\pi n}\), we have:

\[P_{00}^{(2n)} \sim (4p(1-p))^{n} \frac{1}{\sqrt{\pi n}}\]

If \(p=\frac{1}{2}\), \(\sum\limits_{n=1}^{\infty} P_{00}^{(2n)}\sim \sum\limits_{n=1}^{\infty} \frac{1}{\sqrt{\pi n}}=\infty\), so state 0 is recurrent, so every state is recurrent.

If \(p\neq \frac{1}{2}\), \(\sum\limits_{n=1}^{\infty} P_{00}^{(2n)} < \infty\), so state 0 is transient, so every state is transient. (this is possible because there are infinite many states.)

Decomposition of State Space #

For example, if we have a DTMC:

Let states \(a,b\) be in class \(P_1\), states \(c,d\) in class \(P_2\), and \(e\) in class \(Q\). The transition matrix:

\[P = \begin{bmatrix} P_1 & \vec{0} & \vec{0} \\ \vec{0} & P_2 & \vec{0} \\ Q_1 & Q_2 & Q_3 \end{bmatrix} \]

From the graph it is easy to see: \(a,b\) are recurrent, \(c,d\) are recurrent, \(e\) is transient. \(a,b\) and \(c,d\) do not communicate.

In general, let \(T\) be the set of transient states, and \(c_i\) be the recurrent classes.

\[\begin{array}{ccccc} & & \begin{array}{ccccc} c_1 & c_2 & c_3 & \dots & T \end{array}\\ P= & \begin{array}{c} c_1\\ c_2\\ c_3\\ \vdots\\ T \end{array} & \left[\begin{array}{ccccc} P_1 & 0 & 0 & \dots & \\ 0 & P_2 & 0 & \dots & \\ 0 & 0 & P_3 & \dots & \\ \vdots & \vdots & \vdots & \ddots & \\ Q_1 & Q_2 & Q_3 & \dots & Q_T \end{array}\right] \end{array} \]

The advantage of doing so is to simplify the computation, e.g., we only need to look at the sub-matrix if state is in \(P_1\).

Define \(\mu_{jj}=\E[\tau_j \mid X_0=j], \quad j \in S\) (average number of steps needed to revisit state \(j\))

Definition If state \(j\) is recurrent, then it is said to be positive recurrent if \(\mu_{jj}<\infty\), and null recurrent if \(\mu_{jj}=\infty\).

3.4 Stationary Distribution and Limiting Distribution #

Definition A probability distribution \(\pi=(\pi_i)_{i \in S}\) is called a stationary distribution for a DTMC with transition probability matrix \(P\), if

\[\left\{ \begin{array}{l@{\quad:\quad}l} \pi_j = \sum_{i \in S} \pi_i P_{ij} \Longleftrightarrow \underbrace{\pi}_{\text{row vector}} = \pi P & \text{for all } j \in S\\ \sum_{i \in S} \pi_i = 1 & \\ \pi_i \geq 0 & \text{for all } i \in S \end{array}\right. \]

Suppose \(X_0 \sim \pi\), i.e., \(\Pr[X_0 = i]=\pi_i, i \in S\), then \(X_n\sim \pi, n\geq 1\).

\[\Pr[X_1=j] = \sum_{i \in S} \Pr[X_1=j \mid X_0=i] \Pr[X_0=i] = \sum_{i \in S} \pi_i P_{ij} = \pi_{j}, j \in S\]

By induction, \(X_n \sim \pi\).

Stationary distribution is useful because we usually focus on long-term behaviors of a Markov chain. (e.g., whether it is converged.)

Example 1

\(P=\begin{bmatrix} 1 & 0\\ 0 & 1\end{bmatrix}\), then \((\alpha, 1-\alpha)\) is a stationary distribution, \(\alpha\geq 0\), so the stationary distribution is not unique.

Example 2

\(P=\begin{bmatrix} 0 & 1\\ 1 & 0\end{bmatrix}\), then \(\pi = (\frac{1}{2}, \frac{1}{2})\) is the unique stationary distribution.

Example 3 (Simple Random Walk)

\(P_{i, i-1} = P_{i, i+1} = \frac{1}{2}, i \in \ZZ \)

\(\pi_j = \pi_{j-1} \frac{1}{2} + \pi_{j+1}\frac{1}{2} \Longrightarrow \pi_j=\pi_i, \forall i,j \in S\).

\(1=\sum_{i \in S} \pi_i \Longrightarrow \pi_i = \frac{1}{\infty} = 0\), so stationary distribution doesn’t exist.

Therefore, stationary can be non-existing, or not unique.

For the existence of stationary distributions, we have the following conclusions:

For a finite-time DTMC, there always exists at least one stationary distribution.

(hint: use linear programming and duality theory)

(Because the row sum of \(P\) is always 1, so \(P \vec{1} = \vec{1}\), so 1 is always an eigenvalue of \(P\).)
If a DTMC has at least one positive recurrent state, then a stationary distribution exists. (proof skipped)

Note

A stationary distribution exists doesn’t mean the Markov chain converges to this distribution starting from any states.

However, if you have a sufficiently “nice” Markov chain, you should expect the convergence, i.e., \(\Pr[X_n=j | X_0=i] = \pi_j,\quad n\to \infty\) (i.e., the initial state is eventually forgotten).

(Corresponding to previous examples)

Example 1

\(P^n = I = P\) \[\lim_{n\to \infty} \Pr[X_n=j \mid X_0=i] = \left\{ \begin{array}{l@{\quad:\quad}l} 1 & i=j\\ 0 & i\neq j \end{array}\right. \]

Example 2

\(\lim_{n\to \infty} P^n\) does not exist, because:

\(P^{2n} = \begin{bmatrix} 1 & 0\\ 0 & 1\end{bmatrix}\) \(P^{2n+1} = \begin{bmatrix} 0 & 1\\ 1 & 0\end{bmatrix}\)

We say the states have period 2, denoted by \(d(a)=d(b)=2\).

Definition: state \(i\) is said to have period \(d\), if \(d\) is the greatest common divisor of these \(n\) with \(P_{ii}^{(n)}>0, n \geq 1\), that is, \(d(i) = \gcd \{n\geq 1 : P_{ii}^{(n)}>0\}\).

I.e., a state \(i\) has period \(k\) if any return to state \(i\) must occur in multiples of \(k\) time steps.

If \(d(i)=1\), then we say state \(i\) is aperiodic.

Example (Simple Random Walk)

\(d(i)=2, i \in S=\ZZ\)

Lemma 1 If \(i \leftrightarrow j\), then \(d(i)=d(j)\). (period is also a “class property”) (proof on the textbook)

Theorem (proof skipped) Suppose \(i \leftrightarrow j\), then

\(\lim_{n\to \infty} \frac{1}{n} \sum_{k=1}^n P_{ij}^{(k)} = \frac{1}{\mu_{jj}}\)

Intuition:

\[\begin{split} &\frac{1}{n}\sum_{k=1}^n P_{ij}^{(k)}\\ =&\frac{1}{n}\sum_{k=1}^n \Pr[X_k=j \mid X_0=i]\\ =&\frac{1}{n} \sum_{k=1}^n \E[\1_{X_k=j} \mid X_0=i]\\ =&\frac{1}{n} \E \left[ \sum_{k=1}^n \1_{X_k=j} \mid X_0=i \right] \end{split} \]

The above expectation is the expected number of times that state \(j\) is reached starting from state \(i\) in \(n\) steps.

As \(n\to \infty\), this is the long-run average proportion of time spent at state \(j\).
If \(j\) is aperiodic, then \(\lim_{n\to \infty} P_{ij}^{(n)} = \frac{1}{\mu_{jj}}\)
If \(j\) has period \(d(j)\), then \(\lim_{n\to \infty} P_{jj}^{n d(j)} = \frac{d(j)}{\mu_{jj}}\)

Intuition of 2,3:

If \(a\) is not aperiodic, e.g, \(d(a)=2\), then \(P_{aa}^{(2n)}=1\), \(P_{aa}^{(2n-1)}=0\), \(\lim_{n\to \infty} P_{aa}^{(n)}\) doesn’t exist.

But \(\lim_{n\to \infty} P_{aa}^{(2n)} = 1 = \frac{d(a)}{\mu_{aa}} = 1\)

Recall \(\mu_{jj} = \E[\tau_j \mid X_0=j]\) is the expected number of steps needed to return to \(j\). And \(\mu_{jj}<\infty \Longleftrightarrow j \text{ is positive recurrent}\).

So \(j\) is positive recurrent \(\Longleftrightarrow \lim_{n\to \infty} P_{jj}^{(n d(j))}>0\).

Similarly, \(j\) is null recurrent \(\Longleftrightarrow \lim_{n\to \infty} P_{jj}^{(nd(j))} = 0\).

Example (simple symmetric random walk) When \(p=\frac{1}{2}\), \(P_{00}^{(2n)} \sim \frac{1}{\sqrt{\pi n}}\).

\(\lim_{n\to \infty}P_{00}^{(2n)}=0\), so state 0 is null recurrent.

Theorem An irreducible, aperiodic DTMC belongs to one of the following two classes:

Either all states are transient or all null recurrent. In this case, \(\lim_{n\to \infty} P_{ij}^{(n)}=0\) for all \(i, j \in S\), and there exists no stationary distribution.
Or all states are positive recurrent. In this case, \(\lim_{n\to \infty} P_{ij}^{(n)}\) exists for all \(i,j \in S\), and denote \(\pi_j = \lim_{n\to \infty} P_{ij}^{(n)}, j \in S\). Then \(\pi=(\pi_{j})_{j \in S}\) is the unique stationary distribution.

Proof

We first proof 2.

First show positive recurrent is a class property. That is, if \(j\) is positive recurrent and \(j \leftrightarrow k\), then \(k\) is positive recurrent.

\(\lim_{n\to \infty} P_{ij}^{(n)} = \pi_j>0\) exists since \(i\leftrightarrow j\) (because this DTMC is irreducible) and due to the above theorem.

Since \(j \leftrightarrow k\), let \(m\) be such that \(P_{jk}^{(m)}>0\) then \(P_{ik}^{(n+m)}\geq P_{ij}^{(n)} P_{jk}^{(m)}\).

Take limit on both sides, \(\lim_{n\to \infty} P_{ik}^{(n+m)}\geq \pi_j P_{jk}^{(m)}>0\), so \(k\) is positive recurrent.

Note

\(\pi_j\) does not depend on \(i\), so the transition matrix will be like this (values in each column are the same): \[P^n \to \begin{bmatrix}\pi_0 & \pi_1 &\pi_2 &\pi_3 & \dots \\ \pi_0 & \pi_1 &\pi_2 &\pi_3 & \dots \\ \pi_0 & \pi_1 &\pi_2 &\pi_3 & \dots \\ \vdots & \vdots & \vdots &\vdots & \ddots \end{bmatrix} \]

We next show \(\pi = (\pi_j)_{j \in S}\) is a stationary distribution, where \(\pi_j = \lim_{n\to \infty} P_{ij}^{(n)}\).

Because \(P_{ij}^{(n+1)} = \sum_{k \in S} P_{ik}^{(n)}P_{kj}\), so \[\begin{split} \pi_j &= \lim_{n\to \infty} \sum_{k \in S} P_{ik}^{(n)} P_{kj}\\ &=\sum_{k \in S} \lim_{n\to \infty} P_{ik}^{(n)} P_{kj} \quad\text{ correct only if S is finite}\\ &=\sum_{k \in S} \pi_k P_{kj} \end{split} \] Also, \(\sum_{j \in S} \pi_j = \sum_{j \in S} \lim_{n\to \infty} P_{ij}^{(n)} = \lim_{n\to \infty} \sum_{j \in S} P_{ij}^{(n)} =1, \pi_j \geq 0\)

So \(\pi\) is a stationary distribution.

In the above, \(S\) must be finite, so we can exchange the summation and limit.

If \(S\) is infinite, let \(S=\{0,1,2, \dots \}\), then \(\sum_{j=0}^M P_{ij}^{(n)} \leq \sum_{j=0}^{\infty} P_{ij}^{(n)} = 1, \forall M\), take limit on both sides:

\[\lim_{n\to \infty} \sum_{j=0}^M P_{ij}^{(n)} = \sum_{j=0}^M \pi_j \leq 1\]

Implying that

\[\sum_{j=0}^\infty \pi_j \leq 1\]

Now

\[P_{ij}^{(n+1)} = \sum_{k=0}^{\infty } P_{ik}^{(n)} P_{kj} \geq \sum_{k=0}^M P_{ik}^{(n)} P_{kj}\]

Send \(n\to \infty \), we have \(\pi_j \geq \sum_{k=0}^M \pi_k P_{kj}\).

Then send \(M\to \infty \), we have \(\pi_j \geq \sum_{k=0}^\infty \pi_k P_{kj}\).

We want to prove that this must be an equality. Assume there exists \(\pi_j>\sum_{k=0}^\infty \pi_k P_{kj}\) for some \(j\), then

\[\begin{split} \sum_{j=0}^{\infty} \pi_j &> \sum_{j=0}^{\infty} \sum_{k=0}^{\infty} \pi_k P_{kj}\\ &=\sum_{k=0}^{\infty} \pi_k \sum_{j=0}^{\infty} P_{kj}\\ &=\sum_{k=0}^{\infty} \pi_k \end{split} \]

Contradiction. So \(\pi_j=\sum_{k=0}^\infty \pi_k P_{kj}, \quad j \in S\)

Note

Note that here we only showed:

\(\sum_{j=0}^\infty \pi_j \leq 1\)

\(\pi_j=\sum_{k=0}^\infty \pi_k P_{kj}, \quad j \in S\)

So \(\left\{\frac{\pi_j}{\sum_{i=0}^{\infty} \pi_i}, j = 0,1,2, \dots \right\}\) is a stationary distribution.

But we cannot show \(\pi\) is a stationary distribution yet, because (1) is not an equality.

Next, show the uniqueness.

Let \((\nu_j)_{j \in S}\) be any stationary distribution, then \(\nu_j = \sum_{i=0}^{\infty} \nu_i P_{ij}^{(n)}\geq \sum_{i=0}^M \nu_i P_{ij}^{(n)}\), let \(n\to \infty \), we have \(\nu_j \geq \sum_{i=0}^M \nu_i \pi_j \stackrel{M\to \infty}{\Longrightarrow} \nu_j \geq \pi_j\)

To prove the other direction, write

\[\nu_j = \sum_{i=0}^M \nu_i P_{ij}^{(n)} + \sum_{i=M+1}^\infty \nu_i P_{ij}^{(n)} \leq \sum_{i=0}^M \nu_i P_{ij}^{(n)} + \sum_{i=M+1}^{\infty} \nu_i\]

Let \(n\to \infty\), we have \(\nu_j \leq \sum_{i=0}^M \nu_i \pi_j + \sum_{i=M+1}^{\infty} \nu_i\)

Let \(m\to \infty\), we have \(\nu_j \leq \sum_{i=0}^M \nu_i \pi_j = \pi_j\)

So \(\nu_j = \pi_{j}, j \in S\).

Finally we prove 1. If transient or null recurrent: \(\lim_{n\to \infty} P_{ij}^{(n)}=\frac{1}{\mu_{jj}}=0\)

So \(\pi_j = \sum_{i \in S} \pi_i P_{ij}^{(n)} = 0\), so no stationary distribution exists.

3.5 Law of Large Numbers for DTMC (Time Averages) #

If a DTMC \(\{X_n: n=0,1,2, \dots \}\) is irreducible and positive recurrent, and \(f: S\to \RR \) is bounded, then

\[\frac{1}{n} \sum_{k=1}^n \textstyle{\E_i} [f(X_k)] \stackrel{n\to \infty}{\longrightarrow} \pi(f) \triangleq \sum_{j \in S} f(j) \pi(j) \quad \forall i \in S\]

where \(\E_i[f(X_k)] = \E[f(X_k) \mid X_0=i]\)

\(f\) is the cost function, so the long-term average reward converges to \(\E[f(Y)]\) where \(Y\sim \pi\).

4 Continuous-Time Markov Chain (CTMC) #

4.1 Definition of CTMC #

Let \(X=\{X(t), t\geq 0\}\) be a continuous time stochastic process with discrete state space \(S\). If \[\Pr[X(t+s)=j \mid X(s)=i, X(u) = x(u), 0 \leq u < s] = \Pr[X(t+s)=j \mid X(s)=i]\] for all \(s,t\geq 0\) and \(i, j, x(u) \in S\), then \(\{X(t), t\geq 0\}\) is called a continuous time Markov Chain.

Same with DTMC, we only consider time-homogeneous Markov chain, i.e.,

\[\Pr[X(t+s) \mid X(s)=i] = P_{ij}(t) = \Pr[X(t) \mid X(0)=i]\]

The transition probability is independent of \(s\) (called stationary or homogeneous). \(P_{ij}(t)\) and the initial state \(p_i=\Pr[X(0)=i]\) uniquely determine all the probability distribution of the CTMC.

Example

Poisson process is a CTMC: \[\begin{split} &\Pr[X(t+s)=j \mid X(s)=i, X(u)=x(u), 0\leq u < s]\\ =&\Pr[X(t+s)-X(s)=j-i \mid X(s)=i, \dots]\\ =&\Pr[X(t+s)-X(s)=j-i]\\ =&\Pr[X(t+s)=j \mid X(s)=i] = P_{ij}(t) \end{split} \]

One difference between DTMC and CTMC is that, we consider not only the transition probabilities but also the holding time.

Remember that the interarrival time of Poisson process follows an exponential distribution. This is in fact a general property of CTMC.

Denote \(\tau_i\) the holding-time at state \(i\) for a CTMC \(X=\{X(t), t\geq 0\}\). Then \(\tau_i\) follows an exponential distribution with some rate \(v_i\).

To prove this, verify the memoryless property: \(\Pr[\tau_i > t+s \mid \tau_i > s] = \Pr[\tau_i > t]\).

\[\begin{split} \text{LHS} &= \Pr[X(l)=i, s< l \leq t+s \mid X(s)=i, X(u)=i, 0\leq u <s]\\ &=\Pr[X(l)=i, s< l \leq t+s \mid X(s)=i] \quad \text{(Markov property)}\\ &=\Pr[X(l)=i, 0<l\leq t \mid X(0)=i] \quad \text{(Time-homogeneous)}\\ &=\text{RHS} \end{split} \]

Construction of CTMCs (CTMC = DTMC with exponential transition/holding times)

Each time the process enters state \(i\)

The amount of time it spends in that state before making a transition into a different state (holding time) is exponentially distributed with rate \(v_i\), and
When the process leaves state \(i\), it will next enter state \(j(\neq i)\) with some probability, call it \(P_{ij}\), where \(\sum_{j \neq i} P_{ij}=1\). Note that \(P_{ii}\) is always 0.

We can parameterize of a CTMC with \((P, v)\) where \(P = (P_{ij})_{i,j \in S}\), and \(v = (v_i)_{i \in S}\).

Assume throughout \(0\leq v_i < \infty\) (because if \(v_i=\infty\), \(\E[\tau_i] = \frac{1}{v_i}=0\), which is not interesting), a CTMC is called regular if there is finite number of transition over any finite time window.

Define the (infinitesimal) generator matrix \(Q=(q_{ij})_{i,j \in S}\) as follows: \[\left\{ \begin{array}{l@{\quad:\quad}l} q_{ij}=v_i \cdot P_{ij} & i \neq j, i,j \in S\\ q_{ii}=-v_i & i \in S \end{array}\right. \] where \(q_{ij}\) is the transition rate (from \(i\) to \(j\)), and \(v_i\) is the rate of leaving state \(i\).

\((P, v)\) and \(Q\) are equivalent, one can derive the other.

The row sum of \(Q\) is 0: \[\sum_{j \in S} q_{ij} = \sum_{j\neq i} q_{ij} + q_{ii} = \sum_{j \neq i} v_i P_{ij} - v_i = v_i \sum_{j \neq i} P_{ij} - v_i = 0\]

Example

A two-state CTMC with \(Q = \begin{bmatrix}-\alpha & \alpha \\ \beta & -\beta \end{bmatrix}\), \(\alpha, \beta>0\).

\(v_0 = |-\alpha|=\alpha\), \(v_1 = \beta\).

\(P_{01} = \frac{q_{01}}{v_0}=\frac{\alpha}{\alpha}=1\), \(P_{00}=0\), \(P_{01}=1\), \(P_{11}=0\).

\(P=\begin{bmatrix}0 & 1\\ 1 & 0\end{bmatrix}\). \(\tau_0 \sim \exp(\alpha), \tau_1 \sim \exp(\beta)\).

We can draw the “rate diagram”:

4.2 Birth and Death Process #

Definition A CTMC with \(S=\{0,1,2, \dots\}\) for which \(q_{ij}=0\) whenever \(|i-j|>1\) is called a birth and death process.

Denote \(\lambda_i = q_{i,i+1}\) as the birth rate, \(\mu_i = q_{i, i-1}\) as the death rate.

So the transitions from state \(i\) can only go to either state \(i-1\) or state \(i+1\). When the state increases by 1 we say a birth occurs, and when it decreases by 1 we say a death occurs.

\[Q = \begin{bmatrix} -\lambda_0 & \lambda_0 & 0 & \dots\\ \mu_1 & -(\lambda_1 + \mu_1) & \lambda_1 & \dots\\ \vdots & \vdots & \vdots & \ddots \\ \end{bmatrix} \]

So \(v_i = \lambda_i + \mu_i\), \(P_{i, i+1} = \frac{q_{i, i+1}}{v_i}= \frac{\lambda_1}{\lambda_1+\mu_i}\), \(P_{i, i-1} = \frac{q_{i, i-1}}{v_i}= \frac{\mu_1}{\lambda_1+\mu_i}\), \(P_{i,j}=0, j\neq i \pm 1\).

At state \(i\), the holding time is \(\exp(v_i)=\exp(\lambda_i+\mu_i)\), because:

The “birth clock” \(\xi_i \sim \exp(\lambda_i)\), the “death clock” \(\eta_i \sim \exp(\mu_i)\), and they are independent. So the holding time \(\tau_i = \min \{\xi_i, \eta_i\} \sim \exp(\lambda_i + \mu_i)\). \[\begin{split} P_{i,i+1} &= \Pr[\xi_i<\eta_i]=\frac{\lambda_i}{\lambda_i+\mu_i}\\ P_{i,i-1} &= \Pr[\xi_i>\eta_i]=\frac{\nu_i}{\lambda_i+\mu_i} \end{split} \]

Example (The M/M/s Queue)

The name “X/Y/Z”: “X” is the arrival process/interarrival time distribution, “Y” is the service time distribution, “Z” is the number of servers.

“M” stands for “Markov”, meaning that the distribution is exponential.

First “M”: Arrival process is Poisson process with rate \(\lambda\), or equivalently the interarrival time is exponential with rate \(\lambda\).

Second “M”: Service time distribution is exponential with rate \(\mu\).

“s”: There are \(s\) servers.

Service discipline: FIFO (First-In-First-Out), FCFS (First-Come-First-Serve).

Denote \(X(t)\) as the number of customers/jobs in the system at time \(t\). State space \(S=\{0,1,2, \dots\}\). \(\{X(t): t \geq 0\}\) is a birth and death process.

Birth rate \(\lambda_i = \lambda\), death rate \(\mu_i = s\cdot \mu\) if \(i>s\), and \(\mu_i = i\cdot\mu\) if \(i \leq s\).

A birth and death process with death rate \(\mu_i=0\) for all \(i\) is called a pure birth process.

Example (Yule Process)

Consider a pure birth process resulting from a population where each member acts independently and gives birth at rate \(\lambda>0\). No one ever dies. Let \(X(t)\) denote the population size at time \(t\). Then \(\{X(t):t\geq 0\}\) is a pure birth process with \(\lambda_i=i \cdot \lambda\), \(i\geq 1\).

The holding time \(T_i \sim \exp(i\lambda)\) are independent with each other.

Now compute: \(P_{ij}(t)=\Pr[X(t)=j \mid X(0)=i]\)

Consider \(i=1\):

\[\begin{split} P_{ij}(t)&=\Pr[X(t)=j \mid X(0)=1] \\ &=\Pr[X(t)\geq j \mid X(0)=1] - \Pr[X(t)\geq j+1 \mid X(0)=1] \end{split} \]

Now compute: \(\Pr[X(t)\geq j \mid X(0)=1] = \Pr[T_1+T_2+ \dots + T_{j-1} \leq t \mid X(0)=1]\)

For \(j=2\), \(\Pr[X(t)\geq 2 \mid X(0)=1] = \Pr[T_1\leq t] = 1-e^{\lambda t}\).

For \(j=3\): \[\begin{split} \Pr[T_1+T_2\leq t] &= \int_{0}^{\infty} \Pr[T_2+T_1 \leq t \mid T_1 = x] f_{T_1}(x)\dd{x}\\ &=\int_{0}^{\infty} \Pr[T_2+x \leq t] f_{T_1}(x)\dd{x}\\ &= \int_{0}^{\infty} \left[ 1-e^{-2\lambda(t-x)} \right] \lambda e^{-\lambda x}\dd{x}\\ &= \left( 1-e^{-\lambda t} \right)^2 \end{split} \]

In general, we have \(\Pr[T_1+ \dots + T_{j-1} \leq t] = \left( 1-e^{-\lambda t} \right)^{j-1}\).

So \(P_{1j}(t) = \left( 1-e^{-\lambda t} \right)^{j-1} - \left( 1-e^{-\lambda t} \right)^j = e^{-\lambda t} \cdot \left( 1-e^{-\lambda t} \right)^{j-1}, \quad j=1,2,3, \dots\).

Therefore, starting with a single individual, the population size at time \(t\) will have a geometric distribution with mean \(e^{\lambda t}\).

So if starting with \(i\) individuals, the population size at \(t\) will be the sum of \(i\) i.i.d. geometric R.V.s, which follows a negative binomial distribution.

\[P_{ij}(t) = C_{j-1}^{i-1} e^{-\lambda ti} \left( 1-e^{-\lambda t} \right)^{j-i}\]

4.3 Kolmogorov Differential Equations #

Consider a CTMC with generator matrix \(Q\), and the transition probability matrix is \(P(t)=(P_{ij}(t))_{i,j \in S}\), \(t\geq 0\).

Lemma 1 For all \(s,t\geq 0\), we have for all \(i,j \in S\) \[P_{ij}(t+s)=\sum_{k \in S} P_{ik}(t)\cdot P_{kj}(s)\] or in matrix form: \(P(t+s)=P(t)P(s)\).

Proof

\[\begin{split} P_{ij}(t+s) &= \Pr[X(t+s)=j \mid X(0)=i]\\ &= \sum_{k \in S} \Pr[X(t+s)=j, X(t)=k \mid X(0)=i]\\ &= \sum_{k \in S} \Pr[X(t+s)=j \mid X(t)=k, X(0)=i] \cdot \Pr[X(t)=k \mid X(0)=i]\\ &= \sum_{k \in S} \Pr[X(s)=j \mid X(0)=k] \cdot \Pr[X(t)=k \mid X(0)=i]\\ &= \sum_{k \in S} P_{ik}(t) \cdot P_{kj}(s) \end{split} \]

Lemma 2

\(\lim_{h \to 0} \frac{P_{ii}(h)-1}{h} = q_{ii} = -v_i\)
\(\lim_{h \to 0} \frac{P_{ik}(h)}{h} = q_{ik}\) for \(k\neq i\).

In matrix notation, we have \[\lim_{h\to 0} \frac{P(h)-P(0)}{h}=Q\]

where \(P(0)=I\).

LHS is like a derivative, so denote by \(P’(0)\).

Proof

For (1), we want to compute \(\frac{\Pr[X(h)=i \mid X(0)=i]-1}{h}\).

Since from state \(i\), the holding time \(\tau_i\) follows an exponential distribution with rate \(v_i\), we have:

\[\begin{split} \Pr[\text{0 transition by time } h \mid X(0)=i] &= \Pr[\tau_i > h]\\ &= 1-\Pr[\tau_i\leq h]\\ &= e^{-v_i h} = 1 - v_i h + o(h) \end{split} \]

\[\begin{split} &\Pr[\text{2 or more transitions by time } h | X(0)=i]\\ =&\sum_{j \neq i} \Pr[\text{2 or more transitions by time } h\mid \text{1st transition to }j, X(0)=i]P_{i,j}\\ =& \sum_{j\neq i}\Pr[\tau_i+\tau_j\leq h] P_{i,j}\\ \leq& \sum_{j\neq i}\Pr[\tau_i \leq h]\Pr[\tau_j \leq h] P_{i,j} \quad \text{since the event } \{\tau_i+\tau_j\leq h\} \subset \{\tau_i\leq h, \tau_j\leq h\}\\ =& \sum_{j \neq i} (v_i h +o(h))(v_j h +o(h)) P_{i,j}\\ =& \sum_{j \neq i} o(h) P_{i,j} = o(h) \end{split} \]

We can also know from the above that

\[\Pr[\text{1 transition by } h \mid X(0)=i] = 1- (1-v_i h+o(h)) - o(h) = v_i h + o(h)\]

In \(\Pr[X(h)=i \mid X(0)=i]\), there are two possibilities, either 0 transition happens, or 2 or more transitions happen and get back to state \(i\). So we have:

\[\Pr[X(h)=i \mid X(0)=i] = 1-v_i h +o(h) + o(h) = 1-v_i h + o(h)\]

So we can easily get \(\lim_{h \to 0} \frac{P_{ii}(h)-1}{h} = -v_i\).

For (2), similarly, in \(\Pr[X(h)=k \mid X(0)=i], k\neq i\), either there is exactly 1 transition to state \(k\), the probability is \((v_i h + o(h))P_{i,k}\), or 2 or more transitions happen. In summary, we have:

\[\Pr[X(h)=k \mid X(0)=i] = v_i P_{i,k} h +o(h) + o(h) = v_i P_{i,k} h + o(h)\]

So we can get \(\lim_{h \to 0} \frac{P_{ik}(h)}{h} = q_{ik}\).

Theorem 1 (Kolmogorov Backward Equation)

For all \(i, j \in S\) and \(t \geq 0\), \[P_{ij}’(t)=\sum_{k \in S} q_{ik}\cdot P_{kj}(t)\] using matrix notation, we have \(P’(t)=Q\cdot P(t)\).

Proof

Here we assume the state space \(S\) is finite, the general case is in the textbook.

\[\begin{split} P_{ij}(t+h) &= \sum_{k \in S} P_{ik}(h)P_{kj}(t)\\ &= \sum_{k \neq i} P_{ik}(h)P_{kj}(t) + P_{ii}(h)P_{ij}(t) \end{split} \] \[\begin{split} \frac{P_{ij}(t+h) - P_{ij}(t)}{h} &= \frac{\sum_{k \neq i} P_{ik}(h)P_{kj}(t) + (P_{ii}(h) -1) P_{ij}(t)}{h} \end{split} \]

Taking limits on both sides, we have

\[\begin{split} P_{ij}’(t) &= \lim_{h \to 0} \frac{\sum_{k\neq i} P_{ik}(h) P_{kj}(t)}{h} + \lim_{h\to 0} \frac{(P_{ii}(h)-1)P_{ij}(t)}{h}\\ &= \sum_{k \neq i}\lim_{h\to 0}\frac{P_{ik}(h)}{h}P_{kj}(t) + q_{ii}P_{ij}(t)\\ &= \sum_{k\neq i} q_{ik}P_{kj}(t)+q_{ii}P_{ij}(t)\\ &= \sum_{k \in S} q_{ik}P_{kj}(t) \end{split} \]

The sum and limit can be exchanged because we only consider the case where \(S\) is finite.

Theorem 2 (Kolmogorov Forward Equation)

Under suitable conditions (including birth-death process and finite state CTMC), we have \[P_{ij}’(t)=\sum_{k \in S} P_{ik}(t) \cdot q_{kj}, \quad i, j \in S\] or \(P’(t)=P(t)\cdot Q\).

(proof is just swapping \(h\) and \(t\) in the above proof)

Solution to these equations

\[P(t) = e^{tQ}, t\geq 0\]

(Similar to ODE: \(f’(t)=c\cdot f(t), f(0)=1 \Rightarrow f(t)=e^{ct}, t\geq 0\).)

Matrix Exponential

If \(A\) is a matrix, then:

\[e^A = \sum_{n=0}^{\infty} \frac{A^n}{n!}\]

\(P’(0)=Q\), because the row sum of \(P\) is 1 (constant), after taking the derivative, the row sum of \(Q\) is zero.

In discrete case: One step generator: let \(h=0\), we get \(P-I\), this is the analog of \(Q\) in DTMC.

Example 1

Define \(\phi_j(t)=\Pr[X(t)=j], j \in S, t\geq0\). Given \((\phi_j(0))_{j \in S}\), compute \((\phi_j(t))_{j \in S}\) (denote as a row vector \(\vec{\phi}(t)\)).

\[\phi_j(t) = \Pr[X(t)=j] = \sum_{i \in S} \Pr[X(t)=j \mid X(0)=i] \phi_i(0) = \sum_{i \in S} \phi_i(0) P_{ij}(t)\] Or equivalently

\[\phi(t) = \phi(0)\cdot P(t)\]

By Kolmogorov differential equation:

\[\frac{d \phi(t)}{dt} = \phi(0) P’(t) \stackrel{\text{forward equation}}{=} \phi(0) P(t) Q = \phi(t) Q\]

Solve this differential equation (similar with ODE) we can get the answer.

(Note: “forward” means “forward in time”. In this example we know the initial state and compute the state afterwards.)

Example 2

Denote \(u_i(t)=\E[f(X(T)) \mid X(t)=i], i \in S, t \in [0, T]\), where \(T\) is given. Compute \(\vec{u}(t)\).

Obviously, \(u_i(T) = f(i), i \in S\).

By the definition of expectation:

\[u_i(t) = \sum_{j \in S} f(j) \Pr[X(T)=j \mid X(t)=i] = \sum_{j \in S} f(j) P_{ij}(T-t) \Longleftrightarrow \vec{u}(t) = P(T-t) \vec{f}\]

where \(\vec{u}(t) = (u_i(t))_{i \in S}\) and \(\vec{f} = (f(i))_{i \in S}\).

So we can get the differential equations:

\[\left\{ \begin{array}{l} \frac{d \vec{u}(t)}{dt} = (-1) P’(T-t) \vec{f} \stackrel{\text{backward equation}}{=} (-1) Q P(T-t) \vec{f} = -Q \vec{u}(t) \\ \vec{u}(T) = \vec{f} \end{array}\right. \]

Solve the above equations (similar with ODE) and get the answer.

(Note: “backward” means “backward in time”. Because here we know the state at terminal time instead of the initial time.)

We can also change variable \(v(t)=u(T-t)\), and use the forward equation.

Example 3

Consider the two-state CTMC with generator \(Q=\begin{bmatrix}-\lambda & \lambda \\ \mu & -\mu\end{bmatrix}\), compute \(P(t)=\begin{bmatrix}P_{00}(t) & P_{01}(t) \\ P_{10}(t) & P_{11}(t)\end{bmatrix}\).

This example can be solved as analytical formulas (rare, most of time you need a computer to compute the numerical solutions).

\[\begin{bmatrix}P’_{00}(t) & P’_{01}(t) \\ P’_{10}(t) & P’_{11}(t)\end{bmatrix} = \begin{bmatrix}P_{00}(t) & P_{01}(t) \\ P_{10}(t) & P_{11}(t)\end{bmatrix} \begin{bmatrix}-\lambda & \lambda \\ \mu & -\mu\end{bmatrix}\]

\[\left\{ \begin{array}{l} P’_{00}(t) = -\lambda P_{00}(t) + \mu P_{01}(t) = -\lambda P_{00}(t)+ \mu (1-P_{00}(t)) = -(\lambda + \mu) P_{00}(t) + \mu \\ P_{00}(0)=1 \end{array}\right. \]

Solve the above equations, get

\[\left\{ \begin{array}{l} P_{00}(t)=\frac{\mu}{\lambda+\mu}+\frac{\lambda}{\lambda+\mu} e^{-(\lambda+\mu)t} \\ P_{01}(t) = 1-P_{00}(t) = \frac{\lambda}{\lambda+\mu}-\frac{\lambda}{\lambda+\mu} e^{-(\lambda+\mu)t} \\ P_{11}(t)=\frac{\lambda}{\lambda+\mu}+\frac{\mu}{\lambda+\mu} e^{-(\lambda+\mu)t} \\ P_{10}(t) = 1-P_{11}(t) = \frac{\mu}{\lambda+\mu}-\frac{\mu}{\lambda+\mu} e^{-(\lambda+\mu)t} \end{array}\right. \]

Note that the solution converges to some constant as \(t\to\infty\). E.g., \(P_{00}(t) \to \frac{\mu}{\lambda+\mu}\), \(P_{01}(t) \to \frac{\lambda}{\lambda+\mu}\).

In general, it is hard to solve the analytical forms, how to find out the limiting probabilities (or stationary distributions)?

4.4 Limiting Distribution and Stationary Distribution #

Definition A probability distribution \(\pi=(\pi)_{i \in S}\) is said to be a stationary distribution for a CTMC with generator matrix \(Q=(q_{ij})_{i,j \in S}\), if

\(\sum_{i \in S} \pi_i q_{ij} = 0\) for all \(j \in S\). In matrix form: \(\pi Q=0\) (\(\pi\) is a row vector)
\(\sum_{i \in S} \pi_i = 1\)
\(\pi_i\geq 0\) for all \(i \in S\)

Note

Recall that \(P-I\) is the discrete analog of \(Q\), note that \(\pi = \pi P\), so \(\pi(P-I)=0\).

If \(X(0) \sim \pi\) (\(\Pr[X(0)=j]= \pi_j, j \in S\)), then \(X(t) \sim \pi, t>0\).

Proof

Easy to see that if \(X(0) \sim \pi\), then \(X(t) \sim \pi P(t)\). Now prove \(\pi P(t) = \pi\).

This follows from \(\frac{d}{dt}(\pi P(t)) = \pi P’(t) = \pi Q P(t) = 0\), and \(P(0) = I\).

We can actually use \(\pi = \pi P\) as the definition, but we do not because most of the time, \(P(t)\) is unknown or hard to compute.

Example

\(Q=\begin{bmatrix}-\lambda & \lambda \\ \mu & -\mu\end{bmatrix}\), compute \(\pi\).

\[\pi\cdot Q = 0 \implies \begin{bmatrix}\pi_0 & \pi_1\end{bmatrix} \cdot \begin{bmatrix}-\lambda & \lambda \\ \mu & -\mu\end{bmatrix} = 0 \]

Solve and get the same result as the above example:

\[\left\{ \begin{array}{l} \pi_0 = \frac{\mu}{\lambda+\mu} \\ \pi_1 = \frac{\lambda}{\lambda+\mu} \end{array}\right. \]

Question: Is it always the case that \(\lim_{t \to \infty} P_{ij}(t) = \pi_j, i, j \in S\)?

If \(\lim_{t \to \infty} P_{ij}(t)\) exists and equals \(v_j\), then \(v = (v_j)_{j \in S}\) is called the limiting distribution.

Theorem If the limiting distribution \(v = (v_j)_{j \in S}\) exists, then \(v\) is a stationary distribution of the CTMC.

(Note: However, to determine whether it is converged needs more technicalities.)

Proof

Assume \(S\) is finite (to avoid some technicalities).

(1) To prove \(vQ=0\), using the backward equation:

\[P_{ij}’(t)=\sum_{k \in S} q_{ik} P_{kj}(t) \]

Take limits on both sides

\[\lim_{t \to \infty} P_{ij}’(t)=\sum_{k \in S} q_{ik} \lim_{t \to \infty} P_{kj}(t) = \underbrace{\sum_{k \in S} q_{ik}}_{0} v_j = 0\]

According to the forward equation: \[P_{ij}’(t)=\sum_{k \in S} P_{ik}(t) q_{kj}\] Take limits on both sides

\[0 = \lim_{t \to \infty} P_{ij}’(t)=\sum_{k \in S} \lim_{t \to \infty} P_{ik}(t) q_{kj} = \sum_{k \in S} v_k q_{kj}\]

So \(vQ=0\).

(2) To prove \(\sum_{k \in S} v_k = 1\)

\[\sum_{k \in s} v_k = \sum_{k \in S} \lim_{t \to \infty} P_{ik}(t) = \lim_{t\to\infty} \sum_{k \in S} P_{ik}(t) = 1\]

(3) \(v_j\) is the limit of \(P_{ij}(t)\) so is obviously non-negative.

\[0=\sum_{k \in S} \pi_k q_{kj} = \pi_j \underbrace{q_{jj}}_{-v_j} + \sum_{k \neq j} \pi_k q_{kj}\]

so \(\pi_j v_j = \sum_{k \neq j} \pi_k q_{kj}\) (This is called “balance equation”).

(Note that this \(v\) is the leaving rate, not the limiting distribution in the above theorem.)

LHS: the rate at which the CTMC leaves state \(j\) in the long run (“rate out”).

RHS: the rate at which the CTMC enters state \(j\) in the long run (“rate in”).

In some applications, \(\pi_j\) is big, meaning that \(j\) is more important. E.g., in website browsing time, a big \(\pi_j\) means page \(j\) is more popular.

Example

Consider a birth and death process with rate diagram:

“Rate Process Leaves” = “Rate Process Enters”:

State 0: \(\lambda_0 \pi_0 = \mu_1 \pi_1 \Rightarrow \pi_1 = \frac{\lambda_0}{\mu_{1}}\pi_0\)

State 1: \(\lambda_1 \pi_1 + \mu_1 \pi_1 = \lambda_0 \pi_0 + \pi_2\mu_2 \Rightarrow \pi_2 = \frac{\lambda_1}{\mu_2} \pi_1\)

In general, \(\pi_n = \frac{\lambda_0 \lambda_{1} \dots \lambda_{n-1}}{\mu_1 \mu_2 \dots \mu_n} \pi_0\).

Since \(\sum_{i = 0}^{\infty}\pi_i=1\), we have

\[1 = \pi_0 + \pi_0 \sum_{n=1}^{\infty} \frac{\lambda_0 \lambda_1 \dots \lambda_{n-1}}{\mu_1 \mu_2 \dots \mu_n}\]

so \(\pi_0 = (1+\sum_{n=1}^{\infty} \frac{\lambda_0 \lambda_1 \dots \lambda_{n-1}}{\mu_1 \mu_2 \dots \mu_n})^{-1}\)

For a Birth-Death process, the stationary distribution exists if \[k = \sum_{n=1}^{\infty} \frac{\lambda_0 \lambda_1 \dots \lambda_{n-1}}{\mu_1 \mu_2 \dots \mu_n} < +\infty\]

In this case \(\pi_0 = \frac{1}{1+k}\), \(\pi_n = \frac{\lambda_0 \dots \lambda_{n-1}}{\mu_1 \dots \mu_n} \pi_0, n \geq 1\).

Example

Consider a M/M/1 queue, which is a Birth-Death process where \(\lambda_i \equiv \lambda, i=0,1,2, \dots\), \(\mu_i \equiv \mu, i=0,1,2, \dots\).

\(X(t)\): number of customers in the system at time \(t\).

A stationary distribution exists for the CTMC \(\{X(t):t\geq 0\}\) if \(k = \sum_{n=1}^{\infty} \left( \frac{\lambda}{\mu} \right)^n < +\infty\). Or equivalently \(\frac{\lambda}{\mu}<1\).

In this case, \(\pi_0 = \frac{1}{1+k} = \frac{1}{\sum_{n=0}^{\infty} \rho^n}=1-\rho\) (denote \(\rho = \frac{\lambda}{\mu}\), the utilization of the server), \(\pi_n = (1-\rho)\cdot \rho^n, n\geq 0\).

Steady-State Analysis

Average number of customers in the system (\(\E[Q]\))?

\[\sum_{n=0}^{\infty} n \cdot \pi_n = \sum_{n=0}^{\infty} n(1-\rho)\rho^n = \underbrace{\sum_{n=0}^{\infty} n(1-\rho)\rho^{n-1}}_{\text{mean of geometric distribution}} \cdot \rho = \frac{\rho}{1-\rho}\]
Long-run fraction of time the system/server is busy/idle?

Busy (at least one person in queue): \(1-\pi_0 = \rho\).

Idle (no one in queue): \(\pi_0 = 1-\rho\).
Average waiting time of customers (\(\E[W]\))?

According to the Little’s Law, \(\E[Q] = \lambda\E[W]\).

Determining the existence of stationary distribution (\(\lim_{t \to \infty} P_{ij}(t) \stackrel{?}{=} \pi_j\))

A DTMC is ergodic if it is irreducible, aperiodic and positive recurrent.
A CTMC is irreducible if the underlying/embedded DTMC is irreducible.
An irreducible CTMC is positive recurrent \(\Longleftrightarrow\) there exists a stationary distribution \(\pi = (\pi_i)_{i \in S}\) with \(\pi_i>0\) for all \(i \in S\).
If an irreducible CTMC has a finite state space, then it is positive recurrent.

5 Martingales #

5.1 Conditional Expectation and Martingales #

Let \(X\) and \(Y\) be jointly discrete random variables.

\(\Pr[X=x \mid Y=y] = \frac{\Pr[X=x, Y=y]}{\Pr[Y=y]}\), provided that \(\Pr[Y=y]>0\).

\(\E[X \mid Y=y] = \sum_x \Pr[X=x \mid Y=y] \cdot x = h(y)\)

\(\E[X\mid Y] = h(Y)\), a random variable.

Theorem Suppose \(\E[|X|< +\infty]\), then

\[\E[X] = \E[\underbrace{\E[X\mid Y]}_{h(Y)}]\]

In particular, if \(Y\) is a discrete random variable,

\[\E[X] = \sum_y \E[X \mid Y=y] \cdot \Pr[Y=y] = \sum_y h(y) \cdot \Pr[Y=y]\]

Proof

\[\begin{split} & \sum_y \E[X \mid Y=y]\cdot \Pr[Y=y]\\ =&\sum_y \sum_x x\cdot \Pr[X=x \mid Y=y]\cdot \Pr[Y=y]\\ =&\sum_x x \sum_y \Pr[X=x \mid Y=y]\cdot \Pr[Y=y] \text{ (Fubini-Tonelli)}\\ =&\sum_x x\cdot \Pr[X=x]\\ =&\E[X] \end{split} \]

Properties

\(\E[X\mid Y]\geq 0\) (with probability one) if \(x\geq 0\) (with probability one).
\(\E[a \mid Y]=a\) for any constant \(a\).
\(\E[X\mid Y] = \E[X]\) if \(X\) and \(Y\) are independent.
\(\E[X\cdot g(Y) \mid Y] = g(Y)\cdot \E[X\mid Y]\) (“pulling out” constants).
\(\E[\sum_{i=1}^n X_i \mid Y] = \sum_{i=1}^n \E[X_i \mid Y]\).
\(\E[X \mid X, Y] = X\).
\(\E[X \mid W] = \E[\E[X \mid W, Y] \mid W] = \E[\E[X\mid W] \mid W, Y]\) (Tower rule).

Proof of (7)

We prove \(\E[X \mid W=w] = \E[\E[X \mid W=w, Y] \mid W=w]\).

\[\begin{split} \text{RHS} &= \sum_y \E[X \mid W=w, Y=y] \cdot \Pr[Y=y \mid W=w]\\ &= \sum_y \sum_x x\cdot \Pr[X=x \mid W=w, Y=y] \cdot \Pr[Y=y \mid W=w]\\ &= \sum_y \sum_x x\cdot \Pr[X=x, Y=y \mid W=w]\\ &= \sum_x x \sum_y \Pr[X=x, Y=y \mid W=w]\\ &= \sum_x x \Pr[X=x \mid W=w] = \E[X \mid W=w] \end{split} \]

Fact For any random variable \(X\) with \(\E[X]< +\infty\), \(\E[(X-c)^2]\) is minimized at \(c=\E[X]\).

Conditional expectation as orthogonal projection.

\[\left\{ \begin{array}{l} \min \E[(X-\xi)^2] \\ \xi = g(Y) \end{array}\right. \Longrightarrow \xi^{*} = \E[X\mid Y] \text{ (Best prediction)} \]

5.2 Martingales #

Definition A stochastic process \(\{Z_n: n\geq 1\}\) is called a martingale if

\(\E[|Z_n|] < +\infty\) for all \(n\). (This is just a technical requirement.)
\(\E[Z_{n+1} \mid Z_1, \dots, Z_n] = Z_n\) for all \(n\). (This is the main part.)

Lemma For a martingale \(\{Z_n: n\geq1\}\), we have \(\E[Z_n] = \E[Z_1]\) for all \(n\).

Proof

From (2), we have for all \(n\):

\[\E[Z_n] = \E[\E[Z_{n+1} \mid Z_1, \dots, Z_n]] = \E[Z_{n+1}]\]

Example 1 (Random Walk)

Let \(X_1, X_2, \dots\) be i.i.d. random variables with \(\E[X_i]=0\) and \(\E[|X_i|]<+\infty\). Let \(Z_n = \sum_{i=1}^{n} X_i, n\geq 1\). Then \(\{Z_n: n\geq 1\}\) is a martingale.

Proof

Check (1): \(\E[Z_n]= \E[\sum_{i=1}^{n} X_i] \leq \sum_{i=1}^{n} \E[|X_i|] < +\infty\), for all \(n\).

Check (2): \[\begin{split} \E[Z_{n+1} \mid Z_1, \dots, Z_n] &= \E[Z_n + X_{n+1} \mid Z_1, \dots, Z_n]\\ &= \E[Z_n \mid Z_1, \dots, Z_n] + \E[X_{n+1} \mid Z_1, \dots, Z_n]\\ &= Z_n + \E[X_{n+1}]\\ &= Z_n \end{split} \]

(Note: If \(\E[X_i] \neq 0\), this is still a Markov chain, but not a martingale. Martingale models a fair game.)

Example 2 (Geometric Random Walk)

Let \(X_1, X_2, \dots\) be i.i.d. random variables with \(\E[X_i]=1\) and \(\E[|X_i|]<+\infty\). Let \(Z_n = \prod_{i=1}^{n} X_i, n\geq 1\). Then \(\{Z_n: n\geq 1\}\) is a martingale.

Proof

Check (1): \(\E[|Z_n|] = \E[|X_1 \cdot \dots \cdot X_n|] = \prod_{i=1}^n \E[|X_i|] < \infty\), for all \(n\).

Check (2):

\[\begin{split} \E[Z_{n+1} \mid Z_1, \dots, Z_n] &= \E[Z_n \cdot X_{n+1} \mid Z_1, \dots, Z_n]\\ &= Z_n \cdot \E[X_{n+1} \mid Z_1, \dots, Z_n]\\ &= Z_n\cdot \E[X_{n+1}]\\ &= Z_n \end{split} \]

Example 3 (Doob-type martingale)

Let \(X, Y_1, Y_2, \dots\) be random variables with \(\E[|X|] < +\infty\). Let \(Z_n = \E[X \mid Y_1, \dots, Y_n]\) for \(n \geq1\). Then \(\{Z_n: n\geq 1\}\) is a martingale.

Proof

Check (1): \(\E[|Z_n|] = \E[|\E[X \mid Y_1, \dots, Y_n]|] \leq \E[\E[|X| \mid Y_1, \dots, Y_n]] = \E[|X|] < +\infty\).

Inequality is due to \(|\E[X]| \leq \E[|X|]\).

Check (2):

First we need an intermediate step, using the tower rule:

\[\E[Z_{n+1} \mid Y_1, \dots, Y_n] = \E[\E[X\mid Y_1, \dots, Y_{n+1}] \mid Y_1, \dots, Y_n] = \E[X \mid Y_1, \dots, Y_n] = Z_n\]

Then

\[\E[Z_{n+1} \mid Z_1, \dots, Z_n] = \E[\E[Z_{n+1} \mid Y_1, \dots, Y_n] \mid Z_1, \dots, Z_n] = \E[Z_n \mid Z_1, \dots, Z_n] = Z_n\]

(If \(Y_1, \dots, Y_n\) is known, then \(Z_1, \dots, Z_n\) is known, so \(Z_1, \dots, Z_n\) have less information than \(Y_1, \dots, Y_n\).)

5.3 Stopping Times and the Martingale Stopping Theorem #

Definition A positive, integer-valued, possibly infinite random variable \(N\) is said to be a random time for the process \(\{Z_n, n\geq 1\}\) if the event \(\{N=n\}\) is determined by \(Z_1, \dots, Z_n\) for any \(n\). (Intuition: can’t look into the future.)

If \(\Pr[N<\infty]=1\), then the random time \(N\) is called a stopping time.

Example

\(N=\inf\{n\geq1 : Z_n=c\}\) for a constant \(c\). (\(N\) is the hitting time when the process first hits \(c\).)

\(\{N=n\} \Longleftrightarrow \{Z_1 \neq c, Z_2 \neq c, \dots, Z_{n-1}\neq c, Z_n=c\}\), so \(N\) is a random time.

Counter-Example

\(N=\sup\{n\geq1 : Z_n=c\}\)

\(\{N=1\} \Longleftrightarrow \{Z_1=c, Z_k\neq c, \text{ for } k\geq 2\}\), this is not determined by only \(Z_1\), you need information about the future, so \(N\) is NOT a random time.

Let \(N\) be a random time for \(\{Z_n: n\geq 1\}\). Define \[\overline{Z}_n = Z_{\min\{n, N\}}=\left\{ \begin{array}{l@{\quad:\quad}l} Z_n & \text{if } n\leq N\\ Z_N & \text{if } n> N \end{array}\right. \] \(\{\overline{Z}_n: n\geq1\}\) is called the stopped process of \(\{Z_n: n\geq 1\}\).

In the above example, if \(c=1\), and the process \(Z_n\) is as the figure below, then \(N=\inf\{n\geq1 : Z_n=1\} = 2\).

Therefore \(\overline{Z}_1 = Z_1 = 0\), \(\overline{Z}_2 = Z_2 = 1\), \(\overline{Z}_3 = Z_2 = 1\), etc.

(Why this is important? Because of the following result.)

Proposition If \(N\) is a random time for a martingale \(\{Z_n: n\geq 1\}\), then the stopped process \(\{\overline{Z}_n :n\geq 1\}\) is also a martingale.

Proof

Define \(I_n=\1_{n\leq N}\), which is determined by \(Z_1, \dots, Z_{n-1}\).

(Note that the subscript is \(n-1\) instead of \(n\). An event being determined is equivalent to its complement being determined. The complement event of \(\{n\leq N\}\) is \(\{N\leq n-1\}\) because N is integer. \(\{N\leq n-1\} = \bigcup_{k=1}^{n-1} \{N=k\}\) is determined by \(Z_1, \dots, Z_{n-1}\).)

One can verify \(\overline{Z}_n = \overline{Z}_{n-1} + I_n\cdot(Z_n-Z_{n-1})\). (Easy exercise, skipped)

Now prove an intermediate step: \(\E[Z_n \mid Z_1, \dots, Z_{n-1}] = \overline{Z}_{n-1}\):

\[\begin{split} \E[\overline{Z}_n \mid Z_1, \dots, Z_{n-1}] &= \E[\overline{Z}_{n-1} + I_n(Z_n-Z_{n-1}) \mid Z_1, \dots, Z_{n-1}] \\ &= \E[\overline{Z}_{n-1} \mid Z_1, \dots, Z_{n-1}] + \E[I_n(Z_n-Z_{n-1}) \mid Z_1, \dots, Z_{n-1}] \\ &= \overline{Z}_{n-1} + I_n\E[Z_n-Z_{n-1} \mid Z_1, \dots, Z_{n-1}] \\ &= \overline{Z}_{n-1} + I_n(\underbrace{\E[Z_n \mid Z_1, \dots, Z_{n-1}] - Z_{n-1}}_{0}) \\ &= \overline{Z}_{n-1} \end{split} \]

The third equation holds because \(\overline{Z}_{n-1}\), \(I_n\) are determined by \(Z_1, \dots, Z_{n-1}\).

\[\begin{split} &\E[\overline{Z}_n \mid \overline{Z}_1, \dots, \overline{Z}_{n-1}]\\ =& \E[\E[\overline{Z}_n \mid Z_1, \dots, Z_{n-1}] \mid \overline{Z}_1, \dots, \overline{Z}_{n-1}]\\ =& \E[\overline{Z}_{n-1} \mid \overline{Z}_1, \dots, \overline{Z}_{n-1}] = \overline{Z}_{n-1} \end{split} \]

The first equality holds because the stopped process contains less information than the original process.

Theorem (Martingale Stopping Theorem/Optional Sampling Theorem)

Let \(N\) be a stopping time for the martingale \(\{Z_n : n\geq 1\}\). If either

\(\{\overline{Z}_n : n\geq 1\}\) are uniformly bounded, or
\(N\) is bounded, or
\(\E[N]<+\infty\) and there exists \(M<\infty\) such that \(\E[|Z_{n+1}-Z_n| \mid Z_1, \dots, Z_n] \leq M\).

Then \(\E[Z_N] = \E[Z_1]\).

Proof

Assume (1)

We know \(\E[\overline{Z}_n] = \E[\overline{Z}_1] = \E[Z_1], n\geq 1\) (first equality: stopped process is martingale, second equality: \(\overline{Z}_1 = Z_{\min\{1, N\}} = Z_1\)).

Since \(N\) is a stopping time, so \(\Pr[N< \infty] = 1\), hence as \(n \to \infty\)

\[\overline{Z}_n = Z_{\min\{n, N\}} \to Z_N \text{ with probability 1}\]

So

\[\begin{split} \E[Z_N] &= \E[\lim_{n\to\infty} \overline{Z}_n]\\ &= \lim_{n\to\infty} \E[\overline{Z}_n]\\ &= \E[Z_1] \end{split} \]

The second equality is due to Dominated Convergence Theorem (DCT), since \(\overline{Z}_n\) is bounded, i.e., \(|\overline{Z}_n| \leq k, \forall n\geq1\).

Example (Gambler’s Ruin Problem, P186 in the textbook)

Consider a simple symmetric random walk \(\{Z_n: n\geq 0\}\) with \(Z_0=i>0\).

Define \(N=\inf\{n\geq0 : Z_n= M \text{ or } Z_n=0\}\), where \(M>i\).

Compute \(\Pr[Z_N=M]\).

(1) Verify \(N\) is a stopping time. (exercise, solution)

(2) \(\E[Z_N] = \E[Z_0]=i \Longrightarrow M\Pr[Z_N=M] + 0 \Pr[Z_N=0]=i\), so \(\Pr[Z_N=M] = \frac{i}{M}\).

The first equation is due to the Martingale Stopping Theorem, because \(|\overline{Z}_n| \leq M, n\geq 0\) (uniformly bounded holds).

5.4 Azuma’s Inequality for Martingales #

Theorem Let \(\{Z_n: n\geq1\}\) be a martingale with \(\mu=\E[Z_n]\). Let \(Z_0=\mu\). Suppose that there exists constants \(\alpha_i, \beta_i \geq 0\) such that

\[-\alpha_i \leq Z_i- Z_{i-1}\leq \beta_i, i\geq1\]

Then for any \(n\geq1, a>0\) we have

\(\Pr[Z_n-\mu\geq a]\leq \exp \left( -\frac{2a^2}{\sum_{i=1}^{n} (\alpha_i + \beta_i)^2} \right)\)
\(\Pr[Z_n-\mu < -a]\leq \exp \left( -\frac{2a^2}{\sum_{i=1}^{n} (\alpha_i + \beta_i)^2} \right)\)

(Proof is in the textbook and so skipped)

\(\{Z_i-Z_{i-1}, i\geq1\}\) is called a martingale difference sequence.

Note

There is also a weaker version of Azuma’s inequality, where the assumption is that \(|Z_i-Z_{i-1}| \leq c_i\), i.e., the interval must be symmetric. This will lead to moving the constant 2 to the denominator.

The weaker version is also quite common, e.g., on the Wikipedia. Don’t be confused.

Corollary (Hoeffding Inequality)

Suppose \(Z_n = \sum_{i=1}^{n} X_i\) where \((X_i)_{i\geq 1}\) are independent random variables with \(E[X_i]=0\) and \(\E[|X_i|]<\infty\) for all \(i\).

Suppose \(-\alpha_i \leq X_i \leq \beta_i, i\geq1\), then for \(a>0\)

\[\Pr \left[ \frac{1}{n} \sum_{i=1}^{n} X_i \geq a \right] \leq \exp \left( \frac{-2a^2n^2}{\sum_{i=1}^{n} (\alpha_i + \beta_i)^2} \right)\] \[\Pr \left[ \frac{1}{n} \sum_{i=1}^{n} X_i < -a \right] \leq \exp \left( \frac{-2a^2n^2}{\sum_{i=1}^{n} (\alpha_i + \beta_i)^2} \right)\]

5.5 Martingale Convergence Theorem #

Definition A stochastic process \(\{Z_n: n\geq1\}\) with \(\E[|Z_n|]<\infty\) for all \(n\geq1\) is called a submartingale if

\[\E[Z_{n+1} \mid Z_1, \dots, Z_n]\geq Z_n\]

(So \(\E[Z_{n+1}]\geq \E[Z_n]\))

and a supermartingale if

\[\E[Z_{n+1} \mid Z_1, \dots, Z_n] \leq Z_n\]

(So \(\E[Z_{n+1}] \leq \E[Z_n]\))

Theorem If \(N\) is a stopping time for \(\{Z_n: n\geq1\}\) such that any one of the three conditions of Martingale Stopping Theorem holds, then \(\E[Z_N] \geq \E[Z_1]\) for a submartingale. \(\E[Z_N] \leq \E[Z_1]\) for a supermartingale.

(Proof skipped, similar to the Martingale Stopping Theorem)

Lemma If \(\{Z_i: i\geq 1\}\) is a submartingale and \(N\) is a stopping time such that \(\Pr[N\leq n]=1\) for a given \(n\), then

\[\E[Z_1]\leq \E[Z_N]\leq \E[Z_n]\]

Proof

Since \(N\) is bounded, so we can apply the above theorem to get the first inequality.

For the second inequality,

\[\begin{split} \E[Z_n]&=\E[\E[Z_n \mid Z_1, \dots, Z_N]]\\ &= \sum_{k=1}^{n} \E[\E[Z_n \mid Z_1, \dots, Z_N] \mid N=k] \Pr[N=k]\\ &= \sum_{k=1}^{n} \E[\E[Z_n \mid Z_1, \dots, Z_k] \mid N=k] \Pr[N=k]\\ &\geq \sum_{k=1}^{n} \E[Z_k \mid N=k] \Pr[N=k] \text{ (submartingale property)}\\ &= \E[Z_N] \end{split} \]

(Note that for the submartingale property \(\E[Z_{n+1} \mid Z_1, \dots, Z_n]\geq Z_n\), this is actually true that \(\E[Z_n \mid Z_1, \dots, Z_k]\geq Z_k\), for all \(k < n\). For a martingale, the equality also holds (Exercise 6.1).)

Lemma If \(\{Z_n: n\geq1\}\) is a martingale and \(f\) is a convex function with \(\E[f(Z_n)]<\infty\) for all \(n\), then \(\{f(Z_n): n\geq1\}\) is a submartingale.

Proof

\[\begin{split} \E[f(Z_{n+1}) \mid Z_1, \dots, Z_n] &\geq f(\E[Z_{n+1} \mid Z_1, \dots, Z_n]) \text{ (Jensen’s inequality for conditional expectation)}\\ &= f(Z_n) \end{split} \]

\[\begin{split} &\E[f(Z_{n+1}) \mid f(Z_1), \dots, f(Z_n)]\\ =& \E[\E[f(Z_{n+1}) \mid Z_1, \dots, Z_n] \mid f(Z_1), \dots, f(Z_n)] \text{ (Tower property)}\\ \geq& \E[f(Z_n) \mid f(Z_1), \dots, f(Z_n)]\\ =& f(Z_n) \end{split} \]

Lemma (Kolmogorov’s Inequality for Submartingales)

If \(\{Z_n : n\geq1\}\) is a non-negative submartingale, then for all \(a>0\)

\[\Pr[\max\{Z_1, \dots, Z_n\}>a] \leq \frac{\E[Z_n]}{a}\]

(Cannot be directly proved by Markov Inequality, as expectation and max cannot be swapped: \(\E[\max\{Z_1, \dots, Z_n\}] \neq \max\{\E[Z_1], \dots, \E[Z_n]\}\).)

Proof

Define \(N=\inf \{i: Z_i>a, i\leq n\}\). If \(Z_i\leq a\) for all \(i\leq n\), we let \(N=n\). Then \(N\) is a stopping time with \(\Pr[N\leq n]=1\). (\(N\) is the time that the process first exceeds \(a\).)

Then \(\Pr[\max \{Z_1, \dots, Z_n\} >a] = \Pr[Z_N>a] \leq \frac{\E[Z_N]}{a} \leq \frac{\E[Z_n]}{a}\).

(First inequality: Markov Inequality, second inequality: submartingale property)

Corollary Let \(\{Z_n: n\geq1\}\) be a martingale, then for all \(a>0\),

\(\Pr[\max\{|Z_1|, \dots, |Z_n|\}>a]\leq \frac{\E[|Z_n|]}{a}\)
\(\Pr[\max\{|Z_1|, \dots, |Z_n|\}>a]\leq \frac{\E[Z_n^2]}{a^2}\)

Proof

(1) Since \(f(x)=|x|\) is convex, so \(\{|Z_i|: i\geq1\}\) is a non-negative submartingale. This follows Kolmogorov’s Inequality for Submartingales.

(2) Since \(f(x)=x^2\) is convex, and \(\Pr[\max\{|Z_1|, \dots, |Z_n|\}>a] = \Pr[\max\{Z_1^2, \dots, Z_n^2\}>a^2]\).

Theorem (Martingale Convergence Theorem)

If \(\{Z_n : n\geq1\}\) is a martingale such that for some \(M<\infty\), \(\E[|Z_n|]\leq M\) for all \(n\), then with probability one, \(\lim_{n\to \infty} Z_n\) exists and is finite.

(Note: \(\{Z_n\}\) converges “almost surely” (See definition in Strong Law of Large Numbers (SLLN)). Intuitively, it means for almost every \(\omega \in \Omega\), \(\lim_{n\to\infty} Z_n(\omega) = Z_{\infty}(\omega)\).)

Proof

Assume that \(\E[Z_n^2]\leq M\) for all \(n\). (This is a stronger assumption, to make the proof easier.)

\(\{Z_n^2:n\geq1\}\) is a submartingale. So \(\E[Z_n^2]\) is non-decreasing in \(n\). So \(\lim_{n\to\infty} \E[Z_n^2] = \mu\) exists.

We want to show that \(\{Z_n: n\geq1\}\) is a Cauchy Sequence with probability 1.

That is, with probability 1, \(|Z_{m+k}-Z_m| \to 0\), as \(m,k \to\infty\) (for every \(\omega \in \Omega\)).

(Because \(X_n \stackrel{\text{a.s.}}{\to} X\) if and only if \(\{X_n\}\) is a Cauchy Sequence with probability 1, if and only if \(\lim_{n\to\infty} \Pr[\sup_{k>0} |X_{n+k}-X_n|>\epsilon]=0\). See proof in Here.)

Now fix \(\epsilon>0\) and \(n>1\),

\[\begin{split} &\Pr[|Z_{m+k}-Z_m| > \epsilon \text{ for some } k\leq n]\\ \leq&\frac{1}{\epsilon^2} \E\left[|Z_{m+n} -Z_m|^2\right] \text{ (Corollary (2))}\\ =& \frac{1}{\epsilon^2} \left( \E\left[Z_{m+n}^2\right] + \E\left[Z_m^2\right] -2\E[Z_{m+n}\cdot Z_m] \right)\\ =& \frac{1}{\epsilon^2} \left( \E\left[Z_{m+n}^2\right] -\E\left[Z_m^2\right] \right) \end{split} \]

The last equality is due to:

\[\begin{split} \E[Z_{m+n}\cdot Z_m] &= \E[\E[Z_{m+n}\cdot Z_m \mid Z_1, \dots, Z_m]]\\ &= \E[Z_m \cdot \E[Z_{m+n} \mid Z_1, \dots, Z_m]]\\ &= \E[Z_m^2] \text{ (Exercise 6.1)} \end{split} \]

Sending \(n \to \infty\) on both sides, we have

\[\Pr[|Z_{m+k}-Z_m| > \epsilon \text{ for some } k] \leq \frac{1}{\epsilon^2} \left[ \mu - \E[Z_m^2] \right]\]

Sending \(m\to\infty\) on both sides, we get

\[\Pr[|Z_{m+k}-Z_m| > \epsilon \text{ for some } k] \to 0 \text{ as } m\to\infty\]

Then with probability 1, \(\{Z_n: n\geq1\}\) is a Cauchy sequence, because \(\Pr[\sup_{k}{|Z_{m+k}-Z_m|}>\epsilon] \to 0\) as \(m\to\infty\).

Denote \(Z_{\infty} = \lim_{n\to\infty} Z_n\), then \(\E[|Z_{\infty}|]\leq M\), because

\[\E[|Z_{\infty}|] =\E\left[\lim_{n\to\infty} |Z_n|\right] \leq \liminf_{n\to\infty} \E[|Z_n|]\leq M\]

(The inequality is Fatou’s Lemma)

6 Brownian Motion #

6.1 Definition of Brownian Motion #

Definition A continuous-time stochastic process \(\{X(t): t\geq 0\}\) is called a Brownian motion if

\(X(0)=0\).
\(\{X(t): t\geq 0\}\) has stationary and independent increments.
For each fixed \(t\), \(X(t)\) follows a Normal distribution with mean 0 and variance \(\sigma^2 t\).

When \(\sigma=1\), \(\{X(t): t\geq0\}\) is called a standard Brownian motion. We will suppose throughout that \(\sigma=1\).

Properties

With probability one, \(\{X(t): t\geq0\}\) has continuous sample paths but the sample paths are nowhere differentiable.

(Proof is skipped.)
\(\{X(t): t\geq0\}\) is a Markov process.

Proof

\[\begin{split} &\Pr[X(t+s)\leq a \mid X(s)=x, X(u)=x(u), 0< u <s]\\ =&\Pr[X(t+s)-X(s)\leq a-x \mid X(s)=x, X(u)=x(u), 0< u <s]\\ =&\Pr[X(t+s)-X(s)\leq a-x \mid X(s)=x] \text{ (independent increments)}\\ =&\Pr[X(t+s)\leq a \mid X(s)=x] \end{split} \]
The joint distribution of \((X(t_1), \dots, X(t_n))\) for \(t_1<t_2< \dots < t_n\) is multivariate normal.

Proof

Since \(X(t) \sim N(0,t)\), and its density is \(f_t(x)=\frac{1}{\sqrt{2\pi t}} e^{-\frac{x^2}{2t}}\).

The event \(\{X(t_1)=x_1, \dots, X(t_n)=x_n\}\) can be rewritten in terms of independent increment events:

\[\{X(t_1)=x_1, X(t_2)-X(t_1)=x_2-x_2, \dots, X(t_n)-X(t_{n-1})=x_n-x_{n-1}\}\]

The joint distribution is a multivariate Normal because

\[(\underbrace{X(t_1)}_{\sim N(0,t_1)}, \underbrace{X(t_2)-X(t_1)}_{\sim N(0,t_2-t_1)}, \dots, \underbrace{X(t_n)-X(t_{n-1})}_{\sim N(0,t_n-t_{n-1})}) \sim N(\vec{0}, \Sigma)\]

And the covariance matrix is

\[\Sigma = \begin{bmatrix}t_1& & &\\ & t_2-t_1&& \\ & & \ddots & \\ & & & t_n-t_{n-1}\end{bmatrix}\]

So \((X(t_1), \dots, X(t_n))\) is a multivariate Normal, and the joint density function is \[f(x_1, \dots, x_n)=f_{t_1}(x_1) \cdot f_{t_2-t_1}(x_2-x_1) \cdot \dots \cdot f_{t_n-t_{n-1}}(x_n-x_{n-1})\]
\(\Cov(X(s), X(t))=\min\{s,t\}\).

Proof

Assume \(s<t\),

\[\begin{split} \Cov(X(s), X(t)) &= \Cov(X(s), X(t)-X(s)+X(s))\\ &= \Cov(X(s),X(s)) + \Cov(X(s), X(t)-X(s))\\ &= s+ 0\\ &=s \end{split} \]

Brownian motion is a kind of random walk. Let \(t=n \Delta t\), then

\[X(t)=\underbrace{X(n\Delta t)-X((n-1)\Delta t)}_{\sim N(0, \Delta t)} + \dots + \underbrace{X(\Delta t) -X(0)}_{\sim N(0, \Delta t)}\]

All increments follow i.i.d. \(N(0, \Delta t)\), so it is a random walk.

Brownian motion is a special case of Gaussian processes.

Similar to Gaussian distribution, there is a Central Limit Theorem for Gaussian processes.

Example

Compute the conditional distribution/density of \(X(s)\) given \(X(t)=b\) where \(s<t\).

For fixed \(t\), \(X(t) \sim N(0,t)\), the density of \(X(t)\) is \(f_t(x) = \frac{1}{\sqrt{2\pi t}} e^{-\frac{x^2}{2t}}\).

The conditional density is

\[\begin{split} &\Pr[X(s)\approx x \mid X(t)=b]\\ =& \frac{\Pr[X(s)\approx x, X(t) \approx b]}{\Pr[X(t)\approx b]}\\ =& \frac{\Pr[X(s)\approx x, X(t)-X(s) \approx b-x]}{\Pr[X(t)\approx b]}\\ =&\frac{f_s(x) \cdot f_{t-s}(b-x)}{f_t(b)}\\ =& \frac{\frac{1}{\sqrt{2\pi s}} e^{-\frac{x^2}{2s}} \frac{1}{\sqrt{2\pi (t-s)}} e^{-\frac{(b-x)^2}{2(t-s)}}}{\frac{1}{\sqrt{2\pi t}} e^{-\frac{x^2}{2t}}}\\ =& \frac{1}{\sqrt{2\pi \frac{s}{t} (t-s)}} e^{-\frac{(x-\frac{bs}{t})^2}{2 \frac{s}{t}(t-s)}} \end{split} \]

This is a Normal distribution with \(\E[X(s) \mid X(t)=b] = b\cdot \frac{s}{t}\) and variance \(\Var[X(s) \mid X(t)=b]=\frac{s}{t}(t-s)\).

So given \(X(t)=b\), the average position of the Brownian motion at time \(s\) is a point between the origin and \((t, b)\).

6.2 Hitting Times of Brownian Motions #

Given a (standard) Brownian motion \(\{X(t) : t\geq0\}\). For \(a>0\).

Define \(T_a=\inf\{t\geq0 : X(t)=a\}\).

We want to compute the distribution of \(T_a\).

\[\begin{split} \Pr[X(t)\geq a] &= \underbrace{\Pr[X(t)\geq a \mid T_a\leq t]}_{=\frac{1}{2}}\Pr[T_a \leq t] + \underbrace{\Pr[X(t)\geq a \mid T_a> t]}_{=0}\Pr[T_a > t] \\ &= \frac{1}{2} \Pr[T_a \leq t] \end{split} \]

\(\Pr[X(t)\geq a \mid T_a > t] = 0\) because the sample path is continuous.

\(\Pr[X(t)\geq a \mid T_a\leq t] = \frac{1}{2}\) because of the “reflection principle”.

\[\Pr[T_a\leq t] = 2\Pr[X(t)\geq a]=2 \frac{1}{\sqrt{2\pi t}} \int_{a}^{\infty} e^{-\frac{x^2}{2t}} \dd{x}\]

We can also change variable: let \(y=\frac{x}{\sqrt{t}}\), then

\[\Pr[T_a\leq t] = \sqrt{\frac{2}{\pi}} \int_{\frac{a}{\sqrt{t}}}^{\infty} e^{-\frac{y^2}{2}} \dd{y}\]

We can also get

\[\Pr[T_a < \infty] = \lim_{t\to \infty} \Pr[T_a \leq t] = \sqrt{\frac{2}{\pi}} \int_{0}^{\infty} e^{-\frac{y^2}{2}} \dd{y} = 1\]

Then we compute \(\E[T_a]\).

\[\begin{split} \E[T_a] &= \int_{0}^{\infty} \Pr[T_a > t] \dd{t} = \int_{0}^{\infty} (1-\sqrt{\frac{2}{\pi}} \int_{\frac{a}{\sqrt{t}}}^{\infty} e^{-\frac{y^2}{2}} \dd{y} ) \dd{t}\\ &= \int_{0}^{\infty} \sqrt{\frac{2}{\pi}} \int_{0}^{\frac{a}{\sqrt{t}}} e^{-\frac{y^2}{2}} \dd{y} \dd{t}\\ &= \sqrt{\frac{2}{\pi}} \int_{0}^{\infty} e^{-\frac{y^2}{2}} \int_{0}^{\frac{a^2}{y^2}} \dd{t} \dd{y}\\ &= \sqrt{\frac{2}{\pi}} \int_{0}^{\infty} e^{-\frac{y^2}{2}} \frac{a^2}{y^2} \dd{y}\\ &= \sqrt{\frac{2}{\pi}} a^2 \int_{0}^{\infty} \frac{e^{-\frac{y^2}{2}}}{y^2} \dd{y}\\ &\geq \sqrt{\frac{2}{\pi}}a^2 \int_{0}^{1} \frac{e^{-\frac{1}{2}}}{y^2} \dd{y}\\ &= \infty \end{split} \]

Note

Brownian motion is symmetric, so for \(a<0\),

\[\Pr[T_a \leq t] = \Pr[T_{-a}\leq t] = \sqrt{\frac{2}{\pi}} \int_{\frac{|a|}{\sqrt{t}}}^{\infty} {e^{-\frac{y^2}{2}}} \dd{y}\]

We can compute the maximum of a Brownian motion. For \(a>0\),

\[\Pr\left[\max_{0\leq s\leq t} X(s) \geq a\right] = \Pr[T_a \leq t]\]

For a simple random walk, compute \(T_1=\inf\{n\geq0 : S_n=1\}\). (exercise)

6.3 Variation on Brownian Motions #

Brownian Motion reflected at the origin #

If \(\{X(t):t\geq0 \}\) is a Brownian motion, then \(\{Z(t):t\geq0\}=\{|X(t)|:t\geq0\}\) is called a Brownian Motion reflected at the origin.

Compute the distribution of \(Z(t)\):

For \(y \geq0\), \[\begin{split} \Pr[Z(t)\leq y] &= \Pr[|X(t)| \leq y]\\ &= \Pr[X(t)\leq y] - \Pr[X(t)\leq -y]\\ &= \Pr[X(t)\leq y] - \Pr[X(t)\geq y]\\ &= \Pr[X(t)\leq y] - (1-\Pr[X(t)\leq y])\\ &= 2\Pr[X(t) \leq y] -1\\ &= \sqrt{\frac{2}{\pi t}} \int_{-\infty}^{y} e^{-\frac{x^2}{2t}} \dd{x} -1 \end{split} \]

\[\E[Z(t)]=\E[|X(t)|] = \sqrt{\frac{2t}{\pi}}\].

Geometric Brownian Motion #

If \(\{X(t) :t\geq0\}\) is a Brownian Motion, then \(\{Y(t):t\geq0\} = \{e^{X(t)}:t\geq0\}\) is called a geometric Brownian motion.

The moment generating function of \(X(t)\) is

\[\E[e^{s\cdot X(t)}] = e^{-\frac{ts^2}{2}}, s \in \RR\]

\[\E[Y(t)] = \E[e^{X(t)}]= e^{\frac{t}{2}}\]

\[\Var[Y(t)] = \E\left[Y(t)^2\right] -\left(e^{\frac{t}{2}}\right)^2= \E\left[e^{2X(t)}\right] - e^t = e^{2t}-e^t\]

Links to this note

← → Top