For any sequence of events \(\left\{ E_i \right\}_{i=1}^{\infty}\) which are mutually exclusive (i.e., \(E_i \cap E_j=\emptyset\)), then
\[\Pr \left[ \bigcup_{i=1}^{\infty} E_i \right] = \sum_{i=1}^{\infty} \Pr[E_i]\]
The joint CDF of two R.V.s \(X\) and \(Y\) is given by:
\[F(x,y)=\Pr[X\leq x,Y\leq y], x,y \in \RR \]
where \(\Pr[X\leq x, Y\leq y]=\Pr[\left\{ X\leq x \right\} \cap \left\{ Y\leq y \right\}\).
The expectation of a R.V. is defined by:
\[\E[X]=\int_{-\infty}^{+\infty}x \dd{F(x)} = \left\{ \begin{aligned}
&\sum_x x \Pr[X=x] &\text{if } X \text{ is discrete}\\
&\int_{-\infty}^{+\infty}x f(x)\dd{x} &\text{if } X \text{ is continuous} \end{aligned} \right.
\]
whenever the integral exists.
For a function \(h: \RR \to \RR \):
\[\E[h(X)]=\int_{-\infty}^{+\infty}h(x) \dd{F(x)} = \left\{ \begin{aligned}
&\sum_x h(x) \Pr[X=x] &\text{if } X \text{ is discrete}\\
&\int_{-\infty}^{+\infty}h(x) f(x)\dd{x} &\text{if } X \text{ is continuous} \end{aligned} \right.
\]
For a function \(h: \RR^n \to \RR \):
\[\E[h(X_1,X_2, \dots , X_n)] = \int_{\RR^n}h(x_1,x_2, \dots ,x_n)\dd{F(x_1, \dots , x_n)}\]
If \(X\) and \(Y\) are independent, then \(\Cov[X,Y]=0\), \(\E[XY]=\E[X]\E[Y]\).
Note
Covariance only measures one particular type of dependence, therefore these two are not equivalent. If two R.V.s are independence, their covariance is 0, but having a covariance of 0 does not imply the variables are independent.
And \(\Psi_X’’(0) = \E[X^2]\), which is the second moment of \(X\).
In general, we have \(\Psi_X^{(n)}(0)=\E \left[ X^n \right]\), the n-th moment of \(X\).
Theorem: If the MGF \(\Psi_X(t), \Psi_Y(t)\) are finite for all \(t\in \RR \) and \(\Psi_X(t)=\Psi_Y(t)\) for all \(t\in \RR \), then
\[\Pr[X\leq x] = \Pr[Y\leq x] \text{ for all } x \in \RR \]
Therefore, when a MGF exists, it uniquely determines a distribution.
Exercise
If \(X_1 \sim N(\mu_1, \sigma_1^2), X_2 \sim N(\mu_2, \sigma_2^2)\) and \(X_1,X_2\) are independent, then \(X_1+X_2\sim N(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)\)
Proof:
First prove the MGF of the normal distribution \(N(\mu,\sigma^2)\) is \(e^{t\mu + \frac{1}{2}\sigma^2t^2}\):
A continuous R.V. \(X\) is said to have exponential distribution with parameter \(\lambda>0\) if its PDF is:
\[f(x)=\left\{\begin{aligned}
& \lambda e^{-\lambda x} &x>0\\
& 0 & \text{otherwise}
\end{aligned}\right.
\]
A stochastic process \(X=\{X_t: t\in T\}\) is a collection of random variables defined on the same probability space.
\(T\): Index set, time \(t\in T\)
\(X_t\): State of the process at time \(t\)
\(X\) is called a discrete-time process if \(T\) is countable, e.g., \(T=\{1,2,3, \dots \}\). In this case, \(X=\{X_n:n=0,1,2, \dots \}\).
\(X\) is called a continuous-time process if \(T\) is continuous, e.g., \(T=[0,+\infty)\). In this case, \(X=\{X_t: t\geq 0\}\).
You can think that a process denotes an evolution of some system in time.
Since \(X_t: \Omega\to \RR \), therefore \(X: \Omega \times T\to \RR \). So for a fixed \(\omega \in \Omega\), \(X_t(\omega)\) becomes a function of \(t\in T\), called a sample path of \(X\).
Example (Random Walk)
Let \(\{\xi_i:i=1,2, \dots \}\) be a sequence of i.i.d. R.V.s. Define \(S_0=0\), and \(S_n = \sum_{i=1}^n \xi_i\), for \(n\geq 1\).
Then \(\{S_n:n=0,1,2, \dots \}\) is called a random walk.
(P.S.: \(\xi_i\) can have different distribution.)
When \(\xi_i\) has distribution \(\Pr[\xi_i=1]=\Pr[\xi_i=-1]=\frac{1}{2}\), \(\{S_n:n=0,1,2, \dots \}\) is called a simple symmetric random walk.
A stochastic process \(X\) is said to have independent increments, if for any \(0\leq t_0<t_1<\dots <t_n\), \(X(t_1)-X(t_0), X(t_2)-X(t_1), \dots ,X(t_n)-X(t_{n-1})\) are (mutually, not just pairwise) independent, for all \(n\geq 1\).
\(X\) is said to have stationary increments if for any fixed \(t>0\), \(X(t+s)-X(s)\) have the same distribution, for all \(s\geq 0\). (I.e., the distribution depends only on the length of the time interval \(t\).)
A stochastic process \(N=\{N(t):t\geq 0\}\) is called a counting process if \(N(t)\) represents the number of events occurred up to time \(t\) (including \(t\)).
A counting process satisfies:
\(N(0)=0\)
\(N(t)\) is integer-valued
\(N(s)\leq N(t)\) for \(s\leq t\)
For \(s < t\), \(N(t)-N(s)\) equals the number of events that have occurred in the interval \((s.t]\).
A sample path of Poisson process is a right-continuous function, and has left limits:
A counting process \(N=\{N(t): t\geq 0\}\) is called a Poisson process if:
\(N(0)=0\)
\(N\) has independent increments
For any \(s,t\geq 0\), \(\Pr[N(t+s)-N(s)=n] = \frac{(\lambda t)^n}{n!}e^{-\lambda t}\), for \(n=0,1,2, \dots \). I.e., \(N(t+s)-N(s)\) has a Poisson distribution with mean \(\lambda t\).
\(\lambda\): rate of arrival (events occurring)
\[\E[N(t)]=\lambda t \Longrightarrow \lambda = \frac{\E[N(t)]}{t}\]
A counting process \(N=\{N(t):t\geq 0\}\) is called a Poisson process with rate \(\lambda>0\), if
\(N(0)=0\)
\(N\) has stationary and independent increments
As \(h\to 0\), \(\Pr[N(h)=0]=1-\lambda h +o(h)\), \(\Pr[N(h)=1]= \lambda h +o(h)\), \(\Pr[N(h)\geq2]=o(h)\) (infinitesimal behavior)
2.4 Two Equivalent Definitions of Poisson Process #
Proof (Definition 1 => Definition 2)
From \(\Pr[N(t+s)-N(s)=n] = \frac{(\lambda t)^n}{n!}e^{-\lambda t}\), it is easy to see the distribution of \(N(t+s)-N(s)\) is irrelevant with \(s\), so N is stationary.
Let \(s, n = 0\), we get \(\Pr[N(t)=0] = e^{-\lambda t}\). Denote \(P_0(t)\triangleq \Pr[N(t)=0]\), then
\(\Pr[N(h)=0]= e^{-\lambda h} = 1-\lambda h +o(h)\), this is from the Taylor Expansion of \(e^x\).
\(\Pr[N(h)=1]= \lambda h e^{-\lambda h} = \lambda h (1-\lambda h +o(h)) = \lambda h + o(h)\)
For \(n=1\), we have \(P_1’(t)=-\lambda P_1(t) + \lambda e^{-\lambda t}\), this is a First-order Linear Differential Equation with Constant Coefficients.
Solve it and get \(P_1(t) = \lambda t e^{-\lambda t}\)
Then prove by induction (details are skipped), get \(P_n(t) = \frac{(\lambda t)^n}{n!}e^{-\lambda t}\).
According to the stationary condition, we have \(\Pr[N(t+s)-N(s)=n] = \frac{(\lambda t)^n}{n!}e^{-\lambda t}\).
In the above we consider Poisson process in terms of counting. Now we change our perspective: in terms of arrival times.
Consider a Poisson process \(\{N(t), t\geq 0\}\) with rate \(\lambda\geq0\). Denote \(X_1\) the arrival time of the first event, \(X_n\) the time between the \(n-1\)-th arrival and \(n\)-th arrival.
\((X_n, n=1,2, \dots)\) is called the sequence of interarrival times.
Also we compute the partial sum \(S_n\): the arrival time of n-th event.
\[S_n = \sum_{i=1}^n X_i, n=1,2, \dots \]
What are the distributions of \(X_n\) and \(S_n\)?
Given a sequence of i.i.d. exponential random variables \(\{X_i, i=1,2, \dots \}\) with \(\E[X_i]=\frac{1}{\lambda}\).
Let \(S_n=\sum_{i=1}^n X_i\) for \(n\geq 1\) and \(S_0=0\).
Define \(N(t)=\max \{n: S_n\leq t\}, t\geq 0\).
Then \(\{N(t):t\geq 0\}\) is a Poisson process with rate \(\lambda\).
Note
Definition 3 implies definition 1 and 2. Proof is complex and so skipped.
In the definition, the exponential distributions are important, otherwise it will not be a Poisson.
By this definition, we can simulate sample paths by generating \(X_i\). This can also be used to test whether a coming process is Poisson (Compute the interarrival times and use statistical tests to see whether they are close to exponential.).
Example
Compute \(\Pr[X_1\leq s \mid N(t)=1], s<t\)
\[\begin{aligned}\Pr[X_1\leq s \mid N(t)=1] &= \frac{\Pr[X_1\leq s, N(t)=1]}{\Pr[N(t)=1]}\\
&=\frac{\Pr[N(s)=1, N(t)-N(s)=0]}{\lambda t e^{-\lambda t}}\\
&=\frac{\Pr[N(s)=1]\Pr[N(t)-N(s)=0]}{\lambda t e^{-\lambda t}}\\ &=\frac{\lambda s e^{-\lambda s} e^{-\lambda (t-s)}}{\lambda t e^{-\lambda t}}\\
&=\frac{s}{t}
\end{aligned}
\]
Therefore, given that before time \(t\) there is exactly one event happened, the event happening time \(X_1\) is uniformly distributed over \((0,t)\). This is a special property of Poisson.
Generalization
Given \(N(t)=n\), what is the joint distribution of \((S_1,S_2, \dots, S_n)\)?
First we introduce the concept of “order statistics”:
Let \(Y_1,Y_2, \dots , Y_n\) be \(n\) i.i.d. R.V.s with PDF \(f\). And \(Y_{(1)}, Y_{(2)}, \dots , Y_{(n)}\) are the order statistics corresponding to them if \(Y_{(k)}\) is the \(k\)th smallest value among \(Y_1, \dots ,Y_n\), for \(k=1,2, \dots ,n\). The joint density of \(Y_{(1)}, Y_{(2)}, \dots , Y_{(n)}\) is given by:
\[f(y_1,y_2, \dots ,y_n)=n\,! \prod_{i=1}^n f(y_i), \ \ y_1<y_2< \dots <y_n\]
Then we compute the conditional density function of \(S_1, \dots ,S_n\) given that \(N(t)=n\).
Let \(0<t_1<t_2<\dots <t_{n+1} = t\) and select \(h_i\) small enough such that \(t_i+h_i<t_{i+1}, \ i=1, \dots ,n\).
The Poisson distribution is a limiting case of the binomial distribution which arises when the number of trials n increases indefinitely whilst the product \(\lambda= np\), which is the expected value of the number of successes from the trials, remains constant.
In one word, \(Bin(n,p) \approx Poi(\lambda), np \approx \lambda\) when \(n\) is large, \(p\) is small.
Proof
Split time into \(n\) pieces, for each time slot, an event happens with probability \(p\), so the expectation of the number of events happen is \(\lambda = np\).
According to Binomial distribution, the probability of \(k\) events happen is \(C_n^k p^k (1-p)^{n-k}\). As \(n\to \infty\), we have:
The number of customers coming from 10 am to 11 am is larger than those from 1 am to 2 am. So in some scenarios, \(\lambda\) may not be a constant anymore.
A counting process \(\{N(t), t\geq 0\}\) is called a non-homogeneous or non-stationary Poisson process with rate function \(\lambda(t)\), if
\(N(0)=0\)
\(N\) has independent increments
As \(h\downarrow 0\), \(\Pr[N(t+h)-N(t)=0]=1-\lambda(t) h +o(h)\), \(\Pr[N(t+h)-N(t)=1]= \lambda(t) h +o(h)\), \(\Pr[N(t+h)-N(t)\geq 2]=o(h)\), for any \(t\geq 0\)
Theorem: For a non-homogeneous Poisson process with rate function \(\lambda(t),\ t\geq 0\), we have:
\(N(t+s)-N(s)\) follows a Poisson distribution with mean \(\int_s^{t+s} \lambda(u)\dd{u}=\E[N(t+s)-N(s)]\), i.e., for any \(t,s\geq 0\)
Similar to the proof of Definition 2 => Definition 1.
2.9 Compound Poisson Random Variables and Process #
Intuition
We may be only interested in the rewards/costs instead of counting.
Let \(\{X_i: i=1,2, \dots \}\) be a sequence of i.i.d. R.V.s with a common distribution \(F\) (CDF).
Let \(N\) be a Poisson R.V. with mean \(\lambda\), and it is independent of \(\{X_i, i=1,2, \dots \}\).
\(W=\sum_{i=1}^N X_i\) is called a compound Poisson random variable, with Poisson parameter \(\lambda\), and component distribution \(F\).
Note
If \(X_i=1\), then \(W=N\), which is just a Poisson R.V.
So \(\E[W]=\Psi_W’(0)=\lambda \Psi_X’(0)=\lambda \E[X_i]\), and \(\Var[W]=\lambda \E[X_i^2]\)
A stochastic process \( \left\{ X(t): t\geq 0 \right\}\) is called a compound Poisson process if \(X(t)=\sum_{i=1}^{N(t)} X_i, t\geq 0\), where \(\left\{ N(t), t\geq 0 \right\}\) is a Poisson process and
\(\left\{ X_i, i=1, 2, \dots \right\}\) is a sequence of i.i.d. R.V.s that is independent of \(\left\{ N(t), t\geq 0 \right\}\).
This is not a counting process any more.
Example: An insurance company
An insurance company receives claims that arrive according to a Poisson process with rate \(\lambda\).
Suppose the amount of claims form a set of i.i.d. R.V.s that is independent of the claim arrival process.
If \(X(t)\) denotes the total amount of claims by time \(t\), then \(\left\{ X(t):t\geq 0 \right\}\) is a compound Poisson process.
\(Y(t)=y_0 + c\cdot t - \sum_{i=1}^{N(t)} X_i\)
where \(Y(t)\) is the profit, \(c\cdot t\) is the premium.
If \(Y(t)<0\), there will be a bankruptcy. One interesting problem is that, given some time, what is the chance that there will be a bankruptcy?
For a Poisson process \(\{N(t), t\geq 0\}\) we have \(\Var[N(t)]=\E[N(t)]\)
For many real-world systems, it is not the case that \(\frac{\Var[N(t)]}{\E[N(t)]}=1\).
E.g. for deterministic counting process, \(\frac{\Var[N(t)]}{\E[N(t)]} < 1\).
In what condition we have \(\frac{\Var[N(t)]}{\E[N(t)]} > 1\)?
We need more uncertainty, “over dispersion of count data”.
Let \(\Lambda >0\) be a R.V. with distribution function \(G(x)=\Pr[\Lambda<x]\) (CDF).
Let \(N\) be a counting process such that given \(\Lambda=\lambda\), \(\{N(t), t\geq 0\}\) is a Poisson process with rate \(\lambda\).
Then \(N\) is called a conditional Poisson process.
\(\{N(t), t\geq 0\}\) is not a Poisson process. The increments are still stationary (because \(N(t+s)-N(s)\) depends only on the time increment \(t\)), but not independent (because one can verify that \(\Pr[N(t+s)=0]\neq \Pr[N(s)=0]\cdot \Pr[N(t+s)-N(s)=0]]\)).
Suppose \(\{N_1(t), t\geq 0\}\) is a Poisson process with rate \(\lambda=1\).
Define \(\tilde{N}(t)\triangleq N_1(\lambda t), t\geq 0\)
Fact
\(\{\tilde{N}(t), t\geq 0\}\) is a Poisson process with rate \(\lambda\).
Similarly, define \(N(t)\triangleq N_1(\Lambda t), t\geq 0\) (\(\Lambda\) is independent of \(N_1\)ļ¼(\(\Lambda t\) is called “random time change”.)
Then \(\{N(t), t\geq 0\}\) is a conditional Poisson process.
Conditional Poisson process is a special case of doubly stochastic Poisson process (or Cox process), where \(\lambda(t)\) itself is a stochastic process.
Conditional Poisson process is the case where \(\lambda(t)\) is a constant R.V. \(\Lambda\).
Exercise
For some currently unknown reasons, the average rate at which seismic shocks occur in a certain region changes over time (let’s say it is either \(\lambda_1\) or \(\lambda_2\).). Suppose it is \(\lambda_1\) for \(p\) percent of time and \(\lambda_2\) for the remaining time. Given \(n\) shocks in the first \(t\) times units, compute the probability it is a \(\lambda_1\) season.
Suppose
\[\Lambda=\left\{\begin{aligned}&\lambda_1 & \text{with probability } p\\
&\lambda_2 &\text{with probability } 1-p
\end{aligned}
\right.
\]
Compute \(\Pr[\Lambda=\lambda_1 \mid N(t)=n]\) (\(N(t)\) is a conditional Poisson process)
A generalization of Poisson process. It is also a counting process, but allows interarrival times to be general: i.i.d. non-negative R.V.s, not only exponential distribution.
Let \(\{X_n:n=1,2, \dots \}\) be a sequence of i.i.d. non-negative R.V.s with a common distribution function \(F\).
Assume \(F(0)=\Pr[X_n=0]<1\) and \(\E[X_n]=\mu \in (0,\infty)\).
Define \(S_0=0\) and \(S_n=\sum_{i=1}^n X_i\) for \(n\geq 1\).
Let \(N(t)=\sup\{n: S_n\leq t\}\) for \(t\geq 0\).
Then the counting process \(\{N(t), t\geq 0\}\) is called a renewal process.
Note
\((X_n)\): sequence of interarrival times.
Interarrival times can be zero, as long as the probability is not 1. (Otherwise the model is not interesting.)
The “\(\sup\)” can be changed to “\(\max\)”. Because in a finite time \(t\), there cannot be an infinite number of renewals occurred. Because:
\[\Pr \left[ \lim_{n\to \infty} \frac{S_n}{n} = \mu \right]=1 \ \ \text{(Strong law of large numbers)}\]
So \(S_n \to \infty\) as \(n\to \infty\) with probability 1, i.e., if \(S_n\leq t\), then \(n\) must be finite. Therefore, only a finite number of events occurred on \([0,t]\) for any \(t\).
So the “\(\sup\)” cannot be infinite, so it can be replaced with “\(\max\)”.
Let \(\{X_n: n=0,1,2, \dots \}\) be a discrete-time stochastic process taking on a finite or countable number of values, say, \(S = \{0,1,2, \dots \} \) (this is called state space).
Suppose
for all states \(i_0,i_1, \dots ,i_{n-1}, i, j\) and \(n\geq 0\).
Then \(\{X_n:0,1,2, \dots \}\) is called a discrete-time Markov chain (“time-homogeneous”) (it is called a chain because the state space is countable or finite, this will be generalized in Markov process).
The conditional distribution of future state \(X_{n+1}\), given the past state \(X_0, \dots ,X_{n-1}\) and the present state \(X_n\), is independent of the past states and depends only on the present state.
The row sum of \(P\) is always 1. However, this is not the case for column sums.
Consider the following markov chain with 2 states. State 1 leads to 1 with probability 1 and state 2 leads to state 1 with probability 1. Then the sums of the columns of the transition matrix are 2 and 0.
Example (Random Walk)
Let \((X_i)_{i=1}^{\infty}\) be i.i.d. with \(\Pr[X_i=k]=a_k, k=0, \pm 1, \pm 2, \dots \)
Let \(S_0=0, S_n=\sum_{i=1}^n X_i, n\geq 1\), then \(\{S_n: n=0, 1, \dots \}\) is a DTMC.
Proof
Need to prove: \(\Pr[S_{n+1}=j | S_n=i, \dots, S_1=i_1, S_0=0] = \Pr[S_{n+1}=j \mid S_n = i]\)
To compute \(\lim\limits_{n\to \infty} \Pr[X_n=j \mid X_0=a], j \in S\)ļ¼or \(\lim\limits_{n\to \infty} P^n\), can we simplify the computation?
State \(j\) is said to accessible from state \(i\) if there exists some \(n\geq 0\), such that \(P_{ij}^{(n)}>0\). We write \(j \leftarrow i\).
State \(j\) and state \(i\) communicate if \(i \leftrightarrow j\).
Lemma Communication is an equivalence relation, that is:
\(i\leftrightarrow i\)
if \(i\leftrightarrow j\), then \(j\leftrightarrow i\)
if \(i \leftrightarrow j\), \(j\leftrightarrow k\), then \(i\leftrightarrow k\)
Proof of 3 There exists \(n,m\geq 0\) such that \(P_{ij}^{(n)}>0\), \(P_{jk}^{(m)}>0\), then \(P_{ik}^{n+m}=\sum_{l\in S}P_{il}^{(n)}P_{lk}^{(m)}\geq P_{ij}^{(n)}P_{jk}^{(m)}>0\).
Two states that communicate are said to be in the same class.
A DTMC is said to be irreducible if there is only one class, i.e., all states communicate with each other.
The example given above whose \(P=\begin{bmatrix}\frac{1}{2} & \frac{1}{2}\\ 0 & 1\end{bmatrix}\) is not irreducible, because states \(a,b\) are not in the same class (since \(a\) is not accessible from \(b\)).
Define the hitting time \(\tau_j = \inf \{n\geq 1: X_n=j\}\), for \(j \in S\). (Meaning: the time step at which the state first transits into \(j\).) (This is a random quantity, but not a random variable because it is possible to be infinite).
Definition: A state \(j\) is called recurrent if \(f_{jj}=1\), and called transient if \(f_{jj}<1\).
Theorem: A state \(j\) is recurrent if and only if \(\sum\limits_{n=1}^{\infty} P_{jj}^{(n)}=\infty\).
Example
In the above example where \(P=\begin{bmatrix}\frac{1}{2} & \frac{1}{2}\\ 0 & 1\end{bmatrix}\)
For state \(b\), \(P_{bb}^{(n)}=1 \quad \forall n\geq 1\) (state \(b\) is absorbing because \(P_{bb}=1\)), so \(b\) is recurrent.
For state \(a\), \(P_{aa}^{(n)}=\Pr[X_n=a \mid X_0=a]= \left( \frac{1}{2} \right)^n\), so \(\sum\limits_{n=1}^{\infty} P_{aa}^{(n)}=\sum\limits_{n=1}^{\infty} \frac{1}{2^{n}}=1<\infty\), so \(a\) is transient.
Using definition we get the same results:
For state \(b\), \(\Pr[\tau_b<\infty \mid X_0=b] = 1\)
For state \(a\), \(\Pr[\tau_a<\infty \mid X_0=a]=\frac{1}{2}\), because given \(X_0=a\), we have:
If state \(j\) is transient, we need to show \(\sum\limits_{n=1}^{\infty} P_{jj}^{(n)}=\E[N_j \mid X_0=j]<\infty\).
Each time when the DTMC returns to state \(j\), there is a positive probability \(1-f_{jj}>0\) that it will never return again.
Define “success” if it will return and “failure” if it will return. Then \(N_j\) is the trial number on which the finite success occurs.
\(N_j\) follows the geometric distribution with success probability \(1-f_{jj}\). \(\E[N_j \mid X_0=j]=\frac{1}{1-f_{jj}}<\infty\).
Corollary With probability one, a transient state will only be visited for a finite number of times.
Corollary A finite-state DTMC has at least one recurrent state. (Important)
Corollary If \(i\) is recurrent and \(i \leftrightarrow j\), then \(j\) is recurrent.
Proof
There exists some \(m,n\geq 0\) such that \(P_{ij}^{(n)}>0\) and \(P_{ji}^{(m)}>0\)
\[P_{jj}^{(m+n+s)} = \sum_{k,l \in S} P_{jk}^{(m)} P_{kl}^{(s)} P_{lj}^{(n)} \geq P_{ji}^{(m)} P_{ii}^{(s)} P_{ij}^{(n)}\]
If \(p=\frac{1}{2}\), \(\sum\limits_{n=1}^{\infty} P_{00}^{(2n)}\sim \sum\limits_{n=1}^{\infty} \frac{1}{\sqrt{\pi n}}=\infty\), so state 0 is recurrent, so every state is recurrent.
If \(p\neq \frac{1}{2}\), \(\sum\limits_{n=1}^{\infty} P_{00}^{(2n)} < \infty\), so state 0 is transient, so every state is transient. (this is possible because there are infinite many states.)
The advantage of doing so is to simplify the computation, e.g., we only need to look at the sub-matrix if state is in \(P_1\).
Define \(\mu_{jj}=\E[\tau_j \mid X_0=j], \quad j \in S\) (average number of steps needed to revisit state \(j\))
Definition If state \(j\) is recurrent, then it is said to be positive recurrent if \(\mu_{jj}<\infty\), and null recurrent if \(\mu_{jj}=\infty\).
3.4 Stationary Distribution and Limiting Distribution #
Definition A probability distribution \(\pi=(\pi_i)_{i \in S}\) is called a stationary distribution for a DTMC with transition probability matrix \(P\), if
\[\left\{
\begin{array}{l@{\quad:\quad}l}
\pi_j = \sum_{i \in S} \pi_i P_{ij} \Longleftrightarrow \underbrace{\pi}_{\text{row vector}} = \pi P & \text{for all } j \in S\\
\sum_{i \in S} \pi_i = 1 & \\
\pi_i \geq 0 & \text{for all } i \in S
\end{array}\right.
\]
Suppose \(X_0 \sim \pi\), i.e., \(\Pr[X_0 = i]=\pi_i, i \in S\), then \(X_n\sim \pi, n\geq 1\).
Stationary distribution is useful because we usually focus on long-term behaviors of a Markov chain. (e.g., whether it is converged.)
Example 1
\(P=\begin{bmatrix} 1 & 0\\ 0 & 1\end{bmatrix}\), then \((\alpha, 1-\alpha)\) is a stationary distribution, \(\alpha\geq 0\), so the stationary distribution is not unique.
Example 2
\(P=\begin{bmatrix} 0 & 1\\ 1 & 0\end{bmatrix}\), then \(\pi = (\frac{1}{2}, \frac{1}{2})\) is the unique stationary distribution.
\(1=\sum_{i \in S} \pi_i \Longrightarrow \pi_i = \frac{1}{\infty} = 0\), so stationary distribution doesn’t exist.
Therefore, stationary can be non-existing, or not unique.
For the existence of stationary distributions, we have the following conclusions:
For a finite-time DTMC, there always exists at least one stationary distribution.
(hint: use linear programming and duality theory)
(Because the row sum of \(P\) is always 1, so \(P \vec{1} = \vec{1}\), so 1 is always an eigenvalue of \(P\).)
If a DTMC has at least one positive recurrent state, then a stationary distribution exists. (proof skipped)
Note
A stationary distribution exists doesn’t mean the Markov chain converges to this distribution starting from any states.
However, if you have a sufficiently “nice” Markov chain, you should expect the convergence, i.e., \(\Pr[X_n=j | X_0=i] = \pi_j,\quad n\to \infty\) (i.e., the initial state is eventually forgotten).
We say the states have period 2, denoted by \(d(a)=d(b)=2\).
Definition: state \(i\) is said to have period \(d\), if \(d\) is the greatest common divisor of these \(n\) with \(P_{ii}^{(n)}>0, n \geq 1\), that is, \(d(i) = \gcd \{n\geq 1 : P_{ii}^{(n)}>0\}\).
I.e., a state \(i\) has period \(k\) if any return to state \(i\) must occur in multiples of \(k\) time steps.
If \(d(i)=1\), then we say state \(i\) is aperiodic.
Example (Simple Random Walk)
\(d(i)=2, i \in S=\ZZ\)
Lemma 1 If \(i \leftrightarrow j\), then \(d(i)=d(j)\). (period is also a “class property”) (proof on the textbook)
Theorem (proof skipped) Suppose \(i \leftrightarrow j\), then
The above expectation is the expected number of times that state \(j\) is reached starting from state \(i\) in \(n\) steps.
As \(n\to \infty\), this is the long-run average proportion of time spent at state \(j\).
If \(j\) is aperiodic, then \(\lim_{n\to \infty} P_{ij}^{(n)} = \frac{1}{\mu_{jj}}\)
If \(j\) has period \(d(j)\), then \(\lim_{n\to \infty} P_{jj}^{n d(j)} = \frac{d(j)}{\mu_{jj}}\)
Intuition of 2,3:
If \(a\) is not aperiodic, e.g, \(d(a)=2\), then \(P_{aa}^{(2n)}=1\), \(P_{aa}^{(2n-1)}=0\), \(\lim_{n\to \infty} P_{aa}^{(n)}\) doesn’t exist.
But \(\lim_{n\to \infty} P_{aa}^{(2n)} = 1 = \frac{d(a)}{\mu_{aa}} = 1\)
Recall \(\mu_{jj} = \E[\tau_j \mid X_0=j]\) is the expected number of steps needed to return to \(j\). And \(\mu_{jj}<\infty \Longleftrightarrow j \text{ is positive recurrent}\).
So \(j\) is positive recurrent \(\Longleftrightarrow \lim_{n\to \infty} P_{jj}^{(n d(j))}>0\).
Example (simple symmetric random walk)
When \(p=\frac{1}{2}\), \(P_{00}^{(2n)} \sim \frac{1}{\sqrt{\pi n}}\).
\(\lim_{n\to \infty}P_{00}^{(2n)}=0\), so state 0 is null recurrent.
Theorem An irreducible, aperiodic DTMC belongs to one of the following two classes:
Either all states are transient or all null recurrent. In this case, \(\lim_{n\to \infty} P_{ij}^{(n)}=0\) for all \(i, j \in S\), and there exists no stationary distribution.
Or all states are positive recurrent. In this case, \(\lim_{n\to \infty} P_{ij}^{(n)}\) exists for all \(i,j \in S\), and denote \(\pi_j = \lim_{n\to \infty} P_{ij}^{(n)}, j \in S\).
Then \(\pi=(\pi_{j})_{j \in S}\) is the unique stationary distribution.
Proof
We first proof 2.
First show positive recurrent is a class property. That is, if \(j\) is positive recurrent and \(j \leftrightarrow k\), then \(k\) is positive recurrent.
\(\lim_{n\to \infty} P_{ij}^{(n)} = \pi_j>0\) exists since \(i\leftrightarrow j\) (because this DTMC is irreducible) and due to the above theorem.
Since \(j \leftrightarrow k\), let \(m\) be such that \(P_{jk}^{(m)}>0\) then \(P_{ik}^{(n+m)}\geq P_{ij}^{(n)} P_{jk}^{(m)}\).
Take limit on both sides, \(\lim_{n\to \infty} P_{ik}^{(n+m)}\geq \pi_j P_{jk}^{(m)}>0\), so \(k\) is positive recurrent.
Note
\(\pi_j\) does not depend on \(i\), so the transition matrix will be like this (values in each column are the same):
\[P^n \to \begin{bmatrix}\pi_0 & \pi_1 &\pi_2 &\pi_3 & \dots \\
\pi_0 & \pi_1 &\pi_2 &\pi_3 & \dots \\
\pi_0 & \pi_1 &\pi_2 &\pi_3 & \dots \\
\vdots & \vdots & \vdots &\vdots & \ddots
\end{bmatrix}
\]
We next show \(\pi = (\pi_j)_{j \in S}\) is a stationary distribution, where \(\pi_j = \lim_{n\to \infty} P_{ij}^{(n)}\).
In the above, \(S\) must be finite, so we can exchange the summation and limit.
If \(S\) is infinite, let \(S=\{0,1,2, \dots \}\), then \(\sum_{j=0}^M P_{ij}^{(n)} \leq \sum_{j=0}^{\infty} P_{ij}^{(n)} = 1, \forall M\), take limit on both sides:
So \(\left\{\frac{\pi_j}{\sum_{i=0}^{\infty} \pi_i}, j = 0,1,2, \dots \right\}\) is a stationary distribution.
But we cannot show \(\pi\) is a stationary distribution yet, because (1) is not an equality.
Next, show the uniqueness.
Let \((\nu_j)_{j \in S}\) be any stationary distribution, then \(\nu_j = \sum_{i=0}^{\infty} \nu_i P_{ij}^{(n)}\geq \sum_{i=0}^M \nu_i P_{ij}^{(n)}\), let \(n\to \infty \), we have \(\nu_j \geq \sum_{i=0}^M \nu_i \pi_j \stackrel{M\to \infty}{\Longrightarrow} \nu_j \geq \pi_j\)
Let \(X=\{X(t), t\geq 0\}\) be a continuous time stochastic process with discrete state space \(S\). If
\[\Pr[X(t+s)=j \mid X(s)=i, X(u) = x(u), 0 \leq u < s] = \Pr[X(t+s)=j \mid X(s)=i]\]
for all \(s,t\geq 0\) and \(i, j, x(u) \in S\), then \(\{X(t), t\geq 0\}\) is called a continuous time Markov Chain.
Same with DTMC, we only consider time-homogeneous Markov chain, i.e.,
The transition probability is independent of \(s\) (called stationary or homogeneous). \(P_{ij}(t)\) and the initial state \(p_i=\Pr[X(0)=i]\) uniquely determine all the probability distribution of the CTMC.
Example
Poisson process is a CTMC:
\[\begin{split}
&\Pr[X(t+s)=j \mid X(s)=i, X(u)=x(u), 0\leq u < s]\\
=&\Pr[X(t+s)-X(s)=j-i \mid X(s)=i, \dots]\\
=&\Pr[X(t+s)-X(s)=j-i]\\
=&\Pr[X(t+s)=j \mid X(s)=i] = P_{ij}(t)
\end{split}
\]
One difference between DTMC and CTMC is that, we consider not only the transition probabilities but also the holding time.
Remember that the interarrival time of Poisson process follows an exponential distribution. This is in fact a general property of CTMC.
Denote \(\tau_i\) the holding-time at state \(i\) for a CTMC \(X=\{X(t), t\geq 0\}\). Then \(\tau_i\) follows an exponential distribution with some rate \(v_i\).
To prove this, verify the memoryless property: \(\Pr[\tau_i > t+s \mid \tau_i > s] = \Pr[\tau_i > t]\).
\[\begin{split}
\text{LHS} &= \Pr[X(l)=i, s< l \leq t+s \mid X(s)=i, X(u)=i, 0\leq u <s]\\
&=\Pr[X(l)=i, s< l \leq t+s \mid X(s)=i] \quad \text{(Markov property)}\\
&=\Pr[X(l)=i, 0<l\leq t \mid X(0)=i] \quad \text{(Time-homogeneous)}\\
&=\text{RHS}
\end{split}
\]
Construction of CTMCs (CTMC = DTMC with exponential transition/holding times)
Each time the process enters state \(i\)
The amount of time it spends in that state before making a transition into a different state (holding time) is exponentially distributed with rate \(v_i\), and
When the process leaves state \(i\), it will next enter state \(j(\neq i)\) with some probability, call it \(P_{ij}\), where \(\sum_{j \neq i} P_{ij}=1\). Note that \(P_{ii}\) is always 0.
We can parameterize of a CTMC with \((P, v)\) where \(P = (P_{ij})_{i,j \in S}\), and \(v = (v_i)_{i \in S}\).
Assume throughout \(0\leq v_i < \infty\) (because if \(v_i=\infty\), \(\E[\tau_i] = \frac{1}{v_i}=0\), which is not interesting), a CTMC is called regular if there is finite number of transition over any finite time window.
Define the (infinitesimal) generator matrix \(Q=(q_{ij})_{i,j \in S}\) as follows:
\[\left\{
\begin{array}{l@{\quad:\quad}l}
q_{ij}=v_i \cdot P_{ij} & i \neq j, i,j \in S\\
q_{ii}=-v_i & i \in S
\end{array}\right.
\]
where \(q_{ij}\) is the transition rate (from \(i\) to \(j\)), and \(v_i\) is the rate of leaving state \(i\).
\((P, v)\) and \(Q\) are equivalent, one can derive the other.
The row sum of \(Q\) is 0:
\[\sum_{j \in S} q_{ij} = \sum_{j\neq i} q_{ij} + q_{ii} = \sum_{j \neq i} v_i P_{ij} - v_i = v_i \sum_{j \neq i} P_{ij} - v_i = 0\]
Example
A two-state CTMC with \(Q = \begin{bmatrix}-\alpha & \alpha \\ \beta & -\beta \end{bmatrix}\), \(\alpha, \beta>0\).
Definition A CTMC with \(S=\{0,1,2, \dots\}\) for which \(q_{ij}=0\) whenever \(|i-j|>1\) is called a birth and death process.
Denote \(\lambda_i = q_{i,i+1}\) as the birth rate, \(\mu_i = q_{i, i-1}\) as the death rate.
So the transitions from state \(i\) can only go to either state \(i-1\) or state \(i+1\). When the state increases by 1 we say a birth occurs, and when it decreases by 1 we say a death occurs.
At state \(i\), the holding time is \(\exp(v_i)=\exp(\lambda_i+\mu_i)\), because:
The “birth clock” \(\xi_i \sim \exp(\lambda_i)\), the “death clock” \(\eta_i \sim \exp(\mu_i)\), and they are independent. So the holding time \(\tau_i = \min \{\xi_i, \eta_i\} \sim \exp(\lambda_i + \mu_i)\).
\[\begin{split}
P_{i,i+1} &= \Pr[\xi_i<\eta_i]=\frac{\lambda_i}{\lambda_i+\mu_i}\\
P_{i,i-1} &= \Pr[\xi_i>\eta_i]=\frac{\nu_i}{\lambda_i+\mu_i}
\end{split}
\]
Example (The M/M/s Queue)
The name “X/Y/Z”: “X” is the arrival process/interarrival time distribution, “Y” is the service time distribution, “Z” is the number of servers.
“M” stands for “Markov”, meaning that the distribution is exponential.
First “M”: Arrival process is Poisson process with rate \(\lambda\), or equivalently the interarrival time is exponential with rate \(\lambda\).
Second “M”: Service time distribution is exponential with rate \(\mu\).
“s”: There are \(s\) servers.
Service discipline: FIFO (First-In-First-Out), FCFS (First-Come-First-Serve).
Denote \(X(t)\) as the number of customers/jobs in the system at time \(t\). State space \(S=\{0,1,2, \dots\}\). \(\{X(t): t \geq 0\}\) is a birth and death process.
Birth rate \(\lambda_i = \lambda\), death rate \(\mu_i = s\cdot \mu\) if \(i>s\), and \(\mu_i = i\cdot\mu\) if \(i \leq s\).
A birth and death process with death rate \(\mu_i=0\) for all \(i\) is called a pure birth process.
Example (Yule Process)
Consider a pure birth process resulting from a population where each member acts independently and gives birth at rate \(\lambda>0\). No one ever dies. Let \(X(t)\) denote the population size at time \(t\). Then \(\{X(t):t\geq 0\}\) is a pure birth process with \(\lambda_i=i \cdot \lambda\), \(i\geq 1\).
The holding time \(T_i \sim \exp(i\lambda)\) are independent with each other.
Now compute: \(P_{ij}(t)=\Pr[X(t)=j \mid X(0)=i]\)
Therefore, starting with a single individual, the population size at time \(t\) will have a geometric distribution with mean \(e^{\lambda t}\).
So if starting with \(i\) individuals, the population size at \(t\) will be the sum of \(i\) i.i.d. geometric R.V.s, which follows a negative binomial distribution.
Consider a CTMC with generator matrix \(Q\), and the transition probability matrix is \(P(t)=(P_{ij}(t))_{i,j \in S}\), \(t\geq 0\).
Lemma 1 For all \(s,t\geq 0\), we have for all \(i,j \in S\)
\[P_{ij}(t+s)=\sum_{k \in S} P_{ik}(t)\cdot P_{kj}(s)\]
or in matrix form: \(P(t+s)=P(t)P(s)\).
\(\lim_{h \to 0} \frac{P_{ik}(h)}{h} = q_{ik}\) for \(k\neq i\).
In matrix notation, we have
\[\lim_{h\to 0} \frac{P(h)-P(0)}{h}=Q\]
where \(P(0)=I\).
LHS is like a derivative, so denote by \(P’(0)\).
Proof
For (1), we want to compute \(\frac{\Pr[X(h)=i \mid X(0)=i]-1}{h}\).
Since from state \(i\), the holding time \(\tau_i\) follows an exponential distribution with rate \(v_i\), we have:
\[\begin{split}
\Pr[\text{0 transition by time } h \mid X(0)=i] &= \Pr[\tau_i > h]\\
&= 1-\Pr[\tau_i\leq h]\\
&= e^{-v_i h} = 1 - v_i h + o(h)
\end{split}
\]
\[\begin{split}
&\Pr[\text{2 or more transitions by time } h | X(0)=i]\\
=&\sum_{j \neq i} \Pr[\text{2 or more transitions by time } h\mid \text{1st transition to }j, X(0)=i]P_{i,j}\\
=& \sum_{j\neq i}\Pr[\tau_i+\tau_j\leq h] P_{i,j}\\
\leq& \sum_{j\neq i}\Pr[\tau_i \leq h]\Pr[\tau_j \leq h] P_{i,j} \quad \text{since the event } \{\tau_i+\tau_j\leq h\} \subset \{\tau_i\leq h, \tau_j\leq h\}\\
=& \sum_{j \neq i} (v_i h +o(h))(v_j h +o(h)) P_{i,j}\\
=& \sum_{j \neq i} o(h) P_{i,j} = o(h)
\end{split}
\]
We can also know from the above that
\[\Pr[\text{1 transition by } h \mid X(0)=i] = 1- (1-v_i h+o(h)) - o(h) = v_i h + o(h)\]
In \(\Pr[X(h)=i \mid X(0)=i]\), there are two possibilities, either 0 transition happens, or 2 or more transitions happen and get back to state \(i\). So we have:
\[\Pr[X(h)=i \mid X(0)=i] = 1-v_i h +o(h) + o(h) = 1-v_i h + o(h)\]
So we can easily get \(\lim_{h \to 0} \frac{P_{ii}(h)-1}{h} = -v_i\).
For (2), similarly, in \(\Pr[X(h)=k \mid X(0)=i], k\neq i\), either there is exactly 1 transition to state \(k\), the probability is \((v_i h + o(h))P_{i,k}\), or 2 or more transitions happen. In summary, we have:
\[\Pr[X(h)=k \mid X(0)=i] = v_i P_{i,k} h +o(h) + o(h) = v_i P_{i,k} h + o(h)\]
So we can get \(\lim_{h \to 0} \frac{P_{ik}(h)}{h} = q_{ik}\).
Theorem 1 (Kolmogorov Backward Equation)
For all \(i, j \in S\) and \(t \geq 0\),
\[P_{ij}’(t)=\sum_{k \in S} q_{ik}\cdot P_{kj}(t)\]
using matrix notation, we have \(P’(t)=Q\cdot P(t)\).
Proof
Here we assume the state space \(S\) is finite, the general case is in the textbook.
The sum and limit can be exchanged because we only consider the case where \(S\) is finite.
Theorem 2 (Kolmogorov Forward Equation)
Under suitable conditions (including birth-death process and finite state CTMC), we have
\[P_{ij}’(t)=\sum_{k \in S} P_{ik}(t) \cdot q_{kj}, \quad i, j \in S\]
or \(P’(t)=P(t)\cdot Q\).
(proof is just swapping \(h\) and \(t\) in the above proof)
Solution to these equations
\[P(t) = e^{tQ}, t\geq 0\]
(Similar to ODE: \(f’(t)=c\cdot f(t), f(0)=1 \Rightarrow f(t)=e^{ct}, t\geq 0\).)
Matrix Exponential
If \(A\) is a matrix, then:
\[e^A = \sum_{n=0}^{\infty} \frac{A^n}{n!}\]
\(P’(0)=Q\), because the row sum of \(P\) is 1 (constant), after taking the derivative, the row sum of \(Q\) is zero.
In discrete case: One step generator: let \(h=0\), we get \(P-I\), this is the analog of \(Q\) in DTMC.
Example 1
Define \(\phi_j(t)=\Pr[X(t)=j], j \in S, t\geq0\). Given \((\phi_j(0))_{j \in S}\), compute \((\phi_j(t))_{j \in S}\) (denote as a row vector \(\vec{\phi}(t)\)).
Note that the solution converges to some constant as \(t\to\infty\). E.g., \(P_{00}(t) \to \frac{\mu}{\lambda+\mu}\), \(P_{01}(t) \to \frac{\lambda}{\lambda+\mu}\).
In general, it is hard to solve the analytical forms, how to find out the limiting probabilities (or stationary distributions)?
4.4 Limiting Distribution and Stationary Distribution #
Definition A probability distribution \(\pi=(\pi)_{i \in S}\) is said to be a stationary distribution for a CTMC with generator matrix \(Q=(q_{ij})_{i,j \in S}\), if
\(\sum_{i \in S} \pi_i q_{ij} = 0\) for all \(j \in S\). In matrix form: \(\pi Q=0\) (\(\pi\) is a row vector)
\(\sum_{i \in S} \pi_i = 1\)
\(\pi_i\geq 0\) for all \(i \in S\)
Note
Recall that \(P-I\) is the discrete analog of \(Q\), note that \(\pi = \pi P\), so \(\pi(P-I)=0\).
If \(X(0) \sim \pi\) (\(\Pr[X(0)=j]= \pi_j, j \in S\)), then \(X(t) \sim \pi, t>0\).
Proof
Easy to see that if \(X(0) \sim \pi\), then \(X(t) \sim \pi P(t)\). Now prove \(\pi P(t) = \pi\).
This follows from \(\frac{d}{dt}(\pi P(t)) = \pi P’(t) = \pi Q P(t) = 0\), and \(P(0) = I\).
We can actually use \(\pi = \pi P\) as the definition, but we do not because most of the time, \(P(t)\) is unknown or hard to compute.
so \(\pi_j v_j = \sum_{k \neq j} \pi_k q_{kj}\) (This is called “balance equation”).
(Note that this \(v\) is the leaving rate, not the limiting distribution in the above theorem.)
LHS: the rate at which the CTMC leaves state \(j\) in the long run (“rate out”).
RHS: the rate at which the CTMC enters state \(j\) in the long run (“rate in”).
In some applications, \(\pi_j\) is big, meaning that \(j\) is more important. E.g., in website browsing time, a big \(\pi_j\) means page \(j\) is more popular.
Example
Consider a birth and death process with rate diagram:
For a Birth-Death process, the stationary distribution exists if
\[k = \sum_{n=1}^{\infty} \frac{\lambda_0 \lambda_1 \dots \lambda_{n-1}}{\mu_1 \mu_2 \dots \mu_n} < +\infty\]
In this case \(\pi_0 = \frac{1}{1+k}\), \(\pi_n = \frac{\lambda_0 \dots \lambda_{n-1}}{\mu_1 \dots \mu_n} \pi_0, n \geq 1\).
Example
Consider a M/M/1 queue, which is a Birth-Death process where \(\lambda_i \equiv \lambda, i=0,1,2, \dots\), \(\mu_i \equiv \mu, i=0,1,2, \dots\).
\(X(t)\): number of customers in the system at time \(t\).
A stationary distribution exists for the CTMC \(\{X(t):t\geq 0\}\) if \(k = \sum_{n=1}^{\infty} \left( \frac{\lambda}{\mu} \right)^n < +\infty\). Or equivalently \(\frac{\lambda}{\mu}<1\).
In this case, \(\pi_0 = \frac{1}{1+k} = \frac{1}{\sum_{n=0}^{\infty} \rho^n}=1-\rho\) (denote \(\rho = \frac{\lambda}{\mu}\), the utilization of the server), \(\pi_n = (1-\rho)\cdot \rho^n, n\geq 0\).
Steady-State Analysis
Average number of customers in the system (\(\E[Q]\))?
\[\sum_{n=0}^{\infty} n \cdot \pi_n = \sum_{n=0}^{\infty} n(1-\rho)\rho^n = \underbrace{\sum_{n=0}^{\infty} n(1-\rho)\rho^{n-1}}_{\text{mean of geometric distribution}} \cdot \rho = \frac{\rho}{1-\rho}\]
Long-run fraction of time the system/server is busy/idle?
Busy (at least one person in queue): \(1-\pi_0 = \rho\).
Idle (no one in queue): \(\pi_0 = 1-\rho\).
Average waiting time of customers (\(\E[W]\))?
According to the Little’s Law, \(\E[Q] = \lambda\E[W]\).
Determining the existence of stationary distribution (\(\lim_{t \to \infty} P_{ij}(t) \stackrel{?}{=} \pi_j\))
A DTMC is ergodic if it is irreducible, aperiodic and positive recurrent.
A CTMC is irreducible if the underlying/embedded DTMC is irreducible.
An irreducible CTMC is positive recurrent \(\Longleftrightarrow\) there exists a stationary distribution \(\pi = (\pi_i)_{i \in S}\) with \(\pi_i>0\) for all \(i \in S\).
If an irreducible CTMC has a finite state space, then it is positive recurrent.
Let \(X_1, X_2, \dots\) be i.i.d. random variables with \(\E[X_i]=0\) and \(\E[|X_i|]<+\infty\). Let \(Z_n = \sum_{i=1}^{n} X_i, n\geq 1\). Then \(\{Z_n: n\geq 1\}\) is a martingale.
Proof
Check (1): \(\E[Z_n]= \E[\sum_{i=1}^{n} X_i] \leq \sum_{i=1}^{n} \E[|X_i|] < +\infty\), for all \(n\).
(Note: If \(\E[X_i] \neq 0\), this is still a Markov chain, but not a martingale. Martingale models a fair game.)
Example 2 (Geometric Random Walk)
Let \(X_1, X_2, \dots\) be i.i.d. random variables with \(\E[X_i]=1\) and \(\E[|X_i|]<+\infty\). Let \(Z_n = \prod_{i=1}^{n} X_i, n\geq 1\). Then \(\{Z_n: n\geq 1\}\) is a martingale.
Proof
Check (1): \(\E[|Z_n|] = \E[|X_1 \cdot \dots \cdot X_n|] = \prod_{i=1}^n \E[|X_i|] < \infty\), for all \(n\).
Let \(X, Y_1, Y_2, \dots\) be random variables with \(\E[|X|] < +\infty\). Let \(Z_n = \E[X \mid Y_1, \dots, Y_n]\) for \(n \geq1\). Then \(\{Z_n: n\geq 1\}\) is a martingale.
(If \(Y_1, \dots, Y_n\) is known, then \(Z_1, \dots, Z_n\) is known, so \(Z_1, \dots, Z_n\) have less information than \(Y_1, \dots, Y_n\).)
5.3 Stopping Times and the Martingale Stopping Theorem #
Definition A positive, integer-valued, possibly infinite random variable \(N\) is said to be a random time for the process \(\{Z_n, n\geq 1\}\) if the event \(\{N=n\}\) is determined by \(Z_1, \dots, Z_n\) for any \(n\). (Intuition: can’t look into the future.)
If \(\Pr[N<\infty]=1\), then the random time \(N\) is called a stopping time.
Example
\(N=\inf\{n\geq1 : Z_n=c\}\) for a constant \(c\). (\(N\) is the hitting time when the process first hits \(c\).)
\(\{N=n\} \Longleftrightarrow \{Z_1 \neq c, Z_2 \neq c, \dots, Z_{n-1}\neq c, Z_n=c\}\), so \(N\) is a random time.
Counter-Example
\(N=\sup\{n\geq1 : Z_n=c\}\)
\(\{N=1\} \Longleftrightarrow \{Z_1=c, Z_k\neq c, \text{ for } k\geq 2\}\), this is not determined by only \(Z_1\), you need information about the future, so \(N\) is NOT a random time.
Let \(N\) be a random time for \(\{Z_n: n\geq 1\}\). Define
\[\overline{Z}_n = Z_{\min\{n, N\}}=\left\{
\begin{array}{l@{\quad:\quad}l}
Z_n & \text{if } n\leq N\\
Z_N & \text{if } n> N
\end{array}\right.
\]
\(\{\overline{Z}_n: n\geq1\}\) is called the stopped process of \(\{Z_n: n\geq 1\}\).
In the above example, if \(c=1\), and the process \(Z_n\) is as the figure below, then \(N=\inf\{n\geq1 : Z_n=1\} = 2\).
(Why this is important? Because of the following result.)
Proposition If \(N\) is a random time for a martingale \(\{Z_n: n\geq 1\}\), then the stopped process \(\{\overline{Z}_n :n\geq 1\}\) is also a martingale.
Proof
Define \(I_n=\1_{n\leq N}\), which is determined by \(Z_1, \dots, Z_{n-1}\).
(Note that the subscript is \(n-1\) instead of \(n\). An event being determined is equivalent to its complement being determined. The complement event of \(\{n\leq N\}\) is \(\{N\leq n-1\}\) because N is integer. \(\{N\leq n-1\} = \bigcup_{k=1}^{n-1} \{N=k\}\) is determined by \(Z_1, \dots, Z_{n-1}\).)
One can verify \(\overline{Z}_n = \overline{Z}_{n-1} + I_n\cdot(Z_n-Z_{n-1})\). (Easy exercise, skipped)
Now prove an intermediate step: \(\E[Z_n \mid Z_1, \dots, Z_{n-1}] = \overline{Z}_{n-1}\):
The second equality is due to Dominated Convergence Theorem (DCT), since \(\overline{Z}_n\) is bounded, i.e., \(|\overline{Z}_n| \leq k, \forall n\geq1\).
Example (Gambler’s Ruin Problem, P186 in the textbook)
Consider a simple symmetric random walk \(\{Z_n: n\geq 0\}\) with \(Z_0=i>0\).
Define \(N=\inf\{n\geq0 : Z_n= M \text{ or } Z_n=0\}\), where \(M>i\).
Compute \(\Pr[Z_N=M]\).
(1) Verify \(N\) is a stopping time. (exercise, solution)
Theorem Let \(\{Z_n: n\geq1\}\) be a martingale with \(\mu=\E[Z_n]\). Let \(Z_0=\mu\). Suppose that there exists constants \(\alpha_i, \beta_i \geq 0\) such that
\(\{Z_i-Z_{i-1}, i\geq1\}\) is called a martingale difference sequence.
Note
There is also a weaker version of Azuma’s inequality, where the assumption is that \(|Z_i-Z_{i-1}| \leq c_i\), i.e., the interval must be symmetric. This will lead to moving the constant 2 to the denominator.
The weaker version is also quite common, e.g., on the Wikipedia. Don’t be confused.
Suppose \(Z_n = \sum_{i=1}^{n} X_i\) where \((X_i)_{i\geq 1}\) are independent random variables with \(E[X_i]=0\) and \(\E[|X_i|]<\infty\) for all \(i\).
Suppose \(-\alpha_i \leq X_i \leq \beta_i, i\geq1\), then for \(a>0\)
Definition A stochastic process \(\{Z_n: n\geq1\}\) with \(\E[|Z_n|]<\infty\) for all \(n\geq1\) is called a submartingale if
\[\E[Z_{n+1} \mid Z_1, \dots, Z_n]\geq Z_n\]
(So \(\E[Z_{n+1}]\geq \E[Z_n]\))
and a supermartingale if
\[\E[Z_{n+1} \mid Z_1, \dots, Z_n] \leq Z_n\]
(So \(\E[Z_{n+1}] \leq \E[Z_n]\))
Theorem If \(N\) is a stopping time for \(\{Z_n: n\geq1\}\) such that any one of the three conditions of Martingale Stopping Theorem holds, then
\(\E[Z_N] \geq \E[Z_1]\) for a submartingale. \(\E[Z_N] \leq \E[Z_1]\) for a supermartingale.
(Note that for the submartingale property \(\E[Z_{n+1} \mid Z_1, \dots, Z_n]\geq Z_n\), this is actually true that \(\E[Z_n \mid Z_1, \dots, Z_k]\geq Z_k\), for all \(k < n\). For a martingale, the equality also holds (Exercise 6.1).)
Lemma If \(\{Z_n: n\geq1\}\) is a martingale and \(f\) is a convex function with \(\E[f(Z_n)]<\infty\) for all \(n\), then \(\{f(Z_n): n\geq1\}\) is a submartingale.
(Cannot be directly proved by Markov Inequality, as expectation and max cannot be swapped: \(\E[\max\{Z_1, \dots, Z_n\}] \neq \max\{\E[Z_1], \dots, \E[Z_n]\}\).)
Proof
Define \(N=\inf \{i: Z_i>a, i\leq n\}\). If \(Z_i\leq a\) for all \(i\leq n\), we let \(N=n\). Then \(N\) is a stopping time with \(\Pr[N\leq n]=1\). (\(N\) is the time that the process first exceeds \(a\).)
(1) Since \(f(x)=|x|\) is convex, so \(\{|Z_i|: i\geq1\}\) is a non-negative submartingale. This follows Kolmogorov’s Inequality for Submartingales.
(2) Since \(f(x)=x^2\) is convex, and \(\Pr[\max\{|Z_1|, \dots, |Z_n|\}>a] = \Pr[\max\{Z_1^2, \dots, Z_n^2\}>a^2]\).
Theorem (Martingale Convergence Theorem)
If \(\{Z_n : n\geq1\}\) is a martingale such that for some \(M<\infty\), \(\E[|Z_n|]\leq M\) for all \(n\), then with probability one, \(\lim_{n\to \infty} Z_n\) exists and is finite.
(Note: \(\{Z_n\}\) converges “almost surely” (See definition in Strong Law of Large Numbers (SLLN)). Intuitively, it means for almost every \(\omega \in \Omega\), \(\lim_{n\to\infty} Z_n(\omega) = Z_{\infty}(\omega)\).)
Proof
Assume that \(\E[Z_n^2]\leq M\) for all \(n\). (This is a stronger assumption, to make the proof easier.)
\(\{Z_n^2:n\geq1\}\) is a submartingale. So \(\E[Z_n^2]\) is non-decreasing in \(n\). So \(\lim_{n\to\infty} \E[Z_n^2] = \mu\) exists.
We want to show that \(\{Z_n: n\geq1\}\) is a Cauchy Sequence with probability 1.
That is, with probability 1, \(|Z_{m+k}-Z_m| \to 0\), as \(m,k \to\infty\) (for every \(\omega \in \Omega\)).
(Because \(X_n \stackrel{\text{a.s.}}{\to} X\) if and only if \(\{X_n\}\) is a Cauchy Sequence with probability 1, if and only if \(\lim_{n\to\infty} \Pr[\sup_{k>0} |X_{n+k}-X_n|>\epsilon]=0\). See proof in Here.)
So \((X(t_1), \dots, X(t_n))\) is a multivariate Normal, and the joint density function is
\[f(x_1, \dots, x_n)=f_{t_1}(x_1) \cdot f_{t_2-t_1}(x_2-x_1) \cdot \dots \cdot f_{t_n-t_{n-1}}(x_n-x_{n-1})\]