Thursday, October 27, 2011

Lecture 7

I played to you some Markov music, taken from the web site: Mathematics, Poetry, and Music by Carlos Pasquali. The tune is very monotonous and simple sounding because the distribution of nth note depends only on the (n−1)th note. More interesting music could be obtained by lettting the state be the last k notes, k>1. You might enjoy reading the Wikipedia article on Markov chains. It has a nice section about various applications of Markov chains to: physics, chemistry, testing, Information sciences, queueing theory, the Internet, statistics, economics and finance, social sciences, mathematical biology, games, music, baseball and text generation.

There are lots of other crazy things that people have done with Markov chains. For example, at the Garkov web site the author attempts to use a Markov chain to re-create Garfield comic strips. A much more practical use of Markov chains is in the important area of speech recognition. So-called hidden Markov models are used to help accurately turn speech into text, essentially by comparing the sound that is heard at time t with the possibilities that are most likely, given the sound heard at time t−1 and the modelling assumption that some underlying Markov chain is creating the sounds.

Today's lecture was an introduction to the ideas of an invariant measure and invariant distribution. I mentioned that the invariant distribution is also called a stationary distribution, equilibrium distribution, or steady-state distribution.

Several times today I referred to cards and card-shuffing (which have facinated me from childhood, and in my undergraduate days when I was secretary of the Pentacle Club). It is always good to have a few standard models in your mathematical locker (such as card-shuffling, birth-death chains, random walk on Z and Z^2, etc) with which you can test your intuition and make good guesses about what might or might not be true. The state space of a deck of cards is of size 52! (=80,658,175,170,943,878,571,660,636,856,403,766,975,289,505,440,883,277,824,000,000,000,000) (all the possible orders). Clearly, the equlibrium distribution is one in which each state has equal probability pi_i = 1/52!. I mentioned today that m_i=1/pi_i (which is intuitively obvious, and we prove rigorously in Lecture 9). This means that if you start with a new deck of cards, freshly unwrapped, and then shuffle once every 5 seconds (night and day), it will take you 1.237 x 10^{61} years (on average) before the deck returns to its starting condition. Notice that this result holds for any sensible meaning of a shuffle, provided that it has the effect of turning the state space into one closed class (an irreducible chain).

A very important question in many applications of Markov chains is "how long does it take to reach equilbrium?" (i.e. how large need be n so that the distribution of X_n is almost totally independent of X_0?) You might enjoy reading about the work of mathematician/magician Persi Diaconis in answering the question "how many shuffles does it take to randomize a deck of cards?".

Tuesday, October 25, 2011

Lecture 6

There is now a comments thread at the end of the course page, where you may make comments, or ask questions. if you subscribe then you will be sent an email whenever I annouce in the comments that the course notes have been updated.

This lecture marks the half way point of the course. We celebrated by proving George Polya's famous theorem (1921): that symmetric random walk is recurrent on Z and Z^2, but transient on Z^d (d>2). In fact, we only proved transience only for d=3, but it is easy to see that this implies transience for all d>2 (Example Sheet 2, #2).
I mentioned and recommended Polya's famous little book "How to Solve It". I also quoted this story (from A. Motter):
While in Switzerland Polya loved to take afternoon walks in the local garden. One day he met a young couple also walking and chose another path. He continued to do this yet he met the same couple six more times as he strolled in the garden. He mentioned to his wife: how could it be possible to meet them so many times when he randomly chose different paths through the garden?
I talked a bit about electrical networks in which each edge of an infinite graph is replaced with a 1 ohm resistor. I am thinking here not only of graphs such as the regular rectangular or honeycomb lattice, but also graphs as strange as the Penrose tiling (which gives us a non-periodic graph). I told you that the resistence between two nodes in such a network tends to infinity (as the nodes are chosen a distance from one another that tends to infinity) if and only if a symmetric random walk on the same graph is recurrent. I will say more about why this is so in Lecture in 12.

Meanwhile, you can apply today's lecture to Example Sheet 1 #15. It is pretty easy to see from the hint that if we know that symmetric random walk on Z^2 is recurrent then symmetric random walk on a honeycomb lattice must also be recurrent. But proving the converse is harder! In fact, one can show that random walk on every sort of sensibly-drawn planar graph is recurrent. To prove this we must decide what we mean by a "sensibly-drawn planar graph" and then find a way to embed any such graph in in some other graph on which we know random walk is recurrent. You can learn more about the details of this in Peter Doyle and Laurie Snell's Random walks and electric networks, 2006 (freely available to download).

The material in Section 6.5 (Feasibility of wind instruments) was written after a short email correspondence with Peter Doyle. The story about cities which beggar their neighbours is my own fanciful invention, but it makes that same point without needing to brush up on the theory of fluid dynamics..

This map shows the sources of visitors to this Markov Chains course web site over the past 4 days (20-23 October). Visitors to the course page are averaging about 45 per day and on this blog page about 15 per day. It seems that about 25% of the hits to the course page are from outside Cambridge.

Thursday, October 20, 2011

Lecture 5


Today's lecture was on recurrence and transience. First let me clear up a two points that arise from some students' questions.

1. We make the definition that state i is recurrent if P_i(V_i = ∞)=1. It is defined to be transient otherwise, i.e if P_i(V_i = ∞)<1. Later (in Theorem 5.3) we show that if i is transient then actually P_i(V_i = ∞)=0 (but this is a consequence, not part of our starting definition of transience).

2. In the proof of Theorem 5.4 we use the fact that p_{ii}^(n+m+r) ≥ p_{ij}^(n) p_{jj}^(r) p_{ji}^(m). Please don't think that we are using any summation notation! (We never use summation convention in this course.) This inequality is a simply product of three terms on the right hand side and is a simple consequence of the fact that one way to go i→i in n+m+r steps is to first take n steps to go i→j, then r steps to go j→j, and finally m steps to go j→i. There is a ≥ because there are other ways to go i→i in n+m+r steps.

In Theorem 5..5 we gave an important way to check if a state is recurrent or transient, in terms of the summability of the p_{ii}^(n). This criterion will be used in Lecture 6. There are other ways to check for transience. One other way is to solve the RHE for the minimal solution to
y_j = sum_k p_{jk} y_k, j neq i, and
y_i=1.
So y_j =P_j(return to i). Now check the value of sum_k p_{ik} y_k. If it is <1 then i is transient. This is essentially the content of Theorem 5.9, which I have put in my published notes but am not going to discuss in lectures. However, you may find it helpful to read the Theorem. It's proof is simple.

I talked for a few minutes about my research on on-line bin packing, in the paper Markov chains, computer proofs, and average-case analysis of best fit bin packing. In this research we consider items, of sizes which are uniformly chosen amongst the intergers 1,2,...,8, say, that arrive in a stream, and as each item arrives it must be packed in a bin. Initially there are an infinite number of empty bins, of some size, say 11. In the Markov chain that models the process of on-line best-fit bin packing the state can be represented as (x1,x2,..,x10), where xi is the number of bins that we have started, but which are not yet full, and which have a gap of i. It is interesting to ask if infinitely there is a return to the state (0,0,...,0) in which there are no partially-full bins present (i.e. if the Markov chain is recurrent). You might like to view these seminar sildes for more details. (These were for a faculty colloquium and aimed at a general audience of mathematicians, and so it should be well within your knowledge of mathematics to understand these slides.)

In the on-line bin packing research (and many other problems in queueing theory) researchers often prove results about recurrence and transience using some more sophisticated ideas than in Theorems 5.4. and 5.9 of today's lecture. One of these ideas is Foster's criterion. This says that an irreducible Markov chain is recurrent if we can find a function f : I → R (called a Lyapounov function) and a finite subset of the state space, say J, such that (a) E[ f(X_1) | X_0=i ] ≤ f(i) for all i not in J, and (b) for each M>0 the set of states for which f(i)≤M is finite. Part (a) is essentially saying that outside J there is always drift back to states where f is smaller.

Tuesday, October 18, 2011

Lecture 4


In today's lecture we had the definition of a stopping time. This brings to my mind a small riddle (which I think I heard from David Kendall). "How do you make perfect toast? Answer: Wait until it smokes – then 10 seconds less."

Stopping times play a large role in probability theory. One very important idea is the following. Consider Example Sheet 1 # 10, the gambler's ruin problem played on {0,1,...,10} in the fair game case (p=q=1/2). If the gambler has i and bets j (1≤j≤i), she is equally likely to next have i−j or i+j. So E(X_{n+1} | X_n)=X_n, no matter how much she bets. This implies that E(X_{n+1} | X_0)=X_0 no matter how she bets. It is a theorem (Doob's optional sampling theorem) that E(X_T | X_0)=X_0 for any stopping time T (such that ET< infinity, as indeed must be the case for any stopping time in this problem). Thus, we see that there is no way that the gambler can make any expected profit (or loss), no matter how clever a strategy she uses in choosing the sizes of her bets and when to stop gambling.

The optimal sampling theorem also gives us a quick way to answer the first part of Example Sheet 1 #10. If T is the first time that X_n hits either 0 or 10, then E(X_T | X_0=2)=2 implies P_2(hit 0)0+P_2(hit 10)10 = 2. Hence P_2(hit 10)=1−P_2(hit 0)=1/5. That's even easier than solving the RHEs!

In lectures today I forget to make the comment that the interchange of E_i and sum_{j in I} that occurs halfway down page 14 is an application of Fubini's theorem. This theorem says that we can reverse the order of sums, i.e.

sum_i sum_j a_{ij} i = sum_j sum_i a_{ij}

when these are sums over countable sets and the sum is absolutely convergent. Fubini's theorem is presented in more generality in Part II Probability and Measure.

Section 4.1 is of course a general version of Example Sheet 1 #12. When doing #12, make sure you understand how the condition of "minimal non-negative solution" is being used.

I spent the final 10 minutes of this lecture describing Arthur Engel's probabilistic abacus (a chip firing game) for calculating absorption probabilities in a finite-state Markov chain in which all entries of P are rational. My slides and commentary are in Appendix C, and also an exposition of Peter Doyle's proof that the algorithm really works. I first heard about this abacus in 1976 when Laurie Snell was visiting the Statistical Laboratory. Snell (1925-2011) was a student of Joe Doob, one of the 'greats' of probability theory, whose optional sampling theorem I have mentioned above. Snell is the author (with John Kemeny) of several classic textbooks (including one called "Finite Markov Chains"). He is founder of Chance News (which can be fun to browse). A particular memory that I have of Professor Snell is that he liked to go to London to play roulette at the casinos there. This struck me that is a very peculiar recreation for an expert in probability. But I think it was for fun - he never claimed to make money this way.

You might be amused to know that Question #10 on Example Sheet 1 is a actually an old tripos question from 1972, (Paper IV, 10C). I took IB in 1973 so you and I share at least one question upon which we have practiced as students.

Until now there were several typos in the published notes for Lecture 4. Please make sure you have a current copy, which I think is now fully accurate.

Thursday, October 13, 2011

Lecture 3


You should now be able to do all of Example Sheet 1 (excepting that #12 will be easier after seeing Section 4.1 in the next lecture).

I mentioned that Theorem 3.4 in this lecture is similar to the result that you were taught in Probability IA, concerning the probability of ultimate extinction of a branching process. Remember that in a branching process each individual independently produces offspring for the next generation, according to a distribution in which there are k offspring with probability p_k (k=0,1,...). Given that we start with one individual, the probability of ultimate extinction, say u, is the minimal solution to
u = G(u) = sum_k p_k uk
where G is the probability generating function for the number of offspring.

Do you remember the proof that u is the minimal solution to u=G(u) , and do you see how similar it is to the proof in Theorem 3.4?

I have called the equations of the form x=Px (i.e. x_i = sum_j p_{ij}x_j) "right-hand equations", because x appears on the right-hand side of P. Later in the course we will find a use for "left-hand equations", of the form x=xP. So far as I can tell, this terminology does not appear in any modern books on Markov chains. However, it was language used by David Kendall in the course I took from him in 1972 and I have always found it to be helpful terminology.

I hope you enjoy question #13 on Example Sheet 1 (which is a new question this year). It contains a result that many people find surprising. In a Red-Black game in which p<q the strategy of bold play is optimal (but not necessarily uniquely so). This fact is proved in the Part II course Optimization and Control (see Section 4.3 "Optimal gambling" in the Optimization and Control course notes.) Part of that course is about Markov Decision Processes, which are Markov chains in which we have some control over the transitions that occur, and we try to minimize (or maximize) costs (or rewards) that accrue as we move through states.

Question #14 on the examples sheet extends the idea of a gambling game between 2 players to that of a game amongst 3 players. It is remarkable that there is such a nice formula for the expected number of games until one of the first player becomes bankrupt.

Tuesday, October 11, 2011

Lecture 2 (more)


The fact that we can solve for p_11^{(n)} in the stated form follows from the fact that p_{11}^{(n)} satisfies recurrence relations given by

q(P)_{11} = 0

where q(x)=det(xI−P) is the characteristic polynomial. By the Caley-Hamilton theorem we know q(P)=0 (see IB Linear Algebra)). You know how to solve such recurrence relations (from IA Differential Equations). I have already said something about this is my earlier blog entry for Lecture 2. In particular, although some matrices cannot be diagonalized, every square matrix P can be written in Jordan normal form, P=UJU^{−1} (see IB Linear Algebra), and P^n=UJ^nU^{−1} is then of the form that takes account of repeated eigenvalues.

I believe that a good mathematician these days should not only understand theory and proofs, but also be able to use modern computing tools to quickly work out examples.

I highly recommend that you install Mathematica and use it while you study and work on examples sheets (in many courses). I personally use it on almost a daily basis. You can download a free copy:

http://www.damtp.cam.ac.uk/internal/computing/software/mathematica/

It is very easy install and learn to use. The time you spend learning to use it (and similarly MATLAB) is a very good investment.

Below is a short Mathematica program that does Example Sheet 1, #7. I think that a well-motivated student might like do this question by hand, and then subsequently check the answer using Mathematica. By the way, I would not expect an examiner to set a question in tripos that is as difficult to do by hand as doing all of (a) (b) and (c) in #7. However, it would be fair to ask the answer for just one value of p.

Clear[P,p,p11]
P={{0,1,0},{0,2/3,1/3},{p,1−p,0}};
(* Solution to (a) *)
mu=Eigenvalues[P /. p→1/16]
p11[n_]=a mu[[1]]^n+ b mu[[2]]^n+ c mu[[3]]^n;
Solve[{p11[0]==1,p11[1]==0,p11[2]==0},{a,b,c}]
p11[n] /. %[[1]] //Expand
Out[1]= {1,-(1/4),-(1/12)}
Out[2]= {{a->1/65,b->-(2/5),c->18/13}}
Out[3]= 1/65-1/5 (-1)^n 2^(1-2 n)+1/13 (-1)^n 2^(1-2 n) 3^(2-n)
(* Solution to (b) *)
mu=Eigenvalues[P /. p->1/6]
p11[n_]=a mu[[1]]^n+ b mu[[2]]^n+ c mu[[3]]^n;
Solve[{p11[0]==1,p11[1]==0,p11[2]==0},{a,b,c}]
p11[n] /. %[[1]] //ComplexExpand
Out[4]= {1,-(1/6)+I/6,-(1/6)-I/6}
Out[5]= {{a->1/25,b->12/25-(9 I)/25,c->12/25+(9 I)/25}}
Out[6]= 1/25+1/25 2^(3-n/2) 3^(1-n) Cos[(3 n \[Pi])/4]+1/25 2^(1-n/2) 3^(2-n) Sin[(3 n \[Pi])/4]
(* Solution to (c) *)
mu=Eigenvalues[P /. p->1/12]
p11[n_]=a +(b+c n)mu[[2]]^n;
Solve[{p11[0]==1,p11[1]==0,p11[2]==0},{a,b,c}]
p11[n] /. %[[1]]

Out[7]= {1,-(1/6),-(1/6)}
Out[8]= {{a->1/49,b->48/49,c->-(6/7)}}
Out[9]= 1/49+(-(1/6))^n (48/49-(6 n)/7)

Lecture 2


Following this lecture you should be able to do Example Sheet 1, #1−9 (except 8 (b), although you can probably guess what is meant).

Someone has rightly pointed out that at the bottom of page 2 of the notes, the line should end p_{ i_{i−1} i_n }. This has now been corrected into the notes.

Following the lecture someone asked me a question about the eigenvectors of matrices, specifically, "Is the set of eigenvalues obtained from left-hand eigenvectors the same as that obtained from right-hand eigenvectors?" (The answer is yes.) This question tells me that I might have referred to facts of linear algebra that are fairly new to you, or only briefly covered in the IA Vectors and Matrices course. In the IB Linear Algebra course you will learn more. For example, the schedules include "Algebraic and geometric multiplicity of eigenvalues. Statement and illustration of Jordan normal form." It will be interesting for you to think again about calculation of P^n once you know that it is always possible to write P=UJU^{−1}, where J is an almost diagonal matrix and U is a matrix whose rows are left-hand eigenvectors. Having said this, we shall not actually need any advanced results from linear algebra in our course. Most of our proofs are probabilistic rather than matrix-algebraic. Today's discussion of P^n is perhaps the one exception, since if you wish to fully understand the solution of the recurrence relations when the characteristic polynomial has repeated roots then the representation P=UJU^{−1} is helpful.

This lecture was mostly about how to calculate the elements of P^n by solving recurence relations. We ended the lecture with definitions of "i communicates with j", the idea of class, and closed and open classes. If the Markov chain consists of only one class (and so every state can be reached from every other) then the Markov chain is said to be irreducible.

Notice that if P is m x m and irreducible then Q=(1/m)(I+P+P^2+···+P^{m-1}) is a transition matrix all of whose elements are positive (can you see why? A hint is the pigeonhole principle).

Here now is a sidebar on some interesting results in matrix algebra that are related to today's topics. We said in this lecture that if P is m x m and has m distinct eigenvalues, 1, mu_2, ..., mu_m, then

p_{ij}^{(n)} = a_1 + a_2 mu_2^n + ··· + mu_m^n

for some constants a_1, a_2, ..., a_m.

We would like to know more about the eigenvalues mu_2, ...., mu_m. In particular, let |mu_j| denotes the modulus of mu_j. If |mu_j|<1 for all j>1 then p_{ij}^{(n)} tends to a_1 as n tends to infinity (as we see happening in Example 2.1)
We claim the following.
  1. |mu_j| ≤ 1 for all j>1.
  2. Suppose there exists n such that P^n is strictly positive. Then |mu_j|<1 for all j>1, and so p_{ij}^{(n)} tends to a_1 as n tends to infinity.
These facts are consequences of Perron-Frebonius theory. This theory (dating from about 100 years ago and useful in many branches of mathematics) says the following.

Suppose A is square matrix, which is non-negative and irreducible (in the sense that for all i,j , we have (A^n)_{ij}>0 for some n). Then there exists a positive real number, say lambda, such that (i) lambda is an eigenvalue of A, (ii) lambda has multiplicity 1, (iii) both the left and right-hand eigenvectors corresponding to lambda are strictly positive, (iv) no other eigenvector of A is strictly positive, (v) all eigenvalues of A are in modulus no greater than lambda, (vi) if, moreover, A is strictly positive then all other eigenvalues of A are in modulus strictly less than lambda, and
(vii): min_i sum_j a_{ij} ≤ lambda≤ max_i sum_j a_{ij}.
i.e. lambda lies between the minimum and maximum of row sums of A.

So in the case that A =P (i.e. the transition matrix of an irreducible Markov chain), (vii) implies that lambda=1 (and (1,1,...,1) is the corresponding right-hand eigenvector).

Of the claims made earlier, 1 follows from (v) and (vii). To see 2, we have from (vi) that if P^n is strictly positive then all its eigenvalues different to 1 are in modulus less strictly than 1. But if mu is an eigenvalue of P then mu^n is an eigenvalue of P^n. Hence we must have |mu|<1.

As mentioned above, if P is irreducible then Q=(1/m)(I+P+P^2+···+P^{m−1}) must be positive (i.e. a matrix of positive elements). Thus from Perron-Frebonius theory, Q has a largest eigenvalue 1 and all its other eigenvalues are strictly less than 1. From these observations it follows that

Q^n tends to a limit as n tends to infinity.

Notice that a Markov chain with transition matrix Q can be obtained by inspecting our original chain at times 0, 0+Y_1, 0+Y_1+Y_2, 0+Y_1+Y_2+Y_3, ..., where the Y_i are i.i.d. random variables, each being uniformly distributed over the numbers 0,1,...,m−1.

Thursday, October 6, 2011

Lecture 1

Lecture 1 was on Thursday October 6 at 10am. It was attended by almost all IB students. Some notes are already on line. Something fun is Appendix C.

The introduction of a jumping frog (who I called Fred) into Example 1.1 is window-dressing. However, the frog and lily pad metaphor was used by Ronald A. Howard in his classic book Dynamic Programming and Markov Processes (1960). I read this book long ago and the image has stuck and been helpful. It gets you thinking in pictures, which is good.

Markov chains are a type of mathematics that I find to be highly visual, by which I mean that the problems, and even the proofs (I'll give examples later), can be run through my mind (and yours) in a very graphic way - almost like watching a movie play out.

Our course only deals with Markov chains in which the state space is countable. Also, we focus on a discrete-time process, X_0, X_1, X_2, ... , It is really not fundamentally different to deal with Markov processes that have an uncountable state space (like the non-negative real numbers) and which move in continuous time. The mathematics becomes only mildly more tricky. An important and very interesting continuous-time Markov process is Brownian motion. This is a process (X_t)_{t ≥ 0} which is continuous and generalizes the idea of random walk. It is very useful in financial mathematics.

Someone asked me afterwards to say more about simulation of a Markov chain. I am talking about simulation of a Markov chain using a computer. Think about how we might simulate the outcome of tossing a fair coin on a computer. What we do it ask the computer to provide a sample of a uniform random variable on [0,1]. In Mathematica, for example, this could be done with the function call U=Random[ ]. If U<1/2 we say Tails, and if U≥1/2 we say Heads.
Similarly, if we wished to simulate the two-state Markov chain in Example 1.4 we would, if X_n=1, set X_{n+1}=1 if U lies in the interval [0,1−alpha) and set X_{n+1}=2 if U lies in [1−alpha,1]. If X_n=2, we would set X_{n+1}=1 if U lies in [0,beta) and set X_{n+1}=2 if U lies in [beta,1]. The content of Section 1.4 generalizes this idea in an obvious way. Simulation of random processes via a computer is an important tool. For example, we might like to run a computer simulation of road traffic network to see what delays occur when we try different strategies for sequencing the traffic lights at intersections.

Section 1.6 is about the calculation of P^n when P is a 2x2 matrix. We found P^n by writing P=UDU^{−1}, where D=DiagonalMatrix[{1,1−alpha−beta}]. We did not need to know U explicitly. But, as an exercise, we can easily find it. Notice that for any stochastic matrix P there is always a right-hand eigenvector of (1,1,....,1) (a column vector), since row sums of P are 1. So there is always an eigenvalue of 1. The other right-hand eigenvector is (alpha,−beta), with eigenvalue of 1−alpha−beta. So we could take U^{−1}={{1,alpha},{1,−beta}}. Can you also find the left-hand eigenvectors? They are of course the rows of U.

In Section 1.6 I gave two methods of finding p_{11}^{(n)}. The first method obviously generalizes to larger matrices. The second method is specific to this 2x2 example, but it is attractive because it gets to the answer so quickly.

Notice that throughout my blog I use pseudo-LaTeX notation for mathematics. So x_i is "x sub i", alpha is the Greek character alpha, and so forth. (I say pseudo-LaTeX, because in actual-LaTeX one has to put $ signs around things, e.g. $x_i$, and put a \ in front of a Greek character name, e.g. \alpha). LaTeX is the language in which my lecture notes (and almost all modern mathematical papers) are written. When I want to do some algebra or work with mathematical objects I sometimes use Mathematica's notation. So DiagonalMatrix[{1,1−alpha−beta}] is the 2x2 diagonal matrix which has 1 and 1−alpha−beta on the diagonal. Also, {{1,alpha},{1,−beta}} is the 2x2 matrix whose rows are {1,alpha} and {1,−beta}.


There was one typo in the notes that were on this site until today. You should note this if you downloaded a copy of the notes before today. In Theorem 1.3 the statement should be "if and only if" not just "if". As I said, this theorem is not particularly deep, but it gives us practice in using the notation and understanding how Markov chains work.

You will need the ideas in Section 1.2 to do Example Sheet 1 #2, #3 and #4.