Essay on Probability Notation

On a suggestion by an editor on the wikipedia Monty Hall problem arbitration case, I will write down some ideas here about popular notations in probability theory and its applications.

Probabilists, and those who use probability, such as statisticians, engineers, financial engineers, computer scientists, physicists, epidemiologists, bioscientists ... need to talk both formally and informally about probabilities of events, about random variables, about the distributions of random variables, about their means, their probability densities or mass functions. Then there are conditional probabilities and conditional expectations. And lots more besides, but this is enough to start with.

Probability theory is part of mathematics and mathematicians also talk about and use probability. The mathematical point of view, and mathematical conventions of notation, can be very different from the statististician's point of view, the physicist's point of view, and so on. Many notational conventions in this broad field, for instance those used in physics and in statistics, were invented by physicists and statisticians long before probability became an established and respectable branch of modern mathematics. Resistance is futile, you will be assimilated ... did not apply. The optimal notation for communicating ideas between users of probability, and in particular between statisticians, does not have to follow the rules of modern mathematical good taste... which are of course also a convention belonging to some time and place in the history of science. Computer science has already started to influence the notation of statistics as well as its practice.

Let's get started.

Probabilities

Probabilities are very often represented by writing plain vanilla $\scriptstyle {P(...)}$ , meaning the probability of .... Some writers like $\scriptstyle {\Pr(...)}$ and others ${\scriptstyle {\rm {{Prob}(...)}}}$ . Some mathematical writers like to distinguish typographically that the P of the plain vanilla notation is standing for a generic probability rather than one particular function (actually, probability measure) by using special alphabets (more precisely: typefaces), for instance Bold or <open>Open</open> or <gothic>Gothic</gothic>. Or just by using Roman rather than Italic. Note that Pr and Prob, just like sin, cos and log, are customarily typeset in Roman, not in Italic. They are abbreviations of real words, not sequences of mathematical symbols. "Prob" is not a sequence of four mathematical variables $P$ , $r$ , $o$ , $b$ .

...of what?

The things that have probabilities are (are called) events. In elementary probability theory, $A$ , $B$ $C$ are typical events. So it should be no surprise that $P(A)$ stands for the probability of (the event) $A$ . For instance, the probability that I will ever finish this essay.

It was a surprise to me to learn that this notation is actually relative new (anything which happened in the lifetime of my parents I tend to consider not terribly long ago, and in my own lifetime recently). See this nice page [1].

Random Variables

Now it gets complicated

Then we come to the other big player: the random variable. As mathematicians know, in the axiomiatic (Kolomogorov) theory by which probability theory can be built on set theory and seen as a special case of measure theory, random variables are actually deterministic functions, but if this doesn't make sense to you, that's fine. In fact: forget I ever said it, please!

Random variables are random things which can take on different values, by chance. And in fact we're interested in the particular chances with which they take on particular values. In elementary expositions, random variables are typically called $X$ , $Y$ , $Z$ . Note that we are in the far end of the alphabet from all the events, and this is deliberate! It's deliberate, and it's done to help beginners. Later we sophisticated folk will stop adhering to this convention, because we have internalized the concepts and don't need explicit prompting all the time.

An event happens or not. Your throw of five poker dice results in a full house or it doesn't. A random variable results in a number. When you toss 5 ordinary dice the total number of dots on the five top faces can be any sum from 5 to 30. This number is sometimes called the outcome of the random variable.

The relationship between random variables and events is simple: something you say about a random variable's outcome is an event. For instance, if $X$ is the total number of dots when you toss 5 dice, then that outcome might or might not exceed 10. So " $X>10$ " is an event. One might want to know its probability. Or to do other "event things" with it, for instance, give it a name, a name like $A$ ! It's customary to use curly brackets which is actually also a shorthand version of a formal set-theoretic notation for the same thing, when we are building it Kolmogorov-wise on set theory. But that is something which most readers won't want or need to know. Anyway, for various good reasons we typically write $\{X>10\}$ for the event total number of dots ... or outcome of tossing five dice exceeds 10.

Events have probabilities. Usually we relax and leave out brackets which a pedantic person can insert by hand if they are so inclined, and just write $P(X>10)$ (no curly brackets inside the round ones) for the probability (of the event that) the random variable in question exceeds 10.

Roughly up to the 1930s, a "random variable" was a rather vague notion, and it was only people like physicists or statisticians who ever talked about random variables for their own devious purposes. Mathematicians did occasionally talk about probabilities (and certainly also did a lot of both difficult and important mathematics involving probability mass functions and probability density functions, see below).

This is one of the reasons why Kolmogorov is one of my great scientific heroes. He made an honest mathematician out of me, a statistician. Now I'm proud to be both.

Again I was stunned, this time to learn just how recently the notation X for a random variable and x for one of its values was introduced, [2]. This maybe explains why physicists, who tend to get by pretty well with 19th century mathematics (except when they do quantum theory, for which they use a load of highly abstract mathematics), use such funny notation (and have such funny ideas) concerning probability and statistics.

More complicated still: distributions and mass functions

Obviously," $X=10$ " is also an event, so of course we (well - we mathematicians that is) usually denote it by $\{X=10\}$ , and its probability by $P(X=10)$ . Now we might be interested in doing things with probabilities of our random variable taking arbitrary values, i.e., we want to keep the value open for the time being, and only for instance see what happens when it is actually 10, or actually 11, or something else, later. It's customary (well - in the social circles which I frequent) to use corresponding small letters to denote possible values of random variables, whose name was the corresponding capital letter. Let the mathematical variable $x$ stand for an arbitrary possible value of the random variable $X$ . Then $\{X=x\}$ is an event, and $P(X=x)$ is a probability. What the probability is depends, obviously, both on which random variable we are talking about (total number of dots on five dice? the total of the three lowest?) and on the value we are talking about, $x$ , which could in principle be any number you like, even negative, even not a whole number. For the five dice example, $P(X=x)$ is something interesting when $x$ is 5, 6, 7, ... or 30; and otherwise it is zero. And notice in this example that those 30-5+1=26 interesting probabilities add up to 1.

Those 26 interesting probabilities for the possible total number of dots on five dice determine what is called the probability distribution of the random variable $X$ in question. Note: I don't say they are the distribution, though some people, for instance most physicists, would use that language. For a mathematician a probability distribution is something a bit abstract, I would say it stands for the collection of all the probabilities of all the events which you could ever write down concerning $X$ only.

We use the word probability mass function to stand for the collection of values of probabilities of elementary events concerning a random variable like $X$ , I mean, the events that it takes on a particular specific value, the events $\{X=x\}$ . Each one gets a probability. For silly values of $x$ the value is zero, for a relatively small collection of interesting values it is an interesting probability, and those interesting probabilities all add up to 1.

A typical probability mass function might well be denoted by $p$ , and its values by $p(x)$ . In terms of a picture, $p$ stands for a complete graph, $p(x)$ for the y-coordinate of the point on the graph above x-coordinate $x$ . In computer or data base terms, $p$ is a whole table, $p(x)$ is the value found by looking up the table in a particular position.

Now when we start to have to talk about a number of different random variables at the same time, notation either gets heavy, or it gets dangerous (if not downright illegal according to more pedantic schools of thought). How to associate particular probability mass functions with particular random variables?

The mathematician's solution is by use of subscripts. So $p_{X}(x)$ stands for the probability mass function of the random variable $X$ , looked up at the value $x$ . For instance, $p_{X}(3)$ , or $p_{X}(4)$ . These two last mentioned objects stand for the probabilities (of the events that) $X=3$ and $X=4$ respectively.

Ha, did you notice? I got lazy and stopped writing the curly brackets.

The lazy physicist avoids writing down things which are obvious from the context. So almost everyone (well, everyone who thinks like many physicists think and also like many statisticians think) will understand that $p(x)$ stands for $P(X=x)$ while $p(y)$ stands for $P(Y=y)$ .

(I'm sorry for those non-science types who just get dizzy when they see stuff like this anyway.)

The trouble with the lazy notation comes when we start substituting particular values. What are we talking about when we write $p(3)$ ? $P(X=3)$ or $P(Y=3)$ or something else? And why can't we also use $y$ , $z$ as well as $x$ to stand for possible values of $X$ ? Well, the answer is, of course we can! There's nothing wrong at all with $p_{X}(y)$ . As long as you know the difference between small letters and capital letters, and are aware of the convention (but beware, conventions may be broken!) that small letters stand for possible outcomes of random variables, i.e., for possible numbers, while the capital letters stand for random variables themselves ...

<rant>Nowadays many university students seem not to distinguish between capital letters and small letters. For that matter, I can hardly distinguish any of their handwritten letters from one another, to be honest! They also are not particularly interested in distinguishing between subscripts and superscripts and the ordinary things in between, whether by size nor by position. They have a bl***y hard time learning maths, and I have a bl***y hard time marking their exam papers.</rant>

Seriously, I think that at some point mathematics made a notational choice which is somehow like the difference in computer programming between the paradigms of object oriented programming and procedural programming. It would be fun to write a bit about this. The physicists somehow stayed with procedural, while the mathematicians went object oriented. It seemed the right thing at the time.

And then densities

Some random variables aren't restricted to outcomes that are whole numbers. At least in an abstract world, their values could be more or less any number, or at least any number within some range. For instance, if I toss infinitely many fair coins I get an infinite sequence of heads and tails. Coding heads by a one and tails by a zero, and stringing all the zeros and ones together while putting zero and a dot in front, I get a binary fraction, for instance 0.1100101... stands for Head, Head, Tail, Tail, Head, Tail, Head,... . The random outcome can therefore be any number between zero and one, represented in binary!

(The astute reader will notice I am ignoring outcomes that finish in ones forever, such as 0.1100101...1111111..., which we do not use to represent numbers in binary. The really astute reader will easily figure out that these can be enumerated and altogether have zero probability, so we might as well forget about them anyway.)

Let's call that infinite sequence of zeros and ones, the outcome of tossing one fair coin forever, encoded, by binary trickery, as a single real number between 0 and 1, the random variable $U$ . We can see if the first coin fell head or tails by looking to see if U is larger or smaller than 1/2. Double U and you get a number between 0 and 2. If it is bigger than 1, substract 1. Call the result V. You can see whether the second coin fell heads or tails by looking to see if V is larger or smaller than 1/2. And so on. You don't even have to learn anything about binary numbers! I am using the U for uniform here, but you can call it $X$ if you prefer, what's in a name?

Unlike our old friend the total dots on five dice, $U$ has exactly probability zero to take precisely any particular value $u$ . Because the chance of tossing precisely Head, Head, Tail, Tail, Head, Tail, Head,... is half times half times half times half ... which obviously must be zero. (Certainly it is smaller than 1/2 multiplied by itself any number of times you like. And here I am assuming so-called standard mathematics. Infinitesimals have been banned.) So the probability mass function of $U$ is $p(u)=0$ for all values of $u$ .

Yet $U$ does have interesting probabilities to take values in interesting ranges, for instance, $P(U<1/4)$ is easily figured out to be exactly 1/4 (coincidence???) because the event in question, $\{U<1/4\}$ consists of all those infinite sequences of coin tosses which begin with Tail, Tail.

This is where so-called probability density functions come in. We can compute the probability that $U$ takes values within some range by integrating some function over that range, just as the probability that our old friend $X$ takes a value in some range can be found by adding its mass function over the range. Instead of a probability mass function we now talk about a probability density function. It really is a density, it means probability per unit of whatever. A random height, measured in feet, has a probability density, which has units of "probability per foot". Convert to inches and your height goes up by a factor 12 but your probability density goes down by a factor 12. At least, that's assuming infinite precision height. Which could be a mathematical fiction, but then all physics is a mathematical fiction, so apparently mathematical fictions can be pretty useful for understanding and controlling (and destroying, even) the real world.

The probability density of the distribution of the random variable U is 1 for u between 0 and 1, and 0 elsewhere.

If "height" was by definition always a whole number of, say, centimeters, then a random height would have a mass function, not a density function, and we would be pretty stuck with how to convert it to inches.

It's common to use notation like $f_{Y}(y)$ for the probability density function of the random $Y$ evaluated at the value $y$ . So again, $f_{Y}$ is a graph or a lookup-table, and $f_{Y}(3.14159..)$ is the number which you find when you evaluate it / look it up at the value $\pi$ (supposing that by "3.14159.." I mean precisely $\pi$ ).

Mass functions are densities

From an advanced point of view, mass functions are just densities, densities with respect to something called counting measure, while an ordinary density, is actually a density with respect to Lebesgue measure. (The counting measure of a set is the number of points in the set, the Lebesgue measure of an interval on the line is the length of the interval). So it can be convenient and it certainly is legal to use the same notation, some prefer $p$ and some prefer $f$ , for both objects. You just have to know whether to add values, or integrate values, when you come to calculate probabilities (and more generally, compute expectation values ... but that's another subject).

Conditioning

This section is going to have a lot of pretty formulas, culminating, as you might guess, with Bayes' theorem, in different versions: for events, for discrete random variables, and for continuous random variables. For very good reasons the "formula" will look much the same for each of those cases. But in each case it is a formula about completely different kinds of things, so belonging to rather different contexts.

Conditional probabilities, conditional distributions, conditional densities

If $A$ and $B$ are any two events, we define the probability of $A$ given $B$ , written $P(A|B)$ by requiring the chain rule to be true: the probability of A and B together should be the probability of B times the probability of A given B. This forces

P(A|B)={\frac {P(A~{\textrm {and}}~B)}{P(B)}},

at least, as long as $P(B)$ is not zero. (If it were zero, then the probability of A and B, which cannot be bigger, must also be zero; zero divided by zero is not defined).

Bayes' theorem for events

Suppose events $B_{i}$ for $i=1,2,...$ are mutually exclusive and exhaustive. That means to say that just one of them has to happen. Let $A$ be some other event. Then by the definition of conditional probability and the fact that probabilities add up over mutually exclusive ways in which an event can happen,

P(B_{j}|A)={\frac {P(A|B_{j})\;P(B_{j})}{\sum _{i}\;P(A|B_{i})\;P(B_{i})}}

Zero times undefined equals zero. Zero divided by zero equals undefined.

In applications, one can think of the $B_{i}$ as being a collection of mututally exclusive causes of some event $A$ . The formula shows how the initial probabilities of the $B_{i}$ are converted to conditional probabilities given $A$ , on learning that the event A has happened.

The big expression with the summation in the denominator (that means: down-stairs) of the right hand side is the same, whatever event $B_{j}$ we look at on the left hand side. It's a fact that the conditional probabilities $P(B_{j}|A)$ on the left hand side have to add up to 1, just as the a priori probabilities $P(B_{j})$ must do too (adding over the index j). So we can also write Bayes' theorem in the more simple form

P(B_{j}|A)\propto P(A|B_{j})P(B_{j})

where the symbol $\propto$ means proportional to, and in this case, the proportionality is to be understood as holding when j varies.

Joint (simultaneous) mass function

If X andY are two random variables which both can only take on values from some discrete sets, e.g. whole numbers, then we conventionally write expressions like $p(x,y)$ to stand for the joint probability mass function of X and Y,

p(x,y)=p_{X,Y}(x,y)=P(X=x~{\textrm {and}}~Y=y).

Bayes' theorem for discrete random variables

Since $\{X=x\}$ and $\{Y=y\}$ are events, we already know what we mean by the probability that X=x given that Y=y, namely

P(X=x|Y=y)=p_{X|Y=y}(x)=p(x|y)={\frac {P(X=x,Y=y)}{P(Y=y)}}={\frac {p_{X,Y}(x,y)}{p_{Y}(y)}}={\frac {p(x,y)}{p(y)}}.

The last expression here is a bit dangerous. You know when I write p(y) that I mean the mass function of the random variable Y evaluated in the point y. But what is, for instance, p(3)?

Keeping the conditioning by Y=y fixed, but varying x, one can read this formula as defining the conditional mass function of X given Y, more precisely, given Y=y. If you prefer your notation to be pretty explicit, you might like to write $p_{X|Y=y}(x)$ for this quantity. If you are lazy or like to invite danger, you might write just $p(x|y)$ .

Bayes' theorem can be applied to the situation where the events $B_{j}$ (remember, they should be exclusive and exhaustive) are all the different events X=x as x varies; and at the same time, we could take the event A to be the event Y=y, for one specific value of y. Applying the notational short cuts we find the formula

p(x|y)={\frac {p(y|x)\;p(x)}{\sum _{x'}p(y|x')\;p(x')}},

or effectively, and easier to read and remember and understand,

p(x|y)\;\propto \;p(y|x)\;p(x)

,

where the proportionality is as x varies, while y is kept fixed. When you add over all x, for fixed y, the conditional probabilities p(x|y) must add up to 1, and this fact can be used to simplify computations (drop complicated but constant factors) and determine the proportionality constant afterwards.

Joint probability density function

With the usual short-cut notations, we write f(x,y) for the joint probability density function of a pair of random variables X and Y both of which take values continuously over some intervals. It has units "probability per unit X and per unit Y". For instance, if X is the height and Y is the weight of a random wikipedia editor, then their joint probability density would have the units probability per cm and per gram if we measure height and weight in centimeters and grams. You can find the probability that your random wikipedia editor has height and weight within some region of the plane by integrating that density over the region.

Bayes' theorem for continuous random variables

There is some trickyness involved in defining the conditional distribution of a random variable X given that another random variable Y takes the value y say, when Y is continuously distributed: individual exact values have probability zero; but the probability that Y falls within a specific range of values can be found by integrating a probability density over that range. Actually the trickyness is easy to explain: we want to be able to compute joint probabilities of events concerning X and Y by first computing said probability given each specific value of Y separately, using "the" conditional probability distribution of X given Y=y, and then averaging over y taken from the original (so-called, marginal) probability distribution of Y. It turns out that this can indeed be done, and moreover in essentially only one way; the quotation marks around "the" are superfluous.

This shows how devious mathematicians often are: they simply define things to be what they have to be, in order to make the theorems which they want to be true, to be true by definition. And they do this in such a smart a way that nobody realises they are not communicating with the gods, but just playing tricks on you.

Recall that we write f(x,y) for the joint probability density function of a pair of random variables X and Y - it has units "probability per unit X and per unit Y". And again we get Bayes' theorem, expressed right away in proportionality terms,

f(x|y)\propto f(y|x)f(x)

,

where the proportionality is as x varies, while y is kept fixed. When you integrate over all x, for fixed y, the conditional probability density f(x|y) must integrate to 1, and this fact can again be used to simplify computations (drop complicated but constant factors) and determine the proportionality constant afterwards.

Wonderfully, the same formulas hold for a pair of random variables, one of whom is discrete, one of whom is continuous, and both ways round, provided we keep in our minds what must be a probability density, what is actually a probability. This gives more good reasons to use the same symbol, p or f as a matter of taste, for both probability densities and probability mass functions, and whether conditional, simultaneous (joint), or marginal.

Example: What Bayes did

The smartness of Bayes was not to discover Bayes' theorem, which is completely obvious, but to figure out how it works in a situation with continuous random variables, or with continuous and discrete random variables. For this purpose he needed to use state-of-the art calculus (differentiation, integration), which had only recently been invented by Newton and / or Leibniz (roughly depending on whether you were a Brit or someone from the Continent).

He considered someone throwing ten white billiard balls and one black billiard ball independently of one another onto a rectangular billiard table, in such a way that each ball's position measured from the left end of the table, in units of one table length, could be thought of as a random variable U uniformly distributed between zero and one. In particular, initially the black ball's position has this probability distribution. An equation for that density is f(u)=1 where u is the position of the black ball. (That is the notation of the previous section except U takes the place of X.)

But suppose I now tell you that 7 white balls ended up to the left of the black ball, and 3 to the right? Now it seems likely that the black ball is somewhere roughly 7 tenths along the table. Can we be more precise? What would happen as we threw more and more white balls onto the table? (taking care all the time that the balls don't hit one another yet end up completely independently of one another each at a completely random position on the table). Mathematicians are good at imagining completely impossible things! (Even 7 of them, and even before breakfast).

Thomas figured out that the conditional probability density of the position of the black ball, or f(u | Y=7), must be proportional to $u^{7}(1-u)^{3}$ since the probability that any particular 7 balls fall to the left of position u and 3 to the right is $u^{7}(1-u)^{3}$ . The number of ways in which this can happen is not interesting, for present purposes, saving us quite a few head-aches. A bit of the rocket-science of his day, calculus, told him that the conditional probability density of the position of the black ball must be $f(u|Y=7)=(11!/7!3!).u^{7}(1-u)^{3}$ . From this together with the super-rocket-science numerical integration methods of his teachers at the university of Edinburgh, he could go on to compute things like the probability that the black ball was in the rightmost quarter of the table, or P(U>3/4 | Y). And hence make wily bets on that question.

His real interest was probably to use the same techniques to prove that the probability that God exists, given all the wonderful things we know about the world, must be, for all practical purposes, equal to one. After all, his business was being a Right Reverend.

Example: Monty Hall problem

In the Monty Hall problem we have contestant, who chooses a door; a host, who prior to the show has hidden a car behind one of the doors, and later opens a different door to the door chosen by the player, revealing a goat. There are three doors.

It's convenient to represent these choices of actions by random variables. And convenient to give them mnemonic names. A convenient choice could be C for the door hiding the Car, S for the door initially Selected by the player, H for the door opened by the Host revealing a goat. We have some capital letters here from earlier on in the alphabet than usual, but so what, comfort and convenience go first. It might have been fun to call the door chosen by the player P but then we would have to use something more complicated than P for probabilities, to avoid mixup. Let's just remember these are all random variables, none of them are events.

We're told that the player chose Door 1, and the host opened Door 3. Let's use Bayes' theorem to figure out the chance that the car is behind the other door, Door 2.

Bayes' theorem can be rewritten in yet another way, called Bayes' rule, through the use of the notion of odds. Odds are just ratios of probabilities. Just as the proportionality versions exhibited above emphasize, by dividing two "versions" of Bayes theorem, the complicated mess in the denominator vanishes. We don't need to know it, since we know that probabilities add to 1, anyway. The constant which we are ignoring is called the normalizing factor: it's precisely what you need to divide by, to force the bl***y things to add up to 1. Suppose I'm interested in the relative probabilities of two events A and $A'$ , before and after getting the information that an event B has happened. Bayes rule is

{\frac {P(A|B)}{P(A'|B)}}~=~{\frac {P(A)}{P(A')}}\times {\frac {P(B|A)}{P(B|A')}}.

The posterior odds for hypotheses (or scenarios) A versus $A'$ given evidence B is equal to their prior odds times the likelihood ratio or Bayes factor of the evidence ... a complicated name for the second term on the right hand side. The ratio of the chances of evidence B under each of the two hypotheses A and $A'$ .

It couldn't be simpler. The prior odds must be important. The likelihood ratio must be important. And all you have to do is multiply them! You do have to be smart to come up with the idea of looking at ratios, to get this simplicity. I don't know who should be credited with that, I suspect this was post Bayes, post Laplace ... probably somewhere in the 20th century. Probably an Anglo-Saxon with a gambling addiction, since I don't think any other language has the concept odds. (Actually, it seems this was maybe Cournot, a French economist of the mid 19th century, from whom the great French mathematician Borel and others also got it).

Let A be the event $\{C=1\}$ and $A'$ be the event $\{C=2\}$ . Is the car behind Door 1 or Door 2? This is what interests the player right now, after he chose Door 1, and he's seen the host reveal a goat behind Door 3.

Let B be the event $\{S=1,H=3\}$ ; the player chose Door 1 and the host opened Door 3.

To get further we have to start making some assumptions. It seems pretty reasonable to suppose that the choice of door of the player is independent of the location of the car. Most people find it reasonable to suppose that initially, the car is equally likely to be be behind each of the three doors. This tells us that the prior odds on the car being behind Door 1 versus Door 2 are 1:1. We only have to figure out the ratio of P(B|A) to $P(B|A')$ ; in words, the ratio of the chances that S=1 and H=3 given C=1, and that S=1 and H=3 given C=2.

We think of these (conditional) chances, under two different scenarios, as being built up in two steps (chain rule!). In each scenario the location of the Car is fixed: either Door 1, or Door 2 (the third scenario is not interesting, in view of what is going to happen later). First step, the player Selects his door. We didn't talk about how he did this, but it doesn't matter. His choice of door is independent of the location of the car, so wherever the car is hidden, the chance he chooses Door 1 is the same, and the ratio of two equal probabilities is 1. This leaves the second and last step: the Host chooses his door to open, in the situation that the car is behind Door 1 or Door 2, known, and the player has chosen Door 1. When the car is behind Door 2 the host has no choice, but when the car is behind Door 1, the host does have a choice. The chance he opens Door 3 in this situation is by definition P(H=3|S=C=1). Let's call this chance q. The Bayes factor, for the car being behind Door 1 versus its being behind Door 2, is therefore q:1, and the posterior odds on the car being behind Door 1 against being behind Door 2 are therefore q:1 as well. The player had better switch, since whatever that probability q might be, it's not more than 1, so whether the player knows it or not, it's never bad to switch.

Many people are also happy to suppose q=1/2. They are then also very happy to know the posterior odds on the car being behind their initially chosen door, and to know that they are 1/2:1, or 1:2. They might notice that this is the same as their prior odds of initially having picked a goat-door. And then they might wonder if they could have figured out in advance that the posterior odds would be the same, i.e., that the specific choice of door which they made, and the specific choice of door which the host made (as far as he could choose), are completely irrelevant. A good explanation is symmetry.

Great expectations

Expectation value

Here I want to write just a little about the mathematicians' E(X) and the physicists' <x>.

The mathematical expectation of a random variable X can be computed from its mass function or density by multiplying p(x) or f(x) respectively by x, and then summing or integrating as appropriate, over all values of x. Ordinary folk are told that this is the definition, sophisticated mathematicians know that the validity of these formulae are theorems, which are proved by working out special cases of the definition, which we sophisticated mathematicians like to keep secret from ordinary folk. To mention an amusing example: the result that $E(X^{2})$ can be calculated by integrating or summing $x^{2}$ times density or mass function, as well as by computing the mass function or density of the random variable $Y=X^{2}$ , and then computing the expectation value of Y, is sometimes disrespectfully called the law of the unconscious statistician, since statisticians know by instinct that this is true, and use it all the time, without realising that they are applying a not entirely trivial mathematical theorem.

Physicists were taking expectation values, called ensemble averages, long ago, as statistical-mechanics, which is nothing to do with statistics, but a lot to do with probability theory, and which goes back to the nineteenth century. Statistical mechanics is about seeing gasses or fluids as a whole load of little particles all doing pretty mechanistic, Newtonian, things, but at the macroscopic level having properties like temperature and pressure.

Actually, when physicists use the word statistics they usually mean what a mathematician would mean with the word probability.

Physicists are usually not too interested in making subtle notational distinctions so some property of a particle is called x and its average over a whole lot of particles is called <x>.

In probability theory, the expectation used to be always called the mathematical expectation to distinguish it from moral expectation. I guess you want or hope for the moral expectation, but you get the mathematical expectation (at least, on average). The Russian literature often used to write M(X) for E(X), to emphasize that it was Mathematical expectation and not just any expectation. (But my Russian friends tell me this is not true. They didn't have morals anyway, M stands for mathematical expectation, full stop).

Conditional Expectations

A conditional expectation is just an expectation with respect to a conditional probability distribution. E(X|Y=y) stands for the expectation value of the random variable X, computed with respect to its conditional distribution given that Y=y. In short, the conditional expectation of X given Y=y. It's a number.

Now E(X|Y=y) depends on y, it is a function of y, just like any other function such as square, absolute value, logarithm, whatever. One can apply a function to a random variable. For instance, if X is a positive random variable then log(X) is another random variable, related to the first by taking the value log(x) whenever X takes the value x. Not comes an important and tricky notation: I define E(X|Y) to be the same function of Y as E(X|Y=y) is of y. This new thing, called the conditional expectation of X given Y, is a random variable. In a small number of words, it's the expected value of X given the value actually taken by Y, whatever that is.

Then comes a beautiful theorem: E(E(X|Y))=E(X). One can compute the expectation value of X in two steps: firstly by averaging the values x of X with respect to the conditional probabilities of X=x given Y=y, for each fixed y, separately; and then secondly averaging this result over y with respect to the marginal probabilities of Y=y.

Beautiful, easy and tricky. For instance, E(X|Y) is not the same thing as E(X|Y=Y), even though it seems that I told you that it was defined by computing E(X|Y=y) and then substituting Y for y. But E(X|Y) is a random variable and simultaneously a certain function of Y. Regarding E(X|Y=Y), since Y=Y is certain, E(X|Y=Y)=E(X), a certain number.

Beyond Probability

Then the statisticians hopelessly complicated things

(Frequentist/classical) Statisticians have an overpowering need to distinguish between random variables and the parameters of their distributions. And they also need to introduce estimators, which are recipes, and estimates, which are the results of applying the recipe to some data. We also want to talk about sets of possible parameter values. So along come $\Theta$ , $\theta$ , $\scriptstyle {\widehat {\theta }}$ , ... it gets impossible to maintain the clean probabilist's conventions. Sophisticated statisticians will keep notation simple by ignoring a lot of fine distinctions, which are hopefully clear from the context ... if you belong to the club.

Then we get confidence intervals and all that stuff, and more confusion still.

The whole essay was supposed to be about elementary probability notation, not about statistics, so I think I have a good excuse to stop this section here.

Then the Bayesian statisticians hopelessly oversimplified everything

Now everything is a random variable. OMG.

Seriously though, maybe some Bayesian statistician can take over the story here. (Some of my best friends are Bayesian, and some of my best friends are physicists too! That's why I like to tease them).

Higher dimensions, more conflicts

At some point we need to start talking about collections of random variables. These are conventionally and somewhat mysteriously in elementary statistics texts called random vectors and a new riot of typographical distinctions comes in. E.g. bold for random vectors X with components $X_{i}$ . Alternatively one decorates them with arrows on top.

To begin with however the notion of a vector is unnecessary; to begin with we are mostly just collecting random variables together in a list. It is even more unnecessary to suppose that the vector is a column vector (which is the usual convention, but then necessitates using the transpose symbol all over the place when writing out the vector in a horizontal line of text).

The point of a vector (see the pun?) is that you can add them and multiply them by scalars, and more such stuff.

Random matrices become of course a typographical nightmare.

My opinion is to start off making distinctions when it is important to see them, but later to use lazy notation, which means that any embelishments which are superfluous for understanding, since the immediate context makes them obvious, should be dropped. This corresponds to the "higher" point of view that random variables are just random things taking values in whatever "space" you care to imagine. They just have to be well-behaved enough that one can meaningfully talk about probabilities of "events" involving them (measure theory!).

So when lecturing, after a week or so of adding arrows on top of anything which could be called a vector and/or a list, I start forgetting to draw the arrows anymore. Anything random is just written X.

And then we go quantum

Now there is another level of discussion: the objects which quantum physists call observables, most of which are not observable at all. They are represented by certain kinds of operators on complex Hilbert space. When you do try to observe an observable you see something random, and you disturb the system you are interfering with. The values you can observe are actually eigenvalues (or generalized eigenvalues in a certain sense) of those operators. Being random you might want to represent them with random variables and do probability with them, or even statistics.

Things are complicated because you can add and multiply operators ... while you can also add and multiply results of measurements on physical systems. Things are especially complicated at the conceptual level, precisely because there is a certain degree of compatibility between the world of the operators, and the world of the random variables. This led another hero John von Neumann to believe he had proved a no-go theorem for hidden variables, and all the physicists believed him, though it was patently obvious (for instance to yet another hero, John Bell) that he had got confused by the deceptive similarity of the + sign when located between two random variables, and when located between two Hilbert space operators.

But perhaps we should not delve deeper into this here. Though I would like to be disrespectful and mention the law of the unconscious quantum physicist: the probability distribution of measurements of an observable $Y=X^{2}$ , is the same as the probability distribution of the squares of measurements of the observable X.

To further complicate matters, quantum physicists often use notations like $\scriptstyle {\widehat {A}}$ , $\scriptstyle {\widehat {B}}$ for their observables (remember, these are actually operators on a Hilbert space) corresponding to classical physical observables A, B; and hence $\scriptstyle {<{\widehat {A}}>}$ stands for a somewhat theoretical mean value of $\scriptstyle {\widehat {A}}$ ; I say somewhat theoretical, since you might have to do quite a wild thought experiment to actually measure that observable on separate quantum particles.

The functional analysts generalize beyond recognition

Since the expectation E(X) depends on the probability measure P(.), and we always know whether we are talking about events or random variables, we can spare a letter. Some people write E(X) and E(A). Others like P(A) and P(X). The brackets are also superfluous in these cases. This gets really fun when we start transforming probability distributions by transforming the random variables. f(P) instead of P of $f^{-1}$ . fP means P transformed by f. Pf means the P-expectation of the random variable f.

One can go in the opposite direction and start adding notation which till now was being silently skipped. For instance ${\scriptstyle P(\{\omega \in \Omega :3<X(\omega )<4\})}$ instead of just P(3<X<4).

But more of this does not belong in this essay.

Richard Gill (talk) 18:53, 21 March 2011 (UTC)