I need a break from exasperation. These days I get too easily exasperated at stupidity - and much of it is my own stupidity when I haven’t understood something clearly enough. And sometimes the snark and poor-taste humour isn’t quite enough to offset things. So it’s time to put your feet up and let me guide you into a blissful and relaxing coma as I write another more technical piece.
We’re all very familiar with the mantra correlation is not causation, which is usually used to dismiss some apparent association (correlation) between two disparate data sets. Two classic covid examples of where association has been questioned are
The claim that lockdowns caused cases (and deaths) to fall based on the fact that the curves “turned around” after lockdowns had been introduced somewhere.
The claim that certain
warp speedwarped Pharmaceutical products were dangerous based on the massive uplift in reports to monitoring systems like VAERS.
But what IS correlation?
You can’t be interested in the fundamental issues with quantum mechanics, as I am, without having to deal with the notion of correlation. In QM you can have systems that are more highly correlated than is possible if these were systems which operated under the laws of classical physics. The usual term to describe these quantum-correlated systems is entanglement.
It’s more than just theoretical because you can use this ‘extra’ correlation to do stuff, and there are tons of applications (most of which have only been demonstrated in very difficult lab experiments).
Correlation refers to the ability to be able to ‘predict’ the properties of one variable from knowing the properties of another. Let’s sharpen up that “refers to” there a bit.
Suppose we have a device to measure temperature, say, and another that records the lies told by politicians about climate change. You’d find that an increase in the value of the temperature measurements was associated with an increase in the number of climate lies told by politicians. There’s a clear correlation.
In this particular case it’s clear that the causative direction is from temperature to lies - although it’s also possible that the hot air emitted by politicians is causing the temperature to increase.
So let’s try to pin these notions down a bit by thinking of flipping 2 coins. Let’s suppose we have a nickel and a dime and we flip the nickel first.
We have four possible outcomes (heads or tails) for our two coin flips
(H,H)
(T,T)
(H,T)
(T,H)
Not the most exciting example in the world but, curiously, it turns out to be quite important1.
You can see that with the assumption of fair coins you get one of these 4 outcomes from the 2 flips with equal likelihood. In other words, you’ll get the result (H,T), for example, with a probability of 1/4.
But what if some pesky engineering student had hidden devices in these coins such that getting H on the first flip made it more likely you’d get H on the 2nd flip. and similarly for the result T?
What the student has done here is to build in some mechanism that ensures some correlation between the results obtained.
How do we ‘deal’ with that? How do we analyse it to understand the effect a bit better?
It’s Probably Probabilistic
Probability is notoriously difficult to notate well2. We usually use a kind of shorthand that can be a bit confusing. Let’s write the notation to begin with in a more complete way. We’re going to use P to mean ‘probability’
The words, what is the probability that we obtain H with the nickel and T with the dime, can be written in a longer notation as
P(result of nickel throw = H, result of dime throw = T)
Which, of course, we’re going to shorten a bit to
P(H,T)
What’s that comma doing in there? The comma, small and insignificant though it might seem, is standing in for the word AND.
If we want to talk about all of the results we use the same notation - in this case the P thing would called a joint probability distribution and we might write it as P(n,d). The n here would stand for “the result obtained with the nickel”, for example.
P(H,T) refers to a specific result3
P(n,d) refers to all of the 4 possible results
I did mention that the notation can be a bit of a bugger, didn’t I?
Assuming you’re all still awake at the back there, let’s press on.
Before I do, however, I’m going to throw some mud about. Technically, correlation refers to a specific technical quantity and there’s a difference between dependence and correlation (as defined in this technical way).
Correlation, technically, refers to a statistical property - which is why I’ve started rambling on about coin flips. However, when it’s used in a lot of popular discussions what people are really talking about is the notion of dependence.
I can, for example, write the equation for a circle of radius r (centred at the origin) as follows
The x and y variables here are not “free” to have any value - they are dependent variables. If you fix on a particular value for y, that will constrain the value of x you have.
I’m going to be a little bit fast and loose and adopt the more colloquial understanding - that correlation and dependence are, essentially, synonymous. For the majority of instances with statistical quantities, this is fine.
Nickels and Dimes
So let’s take a closer look at our nickel and dime - and we’re going to assume interference by that pesky engineering student. The hidden mechanism has been set such that if a head is obtained with the nickel then there is a 3/4 probability of getting a head with the dime, and similarly with the result ‘tail’ for the nickel.
One of the best ways to approach simple probability questions is to draw what’s known as a probability ‘tree’. Here’s what it looks like for our nickel and dime
The nickel is flipped and we get H or T with equal probability of 1/2 (we assume the nickel is ‘fair’). The mechanism then kicks in for the flip of the dime and if we got a nickel head we have a probability of 3/4 of getting a dime head (obviously we have a 1/4 probability of getting a dime tail after getting the nickel head).
What we’d like are the probabilities P(H,H), P(H,T), P(T,H) and P(T,T).
The idea of the tree is to end up with all of the possible outcomes as the branches.
These diagrams make it possible to almost ‘read off’ the probabilities we need. The simple rule is that we multiply along the branches.
So, if we’ve flipped the nickel and obtained a head (probability of this happening is 1/2) and THEN we flip the dime - the probability of getting a head with the dime is now 3/4. So the overall probability is going to be
The second set of probabilities in the diagram (those for the dime) are conditional on the result obtained for the nickel.
Notice that we’re not able to completely predict the result of the dime flip. Perhaps the best way to think about this is in terms of placing a bet; if you were going to place a bet on the result of the dime flip, knowing the nickel flip, where would you put your money?
The variables we have here are n and d (which stand for the result of the flip for the nickel and dime, respectively). The values that these variables can each have are just H and T.
We’d say that the variables n and d are correlated.
Conditional Probability
The coin example has the notion of things being conditional explicitly built in to it. The second flip is not independent of the first flip, but conditional upon the result obtained with that first flip.
Remember that comma in our probability notation? We wrote P(H,H) to mean the probability of nickel head AND dime head. The notation we use when have things being conditional is to use a vertical line | so that :
The probability of getting H with the dime given that the nickel flip was H = P(H|H)
In the diagram above, the second set of probabilities I’ve written down are the following, and they are conditional probabilities
P(H|H) = 3/4
P(H|T) = 1/4
P(T|H) = 1/4
P(T|T) = 3/4
Try not to be too bamboozled by the visual appearance here - we can think of the math notation as just being ‘lazy’. I don’t want to have to keep writing out a phrase like “the probability the dime flip was heads given that the nickel flip was tails”.
We say that the second result is conditioned upon the first result - and this gives us our clue about how to (technically) approach the notion of correlation4.
What did we do to get our final joint probabilities? We multiplied along the branches.
What we’ve done, in math terms, is to apply the equation
This article is aimed at those who gave up on maths - I know that for quite a few of you you’ll already know and understand everything I’m writing - but if you’re one of those who did give up on the maths then don’t let this equation give you the heebie-jeebies.
You can, I hope, see that if the probability of the first flip being H was 1/2 and the probability of the second being H was 3/4, then the overall probability of H and H is going to be just the product of these two things.
But that’s all the formula above is saying - just that.
Yes, Yes, Yes, but how do I tell when stuff is correlated?
The conditional probability is really the key here. Look at P(d|n) and think about what would happen if there were NO correlation between the flips. The result of the second flip is going to be indifferent to the result we got with the first. The result of the second flip is nothing to do with the result of the first flip, in this case.
In this case, with no correlation, we’d be able to write
What we’re saying here is that the results of the dime throw are not conditional upon the results of the nickel throw.
This means that, when the results are not correlated, we’d be able to write
This means that if we know the probability distributions involved we can check to see if the results are going to be correlated in some way.
Simples.
Maybe. But what can we infer if we do find some correlation between variables?
This is usually where the correlation is not causation mantra gets applied. Just because we see some association (correlation) between two sets of data it does not mean that there is a causative relationship. But, equally, we could turn it round a bit and ask whether causation exists in the absence of any association (or correlation)?
A Conditional Aside
Once you’ve realised that when you have 2 or more variables in a problem involving statistics there are various probabilities you can look at. We’ve seen that we can have joint probabilities and we can also have conditional probabilities. The question naturally arises as to which one is the “right” one to examine in a given situation?
This depends on the probability question you’re trying to answer.
Let’s take the issue of cops shooting black people and treat it statistically. What would be the variables and their values? We might have the variables g (for ‘gun’ with the values 1 = fired and 0 = not fired), c (for colour of the suspect and let’s just go with 1 = black and 0 = white) and a (for arrest situation with 1 = suspect being arrested and 0 = no arrest in process). This might be one way we could approach the analysis.
When you’re trying to answer the question whether cops treat black people differently to white people with respect to shooting them, then what is the correct probability to examine? With 3 possible variables there are quite a few different probabilities we could have.
There’s a the full joint probability P(g,c,a). There are the marginal joint probabilities P(g,c), P(g,a), P(c,a). And then there are a host of conditional probabilities that can be constructed like P(g, c|a) or P(g|c,a) or P(g|c) and so on. These are all “answering” different probability questions.
One pertinent question to ask would be the following :
Given an arrest is underway is there a correlation between skin colour of the suspect and the firing of the cop’s gun?
If we examine the statistics and find that
Then the variable c (skin colour) is playing no role in the firing of a gun during an arrest; it is not correlated with the result “gun fired”.
If this is the case, then the cops are showing no bias or favour based on skin colour in the event of an arrest situation (with respect to firing their guns).
But when you read some newspaper article claiming that black people are “more likely” to be shot by the cops - do you know which of the various probability distributions they’re talking about?
Until you do know which of the myriad possibilities they’re talking about, you cannot properly interpret the statement “more likely”.
The ‘variables’ and ‘values’ I’ve sketched above for the cop shooting situation need quite a bit of finessing - they’re not quite right and almost certainly an incomplete list with which to understand the situation - but hopefully you get the idea I’m trying to convey here. For example, a correlation could be found, but it still wouldn’t necessarily mean the cops are acting unfairly or being discriminatory. You’d probably need to fine tune it with the type of arrest situation - just to give one possibility. You might also have to examine the decidedly politically incorrect notion that there might be differences between the behaviour of black and white people. This certainly needs to be “on the table” as a possibility to be examined (and understood if such differences are found). But that would require objectivity - a difficult thing to strive for in our politically charged times. Being objective about things these days could easily lose you your job.
The idea here is that once we have more than one thing fluctuating and going all statistical on us, then there isn’t just one kind of probability involved. Which one you choose is going to depend on the specific probability question you’re trying to answer. And it’s easy to choose the wrong one here. Whenever you read any media article which contains the word probability (or phrases like “more likely”) then press the crap out of the big red caution button in your mind.
It’s All Uncertain
Another way to approach correlation, and it’s the one I probably like the best, is to think about uncertainty.
I don’t just mean any old “uncertainty” but the specific way we can define uncertainty in a technical way. This is a parameter that quantifies the ‘amount’ of uncertainty we have in a statistical quantity.
If it’s to be a good parameter of uncertainty we’d like it to have a couple of properties5
(a) - it’s got to be be positive (how can you have less than zero uncertainty about something?)
(b) - if we have two completely independent (that is, not correlated at all) fluctuating things then the total amount of uncertainty we have is just the sum of the uncertainty for each individually.
It turns out that there is a unique mathematical function (the logarithm) that satisfies these sensible properties.
It’s this uniqueness with respect to the very sensible properties we require that makes me give this technical parameter a big thumbs up.
It was Claude Shannon who first started thinking about how to characterise uncertainty - and one of the big first applications was in understanding crypto. The idea here is that if you want to keep a message as secret as possible then you must ensure that you have maximised the uncertainty about the message given that you have the ciphertext. Or another way of phrasing this is to say that you want the attacker to be able to recover no information about the message given that he has the ciphertext. Shannon’s amazing insight and genius was to recognise the fundamental connection between uncertainty and information.
The importance of this uncertainty parameter is that it allowed us to turn things like “to be able to recover no information” into more than just wordplay; you could now mathematically analyse this stuff.
So how do we use this uncertainty parameter to characterize correlation?
We’re going to imagine sending Alice off to one room with the nickel and Bob off to another room with the dime (men, of course, should be paid more than women6).
Alice is going to flip the coin every time a red light flashes and record her results in sequence. Bob is going to do the same. The purpose of the light here is to ensure that the nickel is flipped first - because we’re using our engineered nickel and dime.
Alice is just going to get a whole set of results that look entirely like tossing a (fair) coin. So is Bob. To each of them individually it looks like they’re just flipping a normal (fair) coin.
When they meet afterwards they can compare the data and see that, actually, things were not quite as random as they first thought. They can now see their data is correlated. The correlation is not perfect - but it’s there and can be measured.
There is less uncertainty in the joint data than they would expect if they really had two completely independent coins (without this hidden mechanism).
The symbol that gets used to denote uncertainty is usually H. I don’t know why.
We have
Fair independent coins : total uncertainty = H(A) + H(B) 7
Engineered coins : total uncertainty is less than H(A) + H(B)
The ‘amount’ of correlation we have can then be seen to be just the difference between the uncertainty of the individual data (Alice and Bob alone in their rooms) and the uncertainty of the joint data (Alice and Bob compare notes).
The ‘information content’ of the correlation can then be defined to be
where H(A,B) is the uncertainty in the joint data.
This is a really nice way of approaching correlation - thinking about its information content.
One of the really fascinating things is that this fundamental uncertainty parameter, H, turns out to be (mathematically) the same as the entropy that is useful in physics.
It’s another reason why I like this approach to correlation.
It was ‘im wot dun it - he caused it
Of course, none of this really addresses the question that everyone would really like to answer. Two things appear to be correlated (an association is found). Did one “cause” the other?
I’ve talked about the alarming rise in childhood autism before. It’s gone a bit off scale from instances like 1 in 10,000 to 1 in 36 in just a few decades. Those numbers, if true, represent a crazy big jump - and hard to fully ‘explain away’ by mumbling something about being more aware or better diagnoses.
Some people have suggested that the heavy childhood vaccine schedule, which has been significantly ramped up over the same time period, might be responsible. That’s a very plausible possibility.
Criticizing vaccines is a bit like criticizing Muhammad - some people get all a bit emotional over it.
One common argument you’ll hear against the vaccine/autism hypothesis is that correlation does not imply causation. This is true. And to support this argument they’ll show that the rise in organic food consumption has shown a similar increasing trend over the same time period. Look, they’ll say, the consumption of organic food is correlated with the rise in autism.
You could probably find any number of things which have increased over the same time period - the number of people identifying as trans might be one. But here it’s interesting because, as we know, there is a higher proportion of autism in the trans-identifying group than in the general population. Which way is the potential causation going here?
The point is that for any given trend you could probably find umpteen unrelated things which have a similar trend. So what? How does that invalidate the notion that whatever is causing this rise in autism it will show an increasing trend also?
In other words, there will be something that is creating this rising trend - and that will be correlated with the autism data. There will be a gazillion, unrelated, things that are irrelevant which also possess a similar trend line - but one (or maybe a few) things will be very relevant - and show this correlation.
The problem with the “correlation does not equal causation” argument is that it tends to de-emphasize that something will be (relevantly) correlated with the data; there’s always a reason why the data is what it is.
I don’t know what’s causing this explosion of childhood autism - it’s a very real problem (as even the most devoted adherents of the holy church of vaccinology will admit). It will become, in time, an insane economic burden as the very necessary caregivers (family) age.
But where should we ‘look’?
Is it vaccines? Is it some other environmental toxin? Some food additive? The fact that people are delaying having kids, on average, these days?
Whatever it is - one key piece of initial evidence will be in charts which are correlated. That will give us the potential first clue.
You Can Wake Up Now
And that’s all for today folks. I have just touched upon the basics of how correlation (or dependence) is approached whenever we have statistical quantities. Unfortunately, when you have fluctuating data, we need to use probabilities - and that’s a bit of an embuggeration.
The real problem, as it is with so many things that require a statistical approach, is that of inference; what can we infer from the data?
Another approach I quite like is to try to ‘reduce’ things down to a communication channel. Someone is trying to send us a message, but by the time it gets to us various things have happened; noise has been added. How do we reconstruct the intended message from the (noisy) data we have?
Is what we have just ‘noise’ or is there any correlation left with the original message?
This is essentially identical (in probability and math terms) to a binary communication channel which is a very important and fundamental notion in information theory. I should say it becomes ‘identical’ when you add in the possibility of correlation between the two coin flips and that we might have biased coins.
The thing with notation is that we’re trying to condense down a lot of verbiage and concept into a small, visually appealing, thing. This really helps when you have to do stuff like algebra and being able to see patterns. The flip side of this is that it makes math look like a crazy mix of hieroglyphs.
In a slightly more precise notation we might write P(H,T) = P(n=H, d=T)
Technically, I’m talking about statistical dependence here - but I kind of like the more colloquial understanding involved in using the term correlation, so I’m going to be naughty and keep talking about correlation.
There’s actually a third one to do with continuity, but we’ll leave that one out. It’s just saying that whatever parameter we have it should be a continuous function of the probability involved.
If there are any woke wunderkind reading this, please be informed that this is something known as a joke - you may not be familiar with the concept.
Remember that a property of this uncertainty is that it’s additive for independent things.
In my experience, when the smug and dim emit the "correlation is not causation" flatus, they are claiming that because two things are correlated there can be *no* causal effect. In other words, they are claiming that correlation *denies* causation. It's a complete inversion of reality because Fuck Me Dead, very often when you see correlation, there quite often *is* causation involved.
"Correlation is not Causation", when meant honestly should always be stated as "Correlation is *not necessarily* Causation, but jeez it could be zad quite often is".
And after that little rant, thanks again for another fine exposition. I studied Pure Mathematics and Physics including elementat QM (amongst other things) at the Australian National University back in the early 1980s when courses were actually rigorous, and your lessons always stir a lot of old memories, mostly fond (beer) , sometimes not so much (exam halls).😊
I do enjoy these mini courses.
The quote notes I copy and pasted;
**These days I get too easily exasperated at stupidity ...
**Being objective about things these days could easily lose you your job.
**Whenever you read any media article which contains the word probability (or phrases like “more likely”) then press the crap out of the big red caution button in your mind.
There does seem to be a correlation between many things, and many either assign it to, or disregard it because of, intuition. I've long thought that intuition is simply having a better sense of pattern recognition. And the patterns are all there in plain sight, but an increasing number these days seem unable to perceive them.