How to Skew Correlation in Research

Updated 1/1/2019

In one of the various social media outlets I hang around, a conspiracy theorist posted a link to this video:

Ok, cool.

Here’s the problem with that:
The referenced study “cover-up” results used extremely cherry-picked data to show false conclusions. Within the confines of their studies, they only managed to show slight correlation once they threw out the majority of data-points. Whenever you restrict your data to a small subset, you can see correlation where it doesn’t actually exist. If you bias your data, you can show anything.

Now, I’m certainly not someone with a super statistics-heavy background but I do have a bit of background in probability.

In this case, the “researchers,” in attempting to show a MMR-autism correlation, had to throw out everyone except for African American children with ages less than 36 months born in Georgia after the data was collected. Ignoring that show-stopping flaw, the data collection techniques were inherently low quality further contributing to the illegitimacy of the study.

Science Based Medicine has already written about why the study is flawed. Further writing from Scienceblogs gives some more information. Instead, I’ll talk about research statistics in general.

Since antivaxxers always want to discuss their research in the correlation between MMR vaccines and autism, it’s safe to assume that they must have some background in research, right?

Let’s talk about a hypothetical experiment where I am trying to show correlation and how we can shift it around if we want.

Pascal's Binomial Distribution
Pascal's Binomial Distribution Pascal’s Binomial Distribution

Let’s suppose I have a set of M&Ms of all different colors available. I want to determine the correlation between drawing a brown M&M and a blue M&M on the next draw. As anti-vaxxers should know (since they seem to love posting about statistics and correlations) that this would be a discrete binomial distribution where the correlation is the product of our individual color selection Expected Values (with the mean subtracted out so we actually end up with the covariance of each before we multiply them) all divided by the product of our event standard deviations. The probability of each color choice uses the traditional binomial probabilistic equation that they are certainly familiar with from their research.

Correlation Equation
Correlation Equation Correlation Equation
binomial
binomial Binomial Distribution Equation

Now if you look at the overall probability mass function equation and decide to change the denominators all around since you’re throwing out samples you dislike, you’ll see that this totally changes your individual event probabilities once you solve for the probability mass function of your experiment. As you can possibly guess, changing your probability mass function will change your binomial distribution and cause every single other equation to change thus causing your correlation to go up or down based on how you select the data.

This is how they mathematically cherry picked data to make their correlation coefficient go up.

I invite comments below.