Friday, June 26, 2009

Figures Lie, and Liars will Figure

This was originally a comment in reply to C-Lauff's comment on my commentary on the Iran elections. However, the response ran really long, and is now a post. He added a link to an article that suggests that the vote totals looked fishy. However, if you click on the comments, you'll see a number of comments that indicate that the analysis itself is fishy. I'm going to give my take on it in the reply to C-Lauff, below:


Yeah, I saw that article. The method seems OK. I'm about 99% sure that they used binomial distributions for this. However, I'm not sure about their interpretation. If you go to http://stattrek.com/Tables/Binomial.aspx you can fill in numbers yourself.

I'll walk through the calculations. Essentially, what they're saying is, "OK. We've got a list of 116 numbers. Each number (0-9) should show up roughly 11.6 times, or 10% of the time." And, deviations from this 10% should be relatively random (which becomes normally distributed).

The proof offered in the editorial (haven't looked at the raw analysis), suggests that having one digit show up 20 times is unlikely, and having a 2nd digit show up only 5 times is unlikely, and having both occur is akin to fraud.

We can separate this into 2 calculations. The first is the odds that with 116 trials, at a probability of 10%, how likely is it that we get a single digit 20 or more times. For one particular digit, the odds are against you - only 1.105%. However, for all 10 digits, the equation is a little different. The odds for having no digits show up 10 or more times is: (1-.01105)^10. Run the numbers, and you get the probability that at least one digit shows up 20 or more times as 10.52%.

Now, of the remaining 96 trials, how likely is it that you get a single digit 5 or fewer times? For a specific digit, the answer is 3.734%. However, for all of the 9 remaining digits, the calculation is (1-.03734)^9 = 1-p. The odds of this happening is about 29%.

Now, you multiply the results together to see how often both happen, and you get roughly 3.05% of the time, which is what they say (the less than 4 out of 100 times they say in the article).

They go on to say some stuff about sequential digits, and run the same calculation (the odds of 72 or fewer successes out of 116, given a 70% success rate), and get another value of 4.12% (the less than 4.2% they say in the article)

Now, what they're saying is that the odds of both the first condition (the 3.05%) and the 2nd condition (the 4.12%) occurring is slim (I get roughly 0.13%), a little lower than the 0.5% they say in the article. However, that doesn't really indicate fraud, in my opinion. Think about all the possibilities in life. Any single one happening is ultra-rare, right? We're biased to pull out things that support our assertions.

Look at our UPL fantasy baseball stats. Look at single-digits column of the total runs scored. As of today (6/26), we have 4 teams with 3, 2 teams each with 9,7,and 2 runs, and only 1 team with 1 and 8. No teams have 4, 5, or 6 runs.

What are the odds of having 4 or more teams with the same number of runs? The numbers suggest 22.87% of the time that will happen. Easy enough. Now, if I wanted to fish something out of thin air, and build a statistical argument around it, I can easily do it.

Look at the number of times that you get the same number in the tens and ones column... you see 09, 17, and 53 twice. What are the odds of seeing the exact same set of 2-digit numbers? 1 in 100. What are the odds of seeing it 3 or more times out of 12? 5 times out of a million. What are the odds that the sequence of numbers that we currently have in the UPL show up (given those two conditions)? About 1 in a million. Clearly there is fraud going on. And I'm certain that this is the case because I'm not in first place (which has happened 5 out of 8 seasons).

So, aside from reminding everyone how unlikely it is that I won't come back to win the baseball league, I'm sort of pointing out how in any data set, you can pretty much fish out whatever you want, if you keep looking hard enough. And once you fish out the conclusion you want, you can come up with stats to back it up, particularly if you just random numbers as your basis. However (and this is the key), smart people will look at the theoretical explanation for why the stats someone poses really matter.

Overall, I suppose my question is, do you still think that the article gives a strong case for fraud, or a weak case for fraud? In the original article, I'm just not seeing it.

-Chairman

2 comments:

Westy said...

All I know is that if the boys at 538 think there was fraud, there probably was.

Chairman said...

I wasn't really commenting on whether or not there was fraud. I was commenting on how good of a case the original article built.

In terms of "proving" that there was fraud, I thought that the original article was crap. I agree with their math. I think that they stretched big time with their interpretation (hence my example w/ the UPL runs scored numbers), which are a 1 in a million occurrence.

An equally absurd way to pass their test. Make up numbers, and then use a random number generator on the one's and ten's digits. If this passes, then their analysis isn't all that rigorous.