## Wednesday, February 4, 2009

### Testing Balanced Dice

If you're like me, at this point you've seen a whole bunch of fellow players keeping track of their "lucky" dice, trying to "train" dice by storing them best-side-up, stuff like that. I got to wondering recently what it would take to practically test whether you've got a fairly balanced die or not, and fortunately I've read enough statistics at this point to finally track down what it would take. (I was a bit torn about whether to put this post here or in my math blog; I do think it belongs here a bit more, because it's a specific application of a pretty well-known "Pearson's chi-square" hypothesis testing procedure.)

Testing a d6: Let's say you've got a d6. Roll it 30 times. Keep a tally of how many times each face comes up, from 1 to 6. (Note that we expect the number of appearances from each face to be about 5; 30/6 = 5). At the end, go through the counts and subtract 5 from each, square them all, and then add them all up. For a fair die, the total at the end should be no more than 55.

Testing a d20: Alternatively, say you're looking at a d20. Roll it 100 times. Again, keep a tally of how many times each face comes up, from 1 to 20. For each of the counts, go through and subtract 5, then square them, then add them all up at the end. For a fair die, the total at the end of this process should be at most 150.

Comments: The process specified above uses the minimum number of rolls you can get away with (5 times the number of sides; more on that below). It has a significance level of 5%; that is, there's a 5% chance for a die that's actually perfectly balanced to fail this test (Type I error). There's also some chance for a crooked die to accidentally pass the test, but that probability is a sliding function of how crooked the die is (Type II error). A graph could be shown for that possibility, but I've omitted it here (usually referred to as the "power curve" for the test).

How does this work? As mentioned above, it's an application of the well-known "Pearson's chi-square test" in statistics. The process results in a random number whose probability follows a "chi-square" curve after some large number of rolls. Formally, the different faces of the die are called the potential "outcomes" of rolling the die; the count of how many times each one comes up is called the "frequency" of that outcome; and the number of times you expect to see each one is called the "expected frequency". For the "chi-square" function to be applicable, you've got to have an expected frequency of 5 or more for each possible outcome (hence the requirement for a number of rolls of at least 5 times the number of sides).

The numerical process after you're done rolling can be referred to as the (very, very common) method of finding the "sum squared error" (abbreviated SSE). The "error" is how far off each frequency was from the expected value of 5 (hence count - 5); the "squared error" is when you square that value (thus making everything positive, among other niceties); and the "sum squared error" (SSE) is when you add all of those up. If the die showed every face exactly 5 times, the SSE would be exactly zero; the more crooked it is, the more error, and hence the larger the SSE value at the end.

Normally in Pearson's procedure, you'd take each "squared error" and divide by the expected frequency, then add those, then check a table of chi-squared values to see how likely that result was (compared to your initial expectation). But since we expect every side of our dice to be equally likely, there's a simplification that I've done above. For example, for a d6 test at a 5% significance level (degrees of freedom one less than sides on the die, so df = 5), I go to a table of chi-square values and look up X²(5, 0.05) and see 11.070. That means I would normally reject the fair-die hypothesis (i.e., the null-hypothesis Hₒ, "there is no difference from a fair die") if X² > 11.070. Under Pearson's procedure this would imply the following (letting "O" indicate the observed frequency of each outcome, and "E" indicated the expected frequency of each side, e.g., 5 in each case here):

$$\Sigma (O-E)^2 / E > 11.070$$

$$\Sigma (O-E)^2 / 5 > 11.070$$

$$\Sigma (O-E)^2 > 11.070 \times 5$$

$$SSE > 55.35$$

A similar simplification is done for the d20 process.

Now, if you wanted to improve the accuracy of the test you could obviously roll the die more times. You would then be able to reduce the chance for either a Type I or Type II error, but never totally avoid either possibility. (In practice, we normally keep the Type I error "significance level" fixed, and work to reduce the Type II error, thus improving the "power" of the test).

I'll end this with how you can form your own test, for any die, to whatever power level you desire. I'll assume you keep the significance level at 5%, and let E be the expected number of times each face should appear in your experiment (E = rolls/sides; having E ≥ 5 is a requirement!). Then you will reject the Hₒ "balanced die hypothesis" if the SSE you get at the end is greater than X ∙ E, where X is the table lookup for X²(sides-1, 0.05). For a d4, X = 7.815; d6, X = 11.070; d8, X = 14.067; d10, X = 16.919; d12, X = 19.675; and for the d20, X = 30.143. Have fun.

Pearson's chi-squared test at Wikipedia

Edit: Someone in the comments asked for a look at an actual working example -- included below in a photo of my scratch paper from the last time I tested a d20.

1. Cool post, Dan. While I don't consider myself superstitious in everyday life, there is something about D&D dice!

I swear there are nights where my d20s never see the high side of 10!

I think I will test a couple of see if there is any fact, or simply perception to my "evidence".

I stumbled onto your blog a few months back, and have read through all of your posts, and it is bookmarked in my Daily folder.

Keep it up!

2. So how much empirical testing did you put this through? I presume you tested some number of fair dice, but did you test it on an intentionally weighted die?

I have an old d10 kicking around that once fell on the floor at my friends house when his dog scooped it up and started chewing on it. We rescued the pieces and I glued the larger chunks back into a reasonable semblance of a d10. I'm tempted to try this test on it to see where it lies.

3. @ Greymist:
"I swear there are nights where my d20s never see the high side of 10!"

Maybe you accidentally used some 0-9 twice d20s. ;>] I've had to check some of mine lately, as I too have been rolling for @#\$%.

@ Delta:
Nice post. I am always curious to the fairness of certain dice, and those that are favored by folks. Too bad I hated probs and stats back in college. Math minor here, but man did I struggle with P&S....ugh. Nice work though. Thanks for the info and something to try next time my dice blow.

4. So how much empirical testing did you put this through? I presume you tested some number of fair dice, but did you test it on an intentionally weighted die?

No, none whatsoever. The math theory is really not subject to empirical testing; if you think there's gaps in the mathematical reasoning I'd like to hear about them.

Logically, if you were going to really "test the test" you'd have to know exactly how weighted the die was in the first place, but the only way to generate a solid theory about that is to use a statistical test exactly like this one. So you'd be right back where you started.

But, I'd be interested in hearing exactly how crooked your smashed-up-glued-together die is. :)

5. I've always found the statistical holiness of 5% confidence to be amusing, considering that's the chance of rolling a natural 20 (or, I guess a 1 would be more appropriate).

6. Ok, lookie here. I barely got out of math class. I want to test some six siders. Can you add a comment here with an example of the actual math used, not the forumla.

Yes, I know this is dumb. I just want to make sure I'm doing it properly since I was a history major lol.
thanks,
Musashi

7. @mylittlesoldier, here's the work shown for a d6 I just tested.

Roll the d6 thirty times, keep a tally of the number of times you see each face:

1: six times, 2: five times, 3: five times, 4; seven times, 5: 4 times, 6: three times.

Subtract five from each of those occurrences, so I get: 1, 0, 0, 2, -1, -2.

Square each of those, so I get: 1, 0, 0, 4, 1, 4.

Sum those numbers and I get 10. Ten is less than 55, so this is probably a fair-rolling d6.

8. Paul is, of course, correct. For another view, here's a shot of my working notepaper from the last time I was testing dice (in this case a d20).

You can see that I rolled the d20 100 times and tallied the results each time (columns 1-3). Then I subtracted 5 (the ideal tally for each face, 100/20) from each of those (column 4). Then I squared those and added them all up (column 5). Since this came in below 150, I accepted the die as most likely reasonably balanced.

If you're testing a d6, do the exact same procedure but just roll it 30 times -- at the end, accept it as balanced if the total is 55 or less.

9. Ok, since I had a little time to spare and a two new d20s lying around, I took them to the test. One was a Pegasus Oblivion Yellow, the other one a Q-Workshop Black & White Forest. 100 rolls as you suggested. The Oblivion got a Result of 180, the Forest 80. Actually I suspected the Forest to be more inaccurate, since it is heavily carved. That was interesting. Also my cat is now interested in playing with dice too :-)

10. Just curious, but what if the the dice is biased at only a minimal number of sides? You test shows that a *series* is fitting a prediction, but does it capture *individual* deviations from the expected series? Take your d20. Out of 100 rolls you only got 1 four. What if this bias were to continue, while all other sides remained very close to the expected outcome? I guess I'm asking, what happens if the deviation at a single value is spread equally among all other members of the series? Or is this simply an example of a Type II error?

11. Hi, Mel -- To make a long story short: yes, Pearson's chi-square test "captures individual deviations". The squaring operation in the algorithm means that it becomes increasingly sensitive to deviations as the sample size goes up. For example:

Having one face appear only once over 100 rolls is not really that surprising. The chance for that to happen, for a given face on a fair d20, is (by the binomial formula): P(X=x)= nCr*p^x*(1-p)^(n-x) = 100C1*0.05^1*(1-0.05)^(100-1) = 100*0.05*0.95^99 = 100*0.05*0.006 = 0.03 = 3%. So the chance that one of 20 faces does this is roughly 20*3% = 60% (approximating here to avoid a much tougher calculation).

Let's say we bump of the number of rolls to 1,000 (E=50 per face) and keep the sampling ratios as you suggest: say one face gets 10 appearances -- in fact, let's say 12 to be generous and keep the math easy -- while the other rolls are spread evenly, 52 each for the other 19 faces. (Note 1x12+19x52 = 12+988 = 1000). There's a lesser, but still nonzero chance of this happening on a fair die (I'll let you work that out if you want.)

But if this is what we observed in the experiment and applied Pearson's chi-square test, then we'd get a sum-squared error of SSE = sum(O-E)^2 = 19*(52-50)^2+(12-50)^2 = 19*(2)^2+(-38)^2 = 19*4+1444 = 76+1444 = 1520. And so this would radically fail our test (remember, limit of just 150 for a d20)!

To recap: The test is increasingly sensitive to deviations like you're talking about over much larger trials (as is the case for any statistical hypothesis testing procedure).

12. Thanks for your response! Let me add that I enjoy reading your blog. Your post on normal distributions and variance (as well as your discussion of range penalties for archery) was one of my favorite posts in OSR blogdom. Really got me thinking!

13. Thanks for the kind words! :D

14. @ Delta: Commenting on what you said below:

"And so this would radically fail our test (remember, limit of just 150 for a d20)!"

The limit increases as the sampling size increases, because the formula is X*E, where X=~30 for a d20 and E=rolls/sides. In the case of 1000 rolls, the limit becomes ~1500. Which falls roughly with your 1520 result :)

15. Holy smoke is that super embarrassing! Thanks for pointing out that glitch, Rezanah. I totally didn't read my own summary close enough.

But the broader point remains: As sample size (number of rolls) goes up, the probability for a certain proportion of skewed rolls goes down, and the associated test correspondingly gets more sensitive to such fluctuations. Again, the concrete thing you can point to is the squaring operation that makes the sum-squared-error (SSE) explode faster than the sample size.

A more correct example: Say we look at a few different sample sizes (n), each where we're rolling a d20, and one face shows up 1.2% of the time (O1), the others being balanced (O2).

(1) For n=100, E=5, O1=1.2, O2=5.2; so SSE=15.2 < X*E=150.72 -- passing the test easily by a factor of about x10.

(2) For n=1,000, E=50, O1=12, O2=52; so SSE=1,520 ~ X*E=1,507.15 -- very close to the critical value for the test (as noted by Rezanah).

(3) For n=10,000, E=500, O1=120, O2=520; so SSE=152,000 > X*E=15,071.5 -- now failing the test radically by a factor of x10.

In summary: As the sample size gets bigger, the test definitely does get more sensitive to proportional fluctuations (although not quite as fast as I asserted above).

Huge "thank you" to Rezanah, much appreciate the assist. If anyone else sees anything else I need to polish, please likewise tell me...

16. As I look at the comments to your excellent Feb 4, 2009 article I finally understand that for the d6 30 roll test that the SSE limit is 55.35. If the number of test rolls were to increse to 60 or 300 rolls then the SSE limit would increase proportionally to 110.7 and 553.5 respectively. Am I correct?

Similarly, in the original article you mention a graph to show the possibility of a crooked dice. I would be glad to see it if readily available and you think useful.

Great article!!

17. JohnF said: "If the number of test rolls were to increase to 60 or 300 rolls then the SSE limit would increase proportionally to 110.7 and 553.5 respectively. Am I correct?"

Hey JohnF, you have it exactly right. (Even though I embarrassingly forgot that myself up in my 6/27 comment.) And you're on-target that the usefulness of increasing the die-rolls is to make the test more sensitive, i.e., greater "power" which could be shown in a graph.

At the moment, I don't have that graph. Let me think about that for a bit. Thanks for the kind words!

18. Actually, I think I was being flippant about the power-curve graphs, because (a) that's, like, real hard, and (b) I'm not sure how to do it immediately for a chi-squared test. I will continue to think about how to do that.

19. Hmm, I think this is actually a fairly meaningless test until you determine with what probability an *unfair* die will pass the test.

You've shown how often a fair die will pass the test, but that doesn't directly translate into a confidence that the dice is fair -- a die with a very, very slight unfairness will pass the test only slightly less often. As the unfairness increases, it'll be less and less likely to pass, so the test can discriminate between a perfectly balanced die and a blatantly unbalanced one. But to be useful you definitely need to know what the sensitivity really is.

20. @ starwed: What you're talking about is the "power" of the test, as mentioned above. See the last link at the very end of the blog post for a complete treatment of that.

21. Hi, Delta! Thanks for this post! But I was reading through a post in boardgamegeek.com (http://boardgamegeek.com/thread/576612/testing-dice-for-fairness) and someone brought up the following point: "There are practically no games where the imbalance of a standard die would significantly influence the game. There are just too few die rolls." What do you think? Should I keep using my unbalanced die or buy cassino ones?

1. That particular comment is just silly and totally wrong in every respect. In fact, fewer dice rolls actually means more opportunity for the imbalance to make a difference (not less). Any save against poison shows that a single die-roll can have a monumental effect on the game. (It's a separate issue that it takes a large number of rolls to scientifically confirm that a die is biased).

So as far as what action to take, it depends on how much you care about your game or its outcome. For D&D, I did test all eight of my d20's with 100-rolls each, and if they're not obviously biased, then I keep them. For Book of War, I actually did buy a set of casino d6's just last week, partly to show off the product to other people, and partly because the head-to-head competition gets people's dander up a little bit more when things go bad.

Also keep in mind is that casino security doesn't do hundreds of trials of rolling, they just stack the d6's next to each other and visually confirm the corners are square and meet up evenly without visible gaps. For a die to be radically biased, it would likely need obvious cracking/chipping from the die.

2. And to follow-up: My understanding is if you care about true-rolling polyhedral dice, the best bet is to just get the precision-edged Gamescience dice and leave it at that (on sale through gamestation.net).

22. Thanks for this. I just built an Excel Spreadsheet for testing my D20's. I have uploaded a copy here if others want to play around with it as well.

1. That's great, thanks for posting that!

23. How would this work for dice with non-uniform distribution, such as “average” dice which have sides 2-3-3-4-4-5 (to avoid the extreme 1 and 6 results)?

1. Good question; pretty much the same except that the expectation of appearances now differs per value. Example: Say you roll this die 30 times. Keep a similar table of frequency appearances for 2-3-4-5; for 2 and 5 subtract 5, but for 3 and 4 subtract 10 (respective expectations); then square and do the rest as normal. Basically you tally the squared error in any case (difference of frequency and expectation, whatever that may be for each outcome).

Alternatively, you could make a little mark to actually distinguish each of the 6 faces and do it normally (with equal expectations).

24. How can I translate this formula to another dice: For example, what is the total expected for a D10

1. Search for where "d10" is referenced in the original blog post above. Pick a per-side expectation E, minimum E = 5; so in this case roll the die at least 10*5 = 50 times and record the results. For a fair die, the SSE at the end should be no more than X*E = 16.919*5 = 85.

* Noting again that this is a lower-power test; if the die fails, then you can be sure it's broken; but lots of biased dice will pass the test anyway. For a high-power tests you'd want to use lots of rolls, maybe around E = 100 (so roll the die 1,000 times and see if the SSE remains below 16.919*100 = 1692).

25. While you (rightfully) cautioned readers about the limitations of the chi-squared Goodness of Fit test in the follow-up article, I would like to say that as a statistician I do think there's still some practical usefulness to lower sample size tests like the ones you propose here. Not quite so low - I'd suggest about twice as many, so 60 rolls of a d6 or 300 rolls to test a d20 - but on the same order of magnitude.

While this isn't extremely powerful, it's enough to find gross imbalances (e.g., a d20 that rolls below 10 three-quarters of the time) while being quick enough to test a large number of dice on a lazy Sunday afternoon. Making an analogy to professional statistical analysis, using a lower sample size keeps the "cost" of doing the test low, which is desirable if you don't require the extra precision granted by a larger sample. Also, bear in mind that if you're willing to discard a die more easily (say, alpha of .9 instead of .95) then the power increases - and after all, they're only plastic pieces in a game, so accidentally putting an "okay" die into the "unbalanced" pile isn't the end of the world.

1. Interesting comment, and basically true, but I probably wouldn't act on those suggestions for a couple reasons:

One: Consider the "d20 rolls mostly below 10" case; it's pretty unlikely for that to be a hypothesis, because manufacturers post numbers of large differences next to each other (1 adjacent to 19, 2 next to 20, etc.) so this test doesn't appear on an unbalanced die.

Second: Monday on my math blog (madmath.com) I'll post the first case I've ever found of dice failing the test. In this case it's a whole boxful of cheap, rather obviously unbalanced d6's; and in fact if I'd rolled less than about 500 dice I don't think it would have failed the test (at alpha = 0.05).

This actually came up because I used those dice in a class experiment, effectively "how many hits can we get on armor value 5 [in Book of War]?", and the proportion was way less than expected. But if I'd asked for "armor 4" (i.e., proportion less than half, like the d20-below-10 thought experiment), I never would have noticed the difference, because likewise the "1" and the "4" faces are about equally overbalanced to mask this effect.

But: Great intuition, and thanks for pinging me on this, because I've been working on writing up this failed test just the last day or so!

26. hello there
I tried your formula but I hit some blocks on the way.

I tried following the steps for the D6
I keep tabs on which the results of the die for 30 rolls.
Then I substrect each of those results with 5
then I square the result of each of the substraction
then I sum the result of the square.
in 1 dice I get the result of 122

so is something wrong in the formula or the dice?

1. That definitely suggests that your die is unbalanced! Is there anything visually, obviously wrong with it? If you try the test again (another 30 rolls), do you get another very high number at the end?

2. Heya Delta,thanks for getting back to me.
well actually those are custom dice from Chessex,I have the 6 sided pips to be changed to my army's icon. Does this affect the dice? I have done this before for a dice by chessex to but from 20 dice that I bought, 12 of them have the results of 55

3. It seems like that data says yes, it does effect the balance of those dice -- that's pretty compelling evidence, actually. It may be that the army icon is larger/dub deeper than the other pips and makes that side unbalanced. (Similar to the dice that failed me here with a giant "1" pip.)

That's kind of too bad, but having the test fail on your repeatedly like that is an extremely strong sign. Is that special "6" side showing up more commonly than the other sides (or something else)?

4. "dub" -> "dug"

5. by special "6" side showing up more,do you meant it when rolling the dice 30 times? I have thrown away the calculation and I will try them all again.

6. Right; I'm just taking a stab at guessing why it's so unbalanced. I'm guessing that the "6" side is coming up more than the others (or maybe it's coming up less).

7. This comment has been removed by the author.

8. My next guess was going to be that the "1" was coming up more often; the "6" side is probably shallowly engraved, so it's heavy on that side (and hence rolls to the bottom).