Delta's D&D Hotspot: Testing Balanced Dice

2009-02-04

Testing Balanced Dice

If you're like me, at this point you've seen a whole bunch of fellow players keeping track of their "lucky" dice, trying to "train" dice by storing them best-side-up, stuff like that. I got to wondering recently what it would take to practically test whether you've got a fairly balanced die or not, and fortunately I've read enough statistics at this point to finally track down what it would take. (I was a bit torn about whether to put this post here or in my math blog; I do think it belongs here a bit more, because it's a specific application of a pretty well-known "Pearson's chi-square" hypothesis testing procedure.)

Testing a d6: Let's say you've got a d6. Roll it 30 times. Keep a tally of how many times each face comes up, from 1 to 6. (Note that we expect the number of appearances from each face to be about 5; 30/6 = 5). At the end, go through the counts and subtract 5 from each, square them all, and then add them all up. For a fair die, the total at the end should be no more than 55.

Testing a d20: Alternatively, say you're looking at a d20. Roll it 100 times. Again, keep a tally of how many times each face comes up, from 1 to 20. For each of the counts, go through and subtract 5, then square them, then add them all up at the end. For a fair die, the total at the end of this process should be at most 150.

Comments: The process specified above uses the minimum number of rolls you can get away with (5 times the number of sides; more on that below). It has a significance level of 5%; that is, there's a 5% chance for a die that's actually perfectly balanced to fail this test (Type I error). There's also some chance for a crooked die to accidentally pass the test, but that probability is a sliding function of how crooked the die is (Type II error). A graph could be shown for that possibility, but I've omitted it here (usually referred to as the "power curve" for the test).

How does this work? As mentioned above, it's an application of the well-known "Pearson's chi-square test" in statistics. The process results in a random number whose probability follows a "chi-square" curve after some large number of rolls. Formally, the different faces of the die are called the potential "outcomes" of rolling the die; the count of how many times each one comes up is called the "frequency" of that outcome; and the number of times you expect to see each one is called the "expected frequency". For the "chi-square" function to be applicable, you've got to have an expected frequency of 5 or more for each possible outcome (hence the requirement for a number of rolls of at least 5 times the number of sides).

The numerical process after you're done rolling can be referred to as the (very, very common) method of finding the "sum squared error" (abbreviated SSE). The "error" is how far off each frequency was from the expected value of 5 (hence count - 5); the "squared error" is when you square that value (thus making everything positive, among other niceties); and the "sum squared error" (SSE) is when you add all of those up. If the die showed every face exactly 5 times, the SSE would be exactly zero; the more crooked it is, the more error, and hence the larger the SSE value at the end.

Normally in Pearson's procedure, you'd take each "squared error" and divide by the expected frequency, then add those, then check a table of chi-squared values to see how likely that result was (compared to your initial expectation). But since we expect every side of our dice to be equally likely, there's a simplification that I've done above. For example, for a d6 test at a 5% significance level (degrees of freedom one less than sides on the die, so df = 5), I go to a table of chi-square values and look up X²(5, 0.05) and see 11.070. That means I would normally reject the fair-die hypothesis (i.e., the null-hypothesis Hₒ, "there is no difference from a fair die") if X² > 11.070. Under Pearson's procedure this would imply the following (letting "O" indicate the observed frequency of each outcome, and "E" indicated the expected frequency of each side, e.g., 5 in each case here):

$$\Sigma (O-E)^2 / E > 11.070$$

$$\Sigma (O-E)^2 / 5 > 11.070$$

$$\Sigma (O-E)^2 > 11.070 \times 5$$

$$SSE > 55.35$$

A similar simplification is done for the d20 process.

Now, if you wanted to improve the accuracy of the test you could obviously roll the die more times. You would then be able to reduce the chance for either a Type I or Type II error, but never totally avoid either possibility. (In practice, we normally keep the Type I error "significance level" fixed, and work to reduce the Type II error, thus improving the "power" of the test).

I'll end this with how you can form your own test, for any die, to whatever power level you desire. I'll assume you keep the significance level at 5%, and let E be the expected number of times each face should appear in your experiment (E = rolls/sides; having E ≥ 5 is a requirement!). Then you will reject the Hₒ "balanced die hypothesis" if the SSE you get at the end is greater than X ∙ E, where X is the table lookup for X²(sides-1, 0.05). For a d4, X = 7.815; d6, X = 11.070; d8, X = 14.067; d10, X = 16.919; d12, X = 19.675; and for the d20, X = 30.143. Have fun.

Pearson's chi-squared test at Wikipedia

Edit: Someone in the comments asked for a look at an actual working example -- included below in a photo of my scratch paper from the last time I tested a d20.

40 comments:

AnonymousFebruary 5, 2009 at 12:11 AM
Cool post, Dan. While I don't consider myself superstitious in everyday life, there is something about D&D dice!

I swear there are nights where my d20s never see the high side of 10!

I think I will test a couple of see if there is any fact, or simply perception to my "evidence".

I stumbled onto your blog a few months back, and have read through all of your posts, and it is bookmarked in my Daily folder.

Keep it up!
ReplyDelete
Replies
AnonymousFebruary 5, 2009 at 8:06 AM
So how much empirical testing did you put this through? I presume you tested some number of fair dice, but did you test it on an intentionally weighted die?

I have an old d10 kicking around that once fell on the floor at my friends house when his dog scooped it up and started chewing on it. We rescued the pieces and I glued the larger chunks back into a reasonable semblance of a d10. I'm tempted to try this test on it to see where it lies.
ReplyDelete
Replies
yoyorobboFebruary 5, 2009 at 10:21 AM
@ Greymist:
"I swear there are nights where my d20s never see the high side of 10!"

Maybe you accidentally used some 0-9 twice d20s. ;>] I've had to check some of mine lately, as I too have been rolling for @#$%.

@ Delta:
Nice post. I am always curious to the fairness of certain dice, and those that are favored by folks. Too bad I hated probs and stats back in college. Math minor here, but man did I struggle with P&S....ugh. Nice work though. Thanks for the info and something to try next time my dice blow.
ReplyDelete
Replies
DeltaFebruary 5, 2009 at 10:26 AM
So how much empirical testing did you put this through? I presume you tested some number of fair dice, but did you test it on an intentionally weighted die?

No, none whatsoever. The math theory is really not subject to empirical testing; if you think there's gaps in the mathematical reasoning I'd like to hear about them.

Logically, if you were going to really "test the test" you'd have to know exactly how weighted the die was in the first place, but the only way to generate a solid theory about that is to use a statistical test exactly like this one. So you'd be right back where you started.

But, I'd be interested in hearing exactly how crooked your smashed-up-glued-together die is. :)
ReplyDelete
Replies
K. BaileyFebruary 14, 2009 at 2:25 AM
I've always found the statistical holiness of 5% confidence to be amusing, considering that's the chance of rolling a natural 20 (or, I guess a 1 would be more appropriate).
ReplyDelete
Replies
AnonymousAugust 1, 2010 at 12:31 PM
Ok, lookie here. I barely got out of math class. I want to test some six siders. Can you add a comment here with an example of the actual math used, not the forumla.

Yes, I know this is dumb. I just want to make sure I'm doing it properly since I was a history major lol.
thanks,
Musashi
ReplyDelete
Replies
PaulAugust 2, 2010 at 3:26 PM
@mylittlesoldier, here's the work shown for a d6 I just tested.

Roll the d6 thirty times, keep a tally of the number of times you see each face:

1: six times, 2: five times, 3: five times, 4; seven times, 5: 4 times, 6: three times.

Subtract five from each of those occurrences, so I get: 1, 0, 0, 2, -1, -2.

Square each of those, so I get: 1, 0, 0, 4, 1, 4.

Sum those numbers and I get 10. Ten is less than 55, so this is probably a fair-rolling d6.
ReplyDelete
Replies
DeltaAugust 4, 2010 at 2:44 AM
Paul is, of course, correct. For another view, here's a shot of my working notepaper from the last time I was testing dice (in this case a d20).

You can see that I rolled the d20 100 times and tallied the results each time (columns 1-3). Then I subtracted 5 (the ideal tally for each face, 100/20) from each of those (column 4). Then I squared those and added them all up (column 5). Since this came in below 150, I accepted the die as most likely reasonably balanced.

If you're testing a d6, do the exact same procedure but just roll it 30 times -- at the end, accept it as balanced if the total is 55 or less.
ReplyDelete
Replies
UnknownOctober 15, 2010 at 6:25 PM
Ok, since I had a little time to spare and a two new d20s lying around, I took them to the test. One was a Pegasus Oblivion Yellow, the other one a Q-Workshop Black & White Forest. 100 rolls as you suggested. The Oblivion got a Result of 180, the Forest 80. Actually I suspected the Forest to be more inaccurate, since it is heavily carved. That was interesting. Also my cat is now interested in playing with dice too :-)
ReplyDelete
Replies
MelJune 26, 2011 at 10:14 AM
Just curious, but what if the the dice is biased at only a minimal number of sides? You test shows that a *series* is fitting a prediction, but does it capture *individual* deviations from the expected series? Take your d20. Out of 100 rolls you only got 1 four. What if this bias were to continue, while all other sides remained very close to the expected outcome? I guess I'm asking, what happens if the deviation at a single value is spread equally among all other members of the series? Or is this simply an example of a Type II error?
ReplyDelete
Replies
DeltaJune 27, 2011 at 2:55 AM
Hi, Mel -- To make a long story short: yes, Pearson's chi-square test "captures individual deviations". The squaring operation in the algorithm means that it becomes increasingly sensitive to deviations as the sample size goes up. For example:

Having one face appear only once over 100 rolls is not really that surprising. The chance for that to happen, for a given face on a fair d20, is (by the binomial formula): P(X=x)= nCr*p^x*(1-p)^(n-x) = 100C1*0.05^1*(1-0.05)^(100-1) = 100*0.05*0.95^99 = 100*0.05*0.006 = 0.03 = 3%. So the chance that one of 20 faces does this is roughly 20*3% = 60% (approximating here to avoid a much tougher calculation).

Let's say we bump of the number of rolls to 1,000 (E=50 per face) and keep the sampling ratios as you suggest: say one face gets 10 appearances -- in fact, let's say 12 to be generous and keep the math easy -- while the other rolls are spread evenly, 52 each for the other 19 faces. (Note 1x12+19x52 = 12+988 = 1000). There's a lesser, but still nonzero chance of this happening on a fair die (I'll let you work that out if you want.)

But if this is what we observed in the experiment and applied Pearson's chi-square test, then we'd get a sum-squared error of SSE = sum(O-E)^2 = 19*(52-50)^2+(12-50)^2 = 19*(2)^2+(-38)^2 = 19*4+1444 = 76+1444 = 1520. And so this would radically fail our test (remember, limit of just 150 for a d20)!

To recap: The test is increasingly sensitive to deviations like you're talking about over much larger trials (as is the case for any statistical hypothesis testing procedure).
ReplyDelete
Replies
MelJune 27, 2011 at 6:01 PM
Thanks for your response! Let me add that I enjoy reading your blog. Your post on normal distributions and variance (as well as your discussion of range penalties for archery) was one of my favorite posts in OSR blogdom. Really got me thinking!
ReplyDelete
Replies
DeltaJune 28, 2011 at 1:57 AM
Thanks for the kind words! :D
ReplyDelete
Replies
RezanahJuly 2, 2011 at 8:09 PM
@ Delta: Commenting on what you said below:

"And so this would radically fail our test (remember, limit of just 150 for a d20)!"

The limit increases as the sampling size increases, because the formula is X*E, where X=~30 for a d20 and E=rolls/sides. In the case of 1000 rolls, the limit becomes ~1500. Which falls roughly with your 1520 result :)
ReplyDelete
Replies
DeltaJuly 3, 2011 at 1:47 PM
Holy smoke is that super embarrassing! Thanks for pointing out that glitch, Rezanah. I totally didn't read my own summary close enough.

But the broader point remains: As sample size (number of rolls) goes up, the probability for a certain proportion of skewed rolls goes down, and the associated test correspondingly gets more sensitive to such fluctuations. Again, the concrete thing you can point to is the squaring operation that makes the sum-squared-error (SSE) explode faster than the sample size.

A more correct example: Say we look at a few different sample sizes (n), each where we're rolling a d20, and one face shows up 1.2% of the time (O1), the others being balanced (O2).

(1) For n=100, E=5, O1=1.2, O2=5.2; so SSE=15.2 < X*E=150.72 -- passing the test easily by a factor of about x10.

(2) For n=1,000, E=50, O1=12, O2=52; so SSE=1,520 ~ X*E=1,507.15 -- very close to the critical value for the test (as noted by Rezanah).

(3) For n=10,000, E=500, O1=120, O2=520; so SSE=152,000 > X*E=15,071.5 -- now failing the test radically by a factor of x10.

An Excel spreadsheet of the details can be found here: http://www.superdan.net/download/PearsonSensitivity.xls

In summary: As the sample size gets bigger, the test definitely does get more sensitive to proportional fluctuations (although not quite as fast as I asserted above).

Huge "thank you" to Rezanah, much appreciate the assist. If anyone else sees anything else I need to polish, please likewise tell me...
ReplyDelete
Replies
JohnFarrisJuly 23, 2011 at 11:48 PM
As I look at the comments to your excellent Feb 4, 2009 article I finally understand that for the d6 30 roll test that the SSE limit is 55.35. If the number of test rolls were to increse to 60 or 300 rolls then the SSE limit would increase proportionally to 110.7 and 553.5 respectively. Am I correct?

Similarly, in the original article you mention a graph to show the possibility of a crooked dice. I would be glad to see it if readily available and you think useful.

Great article!!
ReplyDelete
Replies
DeltaJuly 24, 2011 at 12:37 AM
JohnF said: "If the number of test rolls were to increase to 60 or 300 rolls then the SSE limit would increase proportionally to 110.7 and 553.5 respectively. Am I correct?"

Hey JohnF, you have it exactly right. (Even though I embarrassingly forgot that myself up in my 6/27 comment.) And you're on-target that the usefulness of increasing the die-rolls is to make the test more sensitive, i.e., greater "power" which could be shown in a graph.

At the moment, I don't have that graph. Let me think about that for a bit. Thanks for the kind words!
ReplyDelete
Replies
DeltaJuly 24, 2011 at 2:16 AM
Actually, I think I was being flippant about the power-curve graphs, because (a) that's, like, real hard, and (b) I'm not sure how to do it immediately for a chi-squared test. I will continue to think about how to do that.
ReplyDelete
Replies
starwedJuly 10, 2012 at 2:06 PM
Hmm, I think this is actually a fairly meaningless test until you determine with what probability an *unfair* die will pass the test.

You've shown how often a fair die will pass the test, but that doesn't directly translate into a confidence that the dice is fair -- a die with a very, very slight unfairness will pass the test only slightly less often. As the unfairness increases, it'll be less and less likely to pass, so the test can discriminate between a perfectly balanced die and a blatantly unbalanced one. But to be useful you definitely need to know what the sensitivity really is.
ReplyDelete
Replies
DeltaJuly 11, 2012 at 1:07 AM
@ starwed: What you're talking about is the "power" of the test, as mentioned above. See the last link at the very end of the blog post for a complete treatment of that.
ReplyDelete
Replies
News CasterJanuary 5, 2013 at 8:04 PM
Hi, Delta! Thanks for this post! But I was reading through a post in boardgamegeek.com (http://boardgamegeek.com/thread/576612/testing-dice-for-fairness) and someone brought up the following point: "There are practically no games where the imbalance of a standard die would significantly influence the game. There are just too few die rolls." What do you think? Should I keep using my unbalanced die or buy cassino ones?
ReplyDelete
Replies
LanceJanuary 23, 2013 at 10:13 PM
Thanks for this. I just built an Excel Spreadsheet for testing my D20's. I have uploaded a copy here if others want to play around with it as well.

D20 Testing Spreadsheet (Google Docs) - https://docs.google.com/spreadsheet/ccc?key=0Aik7Xd__ctlTdE03ZkpqMXByQ1dkX1FEcnE4ajU3Ync
ReplyDelete
Replies
UnknownDecember 5, 2014 at 11:55 AM
How would this work for dice with non-uniform distribution, such as “average” dice which have sides 2-3-3-4-4-5 (to avoid the extreme 1 and 6 results)?
ReplyDelete
Replies
Nayara CostaJune 17, 2015 at 8:02 PM
How can I translate this formula to another dice: For example, what is the total expected for a D10
ReplyDelete
Replies
DanJanuary 29, 2016 at 10:01 AM
While you (rightfully) cautioned readers about the limitations of the chi-squared Goodness of Fit test in the follow-up article, I would like to say that as a statistician I do think there's still some practical usefulness to lower sample size tests like the ones you propose here. Not quite so low - I'd suggest about twice as many, so 60 rolls of a d6 or 300 rolls to test a d20 - but on the same order of magnitude.

While this isn't extremely powerful, it's enough to find gross imbalances (e.g., a d20 that rolls below 10 three-quarters of the time) while being quick enough to test a large number of dice on a lazy Sunday afternoon. Making an analogy to professional statistical analysis, using a lower sample size keeps the "cost" of doing the test low, which is desirable if you don't require the extra precision granted by a larger sample. Also, bear in mind that if you're willing to discard a die more easily (say, alpha of .9 instead of .95) then the power increases - and after all, they're only plastic pieces in a game, so accidentally putting an "okay" die into the "unbalanced" pile isn't the end of the world.
ReplyDelete
Replies
UnknownFebruary 29, 2016 at 12:25 AM
hello there
I tried your formula but I hit some blocks on the way.

I tried following the steps for the D6
I keep tabs on which the results of the die for 30 rolls.
Then I substrect each of those results with 5
then I square the result of each of the substraction
then I sum the result of the square.
in 1 dice I get the result of 122

so is something wrong in the formula or the dice?
ReplyDelete
Replies

Add comment

Delta's D&D Hotspot

2009-02-04

Testing Balanced Dice

40 comments:

About Me

Testimonial

D&D House Rules in Brief

Wandering DMs

OED Games

Bluesky

GitHub

Add-Ons for OD&D

Popular Posts

Blog Archive