Lately I've been playing an online game (a variant of Sudoku) and I've been capturing each puzzle after I have finished it. Occasionally, I am presented with a puzzle that I've already captured. So, naturally, this got me to wondering how many different puzzles the site is drawing from. It's happened enough times that I suspect the pool is actually pretty limited.
For regular Sudoku, there are a little over 5 million unique grids, if you exclude rotations, reflections, column/row reordering, relabeling, etc., but since I am not looking at these things when I compare a new puzzle to my set of captured puzzles, I should be looking at well over10^23 10^21 different grids. So even getting a single repeat should basically not ever happen unless the pool is a tiny, tiny fraction of the true solution space.
But how can I estimate it's size, and what's the best way to do so?
I see two methods. The first is to look at the fraction of repeats I have seen and calculate the pool size that would give me some threshold probability of seeing that many repeats. Depending on the threshold used, I could associate that estimate with a level of confidence that the true value was no larger than the estimate.
That would be an essentially static model looking at a single data point. But if I were to track WHEN the repeats occurred, it seems like I could get a much better estimate. As my collection of captured puzzles grows, the frequency of repeats should increase. So if I look at the spacing between repeats, that spacing should drop as time goes on and the rate at which it drops should give me an estimate of the total pool size.
My guess is that this is not an overly difficult problem. More to the point, I suspect it is probably one that has been extensively studied because I can see the potential for applications that need to make estimates in various rare-event scenarios.
I might throw together a Python script to generate a stream of strings randomly chosen from a pool of N strings, and another script to analyze the stream to try to determine the what N is and see how quickly the estimate converges on the correct value.
Anyone interested in doing the same and comparing the rates of convergence?
For regular Sudoku, there are a little over 5 million unique grids, if you exclude rotations, reflections, column/row reordering, relabeling, etc., but since I am not looking at these things when I compare a new puzzle to my set of captured puzzles, I should be looking at well over
But how can I estimate it's size, and what's the best way to do so?
I see two methods. The first is to look at the fraction of repeats I have seen and calculate the pool size that would give me some threshold probability of seeing that many repeats. Depending on the threshold used, I could associate that estimate with a level of confidence that the true value was no larger than the estimate.
That would be an essentially static model looking at a single data point. But if I were to track WHEN the repeats occurred, it seems like I could get a much better estimate. As my collection of captured puzzles grows, the frequency of repeats should increase. So if I look at the spacing between repeats, that spacing should drop as time goes on and the rate at which it drops should give me an estimate of the total pool size.
My guess is that this is not an overly difficult problem. More to the point, I suspect it is probably one that has been extensively studied because I can see the potential for applications that need to make estimates in various rare-event scenarios.
I might throw together a Python script to generate a stream of strings randomly chosen from a pool of N strings, and another script to analyze the stream to try to determine the what N is and see how quickly the estimate converges on the correct value.
Anyone interested in doing the same and comparing the rates of convergence?
Last edited: