Stats problem that some might find interesting.

Thread Starter

WBahn

Joined Mar 31, 2012
32,702
Lately I've been playing an online game (a variant of Sudoku) and I've been capturing each puzzle after I have finished it. Occasionally, I am presented with a puzzle that I've already captured. So, naturally, this got me to wondering how many different puzzles the site is drawing from. It's happened enough times that I suspect the pool is actually pretty limited.

For regular Sudoku, there are a little over 5 million unique grids, if you exclude rotations, reflections, column/row reordering, relabeling, etc., but since I am not looking at these things when I compare a new puzzle to my set of captured puzzles, I should be looking at well over 10^23 10^21 different grids. So even getting a single repeat should basically not ever happen unless the pool is a tiny, tiny fraction of the true solution space.

But how can I estimate it's size, and what's the best way to do so?

I see two methods. The first is to look at the fraction of repeats I have seen and calculate the pool size that would give me some threshold probability of seeing that many repeats. Depending on the threshold used, I could associate that estimate with a level of confidence that the true value was no larger than the estimate.

That would be an essentially static model looking at a single data point. But if I were to track WHEN the repeats occurred, it seems like I could get a much better estimate. As my collection of captured puzzles grows, the frequency of repeats should increase. So if I look at the spacing between repeats, that spacing should drop as time goes on and the rate at which it drops should give me an estimate of the total pool size.

My guess is that this is not an overly difficult problem. More to the point, I suspect it is probably one that has been extensively studied because I can see the potential for applications that need to make estimates in various rare-event scenarios.

I might throw together a Python script to generate a stream of strings randomly chosen from a pool of N strings, and another script to analyze the stream to try to determine the what N is and see how quickly the estimate converges on the correct value.

Anyone interested in doing the same and comparing the rates of convergence?
 
Last edited:

BobTPH

Joined Jun 5, 2013
11,463
I can’t see coming up with any certain result unless you know how the puzzle is chosen. It sounds like you are assuming random selection from a small set of possibilities, but this may not be the case. Possibly, it is generating the puzzles from an algorithm and not all puzzles are equally likely.
 

Futurist

Joined Apr 8, 2025
720
According to Wikipedia there are about 6 x 10^21 possible Sudoku arrangements, and after filtering out rotations etc, that drops to about 5 x 10^9. So where are you getting this 10^23 from @WBahn ?

This is an excellent thing to "discuss" with Copilot too.

Also, regarding programming stuff for this, it strikes me as very well suited to a functional language.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,702
I can’t see coming up with any certain result unless you know how the puzzle is chosen. It sounds like you are assuming random selection from a small set of possibilities, but this may not be the case. Possibly, it is generating the puzzles from an algorithm and not all puzzles are equally likely.
As is often (as in, almost always) the case, assumptions have to be made and those assumptions should be laid out. When possible, its then a good idea to test those assumptions. For instance, if puzzles are not equally likely, then that should show up, eventually, as the same puzzle being chosen multiple times. Of course, it may require a lot more data than is feasible to collect to test that hypothesis with any confidence.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,702
According to Wikipedia there are about 6 x 10^21 possible Sudoku arrangements, and after filtering out rotations etc, that drops to about 5 x 10^9. So where are you getting this 10^23 from @WBahn ?

This is an excellent thing to "discuss" with Copilot too.

Also, regarding programming stuff for this, it strikes me as very well suited to a functional language.
That was a typo, I meant to (and thought I had), typed 10^21 (as an approximation to 6.67E21). Thanks for catching it.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,702
I've read that a "true" sudoku puzzle is symmetrical. None of the apps or online games I've seen seem to care about that and appear more random. Requiring symmetry dramatically reduces the pool.

https://en.wikipedia.org/wiki/Mathematics_of_Sudoku
I haven't seen the claim that a sudoku puzzle has to be symmetric to be somehow "true" -- I guess that would depend on what the original definition was. I've always seen the ordinary puzzle with no symmetry requirement referred to as either, "proper" or "classic" sudoku with various subcategories imposing additional constraints, such as the diagonals also being a complete set of symbols or symmetries of one kind or another. Even just saying that a puzzle is symmetric has a lot of ambiguity since there are 26 different kinds of symmetry, some of which are not obvious upon casual inspection.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,702
I've been tracking data on repeats for about a month now. I haven't run an analysis yet but my gut feel is that their pool only has about four to five hundred different puzzles (assuming uniform random draws from the pool). I've seen some behavior that I think challenges the notion of randomness in the draws, but I'm not at all sure how strong it is.

But it got me to thinking about a separate puzzle challenge that I'll throw out.

What is the fewest number of bits needed to encode a Sudoko layout (the final filled in puzzle, not the clues)?

If you want to participate, I ask that you do the following:

State how many bits are needed, the justify that number be describing how your encoding works.

Feel free to respond to another poster's approach by pointing out how it might be possible to reduce the number of bits by taking something that was overlooked into account (you don't have to be quantitative about that).

This is almost certainly a solved problem, so I ask that no one go hunt down the answer and post it. Only post what you come up with on your own, though building on prior offered approaches is completely legitimate.

I'll start off with a couple of approaches. The first is the worst brute-force approach to establish an upper bound. The other two are still pretty brute-force, but should lower the upper limit somewhat.

BRUTE FORCE: 324 bits
Simply encode each of the 81 symbols using a 4-bit value, ordered from top-left to bottom-right in row major fashion.

There's a lot of low hanging fruit to trim some bits off. For instance, the last row doesn't need to be encoded at all, since the prior eight uniquely determine it, so that would eliminate 36 bits and get you down to 288 bits right there.

AUTOCOMPRESSION: 256 bits
To encode a sequence of nine symbols, the first symbol can be encoded with 4 bits. But now there are only eight symbols, so the next three symbols can be identified with just 3 bits. Then there are just four symbols left and the next three can be encoded with two bits, leaving the remaining unencoded symbol as the final symbol. This requires 32 bits per row (or column, or 3x3 block), but only eight of the nine row need be encoded, requiring 256 bits.

PERMUTATIONS: 152 bits
Each row (or column, or 3x3 block) is a unique permutation of nine symbols, of which there are 9! = 362880 such permutations. So, assuming that you can algorithmically go from an integer from 0 to 362879 to the corresponding permutation, it would require 19 bits to represent a permutation and you would need 8 such values (the nineth being determined by the prior 8), which would only require 152 bits total.

Note that this isn't a fully-articulated encoding, since I need to establish how to go from the permutation serial number to the actual permutation it maps to, but I think I have a good idea how it can be done.

I have some ideas on how to squeeze it down further, but they are going to require that I think about it quite a bit.
 
Top