Stats problem that some might find interesting.

Thread Starter

WBahn

Joined Mar 31, 2012
32,776
Lately I've been playing an online game (a variant of Sudoku) and I've been capturing each puzzle after I have finished it. Occasionally, I am presented with a puzzle that I've already captured. So, naturally, this got me to wondering how many different puzzles the site is drawing from. It's happened enough times that I suspect the pool is actually pretty limited.

For regular Sudoku, there are a little over 5 million unique grids, if you exclude rotations, reflections, column/row reordering, relabeling, etc., but since I am not looking at these things when I compare a new puzzle to my set of captured puzzles, I should be looking at well over 10^23 10^21 different grids. So even getting a single repeat should basically not ever happen unless the pool is a tiny, tiny fraction of the true solution space.

But how can I estimate it's size, and what's the best way to do so?

I see two methods. The first is to look at the fraction of repeats I have seen and calculate the pool size that would give me some threshold probability of seeing that many repeats. Depending on the threshold used, I could associate that estimate with a level of confidence that the true value was no larger than the estimate.

That would be an essentially static model looking at a single data point. But if I were to track WHEN the repeats occurred, it seems like I could get a much better estimate. As my collection of captured puzzles grows, the frequency of repeats should increase. So if I look at the spacing between repeats, that spacing should drop as time goes on and the rate at which it drops should give me an estimate of the total pool size.

My guess is that this is not an overly difficult problem. More to the point, I suspect it is probably one that has been extensively studied because I can see the potential for applications that need to make estimates in various rare-event scenarios.

I might throw together a Python script to generate a stream of strings randomly chosen from a pool of N strings, and another script to analyze the stream to try to determine the what N is and see how quickly the estimate converges on the correct value.

Anyone interested in doing the same and comparing the rates of convergence?
 
Last edited:

BobTPH

Joined Jun 5, 2013
11,488
I can’t see coming up with any certain result unless you know how the puzzle is chosen. It sounds like you are assuming random selection from a small set of possibilities, but this may not be the case. Possibly, it is generating the puzzles from an algorithm and not all puzzles are equally likely.
 

Futurist

Joined Apr 8, 2025
734
According to Wikipedia there are about 6 x 10^21 possible Sudoku arrangements, and after filtering out rotations etc, that drops to about 5 x 10^9. So where are you getting this 10^23 from @WBahn ?

This is an excellent thing to "discuss" with Copilot too.

Also, regarding programming stuff for this, it strikes me as very well suited to a functional language.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,776
I can’t see coming up with any certain result unless you know how the puzzle is chosen. It sounds like you are assuming random selection from a small set of possibilities, but this may not be the case. Possibly, it is generating the puzzles from an algorithm and not all puzzles are equally likely.
As is often (as in, almost always) the case, assumptions have to be made and those assumptions should be laid out. When possible, its then a good idea to test those assumptions. For instance, if puzzles are not equally likely, then that should show up, eventually, as the same puzzle being chosen multiple times. Of course, it may require a lot more data than is feasible to collect to test that hypothesis with any confidence.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,776
According to Wikipedia there are about 6 x 10^21 possible Sudoku arrangements, and after filtering out rotations etc, that drops to about 5 x 10^9. So where are you getting this 10^23 from @WBahn ?

This is an excellent thing to "discuss" with Copilot too.

Also, regarding programming stuff for this, it strikes me as very well suited to a functional language.
That was a typo, I meant to (and thought I had), typed 10^21 (as an approximation to 6.67E21). Thanks for catching it.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,776
I've read that a "true" sudoku puzzle is symmetrical. None of the apps or online games I've seen seem to care about that and appear more random. Requiring symmetry dramatically reduces the pool.

https://en.wikipedia.org/wiki/Mathematics_of_Sudoku
I haven't seen the claim that a sudoku puzzle has to be symmetric to be somehow "true" -- I guess that would depend on what the original definition was. I've always seen the ordinary puzzle with no symmetry requirement referred to as either, "proper" or "classic" sudoku with various subcategories imposing additional constraints, such as the diagonals also being a complete set of symbols or symmetries of one kind or another. Even just saying that a puzzle is symmetric has a lot of ambiguity since there are 26 different kinds of symmetry, some of which are not obvious upon casual inspection.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,776
I've been tracking data on repeats for about a month now. I haven't run an analysis yet but my gut feel is that their pool only has about four to five hundred different puzzles (assuming uniform random draws from the pool). I've seen some behavior that I think challenges the notion of randomness in the draws, but I'm not at all sure how strong it is.

But it got me to thinking about a separate puzzle challenge that I'll throw out.

What is the fewest number of bits needed to encode a Sudoko layout (the final filled in puzzle, not the clues)?

If you want to participate, I ask that you do the following:

State how many bits are needed, the justify that number be describing how your encoding works.

Feel free to respond to another poster's approach by pointing out how it might be possible to reduce the number of bits by taking something that was overlooked into account (you don't have to be quantitative about that).

This is almost certainly a solved problem, so I ask that no one go hunt down the answer and post it. Only post what you come up with on your own, though building on prior offered approaches is completely legitimate.

I'll start off with a couple of approaches. The first is the worst brute-force approach to establish an upper bound. The other two are still pretty brute-force, but should lower the upper limit somewhat.

BRUTE FORCE: 324 bits
Simply encode each of the 81 symbols using a 4-bit value, ordered from top-left to bottom-right in row major fashion.

There's a lot of low hanging fruit to trim some bits off. For instance, the last row doesn't need to be encoded at all, since the prior eight uniquely determine it, so that would eliminate 36 bits and get you down to 288 bits right there.

AUTOCOMPRESSION: 256 bits
To encode a sequence of nine symbols, the first symbol can be encoded with 4 bits. But now there are only eight symbols, so the next three symbols can be identified with just 3 bits. Then there are just four symbols left and the next three can be encoded with two bits, leaving the remaining unencoded symbol as the final symbol. This requires 32 bits per row (or column, or 3x3 block), but only eight of the nine row need be encoded, requiring 256 bits.

PERMUTATIONS: 152 bits
Each row (or column, or 3x3 block) is a unique permutation of nine symbols, of which there are 9! = 362880 such permutations. So, assuming that you can algorithmically go from an integer from 0 to 362879 to the corresponding permutation, it would require 19 bits to represent a permutation and you would need 8 such values (the nineth being determined by the prior 8), which would only require 152 bits total.

Note that this isn't a fully-articulated encoding, since I need to establish how to go from the permutation serial number to the actual permutation it maps to, but I think I have a good idea how it can be done.

I have some ideas on how to squeeze it down further, but they are going to require that I think about it quite a bit.
 
I accept your challenge: 101 bits.

I presume, like you said, that multiple folks all smarter than me got better results and published them twenty years ago, but I did this just for my education and entertainment. All original work done solely by me completely over the last few days, that I now am sharing. No plagiarism, no AI, no motivation to impress anybody, etc.. Plenty of wholesome fun, and even learned something new. Really just hoping that any reader enjoys this kind of topic as much as I do.

Please find herein discussion and encode/decode examples of my contribution.

External Resources:

  • I downloaded "Enumerating possible Sudoku grids" by Felgenhauer and Jarvis. They report a universe of 6670903752021072936960 valid grids. The log2() of this number is 72.498, so 73 is the best number of bits that one can get. They introduce the terminology of one 9x9 grid made of nine 3x3 blocks.

qqwing --generate --difficulty expert --nopuzzle --solution --compact​

  • I variously used bc, python3, and Win7 calc.exe for their ability to handle large-precision integers.


OK ....... Let's have some fun .........


Here is the layout of the grid. I named the 9 blocks according to compass directions. "O" or "Origin" is in the center.

Code:
+--------+--------+--------+
|        |        |        |
|   NW   |    N   |   NE   |
|        |        |        |
+--------+--------+--------+
|        |        |        |
|   W    |    O   |    E   |
|        |        |        |
+--------+--------+--------+
|        |        |        |
|   SW   |    S   |   SE   |
|        |        |        |
+--------+--------+--------+
The eight blocks that are processed are on the perimeter. The ninth one that is ignored when encoding and synthesized when decoding is O. The raw count of possible grids is (9!)^8.

Processing combines a pair of 4-block paths running clockwise from NW and SE. From NW, go across the top through N and NE then down to E. From SE, go across the bottom through S and SW then up to W. Notice that the paths are like a Knight chess-piece move or that "L" shaped piece in Tetris. This gives the same raw count of ((9!)^4)x((9!)^4); it's just done in two pieces.

Code:
+--------+--------+--------+
|        |        |        |
|   NW ---->  N ----> NE   |
|        |        |    |   |
+--------+--------+----|---+
|        |        |    v   |
|   W    |    O   |    E   |
|   ^    |        |        |
+---|----+--------+--------+
|   |    |        |        |
|   SW <----  S <---- SE   |
|        |        |        |
+--------+--------+--------+

The idea behind any reduction is to find where the known numbers of one group constrain another group. Then, we can infer the constrained group's values from a smaller pool of candidates. This results in not needing 9! possibilities in a block, and allows use to break the 9! of a block into smaller 6! and 3! pieces. The result is a product of many small factors which is smaller than a product of a few large factors.

The transformation used here maps the 81 digits of a 9x9 grid to a total of (8+8+1)=17 named regions.

The center block, O, is fully constrained. It thus has 0! or 1 possibility.

NW and SE are unconstrained, with 9! possibilities in each. No change here.

Horizontally merge N into NE, and S into SW. This turns pairs of 9x9 squares into sets of three 6-rows. This gives (6!)^3 instead of (9!)^2, for a reduction of about 350x.

Code:
+--------+--------+--------+
|        |_______N34_______|
|   NW   |_______N2________|
|        |       N1        |
+--------+--------+--------+
|        |        |        |
|   W    |    O   |    E   |
|        |        |        |
+--------+--------+--------+
|________S1_______|        |
|________S2_______|   SE   |
|        S43      |        |
+--------+--------+--------+
Further, consider the two 6-rows of N1 and N2 together. The left half of N1 is the lower 3 numbers of N; the right half of N2, the middle 3. So, the left half of N34 is constrained to the remaining 3 numbers. Their right halves likewise constrain the right half of N34. So, N34 can be split into a pair of 3-rows, N3 AND N4. Doing this converts the 6! of N34 to (3!)x(3!) in its two halves, for a total reduction of 7056x. Similarly split S43.

Code:
+--------+--------+--------+
|        |___N3___|___N4___|
|   NW   |_______N2________|
|        |       N1        |
+--------+--------+--------+
|        |        |        |
|   W    |    O   |    E   |
|        |        |        |
+--------+--------+--------+
|________S1_______|        |
|________S2_______|   SE   |
|   S4   |   S3   |        |
+--------+--------+--------+
After horizontally merging and propagating, all 9 numbers of all top and bottom blocks are known. This leaves only three numbers available for each column of W and E. For example, the leftmost column of W (shown as W3) has three known numbers each above it and below it in the leftmost columns of NW and SW, respectively. This leaves only 3 possible numbers for W3. W then becomes three 3-columns, or (3!)^3, for a reduction of 1680x from 9!. Repeat this on E.

Code:
+--------+--------+--------+
|        |___N3___|___N4___|
|   NW   |_______N2________|
|        |       N1        |
+--+--+--+--------+--+--+--+
|  |  |  |        |  |  |  |
|W3|W2|W1|    O   |E1|E2|E3|
|  |  |  |        |  |  |  |
+--+--+--+--------+--+--+--+
|________S1_______|        |
|________S2_______|   SE   |
|   S4   |   S3   |        |
+--------+--------+--------+
Each L-shaped clockwise path has 8 permutations: one 9!, two 6!, and five 3!.

These give (9!)x((6!)^2)x((3!)^5) or 1462797729792000 per each of two 4-block paths. Square for the total due to both, to get 2139777198284629044363264000000 for the grid. This is a 31-digit decimal number. Its log2() is 100.755, so it requires 101 bits to represent. If the goal is to describe a grid in the fewest number of text characters, then the log36() of 19.489 says that a 20-digit base-36 representation suffices. All of this is 320762715x the number of valid grids, or 28.257 bits greater than theoretical.

Storing the permutations in discrete form gives vector components of 19, 10, and 3 bits for a total of (1x19)+(2x10)+(5x3) or 54 bits required for each path, or 108 for both. Yes, this is 7 bits larger, but not the end of the world. I do not consider it because it has missing codes. (For example, 6 of the 8 values in 3 bits map to one of the 3! permutations, but the other 2 do not.) I only mention it here to acknowledge any apple/orange comparisons.


OK, so that's the math.


The encoding/decoding uses letters to symbolically represent the actual numeric digits. 6-groups use UPPERCASE; 3-groups, lowercase. This is done solely for clarity. It is best described by using an example:

Consider an example second-top row of digits 649725813. It is split as 649 in the middle row of NW, and 725813 in N2.​
  • It does not matter to the "permutation math" what the digits of N2 are, but only which of the 6! permutations of those 6 numerals is present.
  • I use "A" to represent the smallest numeral of N2, and "F" for the largest. This gives the permutation of "ABCDEF" if all six digits of N2 happen to be in ascending order from left to right, as 123578, and "FEDCBA" represents the descending-order 875321.
  • This example's 725813 maps to the "EBDFAC" permutation (according to the particular mapping that I happened to use). So, the second row appears visually here as 649|EBDFAC.
  • Likewise, "abc" through "cba" describe different orderings of the 3-groups' digits.

Again, permutations can use any symbols to represent a subset of numerals. UPPERCASE and lowercase were chosen here for visual clarity.


To encode:

Note: Block O is processed here, just for completeness in printing the examples. Only the other 8 perimeter blocks require processing.​

  • Distribute the 81 digits of a 9x9 grid to the 17 named regions.

  • Do the domain mapping:

  • The 9-groups (NW, O, and SE) do not need more processing.
  • Map the 6-groups (N1, N2, S1, S2) from numerals to 6-UPPERCASE permutations.
  • Map the 3-groups (N3, N4, E1, E2, E3, S3, S4, W1, W2, W3) from numerals to 3-lowercase permutations.

  • Find the ordinal value of each of the 17 9-numeric, 6-UPPERCASE, and 3-lowercase text strings. The result is a collection of 17 numbers.

  • Combine the 16 numbers (all of the 17 except for O) in the right order to get the final 101-bit number.

To decode:

Note: Block O is not supplied; it will be synthesized in the last step.​

  • Break the 101-bit number into its 16 constituent numerical parts.

  • Translate each of the 16 numbers to its associated permutation text string.

  • Infer the digits from the text strings, and put them in the grid:

The 9-groups (NW and SE) are each a single 3x3 block described by a 9-numeral string. Fill in the blocks using their strings.​

The 6-groups (N1, N2, S1, S2) have a mapping described by a 6-UPPERCASE string. The 6 numerals of a 6-group are the ones that are not in the associated row of the neighboring 9-block (either NW or SE). Determine and sort the 6 numerals. Map the lowest numeral to the "A" position of the 6-UPPERCASE string; the second, to the "B" position, etc..​

The 3-groups (N3, N4, E1, E2, E3, S3, S4, W1, W2, W3) have a mapping described by a 3-lowercase string. Some 3-groups are controlled by other 3-groups, so the order of reconciliation matters.​

First, determine the horizontal 3-groups (N3, N4, S3, S4). Afterward the vertical groups (W3, W2, W1, E1, E2, E3) are fully constrained.​

As with the 6-groups, determine and sort the 3 numerals, and then map the lowest numeral to the "a" position of the 3-lowercase string, etc..​

  • Combine the 72 numerals into an 8-block "perimeter grid" (all but the center O block).

  • Fill O according to what is missing in the W/E rows and N/S columns. O is a collection of nine 1-groups, or single digits. Just as the 6-groups are each constrained by 3 other digits, and the 3-groups are each constrained by 6 other digits, the 1-groups of O are each constrained by 12 other digits. Each digit in O is constrained by rows of W and E, and columns of N and S. Because of all of this, each digit of O can be processed in the same way as the 6- and 3-groups. That is, O does not need a separate step that uses a different algorithm.


Examples

Encode


Generate a random grid:

864527139​
729813654​
351496782​
673452918​
182397546​
549168273​
746395281​
931278465​
825614937​

This is what it looks like in a grid:

Code:
+--------+--------+--------+
|   864  |   729  |   351  |
|   527  |   813  |   496  |
|   139  |   654  |   782  |
+--------+--------+--------+
|   673  |   182  |   549  |
|   452  |   397  |   168  |
|   918  |   546  |   273  |
+--------+--------+--------+
|   746  |   931  |   825  |
|   395  |   278  |   614  |
|   281  |   465  |   937  |
+--------+--------+--------+
Separate the digits into the groups:

Code:
+--------+--------+--------+
|   864  |__ 729 _|_ 351 __|
|   527  |____ 813496 _____|
|   139  |     654782      |
+--+--+--+--------+--+--+--+
| 6| 7| 3|   182  | 5| 4| 9|
| 4| 5| 2|   397  | 1| 6| 8|
| 9| 1| 8|   546  | 2| 7| 3|
+--+--+--+--------+--+--+--+
|_____ 746931 ____|  825   |
|_____ 395278 ____|  614   |
|   281  |  465   |  937   |
+--------+--------+--------+
Determine the 6-group UPPERCASE and 3-group lowercase permutations:

Code:
+--------+--------+--------+
|   864  |__ bac _|_ bca __|
|   527  |____ EABCFD _____|
|   139  |     DCBEFA      |
+--+--+--+--------+--+--+--+
| b| c| b|   182  | c| a| c|
| a| b| a|   397  | a| b| b|
| c| a| c|   546  | b| c| a|
+--+--+--+--------+--+--+--+
|_____ ECDFBA ____|  825   |
|_____ BFCADE ____|  614   |
|   bca  |  acb   |  937   |
+--------+--------+--------+
The mapping used by my implementation translates the above permutations to the following integers:

Code:
+--------+--------+--------+
|        |__ 2  __|__ 3  __|
| 309996 |______ 481 ______|
|        |       417       |
+--+--+--+--------+--+--+--+
|  |  |  |        |  |  |  |
|2 |5 |2 | 30356  |4 |0 |5 |
|  |  |  |        |  |  |  |
+--+--+--+--------+--+--+--+
|______ 545 ______|        |
|______ 222 ______| 289810 |
|   3    |  1     |        |
+--------+--------+--------+
The digits (permutation integers) and their bases are below:

Code:
group    NW     N1   N2 N3 N4 E1 E2 E3      SE     S1   S2 S3 S4 W1 W2 W3
value  309996  417  481  2  3  4  0  5    289810  545  222  1  3  2  5  2
base   362880  720  720  6  6  6  6  6    362880  720  720  6  6  6  6  6
The base-10 representation is 1827942056972749719918906854720, and the base-36 representation is 4X7ZLEJSG8O9DAL5P1KW.


Decode

Start by getting the integer representation of a random grid:

Generate a random grid:

653894712​
742135968​
198627453​
561342897​
439781526​
827956341​
915278634​
284563179​
376419285​

Encode it, and its integer base-10 representation is 1318584014390052175401511224476.

Break the number into its 16 constituent integer parts:

Code:
group    NW     N1   N2 N3 N4 E1 E2 E3      SE     S1   S2 S3 S4 W1 W2 W3
value  223615  501    4  3  4  5  4  5    213187  606  105  2  1  1  5  2
base   362880  720  720  6  6  6  6  6    362880  720  720  6  6  6  6  6
Here are the integers shown in their groups:

Code:
+--------+--------+--------+
|        |__ 3  __|__ 4  __|
| 223615 |______ 4   ______|
|        |       501       |
+--+--+--+--------+--+--+--+
|  |  |  |        |  |  |  |
|2 |5 |1 |        |5 |4 |5 |
|  |  |  |        |  |  |  |
+--+--+--+--------+--+--+--+
|______ 606 ______|        |
|______ 105 ______| 213187 |
|   1    |  2     |        |
+--------+--------+--------+
The mapping used by my implementation translates the above integers to the following permutations:

Code:
+--------+--------+--------+
|   653  |__ bca _|_ cab __|
|   742  |____ ABCFDE _____|
|   198  |     EAFCDB      |
+--+--+--+--------+--+--+--+
| b| c| a|        | c| c| c|
| a| b| c|        | b| a| b|
| c| a| b|        | a| b| a|
+--+--+--+--------+--+--+--+
|_____ FACBDE ____|  634   |
|_____ AFCDEB ____|  179   |
|   acb  |  bac   |  285   |
+--------+--------+--------+
Map the UPPERCASE 6-group and lowercase 3-group text from their respective "A" and "a" representations to their available digits:

Code:
+--------+--------+--------+
|   653  |__ 894 _|_ 712 __|
|   742  |____ 135968 _____|
|   198  |     627453      |
+--+--+--+--------+--+--+--+
| 5| 6| 1|        | 8| 9| 7|
| 4| 3| 9|        | 5| 2| 6|
| 8| 2| 7|        | 3| 4| 1|
+--+--+--+--------+--+--+--+
|_____ 915278 ____|  634   |
|_____ 284563 ____|  179   |
|   376  |  419   |  285   |
+--------+--------+--------+
Distribute the groups' content into the 8 perimeter 3x3 blocks:

Code:
+--------+--------+--------+
|   653  |   894  |   712  |
|   742  |   135  |   968  |
|   198  |   627  |   453  |
+--------+--------+--------+
|   561  |        |   897  |
|   439  |        |   526  |
|   827  |        |   341  |
+--------+--------+--------+
|   915  |   278  |   634  |
|   284  |   563  |   179  |
|   376  |   419  |   285  |
+--------+--------+--------+
Determine the content of the center O block:

Code:
+--------+--------+--------+
|   653  |   894  |   712  |
|   742  |   135  |   968  |
|   198  |   627  |   453  |
+--------+--------+--------+
|   561  |   342  |   897  |
|   439  |   781  |   526  |
|   827  |   956  |   341  |
+--------+--------+--------+
|   915  |   278  |   634  |
|   284  |   563  |   179  |
|   376  |   419  |   285  |
+--------+--------+--------+
This is also described by 9 rows of text:

653894712​
742135968​
198627453​
561342897​
439781526​
827956341​
915278634​
284563179​
376419285​


A note on translating between permutations and their ordinal position:

You said that you are working on implementing some translation algorithm ideas around the permutations. I defer to you if that is something that you want to do. I don't want to take away from your enjoyment. I mean, this is your thread, after all.​

I did not design any permutation lookup algorithms:​

  • Initially, I just said "Shoot. Use python." I made a dictionary with entries 3, 6, and 9 having a list of permutations of corresponding size. This makes going back and forth between permutation and ordinal number trivial and fast. My older machine does all of the table generation and lookups and reverse lookups of two examples, using barely a dozen lines of code for that, in under 200ms.

  • What I discovered is that you do not need to compute any permutations at all. That is, going back and forth between integers and text strings of permuted characters is just a case of integer/ASCII translation. Looked at in just the right way, integer/permutation conversion is a case of base conversion seen in functions like itoa/atoi and others. The functions themselves, "left as an exercise for the reader," were pretty short, and easy to write.

Anyway, had a lot of fun. Thanks for sharing the invitation.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,776
I haven't walked through your work, but it sure looks like you've put some serious time and effort into it.

Some of the thoughts that I've had on reducing the underlying span are things like the following:

Only work with puzzles in which the top row is 123456789.

This leaves you with 7 rows that have to be encoded (the eight is then determined by elimination) and 8 columns. That right there gets you down to being able to use no more 3 bits for each symbol. But we can do better than that as the number of remaining symbols on a give row decreases. The first pass using this method requires 110 bits to encode an arbitrary grid. To encode the mapping from the ordered top row to the actual top row can be done in another 21 bits, bringing the total to 132 bits.

But I think this is leaving a lot of leverage on the table.

If I take a puzzle rotate it and/or mirror it, I get another valid puzzle. So can this be leveraged to encode a smaller number of base puzzles and then encode a mirror/rotate vector? There are additional reorderings, such as rows and columns of blocks, that can be encoded. For instance, can my base puzzle assume not only a top row of 123456789, but also a first column, with the resulting rows then being reordered according to a supplied vector?

I can see this getting me down into the 100 bit realm. Perhaps. I don't know if I can get it to beat your 101 bits.
 

joeyd999

Joined Jun 6, 2011
6,252
@WBahn, I am interested in your problem, but I do not possess the statistical prowess to even begin to analyze it. I posed your post, exactly as written, to Grok (out of my own curiosity). I am posting Grok's answer in its entirety here, correcting all the Latex as necessary (this is real work!). If you are not interested, don't read it, but please do not chastise me for the post.

I'm genuinely curious: Is Grok correct? Did it even understand the question properly? Is it just AI slop? I have no way of knowing with out investing a great deal more time.

Side note: I've suggested there be a rule for AI generated content on AAC so that the grounds for these kinds of post are more clear.

----------------------

This is a classic statistics problem known as support size estimation (or estimating the cardinality/population size \( N \) of a discrete uniform distribution from i.i.d. samples with replacement). It’s exactly the inverse of the coupon collector problem and a generalization of the birthday problem (where you estimate the number of “days” from observed collisions). Your intuition is spot-on: repeats are extremely unlikely unless the pool \( N \) is a tiny fraction of the full Sudoku grid space ( \( \sim 6.67 \times 10^{21} \) ), so observing them quickly pins down a much smaller \( N \).


Your two proposed approaches are both valid and have been studied extensively (in ecology for species estimation, database duplicate detection, streaming algorithms, etc.). The dynamic method (tracking when repeats occur) is indeed better than a single static snapshot because it uses the full sequence of data and lets you update the estimate in real time as your collection grows. The frequency of new puzzles slows predictably as the number of seen puzzles \( k \) approaches \( N \).

Best practical estimator (simple and highly accurate)

After \( M \) puzzles played, let \( D \) = number of distinct puzzles seen.


The expected number of distinct puzzles is exactly


\[ \mathbb{E}[D] = N \left(1 - \left(1 - \frac{1}{N}\right)^M \right). \]

Solve numerically for \( N \) so the right-hand side equals your observed \( D \). This is the method of moments estimator and is standard for this problem.⁠Stats.stackexchange


  • It is monotonic in \( N \), so you can use binary search, Newton’s method, or any solver (e.g., scipy.optimize in Python).
  • When collisions are still rare \( ( M \ll \sqrt{N} \)), a quick birthday-problem approximation is

\[ N \approx \frac{M(M-1)}{2C}, \]

where \( C = M - D \) is the total number of “collision events” (times you drew a previously seen puzzle). A slightly better closed-form approximation is \( N \approx -M / \ln(1 - D/M) \).

Even better: full maximum-likelihood estimator using collision timing

Your idea of tracking when each repeat occurs gives strictly more information. Record the sequence as you play:


  • Maintain a counter \( k \) = current number of distinct puzzles seen so far (starts at 0).
  • For each new puzzle \( j = 1 \dots M \):
    • If it is new: contribute \( \frac{N - k}{N} \) to the likelihood and increment \( k \).
    • If it is a repeat: contribute \( \frac{k}{N} \) to the likelihood (no change to \( k \) ).

The full likelihood is the product of these \( M \) terms (or take log for numerical stability):


\[ \log L(N) = \sum_{\text{new draws } j} \log\left(\frac{N - k_{j-1}}{N}\right) + \sum_{\text{repeat draws } j} \log\left(\frac{k_{j-1}}{N}\right), \]

where \( k_{j-1} \) is the number of distinct puzzles before draw\( j \).


Find the integer \( N \geq D \) that maximizes \( \log L(N) \). This is the maximum-likelihood estimator and uses the exact timing/spacing of your repeats (exactly what you suggested). In practice:


  • Loop over plausible \( N \) values (start near the simple estimator above and widen until \( L \) drops).
  • Or use a derivative-free optimizer on the (negative) log-likelihood.
  • As \( M \) grows, re-run on the full sequence — the estimate sharpens dramatically once you start seeing repeats regularly.

This sequential likelihood is easy to update online: every new puzzle just multiplies in one extra factor.

Confidence / uncertainty

  • The variance of \( D \) is known analytically:

\[ \text{Var}(D) = N\left(1-\frac{1}{N}\right)^M + N^2\left(1-\frac{1}{N}\right)\left(1-\frac{2}{N}\right)^M - \left[N\left(1-\frac{1}{N}\right)^M\right]^2. \]

You can plug in your \( \hat{N} \) and invert for approximate confidence intervals on \( N \).


  • For more rigor you can compute a profile-likelihood interval (values of \( N \) where \( \log L(N) \) is within 1.92 of the maximum for 95% CI) or bootstrap by simulating new sequences from your \( \hat{N} \).
Extra information you can use (frequency counts)

If you also record how many times each puzzle appeared (frequency-of-frequencies: \( f_1 \) = number seen once, \( f_2 \) = seen twice, etc.), you can plug into capture-recapture estimators such as Chao’s:


\[ \hat{N} \approx D + \frac{f_1^2}{2f_2}. \]

These are very good when many puzzles are still rare (low sampling fraction).

Practical advice for your Sudoku variant

  1. Store each finished puzzle (or a hash/fingerprint) and maintain a set of seen ones + a counter for each.
  2. After every batch of games, compute the simple \( D \)-based estimate and (optionally) the full MLE.
  3. The estimate will be noisy at first but will converge quickly once repeats become common.
  4. If you ever want me to compute a specific estimate, just share your current \( M \) , \( D \) , and (ideally) the sequence of new/repeat flags or the frequency counts — I can run the numbers instantly.

Your static “fraction of repeats” idea is a coarser version of the above (it essentially looks only at final \( D/M \) ), while the dynamic spacing approach is the gold standard. This exact setup has been solved many times in the literature because it appears in random sampling of databases, web APIs returning random items, ecology, etc. Once you have a few dozen repeats the pool size \( N \) will be pinned down to within a few percent. Happy puzzling — and let me know what the estimate turns out to be!
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,776
@WBahn, I am interested in your problem, but I do not possess the statistical prowess to even begin to analyze it. I posed your post, exactly as written, to Grok (out of my own curiosity). I am posting Grok's answer in its entirety here, correcting all the Latex as necessary (this is real work!). If you are not interested, don't read it, but please do not chastise me for the post.

I'm genuinely curious: Is Grok correct? Did it even understand the question properly? Is it just AI slop? I have no way of knowing with out investing a great deal more time.

Side note: I've suggested there be a rule for AI generated content on AAC so that the grounds for these kinds of post are more clear.

----------------------

This is a classic statistics problem known as support size estimation (or estimating the cardinality/population size \( N \) of a discrete uniform distribution from i.i.d. samples with replacement). It’s exactly the inverse of the coupon collector problem and a generalization of the birthday problem (where you estimate the number of “days” from observed collisions). Your intuition is spot-on: repeats are extremely unlikely unless the pool \( N \) is a tiny fraction of the full Sudoku grid space ( \( \sim 6.67 \times 10^{21} \) ), so observing them quickly pins down a much smaller \( N \).


Your two proposed approaches are both valid and have been studied extensively (in ecology for species estimation, database duplicate detection, streaming algorithms, etc.). The dynamic method (tracking when repeats occur) is indeed better than a single static snapshot because it uses the full sequence of data and lets you update the estimate in real time as your collection grows. The frequency of new puzzles slows predictably as the number of seen puzzles \( k \) approaches \( N \).

Best practical estimator (simple and highly accurate)

After \( M \) puzzles played, let \( D \) = number of distinct puzzles seen.


The expected number of distinct puzzles is exactly


\[ \mathbb{E}[D] = N \left(1 - \left(1 - \frac{1}{N}\right)^M \right). \]

Solve numerically for \( N \) so the right-hand side equals your observed \( D \). This is the method of moments estimator and is standard for this problem.⁠Stats.stackexchange


  • It is monotonic in \( N \), so you can use binary search, Newton’s method, or any solver (e.g., scipy.optimize in Python).
  • When collisions are still rare \( ( M \ll \sqrt{N} \)), a quick birthday-problem approximation is

\[ N \approx \frac{M(M-1)}{2C}, \]

where \( C = M - D \) is the total number of “collision events” (times you drew a previously seen puzzle). A slightly better closed-form approximation is \( N \approx -M / \ln(1 - D/M) \).

Even better: full maximum-likelihood estimator using collision timing

Your idea of tracking when each repeat occurs gives strictly more information. Record the sequence as you play:


  • Maintain a counter \( k \) = current number of distinct puzzles seen so far (starts at 0).
  • For each new puzzle \( j = 1 \dots M \):
    • If it is new: contribute \( \frac{N - k}{N} \) to the likelihood and increment \( k \).
    • If it is a repeat: contribute \( \frac{k}{N} \) to the likelihood (no change to \( k \) ).

The full likelihood is the product of these \( M \) terms (or take log for numerical stability):


\[ \log L(N) = \sum_{\text{new draws } j} \log\left(\frac{N - k_{j-1}}{N}\right) + \sum_{\text{repeat draws } j} \log\left(\frac{k_{j-1}}{N}\right), \]

where \( k_{j-1} \) is the number of distinct puzzles before draw\( j \).


Find the integer \( N \geq D \) that maximizes \( \log L(N) \). This is the maximum-likelihood estimator and uses the exact timing/spacing of your repeats (exactly what you suggested). In practice:


  • Loop over plausible \( N \) values (start near the simple estimator above and widen until \( L \) drops).
  • Or use a derivative-free optimizer on the (negative) log-likelihood.
  • As \( M \) grows, re-run on the full sequence — the estimate sharpens dramatically once you start seeing repeats regularly.

This sequential likelihood is easy to update online: every new puzzle just multiplies in one extra factor.

Confidence / uncertainty

  • The variance of \( D \) is known analytically:

\[ \text{Var}(D) = N\left(1-\frac{1}{N}\right)^M + N^2\left(1-\frac{1}{N}\right)\left(1-\frac{2}{N}\right)^M - \left[N\left(1-\frac{1}{N}\right)^M\right]^2. \]

You can plug in your \( \hat{N} \) and invert for approximate confidence intervals on \( N \).


  • For more rigor you can compute a profile-likelihood interval (values of \( N \) where \( \log L(N) \) is within 1.92 of the maximum for 95% CI) or bootstrap by simulating new sequences from your \( \hat{N} \).
Extra information you can use (frequency counts)

If you also record how many times each puzzle appeared (frequency-of-frequencies: \( f_1 \) = number seen once, \( f_2 \) = seen twice, etc.), you can plug into capture-recapture estimators such as Chao’s:


\[ \hat{N} \approx D + \frac{f_1^2}{2f_2}. \]

These are very good when many puzzles are still rare (low sampling fraction).

Practical advice for your Sudoku variant

  1. Store each finished puzzle (or a hash/fingerprint) and maintain a set of seen ones + a counter for each.
  2. After every batch of games, compute the simple \( D \)-based estimate and (optionally) the full MLE.
  3. The estimate will be noisy at first but will converge quickly once repeats become common.
  4. If you ever want me to compute a specific estimate, just share your current \( M \) , \( D \) , and (ideally) the sequence of new/repeat flags or the frequency counts — I can run the numbers instantly.

Your static “fraction of repeats” idea is a coarser version of the above (it essentially looks only at final \( D/M \) ), while the dynamic spacing approach is the gold standard. This exact setup has been solved many times in the literature because it appears in random sampling of databases, web APIs returning random items, ecology, etc. Once you have a few dozen repeats the pool size \( N \) will be pinned down to within a few percent. Happy puzzling — and let me know what the estimate turns out to be!
The reasoning that Grok gave sounds pretty solid -- but then a lot of AI slop sounds very solid at first blush. Similarly, the equations it provides look reasonably with just a cursory inspection, which is another way of saying that I haven't examined them closely and so they could be completely bogus. But the first equation seems to limit the way I would expect both for large and small M and N.
 

joeyd999

Joined Jun 6, 2011
6,252
The reasoning that Grok gave sounds pretty solid -- but then a lot of AI slop sounds very solid at first blush. Similarly, the equations it provides look reasonably with just a cursory inspection, which is another way of saying that I haven't examined them closely and so they could be completely bogus. But the first equation seems to limit the way I would expect both for large and small M and N.
Well, if it's just slop, maybe it'll help point you in the right direction.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,776
Well, if it's just slop, maybe it'll help point you in the right direction.
This is often the case. I always assume that what an LLM produces is good-sounding snake oil. Sometimes it is, sometimes it isn't. But in trying to vet the answers it gives, I often end up searching along lines that I otherwise wouldn't have and getting closer to, what I believe, is the correct answer regardless of whether it agrees with what the LLM spews forth or not. Sometimes it does, sometimes it doesn't. I've also learned not to allow myself to let confidence build in a conversation just because prior parts of the conversation have proven accurate -- they can, and do, go completely off the rails at the drop of a hat.
 

joeyd999

Joined Jun 6, 2011
6,252
This is often the case. I always assume that what an LLM produces is good-sounding snake oil. Sometimes it is, sometimes it isn't. But in trying to vet the answers it gives, I often end up searching along lines that I otherwise wouldn't have and getting closer to, what I believe, is the correct answer regardless of whether it agrees with what the LLM spews forth or not. Sometimes it does, sometimes it doesn't. I've also learned not to allow myself to let confidence build in a conversation just because prior parts of the conversation have proven accurate -- they can, and do, go completely off the rails at the drop of a hat.
This is not an argument, so please don't take it as such:

For my own personal work, I confirm the agent's assertions. Yes, this takes effort on my part, but it's less effort when I have actual assertions to check and leads to follow.

When I find errors or contradictions, I push back with another prompt. The agent usually acknowledges the error, and provides an explanation and a corrected answer (usually better than the original, if not outright correct).

Subsequent transactions tend to be more accurate, at least in the short term.

But, how is this different than when dealing with real people?

Edit: the time saved via this process is worth $30 a month to me.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,776
This is not an argument, so please don't take it as such:

For my own personal work, I confirm the agent's assertions. Yes, this takes effort on my part, but it's less effort when I have actual assertions to check and leads to follow.

When I find errors or contradictions, I push back with another prompt. The agent usually acknowledges the error, and provides an explanation and a corrected answer (usually better than the original, if not outright correct).

Subsequent transactions tend to be more accurate, at least in the short term.

But, how is this different than when dealing with real people?

Edit: the time saved via this process is worth $30 a month to me.
I also have found that the LLM will change course when errors are pointed out to them, but there are several difference vectors that it seems to take. First, that all seem to be sycophants and praising me about being right to push back -- the point of making me want to walk away. In dealing with humans, I've generally found that humans that take that approach are usually untrustworthy and will say anything that they think will make me like them. No thanks. But, I also realize that LLMs are not human, so the "motivations" are different.

I used to find that the LLM would acknowledge errors and take another shot at it, and often it seemed to be doing a better job. But I also found that I could often get it to finally agree with my assertions and accept what I was telling it as given fact, even if what I was telling it was intentionally and demonstrably wrong.

Lately, they seem to be taking a more defensive attitude (for lack of a better description) and, after telling me explicitly to "put X on line Y of form Z", I would point out where the instructions (this happened to be a tax-related conversation) very clearly said not to do that, and it would come back and say something like, "Oh, I see how it might appear that I told you to do that, but..." Ah, no, it didn't appear like it, it was explicit. They are becoming more like politicians or school children (assuming there's a difference in this regard) and will spout anything in order to claim that they weren't really wrong, that what they said was correct, just misinterpreted. Even more troubling, I've had a couple of instances lately where they take this a step further and insist that they are right and that the instructions and the publications, which were very clear, were wrong and then claim that the instructions used a word incorrectly and it should be interpreted to mean something completely at odds with what the entire context of the instructions say.

For me, the bottom line is that LLMs can be used effectively if you are willing to account for what they are and what they aren't. My big concern, which we have seen repeatedly ever since they came on the scene, is that the "typical" user is more than willing to just accept whatever they say at face value without even bothering to ask if it makes sense, let along vet it. And this is not anything new -- salesmen and con artists have known that slickly packaged crap that is confidently presented will be accepted as golden nuggets by a significant fraction of the population.
 

joeyd999

Joined Jun 6, 2011
6,252
"Oh, I see how it might appear that I told you to do that, but..."
In the case of tax forms, I actually think the LLM might have been correct.

I have experienced this with my own forms and my own actual, living, breathing, human accountants.

And, I get just as nervous.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,776
In the case of tax forms, I actually think the LLM might have been correct.

I have experienced this with my own forms and my own actual, living, breathing, human accountants.

And, I get just as nervous.
Nope. Pretty sure the LLM was wrong.

It explicitly said that Schedule C income from renting out equipment that is tangible personal property does NOT go on the Schedule C, but rather on Line 3 of Part I of Schedule E. Part 1 of Schedule E is titled "Income or Loss from Rental Real Estate and Royalties". Immediately below the title, it states, "Note: If you are in the business of renting personal property, use Schedule C. See instructions."

The instructions state:

"Do not use Schedule E to report income and expenses from the rental of personal property, such as equipment or vehicles. Instead, use Schedule C if you are in the business of renting personal property. You are in the business of
renting personal property if the primary purpose for renting the property is income or profit and you are involved in the rental activity with continuity and regularity.

If your rental of personal property is not a business, see the instructions for Schedule 1 (Form 1040), lines 8l and 24b, to find out how to report the income and expenses."

When this was quoted to it, it said that a "business" in the instructions is not the same as a "business" used elsewhere in the instructions and that since an LLC is a disregarded entity that it must reported in Part I of Schedule E.

When I pointed out that the instructions flatly state NOT to report income and expenses from the rental of personal property, such as equipment or vehicles. Period. Regardless of whether the rental was considered a business activity or not. It then tells you to use Schedule C if it's a business and Schedule 1 if it isn't. It responded that the IRS wording was incorrect and insisted that it must be reported on Schedule E.

When asked if it goes on Schedule E and not Schedule C, then how can the determination of whether the business had a profit or loss be accurate if the rental income is what makes the difference. It said that the profit/loss on Schedule C would be accurate. Uh... how?

An even more comical circle jerk happened regarding which NAICS code to use a new business I stood up last year, with it contradicting itself, telling to use a code that didn't exist in the IRS list, then telling me that I had to use a code that was in that list, then telling me to use the code that it had already told me to use, even though that code was not in the list, then telling me that using a code that is not in the list could significantly delay the processing of my return and potentially trigger an audit, then telling me that all of the tax software uses a broader set of codes than the list provided in the instructions and the IRS expects you to use the broader list, then later telling me that tax software is restricted to using only the IRS-provided list.

At one point, it stated that it's claim that we are expected to use the broader list is verified by the IRS FAQ, plus a webpage that it provided a link to, plus four specific publications. The FAQ never mentions NAICS codes at all, the webpage didn't exist, and none of the publications, all of which were about unrelated things and none of which contained the term NAICS. When this was pointed out, it acknowledged that the website didn't exist and that none of the other resources touch on NAICS codes. When asked why it provided those as verification that we are supposed to use the broader list of NAICS codes, it responded that it only seemed that way.
 

joeyd999

Joined Jun 6, 2011
6,252
Nope. Pretty sure the LLM was wrong.

It explicitly said that Schedule C income from renting out equipment that is tangible personal property does NOT go on the Schedule C, but rather on Line 3 of Part I of Schedule E. Part 1 of Schedule E is titled "Income or Loss from Rental Real Estate and Royalties". Immediately below the title, it states, "Note: If you are in the business of renting personal property, use Schedule C. See instructions."

The instructions state:

"Do not use Schedule E to report income and expenses from the rental of personal property, such as equipment or vehicles. Instead, use Schedule C if you are in the business of renting personal property. You are in the business of
renting personal property if the primary purpose for renting the property is income or profit and you are involved in the rental activity with continuity and regularity.

If your rental of personal property is not a business, see the instructions for Schedule 1 (Form 1040), lines 8l and 24b, to find out how to report the income and expenses."

When this was quoted to it, it said that a "business" in the instructions is not the same as a "business" used elsewhere in the instructions and that since an LLC is a disregarded entity that it must reported in Part I of Schedule E.

When I pointed out that the instructions flatly state NOT to report income and expenses from the rental of personal property, such as equipment or vehicles. Period. Regardless of whether the rental was considered a business activity or not. It then tells you to use Schedule C if it's a business and Schedule 1 if it isn't. It responded that the IRS wording was incorrect and insisted that it must be reported on Schedule E.

When asked if it goes on Schedule E and not Schedule C, then how can the determination of whether the business had a profit or loss be accurate if the rental income is what makes the difference. It said that the profit/loss on Schedule C would be accurate. Uh... how?

An even more comical circle jerk happened regarding which NAICS code to use a new business I stood up last year, with it contradicting itself, telling to use a code that didn't exist in the IRS list, then telling me that I had to use a code that was in that list, then telling me to use the code that it had already told me to use, even though that code was not in the list, then telling me that using a code that is not in the list could significantly delay the processing of my return and potentially trigger an audit, then telling me that all of the tax software uses a broader set of codes than the list provided in the instructions and the IRS expects you to use the broader list, then later telling me that tax software is restricted to using only the IRS-provided list.

At one point, it stated that it's claim that we are expected to use the broader list is verified by the IRS FAQ, plus a webpage that it provided a link to, plus four specific publications. The FAQ never mentions NAICS codes at all, the webpage didn't exist, and none of the publications, all of which were about unrelated things and none of which contained the term NAICS. When this was pointed out, it acknowledged that the website didn't exist and that none of the other resources touch on NAICS codes. When asked why it provided those as verification that we are supposed to use the broader list of NAICS codes, it responded that it only seemed that way.
Just for poops and grins, and not that it matters, SuperGrok provided me the correct answer on the first, short prompt. So, obviously, YMMV and caveat emptor and believe nothing and confirm everything. Good rules for life in general.

But I am more interested in the symbolic math capabilities. If you are ever inclined, I'd surly like to know if those answers are correct for the specific problem with which you started this thread.
 

JohnSan

Joined Sep 15, 2018
124
Back to Sudoku.

So, down to a limit, the fewer the starting values given in a grid, determine how difficult it is to complete?

That seems logical to me.

With any given number of starting values (let's say 23 for instance), does the complexity of solving the grid vary with the initial position of these values, or would all grids starting with 23 values need the same solving techniques? IE. Be the same complexity.

(I can't follow the math shown above).

Or am I way off?
 
Top