Stats problem that some might find interesting.

Thread Starter

WBahn

Joined Mar 31, 2012
32,877
Back to Sudoku.

So, down to a limit, the fewer the starting values given in a grid, determine how difficult it is to complete?

That seems logical to me.

With any given number of starting values (let's say 23 for instance), does the complexity of solving the grid vary with the initial position of these values, or would all grids starting with 23 values need the same solving techniques? IE. Be the same complexity.

(I can't follow the math shown above).

Or am I way off?
Way off.

First, the math above, and the problem being posed, only deals with representing the final grid of values and doesn't address the starting clues.

The difficulty of solving a puzzle, given a set of initial clues, is dependent on several factors.

For a given grid of symbols, there are many sets of initial clues that are locally minimal. A local minimum is a set of clues that is sufficient such that there is exactly one unique grid that is consistent with them and such that if you were to remove any one of those clues there would be multiple valid solutions. But not all of those sets are the same size. So, if I give you two different sets of starting clues for the same grid, one that has, say, twenty symbols and one that has twenty-three, it's entirely possible that the one with twenty-three symbols is a local minima while the one with twenty actually has three more than is necessary to solve the puzzle and may, therefore, be easier to solve.

Another way to view this is to consider a puzzle with a given starting set of clues. Now I want to add an additional clue to make it easier to solve. Are all of the possible symbols I could add equally valuable as a clue? Nope. Some might make it considerably easier, while others might have virtually no impact. Think of most puzzles that you work -- don't you find that there is almost always some low hanging fruit? Spaces that you almost immediately can fill in with very little effort? Would adding those symbols to the clue set therefore make the puzzle much easier to solve? Not much at all, since you would have filled in that symbol almost immediately anyway. But, again thinking of most of the puzzles you work, particularly the more challenging ones where you get to a point where you really have to work to figure out another symbol. Don't you find yourself saying that if you could only figure out the value of this or that symbol, it would all break open? Well, what if that symbol were added to the initial clue set? That would make the puzzle much easier to solve.
 

JohnSan

Joined Sep 15, 2018
126
For a given grid of symbols, there are many sets of initial clues that are locally minimal. A local minimum is a set of clues that is sufficient such that there is exactly one unique grid that is consistent with them and such that if you were to remove any one of those clues there would be multiple valid solutions. But not all of those sets are the same size. So, if I give you two different sets of starting clues for the same grid, one that has, say, twenty symbols and one that has twenty-three, it's entirely possible that the one with twenty-three symbols is a local minima while the one with twenty actually has three more than is necessary to solve the puzzle and may, therefore, be easier to solve.
If these are the starting values for the same grid and 23 is the 'minima', by only giving 20 of these values, how can that grid possibly be resolved? Let alone be easier?

I can see, if there is a minima starting set and some values are added, it may or may not be easier to solve, but just have fewer values to figure out.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,877
If these are the starting values for the same grid and 23 is the 'minima', by only giving 20 of these values, how can that grid possibly be resolved? Let alone be easier?

I can see, if there is a minima starting set and some values are added, it may or may not be easier to solve, but just have fewer values to figure out.
23 might be the minima for a particular set of starting values.

But there might be a different set of starting values for that same grid that only needs 20.

Imagine that I begin with the set of 23 that is minimal. That only means that if I remove one of those 23, the puzzle becomes unsolvable (i.e., has multiple valid solutions).

But what if I, instead, add an additional symbol? That symbol may add enough information that I can now remove four (not any four, but a specific four) of the symbols from the original set before I once again have a minimal set.

But, more likely, the minimal set of 23 and the minimal set of 20 may have few, if any, members in common.

For most solution grids, there are millions, and sometimes billions, of distinct minimal clue sets. The number of clues in each set range from 17 to 40.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,877
I finally got around to throwing together that Python script to estimate the size of the pool that they are drawing their puzzles from. Assuming that they are randomly drawing them from a fixed pool, I calculated the probability of seeing the specific history or repeats that I've actually seen. The odds of each are extremely low -- on the order of 1e-200. But looking at the relative probabilities allows me to not only choose the most likely size, but also to put bounds on that size to a particular confidence level. The current result, after collecting a bit over 500 distinct games (I didn't start tracking repeats until I had an archive of nearly 150 -- I was capturing solved grids for a different purpose initially) is that the best estimate for the pool size is 877 boards with a 90% confidence that the true number of boards lies between 824 and 948.

The relative probabilities are shown below:

pool_size_distribution.png

I also generated the progressive estimates, along with the lower and upper limits of the 90% confidence interval, as the data was collected and those are shown below:

pool_size_progressive_estimate.png
As expected, the data starts off pretty erratic, but starts settling down pretty firmly after about 100 observations and is starting to get smoother after about 500 observations. But there is still an upward trend, so I suspect that the true value is higher than the current estimate.

As a test, both of the basic correctness of the data and also to see if the assumption about fixed pool size was reasonable (or at least not too unreasonable), I generated a test data set where I drew from a pool of 877 items and started with an archive of the same size as the real data and drew the same number of items from the pool. The results were pretty consistent with the live estimate, coming in at 904 <866, 953>, compares pretty favorably with 877 <824, 948>. To judge how favorably, I'd could do a Monte Carlo experiment and see what the distributions of the estimates and the bounds look like. That should be pretty easy to set up and run at this point, so maybe I'll do that sometime soon.
 

joeyd999

Joined Jun 6, 2011
6,307
I asked Grok to write a Python script based upon the original prompt.

Would you be interested in running it to compare your results? I would like to know if the results match, or if Grok is just off in la la land.

Sodoku Pool Size Estimator:
import math
from typing import List, Optional

def estimate_pool_size_simple(M: int, D: int) -> int:
    """
    Simple and fast estimator using expected number of distinct puzzles.
    Solves E[D] = N * (1 - (1-1/N)^M) ≈ observed D
    """
    if D >= M:
        return D  # No collisions yet → lower bound only
    
    # Binary search for N
    low = D
    high = 10**18  # Huge upper bound
    best_n = D
    best_error = float('inf')
    
    while low <= high:
        mid = (low + high) // 2
        expected_d = mid * (1 - (1 - 1/mid)**M) if mid > 1 else M
        error = abs(expected_d - D)
        
        if error < best_error:
            best_error = error
            best_n = mid
        
        if expected_d < D:
            low = mid + 1
        else:
            high = mid - 1
    
    return best_n


def estimate_pool_size_chao(f1: int, f2: int, D: int) -> int:
    """Chao estimator using frequency of frequencies (good when many singles)"""
    if f2 == 0:
        return D + f1 * (f1 - 1) // 2  # fallback
    return D + (f1 * f1) // (2 * f2)


def log_likelihood(N: int, sequence: List[bool]) -> float:
    """Full maximum likelihood given sequence of new/repeat flags"""
    if N < len([x for x in sequence if x]):
        return -float('inf')
    ll = 0.0
    k = 0  # number of distinct seen so far
    for is_new in sequence:
        if is_new:
            ll += math.log((N - k) / N)
            k += 1
        else:
            ll += math.log(k / N)
    return ll


def estimate_pool_size_mle(sequence: List[bool], n_candidates: int = 1000) -> int:
    """
    Maximum Likelihood Estimator using the full sequence of new/repeat outcomes.
    sequence[i] = True if the (i+1)th puzzle was new, False if repeat.
    """
    D = sum(1 for x in sequence if x)
    M = len(sequence)
    
    # Rough starting point
    N_guess = estimate_pool_size_simple(M, D)
    
    best_N = N_guess
    best_ll = -float('inf')
    
    # Search around the guess (widen range as needed)
    search_range = max(1000, N_guess // 5)
    low = max(D, N_guess - search_range)
    high = N_guess + search_range * 2
    
    for N in range(low, high + 1):
        ll = log_likelihood(N, sequence)
        if ll > best_ll:
            best_ll = ll
            best_N = N
    
    return best_N


# ====================== Example Usage ======================
if __name__ == "__main__":
    # Replace with your actual data
    M = 500          # total puzzles played
    D = 420          # distinct puzzles seen
    
    print("Simple estimator:", estimate_pool_size_simple(M, D))
    
    # Example Chao (if you track frequencies)
    f1 = 300  # puzzles seen exactly once
    f2 = 80   # seen exactly twice
    print("Chao estimator:", estimate_pool_size_chao(f1, f2, D))
    
    # Full MLE example (best but requires sequence)
    # sequence = [True, True, False, True, ...]  # True=new, False=repeat
    # print("MLE estimator:", estimate_pool_size_mle(sequence))
How to use it

  1. Run the script and fill in your current M and D.
  2. For higher accuracy, track the full sequence list (True = new puzzle, False = repeat) and use the MLE function.
  3. If you also keep counts of how many times each puzzle appeared, the Chao estimator is excellent.
 

joeyd999

Joined Jun 6, 2011
6,307
It also offered:

Would you like me to:


  • Add confidence intervals?
  • Make a version that reads data from a CSV/file?
  • Optimize the MLE search with scipy (faster for very large data)?
  • Include a simulation function so you can test how well the estimators perform?
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,877
If I get time I look at the script that Grok wrote.

I gave ChatGPT the following prompt about a week ago:

I'm playing an online game that, in theory, has and astronomical number of games. But I noticed that I was getting games that I knew I had seen before. So, I started capturing each completed gameboard and storing it in an archive. Then, when I tried a new game, I would check to see if I already had the game in the archive. If I did, I added the number of games in the archive to a list (which I will provide at the end) when the repeat occurred and then requested a new game. The repeat was not added to the archive, which contains distinct gameboards. I didn't start tracking until I already had 148 games in the archive, so I don't know how many repeats I had prior to that. I presently have 523 games in the archive and there have been 259 (again, this from the point where I started with 148 games in the archive. I'm trying to estimate the total number of games in the pool. Of course, I'm having to assume that the pool of games is fixed and that games are being drawn randomly from it with replacement. In addition to estimating the number of games under this assumption, I'm also interested in estimating the range of values that I can be 90% confident that the total number of games lies within. I'm also testing how likely it is that the assumptions I'm making are valid. Can you do that for me? Here is the list that I have compiled.

148
148
150
150
151
156
163
167
173
173
177
179
181
This list of numbers contained 259 entries.

I just noticed a typo in which I got ahead of myself and failed to finish a sentence above. Despite that, the response that ChatGPT gave was actually pretty impressive. It was able to correctly "understand" how to interpret the data, presented pretty cogent explanation of how to approach the problem, pointed out the trends in the data the reinforce a relatively small pool size, such as the frequency or repeats increasing as the size of the archive grows. It correctly determined how many items were pulled from the pool. It didn't say so directly, but it was using a log-likelihood estimation approach to estimate the mean and standard deviation of the assumed Gaussian distribution, which I didn't walk through in detail but that looks reasonable. It came up with a best estimate of 930 for the pool size with a 90% confidence interval of 815 to 1045. So quite in line with my later estimate.

The fact that it was able to "understand" the structure of my data and correctly (it seems) account for the fact that the data collection did not start from the beginning, and thus repeat information couldn't and shouldn't be assumed for the prior draws impressed me quite a bit.

So after getting this warm fuzzy about how well the LLM was able to do, I was asking it questions about a completely unrelated topic a few days later (don't recall what the topic was) and it made an absolutely glaring error when it said something very much like, "So out of 600 attempts, 150 of them were successes, yielding a success rate of 150/600 = 12.5%". It then used that 12.5% for other stuff and when it was asked if it was sure that the result was 12.5%, it indicated that it was. Only when I asked it very explicitly how 150 divided by 600 yielded 12.5% did it correct the error. I then spend time trying to get it to explain why and how it made the error. At first, it wanted to just brush it off and focus on the corrected answer, but after several probing questions, it finally revealed a bit of insight into how it actually works. When certain triggers are met it invokes other tools to essentially do the math for it and then uses the results to feed its response, but if those triggers aren't met, it uses token prediction according to what it has seen in it's training data. These triggers can be met for one part of a response but not others, so it suggested that the wrong number came because it's training set included examples of similar strings of text that resulted in that value being the most likely predicted next token. It could only talk in generalities because it had no recollection of how prior prompts were produced, only the produced response.

That jives pretty closely with what a colleague of mine, who is very much into the deep end of related technologies, told me a month or so ago when I was talking to him.

Then, again in an unrelated conversation, I was trying to find out if the comedian and ventriloquist Jeff Dunham would be performing in or near Colorado Springs any time in the near future. It first responded that the best I could hope for was that he would be in one of the two larger cities, Denver or Boulder (let's ignore the fact that Colorado Springs is about five times the size of Boulder). It offered to find tour dates in the general region and I took it up on it. It then said that he would be performing in Colorado Springs on March 11, 2026 and offered to get ticket and seating information. So I asked whether it was sure that I would be possible for me to go to the March 11, 2026 performance, and it said that their should be no problem, as long as the show didn't sell out before I purchased my tickets. So I asked it if I needed a time machine, and it proceeded to explain that no time travel was needed since the show isn't until March 11, 2026 and since today is June 8, 2026, the show is almost a year away in the future. So I asked it what today's date was, and it responded with June 8, 2026 and THEN finally noticed that March 11, 2026 is in the past.

So just as I see a glimmer of quality, that illusion gets dashed on the rocks of reality.
 
Top