Stats problem that some might find interesting.

Thread Starter

WBahn

Joined Mar 31, 2012
32,876
Back to Sudoku.

So, down to a limit, the fewer the starting values given in a grid, determine how difficult it is to complete?

That seems logical to me.

With any given number of starting values (let's say 23 for instance), does the complexity of solving the grid vary with the initial position of these values, or would all grids starting with 23 values need the same solving techniques? IE. Be the same complexity.

(I can't follow the math shown above).

Or am I way off?
Way off.

First, the math above, and the problem being posed, only deals with representing the final grid of values and doesn't address the starting clues.

The difficulty of solving a puzzle, given a set of initial clues, is dependent on several factors.

For a given grid of symbols, there are many sets of initial clues that are locally minimal. A local minimum is a set of clues that is sufficient such that there is exactly one unique grid that is consistent with them and such that if you were to remove any one of those clues there would be multiple valid solutions. But not all of those sets are the same size. So, if I give you two different sets of starting clues for the same grid, one that has, say, twenty symbols and one that has twenty-three, it's entirely possible that the one with twenty-three symbols is a local minima while the one with twenty actually has three more than is necessary to solve the puzzle and may, therefore, be easier to solve.

Another way to view this is to consider a puzzle with a given starting set of clues. Now I want to add an additional clue to make it easier to solve. Are all of the possible symbols I could add equally valuable as a clue? Nope. Some might make it considerably easier, while others might have virtually no impact. Think of most puzzles that you work -- don't you find that there is almost always some low hanging fruit? Spaces that you almost immediately can fill in with very little effort? Would adding those symbols to the clue set therefore make the puzzle much easier to solve? Not much at all, since you would have filled in that symbol almost immediately anyway. But, again thinking of most of the puzzles you work, particularly the more challenging ones where you get to a point where you really have to work to figure out another symbol. Don't you find yourself saying that if you could only figure out the value of this or that symbol, it would all break open? Well, what if that symbol were added to the initial clue set? That would make the puzzle much easier to solve.
 

JohnSan

Joined Sep 15, 2018
126
For a given grid of symbols, there are many sets of initial clues that are locally minimal. A local minimum is a set of clues that is sufficient such that there is exactly one unique grid that is consistent with them and such that if you were to remove any one of those clues there would be multiple valid solutions. But not all of those sets are the same size. So, if I give you two different sets of starting clues for the same grid, one that has, say, twenty symbols and one that has twenty-three, it's entirely possible that the one with twenty-three symbols is a local minima while the one with twenty actually has three more than is necessary to solve the puzzle and may, therefore, be easier to solve.
If these are the starting values for the same grid and 23 is the 'minima', by only giving 20 of these values, how can that grid possibly be resolved? Let alone be easier?

I can see, if there is a minima starting set and some values are added, it may or may not be easier to solve, but just have fewer values to figure out.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,876
If these are the starting values for the same grid and 23 is the 'minima', by only giving 20 of these values, how can that grid possibly be resolved? Let alone be easier?

I can see, if there is a minima starting set and some values are added, it may or may not be easier to solve, but just have fewer values to figure out.
23 might be the minima for a particular set of starting values.

But there might be a different set of starting values for that same grid that only needs 20.

Imagine that I begin with the set of 23 that is minimal. That only means that if I remove one of those 23, the puzzle becomes unsolvable (i.e., has multiple valid solutions).

But what if I, instead, add an additional symbol? That symbol may add enough information that I can now remove four (not any four, but a specific four) of the symbols from the original set before I once again have a minimal set.

But, more likely, the minimal set of 23 and the minimal set of 20 may have few, if any, members in common.

For most solution grids, there are millions, and sometimes billions, of distinct minimal clue sets. The number of clues in each set range from 17 to 40.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,876
I finally got around to throwing together that Python script to estimate the size of the pool that they are drawing their puzzles from. Assuming that they are randomly drawing them from a fixed pool, I calculated the probability of seeing the specific history or repeats that I've actually seen. The odds of each are extremely low -- on the order of 1e-200. But looking at the relative probabilities allows me to not only choose the most likely size, but also to put bounds on that size to a particular confidence level. The current result, after collecting a bit over 500 distinct games (I didn't start tracking repeats until I had an archive of nearly 150 -- I was capturing solved grids for a different purpose initially) is that the best estimate for the pool size is 877 boards with a 90% confidence that the true number of boards lies between 824 and 948.

The relative probabilities are shown below:

pool_size_distribution.png

I also generated the progressive estimates, along with the lower and upper limits of the 90% confidence interval, as the data was collected and those are shown below:

pool_size_progressive_estimate.png
As expected, the data starts off pretty erratic, but starts settling down pretty firmly after about 100 observations and is starting to get smoother after about 500 observations. But there is still an upward trend, so I suspect that the true value is higher than the current estimate.

As a test, both of the basic correctness of the data and also to see if the assumption about fixed pool size was reasonable (or at least not too unreasonable), I generated a test data set where I drew from a pool of 877 items and started with an archive of the same size as the real data and drew the same number of items from the pool. The results were pretty consistent with the live estimate, coming in at 904 <866, 953>, compares pretty favorably with 877 <824, 948>. To judge how favorably, I'd could do a Monte Carlo experiment and see what the distributions of the estimates and the bounds look like. That should be pretty easy to set up and run at this point, so maybe I'll do that sometime soon.
 

joeyd999

Joined Jun 6, 2011
6,307
I asked Grok to write a Python script based upon the original prompt.

Would you be interested in running it to compare your results? I would like to know if the results match, or if Grok is just off in la la land.

Sodoku Pool Size Estimator:
import math
from typing import List, Optional

def estimate_pool_size_simple(M: int, D: int) -> int:
    """
    Simple and fast estimator using expected number of distinct puzzles.
    Solves E[D] = N * (1 - (1-1/N)^M) ≈ observed D
    """
    if D >= M:
        return D  # No collisions yet → lower bound only
    
    # Binary search for N
    low = D
    high = 10**18  # Huge upper bound
    best_n = D
    best_error = float('inf')
    
    while low <= high:
        mid = (low + high) // 2
        expected_d = mid * (1 - (1 - 1/mid)**M) if mid > 1 else M
        error = abs(expected_d - D)
        
        if error < best_error:
            best_error = error
            best_n = mid
        
        if expected_d < D:
            low = mid + 1
        else:
            high = mid - 1
    
    return best_n


def estimate_pool_size_chao(f1: int, f2: int, D: int) -> int:
    """Chao estimator using frequency of frequencies (good when many singles)"""
    if f2 == 0:
        return D + f1 * (f1 - 1) // 2  # fallback
    return D + (f1 * f1) // (2 * f2)


def log_likelihood(N: int, sequence: List[bool]) -> float:
    """Full maximum likelihood given sequence of new/repeat flags"""
    if N < len([x for x in sequence if x]):
        return -float('inf')
    ll = 0.0
    k = 0  # number of distinct seen so far
    for is_new in sequence:
        if is_new:
            ll += math.log((N - k) / N)
            k += 1
        else:
            ll += math.log(k / N)
    return ll


def estimate_pool_size_mle(sequence: List[bool], n_candidates: int = 1000) -> int:
    """
    Maximum Likelihood Estimator using the full sequence of new/repeat outcomes.
    sequence[i] = True if the (i+1)th puzzle was new, False if repeat.
    """
    D = sum(1 for x in sequence if x)
    M = len(sequence)
    
    # Rough starting point
    N_guess = estimate_pool_size_simple(M, D)
    
    best_N = N_guess
    best_ll = -float('inf')
    
    # Search around the guess (widen range as needed)
    search_range = max(1000, N_guess // 5)
    low = max(D, N_guess - search_range)
    high = N_guess + search_range * 2
    
    for N in range(low, high + 1):
        ll = log_likelihood(N, sequence)
        if ll > best_ll:
            best_ll = ll
            best_N = N
    
    return best_N


# ====================== Example Usage ======================
if __name__ == "__main__":
    # Replace with your actual data
    M = 500          # total puzzles played
    D = 420          # distinct puzzles seen
    
    print("Simple estimator:", estimate_pool_size_simple(M, D))
    
    # Example Chao (if you track frequencies)
    f1 = 300  # puzzles seen exactly once
    f2 = 80   # seen exactly twice
    print("Chao estimator:", estimate_pool_size_chao(f1, f2, D))
    
    # Full MLE example (best but requires sequence)
    # sequence = [True, True, False, True, ...]  # True=new, False=repeat
    # print("MLE estimator:", estimate_pool_size_mle(sequence))
How to use it

  1. Run the script and fill in your current M and D.
  2. For higher accuracy, track the full sequence list (True = new puzzle, False = repeat) and use the MLE function.
  3. If you also keep counts of how many times each puzzle appeared, the Chao estimator is excellent.
 

joeyd999

Joined Jun 6, 2011
6,307
It also offered:

Would you like me to:


  • Add confidence intervals?
  • Make a version that reads data from a CSV/file?
  • Optimize the MLE search with scipy (faster for very large data)?
  • Include a simulation function so you can test how well the estimators perform?
 
Top