Analyzing Random numbers

bogosort

Joined Sep 24, 2011
696
Well one interesting example is the digits of pi. They pass every test for true randomness yet in two dimensions we know that pi is the ratio of the circumference of a circle to its diameter. So in one respect the digits are random, yet when we think about it in another way it's perfect and can only be one way.
On the contrary, I'd say that there is nothing random about the digits of pi, as it has tiny Kolmogorov complexity: with a few short words and symbols we can perfectly describe any length sequence from pi, including infinite. A test that certifies a sequence from pi as random is a failed test. In fact, we can easily use induction on sequences from pi to prove that any test for randomness will always give an infinite number of false positives.

The mean of an infinite number of samples of a true random variable is 1/2 of the sum of the two extremes (given every number between is generated also).
I don't know how you calculate the mean or pick the extremes from an infinite sequence, but in the finite case this is almost never true for sequences chosen from a random distribution, whether Gaussian or uniform or whatever. In fact, I'd say that if a sequence has this property, it's a good sign that the sequence is not random.
 

Ya’akov

Joined Jan 27, 2019
9,136
Though it is not precisely addressing this topic, the discussion reminded me of a wonderful video.

In the late 50s and early 60s, the Physical Science Study Committee produced a series of really fantastic films for secondary education. They have some of the best demos I've seen, and they feature distinguished lecturers.

They are all worth watching ("Frames of Reference" is really great) but this one fits here, give it a watch, really good.

 

nsaspook

Joined Aug 27, 2009
13,260
Though it is not precisely addressing this topic, the discussion reminded me of a wonderful video.

In the late 50s and early 60s, the Physical Science Study Committee produced a series of really fantastic films for secondary education. They have some of the best demos I've seen, and they feature distinguished lecturers.

They are all worth watching ("Frames of Reference" is really great) but this one fits here, give it a watch, really good.

Yes, I've seen most of them, a great set.
https://archive.org/details/frames_of_reference
 

wayneh

Joined Sep 9, 2010
17,498
In the late 50s and early 60s, the Physical Science Study Committee produced a series of really fantastic films for secondary education. They have some of the best demos I've seen, and they feature distinguished lecturers.
The existence of really great lectures on various topics is what got me thinking about the future of education. I see no reason to make students suffer through sub-par lectures when great presentations of the topic are already in the can.

At the U. of Chicago when I was there, all classes and professors received a rating at the end of every quarter. There were maybe 30 questions about the time spent, the value of the information, the quality of the professor and so on. And believe me, the 5-star professors were head-and-shoulders above the 3-star ones. The 2-star ones got culled.

From that experience on, it seems silly to me to make students sit through a 3-star lecturer when instead you could easily video the 5-star lecturer and be done with it. If the topic evolves a few years later, you refresh and remake it.

Some may say it's better to have a live lecturer. I call BS. A lecture hall environment, where you interact with other students and perhaps a teacher, has value but there's no reason the lecture itself can't be canned.
 

Ya’akov

Joined Jan 27, 2019
9,136
Some of my colleagues recorded their lectures then used the lecture time as recitation instead. This was far better and resulted in much more actual instruction.
 

MrAl

Joined Jun 17, 2014
11,461
On the contrary, I'd say that there is nothing random about the digits of pi, as it has tiny Kolmogorov complexity: with a few short words and symbols we can perfectly describe any length sequence from pi, including infinite. A test that certifies a sequence from pi as random is a failed test. In fact, we can easily use induction on sequences from pi to prove that any test for randomness will always give an infinite number of false positives.


I don't know how you calculate the mean or pick the extremes from an infinite sequence, but in the finite case this is almost never true for sequences chosen from a random distribution, whether Gaussian or uniform or whatever. In fact, I'd say that if a sequence has this property, it's a good sign that the sequence is not random.
Hi,

Ok do some searches on the web.
If you dont know how to take limits then learn :)

We are talking random as in random between two limits. If there is equal probability for all samples then the average comes out to one half.
In theory flipping a coin is 50-50 so if we call heads 1 and tails 0 then the average is 1/2. I dont see how you can get around this but feel free to try.
 

bogosort

Joined Sep 24, 2011
696
Ok do some searches on the web.
If you dont know how to take limits then learn :)
Sigh.

We are talking random as in random between two limits. If there is equal probability for all samples then the average comes out to one half.
Again, this isn't true. You were talking about certifying a sequence as random or not. Well, flip a fair coin 100 times, associating 1 to heads and 0 to tails. You now have a sequence generated by a random variable chosen from a uniform distribution. But if you calculate the mean value, there's a 92% chance that it will not be 0.5. So, your test for randomness fails most of the time. The longer the sequence, the worse your test performs.

Worse still, your test for randomness will certify an infinite number of decidedly not random sequences, namely, every sequence of consecutive positive integers. I don't know about you, but {1, 2, 3, 4, 5, ..., 999, 1000} doesn't look random to me.
 

xox

Joined Sep 8, 2017
838
Another useful metric is to observe the degree to which a sample can be compressed. The larger the size of the data and the closer to raw binary (in other words stripped of extraneous formatting and not in some textual output representation such as a list of decimal numbers) the better. I've used this method in the past and confirmed that it agrees with standard statistical tests quite well (Diehard and TestU01 specifically). Algorithms related to arithmetic coding seem to perform the best incidentally.
 

bogosort

Joined Sep 24, 2011
696
Another useful metric is to observe the degree to which a sample can be compressed.
Indeed, and information compressibility is tied to Kolmogorov complexity. A string of a million 'A's has essentially no complexity, as it can be compressed into a much shorter string, such as 'A{1E6}'. A random string is incompressible, because the shortest description of it is the string itself!
 

wayneh

Joined Sep 9, 2010
17,498
Indeed, and information compressibility is tied to Kolmogorov complexity. A string of a million 'A's has essentially no complexity, as it can be compressed into a much shorter string, such as 'A{1E6}'. A random string is incompressible, because the shortest description of it is the string itself!
What about the coin flips, though? Every run (a string of heads or tails) could be replaced by the count of the string length. So HHHTTHTTHHH can be represented as 32123.
 

xox

Joined Sep 8, 2017
838
Hi,

Ok do some searches on the web.
If you dont know how to take limits then learn :)

We are talking random as in random between two limits. If there is equal probability for all samples then the average comes out to one half.
In theory flipping a coin is 50-50 so if we call heads 1 and tails 0 then the average is 1/2. I dont see how you can get around this but feel free to try.
The easiest way to understand it is to view the data in terms of combinatorics. Suppose you were to flip a coin 4 times. The 16 possible outcomes are:

HHHH (4 H's)
HHHT (3)
HHTH (3)
HHTT (2)
HTHH (3)
HTHT (2)
HTTH (2)
HTTT (1)
THHH (3)
THHT (2)
THTH (2)
THTT (1)
TTHH (2)
TTHT (1)
TTTH (1)
TTTT (0)

(I generated the above by simply converting the numbers 0 through 15 to binary and then wrote an H for the every 0 and a T for every 1.)

Count the number of heads from all possible permutations and you get:

0 : 1/16
1 : 4/16 = 1/4
2 : 6/16 = 3/8
3 : 4/16 = 1/4
4 : 1/16

So the probability that half of the tosses turn out to be heads is 3/8 = 0.375 = 37.5%.
 

xox

Joined Sep 8, 2017
838
What about the coin flips, though? Every run (a string of heads or tails) could be replaced by the count of the string length. So HHHTTHTTHHH can be represented as 32123.
The string represents 11 bits of data and the number 32123 roughly 15 bits.
 

bogosort

Joined Sep 24, 2011
696
What about the coin flips, though? Every run (a string of heads or tails) could be replaced by the count of the string length. So HHHTTHTTHHH can be represented as 32123.
Sure, there's compression opportunity in any repetition, but that's more a reflection of the 2-valued nature of coin flips -- where, with only two possible values, we'd expect significant repetition -- than its information content. Out of curiosity, I created four files, each with a million ASCII characters. One was a random sequence of 'H" and 'T' characters; the other was all 'H' characters; another was a sequence of 500,000 "HT" characters; and the last was a million simulated rolls of a die. I then used gzip to compress each.

The coin flip file had a 6x compression ratio, while the all-Hs file and the HT file both had a 986x compression ratio (the HT file was only 2 bytes bigger than the all-Hs file). The die roll file was reduced by a factor of only 2.69, reflecting the fact that there's less repetition in rolls of dice than in coin flips.
 

MrAl

Joined Jun 17, 2014
11,461
Sigh.


Again, this isn't true. You were talking about certifying a sequence as random or not. Well, flip a fair coin 100 times, associating 1 to heads and 0 to tails. You now have a sequence generated by a random variable chosen from a uniform distribution. But if you calculate the mean value, there's a 92% chance that it will not be 0.5. So, your test for randomness fails most of the time. The longer the sequence, the worse your test performs.

Worse still, your test for randomness will certify an infinite number of decidedly not random sequences, namely, every sequence of consecutive positive integers. I don't know about you, but {1, 2, 3, 4, 5, ..., 999, 1000} doesn't look random to me.
Hi again,

From your recent post the only thing left i can surmise is that you never actually did any testing you just talked about it.
The only thing i have left to say then is do some testing using a RNG and sum the results and compute the average.
What will you see? Can you tell me without doing it first?
Let the number of samples go higher and higher. What can you generalize about the results?
I'd like to hear your theory on this really before you do any testing, but from what you were saying so far you wont be able to come out with the right results unless i misunderstood you somehow.
 

xox

Joined Sep 8, 2017
838
Hi again,

From your recent post the only thing left i can surmise is that you never actually did any testing you just talked about it.
The only thing i have left to say then is do some testing using a RNG and sum the results and compute the average.
What will you see? Can you tell me without doing it first?
Let the number of samples go higher and higher. What can you generalize about the results?
I'd like to hear your theory on this really before you do any testing, but from what you were saying so far you wont be able to come out with the right results unless i misunderstood you somehow.
Not sure what I was thinking either. You're right, the limit obviously has to approach 1/2 eventually.
 

WBahn

Joined Mar 31, 2012
30,045
Not sure what I was thinking either. You're right, the limit obviously has to approach 1/2 eventually.
The details are in the fine print.

The odds that it will be within an arbitrarily small range centered on 1/2 will approach unity.

The odds that it will BE 1/2 will approach zero.
 

xox

Joined Sep 8, 2017
838
The details are in the fine print.

The odds that it will be within an arbitrarily small range centered on 1/2 will approach unity.

The odds that it will BE 1/2 will approach zero.
Well naturally, "probability" is really a misnomer. A better name for it would have been "the laws of possibility". :)
 

bogosort

Joined Sep 24, 2011
696
From your recent post the only thing left i can surmise is that you never actually did any testing you just talked about it.
The only thing i have left to say then is do some testing using a RNG and sum the results and compute the average.
What will you see? Can you tell me without doing it first?
Let the number of samples go higher and higher. What can you generalize about the results?
I'd like to hear your theory on this really before you do any testing, but from what you were saying so far you wont be able to come out with the right results unless i misunderstood you somehow.
In order for N tosses of a fair coin to have a mean of exactly 0.5, you must have N/2 heads. This rules out any sequence with an odd number of tosses. In the case of an even number of tosses, the probability of getting exactly N/2 heads is

nCr(N, N/2) / 2^N​

As 2^N grows faster than nCr(N, N/2), the probability of any particular sequence having exactly N/2 heads diminishes as the length of the sequence grows. Indeed, the probability approaches zero as N → ∞.

Recall that you were talking about a practical test for determining whether a given (finite) sequence is random. I've given you "my theory" as to why it is a terrible test for randomness; if you disagree, please explain how my analysis fails.
 

WBahn

Joined Mar 31, 2012
30,045
Sure, there's compression opportunity in any repetition, but that's more a reflection of the 2-valued nature of coin flips -- where, with only two possible values, we'd expect significant repetition -- than its information content. Out of curiosity, I created four files, each with a million ASCII characters. One was a random sequence of 'H" and 'T' characters; the other was all 'H' characters; another was a sequence of 500,000 "HT" characters; and the last was a million simulated rolls of a die. I then used gzip to compress each.

The coin flip file had a 6x compression ratio, while the all-Hs file and the HT file both had a 986x compression ratio (the HT file was only 2 bytes bigger than the all-Hs file). The die roll file was reduced by a factor of only 2.69, reflecting the fact that there's less repetition in rolls of dice than in coin flips.
This also says something about the gzip compression algorithm.

If you have a file consisting solely of an 8-bit representation for 'H' and an eight bit representation for 'T', then this can trivially be compressed by a factor of 8 (or arbitrarily close thereto) by simply encoding each of the two possibilities in a single bit (combined with a tiny amount of header for the dictionary needed to recover the original file).

This is in no way a criticism of gzip; all compression algorithms are designed with a certain type of data in mind and generally do far from optimal on other types of data. In fact, there exist data sets for any given compression algorithm for which the compressed data actually takes up MORE space than the uncompressed data. In general, those data sets are very much NOT random, so lack of compression alone is not a reliable indicator or randomness -- it can only hint at the possibility.

So just because one compression algorithm does very poorly on a particular data set does not really say much about whether that data set is truly random; it could simply be a data set that is too poor a match for the kind of data that the algorithm was designed for. What we CAN say is that if we DO get any significant compression (and we could probably put a threshold on it), then we have high confidence that the data is NOT truly random since all compression algorithms exploit nonrandomness in the data to achieve compression. But if we don't get any significant compression, about all we can say is that we haven't ruled out the possibility that the data is truly random and, perhaps, we have some reason to suspect that it might be.

If we try lots of different compression algorithms and none of them show any significant compression, then our suspicion that it's truly random will grow, but we can't escape the fact that there are infinitely more "types" of data than there are compression algorithms and so it is highly likely that many demonstrably non-truly random data sets will be poor matches for every compression algorithm yet devised, even if some of those algorithms are specifically designed to compress random-like data sets (though that would certainly be the one whose results would most bolster our opinions).
 
Top