Why don't these approximations seem to converge?

Thread Starter

WBahn

Joined Mar 31, 2012
32,823
But why should I go to the trouble and hassle of finding and using some math package that can handle thousands of bits when I can just do something like

k = log(2) * D;

and get a number that is good to less than 1% even for D = 100 and gets me to 13 sig figs with D = 1e13?

As D gets larger, this approximation just gets better and better.

It's my understanding that even with state of the art bignum packages that evaluating elementary functions of 4000 bits is going to be a few hundred times slower than a 64-bit representation, even if no hardware support is available for the 64-bit number, which is usually not the case. So we are more likely talking about at least a factor of a thousand in practice.

Yes, at some point k can't be represented directly as a limited-precision floating point value, but that's fine. Long before we get there, we can scale D and track the exponent separately.

And remember -- the point of the thread was to explore why two approximations weren't agreeing. Taking the approach of "well, just use arbitrary-precision arithmetic and then don't do any approximations" completely fails to address the point.

Furthermore, the point of the formula in the first place is not to come up with a specific number k for a specific number D, but to come up with a formula that shows how k scales with D. Cranking numbers all day long with a bignum library isn't going to come close to being able to say, very succinctly,

k = ln(2)·D = O(D)
 

Tesla23

Joined May 10, 2009
560
Say I want to solve the following equation for k

\(
\( 1 \; - \; \frac{1}{D} \)^k \; = \; p
\)

for values of p near 0.5.

The exact solution is

\(
k \; = \; \frac{\ln{ \( p \)}}{\ln{ \(1 \; - \; \frac{1}{D} \)}}
\)

But if D is large, then calculating the ln() in the denominator is problematic.

So I can think of two approximations that should be valid when D is very large (as in something like 2^256).

The first is the approximation that

\(
\ln{ \( 1 \; + \; x \) } \; \approx \; x
\)

Applying this to the denominator of the solution, we get

\(
k \; = \; -D \ln{ \( p \)} \; = \; D \ln{ \( \frac{1}{p}\) }
\)

Another approximation is that

\(
\( 1 \; + \; x\)^k \; \approx \; 1 \; + \; kx
\)

Applying this to the original expression, we get

\(
k \; = \; D \( 1 - p \)
\)

If we set p = 0.5, then the first approximation yields

\(
k \; = \; D \ln{\( 2 \)} \; = \; 0.693 D
\)

While the second yields

\(
k \; = \; 0.5 D
\)

For several reasons, I have more confidence in the first approximation. But I would expect both approximations to converge as 1/D goes to zero (meaning that the ratio of the two approximations approaches unity), yet clearly they don't.

Any insights would be appreciated.
Some scribbling you might find interesting WBahn:

looking at the binomial expansion of
\(
\( 1 + x\)^k = 1 + kx + \frac{k(k-1)}{2!}x^2 + \frac{k(k-1)(k-2)}{3!}x^3 + ...
\)

collecting the kx terms:
\(
\( 1 + x\)^k = 1 + kx + \frac{k^2x^2}{2!} + \frac{k^3x^3}{3!} + \frac{k^4x^4}{4!} + ... + O(xp(kx))
\)

where p(y) is some polynomial.

so
\(
\( 1 + x\)^k \approx e^{kx} + O(xf(kx))
\)

f(y) is some function - the limit of p(y) as k→∞.

I think this approximation should be good for small x as long as kx is bounded (I haven't given the exact limits much thought. ). Note that for small kx it reduces to the 1+kx approximation. Using it in your problem gives you the same result as taking logs.
 

Thread Starter

WBahn

Joined Mar 31, 2012
32,823
Yep. I used that approximation in the next part of the lesson I was preparing where I needed to come up with an asymptotic expression for the Birthday Paradox.

The exact series for this is a product series. When p = the probability of no collisions out of a group of k independent samples from a pool of D possible values (with replacement), you get

\(
p(k) \; = \; \prod_{i=0}^{k-1} {\frac{D-i}{D}} \; = \; \frac{D!}{D^k\(D-k\)!}
\)

At first I tried using Stirling's formula on the factorial. It's a bit of a mess and I kept coming up off by a factor of sqrt(2). I still need to go back and visit it.

But the product expression can be approximated by

\(
p(k) \; = \; \prod_{i=0}^{k-1} {\frac{D-i}{D}}
\;
p(k) \; = \; \prod_{i=0}^{k-1} {\(1 \; - \; \frac{i}{D} \)}
\;
p(k) \; = \; \prod_{i=0}^{k-1} {e^{\frac{-ix}{D}}}
\)

taking the log of both sides turns this into an arithmetic series

\(
\ln{\(p(k)\)} \; = \; \sum_{i=0}^{k-1} {\frac{-ix}{D}}
\;
\ln{\(p(k)\)} \; = \; -\frac{1}{D} \sum_{i=0}^{k-1} {i}
\;
\ln{\(p(k)\)} \; = \; -\frac{1}{D} \frac{k\(k-1\)}{2}
\)

If k is sufficiently large (and it most certainly is), then we can ignore the -1 very handily

\(
\ln{\(p(k)\)} \; = \; -\frac{1}{D} \frac{k^2}{2}
\)

Solving for k, we've got

\(
k \; = \; \sqrt{2\ln{\frac{1}{p}}D}
\)

With a hash function, D = 2^N where N is the size of the hash

\(
k \; = \; \sqrt{2\ln{\frac{1}{p}}2^N}
\;
k \; = \; \sqrt{2\ln{\frac{1}{p}}}2^{\frac{N}{2}
\)

Showing that a Birthday Attack requires on the order of the square root of the number of digests or, equivalently, that the effective width of the hash is cut in half by such an attack.

This has profound implications.

If we have a 64-bit hash function, then if we could compute one million hashes each second it would take about 400,000 years before we could expect to have a 50/50 chance of finding a message that has a specific digest. But if we only care about finding two messages that have the same digest, we have a 50/50 chance of achieving that in less than an hour and a half.
 

MrAl

Joined Jun 17, 2014
13,702
But why should I go to the trouble and hassle of finding and using some math package that can handle thousands of bits when I can just do something like

k = log(2) * D;

and get a number that is good to less than 1% even for D = 100 and gets me to 13 sig figs with D = 1e13?

As D gets larger, this approximation just gets better and better.

It's my understanding that even with state of the art bignum packages that evaluating elementary functions of 4000 bits is going to be a few hundred times slower than a 64-bit representation, even if no hardware support is available for the 64-bit number, which is usually not the case. So we are more likely talking about at least a factor of a thousand in practice.

Yes, at some point k can't be represented directly as a limited-precision floating point value, but that's fine. Long before we get there, we can scale D and track the exponent separately.

And remember -- the point of the thread was to explore why two approximations weren't agreeing. Taking the approach of "well, just use arbitrary-precision arithmetic and then don't do any approximations" completely fails to address the point.

Furthermore, the point of the formula in the first place is not to come up with a specific number k for a specific number D, but to come up with a formula that shows how k scales with D. Cranking numbers all day long with a bignum library isn't going to come close to being able to say, very succinctly,

k = ln(2)·D = O(D)

Hello,

I already told you it's up to you what you want to use, it's always up to the individual because they are the only ones that know exactly what they are going to use it for. You could state a million reasons why you dont want to use it, i could state a million reasons why you dont want to use it too.

Also, i already addressed the processing time issue. There's no way it can be as fast as the innate processor speed because it has to do a lot more calculations per crunch. For addition this would be over two times longer for every two fold increase in precision, for multiplication more than 6 times longer for every two fold increase in precision, so it is exponentially slower for every two fold increase in precision. But even though that can add up to a lot, on a modern computer sometimes it just doesnt matter because the overall calculation is still fast.
With my old old computer that ran at 4MHz i had to calculate sines and cosines using a series, which of course took longer than today's number processors. I didnt throw it in the garbage because it took longer i used if for millions and millions of calculations.

Good luck with it.
 
Top