Hashing Function: adding letters of variable

Thread Starter

zulfi100

Joined Jun 7, 2012
656
Hi,

I am trying to understand Hashing function. I am reading a book which says"

how can design a function h that allows the compiler to immediately access the position associated with each variable? All the letters of the variables name can be added together and the sum can be used as an index. in this case, the table needs 3,782 cells ( for a variable K made out of 31 letters “z,” h(K) = 31 . 122 = 3,782).


I cant undertand what is meant by 122 here:.


Some body please guide me.


Zulfi.
 

WBahn

Joined Mar 31, 2012
29,979
The ASCII code for 'z' is 122.

This is an absolutely horrible example of a hash function -- particularly for the example chosen. I hope it is an introductory example that immediately follows with, "But wait -- this is a horrible example because...."

Consider that, using this function, the variables "return" and "turner" will hash to exactly the same position.
 

WBahn

Joined Mar 31, 2012
29,979
Also, the claim that the table needs 3782 cells is nonsensical. It's a hash table for crying out loud.
 

Thread Starter

zulfi100

Joined Jun 7, 2012
656
Hi,
Thanks for your response. I cant understand what's wrong with the value 3782. Is this index not possible considering the worst case?

Zulfi.
 

WBahn

Joined Mar 31, 2012
29,979
Hi,
Thanks for your response. I cant understand what's wrong with the value 3782. Is this index not possible considering the worst case?

Zulfi.
How many possible 31 character variable names are there? Let's just restrict them to consisting only of lowercase letters, as opposed to the usual mix of uppercase, lowercase, digits, and underscores subject only to the constraint that the first character can't be a digit.
 

Papabravo

Joined Feb 24, 2006
21,159
How many possible 31 character variable names are there? Let's just restrict them to consisting only of lowercase letters, as opposed to the usual mix of uppercase, lowercase, digits, and underscores subject only to the constraint that the first character can't be a digit.
Q: Of all those character strings of length 1 to 31 how many of them hash to the same index?
A: Billion and Billions
 

WBahn

Joined Mar 31, 2012
29,979
Q: Of all those character strings of length 1 to 31 how many of them hash to the same index?
A: Billion and Billions
I agree. And that's part of the point I am trying to get across.

Notice that the description implies that the hash value for a variable name gives you direct access to that variable. It does not even mention the potential for collisions or how to deal with them. But perhaps that will get addressed shortly.
 
Last edited:

WBahn

Joined Mar 31, 2012
29,979
Hi,
Thanks for your response. I cant understand what's wrong with the value 3782. Is this index not possible considering the worst case?

Zulfi.
I need to revise my earlier comment about it being non-sensical. Their whole approach is pretty flawed, but let's give them the benefit of the doubt and assume that they are just using this as a contrived example.

Whatever hashing algorithm you come up with, you need a table that has entries for every possible hash value.

For their hash algorithm, the largest possible hash value would result from a variable whose name is a sequence of 31 lower case 'z' characters, which would yield a hash value of 3782. So that table needs to be big enough to allow that value. I'm pretty sure that is what they are trying to convey.
 

Papabravo

Joined Feb 24, 2006
21,159
I need to revise my earlier comment about it being non-sensical. Their whole approach is pretty flawed, but let's give them the benefit of the doubt and assume that they are just using this as a contrived example.

Whatever hashing algorithm you come up with, you need a table that has entries for every possible hash value.

For their hash algorithm, the largest possible hash value would result from a variable whose name is a sequence of 31 lower case 'z' characters, which would yield a hash value of 3782. So that table needs to be big enough to allow that value. I'm pretty sure that is what they are trying to convey.
You might actually want to make the table length a prime number and take the sum of the characters modulo the prime table length. 3779 is the largest prime less than 3782.
 

WBahn

Joined Mar 31, 2012
29,979
You might actually want to make the table length a prime number and take the sum of the characters modulo the prime table length. 3779 is the largest prime less than 3782.
While there are some advantages for using prime-number length hash tables, that would not apply to this algorithm since it is not producing a huge number that must be reduced modulo the table size. The largest possible value that the hash function can produce is 3782, so you would be wrapping around a very small number of possible variable names. In fact, in order to get wrapped, the variable name would have to consist of 31 character, of which a minimum of 27 of them would have to be 'z' (and then only if the other four were 'y') plus, with 30 'z's the remaining character could be no lower than 'w'.

Even in the case where you are reducing a gargantuan number modulo the table size, the table sizes are almost always integer powers of two so that the reduction can take place via a simple bitmask. To compensate for the fact that only the low order bits determine the hash, an augmentation function (or whitening function) is used to mix the upper bits into the lower bits before applying the mask.
 

Papabravo

Joined Feb 24, 2006
21,159
Yes. Actually I was thinking of something a bit smaller so that in fact some of the values on near 3782 would be folded back onto the lower numbers. It still does not remove the fundamental flaw that symbols which are close to each other map to adjacent sells in the table. For any hash algorithm you can write a test program to see how it distributes nearly identical symbols to hash indexes so that you can avoid this behavior. This was especially valuable in assemblers because assembly language programmers tended to use labels that are symbolically similar eg SYM00 and SYM01, especially for assemblers that limited the symbol size to five or seven characters.
 

Thread Starter

zulfi100

Joined Jun 7, 2012
656
Hi,

Thanks for these replies & efforts:



For their hash algorithm, the largest possible hash value would result from a variable whose name is a sequence of 31 lower case 'z' characters, which would yield a hash value of 3782. So that table needs to be big enough to allow that value. I'm pretty sure that is what they are trying to convey.



I feel the above is answer to my question.

In the context of Division based hashing functions, the book says

“It is best if TSize is a prime number, otherwise h(K) = (K mod p) mod TSize for some prime p > TSize can be used. However, nonprime divisors may work equally well as prime divisors provided they do not have prime factors less than 20.”

<
Q: Of all those character strings of length 1 to 31 how many of them hash to the same index?
A: Billion and Billions
>
2^31 = more than 2 Billion. There can only be 3782 unique values so large combinations of values may hash to the same index in the table (may be Billions as you say).

Following is complex

“While there are some advantages for using prime-number length hash tables, that would not apply to this algorithm since it is not producing a huge number that must be reduced modulo the table size. The largest possible value that the hash function can produce is 3782, so you would be wrapping around a very small number of possible variable names. In fact, in order to get wrapped, the variable name would have to consist of 31 character, of which a minimum of 27 of them would have to be 'z' (and then only if the other four were 'y') plus, with 30 'z's the remaining character could be no lower than 'w'.“

I did not try to understand it.

Zulfi.
 
Last edited:

WBahn

Joined Mar 31, 2012
29,979
In the context of Division based hashing functions, the book says

“It is best if TSize is a prime number, otherwise h(K) = (K mod p) mod TSize for some prime p > TSize can be used. However, nonprime divisors may work equally well as prime divisors provided they do not have prime factors less than 20.”
The reason for wanting to use a prime modulus is to that the distribution of the mapping between the raw values (the sum of the letter values in this case) and that hash table entries don't form significant patterns. The two step process they suggest is merely one way to reduce these patterns, but it is going to leave you with a non-uniform distribution even if the number of raw values is a multiple of the table size. This isn't the end of the world.

The problem with this approach is that it is computationally very expensive. Reducing a value modulo a prime involves integer division while reducing it module an integer power of two involves a simple bit mask. Thus the normal way is to perform bit mixing (which can be done very efficiently) and then reduce modulo an integer power of two.

Following is complex

“While there are some advantages for using prime-number length hash tables, that would not apply to this algorithm since it is not producing a huge number that must be reduced modulo the table size. The largest possible value that the hash function can produce is 3782, so you would be wrapping around a very small number of possible variable names. In fact, in order to get wrapped, the variable name would have to consist of 31 character, of which a minimum of 27 of them would have to be 'z' (and then only if the other four were 'y') plus, with 30 'z's the remaining character could be no lower than 'w'.“

I did not try to understand it.
The suggestion was to reduce the sum of the characters modulo 3779 just so that a prime number could be used. But doing so would have no effect on any but a tiny fraction of the possible variable names -- namely those whose sum is at least 3779. Anything less, and the modulo reduction has absolutely zero impact. So what would have to be true of a variable name in order for the sum of its characters to be at least 3779? First, it would have to be a full 31 characters long. No variable name shorter than that could possible total to 3779. Then, even if it were 31 characters long, an absolute minimum of 27 of them would have to be 'z'. Any fewer and it would not be possible for the sum to be 3779. And then the few characters that weren't 'z' would have to be no smaller than 'w' (and that character could only be used if ALL of the other 30 characters were 'z').

The point being that reducing the sum modulo 3779 accomplished nothing except adding a computationally expensive operation to an already piss-poor hash function.
 
Top