algorithm for finding a match in a list of strings

WBahn

Joined Mar 31, 2012
32,877
I implemented a hash table for this purpose years ago and have used it scads of times. Mine handled collisions by inserting the new element at the next non-NULL location in the array (the index of which is derived from the hash function). If it happens a lot, use a larger array. Obviously this method requires a lot of memory. If you don't have memory, then I guess the binary tree. Alternatively, a binary search on a sorted list. Sort as part of the insertion or qsort after the fact. Lots of choices.
That's a technique that is known as "linear search" and it is perhaps the simplest and least sophisticated method dealing with collisions and, not surprisingly, yields poor performance in comparison. But for many, many applications it is still "good enough" and is a clear winner compared to other non-hash-table-based approaches. The problem with linear search is that it tends to become a collision magnet because you get clumps of entries that are occupied and each collision anywhere in the clump makes the clump grow, thus increasing the likelihood that future entries will collide with some entry in the clump. So you usually progressively end up with a very non-random distribution of entries in the table. When you search, if you hit a clump, you are now reduced to the performance of a linear search across the clump. One advantage of linear search is that you will be able to insert and later find any entry right up until the table is completely filled, but the performance starts taking a real hit once the table load starts going much above a pretty small fraction, say 10% or so.

The next step up is some kind of systematic search from the collision point, with quadratic being the simplest. That spreads things out and reduces the clumps, but it does nothing to shorten the list of entries that have to be searched that actually collide with the sought entry. It also has the possibility of getting into a trap in which, for that hash value, the table appears full even when it might be nearly empty. One way of dealing with that is double hashing.

A step up beyond that is to put a data structure at each entry in the table that contains all of the items that collided. That eliminates the interaction between different hash values and makes it so that you aren't as concerned about keeping a low density in your hash table. In fact, you can still get significant benefits even if the hash table is considerably smaller than the number of ites stored in it (a loading factor in excess of 100%). If you have a loading factor of, say, 50% then a significant fraction of the entries will contain a single item and few of the others will contain more than two or three (assuming a good hash function). So even a linear search of those entries is a very small performance hit, plus inserting the entries in that list in a sorted fashion is also very tame.
 

WBahn

Joined Mar 31, 2012
32,877
Since you only allocate storage for hash values that are used, it doesn't require significantly more memory than other alternatives.
What about the memory for the table itself? If my item hashes to a 16-bit index, then I need to have someplace in memory to go that is different for each one of those indices.

Now, I could do a sparse table by putting them into, say, 8-bit buckets so that I only need to start with 256 entries and each entry is a list with a map telling me where to find the data structure for the items in that bucket, but now I have to do some kind of a search on the indices in that bucket (which is a small number with a cap on it). So it's an expensive operation, but is still constant time.
 

WBahn

Joined Mar 31, 2012
32,877
In data structures I was taught the hash table should as a rule never be more than 10% occupied or performance degrades. All that space is a classic tradeoff that provides the benefit of speed. And I too played with optimizing hash functions and generally found there are lots of good ones, but no magic ones.
Depends on what you are doing and how you are doing it. Not all hash tables are created equal and they are so important for large data sets that a lot of people have spent a lot of effort to find ways to optimize them for different purposes.

And that "different purposes" is a key point. Even something as seemingly straight forward as the goal of optimizing a hash function (completely separate from accomplishing the goal, I'm talking about just agreeing on what the goal is) is not simple at all because one has to first clearly define the metric by which one decides whether candidate A is better or worse than candidate B, and that depends entirely on the application.
 

WBahn

Joined Mar 31, 2012
32,877
Thanks guys, thanks to the simple example provided by @xox, I can now understand how to do it with sorting and bsearch().

I am also interested in how it can done with a hash function. But I don't seem to understand how search work with hash function. This is how I think it will work:
  • hash (generate a unique code) for each string
  • sort the string according to the hashed value
  • hash the user input string
  • do a bsearch()
This doesn't seem to make sense to me.

Is there a link to a simple search algorithm with hash function mention above?
How about a really simple example to illustrate the point.

I have strings {dog, cat, butterfly, doggie, cattle}. Now I someone wants to access "cat" and I need to find it in memory. Where do I look. What you are currently doing is starting at one end of your list and walking along until you either find it or reach the end of the list. The amount of time that takes is O(N). If you sort the list you can do a binary search and the time required for that is O(log(N)), which is a HUGE improvement -- it's hard to comprehend just how huge. But we can do "better". Now imagine that I create table with 64 entries. I hash each entry in my list, which produces a 6-bit number which I interpret as an entry in the table

{dog->12, cat-> 47, butterfly -> 38, doggie -> 59, cattle -> 20}

I put each item at the corresponding location within the table.

Now when you want to find "cat" I take the hash and come up with 47 and can go straight to that entry. I now have an O(1) algorithm. No matter how many items are in the list, it takes me the same amount of time to find it. It's especially useful for determining that an item isn't in the list. If you ask for "chicken" and it hashes to 37, then I go look at entry 37 in the table and since it is empty, I know immediately that "chicken" isn't in the list, even if my list has a billion items in it. But what if I add "fish" to the list and it hashes to 59? That collides with "doggie" so I have to deal with that somehow.

There are many ways to do this, each with it's pros and cons (if that weren't the case, then there would be a single way to do it). In fact, that's true for hash tables themselves, the reason why we don't use hash tables for everything is because they have their downsides, too. Engineering is about compromises and choosing the right balance of compromises to get to a solution that is "good enough" for the problem at hand.
 

WBahn

Joined Mar 31, 2012
32,877
One modifiction I do with binary search is I don't do exactly 1/2. I use a power of two for the "center". if I had 500 elements, 1-256 is the first set and 257-500 is the second.
That's pretty common since it allows bit banging. We do the same thing when we apply it manually. Just last night I was trying to find which line in an assembly program was causing the simulator to choke (it's a real brain-dead simulator, so it only said that some line was malformed). The listing was over 31K entries. So I deleted about half of the entries and tried loading it and it loaded, so I knew the entries were in the other half. I didn't even try to split it exactly in half, I just moved the scroll bar to the midway point and selected a nearby easy to remember line number (16000) and deleted everything after that. When I got down to 500 I didn't go to 250, I went to 200. In well under five minutes I had identified the offending instruction and had my translator fixed.
 

WBahn

Joined Mar 31, 2012
32,877
This is one way I normally use it, I understand that I don't really need to do this for a small list. I thought if I could do it faster if the list is very long.
It's important to put some kind of scale of "very long". Almost any method has overhead that makes it WORSE for lists shorter than some amount. You only get a net gain once the problem size grows above some threshold and that threshold varies both with the method and, to some degree, with the specifics of the problem itself. The discussion is very different if we are talking about lists with hundred of items, versus tens of thousands of items, versus hundreds of millions of items.
 

Thread Starter

bug13

Joined Feb 13, 2012
2,002
How about a really simple example to illustrate the point.

I have strings {dog, cat, butterfly, doggie, cattle}. Now I someone wants to access "cat" and I need to find it in memory. Where do I look. What you are currently doing is starting at one end of your list and walking along until you either find it or reach the end of the list. The amount of time that takes is O(N). If you sort the list you can do a binary search and the time required for that is O(log(N)), which is a HUGE improvement -- it's hard to comprehend just how huge. But we can do "better". Now imagine that I create table with 64 entries. I hash each entry in my list, which produces a 6-bit number which I interpret as an entry in the table

{dog->12, cat-> 47, butterfly -> 38, doggie -> 59, cattle -> 20}

I put each item at the corresponding location within the table...

This is the explanation I am hoping for, just to make sure I understand it correctly, do you mean something like this?? If I am correct, it seems like there is a lot of memory waste to get the speed.

C:
/* simple function pointer, no arg and no return */
typedef void (*func_t)(void);

/* hash table */
func_t hash_table[64];

struct list {
    char    *cmd;   /* user command */
    func_t  cb;     /* thing to do  */
};

/* list to string command that need to be hashed */
const struct list LIST[] = {
    {.cmd = "dog",      .cb = func_dog      },
    {.cmd = "cat",      .cb = func_cat      },
    {.cmd = "cattle",   .cb = func_cattle   },
    {.cmd = "fox",      .cb = func_fox      },
};

/* a magical function that take a string and return a hash code */
uint8_t hash_func(const char *str);

int main(int agvc, char *argv[]){

    /* hash string cmd and put them into the table */
    for(unsigned i = 0; i < list_size; i++){
        hash_index = hash_func(LIST[i].cmd);
        hash_table[hash_index].cb = LIST[i].cb;
    }

    for(;;){
        /* get user input */
        uint8_t input_str[SOME_SIZE] = {0};
        get_user_input(input_str);

        /* call function that match the command */
        hash_table[hash_func(input_str)].cb();
    }

}
 
Last edited:

Thread Starter

bug13

Joined Feb 13, 2012
2,002
It's important to put some kind of scale of "very long". Almost any method has overhead that makes it WORSE for lists shorter than some amount. You only get a net gain once the problem size grows above some threshold and that threshold varies both with the method and, to some degree, with the specifics of the problem itself. The discussion is very different if we are talking about lists with hundred of items, versus tens of thousands of items, versus hundreds of millions of items.
I think a quick sort() and bsearch() will do the job nicely for me. I just also want to know how to do it with hash function, with the purpose of learning more.

"Very long" in my case is more like a couple hundreds. I would say I normally have 30nish string command, if a user enter 5 or 6 commands in the same line, and I need to do a search for each command, it will be like 150 to 180 strings to search.
 

WBahn

Joined Mar 31, 2012
32,877
This is the explanation I am hoping for, just to make sure I understand it correctly, do you mean something like this?? If I am correct, it seems like there is a lot of memory waste to get the speed.
With a basic hash table that's correct -- that's the price you pay for what is important. But there are lots of ways of reducing that impact while still keeping the effort for the important tasks constant time (or nearly so).

Your idea is pretty much on track. The thing to remember is that you can't blindly put the item at the index because you have to allow for the possibility of collisions (unless you are using a prefect hash, which is almost never possible). Again, there are many ways to do this, each with its own set of pros and cons.
 
Last edited:

Thread Starter

bug13

Joined Feb 13, 2012
2,002
With a basic hash table that's correct -- that's the price you pay for what is important. But there are lots of ways of reducing that impact while still keeping the effort for the important tasks constant time (or nearly so).

You idea is pretty much on track. The thing to remember is that you can't blindly put the item at the index because you have to allow for the possibility of collisions (unless you are using a prefect hash, which is almost never possible). Again, there are many ways to do this, each with its own set of pros and cons.
Good to know I am on the right track, thanks for pointing out the collisions problem. I think I will leave this for now, have already learn a lot. Will dig deeper when I actually need to use hash table.

Thanks team!
 

xox

Joined Sep 8, 2017
936
I definitely agree with the less space, but I don't see how the claim of order of magnitude faster comes about. Both are log(n) and the multiplying constant out front is usually going to be dominated by the cost of comparing two items. The cost of computing the next index or computing the next link to follow are going to be pretty comparable and generally small compared to the comparison operation.


EDIT: I think I may not be understanding what you were taking away from the post you were responding to. Rereading it, I'm not sure what I'm taking away from it. When a "hash array of linked lists" was mentioned, I took that to mean a classic hash table that used a linked list at each hash index to handle collisions. Provided collisions are minimal, that's an O(1) search problem. But if you are saying that it is O(N), I'm guessing you interpreted it much differently than I did.
Well the hash mechanism itself is of course O(1). If that hash indexes into a list however, the time complexity of searching through THAT is obviously going to be O(N). Ideally that should not be much of an issue, but in practice it actually can be. Switching to flat, sorted, binary-searched arrays instead, you get the best of both worlds. Worst case scenario, you end up with an O(log(N)) search. Otherwise you should expect something much closer to O(1).

Speaking of binary searches, you can optimize that even further by storing both the hash and the key in the array. Then your comparison algorithm can simply inspect and compare hashes BEFORE any string matching is done.
 

WBahn

Joined Mar 31, 2012
32,877
Well the hash mechanism itself is of course O(1). If that hash indexes into a list however, the time complexity of searching through THAT is obviously going to be O(N). Ideally that should not be much of an issue, but in practice it actually can be. Switching to flat, sorted, binary-searched arrays instead, you get the best of both worlds. Worst case scenario, you end up with an O(log(N)) search. Otherwise you should expect something much closer to O(1).

Speaking of binary searches, you can optimize that even further by storing both the hash and the key in the array. Then your comparison algorithm can simply inspect and compare hashes BEFORE any string matching is done.
I don't know how, with a flat sorted binary-searched array, you can expect something much closer to O(1). Half of the time I will require the maximum number of comparisons -- half of the time it will be "worst case". Half of the remaining cases will be just one less that worst case. So why would I expect something much closer to constant time performance?
 

xox

Joined Sep 8, 2017
936
I don't know how, with a flat sorted binary-searched array, you can expect something much closer to O(1). Half of the time I will require the maximum number of comparisons -- half of the time it will be "worst case". Half of the remaining cases will be just one less that worst case. So why would I expect something much closer to constant time performance?
What I mean is in such cases where you have picked all of your "hash table parameters" correctly, then you shouldn't have many collisions anyway, so the end result should be "much closer" to O(1). I know, it was a sloppy statement and I should have qualified that better. In any case, a binary search is almost always faster than a list search, it takes up less space, and basically cost nothing "extra". It's just a flat, sorted array. Instead of searching it sequentially, you do a binary search. It's basically free, so why not?
 
Top