algorithm for finding a match in a list of strings

dl324

Joined Mar 30, 2015
16,916
I implemented a hash table for this purpose years ago and have used it scads of times. Mine handled collisions by inserting the new element at the next non-NULL location in the array (the index of which is derived from the hash function).
Same here. Wrote a program for the company I worked at and it was used a countless number of times for over 20 years. I needed to keep track of cell names in Stream files (graphical databases) and used a 16K element hash array. If multiple names hashed to the same value, each array entry could be a linked list.
If it happens a lot, use a larger array.
Or use a better hash function. I studied hash performance in my program for years. In most cases, a very small percentage of names hashed to the same value.
Obviously this method requires a lot of memory
Since you only allocate storage for hash values that are used, it doesn't require significantly more memory than other alternatives.
 

402DF855

Joined Feb 9, 2013
271
In data structures I was taught the hash table should as a rule never be more than 10% occupied or performance degrades. All that space is a classic tradeoff that provides the benefit of speed. And I too played with optimizing hash functions and generally found there are lots of good ones, but no magic ones.
 

MrChips

Joined Oct 2, 2009
30,802
In data structures I was taught the hash table should as a rule never be more than 10% occupied or performance degrades. All that space is a classic tradeoff that provides the benefit of speed. And I too played with optimizing hash functions and generally found there are lots of good ones, but no magic ones.
That is saying 90% of the allocated space is not used.
I look at it differently. In order to store collisions I will allocate say, 20% additional space.
 

Thread Starter

bug13

Joined Feb 13, 2012
2,002
Thanks guys, thanks to the simple example provided by @xox, I can now understand how to do it with sorting and bsearch().

I am also interested in how it can done with a hash function. But I don't seem to understand how search work with hash function. This is how I think it will work:
  • hash (generate a unique code) for each string
  • sort the string according to the hashed value
  • hash the user input string
  • do a bsearch()
This doesn't seem to make sense to me.

Is there a link to a simple search algorithm with hash function mention above?
 

MrChips

Joined Oct 2, 2009
30,802
It would help if you told use exactly you are trying to do.
How many letters in the word?
Total number of words?
What is the application?
 

dl324

Joined Mar 30, 2015
16,916
But I don't seem to understand how search work with hash function. This is how I think it will work:
  • hash (generate a unique code) for each string
  • sort the string according to the hashed value
  • hash the user input string
  • do a bsearch()
This doesn't seem to make sense to me.
That isn't quite how it works.

You define an array to hold the hashed string values. If that element of the array has already been allocated (all of the elements start out with NULL pointers), you add a link to the first string that hashed to that value. If there were already multiple strings that hashed to that value, you add the latest one to the last element in the linked list; if it's unique.

You don't do a binary search. When you want to see if a string has already occurred, you hash the string. If that value has multiple linked strings, you walk the list until you find a match. If you don't find a match, you add it to the list of strings seen that hashed to that value.
 
One modifiction I do with binary search is I don't do exactly 1/2. I use a power of two for the "center". if I had 500 elements, 1-256 is the first set and 257-500 is the second.
 

Thread Starter

bug13

Joined Feb 13, 2012
2,002
It would help if you told use exactly you are trying to do.
How many letters in the word?
Total number of words?
What is the application?
This is one way I normally use it, I understand that I don't really need to do this for a small list. I thought if I could do it faster if the list is very long.
C:
enum type_t {
    STRING_ONLY,            /* string only command                                             */
    STRING_AND_NUMBER,        /* string follow by a number, will extract the number separately */
    etc...
}

struct list_t {
    type_t    type;
    char     *cmd;            /* user command                 */
    void     (*cb)    (void);    /* thing to do for the command     */
}

const struct list_t list[] ={
    {.cmd = "reboot",        .cb = func_1, .type = STRING_ONLY        },
    {.cmd = "cmd2",            .cb = func_2, .type = STRING_ONLY        },
    {.cmd = "cmd3",            .cb = func_3, .type = STRING_ONLY        },
    {.cmd = "cmd4",            .cb = func_4, .type = STRING_ONLY        },
    {.cmd = "any_length",     .cb = func_5, .type = STRING_ONLY        },
    {.cmd = "id",             .cb = func_5, .type = STRING_AND_NUMBER    },
    /* commands will be added as needed. And
     * commands will be any length, but usually
     * short and easy word
     */
};

void main(void){
   
    /* get user input, usually from some com.
     * eg RS232/485/UDP. Usually from interrupts
     * examples could be:
     * "relay_mode toggle"
     * "relay_mode momentary"
     * "relay_mode normally_close"
     * "relay_mode normally_open"
     * "set id:10"
     * ""
     */
    uint8_t input_data[SOME_SIZE];
    func_get_user_input(input_data);
   
    /* extract the one word from the input, separated by space or : */
    char word[SOME_SIZE];
    word = extract_one_word(input_data);
   
    /* find a match from the list and do things */
    for(unsigned i = 0; i < list_size; i++){
        if (str(word, list[i].cmd) == 0){
            list[i].cb();
        }
    }
}
 
Last edited:

MrChips

Joined Oct 2, 2009
30,802
Is this hypothetical or an actual application?

If you are sending commands from A to B, no one types in "relay_mode normally_close".
This is verbose and very inefficient. That is why we use mnemonics as "R1NO".

Hence you need to sit down and carefully define your command protocol.

Searching through a list of 25 short commands is very different from searching through a dictionary of 10,000 words.

We need to know the application.
 

Thread Starter

bug13

Joined Feb 13, 2012
2,002
Is this hypothetical or an actual application?

If you are sending commands from A to B, no one types in "relay_mode normally_close".
This is verbose and very inefficient. That is why we use mnemonics as "R1NO".

Hence you need to sit down and carefully define your command protocol.

Searching through a list of 25 short commands is very different from searching through a dictionary of 10,000 words.

We need to know the application.
In my machine to machine communicate, I simply use one or two byte code embedded in the payload of a packet.

But unfortunately, the above example are actually for human to type in. They want to be like this, most people on the field wont' be using our stuff a lot. "R1NO" doesn't make a lot of sense for them, as they don't use our stuff often enough. But if you tell them type "relay_mode normally_close" over the phone, is a lot easier to them to understand.

for the record, my full struct is actually like this, so when people type in [help], I can show show them a list of commands and description. But they still prefer "relay_mode normally_close" over "R1NC" (While I am thinking about this, maybe I should implement auto complete with TAB...)

C:
struct list_t {
    type_t        type;
    char         *cmd;                    /* user command                  */
    char        *description;            /* explain what the command     */
    void         (*cb)    (void);        /* thing to do for the command    */
}

const struct list_t list[] ={
    {.cmd = "reboot",    .cb = func_1, .descript = "reboot the device",    .type = STRING_ONLY        },
    {.cmd = "cmd2",        .cb = func_2, .descript = "does this..."        .type = STRING_ONLY        },
};
 
@bug13

One modifiction I do with binary search is I don't do exactly 1/2. I use a power of two for the "center". if I had 500 elements, 1-256 is the first set and 257-500 is the second.
You don't end up with fractions. If it's not in the 257-500 then
256/2 = 64; 128/2 = 32; 32/2 =16; 16/2 = 8; 8/2=4; 4/2=2; 2/2=1 (all integer divides or shifts)

otherwise let's say you have 501. 501/2 = 250.5 etc.
 

MrChips

Joined Oct 2, 2009
30,802
I follow.
I would like to make the following suggestions:

1) All commands should be case insensitive. If the user enters either lower case or upper case, your code should covert all to one (perhaps UPPPER CASE) before parsing.

2) Omit the underscore and any undesirable characters.

3) Spaces should be optional. Delete the space in the parse stage.

4) Abbreviate the command as much as possible.
Here are examples:
reboot
r1 open
r1 close
 

WBahn

Joined Mar 31, 2012
30,052
Or construct the list as a binary tree...
That's effectively just a method of sorting the list, but it's an option. However, you have to be careful HOW you construct the tree since any significant imbalance will rapidly trash the performance gains, possibly making it worse than a linear search through a list because of the tree navigation overhead. Also, searching even a perfectly balanced binary tree is usually slower than doing a binary search on a sorted list because of that same navigation overhead, but if the comparison cost is significantly higher than the navigation cost (which is usually is, but not always), then either is pretty comparable.
 

WBahn

Joined Mar 31, 2012
30,052
You can certainly build hash-tables with hash functions, but the key there is using the right "underlying data structure". Something like a binary tree for example. ;)
If you use a binary tree to store the hashed strings, then your search performance will be basically the same as using a binary tree to store the strings directly. Assuming you do a proper job of managing the balancing of the tree, it takes log(n) time to find a string, but with the hashed strings you have the additional cost of hashing the string you are searching for. So I don't see a significant advantage to using hashed strings from a search standpoint. But if you have a very expensive comparison function -- say a data structure with a lot of stuff crammed in it -- then comparing hashes can significantly lower the comparison cost because most of the time you only have to compare the hash and even just a 32-bit hash spreads things out so that there will be few collisions even on lists of millions of items IF the hash function is good enough.

To get the full benefit of fast searches using hash tables you need to be able to use the hash value to immediately go to the entry of interest (or "close" to it in search terms), which involves an array the size of the hash space.
 

WBahn

Joined Mar 31, 2012
30,052
That's a really lame suggestion. Search time for a list is O(N). An array with binary search is guaranteed to require less memory and perform orders of magnitude faster.
I definitely agree with the less space, but I don't see how the claim of order of magnitude faster comes about. Both are log(n) and the multiplying constant out front is usually going to be dominated by the cost of comparing two items. The cost of computing the next index or computing the next link to follow are going to be pretty comparable and generally small compared to the comparison operation.

EDIT: I think I may not be understanding what you were taking away from the post you were responding to. Rereading it, I'm not sure what I'm taking away from it. When a "hash array of linked lists" was mentioned, I took that to mean a classic hash table that used a linked list at each hash index to handle collisions. Provided collisions are minimal, that's an O(1) search problem. But if you are saying that it is O(N), I'm guessing you interpreted it much differently than I did.
 
Last edited:

MrChips

Joined Oct 2, 2009
30,802
If you are designing a command protocol, how many commands are there in total?
For a short list (say fewer than 100) it would be just as simple to do a straight search, i.e. a number of consecutive if statements.

I have implemented may communications command protocols that are efficient, expandable, addressable, and human readable.
All commands are text and are simple 2-letter commands.
The structure is as follows:

[start] [device ID] [command] [parameter] [checksum] [end]

where:
[start] = $
[device ID] = 0-9, A-Z
[command] = two letters, e.g. ST
[parameter] = numeric data, optional
[checksum] = two hex digits
[end] = CR

Linear searching is efficient because you are only comparing two 16-bit values.
 
Top