algorithm for finding a match in a list of strings

dl324 · Apr 19, 2020

402DF855 said:
I implemented a hash table for this purpose years ago and have used it scads of times. Mine handled collisions by inserting the new element at the next non-NULL location in the array (the index of which is derived from the hash function).

Same here. Wrote a program for the company I worked at and it was used a countless number of times for over 20 years. I needed to keep track of cell names in Stream files (graphical databases) and used a 16K element hash array. If multiple names hashed to the same value, each array entry could be a linked list.

402DF855 said:
If it happens a lot, use a larger array.

Or use a better hash function. I studied hash performance in my program for years. In most cases, a very small percentage of names hashed to the same value.

402DF855 said:
Obviously this method requires a lot of memory

Since you only allocate storage for hash values that are used, it doesn't require significantly more memory than other alternatives.

402DF855 · Apr 19, 2020

In data structures I was taught the hash table should as a rule never be more than 10% occupied or performance degrades. All that space is a classic tradeoff that provides the benefit of speed. And I too played with optimizing hash functions and generally found there are lots of good ones, but no magic ones.

KeepItSimpleStupid · Apr 19, 2020

Remember when sorting a list of strings, it's better to sort an array of pointers.

MrChips · Apr 19, 2020

402DF855 said:
In data structures I was taught the hash table should as a rule never be more than 10% occupied or performance degrades. All that space is a classic tradeoff that provides the benefit of speed. And I too played with optimizing hash functions and generally found there are lots of good ones, but no magic ones.

That is saying 90% of the allocated space is not used.
I look at it differently. In order to store collisions I will allocate say, 20% additional space.

xox · Apr 19, 2020

402DF855 said:
And I too played with optimizing hash functions and generally found there are lots of good ones, but no magic ones.

You might find Galois groups interesting. A general-purpose CRC hash function used in conjunction with specifically-sized arrays yields nearly 100% optimal collision avoidance.

402DF855 · Apr 19, 2020

MrChips said:
That is saying 90% of the allocated space is not used.

Technically it is used. It's insurance that there will be less collisions and therefore better performance.

bug13 · Apr 19, 2020

Thanks guys, thanks to the simple example provided by @xox, I can now understand how to do it with sorting and bsearch().

I am also interested in how it can done with a hash function. But I don't seem to understand how search work with hash function. This is how I think it will work:

hash (generate a unique code) for each string
sort the string according to the hashed value
hash the user input string
do a bsearch()

This doesn't seem to make sense to me.

Is there a link to a simple search algorithm with hash function mention above?

MrChips · Apr 19, 2020

It would help if you told use exactly you are trying to do.
How many letters in the word?
Total number of words?
What is the application?

dl324 · Apr 19, 2020

bug13 said:
But I don't seem to understand how search work with hash function. This is how I think it will work:

hash (generate a unique code) for each string

sort the string according to the hashed value

hash the user input string

do a bsearch()

This doesn't seem to make sense to me.

That isn't quite how it works.

You define an array to hold the hashed string values. If that element of the array has already been allocated (all of the elements start out with NULL pointers), you add a link to the first string that hashed to that value. If there were already multiple strings that hashed to that value, you add the latest one to the last element in the linked list; if it's unique.

You don't do a binary search. When you want to see if a string has already occurred, you hash the string. If that value has multiple linked strings, you walk the list until you find a match. If you don't find a match, you add it to the list of strings seen that hashed to that value.

KeepItSimpleStupid · Apr 19, 2020

One modifiction I do with binary search is I don't do exactly 1/2. I use a power of two for the "center". if I had 500 elements, 1-256 is the first set and 257-500 is the second.

bug13 · Apr 19, 2020

MrChips said:
It would help if you told use exactly you are trying to do.
How many letters in the word?
Total number of words?
What is the application?

This is one way I normally use it, I understand that I don't really need to do this for a small list. I thought if I could do it faster if the list is very long.

C:

enum type_t {
    STRING_ONLY,            /* string only command                                             */
    STRING_AND_NUMBER,        /* string follow by a number, will extract the number separately */
    etc...
}

struct list_t {
    type_t    type;
    char     *cmd;            /* user command                 */
    void     (*cb)    (void);    /* thing to do for the command     */
}

const struct list_t list[] ={
    {.cmd = "reboot",        .cb = func_1, .type = STRING_ONLY        },
    {.cmd = "cmd2",            .cb = func_2, .type = STRING_ONLY        },
    {.cmd = "cmd3",            .cb = func_3, .type = STRING_ONLY        },
    {.cmd = "cmd4",            .cb = func_4, .type = STRING_ONLY        },
    {.cmd = "any_length",     .cb = func_5, .type = STRING_ONLY        },
    {.cmd = "id",             .cb = func_5, .type = STRING_AND_NUMBER    },
    /* commands will be added as needed. And
     * commands will be any length, but usually
     * short and easy word
     */
};

void main(void){
   
    /* get user input, usually from some com.
     * eg RS232/485/UDP. Usually from interrupts
     * examples could be:
     * "relay_mode toggle"
     * "relay_mode momentary"
     * "relay_mode normally_close"
     * "relay_mode normally_open"
     * "set id:10"
     * ""
     */
    uint8_t input_data[SOME_SIZE];
    func_get_user_input(input_data);
   
    /* extract the one word from the input, separated by space or : */
    char word[SOME_SIZE];
    word = extract_one_word(input_data);
   
    /* find a match from the list and do things */
    for(unsigned i = 0; i < list_size; i++){
        if (str(word, list[i].cmd) == 0){
            list[i].cb();
        }
    }
}

bug13 · Apr 19, 2020

KeepItSimpleStupid said:
One modifiction I do with binary search is I don't do exactly 1/2. I use a power of two for the "center". if I had 500 elements, 1-256 is the first set and 257-500 is the second.

Why is that? Assuming it's faster to calculate with bit shift?

MrChips · Apr 19, 2020

Is this hypothetical or an actual application?

If you are sending commands from A to B, no one types in "relay_mode normally_close".
This is verbose and very inefficient. That is why we use mnemonics as "R1NO".

Hence you need to sit down and carefully define your command protocol.

Searching through a list of 25 short commands is very different from searching through a dictionary of 10,000 words.

We need to know the application.

bug13 · Apr 19, 2020

MrChips said:
Is this hypothetical or an actual application?

If you are sending commands from A to B, no one types in "relay_mode normally_close".
This is verbose and very inefficient. That is why we use mnemonics as "R1NO".

Hence you need to sit down and carefully define your command protocol.

Searching through a list of 25 short commands is very different from searching through a dictionary of 10,000 words.

We need to know the application.

In my machine to machine communicate, I simply use one or two byte code embedded in the payload of a packet.

But unfortunately, the above example are actually for human to type in. They want to be like this, most people on the field wont' be using our stuff a lot. "R1NO" doesn't make a lot of sense for them, as they don't use our stuff often enough. But if you tell them type "relay_mode normally_close" over the phone, is a lot easier to them to understand.

for the record, my full struct is actually like this, so when people type in [help], I can show show them a list of commands and description. But they still prefer "relay_mode normally_close" over "R1NC" (While I am thinking about this, maybe I should implement auto complete with TAB...)

C:

struct list_t {
    type_t        type;
    char         *cmd;                    /* user command                  */
    char        *description;            /* explain what the command     */
    void         (*cb)    (void);        /* thing to do for the command    */
}

const struct list_t list[] ={
    {.cmd = "reboot",    .cb = func_1, .descript = "reboot the device",    .type = STRING_ONLY        },
    {.cmd = "cmd2",        .cb = func_2, .descript = "does this..."        .type = STRING_ONLY        },
};

KeepItSimpleStupid · Apr 19, 2020

@bug13

KeepItSimpleStupid said:
One modifiction I do with binary search is I don't do exactly 1/2. I use a power of two for the "center". if I had 500 elements, 1-256 is the first set and 257-500 is the second.

You don't end up with fractions. If it's not in the 257-500 then
256/2 = 64; 128/2 = 32; 32/2 =16; 16/2 = 8; 8/2=4; 4/2=2; 2/2=1 (all integer divides or shifts)

otherwise let's say you have 501. 501/2 = 250.5 etc.

MrChips · Apr 19, 2020

I follow.
I would like to make the following suggestions:

1) All commands should be case insensitive. If the user enters either lower case or upper case, your code should covert all to one (perhaps UPPPER CASE) before parsing.

2) Omit the underscore and any undesirable characters.

3) Spaces should be optional. Delete the space in the parse stage.

4) Abbreviate the command as much as possible.
Here are examples:
reboot
r1 open
r1 close

WBahn · Apr 19, 2020

djsfantasi said:
Or construct the list as a binary tree...

That's effectively just a method of sorting the list, but it's an option. However, you have to be careful HOW you construct the tree since any significant imbalance will rapidly trash the performance gains, possibly making it worse than a linear search through a list because of the tree navigation overhead. Also, searching even a perfectly balanced binary tree is usually slower than doing a binary search on a sorted list because of that same navigation overhead, but if the comparison cost is significantly higher than the navigation cost (which is usually is, but not always), then either is pretty comparable.

WBahn · Apr 19, 2020

xox said:
You can certainly build hash-tables with hash functions, but the key there is using the right "underlying data structure". Something like a binary tree for example.

If you use a binary tree to store the hashed strings, then your search performance will be basically the same as using a binary tree to store the strings directly. Assuming you do a proper job of managing the balancing of the tree, it takes log(n) time to find a string, but with the hashed strings you have the additional cost of hashing the string you are searching for. So I don't see a significant advantage to using hashed strings from a search standpoint. But if you have a very expensive comparison function -- say a data structure with a lot of stuff crammed in it -- then comparing hashes can significantly lower the comparison cost because most of the time you only have to compare the hash and even just a 32-bit hash spreads things out so that there will be few collisions even on lists of millions of items IF the hash function is good enough.

To get the full benefit of fast searches using hash tables you need to be able to use the hash value to immediately go to the entry of interest (or "close" to it in search terms), which involves an array the size of the hash space.

WBahn · Apr 19, 2020

xox said:
That's a really lame suggestion. Search time for a list is O(N). An array with binary search is guaranteed to require less memory and perform orders of magnitude faster.

I definitely agree with the less space, but I don't see how the claim of order of magnitude faster comes about. Both are log(n) and the multiplying constant out front is usually going to be dominated by the cost of comparing two items. The cost of computing the next index or computing the next link to follow are going to be pretty comparable and generally small compared to the comparison operation.

EDIT: I think I may not be understanding what you were taking away from the post you were responding to. Rereading it, I'm not sure what I'm taking away from it. When a "hash array of linked lists" was mentioned, I took that to mean a classic hash table that used a linked list at each hash index to handle collisions. Provided collisions are minimal, that's an O(1) search problem. But if you are saying that it is O(N), I'm guessing you interpreted it much differently than I did.

MrChips · Apr 19, 2020

If you are designing a command protocol, how many commands are there in total?
For a short list (say fewer than 100) it would be just as simple to do a straight search, i.e. a number of consecutive if statements.

I have implemented may communications command protocols that are efficient, expandable, addressable, and human readable.
All commands are text and are simple 2-letter commands.
The structure is as follows:

[start] [device ID] [command] [parameter] [checksum] [end]

where:
[start] = $
[device ID] = 0-9, A-Z
[command] = two letters, e.g. ST
[parameter] = numeric data, optional
[checksum] = two hex digits
[end] = CR

Linear searching is efficient because you are only comparing two 16-bit values.

Thread starter	Similar threads	Forum	Replies	Date
	Looking for the name of an algorithm	Machine Learning, AI & Neural Networks	4	Feb 15, 2024
	Algorithm to sample and interpret scintillator pulses	General Science, Physics & Math	7	Sep 19, 2023
	Soft start using triac	PCB Layout , EDA & Simulations	2	Dec 5, 2022
	The Most Important Algorithm Of All Time	General Science, Physics & Math	11	Nov 4, 2022
S	Algorithm for finding glitches in waveforms	Programming & Languages	28	Apr 4, 2011

algorithm for finding a match in a list of strings

Join our Engineering Community! Sign-in with:

algorithm for finding a match in a list of strings

dl324

402DF855

KeepItSimpleStupid

MrChips

xox

402DF855

bug13

MrChips

dl324

KeepItSimpleStupid

bug13

bug13

MrChips

bug13

KeepItSimpleStupid

MrChips

WBahn

WBahn

WBahn

MrChips

You May Also Like

Microchip Ruggedizes Evacuated Miniature Crystal Oscillators

For the Netherlands, Photonics Initiatives Secure Global Leadership—Part 2

The Kilo Lamp: An Interactive Lamp Controller

Spinning Disks Sputter as AI Heats up Data