algorithm for finding a match in a list of strings

Thread Starter

bug13

Joined Feb 13, 2012
1,846
Hi team

I often find myself need to match a string with the strings in a list, what I usually do is just to simply loop through them all. It's OK for a small list, but what if the list a lot bigger, is there some sort of algorithm to speed thing up??

This is how I usually do it.
C:
const char *list[] = {
    "cmd1",
    "cmd2",
    "cmd3",
    /* more commands */
};


void main(void){
    
    while(1){
        
        /* get user input */
        uint8_t buffer[32] = {0};
        function_get_user_input(buffer);
        
        /* find a match in the list */
        for(unsigned i = 0; i < list_size; i++){
            if (strcmp(list[i], buffer) == 0){
                /* do stuff */
            }
        }   
    }
}
 

WBahn

Joined Mar 31, 2012
25,764
The answer... it depends.

Is the a list that you are going to be doing this for over and over, or is this a one-time thing?

Depending on the specifics, there are things that you can do that can speed things up significantly, or slow them down.
 

Thread Starter

bug13

Joined Feb 13, 2012
1,846
The answer... it depends.

Is the a list that you are going to be doing this for over and over, or is this a one-time thing?

Depending on the specifics, there are things that you can do that can speed things up significantly, or slow them down.
I will need to use it again and again. I didn’t mention it in my OP, the strings in the list are actually usually in different length.
 

WBahn

Joined Mar 31, 2012
25,764
I will need to use it again and again. I didn’t mention it in my OP, the strings in the list are actually usually in different length.
Then it will probably be worth your while to either sort the list and use a binary search or use a hash table.
 

xox

Joined Sep 8, 2017
440
I will need to use it again and again. I didn’t mention it in my OP, the strings in the list are actually usually in different length.
In many situations that's just going to be overkill though. You'd most likely need a LOT of strings before you'd see any appreciable impact on performance.

That said, the most common solution to that kind of problem is to first sort the list and then perform a binary search on the data. Using that approach, for example, a billion strings could be searched in less than 30 comparisons!

Code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int sort_ascending(const void* left, const void* right)
{
 char* lhs = *(char**)left;
 char* rhs = *(char**)right; 
 return strcmp(lhs, rhs);
}

int main(int argc, char** argv)
{
 const char* list[] = 
 {
  "cmd4",
  "cmd2",
  "cmd3", 
  "cmd1"
 };

 size_t width = sizeof(list[0]);
 size_t length = sizeof(list) / width;

 qsort(list, length, width, sort_ascending);

 puts("Commands:");
 for(size_t i = 0; i < length; ++i)
  puts(list[i]);

 puts("Search:");
 while(*(++argv))
 {
  char* tag = *argv;
  char* found = bsearch(&tag, list, length, width, sort_ascending); 

  printf("%s: %sfound\n", tag, found ? "" : "not ");
 }
}
 

John P

Joined Oct 14, 2008
1,808
Think of looking for books by a particular author in the fiction section at a library. It's easy, because the books are arranged with the authors' names in alphabetic order. Just think what it would be like if the books were placed at random!
 

dl324

Joined Mar 30, 2015
10,758
Something like a binary tree for example.
With a decent hashing function, a hash array of linked lists could be sufficient.

The OP needs to think about string lengths. You wouldn't want a lot of them hashing to the same value.
 

MrChips

Joined Oct 2, 2009
21,143
You can certainly build hash-tables with hash functions, but the key there is using the right "underlying data structure". Something like a binary tree for example. ;)
A binary tree still requires a search.
A hash function goes straight to the searched word, assuming you are matching two words and not a string within a string.
This is done all the time when a compiler is looking for a label, symbol or op instruction.

I have written quite a few microprocessor assemblers in my days.
 

xox

Joined Sep 8, 2017
440
A binary tree still requires a search.
A hash function goes straight to the searched word, assuming you are matching two words and not a string within a string.
This is done all the time when a compiler is looking for a label, symbol or op instruction.

I have written quite a few microprocessor assemblers in my days.
Well you apparently haven't implemented too many hash tables, because that's not really how it works. You choose a hash function with certain properties, that gives you a number which you then take the modulus of the table size, and finally that gets you to an array index into another data structure. Which of course is due to the possibility of collisions. So you need something else, an array, a list, a binary tree, whatever, in order to account for possible overflows. The reason I suggested using an array coupled with a binary search is simply because it's way more efficient (not to mention less memory) than using an actual binary tree.
 

MrChips

Joined Oct 2, 2009
21,143
Well apparently you haven't implemented too many hash tables, because that's not how it really works. You choose a hash function with certain properties, that gives you a number which you then take the modulus of the table size, and finally that gets you to an array index into another data structure. Which of course is due to the possibility of collisions. So you need something else, an array, a list, a binary tree, whatever, in order to account for possible overflows. The reason I suggested using an array coupled with a binary search is simply because it's way more efficient (not to mention less memory) than using an actual binary tree.
Ok then. Have it your way. It worked for me and I have no complaints.
 

xox

Joined Sep 8, 2017
440
With a decent hashing function, a hash array of linked lists could be sufficient.

The OP needs to think about string lengths. You wouldn't want a lot of them hashing to the same value.
That's a really lame suggestion. Search time for a list is O(N). An array with binary search is guaranteed to require less memory and perform orders of magnitude faster.
 

MrChips

Joined Oct 2, 2009
21,143
Okay, so just how did you handle collisions then?
If you are planning on using a hash mechanism you have to anticipate collisions and use a mechanism for dealing with it.
Before jumping into any mechanism it helps to study the following:

1) the hash function
2) the uniqueness of each entry
3) the size of the population
4) the size of the hash table
5) the probability of collision.

We recently had a discussion of finding a number in a list of numbers. Since the numbers ranged from 1 to 100, the size of the table was 100 and the probability of collision was zero.
 

xox

Joined Sep 8, 2017
440
If you are planning on using a hash mechanism you have to anticipate collisions and use a mechanism for dealing with it.
Before jumping into any mechanism it helps to study the following:

1) the hash function
2) the uniqueness of each entry
3) the size of the population
4) the size of the hash table
5) the probability of collision.

We recently had a discussion of finding a number in a list of numbers. Since the numbers ranged from 1 to 100, the size of the table was 100 and the probability of collision was zero.
That's a very contrived example though. A real world hash table would never be used in such a way.
 

402DF855

Joined Feb 9, 2013
206
I implemented a hash table for this purpose years ago and have used it scads of times. Mine handled collisions by inserting the new element at the next non-NULL location in the array (the index of which is derived from the hash function). If it happens a lot, use a larger array. Obviously this method requires a lot of memory. If you don't have memory, then I guess the binary tree. Alternatively, a binary search on a sorted list. Sort as part of the insertion or qsort after the fact. Lots of choices.
 
Top