algorithm for finding a match in a list of strings

bug13 · Apr 19, 2020

Hi team

I often find myself need to match a string with the strings in a list, what I usually do is just to simply loop through them all. It's OK for a small list, but what if the list a lot bigger, is there some sort of algorithm to speed thing up??

This is how I usually do it.

C:

const char *list[] = {
    "cmd1",
    "cmd2",
    "cmd3",
    /* more commands */
};


void main(void){
    
    while(1){
        
        /* get user input */
        uint8_t buffer[32] = {0};
        function_get_user_input(buffer);
        
        /* find a match in the list */
        for(unsigned i = 0; i < list_size; i++){
            if (strcmp(list[i], buffer) == 0){
                /* do stuff */
            }
        }   
    }
}

WBahn · Apr 19, 2020

The answer... it depends.

Is the a list that you are going to be doing this for over and over, or is this a one-time thing?

Depending on the specifics, there are things that you can do that can speed things up significantly, or slow them down.

bug13 · Apr 19, 2020

WBahn said:
The answer... it depends.

Is the a list that you are going to be doing this for over and over, or is this a one-time thing?

Depending on the specifics, there are things that you can do that can speed things up significantly, or slow them down.

I will need to use it again and again. I didn’t mention it in my OP, the strings in the list are actually usually in different length.

jpanhalt · Apr 19, 2020

You may get some hints here: https://forum.allaboutcircuits.com/threads/flowchart-find-repeated-number.168406/

WBahn · Apr 19, 2020

bug13 said:
I will need to use it again and again. I didn’t mention it in my OP, the strings in the list are actually usually in different length.

Then it will probably be worth your while to either sort the list and use a binary search or use a hash table.

xox · Apr 19, 2020

bug13 said:
I will need to use it again and again. I didn’t mention it in my OP, the strings in the list are actually usually in different length.

In many situations that's just going to be overkill though. You'd most likely need a LOT of strings before you'd see any appreciable impact on performance.

That said, the most common solution to that kind of problem is to first sort the list and then perform a binary search on the data. Using that approach, for example, a billion strings could be searched in less than 30 comparisons!

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int sort_ascending(const void* left, const void* right)
{
 char* lhs = *(char**)left;
 char* rhs = *(char**)right; 
 return strcmp(lhs, rhs);
}

int main(int argc, char** argv)
{
 const char* list[] = 
 {
  "cmd4",
  "cmd2",
  "cmd3", 
  "cmd1"
 };

 size_t width = sizeof(list[0]);
 size_t length = sizeof(list) / width;

 qsort(list, length, width, sort_ascending);

 puts("Commands:");
 for(size_t i = 0; i < length; ++i)
  puts(list[i]);

 puts("Search:");
 while(*(++argv))
 {
  char* tag = *argv;
  char* found = bsearch(&tag, list, length, width, sort_ascending); 

  printf("%s: %sfound\n", tag, found ? "" : "not ");
 }
}

djsfantasi · Apr 19, 2020

WBahn said:
Then it will probably be worth your while to either sort the list and use a binary search or use a hash table.

Or construct the list as a binary tree...

John P · Apr 19, 2020

Think of looking for books by a particular author in the fiction section at a library. It's easy, because the books are arranged with the authors' names in alphabetic order. Just think what it would be like if the books were placed at random!

MrChips · Apr 19, 2020

Use a hash function.

dl324 · Apr 19, 2020

+1 for hashing the strings.

xox · Apr 19, 2020

MrChips said:
Use a hash function.

You can certainly build hash-tables with hash functions, but the key there is using the right "underlying data structure". Something like a binary tree for example.

dl324 · Apr 19, 2020

xox said:
Something like a binary tree for example.

With a decent hashing function, a hash array of linked lists could be sufficient.

The OP needs to think about string lengths. You wouldn't want a lot of them hashing to the same value.

MrChips · Apr 19, 2020

xox said:
You can certainly build hash-tables with hash functions, but the key there is using the right "underlying data structure". Something like a binary tree for example.

A binary tree still requires a search.
A hash function goes straight to the searched word, assuming you are matching two words and not a string within a string.
This is done all the time when a compiler is looking for a label, symbol or op instruction.

I have written quite a few microprocessor assemblers in my days.

xox · Apr 19, 2020

MrChips said:
A binary tree still requires a search.
A hash function goes straight to the searched word, assuming you are matching two words and not a string within a string.
This is done all the time when a compiler is looking for a label, symbol or op instruction.

I have written quite a few microprocessor assemblers in my days.

Well you apparently haven't implemented too many hash tables, because that's not really how it works. You choose a hash function with certain properties, that gives you a number which you then take the modulus of the table size, and finally that gets you to an array index into another data structure. Which of course is due to the possibility of collisions. So you need something else, an array, a list, a binary tree, whatever, in order to account for possible overflows. The reason I suggested using an array coupled with a binary search is simply because it's way more efficient (not to mention less memory) than using an actual binary tree.

MrChips · Apr 19, 2020

xox said:
Well apparently you haven't implemented too many hash tables, because that's not how it really works. You choose a hash function with certain properties, that gives you a number which you then take the modulus of the table size, and finally that gets you to an array index into another data structure. Which of course is due to the possibility of collisions. So you need something else, an array, a list, a binary tree, whatever, in order to account for possible overflows. The reason I suggested using an array coupled with a binary search is simply because it's way more efficient (not to mention less memory) than using an actual binary tree.

Ok then. Have it your way. It worked for me and I have no complaints.

xox · Apr 19, 2020

dl324 said:
With a decent hashing function, a hash array of linked lists could be sufficient.

The OP needs to think about string lengths. You wouldn't want a lot of them hashing to the same value.

That's a really lame suggestion. Search time for a list is O(N). An array with binary search is guaranteed to require less memory and perform orders of magnitude faster.

xox · Apr 19, 2020

MrChips said:
Ok then. Have it your way. It worked for me and I have no complaints.

Okay, so just how did you handle collisions then?

MrChips · Apr 19, 2020

xox said:
Okay, so just how did you handle collisions then?

If you are planning on using a hash mechanism you have to anticipate collisions and use a mechanism for dealing with it.
Before jumping into any mechanism it helps to study the following:

1) the hash function
2) the uniqueness of each entry
3) the size of the population
4) the size of the hash table
5) the probability of collision.

We recently had a discussion of finding a number in a list of numbers. Since the numbers ranged from 1 to 100, the size of the table was 100 and the probability of collision was zero.

xox · Apr 19, 2020

MrChips said:
If you are planning on using a hash mechanism you have to anticipate collisions and use a mechanism for dealing with it.
Before jumping into any mechanism it helps to study the following:

1) the hash function
2) the uniqueness of each entry
3) the size of the population
4) the size of the hash table
5) the probability of collision.

We recently had a discussion of finding a number in a list of numbers. Since the numbers ranged from 1 to 100, the size of the table was 100 and the probability of collision was zero.

That's a very contrived example though. A real world hash table would never be used in such a way.

402DF855 · Apr 19, 2020

I implemented a hash table for this purpose years ago and have used it scads of times. Mine handled collisions by inserting the new element at the next non-NULL location in the array (the index of which is derived from the hash function). If it happens a lot, use a larger array. Obviously this method requires a lot of memory. If you don't have memory, then I guess the binary tree. Alternatively, a binary search on a sorted list. Sort as part of the insertion or qsort after the fact. Lots of choices.

Thread starter	Similar threads	Forum	Replies	Date
	Looking for the name of an algorithm	Machine Learning, AI & Neural Networks	4	Feb 15, 2024
	Algorithm to sample and interpret scintillator pulses	General Science, Physics & Math	7	Sep 19, 2023
	Soft start using triac	PCB Layout , EDA & Simulations	2	Dec 5, 2022
	The Most Important Algorithm Of All Time	General Science, Physics & Math	11	Nov 4, 2022
S	Algorithm for finding glitches in waveforms	Programming & Languages	28	Apr 4, 2011

algorithm for finding a match in a list of strings

Join our Engineering Community! Sign-in with:

algorithm for finding a match in a list of strings

bug13

WBahn

bug13

jpanhalt

WBahn

xox

djsfantasi

John P

MrChips

dl324

xox

dl324

MrChips

xox

MrChips

xox

xox

MrChips

xox

402DF855

You May Also Like

Indie Semiconductor Releases First Commercial UV DFB Laser at 399 nm

U-blox’s New Wi-Fi 6E Module Steps Up When the Airwaves Get Crowded

Understanding the Hogge Detector and the Triwave Solution

Microchip Ruggedizes Evacuated Miniature Crystal Oscillators