Tokenizing Keywords Pt 4

Entry posted by N9WXU · January 30, 2019

438 views

Quote

make a hash of

phrase
If you make a hash of a job or task, you do it very badly.
[informal]
The Government made a total hash of things.
Watson had made a thorough hash of it.
Synonyms: mess up, muddle, bungle, botch +
hash

COBUILD Advanced English Dictionary. Copyright © HarperCollins Publishers

This week we will discuss a different strategy for tokenizing our words. This strategy will be to convert our word to a number in an unambiguous manner. Then we can simply see if our number is one we recognize. How hard can that be?

The process of converting something into an unambiguous number is hashing. Hash algorithms come in all shapes and sizes depending upon their application needs. One nice hash is the Pearson hash which can make an 8-bit number of a string. For a reasonably comprehensive list of hash algorithms check out this List of hash functions. Generally hash functions are intended to convert some string into a unique number that is very unlikely to represent any other string. How unlikely depends upon the algorithm and the size of the number. For example, the Pearson hashes are quick and easy. They produce an 8-bit value using an XOR table. However, there are more than 256 words in the English language, much less the rest of the worlds languages, so the odds of a hash collision are quite high with a large set of words. However, it is relatively easy to manipulate the XOR tables so that there are no collisions for a small word set.

Assuming we can apply some hash algorithm that will uniquely identify our words, it should be pretty easy to apply this technique to find our tokens. For example, in the first example we had the following table of keywords:

const char *wordList[] = {"GPGGA","GNGSA","GPGSV","GPBOD","GPDBT","GPDCN"};
enum wordTokens {NO_WORD = -1,GPGGA,GNGSA,GPGSV,GPBOD,GPDBT,GPDCN};

What if, that changed to this:

const int hashList[] = {<hash_of_GPGGA>,<hash_of_GNGSA>,<hash_of_GPGSV>,<hash_of_GPBOD>,<hash_of_GPDBT>,<hash_of_GPDCN>};
enum wordTokens {NO_WORD = -1,GPGGA,GNGSA,GPGSV,GPBOD,GPDBT,GPDCN};

It is not hard to imagine hashing the incoming word into an integer and then scanning the hashList looking for a match. This would be 2.5x smaller in memory if we did not need to store the master wordList and the comparison would be a 2 byte compare (or a 1 byte compare if we used a smaller hash). It would only require 2N comparisons where N is the number of words to check for. Of course hashing the incoming word creates an up-front cost but that cost could be buried inside the character receive function.

The hash methods are not perfect. With any one-way algorithm on unknown input, there is the possibility of a collision. That is where two words have the same computed hash value. This could mean that two of the keywords have the same value, or it could mean that a randomly chosen word (or random characters) matches a valid input. In a system where the keyword list is static and known at compile time, it is possible to develop a "perfect hash". That is a hash that guarantees all valid inputs are unique. If your system is concerned about random "noise" being treated as valid data, there are at least two ways to solve this.

Keep a list of the original words and do a final byte-for-byte compare one time.
Add a checksum to the input and make sure the checksum has to be valid in addition to the hash match. For NMEA strings, this is already available.

Can we go faster still?

The integer compare search method works very well, but there are a few ways to go even faster.

Sort the hashes in the list and use a search algorithm like a binary search to find the match. This reduces the the time from O(n) to O(log(n)). Much faster.
Use the hash as an index into an array of the tokens. This reduces the time from O(n) to O(1). Much Much faster but makes a potentially HUGE array (2^<hash bit length>)
Use the hash%<word count> to create a minimally sized table. This works but requires a minimal perfect hash. That is, a hash with the property of producing an N record table for N words. These algorithms are hard to find.
Use the hash as the token in the rest of the system. Why bother looking up a token if you already have a nice number. This is a good solution but assumes that you never need to use the tokens as indices of other arrays.

The idea that you can use the hash as an index into an array of records is the bases of a data structure called a Hash Table. Accessing a hash table is typically O(1) until a collision is detected and then it goes to O(m) where m is the number of items in the collisions. Typically the system implements the hash table as a sparse array with additional arrays holding the matching hash items.

That is a lot of words. I think it is time for a few examples. First let us implement a basic hash search using our short word list.

That was pretty easy but it is easy to see how finding a perfect hash function can get more and more difficult as we add words. Fortunately for us, there is a tool that is part of the GCC compiler suite called Gperf. Gperf's job is to find a perfect hash algorithm for a list of words and produce the C code to process it. Sounds perfect, so here is an example of how that works. First we must prepare a word list. The word list below shows a structure that will be used to store the word list in the C program. This is followed by the list of words and indices that will be used to populate an array of the structure.

struct keyword_s {const char *name; int index;};
%%
GPGGA,0
GNGSA,1
GPGSV,2
GPBOD,3
GPDBT,4
GPDCN,5
GPRMC,6
GPBWC,7

The word list is converted into C code with the following command line:

gperf -t -I  --output-file=hashTable.c keywordlist.txt

This command will create a file called hashTable.c. Inside this file is one public function called in_word_set. Below you can see where I modified the NMEA_findToken function to use the in_word_set function supplied by gperf.

enum wordTokens {NO_WORD = -1,GPGGA,GNGSA,GPGSV,GPBOD,GPDBT,GPDCN, GPRMC, GPBWC};
struct keyword_s {const char *name; int index;};

extern struct keyword_s *in_word_set (register const char *str, register size_t len);

enum wordTokens NMEA_findToken(char *word)
{
    enum wordTokens returnValue = NO_WORD;
    struct keyword_s *kw;
    kw = in_word_set(word,5);
    if(kw) returnValue = kw->index;
    return returnValue;
}

Compile and Run and you get the following results. Note how the hash search spends a fixed amount of time computing the 8-bit Pearson hash of the keyword. Then it spends a small amount of time searching for the hash value in the list of hash keys. This search is a brute force linear search. A binary search would likely be faster with a large word set, but with only 8 words in the word set most o the time its spent computing the hash.

The GPERF code is very interesting. Notice how the function is fixed time if the word is present and is much faster when the word is absent. There are a number of options to GPERF to allow it to produce code with different strategies. If you look at the code produced by GPERF you will notice that there is a final test using strcmp (or optionally strncmp). This will eliminate the possibility of a collision. If we don't care about collisions, look how much faster this gets.

	STRNCMP	IF-ELSE	RAGEL -G2	Hash Search (Hash/Search)	Hash GPERF	Hash GPERF no compare
GNGSA	399	121	280	326 167/159	374	126
GPGSV	585	123	304	288 167/121	374	126
GLGSV	724	59	225	503 167/336	113	113
GPRMC	899	83	299	536 167/369	374	126
GPGGA	283	113	298	440 167/273	374	126

So far I would have to say that the GPEF solution is easily the best way to decipher the sentences from my GPS. I know the GPS will deliver good strings so I would feel pretty comfortable stripping off the final compare. However, even with the string compare the GPERF solution is pretty good. It is only consistently beat by the hand-crafted if-else which will be a challenge to maintain. Perhaps we should consider writing a code generator for that method.

Hashing is a very interesting topic with a lot of ratholes behind it. Some of the fastest algorithms and search methods take advantage of hashes to some degree. Our passwords also use hashes so that passwords can be compared without knowing what the actual password is. I hope this has been as informative to you as it was to write.

As usual, take a look at the attached MPLAB projects to try out the different ideas.

Good Luck.

example4-PearsonSearch.zip

example4 gperf.zip

example4 gperf no compare.zip