Tokenizing Keywords Pt 5

Entry posted by N9WXU · February 6, 2019

574 views

Quote

Do or Do not, there is no Trie

- Yoda

Today we are going to explore the weird and wonderful trie (try). We are actually going to brute force a trie and we will write a basic code generator to synthesize a static trie so it our trusty NMEA token tester can stay in flash memory and fit in the PICmcu. I promise this is going to be the LEAST memory efficient trie you ever will see, but it is quick and it is easy to understand so let us press on.

First things first, what is a trie? Simply, the trie takes the if-else construct we did in pt 2 and turns it into a data structure. I love this concept because it means I can write the trie parser once and simply be feeding it different data sets, it can process any language. In fact, I try to make most of my programs data driven so the code can be well tested and the data can be focus of our attention.

Our basic trie will be implemented as a 2D array 26 x N. N will be adjusted to be as small as possible. The 26 comes from the 26 unique characters that can make up my NMEA string set. We will ignore lower case to make this dataset as small as possible.

Here is a picture showing what we plan to do. In the picture we are finding the NMEA string "GPGGA".

Each row of our 2D array represents each letter. If a letter is not involved at that layer of lookup, we will fill it with a -1. Each time we match, we learn the row where the next match is. By matching to the row, GPGGA and GNGSA both share the first G by pointing to the second row. On the second row, The P and the N will point further into the structures.

Here is the search code:

int trie_search(const char *s,int length)
{
    int t=0;
    int p;
    
    for(int i=0;i<length;i++)
    {
        p = s[i] - 'A';
        if((t = trie[t][p].index)==-1)
            return -1;
    }
    return trie[t][p].key_index - 1; // -1 hold over from a previous version.  Generator can post increment the key
}

This looks fantastic. Only one table lookup per letter so this should be blindly fast. Let us take a look at the data.

	STRNCMP	IF-ELSE	RAGEL -G2	Hash Search (Hash/Search)	Hash GPERF	Hash GPERF no compare	Trie Search
GNGSA	399	121	280	326 167/159	374	126	915
GPGSV	585	123	304	288 167/121	374	126	871
GLGSV	724	59	225	503 167/336	113	113	1395
GPRMC	899	83	299	536 167/369	374	126	984
GPGGA	283	113	298	440 167/273	374	126	821

Well Heck! that is certainly not what we expected to see. What is going on?

At the core, this algorithm is wonderful because it only touches each letter once. This is very similar to the IF-ELSe so I would expect that this algorithm would be very similar to the IF-ELSE. We need to dig a bit deeper. As I look at the code, the only thing that stands out is the actual table lookup. We have a 2D table of structs, with each struct being 2 bytes. I wonder if this function is using a multiply to compute the table address.

  9840                           l1579:	
  9841  031A                     ;example5.c: 65:         if((t = trie[t][p].index)==-1)
  9842                           	movf	(trie_search@p+1),w
  9843  031A  082C               	movwf	(??_trie_search+0)+0+1
  9844  031B  00A1               	movf	(trie_search@p),w
  9845  031C  082B               	movwf	(??_trie_search+0)+0
  9846  031D  00A0               	lslf	(??_trie_search+0)+0,f
  9847  031E  35A0               	rlf	(??_trie_search+0)+1,f
  9848  031F  0DA1               	lslf	(??_trie_search+0)+0,f
  9849  0320  35A0               	rlf	(??_trie_search+0)+1,f
  9850  0321  0DA1               	movf	(trie_search@t+1),w
  9851  0322  082E               	movwf	(___wmul@multiplier+1)
  9852  0323  00F1               	movf	(trie_search@t),w
  9853  0324  082D               	movwf	(___wmul@multiplier)
  9854  0325  00F0               	movlw	068h
  9855  0326  3068               	movwf	(___wmul@multiplicand)
  9856  0327  00F2               	movlw	0
  9857  0328  3000               	movwf	((___wmul@multiplicand))+1
  9858  0329  00F3               	fcall	___wmul

Here is the list file of line 65 and sure enough, there it is. Lots of code to setup the multiply and then a brutal software multiply. We were doomed. This simple brute force approach needs to be a bit more elegance in its data structures.

I will include the packaged project and the generator and you can duplicate the work. I still like this concept, I challenge the readers to run some experiments and post the fastest/smallest version they can devise.

Good Luck

example5.zip

trie_builder.c

trie_builder.h