Tokenizing Keywords Pt 5
QuoteDo or Do not, there is no Trie
- Yoda
Today we are going to explore the weird and wonderful trie (try). We are actually going to brute force a trie and we will write a basic code generator to synthesize a static trie so it our trusty NMEA token tester can stay in flash memory and fit in the PICmcu. I promise this is going to be the LEAST memory efficient trie you ever will see, but it is quick and it is easy to understand so let us press on.
First things first, what is a trie? Simply, the trie takes the if-else construct we did in pt 2 and turns it into a data structure. I love this concept because it means I can write the trie parser once and simply be feeding it different data sets, it can process any language. In fact, I try to make most of my programs data driven so the code can be well tested and the data can be focus of our attention.
Our basic trie will be implemented as a 2D array 26 x N. N will be adjusted to be as small as possible. The 26 comes from the 26 unique characters that can make up my NMEA string set. We will ignore lower case to make this dataset as small as possible.
Here is a picture showing what we plan to do. In the picture we are finding the NMEA string "GPGGA".
Each row of our 2D array represents each letter. If a letter is not involved at that layer of lookup, we will fill it with a -1. Each time we match, we learn the row where the next match is. By matching to the row, GPGGA and GNGSA both share the first G by pointing to the second row. On the second row, The P and the N will point further into the structures.
Here is the search code:
int trie_search(const char *s,int length)
{
int t=0;
int p;
for(int i=0;i<length;i++)
{
p = s[i] - 'A';
if((t = trie[t][p].index)==-1)
return -1;
}
return trie[t][p].key_index - 1; // -1 hold over from a previous version. Generator can post increment the key
}
This looks fantastic. Only one table lookup per letter so this should be blindly fast. Let us take a look at the data.
Well Heck! that is certainly not what we expected to see. What is going on?
At the core, this algorithm is wonderful because it only touches each letter once. This is very similar to the IF-ELSe so I would expect that this algorithm would be very similar to the IF-ELSE. We need to dig a bit deeper. As I look at the code, the only thing that stands out is the actual table lookup. We have a 2D table of structs, with each struct being 2 bytes. I wonder if this function is using a multiply to compute the table address.
9840 l1579:
9841 031A ;example5.c: 65: if((t = trie[t][p].index)==-1)
9842 movf (trie_search@p+1),w
9843 031A 082C movwf (??_trie_search+0)+0+1
9844 031B 00A1 movf (trie_search@p),w
9845 031C 082B movwf (??_trie_search+0)+0
9846 031D 00A0 lslf (??_trie_search+0)+0,f
9847 031E 35A0 rlf (??_trie_search+0)+1,f
9848 031F 0DA1 lslf (??_trie_search+0)+0,f
9849 0320 35A0 rlf (??_trie_search+0)+1,f
9850 0321 0DA1 movf (trie_search@t+1),w
9851 0322 082E movwf (___wmul@multiplier+1)
9852 0323 00F1 movf (trie_search@t),w
9853 0324 082D movwf (___wmul@multiplier)
9854 0325 00F0 movlw 068h
9855 0326 3068 movwf (___wmul@multiplicand)
9856 0327 00F2 movlw 0
9857 0328 3000 movwf ((___wmul@multiplicand))+1
9858 0329 00F3 fcall ___wmul
Here is the list file of line 65 and sure enough, there it is. Lots of code to setup the multiply and then a brutal software multiply. We were doomed. This simple brute force approach needs to be a bit more elegance in its data structures.
I will include the packaged project and the generator and you can duplicate the work. I still like this concept, I challenge the readers to run some experiments and post the fastest/smallest version they can devise.
Good Luck
0 Comments
Recommended Comments
There are no comments to display.