Tokenizing Keywords - Part 1

Entry posted by N9WXU · January 9, 2019

793 views

Quote

If you find that you're spending almost all your time on theory, start turning some attention to practical things; it will improve your theories. If you find that you're spending almost all your time on practice, start turning some attention to theoretical things; it will improve your practice. - Donald Knuth, quoted in: Arturo Gonzalez-Gutierrez (2007) Minimum-length Corridors: Complexity and Approximations. p. 99

This is the first of a 5 part article where I will explore some different ways to tokenize keywords. This is a simple and common task that seems to crop up when you least expect it. We have all probably done some variation of the brute force approach in this first posting, but the posts that follow should prove interesting. Here is the sequence:

Part 1 : STRCMP Brute Force and the framework

Part 2 : IF-ELSE for speed

Part 3 : Automating the IF-ELSE for maintenance

Part 4 : Perfect Hash Maps

Part 5 : Trie's

Feel free to skip around once they are all published if there is an interesting topic that you want to learn more about. In this first posting I will be building a complete testable example so we can do science rather than expounding upon our opinions.

About The Test Data

I am currently working on a GPS project so I am being subjected to NMEA sentences. (https://en.wikipedia.org/wiki/NMEA_0183). These sentences are quite easy to follow and you generally get a burst of text data every second from your GPS module. The challenge is to quickly & efficiently extract this information while keeping your embedded system performing all the tasks required and do this with the smallest possible program.

As far as stream data is concerned, NMEA sentences are remarkably well behaved. Here are a few sample messages from my new Bad Elf GPS Pro +.

$GNGSA,A,3,32,14,01,20,18,08,31,11,25,10,27,,1.19,0.61,1.02*16
$GNGSA,A,3,65,66,88,75,81,82,76,67,,,,,1.19,0.61,1.02*13
$GPGSV,4,1,14,14,73,336,25,32,60,023,31,18,52,294,20,51,51,171,30*7A
$GPGSV,4,2,14,31,47,171,34,10,44,087,33,11,30,294,26,01,26,316,15*78
$GPGSV,4,3,14,20,20,108,34,22,19,304,16,08,13,247,21,27,08,218,18*7A

The pattern is:

$<IDENTIFIER>,<COMMA SEPERATED DATA>*<CHECKSUM>

The work of separating out the parts of a sentence is called parsing. We will not be fully parsing the NMEA sentence but if we wanted to, there are many useful tools like BISON that can automate this job. Each of the automated tools has a minimum price of entry in terms of code size and CPU cycles. This brute force approach has the benefit of simplicity, maintainability and a low initial size/speed cost. So let us press on and establish a performance baseline.

Our Test Program

The program we are about to construct works like this:

Like all instructors I will concentrate on the necessary parts of the lesson and leave all the processing as an exercise for the student. Once you have identified the sentence is easy to apply a suitable sscanf to or strtok to process each sentence individually. The actual part we are going to measure is Identify The Keyword but Find a Keyword is a necessary first step and a little bit of wrapper/measurement code is needed to complete the experiment.

Parsing out our Keywords

To get the identifier all I need to do is discard characters until I see a '$'. Then collect characters until I see a ','. That is easy code to write so here is a quick version:

char * NMEA_getWord(void)
{
    static char buffer[7];
    
    memset(buffer,0,sizeof(buffer));
    do
    {
        serial_read(buffer,1);
    } while(buffer[0] != '$');
    for(int x=0;x<sizeof(buffer)-1;x++)
    {
        serial_read(&buffer[x], 1);
        if(buffer[x]==',')
        {
            buffer[x] = 0;
            break;
        }
    }
    return buffer;
}

I did not try to collect the entire NMEA sentence, validate the checksum, handle line noise. So this is not a complete application. It does fill a buffer with the NMEA sentence quickly and legibly. One more note: the buffer is 7 bytes large because my Bad Elf GPS sends PELFID as the FIRST sentence upon connection. If I wanted to decode this sentence, the buffer needs another byte.

This simple function will be used for all the testing in this series of articles so it is worth understanding.

So here is a description of each line of code:

the function will take no parameters and return a pointer to a NMEA sentence. It will block until it has such a sentence so it assumes a continuous data stream.
There is a static buffer of 7 characters. A pointer to this buffer will be sent to the caller with every iteration. The buffer is 7 characters to allow for the 6 character message from my GPS and a null terminator.
Because this is a static buffer we cannot initialize it in the definition. So every time this function is called, we will clear out the buffer to 0's
The do/while loop reads single characters from the serial port (provided by the user) until a '$' character is received. The getchar() pattern may have been a better abstraction for the serial port but I was originally planning to receive 5 bytes after the $ with a single call. The 6 byte PELFID changed my mind.
The for loop simply reads UP-TO sizeof(buffer) -1 bytes unless the received byte is a ','. I prefer to use sizeof whenever possible to make the compiler do the math to determine how big arrays are.
if a ',' is received, we convert it to a 0 and break.
lastly we return the buffer pointer.

The next piece of housekeeping we should discuss is the main function. This is a simple test application that will provide the framework for evaluating ways of detecting text strings and timing their impact.

int main(int argc, char **argv)
{
    if(serial_open()>0)
    {
        for(int x = 0; x < 24; x ++)
        {
            char *w = NMEA_getWord();
            enum wordTokens t = NMEA_findToken(w);
            printf("word %s,",w);
            if(t >= 0)
                printf("token %d\n",t);
            else
                printf("no match\n");
        }
    }
    serial_close();
  return 0;
}

Like before, I will give a quick line-by-line explanation

This is the standard main declaration for ALL C programs. For embedded systems we often ignore the parameters and return values but I like to do it for test programs simply because I generally prototype as command line programs on my Mac.
serial_open is a simple abstraction that is easy to adapt to any system. It will prepare the serial input from my GPS. The version attached to this project will work on a Linux or Mac and open a specific /dev/cu.<device> that is hardcoded. This is appropriate for such a test application where knowledgeable users can easily edit the code. Obviously a robust version would pass the serial device in as a command line parameter and an embedded system would just activate the hardware.
This program will decode 24 keywords. That makes a simple limit to run in the simulator or on my Mac and is sufficient for these tests.
Now we run the test. First, get a keyword
Second find the keyword and return the token
Third display some output
Repeat
Finally we close the serial port. Often I will skip this step but on a PC program, leaving the port open will cause an error on the second run of the program until the port times out and closes automatically. By closing I save time for the user. For an embedded system, the serial_close function is usually left empty.
For the simulator I would add a while(1) line before the return 0. This will freeze the simulation and allow us to study the results. Otherwise, XC8 may simply restart our application from the beginning and fill up my log file with repeat runs.

This covers all the code except for the bit that we are interested in. For these type of experiments I like to be as agile as possible. This is about failing fast so we can discard bad ideas as quickly as possible. We should never invest so much of our energy in a bad idea that we emotionally feel obligated to "make it work at all costs". I have seen too many bad ideas last for too long due to this human failing.

So lets briefly talk about findToken

The function NMEA_findToken takes a pointer to a possible keyword (as identified by NMEA_getWord) and returns an integer (enum really) that identifies it as a number to the rest of the code. I will leave as an exercise to the student the work of doing something useful with a token id. In the future we can replace this function with different techniques and so long as the interface stays constant, we can simply insert the code and measure the performance.

For today's test, we will simply apply a brute force strategy. We have a list of words:

const char *wordList[] = {"GPGGA","GNGSA","GPGSV","GPBOD","GPDBT","GPDCN"};

and a matching enumeration:

enum wordTokens {NO_WORD = -1,GPGGA,GNGSA,GPGSV,GPBOD,GPDBT,GPDCN};

All we need to do is compare our test word to each word in the list until a match is found. The C standard library provides a useful string compare function strncmp. But XC8 does not include it (it arrived with C99) so we will use its predecessor strcmp and press on. We will just be very careful not to loose our terminating character.

enum wordTokens NMEA_findToken(char *word)
{
  enum wordTokens retValue = NO_WORD;
  for(int x=0; x < sizeof(wordList)/sizeof(*wordList);x++)
  {
    if(strcmp(word,wordList[x])==0)
    {
      retValue = x;
      break;
    }
  }
  return retValue;
}

Here is the blow-by-blow:

This function will take a pointer to a string and return a token.
Initialize the return value to the no_word value
Step through the word list
Test each word looking for a match
on a match, set the return value and leave the loop
Return the value to the caller

I ran it on my Mac and it runs fine identifying GPS strings from my GPS. I also ran it in the PIC16F1939 simulator so I could measure the CPU cycles and code size for a baseline.

For the testing I am looking for 6 words which were carefully chosen to have some missing characters, overlapping characters, and matching words at the beginning and end of the list.

Analysis

Before I show the measurements we should make this scientific and formulate a hypothesis.

My test sequence is GNGSA, GPGSV, GLGSV, GPRMC, GPGGA

The words in the list are in this order : GPGGA, GNGSA, GPGSV, GPBOD, GPDBT, GPDCN

So I expect the first word to fail at the P in GPGGA, then pass GNGSA. This will be a total of 7 tests to execute.

The second word will fail GPGGA on the third test, fail GNGSA on the second test and then pass for a total of 10 tests to execute.

GLGSV will require 12 tests to fail. In every word it will fail the second test.

GPRMC will require 14 tests to fail

GPGGA will require 5 tests to pass.

The most number of tests required to pass or fail is equal to the number of characters in the entire list. That assumes that the list is constructed so the only differences are in the last letter so a possible word must go through the entire word of the entire list before determining if it is a match or fail. For our little test, that is not possible, but if the list were different it would take 30 tests.

So I expect it to be O(n) where n is the total character count in the word list.

Results

Here is the actual data for executing NMEA_findToken in cpu instruction cycles on a PIC16F1939. The data was taken using the MPLAB X simulator.

Word	STRCMP Cycles	Notes
GNGSA	399	Second word in the word list
GPGSV	585	Third word in the word list
GLGSV	724	Not in word list
GPRMC	899	Not in word list
GPGGA	283	First word in the word list

The program did include printf and some buffers so the program did end up large at 2153 bytes of flash memory.

Improvements:

If you decided that this was the strategy for you, then you can make some improvements simply by carefully ordering the list so the most frequent words are first. But after looking things over, it would be nice if we could do our search where we did not need to repeat letters. i.e. all the words start with G, so if the first letter is not a G we can stop the entire process. We will explore one way to accomplish this optimization next time.

For now, take a look at the attached files and I welcome's any suggestions for improvements to this strategy.

Good Luck