Tokenizing Keywords Pt 3

Entry posted by N9WXU · January 23, 2019

494 views

“Code generation, like drinking alcohol, is good in moderation.”

— Alex Lowe

This episode we are going to try something different. The brute force approach had the advantage of being simple and easy to maintain. The hand-crafted decision tree had the advantage of being fast. This week we will look at an option that will hopefully combine the simplicity of the string list and the speed of the decision tree. This week we will use a code generator to automatically create the tokenizing state machine. I will leave it to you to decide if we use generation in moderation.

Let me introduce RAGEL http://www.colm.net/open-source/ragel/. I discovered RAGEL a few years ago when I was looking for a quick and dirty way to build some string handling state machines. RAGEL will construct a complete state machine that will handle the parsing of any regular expression. It can do tokenizing and it can do parsing. Essentially, you define the rules for the tokens and the functions to call when each token is found. For instance, you can write a rule to handle any integer and when an integer is found it can call your doInteger() method. For our simple example of identifying 6 words, the RAGEL code will be a bit overkill but it will be MUCH faster than a brute force string search and in the same ball park as the hand crafted decision tree. Let us get started.

First let us get the housekeeping out of the way. This part of the code you have seen before. It is identical to the first two examples I have already provided. There are two differences. First, this only LOOKS like C code. In fact, it is a RAGEL file (I saved it with a .rl extension) and you will see the differences in a moment. When I use a code synthesizer, I like to place the needed command line at the top of the file in comments. While comments are a smell, this sort of comment is pretty important.

// compile into C with ragel
// ragel -C -L -G2 example3.rl -o example3.c
//

#include <string.h>
#include <stdio.h>
#include "serial_port.h"

char * NMEA_getWord(void)
{
    static char buffer[7];

    memset(buffer,0,sizeof(buffer));
    do
    {
        serial_read(buffer,1);
    } while(buffer[0] != '$');
    for(int x=0;x<sizeof(buffer)-1;x++)
    {
        serial_read(&buffer[x], 1);
        if(buffer[x]==',')
        {
            buffer[x] = 0;
            break;
        }
    }
    return buffer;
}

enum wordTokens {NO_WORD = -1,GPGGA,GNGSA,GPGSV,GPBOD,GPDBT,GPDCN, GPRMC, GPBWC};

RAGEL is pretty nice in that they choose some special symbols to identify the RAGEL bits so the generator simply passes all input straight to the output until it finds the RAGEL identifiers and then it gets to work. This architecture allows you to simply insert RAGEL code directly into your C (or other languages) and add the state machines in place.

The first identifiers we find are the declaration of a state machine (foo seemed traditional). You can define more than one machine so it is important to provide a hint to the generator about which one you want to define.

After the machine definition, I specified the location to place all the state machine data tables. There are multiple ways RAGEL can produce a state machine. If the machine requires data, it will go at the write data block.

%%  machine foo;
%%  write data;

enum wordTokens NMEA_findToken(char *word)
{
    const char *p = word;
    const char *pe = word + strlen(word);
    int cs;
    enum wordTokens returnValue = NO_WORD;

%%{
action gpgga { returnValue = GPGGA; fbreak; }
action gngsa { returnValue = GNGSA; fbreak; }
action gpgsv { returnValue = GPGSV; fbreak; }
action gpbod { returnValue = GPBOD; fbreak; }
action gpdbt { returnValue = GPDBT; fbreak; }
action gpdcn { returnValue = GPDCN; fbreak; }
action gpbwc { returnValue = GPBWC; fbreak; }
action gprmc { returnValue = GPRMC; fbreak; }

gpgga = ('GPGGA') @gpgga;
gngsa = ('GNGSA') @gngsa;
gpgsv = ('GPGSV') @gpgsv;
gpbod = ('GPBOD') @gpbod;
gpdbt = ('GPDBT') @gpdbt;
gpdcn = ('GPDCN') @gpdcn;
gpbwc = ('GPBWC') @gpbwc;
gprmc = ('GPRMC') @gprmc;

main := ( gpgga | gngsa | gpgsv | gpbod | gpdbt | gpdcn | gpbwc | gprmc )*;
write init;
write exec noend;
}%%

    return returnValue;
}

Next is the C function definition starting at line 4 above. I am keeping the original NMEA_findToken function as before. No sense in changing what is working. At the beginning of the function is some RAGEL housekeeping defining the range of text to process. In this case the variable p represents the beginning of the test while pe represents the end of the text. The variable cs is a housekeeping variable and the token is the return value so initialize it to NO_WORD. The next bit is some RAGEL code. The %%{ defines a block of ragel much like /* defines the start of a comment block. The first bit of ragel is defining all of the actions that will be triggered when the strings are identified. Honestly, these actions could be anything and I held back simply to keep the function identical to the original. It would be easy to fully define the NMEA data formats and fully decode each NMEA sentence. These simply identify the return token and break out of the function. If we had not already sliced up the tokens we would want to keep store our position in the input strings so we could return to the same spot. It is also possible to feed the state machine one character a time like in an interrupt service routine.

After the actions, line 21 defines the search rules and the action to execute when a rule is matched. These rules are simply regular expressions (HA! REGEX and SIMPLE in the same sentence). For this example, the expressions are simply the strings. But if your regular expressions were more complex, you could go crazy.

Finally, the machine is defined as matching any of the rules. The initialization and the actual execute code are placed and the RAGEL is complete.

Whew! Let us look at what happened when we compile it.

One of my favorite programming tools is graphviz. Specifically DOT. It turns out that RAGEL can produce a dot file documenting the produced state machine. Lets try it out.

bash> ragel -C -L -V example3.rl -o example3.dot
bash> dot example3.dot -T png -O

It would be nicer if all the numbers on the arrows were the characters rather than the ASCII codes but I suppose I am nitpicking. Now you see why I named my actions after the sentences. The return arrow clearly shows which action is being executed when the words are found. It also shows that the action triggers when the last letter is found rather than a trailing character. I suppose if you had the word gpgga2, then you would need to add some additional REGEX magic. The dotted arrow IN leading to state 17 refers to any other transition not listed. That indicates that any out-of-place letter simply goes back to 17 without triggering an ACTION. It is possible to define a “SYNTAX ERROR” action to cover this case but I did not care. For my needs, failing quietly is a good choice.

This all looks pretty good so far. What does the C look like?

/* #line 1 "example3.rl" */

// compile into C with ragel
// ragel -C -L -G2 example3.rl -o example3.c
//

#include < string.h > #include < stdio.h > #include "serial_port.h"

char * NMEA_getWord(void) {
  static char buffer[7];

  memset(buffer, 0, sizeof(buffer));
  do {
    serial_read(buffer, 1);
  } while (buffer[0] != '$');
  for (int x = 0; x < sizeof(buffer) - 1; x++) {
    serial_read( & buffer[x], 1);
    if (buffer[x] == ',') {
      buffer[x] = 0;
      break;
    }
  }
  return buffer;
}

enum wordTokens {
  NO_WORD = -1, GPGGA, GNGSA, GPGSV, GPBOD, GPDBT, GPDCN, GPRMC, GPBWC
};

/* #line 34 "example3.rl" */

/* #line 39 "example3.c" */
static
const int foo_start = 17;
static
const int foo_first_final = 17;
static
const int foo_error = 0;

static
const int foo_en_main = 17;

/* #line 35 "example3.rl" */

enum wordTokens NMEA_findToken(char * word) {
  const char * p = word;
  const char * pe = word + strlen(word);
  int cs;
  enum wordTokens returnValue = NO_WORD;

  /* #line 57 "example3.c" */
  {
    cs = foo_start;
  }

  /* #line 62 "example3.c" */
  {
    switch (cs) {
      tr5:
      /* #line 45 "example3.rl" */
      {
        returnValue = GNGSA; {
          p++;
          cs = 17;
          goto _out;
        }
      }
      goto st17;
      tr12:
      /* #line 47 "example3.rl" */
      {
        returnValue = GPBOD; {
          p++;
          cs = 17;
          goto _out;
        }
      }
      goto st17;
      tr13:
      /* #line 50 "example3.rl" */
      {
        returnValue = GPBWC; {
          p++;
          cs = 17;
          goto _out;
        }
      }
      goto st17;
      tr16:
      /* #line 48 "example3.rl" */
      {
        returnValue = GPDBT; {
          p++;
          cs = 17;
          goto _out;
        }
      }
      goto st17;
      tr17:
      /* #line 49 "example3.rl" */
      {
        returnValue = GPDCN; {
          p++;
          cs = 17;
          goto _out;
        }
      }
      goto st17;
      tr20:
      /* #line 44 "example3.rl" */
      {
        returnValue = GPGGA; {
          p++;
          cs = 17;
          goto _out;
        }
      }
      goto st17;
      tr21:
      /* #line 46 "example3.rl" */
      {
        returnValue = GPGSV; {
          p++;
          cs = 17;
          goto _out;
        }
      }
      goto st17;
      tr23:
      /* #line 51 "example3.rl" */
      {
        returnValue = GPRMC; {
          p++;
          cs = 17;
          goto _out;
        }
      }
      goto st17;
      st17: p += 1;
    case 17:
        /* #line 101 "example3.c" */
        if (( * p) == 71)
          goto st1;
      goto st0;
      st0: cs = 0;
      goto _out;
      st1: p += 1;
    case 1:
        switch (( * p)) {
      case 78:
        goto st2;
      case 80:
        goto st5;
      }
      goto st0;
      st2: p += 1;
    case 2:
        if (( * p) == 71)
        goto st3;
      goto st0;
      st3: p += 1;
    case 3:
        if (( * p) == 83)
        goto st4;
      goto st0;
      st4: p += 1;
    case 4:
        if (( * p) == 65)
        goto tr5;
      goto st0;
      st5: p += 1;
    case 5:
        switch (( * p)) {
      case 66:
        goto st6;
      case 68:
        goto st9;
      case 71:
        goto st12;
      case 82:
        goto st15;
      }
      goto st0;
      st6: p += 1;
    case 6:
        switch (( * p)) {
      case 79:
        goto st7;
      case 87:
        goto st8;
      }
      goto st0;
      st7: p += 1;
    case 7:
        if (( * p) == 68)
        goto tr12;
      goto st0;
      st8: p += 1;
    case 8:
        if (( * p) == 67)
        goto tr13;
      goto st0;
      st9: p += 1;
    case 9:
        switch (( * p)) {
      case 66:
        goto st10;
      case 67:
        goto st11;
      }
      goto st0;
      st10: p += 1;
    case 10:
        if (( * p) == 84)
        goto tr16;
      goto st0;
      st11: p += 1;
    case 11:
        if (( * p) == 78)
        goto tr17;
      goto st0;
      st12: p += 1;
    case 12:
        switch (( * p)) {
      case 71:
        goto st13;
      case 83:
        goto st14;
      }
      goto st0;
      st13: p += 1;
    case 13:
        if (( * p) == 65)
        goto tr20;
      goto st0;
      st14: p += 1;
    case 14:
        if (( * p) == 86)
        goto tr21;
      goto st0;
      st15: p += 1;
    case 15:
        if (( * p) == 77)
        goto st16;
      goto st0;
      st16: p += 1;
    case 16:
        if (( * p) == 67)
        goto tr23;
      goto st0;
    }

    _out: {}
  }

  /* #line 66 "example3.rl" */

  return returnValue;
}

int main(int argc, char ** argv) {
  if (serial_open() > 0) {
    for (int x = 0; x < 24; x++) {
      char * w = NMEA_getWord();
      enum wordTokens t = NMEA_findToken(w);
      printf("word %s,", w);
      if (t >= 0)
        printf("token %d\n", t);
      else
        printf("no match\n");
    }
  }
  serial_close();
  return 0;
}

And this is why we use a code generator. The code does not look too terrible. i.e., I could debug it if I thought there were some bugs and it does follow the state chart in a perfectly readable way. BUT, I hope you are not one of those programmers who finds GOTO against their religion. (Though Edsger Dijkstra did allow an exception for low level code when he wrote EWD215 https://www.cs.utexas.edu/users/EWD/transcriptions/EWD02xx/EWD215.html )
So how does this perform?

	STRNCMP	IF-ELSE	RAGEL -G2
GNGSA	399	121	280
GPGSV	585	123	304
GLGSV	724	59	225
GPRMC	899	83	299
GPGGA	283	113	298

And for the code size MPLAB XC8 in Free mode on the PIC16F1939 shows 2552 bytes of program and 1024 bytes of data. Don’t forget that printf is included. But this is comparable to the other examples because I am only changing the one function.

So our fancy code generator is usually faster than the brute force approach, definitely slower than the hand-crafted approach and is fairly easy to modify. I think I would use the string compare until I got a few more strings and then make the leap to RAGEL. Once I was committed to RAGEL, I think I would see how much of the string processing I could do with RAGEL just to speed the development cycles and be prepared for that One Last Feature from Marketing.

Next week we will look at another code generator and a completely different way to manage this task.

Good Luck.

example3.X.zip

0 Comments

Recommended Comments

There are no comments to display.

Add a comment...

× Pasted as rich text. Paste as plain text instead

Only 75 emoji are allowed.

× Your link has been automatically embedded. Display as a link instead

× Your previous content has been restored. Clear editor

× You cannot paste images directly. Upload or insert images from URL.

Insert image from URL

Sign In

What every embedded programmer should know about ...

Tokenizing Keywords Pt 3

0 Comments

Recommended Comments

Browse Forums

Blogs

Activity

About