Jump to content
 

Search the Community

Showing results for tags 'tokenizing'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • Top Level
    • Announcements
    • The Lounge
    • Questions and Answers
    • Forum Help
    • Project Gallery
    • Vendor Bug Reports

Blogs

  • What every embedded programmer should know about ...
  • Programming Lore
  • Toe-to-toe

Categories

  • Files

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


About Me

Found 1 result

  1. “Code generation, like drinking alcohol, is good in moderation.” — Alex Lowe This episode we are going to try something different. The brute force approach had the advantage of being simple and easy to maintain. The hand-crafted decision tree had the advantage of being fast. This week we will look at an option that will hopefully combine the simplicity of the string list and the speed of the decision tree. This week we will use a code generator to automatically create the tokenizing state machine. I will leave it to you to decide if we use generation in moderation. Let me introduce RAGEL http://www.colm.net/open-source/ragel/. I discovered RAGEL a few years ago when I was looking for a quick and dirty way to build some string handling state machines. RAGEL will construct a complete state machine that will handle the parsing of any regular expression. It can do tokenizing and it can do parsing. Essentially, you define the rules for the tokens and the functions to call when each token is found. For instance, you can write a rule to handle any integer and when an integer is found it can call your doInteger() method. For our simple example of identifying 6 words, the RAGEL code will be a bit overkill but it will be MUCH faster than a brute force string search and in the same ball park as the hand crafted decision tree. Let us get started. First let us get the housekeeping out of the way. This part of the code you have seen before. It is identical to the first two examples I have already provided. There are two differences. First, this only LOOKS like C code. In fact, it is a RAGEL file (I saved it with a .rl extension) and you will see the differences in a moment. When I use a code synthesizer, I like to place the needed command line at the top of the file in comments. While comments are a smell, this sort of comment is pretty important. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 // compile into C with ragel // ragel -C -L -G2 example3.rl -o example3.c // #include <string.h> #include <stdio.h> #include "serial_port.h" char * NMEA_getWord(void) { static char buffer[7]; memset(buffer,0,sizeof(buffer)); do { serial_read(buffer,1); } while(buffer[0] != '$'); for(int x=0;x<sizeof(buffer)-1;x++) { serial_read(&buffer[x], 1); if(buffer[x]==',') { buffer[x] = 0; break; } } return buffer; } enum wordTokens {NO_WORD = -1,GPGGA,GNGSA,GPGSV,GPBOD,GPDBT,GPDCN, GPRMC, GPBWC}; RAGEL is pretty nice in that they choose some special symbols to identify the RAGEL bits so the generator simply passes all input straight to the output until it finds the RAGEL identifiers and then it gets to work. This architecture allows you to simply insert RAGEL code directly into your C (or other languages) and add the state machines in place. The first identifiers we find are the declaration of a state machine (foo seemed traditional). You can define more than one machine so it is important to provide a hint to the generator about which one you want to define. After the machine definition, I specified the location to place all the state machine data tables. There are multiple ways RAGEL can produce a state machine. If the machine requires data, it will go at the write data block. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 %% machine foo; %% write data; enum wordTokens NMEA_findToken(char *word) { const char *p = word; const char *pe = word + strlen(word); int cs; enum wordTokens returnValue = NO_WORD; %%{ action gpgga { returnValue = GPGGA; fbreak; } action gngsa { returnValue = GNGSA; fbreak; } action gpgsv { returnValue = GPGSV; fbreak; } action gpbod { returnValue = GPBOD; fbreak; } action gpdbt { returnValue = GPDBT; fbreak; } action gpdcn { returnValue = GPDCN; fbreak; } action gpbwc { returnValue = GPBWC; fbreak; } action gprmc { returnValue = GPRMC; fbreak; } gpgga = ('GPGGA') @gpgga; gngsa = ('GNGSA') @gngsa; gpgsv = ('GPGSV') @gpgsv; gpbod = ('GPBOD') @gpbod; gpdbt = ('GPDBT') @gpdbt; gpdcn = ('GPDCN') @gpdcn; gpbwc = ('GPBWC') @gpbwc; gprmc = ('GPRMC') @gprmc; main := ( gpgga | gngsa | gpgsv | gpbod | gpdbt | gpdcn | gpbwc | gprmc )*; write init; write exec noend; }%% return returnValue; } Next is the C function definition starting at line 4 above. I am keeping the original NMEA_findToken function as before. No sense in changing what is working. At the beginning of the function is some RAGEL housekeeping defining the range of text to process. In this case the variable p represents the beginning of the test while pe represents the end of the text. The variable cs is a housekeeping variable and the token is the return value so initialize it to NO_WORD. The next bit is some RAGEL code. The %%{ defines a block of ragel much like /* defines the start of a comment block. The first bit of ragel is defining all of the actions that will be triggered when the strings are identified. Honestly, these actions could be anything and I held back simply to keep the function identical to the original. It would be easy to fully define the NMEA data formats and fully decode each NMEA sentence. These simply identify the return token and break out of the function. If we had not already sliced up the tokens we would want to keep store our position in the input strings so we could return to the same spot. It is also possible to feed the state machine one character a time like in an interrupt service routine. After the actions, line 21 defines the search rules and the action to execute when a rule is matched. These rules are simply regular expressions (HA! REGEX and SIMPLE in the same sentence). For this example, the expressions are simply the strings. But if your regular expressions were more complex, you could go crazy. Finally, the machine is defined as matching any of the rules. The initialization and the actual execute code are placed and the RAGEL is complete. Whew! Let us look at what happened when we compile it. One of my favorite programming tools is graphviz. Specifically DOT. It turns out that RAGEL can produce a dot file documenting the produced state machine. Lets try it out. bash> ragel -C -L -V example3.rl -o example3.dot bash> dot example3.dot -T png -O It would be nicer if all the numbers on the arrows were the characters rather than the ASCII codes but I suppose I am nitpicking. Now you see why I named my actions after the sentences. The return arrow clearly shows which action is being executed when the words are found. It also shows that the action triggers when the last letter is found rather than a trailing character. I suppose if you had the word gpgga2, then you would need to add some additional REGEX magic. The dotted arrow IN leading to state 17 refers to any other transition not listed. That indicates that any out-of-place letter simply goes back to 17 without triggering an ACTION. It is possible to define a “SYNTAX ERROR” action to cover this case but I did not care. For my needs, failing quietly is a good choice. This all looks pretty good so far. What does the C look like? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 /* #line 1 "example3.rl" */ // compile into C with ragel // ragel -C -L -G2 example3.rl -o example3.c // #include < string.h > #include < stdio.h > #include "serial_port.h" char * NMEA_getWord(void) { static char buffer[7]; memset(buffer, 0, sizeof(buffer)); do { serial_read(buffer, 1); } while (buffer[0] != '$'); for (int x = 0; x < sizeof(buffer) - 1; x++) { serial_read( & buffer[x], 1); if (buffer[x] == ',') { buffer[x] = 0; break; } } return buffer; } enum wordTokens { NO_WORD = -1, GPGGA, GNGSA, GPGSV, GPBOD, GPDBT, GPDCN, GPRMC, GPBWC }; /* #line 34 "example3.rl" */ /* #line 39 "example3.c" */ static const int foo_start = 17; static const int foo_first_final = 17; static const int foo_error = 0; static const int foo_en_main = 17; /* #line 35 "example3.rl" */ enum wordTokens NMEA_findToken(char * word) { const char * p = word; const char * pe = word + strlen(word); int cs; enum wordTokens returnValue = NO_WORD; /* #line 57 "example3.c" */ { cs = foo_start; } /* #line 62 "example3.c" */ { switch (cs) { tr5: /* #line 45 "example3.rl" */ { returnValue = GNGSA; { p++; cs = 17; goto _out; } } goto st17; tr12: /* #line 47 "example3.rl" */ { returnValue = GPBOD; { p++; cs = 17; goto _out; } } goto st17; tr13: /* #line 50 "example3.rl" */ { returnValue = GPBWC; { p++; cs = 17; goto _out; } } goto st17; tr16: /* #line 48 "example3.rl" */ { returnValue = GPDBT; { p++; cs = 17; goto _out; } } goto st17; tr17: /* #line 49 "example3.rl" */ { returnValue = GPDCN; { p++; cs = 17; goto _out; } } goto st17; tr20: /* #line 44 "example3.rl" */ { returnValue = GPGGA; { p++; cs = 17; goto _out; } } goto st17; tr21: /* #line 46 "example3.rl" */ { returnValue = GPGSV; { p++; cs = 17; goto _out; } } goto st17; tr23: /* #line 51 "example3.rl" */ { returnValue = GPRMC; { p++; cs = 17; goto _out; } } goto st17; st17: p += 1; case 17: /* #line 101 "example3.c" */ if (( * p) == 71) goto st1; goto st0; st0: cs = 0; goto _out; st1: p += 1; case 1: switch (( * p)) { case 78: goto st2; case 80: goto st5; } goto st0; st2: p += 1; case 2: if (( * p) == 71) goto st3; goto st0; st3: p += 1; case 3: if (( * p) == 83) goto st4; goto st0; st4: p += 1; case 4: if (( * p) == 65) goto tr5; goto st0; st5: p += 1; case 5: switch (( * p)) { case 66: goto st6; case 68: goto st9; case 71: goto st12; case 82: goto st15; } goto st0; st6: p += 1; case 6: switch (( * p)) { case 79: goto st7; case 87: goto st8; } goto st0; st7: p += 1; case 7: if (( * p) == 68) goto tr12; goto st0; st8: p += 1; case 8: if (( * p) == 67) goto tr13; goto st0; st9: p += 1; case 9: switch (( * p)) { case 66: goto st10; case 67: goto st11; } goto st0; st10: p += 1; case 10: if (( * p) == 84) goto tr16; goto st0; st11: p += 1; case 11: if (( * p) == 78) goto tr17; goto st0; st12: p += 1; case 12: switch (( * p)) { case 71: goto st13; case 83: goto st14; } goto st0; st13: p += 1; case 13: if (( * p) == 65) goto tr20; goto st0; st14: p += 1; case 14: if (( * p) == 86) goto tr21; goto st0; st15: p += 1; case 15: if (( * p) == 77) goto st16; goto st0; st16: p += 1; case 16: if (( * p) == 67) goto tr23; goto st0; } _out: {} } /* #line 66 "example3.rl" */ return returnValue; } int main(int argc, char ** argv) { if (serial_open() > 0) { for (int x = 0; x < 24; x++) { char * w = NMEA_getWord(); enum wordTokens t = NMEA_findToken(w); printf("word %s,", w); if (t >= 0) printf("token %d\n", t); else printf("no match\n"); } } serial_close(); return 0; } And this is why we use a code generator. The code does not look too terrible. i.e., I could debug it if I thought there were some bugs and it does follow the state chart in a perfectly readable way. BUT, I hope you are not one of those programmers who finds GOTO against their religion. (Though Edsger Dijkstra did allow an exception for low level code when he wrote EWD215 https://www.cs.utexas.edu/users/EWD/transcriptions/EWD02xx/EWD215.html ) So how does this perform? STRNCMP IF-ELSE RAGEL -G2 GNGSA 399 121 280 GPGSV 585 123 304 GLGSV 724 59 225 GPRMC 899 83 299 GPGGA 283 113 298 And for the code size MPLAB XC8 in Free mode on the PIC16F1939 shows 2552 bytes of program and 1024 bytes of data. Don’t forget that printf is included. But this is comparable to the other examples because I am only changing the one function. So our fancy code generator is usually faster than the brute force approach, definitely slower than the hand-crafted approach and is fairly easy to modify. I think I would use the string compare until I got a few more strings and then make the leap to RAGEL. Once I was committed to RAGEL, I think I would see how much of the string processing I could do with RAGEL just to speed the development cycles and be prepared for that One Last Feature from Marketing. Next week we will look at another code generator and a completely different way to manage this task. Good Luck. example3.X.zip
×
×
  • Create New...