Intro
Recently I was working on go-syslog
1 package making some changes for a
private project. This package parses incoming Syslog2 records and returns
structured data.
My first questions, when I opened a file https://github.com/influxdata/go-syslog/blob/develop/rfc3164/machine.go were: “What a heck is that?”, “Who is writing in Golang but in C-like style?” and “How can I read it and make any modifications here?”
Before making any modifications I have to get a bit understanding how it works.
Everything is completely clear until
line:97
where two things start: function machine.Parse
and “goto hell”. The function
is 8500 lines long and mostly consists of labels, goto
statements and “magic
numbers”. At that point I had no idea how to deal with it.
That is how it looks like
{
var _widec int16
if (m.p) == (m.pe) {
goto _testEof
}
switch m.cs {
case 1:
goto stCase1
case 0:
goto stCase0
case 2:
goto stCase2
case 3:
goto stCase3
//<skipped>
goto tr37
st104:
if (m.p)++; (m.p) == (m.pe) {
goto _testEof104
}
stCase104:
if (m.data)[(m.p)] == 32 {
goto tr39
}
if 33 <= (m.data)[(m.p)] && (m.data)[(m.p)] <= 126 {
goto st105
}
goto tr37
st105:
if (m.p)++; (m.p) == (m.pe) {
goto _testEof105
}
// and so on...
Well, someone wrote it, right? That means I can modify it I thought and I’ve started project exploring. I quickly found another file machine.go.rl which also started as a Golang file but soon on line 23 starts some weird (at that point of time) statement
%%{
machine rfc3164;
include common "common.rl";
# unsigned alphabet
alphtype uint8;
action mark {
m.pb = m.p
}
OK, it’s definitely not a Golang code. We have some directives rounded by %%{
and }%%
and lines started with %%
. I’ve seen such brackets for the first
time.
Let’s try to find any mentions in this project which is using this *.rl
file.
A-ha, there is a makefile
in this project, which refers to the file.
#bla-bla-bla
RAGEL := ragel -I common
#bla-bla-bla
rfc3164/machine.go: rfc3164/machine.go.rl common/common.rl
rfc3164/machine.go:
$(RAGEL) -Z -G2 -e -o $@ $<
@sed -i '/^\/\/line/d' $@
$(MAKE) file=$@ snake2camel
Who? Ragel?
Ragel
Actually, Ragel is a really cool tool and a special language. It allows you to quickly define a state machine which converts incoming data into a well-defined structure. It’s almost like regexp but clearer and simpler.
What kind of task is Ragel good for?
It’s a list from Ragel’s homepage3:
- Writing robust protocol implementations.
- Parsing data formats.
- Lexical analysis of programming languages.
- Validating user input.
I rarely implement any protocols or make lexical analysis, so let’s see how we can parse user input, for example, phone number.
Ragen in action
What we’ll do parse?
First of all we have to define a phone format itself. Say, it will be
+1 (NPA) NXX-XXXX
where:
+
and1
is an optional part, i.e. any phone can start with+1
,1
, or it can be without international code.NPA
is an area code within a range of200..999
.NXX-XXXX
is a subscriber number, range forN
is[2-9]
, forX
is[0-9]
.- All brackets, spaces and hyphens are optional.
More details about the format on Wikipedia.
With that said, allowable phones are:
+1 (555) 2334567
+1(555)2334567
+1 (555) 2334567
+1 (555) 233-4567
+1 (555) 233-45-67
+1 (555) 233 45-67
+1(555)233-4567
+15552334567
+1-555-233-4567
1 (555) 2334567
1(555)2334567
1 (555) 2334567
1 (555) 233-4567
1 (555) 233-45-67
1 (555) 233 45-67
1(555)233-4567
15552334567
(555) 233-4567
(555)233-4567
5552334567
555-233-4567
555 233 45 67
For numbers without international code, it will be set as 1
by default.
Phone structure
Next, we need a structure which will store our parsed data. The structure will be filled during the parse process.
type Phone struct{
IntCode string
AreaCode string
Number string
}
I wouldn’t argue how it is better to store phone numbers as int
or as a
string
, it’s no matter in our case.
Unit tests
Working with code generators more and more I can say that unit-tests are a significant part of the development. They validate generated code and produce the result we expect. In most of my cases it is faster to write such tests rather than trying to debug generated code manually. Also, small changes in the generator input can drastically change and output.
We have an input, we know what we want to get as an output, we are good to write tests. Due to our parser does two things, parses the phone itself, validates phone format and returns an error if format is incorrect. We need three groups of tests:
- positive tests, i.e. tests which check all correct values parsed properly
- tests which check correctness of area code and validates an error, and
- tests which check correctness of phone number and validates an error
For unit tests I use Ginkgo library. Table tests looks like this:
// machine_test.go
DescribeTable("Correct values",
func() {
phone := CurrentGinkgoTestDescription().TestText
p, err := m.Parse([]byte(phone))
Expect(err).NotTo(HaveOccurred())
Expect(*p).To(MatchAllFields(Fields{
"IntCode": Equal("1"),
"AreaCode": Equal("555"),
"Number": Equal("2334567"),
}))
},
Entry("+1 (555) 2334567"),
Entry("+1(555)2334567"),
Entry("+1 (555) 2334567"),
// skipped
)
DescribeTable("Area error",
func() {
phone := CurrentGinkgoTestDescription().TestText
p, err := m.Parse([]byte(phone))
Expect(p).To(BeNil())
Expect(err).To(MatchError("invalid area code, expected 200..999"))
},
Entry("+1 (155) 2334567"),
Entry("11992223344"),
Entry("134 2223344"),
)
DescribeTable("Invid phone format",
func() {
phone := CurrentGinkgoTestDescription().TestText
p, err := m.Parse([]byte(phone))
Expect(p).To(BeNil())
Expect(err).To(MatchError(fmt.Sprintf("invalid phone format: %s", phone)))
},
Entry("+1 (555) 1334567"),
Entry("+1(555)23!4567"),
Entry("+1(555+2334567"),
// skipped
)
this and other code samples are in this repo4.
Ok, what’s happening in this test? We have a test table description started with
DescribeTable
. First parameter is a table name, second one is a function,
which will be called for each Entry
passed as a last parameter.
Due to each Entry
must have a description as a first param, I use it as an
actual value, which inside the function can be retrieved with
CurrentGinkgoTestDescription().TestText
call. Next, I’d expect m.Parse()
returns no error for positive tests and expected errors for others. Also, for
positive tests it checks actual values in structure fields. All the inputs have
to be converted into the same output.
We’re not able to run these test, because we didn’t implement the parser itself and the test run fails with compile error.
Empty state machine aka parser boilerplate.
Our parser will be written in Ragel and Ragel requires a couple of things it
uses during the parse process. First, we have to define a machine
, then point
out how some variables can be found and define places inside Go code where
Ragel’s output will be written.
Here is an empty state machine,
package parser
%%{
machine phone;
main := any;
write data;
access m.;
variable p m.p;
variable pe m.pe;
variable eof m.eof;
variable data m.data;
}%%
// Machine contains variables used by Ragel auto generated code.
// See Ragel docs for details.
type Machine struct {
// Data to process.
data []byte
// Data pointer.
p int
// Data end pointer.
pe int
// Curent state.
cs int
// End of file pointer.
eof int
// Start of current date block.
pb int
// Current err
err error
}
// Phone represents parser result and will be filled by
// Raagel actions.
type Phone struct {
IntCode string
AreaCode string
Number string
}
// New initialized new Machine structure.
func New() *Machine {
return &Machine{}
}
// string returns current parsed variable. Variable m.pb
// should be updated by "mark" action, while m.p is a
// current parser position inside m.data.
func (m *Machine) string() string {
return string(m.data[m.pb:m.p])
}
// Parse takes a slice of bytes as an input an fills Phone structure in case
// input is a phone number in one of valid formats.
func (m *Machine) Parse(input []byte) (*Phone, error) {
m.data = input
m.pe = len(input)
m.p = 0
m.pb = 0
m.err = nil
m.eof = len(input)
phone := &Phone{}
%% write init;
%% write exec;
return phone, m.err
}
i.e. a machine which does nothing yet, but if we run Ragel:
% ragel -Z -G2 machine.rl -o machine.go
we will get a machine.go
file, which is ready-to-compile, but again, it does
nothing. But tests run and return test error.
Ran 30 of 30 Specs in 0.004 seconds
FAIL! -- 0 Passed | 30 Failed | 0 Pending | 0 Skipped
--- FAIL: TestRagel (0.00s)
In manchine.rl
we have a mix of Golang and Ragel code. Ragel code, after
Ragel compiler call will be replaced with Go code based on machine definition.
Ragel machine definition surrounded by %%{
and }%%
will be used by compiler
and the result will be written into lines started with write
command, where:
write data;
will be replaced with constants.write init;
- initialization machine code.write exec;
- machine execution code.
type Machine
Machine
is a main structure which is used by auto-generated parser code. It
contains all variables required by Ragel. About these variables see Ragel
manual5.
In the machine definition there are a number of instruction:
access m.;
defines that in auto-generated prefixm.
will be used for accessing machine data.m
also is a method receiverfunc (m *Machine) Parse()...
variable p m.p;
and such define how to access specific variables.
Implementation
Our implementation consists of set machines and actions. Machine is defined
as <name> = <expression>
, where an expression can be another machine. Such a
definition is used for building machines from smaller pieces and does not
generate any states. Expressions also can include actions.
sp = ' ';
sp
machine defines a space, because it’s used in more than one place it’s
nice to have such a machine. Later, for instance, we can add here whitespaces
and tabs.
hy = (sp* | '-');
hy
defines a machine with any number of spaces or a hyphen and is used as a
delimiter inside the phone later.
hbl = ( hy | '(' );
hbr = ( hy | ')' );
hbl
and hbr
define delimiters which surround area code.
int = '+'? '1' >mark %set_intcode;
int
defines an international code with values 1
or +1
. Question mark sign
?
means the expression before is an optional.
int
machine also is a first one in our implementation which uses actions:
mark
and set_intcode
. Each action, in turn, has a point of the call,
>
is an entering action and %
is a leaving action.
Actions themselves do something useful. set_intcode
, for example, writes
parsed international code values into appropriate fields of Machine
structure.
action set_intcode{
phone.IntCode = m.string()
}
if we look at m.string()
function:
func (m *Machine) string() string {
return string(m.data[m.pb:m.p])
}
we see m.string()
returns a slice from the m.data
array, within a range
from m.pb
and m.p
. We know (you’ve read comments for an empty machine,
right?) m.p
is a data pointer and updated by Ragel, i.e., m.p
is updated when
every new symbol is processed.
but m.pb
is not and that’s why we need >mark
action called before an
expression. It updates the m.pb
pointer with m.p
value which points to the
place inside the m.data
array where the expression begins. Then when we call
%set_intcode
at then end of the expression, call m.data[m.pb:m.p]
gives us
an actual value. (Note: the idea of extracting values this way was found in
go-syslog
1 project mentioned at the beginning). Later we will extract all
values this way.
nxx = ('2'..'9' . digit{2});
nxx
defines a range 200..999
which is used as an area code and a first part
of the phone number.
area = hbl? nxx >mark %set_area @err(err_area) hbr?;
area
defines an area code with separators: (NXX)
, -NXX-
or sp* NXX sp*
,
hbl/hbr
(delimiters) are optional, we extract value the same way as before
>mark
/%set_area
, but here we have an additional call @err(err_area)
, it’s
used for handling errors and err_area
action will be called when first
incorrect symbol found in the area part.
number = (nxx hy? digit{2} hy? digit{2}) >mark %set_number @err(err_phone);
number
defines a phone number started with 2
, with optional separators.
main := int? sp* area sp* number @set_defaults;
Final main
machine links all previously defined machines together. Here
instantiation operator :=
is used which generates a set of states representing
an expression.
set_defaults
action sets a default international code if one was not found.
With all machines and action specified let’s run Ragel compiler again:
% ragel -Z -G2 machine.rl -o machine.go
this time we see better picture
Running Suite: Parser Suite
===========================
Random Seed: 1611600245
Will run 31 of 31 specs
•••••••••••••••••••••••••••••••
Ran 31 of 31 Specs in 0.000 seconds
SUCCESS! -- 31 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS
Ginkgo ran 1 suite in 556.947243ms
Test Suite Passed
Conclusion
Ragel is definitely worth to consider, especially if you build parsers. It improves code long-term maintainability and readability by clean machine definitions, albeit it looks overcomplicated from the first glance.
Blazing fast syslog parser: https://github.com/influxdata/go-syslog ↩︎ ↩︎
Syslog on Wikipedia: https://en.wikipedia.org/wiki/Syslog ↩︎
Code samples: https://github.com/ekhabarov/blog-code-snippets/tree/master/ragel ↩︎
Ragel manual: https://www.colm.net/files/ragel/ragel-guide-6.10.pdf ↩︎