I need to read some foreign data formats into R data structures for manipulation. If the data was csv or yaml or json then I’d just use a package that someone else had written to read in the data. In my case, the data format doesn’t have its own package, so I need to write some parsing code from scratch.
To properly parse text into a structured format I’m going to need a lexer (or tokenizer) and a parser. The following succinct definitions are from StackOverflow:
- A tokenizer breaks a stream of text into tokens
- A lexer is basically a tokenizer, but it usually attaches extra context to the tokens – this token is a number, that token is a string literal, this other token is an equality operator.
- A parser takes the stream of tokens from the lexer and turns it into a (hopefully) tidy data structure that can be manipulated programatically. For parsing computer programs, the final data structure is most likely an abstract syntax tree, but for parsing data, the output can be whatever data structure makes sense.
In this post I’ll show some example usage of a new package called minilexer
which can be used to help write parsers for simple text formats.
In future posts I’ll show how to write parsers which actually do something interesting with this package.
Introducing the minilexer
package
minilexer
provides some tools for simple tokenising/lexing and parsing text files.
I will emphasise the mini in minilexer
as this is not a rigorous or formally complete lexer, but it suits 90% of my needs for turning data text formats into tokens.
For complicated parsing (especially of computer programs) you’ll probably want to use the more formally correct lexing/parsing provided by the rly
package or the dparser
package.
Installation
devtools::install_github('coolbutuseless/minilexer')
Package Overview
Current the package provides one function, and one R6 class:
minilexer::lex(text, patterns)
for splitting the text into tokens.- This function uses the user-defined regular expressions (
patterns
) to splittext
into a character vector of tokens. - The
patterns
argument is a named vector of character strings representing regular expressions for elements to match within the text.
- This function uses the user-defined regular expressions (
minilexer::TokenStream
is a class to handle manipulation/interrogation of the stream of tokens to make it easier to write parsers.
Example: Use lex()
to split a sentence into tokens
sentence_patterns <- c(
word = "\\w+",
whitespace = "\\s+",
fullstop = "\\.",
comma = "\\,"
)
sentence = "Hello there, Rstats."
lex(sentence, sentence_patterns)
## word whitespace word comma whitespace word
## "Hello" " " "there" "," " " "Rstats"
## fullstop
## "."
Example: Use lex()
to split some simplified R code into tokens
R_patterns <- c(
number = "-?\\d*\\.?\\d+",
name = "\\w+",
equals = "==",
assign = "<-|=",
plus = "\\+",
lbracket = "\\(",
rbracket = "\\)",
newline = "\n",
whitespace = "\\s+"
)
R_code <- "x <- 3 + 4.2 + rnorm(1)"
R_tokens <- lex(R_code, R_patterns)
R_tokens
## name whitespace assign whitespace number whitespace
## "x" " " "<-" " " "3" " "
## plus whitespace number whitespace plus whitespace
## "+" " " "4.2" " " "+" " "
## name lbracket number rbracket
## "rnorm" "(" "1" ")"
Example: Use TokenStream
to interrogate/manipulate the tokens
The TokenStream
class is a way of manipulating a stream of tokens to make it easier(*) to write parsers. It is a way of keeping track of which token we are currently looking at, and making assertions about the current token’s value and type.
In the following examples, I’ll be using the R_tokens
I extracted above.
# create the stream to handle the tokens
stream <- minilexer::TokenStream$new(R_tokens)
# What position are we at?
stream$position
## [1] 1
# Assert that the first token is a name and has the value 'x'
stream$expect_value('x')
stream$expect_type('name')
# Show what happens if the current token isn't what we expect
stream$expect_value('+')
## Error: Expected ["+"] at position 1 but found [name]: "x"
# Try and consume this token and move onto the next one, but
# because the 'type' is incorrect, will result in failure
stream$consume_token(type='number')
## Error: Expected ["number"] at position 1 but found [name]: "x"
# Unconditionally consume this token without regard to
# its value or type. This returns the value at the
# current position, and then increments the position
stream$consume_token()
## [1] "x"
# Stream position should have moved on to the second value
stream$position
## [1] 2
# Get the current value, but without advancing the position
stream$current_value()
## [1] " "
# consume it. i.e. return current value and increment position
stream$consume_token(type='whitespace')
## [1] " "
# Stream position should have moved on to the third value
stream$position
## [1] 3
# Get the current value
stream$current_value()
## [1] "<-"
stream$current_type()
## [1] "assign"
Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email