In a prior post, I introduced the minilexer
package, and showed some basic uses of the core functions.
In this post, as an example of using minilexer
, I’ll parse scrabble games into an R data frame.
Note: If you’re interested in analysis of scrabble games, I recommend having a look at the scrabblr
package by James Curley, and checking out his great analysis of 2000 scrabble games using the quacklr
package.
Scrabble game format: gcg
The gcg file format is a human readable representation of a Scrabble game.
In its most basic form, it consists of
- comments up the top (preceded by the
#
character) - rows representing the words played by each player in turn
An example gcg file is show below:
gcg_text <- '
#player1 Quackle Quackle Computer
#player2 David David Boys
#description Quackle Computer plays David Boys in Round 1 at the 2006 Human vs. Computer Showdown
#title 2006 Human vs. Computer Showdown Round 1
#incomplete
>Quackle: DEMJNOT 8d JETON +40 40
>David: ?EDYEIG h2 rEDYEING +64 64
>Quackle: BEDGMNP 7e BEDIM +26 66 BE, ET, DO
>David: HEALERS j1 HEALERS +75 139 BEDIMS
>Quackle: DFGINPS k3 DIF +29 95 AD, LI, EF
>David: COOAORS l1 COOS +28 167 ADO, LIS
>Quackle: EGNOPRS m3 SPONGER +92 187 ADOS, LISP
>David: AORWAVA 6c AVOW +37 204 OBE, WET
>Quackle: AEFMOVZ 8l MEZE +54 241
>David: AARTUNY d8 JAUNTY +32 236
>Quackle: ACFIOOV 1l COOF +27 268
>David: WALTIER 4c WAILED +20 256
>Quackle: AACEINV 3a VIA +22 290 AW
>David: IRUTRUT a3 VIRTU +9 265
>Quackle: AACEHLN 8a EH +42 332 VIRTUE
>David: QUBITUR 2b BRUIT +32 297 BI, RAW
>Quackle: AACILNR 9m RAN +16 348 ZA, EN
>David: PQUIEN? 13a QUEY +32 329
>Quackle: CALLIER c13 EL +2 350
>David: PINIR?N 1e PIN +11 340 PI, IT
>Quackle: ACEILOR 15a CALORIE +83 433 ELL
>David: TRAING? 14f TRAdING +67 407 TI, RE
>David: (DATSXK) +36 443
'
Use lex()
to turn the text into tokens
- Start by defining the regular expression patterns for each element in the gcg file.
- Use
minilexer::lex()
to turn the gcg text into tokens - Throw away whitespace, newlines and comments, since I’m not interested in them.
gcg_patterns <- c(
comment = '(#.*?)\n', # Assume '#' only ever appears to denote a comment
newline = '\n',
whitespace = '\\s+',
player = '>(.*?):', # start of each line with a `>`
location = '[a-o]\\d+|\\d+[a-o]|--|-', # [Number][Letter] or [Letter][Number]. Number first for horizontal words. -/-- for specials
number = minilexer::pattern_number,
symbol = '[-+\\w\\./\\?\\(\\)]+', # Anything else. Could be a rack of letters or a word
comma = ","
)
tokens <- minilexer::lex(gcg_text, gcg_patterns)
tokens <- tokens[!(names(tokens) %in% c('whitespace', 'newline', 'comment'))]
tokens[1:23]
## player symbol location symbol number number
## "Quackle" "DEMJNOT" "8d" "JETON" "+40" "40"
## player symbol location symbol number number
## "David" "?EDYEIG" "h2" "rEDYEING" "+64" "64"
## player symbol location symbol number number
## "Quackle" "BEDGMNP" "7e" "BEDIM" "+26" "66"
## symbol comma symbol comma symbol
## "BE" "," "ET" "," "DO"
Use a TokenStream
to help turn the tokens into data
Initialise a TokenStream
object to help us manipulate/interrogate the list of tokens we have.
stream <- TokenStream$new(tokens)
Write a function to parse a single event line
Parsing a line which has a scrabble event on it consists of the following steps:
- Check we are at the start of a line with a player’s name.
- Get the next symbol representing their current rack of scrabble tiles.
- Get the next symbol which is a board location.
- Get the next symbol which is the word they played.
- Don’t bother extracting the scores - I’m not interested in them.
- Return a single-row data.frame containing that information.
parse_event <- function() {
player_name <- stream$consume_token('player')
rack <- stream$consume_token('symbol')
if (grepl("^\\(", rack)) {
# If 'rack' is surrounded by () it is the end of the game.
return(NULL)
} else {
location <- stream$consume_token('location')
play <- stream$consume_token('symbol')
}
data_frame(player=player_name, rack=rack, location=location, play=play)
}
Write a top-level function containing a parse loop to keep extracting events until we’re done
- For every line in the gcg file (after removing comments)
- Call
parse_event()
- Stack all the events into a data.frame
parse_gcg <- function() {
events <- NULL
while(!is.na(stream$current_value())) {
if (stream$current_type() == 'player') {
event <- parse_event()
events <- rbind(events, event)
} else {
# Silently consume any unhandled tokens e.g. player scores
stream$consume_token()
}
}
events
}
events <- parse_gcg()
player | rack | location | play |
---|---|---|---|
Quackle | DEMJNOT | 8d | JETON |
David | ?EDYEIG | h2 | rEDYEING |
Quackle | BEDGMNP | 7e | BEDIM |
David | HEALERS | j1 | HEALERS |
Quackle | DFGINPS | k3 | DIF |
David | COOAORS | l1 | COOS |
Quackle | EGNOPRS | m3 | SPONGER |
David | AORWAVA | 6c | AVOW |
Quackle | AEFMOVZ | 8l | MEZE |
David | AARTUNY | d8 | JAUNTY |
Quackle | ACFIOOV | 1l | COOF |
David | WALTIER | 4c | WAILED |
Quackle | AACEINV | 3a | VIA |
David | IRUTRUT | a3 | VIRTU |
Quackle | AACEHLN | 8a | EH |
David | QUBITUR | 2b | BRUIT |
Quackle | AACILNR | 9m | RAN |
David | PQUIEN? | 13a | QUEY |
Quackle | CALLIER | c13 | EL |
David | PINIR?N | 1e | PIN |
Quackle | ACEILOR | 15a | CALORIE |
David | TRAING? | 14f | TRAdING |
Manipulate the data
The whole point of parsing this data is to get it into a format that we can manipulate in R, so I’m going to tidy the data further and show the game outcome on a scrabble board.
- Interpret the
location
field as x and y coordinates - Add all the words to a game board
- Display the board somehow
#-----------------------------------------------------------------------------
# 1. Interpret the `location` field as x and y coordinates
#-----------------------------------------------------------------------------
events %<>% mutate(
horizontal = grepl("^\\d", location),
y = as.integer(readr::parse_number(location)),
xc = stringr::str_extract(location, "[a-o]"),
x = charmatch(xc, letters)
)
#-----------------------------------------------------------------------------
# 2. Add all the words to a game board (represented by a 15x15 character matrix)
#-----------------------------------------------------------------------------
board <- matrix('.', nrow=15, ncol=15)
add_word_to_board <- function(board, play, x, y, horizontal) {
if (horizontal) {
x <- seq(nchar(play)) + x - 1
} else {
y <- seq(nchar(play)) + y - 1
}
board[y, x] <- strsplit(play, '')[[1]]
board
}
for (i in seq(nrow(events))) {
event <- events[i,]
board <- add_word_to_board(board, event$play, event$x, event$y, event$horizontal)
}
#-----------------------------------------------------------------------------
# Print board
#-----------------------------------------------------------------------------
cat(paste(apply(board, 1, paste, collapse=' '), collapse="\n"))
## . . . . P I N . . H . C O O F
## . B R U I T . r . E . O . . .
## V I A . . . . E . A D O S . .
## I . W A I L E D . L I S P . .
## R . . . . . . Y . E F . O . .
## T . A V O W . E . R . . N . .
## U . . . B E D I M S . . G . .
## E H . J E T O N . . . M E Z E
## . . . A . . . G . . . . R A N
## . . . U . . . . . . . . . . .
## . . . N . . . . . . . . . . .
## . . . T . . . . . . . . . . .
## Q U E Y . . . . . . . . . . .
## . . L . . T R A d I N G . . .
## C A L O R I E . . . . . . . .
Conclusion
In this post, I used minilexer
to turn a scrabble game file in gcg format into R data.
This parsing code is a very simple recursive descent parser where the parsing of the whole text (parse_gcg()
) is broken into calls to another function to parse smaller parts of the text (parse_event()
).
The parsing of scrabble games has been added as a vignette to the minilexer
package, and I plan to write a few more simple parsers over the next few days.
Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email