This blog has relocated to https://coolbutuseless.github.ioand associated packages are now hosted at https://github.com/coolbutuseless.

29 April 2018

mikefc

In a prior post, I introduced the minilexer package, and showed some basic uses of the core functions.

In this post, as an example of using minilexer, I’ll parse scrabble games into an R data frame.

Note: If you’re interested in analysis of scrabble games, I recommend having a look at the scrabblr package by James Curley, and checking out his great analysis of 2000 scrabble games using the quacklr package.

Scrabble game format: gcg

The gcg file format is a human readable representation of a Scrabble game.

In its most basic form, it consists of

  • comments up the top (preceded by the # character)
  • rows representing the words played by each player in turn

An example gcg file is show below:

gcg_text <- '
#player1 Quackle Quackle Computer
#player2 David David Boys
#description Quackle Computer plays David Boys in Round 1 at the 2006 Human vs. Computer Showdown
#title 2006 Human vs. Computer Showdown Round 1
#incomplete
>Quackle: DEMJNOT  8d   JETON           +40   40
>David: ?EDYEIG   h2  rEDYEING        +64   64
>Quackle: BEDGMNP  7e   BEDIM           +26   66  BE, ET, DO
>David: HEALERS   j1  HEALERS         +75  139  BEDIMS
>Quackle: DFGINPS   k3  DIF             +29   95  AD, LI, EF
>David: COOAORS   l1  COOS            +28  167  ADO, LIS
>Quackle: EGNOPRS   m3  SPONGER         +92  187  ADOS, LISP
>David: AORWAVA  6c   AVOW            +37  204  OBE, WET
>Quackle: AEFMOVZ  8l   MEZE            +54  241
>David: AARTUNY   d8  JAUNTY          +32  236
>Quackle: ACFIOOV  1l   COOF            +27  268
>David: WALTIER  4c   WAILED          +20  256
>Quackle: AACEINV  3a   VIA             +22  290  AW
>David: IRUTRUT   a3  VIRTU            +9  265
>Quackle: AACEHLN  8a   EH              +42  332  VIRTUE
>David: QUBITUR  2b   BRUIT           +32  297  BI, RAW
>Quackle: AACILNR  9m   RAN             +16  348  ZA, EN
>David: PQUIEN? 13a   QUEY            +32  329
>Quackle: CALLIER   c13 EL               +2  350
>David: PINIR?N  1e   PIN             +11  340  PI, IT
>Quackle: ACEILOR 15a   CALORIE         +83  433  ELL
>David: TRAING? 14f   TRAdING         +67  407  TI, RE
>David:               (DATSXK)        +36  443
'

Use lex() to turn the text into tokens

  1. Start by defining the regular expression patterns for each element in the gcg file.
  2. Use minilexer::lex() to turn the gcg text into tokens
  3. Throw away whitespace, newlines and comments, since I’m not interested in them.
gcg_patterns <- c(
  comment       = '(#.*?)\n',                 # Assume '#' only ever appears to denote a comment 
  newline       = '\n',
  whitespace    = '\\s+',
  player        = '>(.*?):',                  # start of each line with a `>`
  location      = '[a-o]\\d+|\\d+[a-o]|--|-', # [Number][Letter] or [Letter][Number]. Number first for horizontal words. -/-- for specials
  number        = minilexer::pattern_number,
  symbol        = '[-+\\w\\./\\?\\(\\)]+',    # Anything else. Could be a rack of letters or a word
  comma         = ","
)

tokens <- minilexer::lex(gcg_text, gcg_patterns)
tokens <- tokens[!(names(tokens) %in% c('whitespace', 'newline', 'comment'))]
tokens[1:23]
##     player     symbol   location     symbol     number     number 
##  "Quackle"  "DEMJNOT"       "8d"    "JETON"      "+40"       "40" 
##     player     symbol   location     symbol     number     number 
##    "David"  "?EDYEIG"       "h2" "rEDYEING"      "+64"       "64" 
##     player     symbol   location     symbol     number     number 
##  "Quackle"  "BEDGMNP"       "7e"    "BEDIM"      "+26"       "66" 
##     symbol      comma     symbol      comma     symbol 
##       "BE"        ","       "ET"        ","       "DO"

Use a TokenStream to help turn the tokens into data

Initialise a TokenStream object to help us manipulate/interrogate the list of tokens we have.

stream <- TokenStream$new(tokens)

Write a function to parse a single event line

Parsing a line which has a scrabble event on it consists of the following steps:

  1. Check we are at the start of a line with a player’s name.
  2. Get the next symbol representing their current rack of scrabble tiles.
  3. Get the next symbol which is a board location.
  4. Get the next symbol which is the word they played.
  5. Don’t bother extracting the scores - I’m not interested in them.
  6. Return a single-row data.frame containing that information.
parse_event <- function() {
  player_name <- stream$consume_token('player')
  rack        <- stream$consume_token('symbol')
  if (grepl("^\\(", rack)) {
    # If 'rack' is surrounded by () it is the end of the game.
    return(NULL)
  } else {
    location <- stream$consume_token('location')
    play     <- stream$consume_token('symbol')
  }

  data_frame(player=player_name, rack=rack, location=location, play=play)
}

Write a top-level function containing a parse loop to keep extracting events until we’re done

  1. For every line in the gcg file (after removing comments)
  2. Call parse_event()
  3. Stack all the events into a data.frame
parse_gcg <- function() {
  events <- NULL
  
  while(!is.na(stream$current_value())) {
    if (stream$current_type() == 'player') {
      event  <- parse_event()
      events <- rbind(events, event)
    } else {
      # Silently consume any unhandled tokens e.g. player scores
      stream$consume_token()
    }
  }
  
  events 
}

events <- parse_gcg()
Table 1: Scrabble game represented as a data.frame
player rack location play
Quackle DEMJNOT 8d JETON
David ?EDYEIG h2 rEDYEING
Quackle BEDGMNP 7e BEDIM
David HEALERS j1 HEALERS
Quackle DFGINPS k3 DIF
David COOAORS l1 COOS
Quackle EGNOPRS m3 SPONGER
David AORWAVA 6c AVOW
Quackle AEFMOVZ 8l MEZE
David AARTUNY d8 JAUNTY
Quackle ACFIOOV 1l COOF
David WALTIER 4c WAILED
Quackle AACEINV 3a VIA
David IRUTRUT a3 VIRTU
Quackle AACEHLN 8a EH
David QUBITUR 2b BRUIT
Quackle AACILNR 9m RAN
David PQUIEN? 13a QUEY
Quackle CALLIER c13 EL
David PINIR?N 1e PIN
Quackle ACEILOR 15a CALORIE
David TRAING? 14f TRAdING

Manipulate the data

The whole point of parsing this data is to get it into a format that we can manipulate in R, so I’m going to tidy the data further and show the game outcome on a scrabble board.

  1. Interpret the location field as x and y coordinates
  2. Add all the words to a game board
  3. Display the board somehow
#-----------------------------------------------------------------------------
# 1. Interpret the `location` field as x and y coordinates
#-----------------------------------------------------------------------------
events %<>% mutate(
  horizontal = grepl("^\\d", location),
  y          = as.integer(readr::parse_number(location)),
  xc         = stringr::str_extract(location, "[a-o]"),
  x          = charmatch(xc, letters)
)
#-----------------------------------------------------------------------------
# 2. Add all the words to a game board (represented by a 15x15 character matrix)
#-----------------------------------------------------------------------------
board <- matrix('.', nrow=15, ncol=15)

add_word_to_board <- function(board, play, x, y, horizontal) {
  if (horizontal) {
    x <- seq(nchar(play)) + x - 1
  } else {
    y <- seq(nchar(play)) + y - 1
  }

  board[y, x] <- strsplit(play, '')[[1]]
  board
}

for (i in seq(nrow(events))) {
  event <- events[i,]
  board <- add_word_to_board(board, event$play, event$x, event$y, event$horizontal)
}
#-----------------------------------------------------------------------------
# Print board
#-----------------------------------------------------------------------------
cat(paste(apply(board, 1, paste, collapse=' '), collapse="\n"))
## . . . . P I N . . H . C O O F
## . B R U I T . r . E . O . . .
## V I A . . . . E . A D O S . .
## I . W A I L E D . L I S P . .
## R . . . . . . Y . E F . O . .
## T . A V O W . E . R . . N . .
## U . . . B E D I M S . . G . .
## E H . J E T O N . . . M E Z E
## . . . A . . . G . . . . R A N
## . . . U . . . . . . . . . . .
## . . . N . . . . . . . . . . .
## . . . T . . . . . . . . . . .
## Q U E Y . . . . . . . . . . .
## . . L . . T R A d I N G . . .
## C A L O R I E . . . . . . . .

Conclusion

In this post, I used minilexer to turn a scrabble game file in gcg format into R data.

This parsing code is a very simple recursive descent parser where the parsing of the whole text (parse_gcg()) is broken into calls to another function to parse smaller parts of the text (parse_event()).

The parsing of scrabble games has been added as a vignette to the minilexer package, and I plan to write a few more simple parsers over the next few days.