Parsing data part 3: parsing chess game files

March 28, 2018 mikefc

In this post, as an example of using minilexer, I’ll parse chess games into an R data frame.

In prior posts

I introduced the minilexer package, and showed some basic uses of the core functions.
I then wrote a simple parser for the scrabble game format

Note: If you’re interested in doing more with chess games, I recommend having a look at the rchess package on CRAN.

Chess game format: pgn

The pgn file format is a human readable representation of a chess game.

In its most basic form, it consists of

a sequence of tags (i.e. comments) surrounded by []
a sequence of numbers and events representing the moves taken by the players i.e.
- A number indicating which move this is within the game.
- Moves the for the white and black player represented in Standard Algebraic Notation (SAN).
Comments can be interspersed between/within the moves and are surrounded by “{}”

An example pgn file is show below:

pgn_text <- '
[Event "F/S Return Match"]
[Site "Belgrade, Serbia JUG"]
[Date "1992.11.04"]
[Round "29"]
[White "Fischer, Robert J."]
[Black "Spassky, Boris V."]
[Result "1/2-1/2"]

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 {This opening is called the Ruy Lopez.}
4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 d6 8. c3 O-O 9. h3 Nb8 10. d4 Nbd7
11. c4 c6 12. cxb5 axb5 13. Nc3 Bb7 14. Bg5 b4 15. Nb1 h6 16. Bh4 c5 17. dxe5
Nxe4 18. Bxe7 Qxe7 19. exd6 Qf6 20. Nbd2 Nxd6 21. Nc4 Nxc4 22. Bxc4 Nb6
23. Ne5 Rae8 24. Bxf7+ Rxf7 25. Nxf7 Rxe1+ 26. Qxe1 Kxf7 27. Qe3 Qg5 28. Qxg5
hxg5 29. b3 Ke6 30. a3 Kd6 31. axb4 cxb4 32. Ra5 Nd5 33. f3 Bc8 34. Kf2 Bf5
35. Ra7 g6 36. Ra6+ Kc5 37. Ke1 Nf4 38. g3 Nxh3 39. Kd2 Kb5 40. Rd6 Kc5 41. Ra6
Nf2 42. g4 Bd3 43. Re6 1/2-1/2
'

Use `lex()` to turn the text into tokens

Start by defining the regular expression patterns for each element in the pgn file.
Use minilexer::lex() to turn the pgn text into tokens
Throw away whitespace, newlines and tags, since I’m not interested in them.

pgn_patterns <- c(
  comment       = '(;.*?)\n',     # Assume ; only appears to denote comment to end of line
  tag           = '\\[.*?\\]',    # parse tags as a whole token. going to ignore
  comment_open  = "\\{",          # Inline comment start
  comment_close = "\\}",          # Inline comment end
  move_number   = "\\d+\\.+",
  symbol        = '[-+\\w\\./]+',
  newline       = '\n',
  whitespace    = '\\s+'
)

tokens <- minilexer::lex(pgn_text, pgn_patterns)
tokens <- tokens[!(names(tokens) %in% c('whitespace', 'newline', 'tag'))]
tokens[1:23]

##   move_number        symbol        symbol   move_number        symbol 
##          "1."          "e4"          "e5"          "2."         "Nf3" 
##        symbol   move_number        symbol        symbol  comment_open 
##         "Nc6"          "3."         "Bb5"          "a6"           "{" 
##        symbol        symbol        symbol        symbol        symbol 
##        "This"     "opening"          "is"      "called"         "the" 
##        symbol        symbol comment_close   move_number        symbol 
##         "Ruy"      "Lopez."           "}"          "4."         "Ba4" 
##        symbol   move_number        symbol 
##         "Nf6"          "5."         "O-O"

Use a `TokenStream` to help turn the tokens into data

Initialise a TokenStream object to help us manipulate/interrogate the list of tokens we have.

stream <- TokenStream$new(tokens)

Write a function to parse a comment

Check we are at the start of a comment i.e. a comment_open token
Keep consuming symbols until we reach the end of the comment i.e. a comment_close token
Collapse all the tokens that we collected into a single string, and return this as the comment.

parse_comment <- function() {
  stream$consume_token('comment_open')
  values <- c()
  while (!identical(stream$current_type(), 'comment_close')) {  # Also need to check for end of stream! TODO
    values <- c(values, stream$consume_token())
  }
  stream$consume_token("comment_close")
  paste(values, collapse=" ")
}

Write a function to parse a set of 2 moves

A set of moves consists of:

A number indicating the move number within the game
Two strings representing move text - one for white and one for black

Parsing method is as follows:

Check that the token at the current stream position is a move number
Keep consuming symbols until we reach a non-symbol
Check that we got 2 moves.
Return a single row data.frame with the move number and the moves of white and black.

parse_move <- function() {
  move_number <- stream$consume_token('move_number')
  values <- c()
  while (identical(stream$current_type(), 'symbol')) {  # Also need to check for end of stream! TODO
    values <- c(values, stream$consume_token())
  }
  if (length(values) != 2) {
    message <- glue("Expecting 2 values only for moves, but got {length(values)} starting at position {start_position}")
    stop(message)
  }
  data_frame(move=as.integer(readr::parse_number(move_number)), white=values[1], black=values[2])
}

Write a top-level function containing a parse loop to keep extracting events until we’re done

If the current token is a move_number then call parse_move()
If the current token is a comment then call parse_comment()
Repeat until done

parse_pgn <- function() {
  game <- NULL
  
  while (!is.na(stream$current_value())) {
    ct <- stream$current_type()
    if (identical(ct, 'move_number')) {
      move <- parse_move()
      game <- bind_rows(game, move)
    } else if (identical(ct, 'comment_open')) {
      comment <- parse_comment()
    } else {
      message <- glue("Parse error at position {state$position}. Not understood: {stream$current_value()}")
      stop(message)
    }
  }
  
  game
}

game <- parse_pgn()

Table 1: Chess game represented as a data.frame
move	white	black
1	e4	e5
2	Nf3	Nc6
3	Bb5	a6
4	Ba4	Nf6
5	O-O	Be7
6	Re1	b5
7	Bb3	d6
8	c3	O-O
9	h3	Nb8
10	d4	Nbd7
11	c4	c6
12	cxb5	axb5
13	Nc3	Bb7
14	Bg5	b4
15	Nb1	h6
16	Bh4	c5
17	dxe5	Nxe4
18	Bxe7	Qxe7
19	exd6	Qf6
20	Nbd2	Nxd6
21	Nc4	Nxc4
22	Bxc4	Nb6
23	Ne5	Rae8
24	Bxf7+	Rxf7
25	Nxf7	Rxe1+
26	Qxe1	Kxf7
27	Qe3	Qg5
28	Qxg5	hxg5
29	b3	Ke6
30	a3	Kd6
31	axb4	cxb4
32	Ra5	Nd5
33	f3	Bc8
34	Kf2	Bf5
35	Ra7	g6
36	Ra6+	Kc5
37	Ke1	Nf4
38	g3	Nxh3
39	Kd2	Kb5
40	Rd6	Kc5
41	Ra6	Nf2
42	g4	Bd3
43	Re6	1/2-1/2

Conclusion

In this post, I used minilexer to turn a chess game file in pgn format into R data.

**Notes:**

Comments in pgn files can be tricky because they can occur in the middle of a game move i.e. after the white’s move, but before the black’s. For this post I have assumed that doesn’t happen, and dealing with such comments is left as an exercise ;) [Hint: Call parse_comment() from within parse_move() rather than from parse_pgn()]
Interpreting the move text into which piece is moving to which location is beyond the scope of this post!
This parsing code is a very simple recursive descent parser where the parsing of the whole text (parse_pgn()) is broken into calls to another function to parse smaller parts of the text (parse_comment() and parse_move()).

coolbutuseless

About

Github

License

Twitter

Recent Posts

r64 package - a c64/6502 assembler

Generating Executable ASCII art

Changing the default arguments to a function

Solving the 8 queens problem

Finding a length n needle in a haystack

Parsing data part 3: parsing chess game files

Chess game format: pgn

Use `lex()` to turn the text into tokens

Use a `TokenStream` to help turn the tokens into data

Write a function to parse a comment

Write a function to parse a set of 2 moves

Write a top-level function containing a parse loop to keep extracting events until we’re done

Conclusion

cool but useless

Recent Posts

r64 package - a c64/6502 assembler

Generating Executable ASCII art

Changing the default arguments to a function

Solving the 8 queens problem

Finding a length n needle in a haystack

Categories

About

About

Github

License

Twitter

Chess game format: pgn

Use lex() to turn the text into tokens

Use a TokenStream to help turn the tokens into data

Write a function to parse a comment

Write a function to parse a set of 2 moves

Write a top-level function containing a parse loop to keep extracting events until we’re done

Conclusion

About

Use `lex()` to turn the text into tokens

Use a `TokenStream` to help turn the tokens into data