29 April 2018

# Parsing data part 4: Parsing the Easter Bunny with minilexer

In this post, as an example of using `minilexer`, I’ll parse the stanford bunny 3D object into an R data structure and display it.

In a prior post, I introduced the `minilexer` package, and showed some basic uses of the core functions.

In subsequent posts, I used `minilexer` to write:

## Example parser: `obj` format for 3d objects

A simple text file to store 3d objects is the Wavefront obj format. The filetype is well documented on the internet (e.g. 1, 2, 3), and an example octahedron object is show below which has 6 vertices and 8 faces.

``````octahedron_obj <- '
# OBJ file created by ply_to_obj.c
#
g Object001

v  1  0  0
v  0  -1  0
v  -1  0  0
v  0  1  0
v  0  0  1
v  0  0  -1

f  2  1  5
f  3  2  5
f  4  3  5
f  1  4  5
f  1  2  6
f  2  3  6
f  3  4  6
f  4  1  6
'``````

The basic structure of a `.obj` file is:

• Comments start with `#` and continue to the end of the line
• There are symbols at the start of each line telling us what the data on the rest of the line represents, e.g.
• `v` means this line defines a vertex and will be followed by 3 numbers representing the x, y, z coordinates.
• `f` means this line defines a triangular face and the following 3 numbers indicate the 3 vertices which make up this face
• `vn` means this line defines a vector for the direction of the normal at a vertex
• The format is more complicated than this, and I’m leaving out a lot of details, but this is enough to get the general idea.

## Use `lex()` to turn the text into tokens

1. Start by defining the regular expression patterns for each element in the obj file.
2. Use `minilexer::lex()` to turn the obj text into tokens
3. Throw away whitespace, newlines and comments, since I’m not interested in them.
``````obj_patterns <- c(
comment    = '(#.*?)\n',  # assume comments take up the whole line
number     = pattern_number,  # This regex is defined in `minilex` and matches most numeric values
symbol     = '\\w+',
newline    = '\n',
whitespace = '\\s+'
)``````

## Tokenising the `obj`

Split the `obj` text data into tokens, but then remove anything that we don’t need to create the actual data structure representing the 3d object.

``````tokens <- lex(octahedron_obj, obj_patterns)
tokens <- tokens[!(names(tokens) %in% c('whitespace', 'newline', 'comment'))]
tokens``````
``````##      symbol      symbol      symbol      number      number      number
##         "g" "Object001"         "v"         "1"         "0"         "0"
##      symbol      number      number      number      symbol      number
##         "v"         "0"        "-1"         "0"         "v"        "-1"
##      number      number      symbol      number      number      number
##         "0"         "0"         "v"         "0"         "1"         "0"
##      symbol      number      number      number      symbol      number
##         "v"         "0"         "0"         "1"         "v"         "0"
##      number      number      symbol      number      number      number
##         "0"        "-1"         "f"         "2"         "1"         "5"
##      symbol      number      number      number      symbol      number
##         "f"         "3"         "2"         "5"         "f"         "4"
##      number      number      symbol      number      number      number
##         "3"         "5"         "f"         "1"         "4"         "5"
##      symbol      number      number      number      symbol      number
##         "f"         "1"         "2"         "6"         "f"         "2"
##      number      number      symbol      number      number      number
##         "3"         "6"         "f"         "3"         "4"         "6"
##      symbol      number      number      number
##         "f"         "4"         "1"         "6"``````

## Use a `TokenStream` to help turn the tokens into data

Initialise a `TokenStream` object to help us manipulate/interrogate the list of tokens we have.

``stream <- TokenStream\$new(tokens)``

## Write a function to parse the lines which start with `f`

The lines which start with `f` encode a single triangular face. The numbers which follow the `f` are the indicies of the vertices which make up the face.

To parse the lines which start with `f`:

• make sure the current token is `f`
• keep consuming tokens as long as they are `numbers`
• when we run out of numbers, we consider this object parsed and return the data, in this case a numeric vector.
``````parse_f <- function() {

# make sure the current token is `f`
stream\$consume_token('symbol', 'f')

# keep consuming tokens as long as they are `numbers`
values <- stream\$consume_tokens_of_type('number', c(3, 4))

# when we run out of numbers, we consider this object parsed and
# return the data, in this case a numeric vector.
as.numeric(values)
}``````

## Write function to parse the lines which start with `g`, `v` and `vn`

``````#-----------------------------------------------------------------------------
# Parse the 'group name' specification
#-----------------------------------------------------------------------------
parse_g <- function() {
stream\$consume_token('symbol', 'g')
stream\$consume_token()
}

#-----------------------------------------------------------------------------
# Parse the coordinates for a vertex. This may be 3 or 4 values, but
# i'm just ignoring the 4th, and keeping the (x, y, z)
#-----------------------------------------------------------------------------
parse_v <- function() {
start_position <- stream\$position
stream\$consume_token('symbol', 'v')
values <- stream\$consume_tokens_of_type('number', c(3, 4))
as.numeric(values[1:3])
}

#-----------------------------------------------------------------------------
# Parse the vector for that represents the normal at a vertex.
# This may be 3 or 4 values, but i'm just ignoring the 4th, and keeping the (x, y, z)
#-----------------------------------------------------------------------------
parse_vn <- function() {
start_position <- stream\$position
stream\$consume_token('symbol', 'vn')
values <- stream\$consume_tokens_of_type('number', c(3, 4))
as.numeric(values[1:3])
}``````

## Write a top-level function containing a parse loop to keep extracting objects until we’re done

• Check the current token
• Call the parser for that token
• Repeat
``````parse_obj <- function() {
obj   <- list()  # This is where we'll hold the parsed data.

while (!is.na(stream\$current_value())) {
cv <- stream\$current_value()
if (cv == 'g') {
parse_g()
} else if (cv == 'v') {
v <- parse_v()
obj\$v <- rbind(obj\$v, v)
} else if (cv == 'f') {
f <- parse_f()
obj\$f <- rbind(obj\$f, f)
} else if (cv == 'vn') {
vn <- parse_vn()
obj\$vn <- rbind(obj\$vn, vn)
} else {
message <- glue("Parse error at position {stream\$position}. Not understood: {stream\$current_value()}")
stop(message)
}
}

obj
}

obj <- parse_obj()``````

The 3d object now exists as a list of data.frames (one for vertices and one for faces)

``obj``
``````## \$v
##   [,1] [,2] [,3]
## v    1    0    0
## v    0   -1    0
## v   -1    0    0
## v    0    1    0
## v    0    0    1
## v    0    0   -1
##
## \$f
##   [,1] [,2] [,3]
## f    2    1    5
## f    3    2    5
## f    4    3    5
## f    1    4    5
## f    1    2    6
## f    2    3    6
## f    3    4    6
## f    4    1    6``````

## Post processing the data: Fortify/denormalise/tidy.

The text representation of the `obj` data is quite compact and avoids repetition but this isn’t quite in the right form for us to manipulate in R.

The following code turns this data into the `faces` data.frame which is slightly more useful as each face has an actual ID, and the x, y and z co-ordinates of its 3 vertices (a, b, c) are explicitly listed on each row i.e. we’ve created a tidy data.frame !

``````#-----------------------------------------------------------------------------
# Fortify/denormalise/tidy the `f` and `v` data into `faces`
#-----------------------------------------------------------------------------
create_faces <- function(obj) {
suppressWarnings({
faces <- data.frame(obj\$f) %>%
set_names(c('a', 'b', 'c')) %>%
mutate(face_id = seq(n())) %>%
gather(idx, vert_id, -face_id) %>%
arrange(face_id, idx) %>%
as.tbl()

verts <- data.frame(obj\$v) %>%
set_names(c('x', 'y', 'z')) %>%
mutate(vert_id = seq(n())) %>%
as.tbl()
})

faces %<>% left_join(verts, by='vert_id')

faces
}

faces <- create_faces(obj)``````
``faces %>% knitr::kable(caption='tidy faces data.structure')``
Table 1: tidy faces data.structure
face_id idx vert_id x y z
1 a 2 0 -1 0
1 b 1 1 0 0
1 c 5 0 0 1
2 a 3 -1 0 0
2 b 2 0 -1 0
2 c 5 0 0 1
3 a 4 0 1 0
3 b 3 -1 0 0
3 c 5 0 0 1
4 a 1 1 0 0
4 b 4 0 1 0
4 c 5 0 0 1
5 a 1 1 0 0
5 b 2 0 -1 0
5 c 6 0 0 -1
6 a 2 0 -1 0
6 b 3 -1 0 0
6 c 6 0 0 -1
7 a 3 -1 0 0
7 b 4 0 1 0
7 c 6 0 0 -1
8 a 4 0 1 0
8 b 1 1 0 0
8 c 6 0 0 -1

## Let’s view the object!

Use your mouse to rotate and zoom the object.

``````view3d(theta = 10, phi=15)
rgl::triangles3d(faces\$x, faces\$y, faces\$z, col='grey')``````

## Bunny!

The exact same code was used to parse a much more interesting object: `bunny.obj`.

Use your mouse to rotate and zoom the object.

``````obj_file <- '../../data/obj/bunny.obj'