Parsing data part 4: Parsing the Easter Bunny with minilexer

March 29, 2018 mikefc

In this post, as an example of using minilexer, I’ll parse the stanford bunny 3D object into an R data structure and display it.

In a prior post, I introduced the minilexer package, and showed some basic uses of the core functions.

In subsequent posts, I used minilexer to write:

Example parser: `obj` format for 3d objects

A simple text file to store 3d objects is the Wavefront obj format. The filetype is well documented on the internet (e.g. 1, 2, 3), and an example octahedron object is show below which has 6 vertices and 8 faces.

octahedron_obj <- '
# OBJ file created by ply_to_obj.c
#
g Object001

v  1  0  0
v  0  -1  0
v  -1  0  0
v  0  1  0
v  0  0  1
v  0  0  -1

f  2  1  5
f  3  2  5
f  4  3  5
f  1  4  5
f  1  2  6
f  2  3  6
f  3  4  6
f  4  1  6
'

The basic structure of a .obj file is:

Comments start with # and continue to the end of the line
There are symbols at the start of each line telling us what the data on the rest of the line represents, e.g.
- v means this line defines a vertex and will be followed by 3 numbers representing the x, y, z coordinates.
- f means this line defines a triangular face and the following 3 numbers indicate the 3 vertices which make up this face
- vn means this line defines a vector for the direction of the normal at a vertex
The format is more complicated than this, and I’m leaving out a lot of details, but this is enough to get the general idea.

Use `lex()` to turn the text into tokens

Start by defining the regular expression patterns for each element in the obj file.
Use minilexer::lex() to turn the obj text into tokens
Throw away whitespace, newlines and comments, since I’m not interested in them.

obj_patterns <- c(
  comment    = '(#.*?)\n',  # assume comments take up the whole line
  number     = pattern_number,  # This regex is defined in `minilex` and matches most numeric values
  symbol     = '\\w+',
  newline    = '\n',
  whitespace = '\\s+'
)

Tokenising the `obj`

Split the obj text data into tokens, but then remove anything that we don’t need to create the actual data structure representing the 3d object.

tokens <- lex(octahedron_obj, obj_patterns)
tokens <- tokens[!(names(tokens) %in% c('whitespace', 'newline', 'comment'))]
tokens

##      symbol      symbol      symbol      number      number      number 
##         "g" "Object001"         "v"         "1"         "0"         "0" 
##      symbol      number      number      number      symbol      number 
##         "v"         "0"        "-1"         "0"         "v"        "-1" 
##      number      number      symbol      number      number      number 
##         "0"         "0"         "v"         "0"         "1"         "0" 
##      symbol      number      number      number      symbol      number 
##         "v"         "0"         "0"         "1"         "v"         "0" 
##      number      number      symbol      number      number      number 
##         "0"        "-1"         "f"         "2"         "1"         "5" 
##      symbol      number      number      number      symbol      number 
##         "f"         "3"         "2"         "5"         "f"         "4" 
##      number      number      symbol      number      number      number 
##         "3"         "5"         "f"         "1"         "4"         "5" 
##      symbol      number      number      number      symbol      number 
##         "f"         "1"         "2"         "6"         "f"         "2" 
##      number      number      symbol      number      number      number 
##         "3"         "6"         "f"         "3"         "4"         "6" 
##      symbol      number      number      number 
##         "f"         "4"         "1"         "6"

Use a `TokenStream` to help turn the tokens into data

Initialise a TokenStream object to help us manipulate/interrogate the list of tokens we have.

stream <- TokenStream$new(tokens)

Write a function to parse the lines which start with `f`

The lines which start with f encode a single triangular face. The numbers which follow the f are the indicies of the vertices which make up the face.

To parse the lines which start with f:

make sure the current token is f
keep consuming tokens as long as they are numbers
when we run out of numbers, we consider this object parsed and return the data, in this case a numeric vector.

parse_f <- function() {
  
  # make sure the current token is `f`
  stream$consume_token('symbol', 'f')
  
  # keep consuming tokens as long as they are `numbers`
  values <- stream$consume_tokens_of_type('number', c(3, 4))
  
  # when we run out of numbers, we consider this object parsed and 
  # return the data, in this case a numeric vector.
  as.numeric(values)
}

Write function to parse the lines which start with `g`, `v` and `vn`

#-----------------------------------------------------------------------------
# Parse the 'group name' specification
#-----------------------------------------------------------------------------
parse_g <- function() {
  stream$consume_token('symbol', 'g')
  stream$consume_token()
}


#-----------------------------------------------------------------------------
# Parse the coordinates for a vertex. This may be 3 or 4 values, but 
# i'm just ignoring the 4th, and keeping the (x, y, z) 
#-----------------------------------------------------------------------------
parse_v <- function() {
  start_position <- stream$position
  stream$consume_token('symbol', 'v')
  values <- stream$consume_tokens_of_type('number', c(3, 4))
  as.numeric(values[1:3])
}


#-----------------------------------------------------------------------------
# Parse the vector for that represents the normal at a vertex. 
# This may be 3 or 4 values, but i'm just ignoring the 4th, and keeping the (x, y, z) 
#-----------------------------------------------------------------------------
parse_vn <- function() {
  start_position <- stream$position
  stream$consume_token('symbol', 'vn')
  values <- stream$consume_tokens_of_type('number', c(3, 4))
  as.numeric(values[1:3])
}

Write a top-level function containing a parse loop to keep extracting objects until we’re done

Check the current token
Call the parser for that token
Repeat

parse_obj <- function() {
  obj   <- list()  # This is where we'll hold the parsed data.
  
  while (!is.na(stream$current_value())) {
    cv <- stream$current_value()
    if (cv == 'g') {
      parse_g()
    } else if (cv == 'v') {
      v <- parse_v()
      obj$v <- rbind(obj$v, v)
    } else if (cv == 'f') {
      f <- parse_f()
      obj$f <- rbind(obj$f, f)
    } else if (cv == 'vn') {
      vn <- parse_vn()
      obj$vn <- rbind(obj$vn, vn)
    } else {
      message <- glue("Parse error at position {stream$position}. Not understood: {stream$current_value()}")
      stop(message)
    }
  }
  
  obj
}

obj <- parse_obj()

The 3d object now exists as a list of data.frames (one for vertices and one for faces)

obj

## $v
##   [,1] [,2] [,3]
## v    1    0    0
## v    0   -1    0
## v   -1    0    0
## v    0    1    0
## v    0    0    1
## v    0    0   -1
## 
## $f
##   [,1] [,2] [,3]
## f    2    1    5
## f    3    2    5
## f    4    3    5
## f    1    4    5
## f    1    2    6
## f    2    3    6
## f    3    4    6
## f    4    1    6

Post processing the data: Fortify/denormalise/tidy.

The text representation of the obj data is quite compact and avoids repetition but this isn’t quite in the right form for us to manipulate in R.

The following code turns this data into the faces data.frame which is slightly more useful as each face has an actual ID, and the x, y and z co-ordinates of its 3 vertices (a, b, c) are explicitly listed on each row i.e. we’ve created a tidy data.frame !

#-----------------------------------------------------------------------------
# Fortify/denormalise/tidy the `f` and `v` data into `faces`
#-----------------------------------------------------------------------------
create_faces <- function(obj) {
  suppressWarnings({
    faces <- data.frame(obj$f) %>%
      set_names(c('a', 'b', 'c')) %>%
      mutate(face_id = seq(n())) %>%
      gather(idx, vert_id, -face_id) %>%
      arrange(face_id, idx) %>%
      as.tbl()
    
    verts <- data.frame(obj$v) %>%
      set_names(c('x', 'y', 'z')) %>%
      mutate(vert_id = seq(n())) %>%
      as.tbl()
  })
  
  faces %<>% left_join(verts, by='vert_id')
  
  faces
}

faces <- create_faces(obj)

faces %>% knitr::kable(caption='tidy faces data.structure')

Table 1: tidy faces data.structure
face_id	idx	vert_id	x	y	z
1	a	2	0	-1	0
1	b	1	1	0	0
1	c	5	0	0	1
2	a	3	-1	0	0
2	b	2	0	-1	0
2	c	5	0	0	1
3	a	4	0	1	0
3	b	3	-1	0	0
3	c	5	0	0	1
4	a	1	1	0	0
4	b	4	0	1	0
4	c	5	0	0	1
5	a	1	1	0	0
5	b	2	0	-1	0
5	c	6	0	0	-1
6	a	2	0	-1	0
6	b	3	-1	0	0
6	c	6	0	0	-1
7	a	3	-1	0	0
7	b	4	0	1	0
7	c	6	0	0	-1
8	a	4	0	1	0
8	b	1	1	0	0
8	c	6	0	0	-1

Let’s view the object!

Use your mouse to rotate and zoom the object.

view3d(theta = 10, phi=15)
rgl::triangles3d(faces$x, faces$y, faces$z, col='grey')

You must enable Javascript to view this page properly.

Bunny!

The exact same code was used to parse a much more interesting object: bunny.obj.

Use your mouse to rotate and zoom the object.

obj_file <- '../../data/obj/bunny.obj'
obj_text <- readLines(obj_file) %>% paste(collapse="\n")
tokens   <- lex(obj_text, obj_patterns)
tokens   <- tokens[!(names(tokens) %in% c('whitespace', 'newline', 'comment'))]
stream   <- TokenStream$new(tokens)
obj      <- parse_obj()
faces    <- create_faces(obj)

You must enable Javascript to view this page properly.

Conclusion

In this post, I used minilexer to turn a 3d object in obj format into R data.

This parsing code is a very simple recursive descent parser where the parsing of the whole text (parse_obj()) is broken into calls to another function to parse smaller parts of the text (parse_g(), parse_v(), parse_vn(), parse_f()).

Possible extensions:

parse the materials file that sometimes accompanies an obj file
use the g tag to create sub-groups of faces within the obj file (this will be needed to assign per-group materials)
calculate vertex normals if none supplied - this will result in smoother objects.
Render the object. Ray tracer in rstats? Path tracer? Use Rcpp?
Modify the object
- simplification
- subdivision

coolbutuseless

About

Github

License

Twitter

Recent Posts

r64 package - a c64/6502 assembler

Generating Executable ASCII art

Changing the default arguments to a function

Solving the 8 queens problem

Finding a length n needle in a haystack

Parsing data part 4: Parsing the Easter Bunny with minilexer

Example parser: `obj` format for 3d objects

Use `lex()` to turn the text into tokens

Tokenising the `obj`

Use a `TokenStream` to help turn the tokens into data

Write a function to parse the lines which start with `f`

Write function to parse the lines which start with `g`, `v` and `vn`

Write a top-level function containing a parse loop to keep extracting objects until we’re done

Post processing the data: Fortify/denormalise/tidy.

Let’s view the object!

Bunny!

Conclusion

cool but useless

Recent Posts

r64 package - a c64/6502 assembler

Generating Executable ASCII art

Changing the default arguments to a function

Solving the 8 queens problem

Finding a length n needle in a haystack

Categories

About

face_id	idx	vert_id	x	y	z
1	a	2	0	-1	0
1	b	1	1	0	0
1	c	5	0	0	1
2	a	3	-1	0	0
2	b	2	0	-1	0
2	c	5	0	0	1
3	a	4	0	1	0
3	b	3	-1	0	0
3	c	5	0	0	1
4	a	1	1	0	0
4	b	4	0	1	0
4	c	5	0	0	1
5	a	1	1	0	0
5	b	2	0	-1	0
5	c	6	0	0	-1
6	a	2	0	-1	0
6	b	3	-1	0	0
6	c	6	0	0	-1
7	a	3	-1	0	0
7	b	4	0	1	0
7	c	6	0	0	-1
8	a	4	0	1	0
8	b	1	1	0	0
8	c	6	0	0	-1

face_id	idx	vert_id	x	y	z
1	a	2	0	-1	0
1	b	1	1	0	0
1	c	5	0	0	1
2	a	3	-1	0	0
2	b	2	0	-1	0
2	c	5	0	0	1
3	a	4	0	1	0
3	b	3	-1	0	0
3	c	5	0	0	1
4	a	1	1	0	0
4	b	4	0	1	0
4	c	5	0	0	1
5	a	1	1	0	0
5	b	2	0	-1	0
5	c	6	0	0	-1
6	a	2	0	-1	0
6	b	3	-1	0	0
6	c	6	0	0	-1
7	a	3	-1	0	0
7	b	4	0	1	0
7	c	6	0	0	-1
8	a	4	0	1	0
8	b	1	1	0	0
8	c	6	0	0	-1

About

Github

License

Twitter

Example parser: obj format for 3d objects

Use lex() to turn the text into tokens

Tokenising the obj

Use a TokenStream to help turn the tokens into data

Write a function to parse the lines which start with f

Write function to parse the lines which start with g, v and vn

Write a top-level function containing a parse loop to keep extracting objects until we’re done

Post processing the data: Fortify/denormalise/tidy.

Let’s view the object!

Bunny!

Conclusion

About

Example parser: `obj` format for 3d objects

Use `lex()` to turn the text into tokens

Tokenising the `obj`

Use a `TokenStream` to help turn the tokens into data

Write a function to parse the lines which start with `f`

Write function to parse the lines which start with `g`, `v` and `vn`

face_id	idx	vert_id	x	y	z
1	a	2	0	-1	0
1	b	1	1	0	0
1	c	5	0	0	1
2	a	3	-1	0	0
2	b	2	0	-1	0
2	c	5	0	0	1
3	a	4	0	1	0
3	b	3	-1	0	0
3	c	5	0	0	1
4	a	1	1	0	0
4	b	4	0	1	0
4	c	5	0	0	1
5	a	1	1	0	0
5	b	2	0	-1	0
5	c	6	0	0	-1
6	a	2	0	-1	0
6	b	3	-1	0	0
6	c	6	0	0	-1
7	a	3	-1	0	0
7	b	4	0	1	0
7	c	6	0	0	-1
8	a	4	0	1	0
8	b	1	1	0	0
8	c	6	0	0	-1