This blog has relocated to https://coolbutuseless.github.ioand associated packages are now hosted at https://github.com/coolbutuseless.

29 April 2018

mikefc

Evaluating R code from potentially malicious sources - part 2

As mentioned in Part 1, I’m looking into the idea of running R code which may originate from potentially malicious sources e.g. code from a web interface, or a database or even a tweet!

My idea after yesterday’s post was to evaluate the code one expression at a time and check the function signature of any calls to ensure nothing had been renamed.

  • For each expression:
  1. Identify the functions (and their signatures) in the supplied code
  2. Check them against a whitelist of functions/signatures that are allowed to run

Note

  • I know this is dangerous and quite probably a fool’s quest.
  • I want to run this in the current process and use the result i.e. no remote sandboxes or docker images.
  • If this is impossible, or there’s an actual way to do this, or a better way of thinking of the problem, I’m keen to hear it!

Ping me on twitter

Why this current idea won’t work.

This idea of trying to catch functions as they are called to ensure they are in the whitelist just won’t work. There are too many ways for a user to corrupt a whitelisted function name within an expression that will be way to hard to detect and prevent.

The reason for this is: expressions are tricky.

The following 2 snippets of code are each a single expression, meaning that if I was stepping through the code and only evaluating a single expression at a time, these would constitute a single step.

parse(text = "a <- 1;") %>% length()
[1] 1
parse(text = 'if (isTRUE(zz <- system) || (mean("echo bad stuff"))) {print(a)}') %>% length()
[1] 1

If I check the second expression for function calls I get:

[1] "if"     "||"     "isTRUE" "<-"     "("      "mean"   "{"      "print" 

So the malicious user has successfully re-cast the name mean to point to the system call and called it, all within a single expression.

This is bad, and I can’t see a way to prevent this without re-writing R’s internal evaluation scheme and interrupting each and every function call in order to check it is what it says it is.

Environments to the rescue!

My idea so far was to have a standard R environment (with all base functions in it) and try and limit a user’s access so they can only call functions I have in a whitelist.

I now think that inverting my idea is the solution i.e. have an environment with only the whitelisted functions available and don’t restrict a user’s access in that environment.

How this will work:

  1. Create an empty environment
  2. Copy the whitelisted functions into it
    • Lots of control available here e.g. you could whitelist mean and mean.default but not include mean.difftime so that ordinary numeric means are possible, but not of difftime objects
  3. Do all evaluation within this environment
#-----------------------------------------------------------------------------
# 1. Create the environment with just whitelisted functions
#-----------------------------------------------------------------------------
envir     <- rlang::new_environment()

#-----------------------------------------------------------------------------
# 2. Copy the whitelisted functions into this environment
#-----------------------------------------------------------------------------
whitelist <- c('structure', 'c', 'list', 'mean', '+', '-', '*', '/', 'if', '||', 'isTRUE', '<-', 'mean.default')
whitelist %>% 
  purrr::walk(function(x) {envir[[x]] <- get(x)})

#-----------------------------------------------------------------------------
#' 3. Evaluate code within this environment
#-----------------------------------------------------------------------------
eval(parse(text="mean(c(1, 2, 3))"), envir = envir)
[1] 2
#-----------------------------------------------------------------------------
#' Trying to call a function not in the environment fails
#-----------------------------------------------------------------------------
testthat::expect_error(
  eval(parse(text="system('echo bad stuff')"), envir = envir),
  'could not find function "system"'
)

#-----------------------------------------------------------------------------
#' Trying to use a variable to point to a function not in the environment fails
#-----------------------------------------------------------------------------
testthat::expect_error(
  eval(parse(text="mean <- system; mean('bad stuff happens')"), envir = envir),
  "object 'system' not found"
)


#-----------------------------------------------------------------------------
#' Trying to use a variable to point to a function not in the environment fails
#' Even this malicious code fails as the 'system' function just isn't in the
#' environment at all, so there's no way to call it!
#-----------------------------------------------------------------------------
testthat::expect_error(
eval(parse(text = 'if (isTRUE(zz <- system) || (mean("echo bad stuff"))) {print(a)}'), envir=envir),
  "object 'system' not found"
)

Summary

It appears that using eval() wih a carefully crafted environment consisting of only whitelisted functions should be my way forward in evaluating R code from possibly malicious sources.

The next question is: What functions are safe to add to the whitelist?.