Recently, I needed to do some statistical tests on data from within a Drupal Module. Having found no PHP libraries suitable for the task, I decided to try and run code in R from within PHP. This post helped out a great deal.

As far as I can tell, the only way to do this is to write the R output to a temporary file. So to start off, I create a new temp file with:

  1. $path = tempnam(‘/tmp’, ‘tmp’);

This function, in case you’ve never seen it, creates a file with a random name in the directory specified by the first argument, with the extension specified by the second argument.

Next, I loaded up some R. Here’s what I did:

  1. $rcode = <<<EOD
  2. library(RMySQL)
  3.  
  4. con <- dbConnect(dbDriver(‘MySQL’), dbname = ‘sg_drupal’,
  5. username=‘xxxx’, password=‘xxxxxx’,host=‘localhost’)
  6.  
  7. data <- dbGetQuery(con,statement=‘$sql’)
  8.  
  9. x = data[,2][data[‘x’]==$a]
  10. y = data[,2][data[‘x’]==$b]
  11.  
  12. x = as.numeric(x)
  13. y = as.numeric(y)
  14.  
  15. ttest = t.test(x,y)
  16. library(rjson)
  17. ttest = toJSON(ttest)
  18.  
  19. cat(ttest, file=‘$path’)
  20.  
  21. EOD;

First, you connect to your database in R. This is a lot easier than trying to pass in a result set – you could use JSON to do this if you really needed to. But R runs MySQL queries just fine, and puts them into neatly accessible vectors. Don’t forget to load the RMySQL library before doing your queries. I believe this comes with the standard package of R.

One word of warning – be careful of using the $ sign in your R code string. There may be a way to escape it, but I couldn’t (quickly) find a way to use this character without PHP trying to parse it as a variable. Fortunately, R has substitute ways to access vectors of a dataset.

Here I’d like to point out that R has a unique functionality that makes it better than PHP for manipulating and traversing raw data. Without having to write a loop, R has functions that can create new vectors out of the matching values of existing vectors, like this:

  1. x = data[,2][data[‘x’]==$a]

This command means “assign the second vector of dataset ‘data’ where the ‘x’ vector equals $a, to variable x.”

After loading the rjson library, you can turn your output, in this case a T-test, into JSON notation. JSON is a special format that makes data readable by PHP and Javascript as an object or array. Finally,

  1. cat(ttest, file=‘$path’)

writes the JSON output to the temp file we created. But none of this has been executed yet. So let’s say:

  1. exec("echo \"$rcode\" | /usr/bin/R –vanilla");
  2.                
  3.                 $back = file_get_contents($path);
  4.                 $back = json_decode($back, true);

First, this runs the R code as a UNIX command on your server. Then, it reads the contents of the temp file and JSON decodes them. The second “true” argument of json_decode returns the data as an array rather than an object. This is important because R will often name its properties with spaces and other illegal characters for object properties in PHP. There is probably a way to escape or change these characters, but I couldn’t find it. The array will work just as well as the object though.

I thought this was pretty cool. Using this method, you could run statistics on anything in your database and display it on a web page. This seems like a clever way to display usage statistics to site admins, and actually give them an idea of whether the results are statistically significant.

3 Comments

    Can you tell us something about the performance of working in this manner with R? Doing an exec() starts an R process and loads the startup libraries. Is it fast?

  • Interested in knowing why you didn’t use the database as the location for handoff between R and Drupal versus the file system. I’m not familiar with Drupal and brand new to R so maybe it’s an obvious answer. Anyway my uneducated reaction would be that the file system approach would likely be less performant but it also would prohibit your R resultset data from being joined and recombined in interesting ways via SQL.

  • I would probably do things differently if I did this again today.

    @valugi You are right, this method was rather slow because it was constantly re-loading the whole R environment. I would look into using RApache instead (http://rapache.net/) because it provides an R interpreter to the web server. I’ve never used this.

    @ken I guess it depends what you want to do with the data that R spits out in the long run; in my case I just wanted to show it to the user once and I didn’t care about caching or anything. Using a database as the handoff is a good idea if you want to preserve the data but you will have to launch a separate query in PHP to get it. The fastest handoff can probably be achieved by constructing and exec()ing a shell script that runs R and returns data as a JSON string which then will be picked up by PHP as the return of exec().

Leave a Reply