Recently, I needed to do some statistical tests on data from within a Drupal Module. Having found no PHP libraries suitable for the task, I decided to try and run code in R from within PHP. This post helped out a great deal.
As far as I can tell, the only way to do this is to write the R output to a temporary file. So to start off, I create a new temp file with:
This function, in case you’ve never seen it, creates a file with a random name in the directory specified by the first argument, with the extension specified by the second argument.
Next, I loaded up some R. Here’s what I did:
-
$rcode = <<<EOD
-
library(RMySQL)
-
-
con <- dbConnect(dbDriver(‘MySQL’), dbname = ‘sg_drupal’,
-
username=‘xxxx’, password=‘xxxxxx’,host=‘localhost’)
-
-
data <- dbGetQuery(con,statement=‘$sql’)
-
-
x = data[,2][data[‘x’]==$a]
-
y = data[,2][data[‘x’]==$b]
-
-
x = as.numeric(x)
-
y = as.numeric(y)
-
-
ttest = t.test(x,y)
-
library(rjson)
-
ttest = toJSON(ttest)
-
-
-
EOD;
First, you connect to your database in R. This is a lot easier than trying to pass in a result set – you could use JSON to do this if you really needed to. But R runs MySQL queries just fine, and puts them into neatly accessible vectors. Don’t forget to load the RMySQL library before doing your queries. I believe this comes with the standard package of R.
One word of warning – be careful of using the $ sign in your R code string. There may be a way to escape it, but I couldn’t (quickly) find a way to use this character without PHP trying to parse it as a variable. Fortunately, R has substitute ways to access vectors of a dataset.
Here I’d like to point out that R has a unique functionality that makes it better than PHP for manipulating and traversing raw data. Without having to write a loop, R has functions that can create new vectors out of the matching values of existing vectors, like this:
-
x = data[,2][data[‘x’]==$a]
This command means “assign the second vector of dataset ‘data’ where the ‘x’ vector equals $a, to variable x.”
After loading the rjson library, you can turn your output, in this case a T-test, into JSON notation. JSON is a special format that makes data readable by PHP and Javascript as an object or array. Finally,
writes the JSON output to the temp file we created. But none of this has been executed yet. So let’s say:
-
-
$back = json_decode($back, true);
First, this runs the R code as a UNIX command on your server. Then, it reads the contents of the temp file and JSON decodes them. The second “true” argument of json_decode returns the data as an array rather than an object. This is important because R will often name its properties with spaces and other illegal characters for object properties in PHP. There is probably a way to escape or change these characters, but I couldn’t find it. The array will work just as well as the object though.
I thought this was pretty cool. Using this method, you could run statistics on anything in your database and display it on a web page. This seems like a clever way to display usage statistics to site admins, and actually give them an idea of whether the results are statistically significant.
Can you tell us something about the performance of working in this manner with R? Doing an exec() starts an R process and loads the startup libraries. Is it fast?
Interested in knowing why you didn’t use the database as the location for handoff between R and Drupal versus the file system. I’m not familiar with Drupal and brand new to R so maybe it’s an obvious answer. Anyway my uneducated reaction would be that the file system approach would likely be less performant but it also would prohibit your R resultset data from being joined and recombined in interesting ways via SQL.
I would probably do things differently if I did this again today.
@valugi You are right, this method was rather slow because it was constantly re-loading the whole R environment. I would look into using RApache instead (http://rapache.net/) because it provides an R interpreter to the web server. I’ve never used this.
@ken I guess it depends what you want to do with the data that R spits out in the long run; in my case I just wanted to show it to the user once and I didn’t care about caching or anything. Using a database as the handoff is a good idea if you want to preserve the data but you will have to launch a separate query in PHP to get it. The fastest handoff can probably be achieved by constructing and exec()ing a shell script that runs R and returns data as a JSON string which then will be picked up by PHP as the return of exec().