## Saturday, December 13, 2008

### R matrices in C functions

Using the .C() function in R, you can only pass vectors. Since R stores matrices columnwise as vectors anyhow, they can be passed to your C function as vectors (along with the number of rows in the matrix) and then accessed in familiar [row,col] manner using the following C functions (idea from here):
int Cmatrix(int *row, int *col, int *nrows){
/* Converts row-col notation into base-zero vector notation, based on a column-wise conversion*/
int vector_loc;
vector_loc=(*row)-1 + ((*col)-1)*(*nrows);
return(vector_loc);
}

void Rmatrix(int *vector_loc,int *row, int *col, int *nrows){
/*Converts vector notation into row-col notation*/
/*vector_loc is the vector address of the matrix (base zero)*/
/*row and col are pointers to the row and col variables that will be updated */
*col=floor(*vector_loc / *nrows)+1;
*row=*vector_loc-((*col)-1)*(*nrows)+1;
}

## Monday, November 17, 2008

### Call C from R and R from C

Several years ago, while a research associate at the University of Chicago, I had the privilege of sitting in on a course taught by Peter Rossi: Bayesian Applications in Marketing and MicroEconometrics. This course -- one I recommend to anyone at U Chicago who is interested in statistics -- was an incredibly clear treatment of Bayesian statistics, but the aspect I appreciated most was Peter's careful demonstration of Bayesian theory and methods using R.

One feature of R that I had not made use of up until that point was the ability to call compiled C and Fortran functions from within R (this makes loop-heavy Metropolis-Hastings samplers much, much faster). It turns out that you can also include the R libraries in C source code so that R functions (e.g., random number generators) can be easily accessed. The R-Cran website has an excellent tutorial on how to develop R extensions (here), but I wanted to share an example Peter used in class because it is extremely brief, and for 95% of what I do, this is all I need.

As Peter writes, this is an incredibly inefficient way of simulating from the chisquare distribution, but it demonstrates the point. His more extensive writeup is located here.

Save the following as testfun.c:
#include <R.h>
#include <Rmath.h>
#include <math.h>
/* Function written by Peter Rossi from the University of Chicago GSB */

/* include standard C math library and R internal function declarations */
void mychisq(double *vec, double *chisq, int *nu)
/* void means return nothing */
{
int i,iter; /* declare local vars */
/* all statements end in; */
GetRNGstate(); /* set random number seed */
for (i=0 ; i < *nu; ++i)
/* loop over elements of vec */
/*nu "dereferences" the pointer */
{ /* vectors start at 0 location!*/
vec[i] = rnorm(0.0,1.0); /*use R function to draw normals */
Rprintf("%ith normal draw= %lf \n",(i+1),vec[i]);
/* print out results for "debugging" */
}
*chisq=0.0;
iter=0;
while(iter < *nu) /* "while" version of a loop */
{
if( iter == iter)
{*chisq=*chisq + vec[iter]*vec[iter];}
/* redundant if stmnt */
iter=iter+1; /* note: can't use ** */
/* if you want to be "cool" use iter += 1 */
}
PutRNGstate(); /* write back ran number seed */
}

To call this function in R, you first need to compile it. To do this you need all the standard compilers and libraries for your operating system. For Debian or Ubuntu, this should do it (if I missed a package, let me know in the comments):

\$ sudo aptitude update
\$ sudo aptitude install build-essential r-base-dev

Now, you should be able to compile the function:

\$ R CMD SHLIB testfun.c

If all goes well, you should see the files testfun.o and testfun.so in the directory. To test the function we will source the following R script into R:

call_mychisq<-function(nu){
##This function is just a wrapper for .C
vector=double(nu); chisq=1
.C("mychisq",as.double(vector),res=as.double(chisq),as.integer(nu))\$res
}

##Load the compiled code (you may need to include
## the explicit file path if it is not local
## NOTE: for Windows machines, you will want to load testfun.dll"

result<-call_mychisq(10)

##If you re-compile testfun.c, you must unload it

And get the following output:

> result<-call_mychisq(10) 1th normal draw= -1.031170 2th normal draw= -1.214103 3th normal draw= 0.002335 4th normal draw= 0.296146 5th normal draw= -0.908862 6th normal draw= -1.567820 7th normal draw= -0.079227 8th normal draw= -1.404414 9th normal draw= 0.616567 10th normal draw= -0.007855 > result
[1] 8.268028

## Wednesday, November 5, 2008

### Using subversion to manage code

I have finally come to terms with the fact that I need some kind of version control for the projects I am working on and the best bet these days is Subversion (svn). I have been using svn for some time now via a GUI client (Linux: kdesvn, Windows: tortoisesvn); however, it turns out that working with svn from the command line is pretty easy and far more versatile. For a complete treatment of this subject, check out the online documentation here and an excellent cheat sheet here. What follows is a quick primer on the very basics of setting up and managing your svn repositories on a local machine or server.

For ease of use I will describe the creation of a single repository for a single project. This means a little more overhead; however, it makes the repository more portable and flexible in the long run.

1) First we set up the repository structure in a temporary folder (either on the server or locally):
\$ mkdir ~/tmp
\$ mkdir ~/tmp/project1
\$ mkdir ~/tmp/project1/trunk
\$ mkdir ~/tmp/project1/branches
\$ mkdir ~/tmp/project1/tags

2) Now, make a folder to hold your repositories and create an empty repository for your project.
\$ mkdir ~/svnroot

3) Import the folder structure into the empty repository. After this import, the folders in tmp can be removed -- they are only there to make the creation of the folder structure easier.
\$ svn import ~/tmp/project1 file:///home/myusername/svnroot/project1 -m "Initial import"

4) Finally, make your current project folder a "working copy" of the repository. Checkout the trunk (or head) of the repository to the folder where project1 currently resides (in this example, the existing project files are located at ~/working/project1).
If you created your repository on a local folder:

Alternatively, if you created your repository on a remote server:

Because the repository is empty at this stage, all the above commands do is create a .svn folder in the ~/working/project1 directory. The following command will show that there are folders and files in the project directory that are not currently part of the repository:
\$ svn st
? somefolder
? someotherfolder
? somefile.txt

Now you need to add all of the files and folders in this directory to the repository. This is easily accomplished using a bit of awk code (modified from a post here):

\$ svn status | grep "^\?" | awk -F "      " '{print \$2}' | tr "\n" "\0" | xargs -0 svn add
\$ svn st
A somefolder
A somefolder/file1.txt
A somefolder/file2.txt
A someotherfolder
A someotherfolder/file3.txt
A somefile.txt

Now you just need to commit these changes and your working directory is up to date:
\$ svn commit -m "Adding original files to repository."

5) When you are ready to commit new changes to the repository, make sure that all new files/folders are added and all deleted files/folders are removed:
\$ svn status | grep "^\?" | awk -F "      " '{print \$2}' | tr "\n" "\0" | xargs -0 svn add
\$ svn status | grep "^\!" | awk -F " " '{print \$2}' | tr "\n" "\0" | xargs -0 svn remove
\$ svn commit -m "Some comment to remind you why you are committing changes..."

6) Finally, if for some reason you want to remove a working directory from versioning (i.e., get rid of the .svn folders that are placed in every folder subfolder), use the following:
\$ cd ~/working/project1
\$ rm -rf 'find . -name .svn'

Update: If your svn repository changes to a new server name, use the following syntax to update your working directory:
\$ cd ~/working/project1
\$svn info
[the current URL and other info are printed to the screen]
\$svn switch --relocate svn+ssh://OLD.URL/path/to/svnrepo svn+ssh://NEW.URL/path/to/svnrepo .
\$svn commit -m "new update"

Note: If you mis-type the old URL, "svn switch" will fail silently. So make sure to check that it has updated by using the "svn info" command.

## Saturday, September 20, 2008

### Installing Ubuntu Linux on existing LVM partition.

I wanted to install Ubuntu on a computer that already housed Fedora Linux. It seems that Fedora defaults to installing /root and /swap on an LVM partition -- something that Ubuntu does not recognize out of the box. After poking about online, here is the process I followed (compiled from here, here, and here.):

Warning: As always, any time you are messing about with drive partitions, you risk something tragic happening to your data. Backup critical bits before continuing.

First deal with reducing the existing logical volume:

# Boot Ubuntu from live-CD.

# Install lvm2:
\$ sudo apt-get install lvm2

\$ sudo modprobe dm-mod

# Scan your system for LVM volumes and identify in the output the volume group name that has your Fedora volume (mine proved to be VolGroup00):
\$ sudo vgscan

# Activate the volume:
\$ sudo vgchange -ay VolGroup00

# Find the logical volume that has your Fedora root filesystem (mine proved to be LogVol00):
\$ sudo lvs

# Determine how much space is used on the volume:
\$ sudo vgdisplay

# Check the filesystem
sudo e2fsck -f /dev/VolGroup00/LogVol00

# Resize the filesystem to desired size (CRITICAL before proceeding to next step!)
\$ resize2fs -f /dev/VolGroup00/LogVol00 16G

# Reduce the size of the logical volume
\$ lvreduce -L10G /dev/VolGroup00/LogVol00

Create new logical volumes for /root and /home (in my case /swap already existed):

# First make sure how much free space remains in the volume group
\$ sudo vgdisplay

# Create \UbuntuRoot and \home
\$ sudo lvcreate -n UbuntuRoot -L10G VolGroup00
\$ sudo lvcreate -n home -L18.28G VolGroup00

# Now format the new logical volumes:
\$ sudo mkfs -j /dev/VolGroup00/UbuntuRoot -L root
\$ sudo mkfs -j /dev/VolGroup00/home -L home

Then run the installer, choosing the appropriate volumes. Note that I had already created a hard partition to serve as /boot --- this is the only way a lvm install will work.

Finally, make sure that the newly installed system has lvm2 included (otherwise it will not start). The following is adapted from here.

# Create a mountpoint for the newly installed system:
\$ sudo mkdir /target
\$ sudo mkdir /target

# Mount the newly installed system:
\$ sudo mount /dev/VolGroup00/UbuntuRoot /target
\$ sudo mount /dev/sda6 /target/boot

# Activate those directories and install lvm2
\$ sudo chroot /target
\$ sudo apt-get update
\$ sudo apt-get install lvm2

Now you can restart into the new system!

## Tuesday, May 6, 2008

### Error capture

In a recent post to r-sig-ecology, Mike Colvin suggested the following to capture errors within a loop:

for (i in 1:1000){
fit<-try(lm(y~x,dataset))
results<- ifelse(class(fit)=="try-error", NA, fit\$coefficients)
}

## Tuesday, March 18, 2008

### Plotting contours

Plenty of packages allow you to plot contours of a "z" value; however, I wanted to be able to plot a specific density contour of a sample from a bivariate distribution over a plot that was a function of the x and y parameters. The example only plots the density of x and y; however, if you set plot.dens=FALSE, the contours will be added to the active display device (i.e., over an image(x,y,function(x,y){...})).

This function also lets you specify the line widths and types for each of the contours.
###########################################
## drawcontour.R

## Written by J.D. Forester, 17 March 2008
###########################################

##This function draws an approximate density contour based on
##empirical, bivariate data.

##change testit to FALSE if sourcing the file
testit=TRUE

draw.contour<-function(a,alpha=0.95,plot.dens=FALSE, line.width=2, line.type=1, limits=NULL, density.res=300,spline.smooth=-1,...){
##a is a list or matrix of x and y coordinates (e.g., a=list("x"=rnorm(100),"y"=rnorm(100)))
## if a is a list or dataframe, the components must be labeled "x" and "y"
## if a is a matrix, the first column is assumed to be x, the second y
##alpha is the contour level desired
##if plot.dens==TRUE, then the joint density of x and y are plotted,
## otherwise the contour is added to the current plot.
##density.res controls the resolution of the density plot

##A key assumption of this function is that very little probability mass lies outside the limits of
## the x and y values in "a". This is likely reasonable if the number of observations in a is large.

require(MASS)
require(ks)
if(length(line.width)!=length(alpha)){
line.width <- rep(line.width[1],length(alpha))
}

if(length(line.type)!=length(alpha)){
line.type <- rep(line.type[1],length(alpha))
}

if(is.matrix(a)){
a=list("x"=a[,1],"y"=a[,2])
}
##generate approximate density values
if(is.null(limits)){
limits=c(range(a\$x),range(a\$y))
}
f1<-kde2d(a\$x,a\$y,n=density.res,lims=limits)

##plot empirical density
if(plot.dens) image(f1,...)

if(is.null(dev.list())){
##ensure that there is a window in which to draw the contour
plot(a,type="n",xlim=limits[1:2],ylim=limits[3:4],...)
}

##estimate critical contour value
## assume that density outside of plot is very small

zdens <- rev(sort(f1\$z))
Czdens <- cumsum(zdens)
Czdens <- (Czdens/Czdens[length(zdens)])
for(cont.level in 1:length(alpha)){
##This loop allows for multiple contour levels
crit.val <- zdens[max(which(Czdens<=alpha[cont.level]))]

##determine coordinates of critical contour
b.full=contourLines(f1,levels=crit.val)
for(c in 1:length(b.full)){
##This loop is used in case the density is multimodal or if the desired contour
## extends outside the plotting region
b=list("x"=as.vector(unlist(b.full[[c]][2])),"y"=as.vector(unlist(b.full[[c]][3])))

##plot desired contour
line.dat<-xspline(b,shape=spline.smooth,open=TRUE,draw=FALSE)
lines(line.dat,lty=line.type[cont.level],lwd=line.width[cont.level])
}
}
}

##############################
##Test the function
##############################

##generate data
if(testit){
n=10000
a<-list("x"=rnorm(n,400,100),"y"=rweibull(n,2,100))
draw.contour(a=a,alpha=c(0.95,0.5,0.05),line.width=c(2,1,2),line.type=c(1,2,1),plot.dens=TRUE, xlab="X", ylab="Y")
}

## Monday, February 4, 2008

### Drop unused factor levels

When creating a subset of a dataframe, I often exclude rows based on the level of a factor. However, the "levels" of the factor remain intact. This is the intended behavior of R, but it can cause problems in some cases. I finally discovered how to clean up levels in this post to R-Help. Here is an example:
> a <- factor(letters)
> a
[1] a b c d e f g h i j k l m n o p q r s t u v w x y z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

## Now, even though b only includes five letters,
## all 26 are listed in the levels
> b <- a[1:5]
> b
[1] a b c d e
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

## This behavior can be changed using the following syntax:
> b <- a[1:5,drop = TRUE]
> b
[1] a b c d e
Levels: a b c d e

Another way to deal with this is to use the dropUnusedLevels() command in the Hmisc library. The only issue here is that behavior is changed globally which may have undesired consequences (see the post listed above).

****UPDATE****
As Jeff Hollister mentions in the comments, there is another way to do this:

a<-factor(letters)
b<-factor(a[1:5])

Yet another way, if you are working with data frames that by default convert characters into factors, was suggested on r-sig-ecology by Hadley Wickham:

options(stringsAsFactors = FALSE)
a <-data.frame("alpha"=letters)
b<-a[1:5]