Friday, August 28, 2009

Dropbox

I have been trying to convince my collaborators to use git so that we all work on the same page; however, I have accepted the fact that not everyone wants to get that involved. For these collaborations, I have started relying on a free web-based service called Dropbox. The free version gives you a 2GB account and allows you to securely share files with others without the hassle of setting up accounts on an ftp server. There is even a client for Linux (thank you!).

On Linux, after downloading and installing the program, you just have to start the daemon:
$dropbox start -i
During the set-up process, a folder is created in your home directory called "Dropbox" and everything you put in there is automatically synced with all the other computers you have linked to the account. You can explicitly share folders with others, or place files in the "Public" subdirectory and then share the associated url.

I have started using this service as a virtual flash drive and as an alternative to emailing enormous files.

git cheat sheet

From what I have seen, the best description of what Git is and how it works operationally was written by Charles Duan. Given that Charles has done such a great job, I am just going to summarize a few key commands here (also based on this tutorial).

A) Create a local git repo out of your current project directory.

$ cd ~/myproj
$ git init
$ git add .
$ git commit -m "first commit"

B) Create an empty remote git repo and push current project repo there.

$ ssh my.remote.server

> mkdir gitroot
> mkdir gitroot/myproj
> cd gitroot/myproj
> git --bare init
> exit

$ cd ~/myproj
$ git remote add origin ssh://my.remote.server/home/username/gitroot/myproj
$ git push origin master

C) Check out a remote git repo and send updates back.
$ git clone ssh://myserver.com/home/username/gitroot/myproj

$ git push
D) Create and switch among branches
$ git branch [new-head-name] [reference-to-branching-point]
For example:
$ git branch testing HEAD
You switch among branches simply by checking them out from your own repo:
$ git checkout testing
$ git push origin local-branch-name
E) Merge branches. Again, look here for a detailed description of how this works in practice.
$ git checkout master
$ git merge testing
F) Convert svn repos to git. There are other ways to do this if you want to preserve the tree of your svn repo; however, if all you want is to strip out the .svn folders:
$cd ~/mysvnproj
$ svn export . ../myproj
The problem is that only files under version control are exported. If you just want to remove svn version control from your current directory (i.e., recursively strip out the .svn folders), then use:
$find . -name ".svn" -type d -exec rm -rf {} \;

Wednesday, August 19, 2009

SVN gets got by git

My Subversion usage finally stabilized, but I realized that my major problem was that commits were relatively expensive (i.e., in order for me to make any commit I had to open an ssh connection to my remote server and send in my updates). Generally this is no big deal, but it made me less likely to commit on a regular basis -- especially when at home or travelling -- so I was falling back into the bad habit of making new copies of code with increasingly obtuse names (myscript.R -> myscript2.R -> myscript2-working.R, etc.). I guess I could develop my own personal file nomenclature system, but frankly, that is what version control software is for.

Earlier this year I stumbled upon an article by Cory Doctorow in which he describes how he uses a Python script (Flashbake) to automate the git version control system. This led me to a video of Linus Torvalds preaching the Gospel of Git at the Googleplex. The chief architect of the Linux kernel is known to be fairly, um... opinionated, so I was prepared to take his comments well salted; however, I ultimately found his arguments for git compelling and decided to give git a whirl.

Git is many things to many people, but my reasons to boot svn to the curb are that I want to:

1) Make local commits often.
2) Branch at will.
3) Switch among branches easily.
4) Merge branches easily.
5) Share repos among collaborators.

For my workflow, git does all of these things better than svn. While the ability to perform local commits was the killer feature for me, I have come to appreciate branching and merging (something I generally avoided with svn due to the often disastrous outcomes). So there it is. I like Git (and not just because it makes me feel cool!).

A friend of mine just mentioned that he likes bazaar version control. It looks very similar to git, (and may have some advantages for certain situations -- especially if Windows users are loathe to install cygwin); however, I am already committed (for at least a few months).

Monday, July 20, 2009

When subversion does not add up.

I have been using Subversion to manage all of my active projects -- it is a fantastic tool but I ran into a bit of a problem recently when I accidentally added several hundred log files as I was preparing for a commit (using the automated method I describe here -- so caveat emptor). Most suggestions I found on the internet were some variation of:
$ svn --recursive revert .
The problem with this solution is that all uncommitted changes to existing files (i.e., modified files that were already present in the repository) would be lost. Ouch! I discovered this solution; however, it did not work for me because I had already made the mistake of deleting the offending files. Finally, I modified the suggested command and all is well:
$ svn st | grep log > badfile.txt
$ sed 's/! /.\//' badfile.txt > newfile.txt
$ less newfile.txt | xargs svn revert

Saturday, December 13, 2008

R matrices in C functions

Using the .C() function in R, you can only pass vectors. Since R stores matrices columnwise as vectors anyhow, they can be passed to your C function as vectors (along with the number of rows in the matrix) and then accessed in familiar [row,col] manner using the following C functions (idea from here):
int Cmatrix(int *row, int *col, int *nrows){
/* Converts row-col notation into base-zero vector notation, based on a column-wise conversion*/
int vector_loc;
vector_loc=(*row)-1 + ((*col)-1)*(*nrows);
return(vector_loc);
}

void Rmatrix(int *vector_loc,int *row, int *col, int *nrows){
/*Converts vector notation into row-col notation*/
/*vector_loc is the vector address of the matrix (base zero)*/
/*row and col are pointers to the row and col variables that will be updated */
*col=floor(*vector_loc / *nrows)+1;
*row=*vector_loc-((*col)-1)*(*nrows)+1;
}

Monday, November 17, 2008

Call C from R and R from C

Several years ago, while a research associate at the University of Chicago, I had the privilege of sitting in on a course taught by Peter Rossi: Bayesian Applications in Marketing and MicroEconometrics. This course -- one I recommend to anyone at U Chicago who is interested in statistics -- was an incredibly clear treatment of Bayesian statistics, but the aspect I appreciated most was Peter's careful demonstration of Bayesian theory and methods using R.

One feature of R that I had not made use of up until that point was the ability to call compiled C and Fortran functions from within R (this makes loop-heavy Metropolis-Hastings samplers much, much faster). It turns out that you can also include the R libraries in C source code so that R functions (e.g., random number generators) can be easily accessed. The R-Cran website has an excellent tutorial on how to develop R extensions (here), but I wanted to share an example Peter used in class because it is extremely brief, and for 95% of what I do, this is all I need.

As Peter writes, this is an incredibly inefficient way of simulating from the chisquare distribution, but it demonstrates the point. His more extensive writeup is located here.

Save the following as testfun.c:
#include <R.h>
#include <Rmath.h>
#include <math.h>
/* Function written by Peter Rossi from the University of Chicago GSB */
/*http://faculty.chicagogsb.edu/peter.rossi/teaching/37904/Adding%20Functions%20Written%20in%20C%20to%20R.pdf */

/* include standard C math library and R internal function declarations */
void mychisq(double *vec, double *chisq, int *nu)
/* void means return nothing */
{
int i,iter; /* declare local vars */
/* all statements end in; */
GetRNGstate(); /* set random number seed */
for (i=0 ; i < *nu; ++i)
/* loop over elements of vec */
/*nu "dereferences" the pointer */
{ /* vectors start at 0 location!*/
vec[i] = rnorm(0.0,1.0); /*use R function to draw normals */
Rprintf("%ith normal draw= %lf \n",(i+1),vec[i]);
/* print out results for "debugging" */
}
*chisq=0.0;
iter=0;
while(iter < *nu) /* "while" version of a loop */
{
if( iter == iter)
{*chisq=*chisq + vec[iter]*vec[iter];}
/* redundant if stmnt */
iter=iter+1; /* note: can't use ** */
/* if you want to be "cool" use iter += 1 */
}
PutRNGstate(); /* write back ran number seed */
}



To call this function in R, you first need to compile it. To do this you need all the standard compilers and libraries for your operating system. For Debian or Ubuntu, this should do it (if I missed a package, let me know in the comments):

$ sudo aptitude update
$ sudo aptitude install build-essential r-base-dev

Now, you should be able to compile the function:

$ R CMD SHLIB testfun.c

If all goes well, you should see the files testfun.o and testfun.so in the directory. To test the function we will source the following R script into R:

call_mychisq<-function(nu){
##This function is just a wrapper for .C
vector=double(nu); chisq=1
.C("mychisq",as.double(vector),res=as.double(chisq),as.integer(nu))$res
}

##Load the compiled code (you may need to include
## the explicit file path if it is not local
## NOTE: for Windows machines, you will want to load testfun.dll"

dyn.load("testfun.so")
result<-call_mychisq(10)

##If you re-compile testfun.c, you must unload it
## and then re-load it:
## dyn.unload("testfun.so")
## dyn.load("testfun.so")

And get the following output:

> dyn.load("testfun.so")
> result<-call_mychisq(10) 1th normal draw= -1.031170 2th normal draw= -1.214103 3th normal draw= 0.002335 4th normal draw= 0.296146 5th normal draw= -0.908862 6th normal draw= -1.567820 7th normal draw= -0.079227 8th normal draw= -1.404414 9th normal draw= 0.616567 10th normal draw= -0.007855 > result
[1] 8.268028

Wednesday, November 5, 2008

Using subversion to manage code

I have finally come to terms with the fact that I need some kind of version control for the projects I am working on and the best bet these days is Subversion (svn). I have been using svn for some time now via a GUI client (Linux: kdesvn, Windows: tortoisesvn); however, it turns out that working with svn from the command line is pretty easy and far more versatile. For a complete treatment of this subject, check out the online documentation here and an excellent cheat sheet here. What follows is a quick primer on the very basics of setting up and managing your svn repositories on a local machine or server.

For ease of use I will describe the creation of a single repository for a single project. This means a little more overhead; however, it makes the repository more portable and flexible in the long run.

1) First we set up the repository structure in a temporary folder (either on the server or locally):
$ mkdir ~/tmp
$ mkdir ~/tmp/project1
$ mkdir ~/tmp/project1/trunk
$ mkdir ~/tmp/project1/branches
$ mkdir ~/tmp/project1/tags

2) Now, make a folder to hold your repositories and create an empty repository for your project.
$ mkdir ~/svnroot
$ svnadmin create ~/svnroot/project1

3) Import the folder structure into the empty repository. After this import, the folders in tmp can be removed -- they are only there to make the creation of the folder structure easier.
$ svn import ~/tmp/project1 file:///home/myusername/svnroot/project1 -m "Initial import"

4) Finally, make your current project folder a "working copy" of the repository. Checkout the trunk (or head) of the repository to the folder where project1 currently resides (in this example, the existing project files are located at ~/working/project1).
If you created your repository on a local folder:
$ svn checkout file:///home/myusername/svnroot/project1/trunk ~/working/project1

Alternatively, if you created your repository on a remote server:
$ svn checkout svn+ssh://remote.server.name/home/myusername/svnroot/project1/trunk ~/working/project1

Because the repository is empty at this stage, all the above commands do is create a .svn folder in the ~/working/project1 directory. The following command will show that there are folders and files in the project directory that are not currently part of the repository:
$ svn st
? somefolder
? someotherfolder
? somefile.txt

Now you need to add all of the files and folders in this directory to the repository. This is easily accomplished using a bit of awk code (modified from a post here):

$ svn status | grep "^\?" | awk -F "      " '{print $2}' | tr "\n" "\0" | xargs -0 svn add
$ svn st
A somefolder
A somefolder/file1.txt
A somefolder/file2.txt
A someotherfolder
A someotherfolder/file3.txt
A somefile.txt

Now you just need to commit these changes and your working directory is up to date:
$ svn commit -m "Adding original files to repository."


5) When you are ready to commit new changes to the repository, make sure that all new files/folders are added and all deleted files/folders are removed:
$ svn status | grep "^\?" | awk -F "      " '{print $2}' | tr "\n" "\0" | xargs -0 svn add
$ svn status | grep "^\!" | awk -F " " '{print $2}' | tr "\n" "\0" | xargs -0 svn remove
$ svn commit -m "Some comment to remind you why you are committing changes..."


6) Finally, if for some reason you want to remove a working directory from versioning (i.e., get rid of the .svn folders that are placed in every folder subfolder), use the following:
$ cd ~/working/project1
$ rm -rf 'find . -name .svn'


Update: If your svn repository changes to a new server name, use the following syntax to update your working directory:
$ cd ~/working/project1
$svn info
[the current URL and other info are printed to the screen]
$svn switch --relocate svn+ssh://OLD.URL/path/to/svnrepo svn+ssh://NEW.URL/path/to/svnrepo .
$svn commit -m "new update"

Note: If you mis-type the old URL, "svn switch" will fail silently. So make sure to check that it has updated by using the "svn info" command.