Thursday, October 22, 2009

ISO week

I am working with a model that produces estimates of snow water equivalent through time. Because I deal with large spatial extents, I decided to have the model produce weekly averages. The problem with this is knowing which file to access for a given date. The snow files are saved using an ISO-8061 week number (I had no idea how many date "standards" there were, but this is a common one). I thought that getting the week number out of a date format in R would be easy, but I was wrong. It turns out to be platform specific and regardless, as far as I could tell, there is no built-in way to calculate the ISO week. Thus the following function.

ISOweek<-function(date,format="%Y-%m-%d",return.val="weekofyear"){
##converts dates into "dayofyear" or "weekofyear", the latter providing the ISO-8601 week
##date should be a vector of class Date or a vector of formatted character strings
##format refers to the date form used if a vector of
## character strings is supplied

##convert date to POSIXt format
if(class(date)[1]%in%c("Date","character")){
date=as.POSIXlt(date,format=format)
}
if(class(date)[1]!="POSIXt"){
print("Date is of wrong format.")
break
}else if(class(date)[2]=="POSIXct"){
date=as.POSIXlt(date)
}


if(return.val=="dayofyear"){
##add 1 because POSIXt is base zero
return(date$yday+1)
}else if(return.val=="weekofyear"){
##Based on the ISO8601 weekdate system,
## Monday is the first day of the week
## W01 is the week with 4 Jan in it.
year=1900+date$year
jan4=strptime(paste(year,1,4,sep="-"),format="%Y-%m-%d")
wday=jan4$wday

wday[wday==0]=7 ##convert to base 1, where Monday == 1, Sunday==7

##calculate the date of the first week of the year
weekstart=jan4-(wday-1)*86400
weeknum=ceiling(as.numeric((difftime(date,weekstart,units="days")+0.1)/7))

#########################################################################
##calculate week for days of the year occuring in the next year's week 1.
#########################################################################
mday=date$mday
wday=date$wday
wday[wday==0]=7
year=ifelse(weeknum==53 & mday-wday>=28,year+1,year)
weeknum=ifelse(weeknum==53 & mday-wday>=28,1,weeknum)

################################################################
##calculate week for days of the year occuring prior to week 1.
################################################################

##first calculate the numbe of weeks in the previous year
year.shift=year-1
jan4.shift=strptime(paste(year.shift,1,4,sep="-"),format="%Y-%m-%d")
wday=jan4.shift$wday
wday[wday==0]=7 ##convert to base 1, where Monday == 1, Sunday==7
weekstart=jan4.shift-(wday-1)*86400
weeknum.shift=ceiling(as.numeric((difftime(date,weekstart)+0.1)/7))

##update year and week
year=ifelse(weeknum==0,year.shift,year)
weeknum=ifelse(weeknum==0,weeknum.shift,weeknum)

return(list("year"=year,"weeknum"=weeknum))
}else{
print("Unknown return.val")
break
}
}




Of course after I wrote this function, I found this thread in R-help. Gustaf Rydevik provided a function that is similar to mine (his function is also about twice as fast); however, on my computer it was giving incorrect years and week numbers for days in January that occur in the previous year's final week (e.g., 1 January 2010 should be week 53 of 2009). Reading through the rest of the thread indicates that the confusion involving week numbers is widespread. I guess the moral of the story is pick a standard and stick to it.

Please let me know in the comments if you encounter any problems running this function (or if you can suggest ways to make it more efficient).

Leap years

A quick function that when provided a numeric vector of years returns a boolean vector where TRUE == Leap year.

is.leapyear=function(year){
#http://en.wikipedia.org/wiki/Leap_year
return(((year %% 4 == 0) & (year %% 100 != 0)) | (year %% 400 == 0))
}

Thursday, September 3, 2009

Truncated Normal Distribution

Many distributions may be used to describe patterns that are non-negative; however, there are not as many choices when an upper bound is also needed (although the beta distribution is very flexible). For various reasons, truncated distributions are sometimes preferred, and the truncated normal is particularly popular. While R has a package that includes the standard functions for this distribution (see rtnorm, dtnorm, etc. in the msm pacakge), the true expectation and variance of the distribution may be of interest. It turns out that the first two moments of the truncated normal are not too hard to calculate (but worth writing functions for):

mean.tnorm<-function(mu,sd,lower,upper){
##return the expectation of a truncated normal distribution
lower.std=(lower-mu)/sd
upper.std=(upper-mu)/sd
mean=mu+sd*(dnorm(lower.std)-dnorm(upper.std))/
(pnorm(upper.std)
-pnorm(lower.std))
return(mean)
}

var.tnorm<-function(mu,sd,lower,upper){
##return the variance of a truncated normal distribution
lower.std=(lower-mu)/sd
upper.std=(upper-mu)/sd
variance=sd^2*(1+(lower.std*dnorm(lower.std)-upper.std*dnorm(upper.std))/
(pnorm(upper.std)-pnorm(lower.std))-((dnorm(lower.std)-dnorm(upper.std))/
(pnorm(upper.std)-pnorm(lower.std)))^2)
return(variance)
}

###Testing
> library(msm)
> a=rtnorm(1000000,-5,2,1,3)
> paste(mean(a),var(a))
[1] "1.52135857341077 0.197281057170982"
> paste(mean.tnorm(-5,2,1,3),var.tnorm(-5,2,1,3))
[1] "1.52090857118 0.197111175109889"

Friday, August 28, 2009

Dropbox

I have been trying to convince my collaborators to use git so that we all work on the same page; however, I have accepted the fact that not everyone wants to get that involved. For these collaborations, I have started relying on a free web-based service called Dropbox. The free version gives you a 2GB account and allows you to securely share files with others without the hassle of setting up accounts on an ftp server. There is even a client for Linux (thank you!).

On Linux, after downloading and installing the program, you just have to start the daemon:
$dropbox start -i
During the set-up process, a folder is created in your home directory called "Dropbox" and everything you put in there is automatically synced with all the other computers you have linked to the account. You can explicitly share folders with others, or place files in the "Public" subdirectory and then share the associated url.

I have started using this service as a virtual flash drive and as an alternative to emailing enormous files.

git cheat sheet

From what I have seen, the best description of what Git is and how it works operationally was written by Charles Duan. Given that Charles has done such a great job, I am just going to summarize a few key commands here (also based on this tutorial).

A) Create a local git repo out of your current project directory.

$ cd ~/myproj
$ git init
$ git add .
$ git commit -m "first commit"

B) Create an empty remote git repo and push current project repo there.

$ ssh my.remote.server

> mkdir gitroot
> mkdir gitroot/myproj
> cd gitroot/myproj
> git --bare init
> exit

$ cd ~/myproj
$ git remote add origin ssh://my.remote.server/home/username/gitroot/myproj
$ git push origin master

C) Check out a remote git repo and send updates back.
$ git clone ssh://myserver.com/home/username/gitroot/myproj

$ git push
D) Create and switch among branches
$ git branch [new-head-name] [reference-to-branching-point]
For example:
$ git branch testing HEAD
You switch among branches simply by checking them out from your own repo:
$ git checkout testing
$ git push origin local-branch-name
E) Merge branches. Again, look here for a detailed description of how this works in practice.
$ git checkout master
$ git merge testing
F) Convert svn repos to git. There are other ways to do this if you want to preserve the tree of your svn repo; however, if all you want is to strip out the .svn folders:
$cd ~/mysvnproj
$ svn export . ../myproj
The problem is that only files under version control are exported. If you just want to remove svn version control from your current directory (i.e., recursively strip out the .svn folders), then use:
$find . -name ".svn" -type d -exec rm -rf {} \;

Wednesday, August 19, 2009

SVN gets got by git

My Subversion usage finally stabilized, but I realized that my major problem was that commits were relatively expensive (i.e., in order for me to make any commit I had to open an ssh connection to my remote server and send in my updates). Generally this is no big deal, but it made me less likely to commit on a regular basis -- especially when at home or travelling -- so I was falling back into the bad habit of making new copies of code with increasingly obtuse names (myscript.R -> myscript2.R -> myscript2-working.R, etc.). I guess I could develop my own personal file nomenclature system, but frankly, that is what version control software is for.

Earlier this year I stumbled upon an article by Cory Doctorow in which he describes how he uses a Python script (Flashbake) to automate the git version control system. This led me to a video of Linus Torvalds preaching the Gospel of Git at the Googleplex. The chief architect of the Linux kernel is known to be fairly, um... opinionated, so I was prepared to take his comments well salted; however, I ultimately found his arguments for git compelling and decided to give git a whirl.

Git is many things to many people, but my reasons to boot svn to the curb are that I want to:

1) Make local commits often.
2) Branch at will.
3) Switch among branches easily.
4) Merge branches easily.
5) Share repos among collaborators.

For my workflow, git does all of these things better than svn. While the ability to perform local commits was the killer feature for me, I have come to appreciate branching and merging (something I generally avoided with svn due to the often disastrous outcomes). So there it is. I like Git (and not just because it makes me feel cool!).

A friend of mine just mentioned that he likes bazaar version control. It looks very similar to git, (and may have some advantages for certain situations -- especially if Windows users are loathe to install cygwin); however, I am already committed (for at least a few months).

Monday, July 20, 2009

When subversion does not add up.

I have been using Subversion to manage all of my active projects -- it is a fantastic tool but I ran into a bit of a problem recently when I accidentally added several hundred log files as I was preparing for a commit (using the automated method I describe here -- so caveat emptor). Most suggestions I found on the internet were some variation of:
$ svn --recursive revert .
The problem with this solution is that all uncommitted changes to existing files (i.e., modified files that were already present in the repository) would be lost. Ouch! I discovered this solution; however, it did not work for me because I had already made the mistake of deleting the offending files. Finally, I modified the suggested command and all is well:
$ svn st | grep log > badfile.txt
$ sed 's/! /.\//' badfile.txt > newfile.txt
$ less newfile.txt | xargs svn revert