Labels

R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

14 November 2016

Playing with OpenCL

I spent last week reading up on modern C++ developments, including some great essays from Herb Sutter. I was particularly struck by his prescient series on Moore's Law, The Free Lunch Is Over and Welcome to the Jungle. The latter essay portrays all possible computer architectures on a 2D plane of CPU versus memory architecture. The axes are a bit tricky, but the general idea is that a platform at the "origin" is predictable and easy to program for, whereas things get trickier as you move up and/or right.

image

This figure deserves describes everything from cellphones and game consoles to super computers and communications satellites. It also got me wondering how hard a simple "hello world" OpenCL program would be to get running on my Intel-only laptop. Can I do this in a day, or perhaps just an evening?

Personally, I don't want to fuss with hardware right now - I just want to see how the GPU/C++ pieces fit together. Conveniently, my laptop contains a low-end 24-core embedded GPU on its Broadwell chip. Running Debian, I was able to easily download the requisite packages and get started. I quickly discovered that consumer Intel GPUs, at present, do *not* support double anything, making this a less-than-ideal test-bed for scientific programming.

My test case was inspired by a performance issue that I ran into in my work. In a scientific simulation program that I use & help develop, Valgrind revealed that pow(double, double) was taking fully half of the total computational time. Poking around a bit, I see that pow() and log really are quite complex to compute, particularly for doubles (since the total effort is a function of precision). With this in mind, I set up a simple example using both OpenCl and straight C++, and compared timings. Note - I strongly recommend using a sample size of greather than one to draw any conclusions with real-life consequences!

In this example, the vanilla C++ is clean and easy to read, but is ~20x slower than the OpenCL version. Worried about the possibility of "unintended optimizations", I tried using a different kernel function. I used float for both examples to keep the total computational complexity the same. The speed results remained, but the new test revealed different answers. To the best of my understanding, this highlights differences in precision-sensitive operations between the OpenCL and stdc++ platforms. This is a pretty tricky area - just know that it's something to keep an eye on if you require perfect concordance between platforms.

EDIT: I also added an example using Boost.Compute today, which brings the best of both the C++ and OpenCL worlds. Boost.Compute has straighforward docs, and includes a nice closure macro that allows the direct incorporation of C++ code in kernel functions. The resulting code is *way* less verbose than vanilla OpenCL. The only downside is the extra requirement. That and some *very* noisy compiler warnings.

Here's a full example that can be found in my github test code repo (boost not shown, timings comparable with OpenCL):

$make; time ./opencl; time ./straight

g++ -Wall -std=c++11 -lOpenCL -o opencl opencl.cpp
g++ -Wall -std=c++11 -o straight straight.cpp
Using platform: Intel Gen OCL Driver
Using device: Intel(R) HD Graphics 5500 BroadWell U-Processor GT2

 result:
200 201 202 203 204 205 206 207 208 209
99990 99991 99992 99993 99994 99995 99996 99997 99998 99999
./opencl  0.20s user 0.09s system 97% cpu 0.295 total

 result:
200 201 202 203 204 205 206 207 208 209
99990 99991 99992 99993 99994 99995 99996 99997 99998 99999
./straight  6.03s user 0.00s system 99% cpu 6.032 total

Hopefully this example helps you get started experimenting with GPU computing. As Herb Sutter points out, we can expect more and greater hardware parallelism in the near future. Discrete GPUs are now commonly used in scientific computing, and Intel is now selling a massively-multicore add-on card, the Xeon Phi processor. Finally, Floating point precision remains an interesting question to keep an eye on in this domain.

15 February 2016

Shiny on Webfaction: VPS installion without root

I've been using Webfaction (plug) as an inexpensive managed VPN. Part of me wants VPS root access, but I'm mostly happy to leave the administrative details to others. Webfaction seems to be a good example of a common VPS plan: user-only access in a rich development environment. Compilers, zsh, and even tmux are available from the shell, making this a very comfortable dev environment overall.

Most times root doesn't matter, but sometimes it complicates new software installs. I've been looking forwards to testing R's webapp package Shiny, but all of the docs assume root access (and some even state that it's required). I set off without knowing if this would work, attempting to see how far I could get. What follows is a (hopefully) reproducible account of a user-land install of R & Shiny via ssh on a Webfaction slice. To the best of my knowledge, this requires only standard development tools, and so should(??) work.

In the following I use [tab] to indicate hitting tab key for auto-completion. The VPS login username is [user]. [edit] means call your editor of choice (vim, emacs, or, god forbid, nano). This assumes you are using bash (which seems to be the default shell on most VPNs).

Prepare the build environment

## ssh to webhost
## make directories, set paths, etc
## source build dir
mkdir ~/src
## software install dir
mkdir ~/local
## personal content dir
CONTENTDIR=~/var
mkdir $CONTENTDIR
## some hosts have /tmp set noexec?
mkdir src/tmp
## Install software here
INSTPREFIX=$HOME/local

## set paths:  
##
echo 'export PATH=$PATH:~/local/bin:~/local/shiny-server/bin' >> ~/.bashrc
echo 'export TMPDIR=$HOME/src/tmp' >>~/.bashrc

## check that all is well
[edit] ~/.bashrc
## update env
. .bashrc
[Ref: temp dir and R packages]

Install R from source: fast and (mostly) easy

cd ~/src
wget http://cran.us.r-project.org/src/base/R-3/R-3.2.3.tar.gz
tar xzf R-3.2.3.tar.gz
cd R-[tab]
./configure --prefix=$INSTPREFIX
## missing library, search and add directory
CPPFLAGS=/usr/lib/jvm/java/include/ make
make install
cd ~

Prep R environment

## The following commands are in R
install.packages(c('shiny', 'rmarkdown'))
## From the shell:
## on a headless / no-X11 box, need cairo for png
echo "options(bitmapType='cairo')" >> ~/.Rprofile
## check that all is well
[edit] ~/.Rprofile
[Ref: R png without X11]

Install cmake (if needed)

## first install cmake - skip if's already available 
`which cmake`
## nothing?  continue
## NOTE - I'm using the source tarball here, not binaries
wget https://cmake.org/files/v3.4/cmake-3.4.3.tar.gz 
tar xzf cmake-[tab]
cd cmake-[tab]
./configure --prefix=$INSTPREFIX
gmake
make install

Install Shiny Server

## From shell
cd ~/src
git clone https://github.com/rstudio/shiny-server.git
cd shiny-server
cmake -DCMAKE_INSTALL_PREFIX=$INSTPREFIX
make 
## "make install" Complains about no build dir
## I'm not sure what happens here, but this seems to work
PYTHON=`which python`
mkdir build
./bin/npm --python="$PYTHON" rebuild 
./bin/node ./ext/node/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js --python="$PYTHON" rebuild 
make install
[Ref: shiny build docs]

Configure Shiny Server

All of the Shiny Server docs assume the config file is located in /etc/, which I don't have access to. There's _zero_ documentation on running shiny, nor does running shiny-server -h or shiny-server --help provide any indication. Trial and error and reading source code on github finally leads to shiny-server path-to-config-file. So, let's make a shiny site!
## Nest content in ~/var
mkdir $CONTENTDIR/shiny
cp -rp ~/src/shiny-server/samples $CONTENTDIR/shiny/apps
mkdir $CONTENTDIR/shiny/logs
## copy the packaged settings template to the content dir
cp ~/src/shiny-server/config/default.config $CONTENTDIR/shiny/server.conf
[edit] $CONTENTDIR/shiny/server.conf
##
## server.conf content follows:
run_as [user];
## leave location as-is
## substitute var with $CONTENTDIR if needed
    site_dir /home/[user]/var/shiny/apps;
    log_dir /home/[user]/var/shiny/logs;    
## save file
## back at shell, run shiny, put in background
shiny-server ~/var/shiny/server.conf &
[Ref: Shiny-server docs]

Testing

Shiny should give messages about Starting listener on 0.0.0.0:3838. First up, let's use ssh to connect remote port 3838 to a local port. This allows local testing without deployment. As an aside, if you're not using ~/.ssh/config on a local machine to manage keys and hostname shortcuts, you should!
## on local machine:
ssh -nNT -L 9000:127.0.0.1:3838 [user]@webhost
Now, if all went well, you should be able to navigate to the welcome page via browser on local machine:
http://127.0.0.1:9000

Once shiny is working, don't forget to take a look at your logs:
ls -alh $CONTENTDIR/shiny/logs

I had trouble with the packaged rmd example app (which renders a .Rmd file). Reading logs showed install issues with pandoc, and I had to manually fiddle with the links:

ln -s $INSTPREFIX/shiny-server/ext/pandoc/static/pandoc $INSTPREFIX/shiny-server/ext/pandoc/
[Ref: port forwarding]

Wrap-up

For a full production environment, you would want a process monitor to keep shiny-server running, as well a public-facing server. See your webhost's documentation for process monitors. More details on shiny-server and apache are here (I haven't tried these proxy methods).

Finally, a more conventional approach using root access on a VPS (such as DigitalOcean) is available here.

Update - 17 Feb 2016: Deployment Logistics

After a day of kicking the tires, I'm happy to report Shiny-server is working well on Webfaction in production mode. Two points:

Making a webapp. In the Webfaction control panel, I added a custom application. In the following, substitute [appname] for the value entered in the Name field. For App category I selected "Websockets", and then clicked "Save". Copy the port number. Edit the server.conf file from above, replacing the number in listen 3838; with the port number copied from Webfaction. Finally, create a website, add a name (can be the same as [appname] from above), and a domain. It typically takes a few minutes for DNS changes to propagate.

The above steps creates a directory named $HOME/webapps/[appname]. I placed the server.conf file here, created app and log directories, and then updated server.conf to reflect the new locations:

## Create the following directories
## add these paths to server.conf, 
## and don't forget the trailing ; 
mkdir $HOME/[appname]/logs
## shiny app files go here:
mkdir $HOME/[appname]/app

[Ref: Webfaction custom application]
[Ref: Webfaction Applications and Websites]

Running the server. Shiny-server will use a PID file, which makes job-spawning a simple shell script + cron job. If shiny-server is already running, it will recognize the PID file and not start another process. I made the following script:

#!/bin/sh
## executable shell script names $HOME/bin/my.shiny.sh
## make sure to run: chmod +x $HOME/bin/my.shiny.sh
APPROOT=$HOME/webapps/[appname]                                                                               
PIDFN=$APPROOT/shiny-server.pid                   
## using full path                                                        
$HOME/local/shiny-server/bin/shiny-server $APPROOT/server.conf --pidfile=$PIDFN>> $APPROOT/logs/server.log 2>&1 &
Now run crontab -e and add an entry for the script (above):
## try once an hour, on the 10th minute of the hour
10 * * * * /home/[user]/bin/my.shiny.sh
Finally, take a look at memory usage. If you exceed memory limits, Webfaction automatically kills everything. And R's memory use grows with more connections (which themselves persist, because websockets). Webfaction distributes a nice python script that shows per-process and total memory usage.

[Ref: shiny-server systemd script (shows commandline usage)]
[Ref: Webfaction cron]

I should point out that I like Webfaction (plug) well enough to pay them money. Their intro plan is $10/month for 1GB RAM + 100GB full SSD, with a 1-month free trial. I like that the webfaction user-base is big enough that lots of my questions are already answered, but small enough that staff actually answer new questions.

I've done my best to document exactly what I did, but I'm sure there are typos. Let me know if you encounter any issues!

14 November 2014

Numerical Simulations and Data passing: C++, Python, and Protocol Buffers

Problem statement & Requirements

I'm working with a complex C++ simulation that requires a large number of user-specified parameters. Both speed and readability are important. I'd like to define all possible parameters in one (and only one) place, and include sensible defaults that can be easily over-ridden. Finally, intelligent type-handling would be nice. For convenience, I decided to wrap the C++ simulation in python setup/glue code. Python is a logical choice here as the "available everywhere" glue language that has nice standard libraries.

Available libraries

There aren't many data-passing options that work with both C++ and python. Libconfig, JSON, XML, and Google Protocol Buffers (PB) appear to be the only reasonable options. Here's my thoughts on the first three:
  • Libconfig: Nice clean library, good language support. The big downside is that data structures must be defined both in a data file and in code - e.g. data is "moved" from a file into C++ variables. I feel like libconfig is best for a small number of complex variables, like lists and vectors.
  • JSON: no clear standard C++ library, library docs so-so, speed complaints from some?
  • XML: Massive overkill.
That leaves PB, which has nice docs for both C++ and python. All the variables, along with their types and defaults, are defined in a .proto file. The protoc tool auto-generates python and C++ code from the .proto file. By adding it to my Makefile, C++ classes are autogenerated at compile time. This makes for fast and readable C++ code - like using a named dict, but without the speed costs.

Solution / Workflow

I'm using python to read user-supplied values into a set of PB messages, and then serializing the messages to files. C++ then reads the messages from those files at runtime. A python script run by make synchronizes the locations of files between python and C++. I also want to process commandline options for my python wrapper script. Happily, I can hand a PB message to python's parser.parse_args() and have it set PB message attributes with setattr(). The last python step (aside from writing the message to disk) is reading "variable,value" pairs from a .csv file. If a variable has already been set by parse_args, I skip it: the commandline values override .csv file values.

Summary

Overall, PB makes a very nice data coupler between an interpreted language like python and a compiled language like C++. Python excels at text processing and is easy to prototype, while C++ is fast and beautiful. PB has a few side-benefits. On the C++ side, it provides some natural namespace encapsulation to manage variable-explosion. Runtime inspection with gdb is easy enough. Finally, storing all the options values used to run each simulation in a standard-format file is handy - it allows tests to re-run the simulation with exactly the same inputs.

Python Snippets

def main():
    ## initialize protobuf, fill with ParseArgs
    setupSim = ProtoBufInput_pb2.setupSim()
    setupSim = ParseArgs(sys.argv[1:], setupSim)
    prepInput(setupSim)
    RunSim()

def ParseArgs(argv, setupSim):
    parser = OptionParser(usage="wrapper.py [options]\nNote: commandline args over-ride values in files.", version=setupSim.version)
    ## these must be valid protocol buffer fields 
    parser.add_option("-t", "--test", dest="testCLI",
        action='store_true', help="Run test suite")
    parser.add_option("-d", "--days", metavar='N',
        dest="number_of_days",
        type='int', help="Number of days to simulate")
    ## parse!
    (setupSim, args) = parser.parse_args(argv, values=setupSim)
    return(setupSim)

def prepInput(setupSim):
    ## options from ParseArgs
    inhandle = open(setupSim.file_options, 'r')
    outhandle = open(ProtoDataFiles.PbFile_setupSim, 'wb')
    reader = csv.reader(inhandle, delimiter=',')
    header = reader.next()
    if not (header == ['variable','value']):
        raise Exception('Incorrect header format') 

    for row in reader:
        ## skip comments, check for 2 fields per row
        if (row[0][0] == '#'):
            continue
        if not (len(row) == 2):
            raise Exception('Problem with value pair: %s' % row)
            
        ## pack the message using text representation
        msgText = '%s : %s' % (row[0], row[1])
        if setupSim.HasField(row[0]):
            print("Skipping config file, keeping commandline value: %s, %s" % (row[0], getattr(setupSim,row[0])))
            continue
        setupSim = Merge(msgText, setupSim)
    ## write out to file for C++ to read
    outhandle.write(setupSim.SerializeToString())
    outhandle.close()


def RunSim():
    subprocess.Popen("./sim").communicate()

if __name__ == "__main__":
    main()

C++ Code Snippets

//PbRead.h
#include 
#include 
#include 
#include 
#include "proto/ProtoBufInput.pb.h"

template 
void PbRead(Type &msg, const char *filename){
    std::fstream infile(filename, std::ios::in | std::ios::binary);
    if (!infile) {
       throw std::runtime_error("Setup message file not found");
    } else if (!msg.ParseFromIstream(&infile)) {
       throw std::runtime_error("Parse error in message file");
    }
}


// sim.cpp
#include "PbRead.h"
#include "ProtoDataFiles.h"

// protocol buffers get passed around, are globals
ProtoBufInput::setupSim PbSetupSim;

int main(int argc,char **argv)
{
    GOOGLE_PROTOBUF_VERIFY_VERSION;
    PbSetupSim.set_init(true);
    // #define PbFile_setupSim "filename" in ProtoDataFiles.h, written by make
    PbRead(PbSetupSim, PbFile_setupSim);
    //...
    if (PbSetupSim.test_2()){
       //...
    }
}

03 July 2014

Efficient Ragged Arrays in R and Rcpp

When is R Slow, and Why?

Computational speed is a common complaint lodged against R. Some recent posts on r-bloggers.com have compared the speed of R with some other programming languages [1], and showed the favorable impact of the new compiler package on run-times [2]. I and others have written about using Rcpp to easily write C++ functions to speed-up bottlenecks in R [3,4]. With the new Rcpp attributes framework, writing fully vectorized C++ functions and incorporating them in R code is now very easy [5].

On a day-to-day basis, though, R's performance is largely a function of coding style. R allows novices users to write horribly inefficient code [6] that produces the correct answer (eventually). Yet by failing to utilize vectorization and pre-allocation of data structures, naive R code can be many orders of magnitude slower than need be. R-help is littered with the tears of novices, and there's even a (fantastic) parody of Dante's Inferno outlining the common "Deadly Sins of R" [7].

Problem Statement: Appending to Ragged Arrays

I recently stumbled onto an interesting code optimization problem that I *didn't* have a quick solution for, and that I'm sure others have encountered. What is the "R way" to vectorize computations on ragged array? One example of a ragged array is a list of vectors that have varying and different lengths. Say you need to dynamically grow many vectors by varying lengths over the course of a stochastic simulation. Using a simple tool like lapply, the entire data structure will be allocated anew with every assignment. This problem is briefly touched on in the official Introduction to R documentation, which simply notes that "when the subclass sizes [e.g. vector sizes] are all the same the indexing may be done implicitly and much more efficiently". But what if you're data *isn't* rectangular? How might one intelligently vectorize a ragged array to prevent (sloooow) memory allocations at every step?

The obvious answer is to pre-allocate a rectangular matrix (or array) that is larger than the maximum possible vector length, and store each vector as a row (or column?) in the matrix. Now we can use matrix assignment, and for each vector track the index of the start of free space. If we try to write past the end of the matrix, R emits the appropriate error. This method requires some book-keeping on our part. One nice addition would be an S4 class with slots for the data matrix and the vector of free-space indices, as well as methods to dynamically expand the data matrix and validate the object. As an aside, this solution is essentially the inverse of a sparse matrix. Sparse matrices use much less memory at the expense of slower access times [8]. Here, we're using more memory than is strictly needed to achieve much faster access times.

Is pre-allocation and book-keeping worth the trouble? object.size(matrix(1.5, nrow=1e3, ncol=1e3)) shows that a data structure of 1,000 vectors, each of length approximately 1,000, occupies about 8Mb of memory. Let's say I resize this structure 1,000 times. Now I'm looking at almost a gigabyte of memory allocations. Perhaps you're getting a sense of what a terrible idea it is to *not* pre-allocate a frequently-resized ragged list?

Three Solutions and Some Tests

Using the above logic, I prototyped a solution as an R function, and then transcribed the result into a C++ function (boundary checks are important in C++). The result is three methods: a "naive" list-append method, an R method that uses matrix assignment, and a final C++ method that modifies the pre-allocated matrix in-place. In C++/Rcpp, functions can use pass-by-reference semantics [9], which can have major speed advantages by allowing functions to modify their arguments in-place. Full disclosure: pass-by-reference semantics requires some caution on the user's part. Pass-by-reference is very different from R's "function programming" semantics (pass-by-value, copy-on-modify), where side-effects are minimized and an explicit assignment call is required to modify an object [10].

I added a unit test to ensure identical results between all three methods, and then used the fantastic rbenchmark package to time each solution. As expected, the naive method is laughably slow. By comparison, and perhaps counter-intuitively, the R and C++ pre-allocation methods are close in performance. Only with more iterations and larger data structures does the C++ method really start to pull ahead. And by that time, the naive R method takes *forever*.

Refactoring existing code to use the pre-allocated compound data structure (matrix plus indices) is a more challenging exercise that's "left to the reader", as mathematics textbooks oft say. lapply() is conceptually simple, and is often fast *enough*. Some work is required to transcribe code from this simpler style to use the "anti-sparse" matrix (and indices). There's a temptation to prototype a solution using lapply() and then "fix" things later. But if you're using ragged arrays and doing any heavy lifting (large data structures, many iterations), the timings show that pre-allocation is more than worth the effort.

Code

Note: you can find the full code here and here.

Setup: two helper functions are used to generate ragged arrays via random draws. First, draws from the negative binomial distribution determine the length of the each new vector (with a minimum length of 1, gen.lengths()), and draws from the normal distribution fill each vector with data (gen.dat()).
## helper functions
gen.lengths <- function(particles, ntrials=3, prob=0.5) {
    ## a vector of random draws
    pmax(1, rnbinom(particles, ntrials, prob))
}
gen.dat <- function(nelem, rate=1) {
    ## a list of vectors, vector i has length nelem[i]
    ## each vector is filled with random draws
    lapply(nelem, rexp, rate=rate)
}

Three solutions: a naive lapply() method, followed by pre-allocation in R.
## naive method
appendL <- function(new.dat, new.lengths, dat, dat.lengths) {
    ## grow dat by appending to list element i
    ## memory will be reallocated at each call
    dat <- mapply( append, dat, new.dat )
    ## update lengths
    dat.lengths <- dat.lengths + new.lengths
    return(list(dat=dat, len=dat.lengths))
}

## dynamically append to preallocated matrix
## maintain a vector of the number of "filled" elements in each row
## emit error if overfilled
## R solution
appendR <- function(new.dat, new.lengths, dat, dat.lengths) {
    ## grow pre-allocated dat by inserting data in the correct place
    for (irow in 1:length(new.dat)) {
        ## insert one vector at a time
        ## col indices for where to insert new.dat
        cols.ii <- (dat.lengths[irow]+1):(dat.lengths[irow]+new.lengths[irow])
        dat[irow, cols.ii] = new.dat[[irow]]
    }
    ## update lengths
    dat.lengths <- dat.lengths + new.lengths
    return(list(dat=dat, len=dat.lengths))
}


Next, the solution as a C++ function. This goes in a separate file that I'll call helpers.cpp (compiled below).
#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
void appendRcpp(  List fillVecs, NumericVector newLengths, NumericMatrix retmat, NumericVector retmatLengths) {
    // "append" fill oldmat w/  
    // we will loop through rows, filling retmat in with the vectors in list
    // then update retmat_size to index the next free
    // newLenths isn't used, added for compatibility
    NumericVector fillTmp;
    int sizeOld, sizeAdd, sizeNew;
    // pull out dimensions of matrix to fill
    int nrow = retmat.nrow();
    int ncol = retmat.ncol();
    // check that dimensions match
    if ( nrow != retmatLengths.size() || nrow != fillVecs.size()) { 
        throw std::range_error("In appendC(): dimension mismatch");
    }
    for (int ii = 0; ii= ncol) {
            throw std::range_error("In appendC(): exceeded max cols");
        }
        // iterator for row to fill
        NumericMatrix::Row retRow = retmat(ii, _);
        // fill row of return matrix, starting at first non-zero elem
        std::copy( fillTmp.begin(), fillTmp.end(), retRow.begin() + sizeOld);
        // update size of retmat
        retmatLengths[ii] = sizeNew;
    }
}


Putting the pieces together: a unit test ensures the results of all three methods are identical, and a function that runs each solution with identical data will be used for timing.
## unit test
test.correct.append <- function(nrep, particles=1e3, max.cols=1e3, do.c=F) {
    ## list of empty vectors, fill with append
    dat.list <- lapply(1:particles, function(x) numeric())
    ## preallocated matrix, fill rows from left to right
    dat.r <- dat.c <- matrix(numeric(), nrow=particles, ncol=max.cols)
    ## length of each element/row
    N.list <- N.r <- N.c <- rep(0, particles)
    ## repeat process, "appending" as we go
    for (ii in 1:nrep) {
        N.new <- gen.lengths(particles)
        dat.new <- gen.dat(N.new)
        ## in R, list of vectors
        tmp <- appendL(dat.new, N.new, dat.list, N.list)
        ## unpack, update
        dat.list <- tmp$dat
        N.list <- tmp$len
        ## in R, preallocate
        tmp <- appendR(dat.new, N.new, dat.r, N.r)
        ## unpack, update
        dat.r <- tmp$dat
        N.r <- tmp$len
        ## as above for C, modify dat.c and N.c in place
        appendRcpp(dat.new, N.new, dat.c, N.c)
    }
    ## pull pre-allocated data back into list
    dat.r.list <- apply(dat.r, 1, function(x) { x <- na.omit(x); attributes(x) <- NULL; x } )
    ## check that everything is  
    identical(dat.r, dat.c) && identical(N.r, N.c) &&
    identical(dat.list, dat.r.list) && identical(N.list, N.r)
}

## timing function, test each method
test.time.append <- function(nrep, particles=1e2, max.cols=1e3, append.fun, do.update=T, seed=2) {
    ## object to modify
    N.test <- rep(0, particles)
    dat.test <- matrix(numeric(), nrow=particles, ncol=max.cols)
    ## speed is affected by size, 
    ## so ensure that each run the same elements
    set.seed(seed)
    for (irep in 1:nrep) {
        ## generate draws
        N.new <- gen.lengths(particles)
        dat.new <- gen.dat(N.new)
        ## bind in using given method
        tmp <- append.fun(dat.new, N.new, dat.test, N.test)
        if(do.update) {
            ## skip update for C
            dat.test <- tmp$dat
            N.test <- tmp$len
        }
    }
}


Finally, we run it all:
library(rbenchmark)
## Obviously, Rcpp requires a C++ compiler
library(Rcpp)
## compilation, linking, and loading of the C++ function into R is done behind the scenes
sourceCpp("helpers.cpp")

## run unit test, verify both functions return identical results
is.identical <- test.correct.append(1e1, max.cols=1e3)
print(is.identical)

## test timings of each solution
test.nreps <- 10
test.ncols <- 1e3
timings = benchmark(
    r=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendR),
    c=test.time.append(test.nreps, max.cols=test.ncols, do.update=F, append.fun=appendRcpp),
    list=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendL),
    replications=10
)

## Just compare the two faster methods with larger data structures.
test.nreps <- 1e2
test.ncols <- 1e4
timings.fast = benchmark(
    r=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendR),
    c=test.time.append(test.nreps, max.cols=test.ncols, do.update=F, append.fun=appendRcpp),
    replications=1e1
)


Benchmark results show that the list-append method is 500 times slower than the improved R method, and 1,000 times slower than the C++ method (timings). As we move to larger data structures (timings.fast), the advantage of modifying in-place with C++ rather than having to explicitly assign the results quickly add up.
> timings
  test replications elapsed relative user.self sys.self
2    c           10   0.057    1.000     0.056    0.000
3 list           10  52.792  926.175    52.674    0.036
1    r           10   0.128    2.246     0.123    0.003

> timings.fast
  test replications elapsed relative user.self sys.self
2    c           10   0.684    1.000     0.683    0.000
1    r           10  24.962   36.494    24.934    0.027

References


[1] How slow is R really?
[2] Speeding up R computations Pt II: compiling
[3] Efficient loops in R - the complexity versus speed trade-off
[4] Faster (recursive) function calls: Another quick Rcpp case study
[5] Rcpp attributes: A simple example 'making pi'
[6] StackOverflow: Speed up the loop operation in R
[7] The R Inferno
[8] Sparse matrix formats: pros and cons
[9] Advanced R: OO field guide: Reference Classes
[10] Advanced R: Functions: Return Values

30 May 2014

Tools for Online Teaching

Last semester (Fall 2014), I organized and taught an interdisciplinary, collaborative class titled Probability for Scientists. Getting 4 separate teachers on the same page was a challenge, but as scientists we're used to communicating over email, and CC'ing everyone worked well enough. Throw in a shared dropbox folder, a shared google calendar, and a weekly planning meeting, and instructor collaboration went pretty smoothly.

It was more challenging to organize the class so that we could easily provide students with up-to-date course information and supplemental material. We ended up using blogger, which has some key benefits and disadvantages. It was *really* easy to set up and add permissions for multiple individuals. This allowed any of us to put up images (for example, photos of the whiteboard at the end of class) and post instructions. One downside we heard from students was an apparent lack of organization. I attempted to organize the blog with intelligent post labels, along with the "Labels widget" (which shows all possible labels and the number of posts per label on the right hand sidebar). I also included the "Posts by Date" sidebar, so all the content could be accessed chronologically. I understand their comments, though I'm not convinced that a single, monolithic page of information is the right direction either.

One feature of blogger that proved helpful was its "Pages", static html files or links that appear (by default) at the top of the blog. These links are always present no matter where in the blog you navigate to. We added the syllabus and course calendar here. And here's where things start to get interesting.

All four instructors need the ability to collaboratively edit and then publish public-facing material with students (like the syllabus), as edit private course material, like grades. Dropbox makes a natural choice for private material, whereas a collaborative source repository like GitHub makes a natural choice for public material. Personally, I'm more familiar/comfortable with Google Code, but I don't think the choice of services here is material. [Sidenote: If you have a pro GitHub account, the choice seems easy: use a private GitHub repo.]

Using a public git repo gave us a few things that I really liked. First off, our class schedule was a simple HTML table in the repo that we pointed at with a static link via the Pages widget, above. When I needed to update an assignment due date, I did a quick edit-and-commit on my laptop, pushed it to the archive, and magically appeared on the blog. If several instructors are simultaneously working on course material, this essentially provides an audit trail of who did what. This is kindergarden-level stuff for software developmers. But these are tools that teachers could benefit from and aren't very familiar with. For example, getting colleagues to set up git represents a non-trivial challenge. Nonetheless, there are other benefits of learning git if you're a scientist (that I won't address here, though see here for some thoughts).

In our class, I also used R for data visualizations. I let the awesome knitr package build the appropriate .png files from my R code (along with pdfs for download, if needed). Again, it sounds simple, but adding the generated figures to the git archive allowed me to quickly link to them in blog posts, and then update them later if needed.

I would have preferred having the blog posts themselves under revision control (only "Pages" can point to an external source html without javascript, which I didn't have time for). But, for the simplicity of setup and the low (e.g. free) cost of use, I didn't find posts to be much of an issue. Blogger allows composition in pure html, without all the *junk*, which helps. But having to re-upload and link a figure every time I find an error? Definitely a pain.

For this class, we also set up an e4ward.com address that pointed to my email box. This allowed me to publicly post the class address without fear of being spammed forevermore, and allowed me to easily identify all class email. Having a single instructor responsible for email correspondence worked well enough. As an early assignment, students were asked to send a question to the class email address. This is a nice way to get to know folks, and incentivize them to go to "digital office hours", e.g. ask good questions over email. We did have some issues with e4ward.com towards the end of the class. This *sucks* - students panic when emails get lost, and it's hard to sort out where the problem is. Honestly, I don't know the answer here.

As a sidenote, I pushed the use of Wikipedia (WP) heavily in this class, and referred to it often myself. This is not possible in all fields, but WP articles in many of the hard sciences are now both technically detailed and accessible. Probability and statistics articles are some of the best examples here, since the "introductory concept" articles are used by a large number of individuals/fields, and aren't the subjects of debate (compare this with the WP pages of, for example, Tibet or Obama). I also discovered WP's "Outlines" pages for the first time. If you ever use statistics, I strongly recommend spending some time with WP's Outline of Statistics. It's epic.

One final tool we used quite a bit was LimeSurvey. As I was planning the class, I went looking for an inexpensive yet flexible survey platform. I was generally disappointed by the offerings. My requirements included raw data export and more than 10 participants; these features tend *not* to be free, and survey tools can get pricey. Enter LimeSurvey (henceforth LS). It's open source, well-documented, versatile, and simple to use. I was reluctant to invest *too* much effort in tools I wasn't sure we'd use, but I got LS running in less than 2 hours. To be fair, our lab has an on-campus, always-on server, and apache2 is already installed and configured. This would have been an annoying out-of-pocket expense had I needed to rent a VPS, though you can now get a lot of VPS for $5/mo. [sidenote: getting our campus network to play nice with custom DNS was a whole other issue...]

LS allowed me to easily construct data entry surveys, allowing each student to enter, for example, their 25 flips of 3 coins, or their sequence of wins and losses playing the Monty Hall problem. Students can quickly enter this data from their browser at their convenience. At its best, visualizations of the class data can give students a sense of ownership and purpose of in-class exercises. LS also allowed us to conduct initial background surveys, as well as anonymous, customized class "satisfaction" surveys mid-course to find out what was working and what wasn't. We ran into a few administrative issues with LS, but it's an overall powerful and stable data collection platform. Students seemed happy with it, and it provided us with valuable feedback.

What would I do differently? I failed to set up an email list early on. It would have been trivial using, e.g. Google Groups. At the time, it seemed redundant, both for the students and us. Didn't they already have the blog? In retrospect, it would have proved useful at several points to communicate "New Blog Post" or "Due Date Changed, check calendar". I've learned this semester to expect that spoken instructions are not necessarily heeded, and that receiving administrative details in writing from multiple sources (blog, calendar, mailing list) is a Good Idea ™.

Along this vein, a minor needed improvement is breaking making a separate table for assignments. I originally combined assignments with the course calendar in the interest of simplicity, giving students all the relevant information in one place. In the end, it was just confusing.

Overall, the class went very well. We have received positive feedback from the students, and we now have a detailed digital record of the course. PDFs of their final project posters are now in the archive, where they will live in perpetuity. Personally, I'm not ready to teach a MOOC yet, but I'm sold on digital tools as useful supplements to in-class material. They allowed me to spend less time doing more. These tools helped the class "organize itself", and reduced communication overhead between instructors. To older or less technically-inclined teachers, some of this might seem difficult or confusing. On one hand, you really only need *one* instructor to coordinate the tech - it's not very hard for instructors to use the tools once they're set up. On the other hand, some of the above pieces are very easy to implement, and support from a department administrator or technically-inclined teaching assistant (TA) might be available for more challenging pieces. An installation of LS shared across a department, for example, should be trivial for any department that has its own linux server (e.g. math, physics, chem).

In my mind, a key goal here is that technology not get in the way. Many of the commercial "online learning products" that I've seen adopted by universities make simple task complicated, while lacking the flexibility. In trying to do everything for everyone, they often end up doing nothing for anyone (or take a high level of skill or experience to use effectively). I far preferred using several discrete tools, each of which does a single job well (class email, blog to communicate, git repo to hold files).

Are there any interesting tools worth trying out next time? I wonder how a class twitter-hashtag would work...

26 November 2013

The Art of the Album

When I'm working, I'll easily listen to 6 or more hours of music a day. Ten years ago, I listened to a *lot* of public radio, which broadened my ear a lot and introducing me to genres ranging from classical jazz to Native American and New Mexican music. Internet radio gave me more choice of stations (I'm now a happy KEXP micro-donor). Finally, there was Rdio (or Spotify, take your pick). Residents of the U.S. got to hear about the wonders of these all-you-can-eat music-as-a-utility services years before they became available here. Complex licensing agreements had to be signed with our musical overlords. But the future has finally arrived, and for $10/month I now have an internet full music (including offline access on phone, $5/mo for just wired computer).

The magic of a high-quality, easily searched and streamed music archive has transformed the way I listen to music. When I hear a song I like, it now takes me less than 30 seconds to find the album and begin playing on my office computer or phone.
There are a few drawbacks - not every album or artist is available (Joanna Newsom is a particularly galling example), and occasionally I find myself without a reliable cellphone or WiFi signal. But these are minor issues. Overall, the ultimate convenience, the *modern-ness* of it all still blows my mind. To me, this is better than a flying car (of course, I don't even own a normal car). And this convenience has, in the last few years, rekindled my love of the art of the album.

It seems to me that independent music in general has benefited from digital distribution by allowing artists to more easily break from the more conventional constraints of genre. I see a lot of experimentation here, running all the way up to the Dirty Projectors' avant-garde classical composition style. Growing up in the 90's, I enjoyed Pearl Jam and Nirvana well enough, but much of the "alternative" music that I listened to at the time sounded (and still sounds) rather similar to my ears, e.g. Grunge. The ones that sounded different really stood out, and I still cherish them for it (I'm looking at you, Pixies). Maybe I'm biased now by access to more music and better DJs, but I find the modern American music scene incredibly vibrant and diverse. Every month, I can look forward to new releases from favorite artists, as well as finding something or someone new to open my eyes and make my day.

What follows is an unordered list of albums that I've recently developed a strong relationship with. These albums cover a wide range of the acoustic/electronic spectrum. I enjoy repetitive, energetic music when I'm working or juggling or cleaning; I love the emotion and classic song-writing of "folk" and "country"; and I love the driving anthems of modern indie. Consequently, I like to think there's "something for everyone" here. And each of these is an *album*, a free-standing work of art worthy of repeated enjoyment in its uncensored, unedited entirety.

Obvious:

Macklemore & Ryan Lewis - The Heist (2012)

I'd like to find more music like this: the crossroads between pop and hip hop, independent music that gets radio play, catchy but meaningful. I can think of half a dozen song lyric lines that make great life slogans. Tis is a great album to blast in the car on a warm spring day.

alt-J - An Awesome Wave (2012)

I think of alt-J as the Neutral Milk Hotel of this decade: where the hell did they come from? It's such a beautiful, subtle album that came out of nowhere and bears repetition very well. I beg the gods for more in the future.

Daft Punk - Random Access Memories (2012)

I never really got into Daft Punk before this album, and I didn't even like it that much the first few plays. The songs on this album tend towards longish, some of them are slow, and I found myself getting bored. Then I began getting lines stuck in my head, and began dipping back in. In the end, I find this an immensely satisfying sort of pop-EDM-concept-album: a soothing mix of repetitive riffs that aren't too fast or insistent with a backdrop of pop anthem melodies. It strikes me as easy-listening Moby? This album is a little slow form me to "sit down" and listen to, but I find it excellent clean-the-house/driving music.

Phosphorescent - Muchacho (2013)

Rainy day + hot coffee. Sunset and a beer. Just got dumped, fired, graduated, engaged? This is such an extraordinarily luscious, eloquent album. It makes me remember that I have emotions. Lots of them.

Santigold - Master of My Make-Believe (2012)

I always perk up when I hear Santigold singles on the radio, but I was slow to listen to the albums. I like her self-title 2008 album, but it never really got under my skin. The second or third listen of Master, though, and I wanted to know more about this artist. After digging around a bit, I feel like I have a better idea of where she's coming from, and where she's going. The comparison to M.I.A. is inevitable, while the album art for Master suggests something more like Outkast. Master has tons of energy and is packed with pop-friendly riffs. But it's complex, and strikes me as walking the "don't define me" tightrope (or slackline, if you will; you can push *back* on a slackline). I enjoy that it doesn't settle down into a niche and stay there.

Less Obvious:

Shovels and Rope - O' Be Joyful (2012)

In my mental map of Americana, I file this near Wilco and Drive By Truckers. Sometimes slow and sweet, sometimes fast and rambunctious, but always melodic, this album is full of luscious 2-part harmonies with a low-fi, intimate feel. I'm always sad when it ends; I always want more.

First Aid Kit - The Lion's Roar (2012)

Can I call this indie-Americana? Less of the overt Southern influence of Shovels and Rope, but still full of tight vocal harmonies of country/folk. Apparently they're sisters, and apparently they're young, but this album has a big sound, full of driving melancholy. Playing two or three of their albums back-to-back is particularly satisfying. They seem to be growing as they go, and I'm excited to hear what comes next.

John Grant - Queen of Denmark (2010)

A very good anthem album. I don't often listen or pay attention to lyrics, but Grant has a John Prine-ish storytelling quality, a dark sense of humor and playful irony. Musically, it's tends towards simple, with a fast, light quality that reminds me of Paul Simon's Graceland. Thematically, though, it's a dark album. A far-off hint of redemption shines at the end of the tunnel, but just barely. Whistling in the dark.


Sharon Van Etten - Tramp (2013)

A powerful voice, and a powerful song-writer. This album is mature and intimate, and Van Etten's voice is strong and clear. Tight harmonies and vocal stylings that are luxurious without being excessive. The utterly enrapturing quality of controlled liquid of her voice reminds me a little of the Cowboy Junkies' Margo Timmins, with a bit of Joni Mitchell. In short, she's good.

Matthew Dear - Beams (2012)

My first introduction to Matthew Dear, this album is driving. Repetitive, almost grinding, the samples remind me of smoothed-out, slowed down industrial, or gears-and-grease voodoo. It reminds me of being in the belly of a very large machine. The tone palette is less pure than, say, Daft Punk, with lots of glitches and grinding noises. It's also harmonic, full of discordant melodies. And I *love* it. There are songs that I would love to hear on a dark dance floor in a small, crowded night-club. It's sexy as hell, with a floating touch of loss and nostalgia.

Jagwar Ma - Howlin (2013)

This is a somewhat confusing album. A mix of upbeat chorus-driven pop tunes and beat-and-sample driven pop-EDM, I find it a little schizophrenic at times. In the space of two songs, it goes from an drivingly upbeat guitar-and-vocals sound akin to Django Django's recent album, to something more akin to Caribou's hypnotic samples, with little in the way of transition. The situation reminds me a little of Hot Chip's recent album In Our Heads (which I still find deeply confusing). But Howlin is infectious throughout, with several singles that belong in the "party mix".


Dirty Projectors - Swing Lo Megellan (2012)

Beautiful melodies with a glorious sheen of tightly-controlled noise and discord, this approaches classical composition in broad-scale interest and ability to scare off pedestrians at a first listen. There's just enough rhythm and melody, though, to reel a music-lover in until one gains some familiarity with the subtleties. Then the album really starts to open up. To my ear, it's the opposite of a show-stopping dramatic pop album. It's playful and light, and strange, and curious and coy, going from simple to huge and back. It's complex and, sometimes subtly, very satisfying. This is real sit-down-and-listen music, kind of like going to see the symphony.

Junip - Self Title (2013)

It's not unlikely that you've already heard "Your Life Your Call". I'm sure it's in some movie or another, or will be soon. I get shivers every time I hear this song - like the soundtracks of the Breakfast Club and Trainspotting had a mutant child. Jose Gonzales has a number of solo albums (I'm quite fond of his 2005 Veneer - see below), though I never made the connection with Junip myself. His voice is as clear and emotional as ever, but the sound is bigger and more nuanced, a wonderful blend of semi-acoustic and smooth electronic sounds. This is an emotional album - not any *particular* emotion, but all of them, simultaneously, and a lot. Much like Muchacho, listening to it makes me feel decidedly and acutely human.


Yppah - Eighty One (2012)

Driving indie dream-pop, Yppah's sound is reminiscent of Heliosequence with drum machines. Something to get the shoe-gazers moving around!

Less new

but recently discovered or especially noteworthy, albums follow.
I'm ready to wrap this post up, so these get just a brief mention, but they're all worth a good listen.

Caribou - Swim (2010)

Smooth, fast, steady electronic. A masterful album.

Jose Gonzales - Veneer (2005)

Contains a cover of The Knife's song "Heartbeats" that I adore. Close and intimate and lush.

Crystal Castles - Self Title (2008)

One of my current favorite albums. I think of it as glitch-rock. It's more syth-y than punk, but has a lot of similar aesthetic sensibilities: loud, abrasive, driving, and inspiring. I particularly like to cue up all 3 Crystal Castles albums and listen to them in a go. Loudly.

Gold Panda - Lucky Shiner (2010)

Very smooth, incredibly-produced electronic music. Deeply satisfying, good work music.

Franz Ferdinand - Self Title (2004)

Anthemic indie-pop. I'm familiar with most of these songs, and was amazed that that they all came from a single album. Buddy Holly meets Lou Reed?

Jolie Holland - Escondida (2004)

Lead singer of The Be Good Tanyas, Jolie Holland's solo album is intimate and enwrapping.

Juana Molina - Tres Cosas (2004)

Quiet and playful yet insistent percussion is the constant backdrop against which Molina's voice plays. And is it ever playful. Her unassuming Spanish is hypnotizing. She has a new release out that I haven't digested yet, but here's another case where I happily queue up 2 or 3 albums in a row and let them blend effortlessly from one to the next.

23 August 2013

Faster, better, cheaper: what is the true value of a computer?

One thing I've had a lot of time to think about over the last 15 years is what exactly does faster & more powerful mean? After a decade of clockspeed wars, we've moved on to more cores, more RAM, longer battery life, less weight, backlit keyboards, etc. A new computer still costs about the same as it did 15 years ago... is it any better?

The more time I spend with the machines, the more I think about usability. A machine is only as powerful as the tasks it can accomplish. I have a $200 netbook (w/linux, of course) that excels at being light. It works great for travel but not netflix. I could not write a paper on it, but I can check email, upload photos/files, etc.

In our computer cluster here, we upgrade as we hit limits. Ran out of RAM? Buy more... it's cheap (for a desktop, at least). Chronic, awful wrist pain? Get an ergonomic keyboard. I find a 2-screen setup a very cost-effective productivity boost, whereas the idea of paying 20% more for a 10% bump in speed strikes me as silly.

The whole state of affairs reminds me a little of mean and variance. We always hear about the expected values of things, the mean, but rarely is variance ever reported, and variance is often the most important part. Things like weather, lifespan, salary, time-to-completion? The variance may be more important than the mean. Speed/power tells something about the "maximum potential" of a machine, but not how much use one might get out of it.

I see two different directions --

First, unexpected developments. Examples include multiple cores and SSD. No one really expected these to change the aesthetics of computers. Nonetheless, the reduced latency of SSD is pleasantly surprising, as is the increased responsiveness under load of a multi-core machine. I don't hit enter and wait. My machine flows more at the speed of thought than it used to, even if I have a browser with 50+ tabs open, music playing, etc.

Second, human interface. My new phone is just a little too tall, which makes it just a little more difficult to use, since I can't reach across the screen with one hand. Again, I never expected this to matter, but it's a human-machine interface question rather than a pure machine capabilities question. Backlit keyboards, ergonomic keyboards, featherweight laptops, and long battery lives are all about the human-machine interface. Which is more important than speed, since the human is the whole point....

The aesthetics of interface is how Apple came to rule the world. Their hardware is beautiful, intuitively responsive to human touch. Hell, even their stores are clean & informative and intuitive and full of fun toys. Personally, I can't stand their walled-garden approach to hardware, and I have the time and energy to coax my machines into greatness through a commandline interface (which remains one of the most powerful human-machine interface ever developed), but I *like* Apple hardware. They've dragged the PC industry kicking and screaming into the 21st century of "Humans matter more than machines".

Which they rather presciently highlighted in their 1984 Superbowl commercial:
http://www.youtube.com/watch?v=2zfqw8nhUwA

Finally, a mind-blowing historical view on the subject, including a 1983 Bill Gates pimping Apple hardware, and Steve Jobs describing how machines should help people rather than the other way around:
http://www.youtube.com/watch?v=xItV5U-V2W4

Everyday revision control

This post has been a long time coming. Over the past year or so, I've gradually become familiar, even comfortable, with git. I've mainly used it for my own work, rather than as a collaborative tool. Most of the folks that I work with don't need to share code on a day-to-day basis, and there's a learning curve that few of my current colleagues seem interested in climbing at this point. This hasn't stopped me from *talking* to my colleagues about git as an important tool in reproducible research (henceforth referred to as RR).

I find the process of committing files and writing commit messages at the end of the day forces me to tidy up. It also allows me to more easily put a project on hold for weeks or months and to then return to it with a clear understanding of what I'd been working on, and what work remained. In short, I use my git commit messages very much like a lab notebook (a countervailing view on git and reproducible research is here, an interesting discussion of GNU make and RR here, and a nice post on RR from knitr author Yihui Xie here ).


Sidenote: I've hosted several projects at https://code.google.com/, and used their git archives, particularly for classes (I prefer the interface to github, though the two platforms are similar). I've also increasingly used Dropbox for collaborations, and I've struggled to integrate Dropbox and Git. Placing the same file under the control of two very different synchronization tools strikes me as a Bad Idea (TM), and Dropbox's handling of symlinks isn't very sophisticated. On the other hand, maintaining 2 different file-trees for one project is frustrating. I haven't found a good solution for this yet...

As far as tools go, most of the time I simply edit files with vim and commit from the commandline. In this sense, git has barely changed my work flow, other than demanding a bit of much-needed attention to organization on my part. Lately, I've started using GUI tools to easily visualize repositories, e.g. to simultaneously see a commit's date, message, files, and diff. Both gitk and giggle have similar functionality -- giggle is prettier, gitk is cleaner. Another interesting development is that Rstudio now includes git tools (as well as native latex and knitr support in the native Rstudio editor). This means that a default Rstudio install has all the tools necessary for a collaborator to quickly and easily check out an repository and start playing with it.

09 August 2013

Adventures with Android

After months of dealing with an increasingly sluggish and downright buggy Verizon HTC Rhyme, I finally took the leap and got a used Galaxy Nexus. First off, I think it's beautiful. The rhyme isn't exactly a high-end phone, so the small, unexciting screen isn't particularly surprising. By comparison, the circa 2011 Nexus is a work of art. My first impression of the AMOLED screen is great. Dark blacks and luscious color saturation (though I have found it to be annoyingly shiny -- screens shouldn't be mirrors!). It even has a barometer!

One big motivation for a phone upgrade (asides from the cracked screen and aforementioned lag) was being stuck at Android 2.3.4 (Gingerbread). As a technologist, I don't consider myself an early-adopted. I prefer to let others sort out the confusion of initial releases, and pick out the gems that emerge. But Gingerbread is well over 2 years old, and a lot has happened since then. The Galaxy Nexus (I have the Verizon CDMA model, codename Toro) is a skin-free, pure Android device. Which means that I am now in control of my phone's Android destiny!

How to go about this? I hit the web and cobbled together a cursory understanding of the Android/google-phone developer ecosystem as it currently stands. First off, there's xda-developers, a very active community of devs and users. There's an organizational page for information on the Galaxy Nexus here that helped me get oriented. This post made installing adb and fastboot a snap on ubuntu 12.04 (precise). There's also some udev magic from google under the "Configuring USB Access" section here that I followed, perhaps blindly (though http://source.android.com is a good primary reference...).

Next, I downloaded ClockworkMod for my device, rebooted my phone into the bootloader, and installed and booted into ClockworkMod:

## these commands are from computer connected to phone (via usb cable)
## check that phone is connected.  this should return "device"
adb get-state
## reboot the phone into the bootloader
adb reboot-bootloader

## in recovery, the phone should have funny pictures of the adroid bot opened up... reminds me of Bender.
## the actual file (the last argument) will vary by device
fastboot flash recovery recovery-clockwork-touch-6.0.3.5-toro.img
## boot into ClockworkMod 
fastboot boot recovery-clockwork-touch-6.0.3.5-toro.img

This brought me to the touchscreen interface of ClockworkMod. First, I did a factory reset/clear cache as per others' instructions. Then I flashed the files listed here (with the exception of root) via sideloading. There's an option in ClockworkMod that says something like "install from sideload". Selecting this gives instructions -- basically, use adb to send the files, and then ClockworkMod takes care of the rest:

## do this on computer after selecting "install from sideload" on phone
## the ROM, "pure_toro-v3-eng-4.3_r1.zip", varies by device
adb sideload pure_toro-v3-eng-4.3_r1.zip
## repeat for all files that need to be flashed

I rebooted into a shiny new install of Jelly Bean (4.3). It's so much cleaner and more pleasant than my old phone. I was also pleasantly surprised to see that Android Backup service auto-installed all my apps from the Rhyme.

In the process of researching this, I got a much better idea of what CyanogenMod does. I'm tempted to try it out now, but I reckon I'll wait for the 4.3 release, whenever that happens.


I also found http://www.htcdev.com/bootloader, which offers the prospect of unlocking and upgrading the HTC Rhyme, though I haven't found any ROMs that work for the CDMA version...

13 June 2013

Secure webserver on the cheap: free SSL certificates

Setting up an honest, fully-certified secure web server (e.g. https) on the cheap can be tricky, mainly due to certificates. Certificates are only issued to folks who can prove they are who they say they are. This verification generally takes time and energy, and therefore money. But the great folks at https://www.startssl.com/ have an automated system that verifies identity and auto-renders associated SSL certificates for free.

Validating an email is easy enough, but validating a domain is trickier -- it requires a receiving mailserver that startssl can mail a verification code to. Inbound port 25 (mail server) is blocked by my ISP, the University of New Mexico (and honestly, I'd rather not run an inbound mail server).

I manage my personal domain through http://freedns.afraid.org/. They provide full DNS management, as well as some great dynamic DNS tools. They're wonderful. But they don't provide any fine-grained email management, just MX records and the like.

The perfect companion of afraid.org is https://www.e4ward.com/. They have mail servers that will conditionally accept mail for specific addresses at personal domain, and forward that mail to an email account. This lets me route specific addresses @mydomain.com, things like postmaster@mydomain.com, to my personal gmail account. E4ward is a real class-act. They manually moderate/approve new accounts, so there's a bit of time lag. To add a domain, they also require proof of control via a TXT record (done through afraid.org).

This whole setup allowed me to prove that I owned my domain to startssl.com without running a mail server or paying for anything other than the domain. The result is my own SSL certificates. I'm running a pylons webapp with apache2 and mod_wsgi. In combination with python's repoze.what, I get secure user authentication over https without any snakeoil.

Hat-tip to this writeup, which introduced me to e4ward.com and their mail servers.

Finally, there are a number of online tools to query domains. dnsstuff.com was one of the better ones I found. It takes longer to load, but gives a detailed report of domain configuration, along with suggestions. A nice tool to verify that everything is working as expected.