Labels

R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

14 November 2014

Numerical Simulations and Data passing: C++, Python, and Protocol Buffers

Problem statement & Requirements

I'm working with a complex C++ simulation that requires a large number of user-specified parameters. Both speed and readability are important. I'd like to define all possible parameters in one (and only one) place, and include sensible defaults that can be easily over-ridden. Finally, intelligent type-handling would be nice. For convenience, I decided to wrap the C++ simulation in python setup/glue code. Python is a logical choice here as the "available everywhere" glue language that has nice standard libraries.

Available libraries

There aren't many data-passing options that work with both C++ and python. Libconfig, JSON, XML, and Google Protocol Buffers (PB) appear to be the only reasonable options. Here's my thoughts on the first three:
  • Libconfig: Nice clean library, good language support. The big downside is that data structures must be defined both in a data file and in code - e.g. data is "moved" from a file into C++ variables. I feel like libconfig is best for a small number of complex variables, like lists and vectors.
  • JSON: no clear standard C++ library, library docs so-so, speed complaints from some?
  • XML: Massive overkill.
That leaves PB, which has nice docs for both C++ and python. All the variables, along with their types and defaults, are defined in a .proto file. The protoc tool auto-generates python and C++ code from the .proto file. By adding it to my Makefile, C++ classes are autogenerated at compile time. This makes for fast and readable C++ code - like using a named dict, but without the speed costs.

Solution / Workflow

I'm using python to read user-supplied values into a set of PB messages, and then serializing the messages to files. C++ then reads the messages from those files at runtime. A python script run by make synchronizes the locations of files between python and C++. I also want to process commandline options for my python wrapper script. Happily, I can hand a PB message to python's parser.parse_args() and have it set PB message attributes with setattr(). The last python step (aside from writing the message to disk) is reading "variable,value" pairs from a .csv file. If a variable has already been set by parse_args, I skip it: the commandline values override .csv file values.

Summary

Overall, PB makes a very nice data coupler between an interpreted language like python and a compiled language like C++. Python excels at text processing and is easy to prototype, while C++ is fast and beautiful. PB has a few side-benefits. On the C++ side, it provides some natural namespace encapsulation to manage variable-explosion. Runtime inspection with gdb is easy enough. Finally, storing all the options values used to run each simulation in a standard-format file is handy - it allows tests to re-run the simulation with exactly the same inputs.

Python Snippets

def main():
    ## initialize protobuf, fill with ParseArgs
    setupSim = ProtoBufInput_pb2.setupSim()
    setupSim = ParseArgs(sys.argv[1:], setupSim)
    prepInput(setupSim)
    RunSim()

def ParseArgs(argv, setupSim):
    parser = OptionParser(usage="wrapper.py [options]\nNote: commandline args over-ride values in files.", version=setupSim.version)
    ## these must be valid protocol buffer fields 
    parser.add_option("-t", "--test", dest="testCLI",
        action='store_true', help="Run test suite")
    parser.add_option("-d", "--days", metavar='N',
        dest="number_of_days",
        type='int', help="Number of days to simulate")
    ## parse!
    (setupSim, args) = parser.parse_args(argv, values=setupSim)
    return(setupSim)

def prepInput(setupSim):
    ## options from ParseArgs
    inhandle = open(setupSim.file_options, 'r')
    outhandle = open(ProtoDataFiles.PbFile_setupSim, 'wb')
    reader = csv.reader(inhandle, delimiter=',')
    header = reader.next()
    if not (header == ['variable','value']):
        raise Exception('Incorrect header format') 

    for row in reader:
        ## skip comments, check for 2 fields per row
        if (row[0][0] == '#'):
            continue
        if not (len(row) == 2):
            raise Exception('Problem with value pair: %s' % row)
            
        ## pack the message using text representation
        msgText = '%s : %s' % (row[0], row[1])
        if setupSim.HasField(row[0]):
            print("Skipping config file, keeping commandline value: %s, %s" % (row[0], getattr(setupSim,row[0])))
            continue
        setupSim = Merge(msgText, setupSim)
    ## write out to file for C++ to read
    outhandle.write(setupSim.SerializeToString())
    outhandle.close()


def RunSim():
    subprocess.Popen("./sim").communicate()

if __name__ == "__main__":
    main()

C++ Code Snippets

//PbRead.h
#include 
#include 
#include 
#include 
#include "proto/ProtoBufInput.pb.h"

template 
void PbRead(Type &msg, const char *filename){
    std::fstream infile(filename, std::ios::in | std::ios::binary);
    if (!infile) {
       throw std::runtime_error("Setup message file not found");
    } else if (!msg.ParseFromIstream(&infile)) {
       throw std::runtime_error("Parse error in message file");
    }
}


// sim.cpp
#include "PbRead.h"
#include "ProtoDataFiles.h"

// protocol buffers get passed around, are globals
ProtoBufInput::setupSim PbSetupSim;

int main(int argc,char **argv)
{
    GOOGLE_PROTOBUF_VERIFY_VERSION;
    PbSetupSim.set_init(true);
    // #define PbFile_setupSim "filename" in ProtoDataFiles.h, written by make
    PbRead(PbSetupSim, PbFile_setupSim);
    //...
    if (PbSetupSim.test_2()){
       //...
    }
}

03 July 2014

Efficient Ragged Arrays in R and Rcpp

When is R Slow, and Why?

Computational speed is a common complaint lodged against R. Some recent posts on r-bloggers.com have compared the speed of R with some other programming languages [1], and showed the favorable impact of the new compiler package on run-times [2]. I and others have written about using Rcpp to easily write C++ functions to speed-up bottlenecks in R [3,4]. With the new Rcpp attributes framework, writing fully vectorized C++ functions and incorporating them in R code is now very easy [5].

On a day-to-day basis, though, R's performance is largely a function of coding style. R allows novices users to write horribly inefficient code [6] that produces the correct answer (eventually). Yet by failing to utilize vectorization and pre-allocation of data structures, naive R code can be many orders of magnitude slower than need be. R-help is littered with the tears of novices, and there's even a (fantastic) parody of Dante's Inferno outlining the common "Deadly Sins of R" [7].

Problem Statement: Appending to Ragged Arrays

I recently stumbled onto an interesting code optimization problem that I *didn't* have a quick solution for, and that I'm sure others have encountered. What is the "R way" to vectorize computations on ragged array? One example of a ragged array is a list of vectors that have varying and different lengths. Say you need to dynamically grow many vectors by varying lengths over the course of a stochastic simulation. Using a simple tool like lapply, the entire data structure will be allocated anew with every assignment. This problem is briefly touched on in the official Introduction to R documentation, which simply notes that "when the subclass sizes [e.g. vector sizes] are all the same the indexing may be done implicitly and much more efficiently". But what if you're data *isn't* rectangular? How might one intelligently vectorize a ragged array to prevent (sloooow) memory allocations at every step?

The obvious answer is to pre-allocate a rectangular matrix (or array) that is larger than the maximum possible vector length, and store each vector as a row (or column?) in the matrix. Now we can use matrix assignment, and for each vector track the index of the start of free space. If we try to write past the end of the matrix, R emits the appropriate error. This method requires some book-keeping on our part. One nice addition would be an S4 class with slots for the data matrix and the vector of free-space indices, as well as methods to dynamically expand the data matrix and validate the object. As an aside, this solution is essentially the inverse of a sparse matrix. Sparse matrices use much less memory at the expense of slower access times [8]. Here, we're using more memory than is strictly needed to achieve much faster access times.

Is pre-allocation and book-keeping worth the trouble? object.size(matrix(1.5, nrow=1e3, ncol=1e3)) shows that a data structure of 1,000 vectors, each of length approximately 1,000, occupies about 8Mb of memory. Let's say I resize this structure 1,000 times. Now I'm looking at almost a gigabyte of memory allocations. Perhaps you're getting a sense of what a terrible idea it is to *not* pre-allocate a frequently-resized ragged list?

Three Solutions and Some Tests

Using the above logic, I prototyped a solution as an R function, and then transcribed the result into a C++ function (boundary checks are important in C++). The result is three methods: a "naive" list-append method, an R method that uses matrix assignment, and a final C++ method that modifies the pre-allocated matrix in-place. In C++/Rcpp, functions can use pass-by-reference semantics [9], which can have major speed advantages by allowing functions to modify their arguments in-place. Full disclosure: pass-by-reference semantics requires some caution on the user's part. Pass-by-reference is very different from R's "function programming" semantics (pass-by-value, copy-on-modify), where side-effects are minimized and an explicit assignment call is required to modify an object [10].

I added a unit test to ensure identical results between all three methods, and then used the fantastic rbenchmark package to time each solution. As expected, the naive method is laughably slow. By comparison, and perhaps counter-intuitively, the R and C++ pre-allocation methods are close in performance. Only with more iterations and larger data structures does the C++ method really start to pull ahead. And by that time, the naive R method takes *forever*.

Refactoring existing code to use the pre-allocated compound data structure (matrix plus indices) is a more challenging exercise that's "left to the reader", as mathematics textbooks oft say. lapply() is conceptually simple, and is often fast *enough*. Some work is required to transcribe code from this simpler style to use the "anti-sparse" matrix (and indices). There's a temptation to prototype a solution using lapply() and then "fix" things later. But if you're using ragged arrays and doing any heavy lifting (large data structures, many iterations), the timings show that pre-allocation is more than worth the effort.

Code

Note: you can find the full code here and here.

Setup: two helper functions are used to generate ragged arrays via random draws. First, draws from the negative binomial distribution determine the length of the each new vector (with a minimum length of 1, gen.lengths()), and draws from the normal distribution fill each vector with data (gen.dat()).
## helper functions
gen.lengths <- function(particles, ntrials=3, prob=0.5) {
    ## a vector of random draws
    pmax(1, rnbinom(particles, ntrials, prob))
}
gen.dat <- function(nelem, rate=1) {
    ## a list of vectors, vector i has length nelem[i]
    ## each vector is filled with random draws
    lapply(nelem, rexp, rate=rate)
}

Three solutions: a naive lapply() method, followed by pre-allocation in R.
## naive method
appendL <- function(new.dat, new.lengths, dat, dat.lengths) {
    ## grow dat by appending to list element i
    ## memory will be reallocated at each call
    dat <- mapply( append, dat, new.dat )
    ## update lengths
    dat.lengths <- dat.lengths + new.lengths
    return(list(dat=dat, len=dat.lengths))
}

## dynamically append to preallocated matrix
## maintain a vector of the number of "filled" elements in each row
## emit error if overfilled
## R solution
appendR <- function(new.dat, new.lengths, dat, dat.lengths) {
    ## grow pre-allocated dat by inserting data in the correct place
    for (irow in 1:length(new.dat)) {
        ## insert one vector at a time
        ## col indices for where to insert new.dat
        cols.ii <- (dat.lengths[irow]+1):(dat.lengths[irow]+new.lengths[irow])
        dat[irow, cols.ii] = new.dat[[irow]]
    }
    ## update lengths
    dat.lengths <- dat.lengths + new.lengths
    return(list(dat=dat, len=dat.lengths))
}


Next, the solution as a C++ function. This goes in a separate file that I'll call helpers.cpp (compiled below).
#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
void appendRcpp(  List fillVecs, NumericVector newLengths, NumericMatrix retmat, NumericVector retmatLengths) {
    // "append" fill oldmat w/  
    // we will loop through rows, filling retmat in with the vectors in list
    // then update retmat_size to index the next free
    // newLenths isn't used, added for compatibility
    NumericVector fillTmp;
    int sizeOld, sizeAdd, sizeNew;
    // pull out dimensions of matrix to fill
    int nrow = retmat.nrow();
    int ncol = retmat.ncol();
    // check that dimensions match
    if ( nrow != retmatLengths.size() || nrow != fillVecs.size()) { 
        throw std::range_error("In appendC(): dimension mismatch");
    }
    for (int ii = 0; ii= ncol) {
            throw std::range_error("In appendC(): exceeded max cols");
        }
        // iterator for row to fill
        NumericMatrix::Row retRow = retmat(ii, _);
        // fill row of return matrix, starting at first non-zero elem
        std::copy( fillTmp.begin(), fillTmp.end(), retRow.begin() + sizeOld);
        // update size of retmat
        retmatLengths[ii] = sizeNew;
    }
}


Putting the pieces together: a unit test ensures the results of all three methods are identical, and a function that runs each solution with identical data will be used for timing.
## unit test
test.correct.append <- function(nrep, particles=1e3, max.cols=1e3, do.c=F) {
    ## list of empty vectors, fill with append
    dat.list <- lapply(1:particles, function(x) numeric())
    ## preallocated matrix, fill rows from left to right
    dat.r <- dat.c <- matrix(numeric(), nrow=particles, ncol=max.cols)
    ## length of each element/row
    N.list <- N.r <- N.c <- rep(0, particles)
    ## repeat process, "appending" as we go
    for (ii in 1:nrep) {
        N.new <- gen.lengths(particles)
        dat.new <- gen.dat(N.new)
        ## in R, list of vectors
        tmp <- appendL(dat.new, N.new, dat.list, N.list)
        ## unpack, update
        dat.list <- tmp$dat
        N.list <- tmp$len
        ## in R, preallocate
        tmp <- appendR(dat.new, N.new, dat.r, N.r)
        ## unpack, update
        dat.r <- tmp$dat
        N.r <- tmp$len
        ## as above for C, modify dat.c and N.c in place
        appendRcpp(dat.new, N.new, dat.c, N.c)
    }
    ## pull pre-allocated data back into list
    dat.r.list <- apply(dat.r, 1, function(x) { x <- na.omit(x); attributes(x) <- NULL; x } )
    ## check that everything is  
    identical(dat.r, dat.c) && identical(N.r, N.c) &&
    identical(dat.list, dat.r.list) && identical(N.list, N.r)
}

## timing function, test each method
test.time.append <- function(nrep, particles=1e2, max.cols=1e3, append.fun, do.update=T, seed=2) {
    ## object to modify
    N.test <- rep(0, particles)
    dat.test <- matrix(numeric(), nrow=particles, ncol=max.cols)
    ## speed is affected by size, 
    ## so ensure that each run the same elements
    set.seed(seed)
    for (irep in 1:nrep) {
        ## generate draws
        N.new <- gen.lengths(particles)
        dat.new <- gen.dat(N.new)
        ## bind in using given method
        tmp <- append.fun(dat.new, N.new, dat.test, N.test)
        if(do.update) {
            ## skip update for C
            dat.test <- tmp$dat
            N.test <- tmp$len
        }
    }
}


Finally, we run it all:
library(rbenchmark)
## Obviously, Rcpp requires a C++ compiler
library(Rcpp)
## compilation, linking, and loading of the C++ function into R is done behind the scenes
sourceCpp("helpers.cpp")

## run unit test, verify both functions return identical results
is.identical <- test.correct.append(1e1, max.cols=1e3)
print(is.identical)

## test timings of each solution
test.nreps <- 10
test.ncols <- 1e3
timings = benchmark(
    r=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendR),
    c=test.time.append(test.nreps, max.cols=test.ncols, do.update=F, append.fun=appendRcpp),
    list=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendL),
    replications=10
)

## Just compare the two faster methods with larger data structures.
test.nreps <- 1e2
test.ncols <- 1e4
timings.fast = benchmark(
    r=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendR),
    c=test.time.append(test.nreps, max.cols=test.ncols, do.update=F, append.fun=appendRcpp),
    replications=1e1
)


Benchmark results show that the list-append method is 500 times slower than the improved R method, and 1,000 times slower than the C++ method (timings). As we move to larger data structures (timings.fast), the advantage of modifying in-place with C++ rather than having to explicitly assign the results quickly add up.
> timings
  test replications elapsed relative user.self sys.self
2    c           10   0.057    1.000     0.056    0.000
3 list           10  52.792  926.175    52.674    0.036
1    r           10   0.128    2.246     0.123    0.003

> timings.fast
  test replications elapsed relative user.self sys.self
2    c           10   0.684    1.000     0.683    0.000
1    r           10  24.962   36.494    24.934    0.027

References


[1] How slow is R really?
[2] Speeding up R computations Pt II: compiling
[3] Efficient loops in R - the complexity versus speed trade-off
[4] Faster (recursive) function calls: Another quick Rcpp case study
[5] Rcpp attributes: A simple example 'making pi'
[6] StackOverflow: Speed up the loop operation in R
[7] The R Inferno
[8] Sparse matrix formats: pros and cons
[9] Advanced R: OO field guide: Reference Classes
[10] Advanced R: Functions: Return Values

30 May 2014

Tools for Online Teaching

Last semester (Fall 2014), I organized and taught an interdisciplinary, collaborative class titled Probability for Scientists. Getting 4 separate teachers on the same page was a challenge, but as scientists we're used to communicating over email, and CC'ing everyone worked well enough. Throw in a shared dropbox folder, a shared google calendar, and a weekly planning meeting, and instructor collaboration went pretty smoothly.

It was more challenging to organize the class so that we could easily provide students with up-to-date course information and supplemental material. We ended up using blogger, which has some key benefits and disadvantages. It was *really* easy to set up and add permissions for multiple individuals. This allowed any of us to put up images (for example, photos of the whiteboard at the end of class) and post instructions. One downside we heard from students was an apparent lack of organization. I attempted to organize the blog with intelligent post labels, along with the "Labels widget" (which shows all possible labels and the number of posts per label on the right hand sidebar). I also included the "Posts by Date" sidebar, so all the content could be accessed chronologically. I understand their comments, though I'm not convinced that a single, monolithic page of information is the right direction either.

One feature of blogger that proved helpful was its "Pages", static html files or links that appear (by default) at the top of the blog. These links are always present no matter where in the blog you navigate to. We added the syllabus and course calendar here. And here's where things start to get interesting.

All four instructors need the ability to collaboratively edit and then publish public-facing material with students (like the syllabus), as edit private course material, like grades. Dropbox makes a natural choice for private material, whereas a collaborative source repository like GitHub makes a natural choice for public material. Personally, I'm more familiar/comfortable with Google Code, but I don't think the choice of services here is material. [Sidenote: If you have a pro GitHub account, the choice seems easy: use a private GitHub repo.]

Using a public git repo gave us a few things that I really liked. First off, our class schedule was a simple HTML table in the repo that we pointed at with a static link via the Pages widget, above. When I needed to update an assignment due date, I did a quick edit-and-commit on my laptop, pushed it to the archive, and magically appeared on the blog. If several instructors are simultaneously working on course material, this essentially provides an audit trail of who did what. This is kindergarden-level stuff for software developmers. But these are tools that teachers could benefit from and aren't very familiar with. For example, getting colleagues to set up git represents a non-trivial challenge. Nonetheless, there are other benefits of learning git if you're a scientist (that I won't address here, though see here for some thoughts).

In our class, I also used R for data visualizations. I let the awesome knitr package build the appropriate .png files from my R code (along with pdfs for download, if needed). Again, it sounds simple, but adding the generated figures to the git archive allowed me to quickly link to them in blog posts, and then update them later if needed.

I would have preferred having the blog posts themselves under revision control (only "Pages" can point to an external source html without javascript, which I didn't have time for). But, for the simplicity of setup and the low (e.g. free) cost of use, I didn't find posts to be much of an issue. Blogger allows composition in pure html, without all the *junk*, which helps. But having to re-upload and link a figure every time I find an error? Definitely a pain.

For this class, we also set up an e4ward.com address that pointed to my email box. This allowed me to publicly post the class address without fear of being spammed forevermore, and allowed me to easily identify all class email. Having a single instructor responsible for email correspondence worked well enough. As an early assignment, students were asked to send a question to the class email address. This is a nice way to get to know folks, and incentivize them to go to "digital office hours", e.g. ask good questions over email. We did have some issues with e4ward.com towards the end of the class. This *sucks* - students panic when emails get lost, and it's hard to sort out where the problem is. Honestly, I don't know the answer here.

As a sidenote, I pushed the use of Wikipedia (WP) heavily in this class, and referred to it often myself. This is not possible in all fields, but WP articles in many of the hard sciences are now both technically detailed and accessible. Probability and statistics articles are some of the best examples here, since the "introductory concept" articles are used by a large number of individuals/fields, and aren't the subjects of debate (compare this with the WP pages of, for example, Tibet or Obama). I also discovered WP's "Outlines" pages for the first time. If you ever use statistics, I strongly recommend spending some time with WP's Outline of Statistics. It's epic.

One final tool we used quite a bit was LimeSurvey. As I was planning the class, I went looking for an inexpensive yet flexible survey platform. I was generally disappointed by the offerings. My requirements included raw data export and more than 10 participants; these features tend *not* to be free, and survey tools can get pricey. Enter LimeSurvey (henceforth LS). It's open source, well-documented, versatile, and simple to use. I was reluctant to invest *too* much effort in tools I wasn't sure we'd use, but I got LS running in less than 2 hours. To be fair, our lab has an on-campus, always-on server, and apache2 is already installed and configured. This would have been an annoying out-of-pocket expense had I needed to rent a VPS, though you can now get a lot of VPS for $5/mo. [sidenote: getting our campus network to play nice with custom DNS was a whole other issue...]

LS allowed me to easily construct data entry surveys, allowing each student to enter, for example, their 25 flips of 3 coins, or their sequence of wins and losses playing the Monty Hall problem. Students can quickly enter this data from their browser at their convenience. At its best, visualizations of the class data can give students a sense of ownership and purpose of in-class exercises. LS also allowed us to conduct initial background surveys, as well as anonymous, customized class "satisfaction" surveys mid-course to find out what was working and what wasn't. We ran into a few administrative issues with LS, but it's an overall powerful and stable data collection platform. Students seemed happy with it, and it provided us with valuable feedback.

What would I do differently? I failed to set up an email list early on. It would have been trivial using, e.g. Google Groups. At the time, it seemed redundant, both for the students and us. Didn't they already have the blog? In retrospect, it would have proved useful at several points to communicate "New Blog Post" or "Due Date Changed, check calendar". I've learned this semester to expect that spoken instructions are not necessarily heeded, and that receiving administrative details in writing from multiple sources (blog, calendar, mailing list) is a Good Idea ™.

Along this vein, a minor needed improvement is breaking making a separate table for assignments. I originally combined assignments with the course calendar in the interest of simplicity, giving students all the relevant information in one place. In the end, it was just confusing.

Overall, the class went very well. We have received positive feedback from the students, and we now have a detailed digital record of the course. PDFs of their final project posters are now in the archive, where they will live in perpetuity. Personally, I'm not ready to teach a MOOC yet, but I'm sold on digital tools as useful supplements to in-class material. They allowed me to spend less time doing more. These tools helped the class "organize itself", and reduced communication overhead between instructors. To older or less technically-inclined teachers, some of this might seem difficult or confusing. On one hand, you really only need *one* instructor to coordinate the tech - it's not very hard for instructors to use the tools once they're set up. On the other hand, some of the above pieces are very easy to implement, and support from a department administrator or technically-inclined teaching assistant (TA) might be available for more challenging pieces. An installation of LS shared across a department, for example, should be trivial for any department that has its own linux server (e.g. math, physics, chem).

In my mind, a key goal here is that technology not get in the way. Many of the commercial "online learning products" that I've seen adopted by universities make simple task complicated, while lacking the flexibility. In trying to do everything for everyone, they often end up doing nothing for anyone (or take a high level of skill or experience to use effectively). I far preferred using several discrete tools, each of which does a single job well (class email, blog to communicate, git repo to hold files).

Are there any interesting tools worth trying out next time? I wonder how a class twitter-hashtag would work...