Labels

R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

29 February 2012

Custom Amazon EC2 config for Rstudio

Introduction

This post is a work in progress building on the previous post. It's my attempt to simultaneously learn Amazon's AWS tools and set up R and Rstudio Server on a customized "cloud" instance. I look forward to testing some R jobs that have large memory requirements or are very parallelizable in the future.

To start, I followed the instructions here to get a vanilla Ubuntu/Bioconductor EC2 image up and running. This was a nice exercise -- props to the bioconductor folks for setting up this image.

Edit -- another look

After playing with this setup more and trying to shrink the root partition (as per
these instructions), I realized there's a tremendous amount of cruft in this AMI. It's over 20GB, there's a weird (big) /downloads/ folder lying aruond, and just about every R package ever in /use/local/lib64/R. I cut it down to ~6gb by removing these two directories.


If I were to do this again, I would use the official Canonical images (more details here), and manually install what I need. Honesty, this is more my style -- it means fewer unknown hands have touched my root, and there's a clear audit chain of the image. Aside from installing a few extra packages (Rstudio server, for example), the rest of the instructions should be similar.

Modifications

I proceeded to lock it down -- add an admin user, prevent root login, etc.
Then I set up apache2 to serve Rstudio server over HTTPS/SSL.
In the AWS console, I edited Security Groups to add a custom Inbound rule for TCP port 443 (https). I then closed off every other port besides 22 (ssh).

Below is the commandline session that I used to do it, along with annotations and links to some files.
adduser myusername
adduser myusername sudoers
## add correct key to ~myusername/.ssh/authorized_keys

vi /etc/ssh/sshd_config 
## disable root login
/etc/init.d/ssh restart
## now log in as myusername via another terminal to make sure it works, and then log out as root


Next, I set up dynamic dns using http://afraid.org (I've previously registered my own domain and added it to their system). I use a script file made specifically to work with AWS -- it's very self-explanatory.

## change hostname to match afraid.org entry
sudo vi /etc/hostname
sudo /etc/init.d/hostname restart

## Now it's time to make Rstudio server a little more secure
## from http://rstudio.org/docs/server/running_with_proxy
sudo apt-get install apache2 libapache2-mod-proxy-html libxml2-dev
sudo a2enmod proxy
sudo a2enmod proxy_http

## based on instructions from http://beeznest.wordpress.com/2008/04/25/how-to-configure-https-on-apache-2/
openssl req -new -x509 -days 365 -keyout crocus.key -out crocus.crt -nodes -subj \
'/O=UNM Biology Wearing Disease Ecology Lab/OU=Christian Gunning, chief technologist/CN=crocus.x14n.org'

## change permissions, move to default ubuntu locations
sudo chmod 444 crocus.crt
sudo chown root.ssl-cert crocus.crt
sudo mv crocus.crt /usr/share/ca-certificates

sudo chmod 640 crocus.key
sudo chown root.ssl-cert crocus.key
sudo mv crocus.key /etc/ssl/private/

sudo a2enmod rewrite
sudo a2enmod ssl

sudo vi /etc/apache2/sites-enabled/000-default
sudo /etc/init.d/apache2 restart

You can see my full apache config file here:

Conclusions

Now I access Rstudio on the EC2 instance with a browser via:
https://myhostname.mydomain.org

I found that connecting to the Rstudio server web interface gave noticable lag. Most annoyingly, key-presses were missed, meaning that I kept hitting enter on incorrect commands. Connecting to the commandline via SSH worked much better.

Another annoyance was that Rstudio installs packages into something like ~R/libraries, whereas the commandline R installs them into ~/R/x86_64-pc-linux-gnu-library/2.14. Is this a general feature of Rstudio? It's a little confusing that this isn't standardized.

Another quirk -- I did all of this on a Spot Price instance. After all of these modifications, I discovered that Spot instances can't be "stopped" (the equivalent of powering down), only terminated (which discards all of the changes). After some looking, I discovered that I could "Create an Image" (EBS AMI) from the running image. This worked well -- I was able to create a new instance from the new AMI that had all of the changes, terminate the original instance, and then stop the new instance.

All of this sounds awfully complicated. Overall, this is how I've felt about Amazon AWS in general and EC2 in particular for a while. The docs aren't great, the web-based GUI tools are sometimes slow to respond, and the concepts are *new*. But I'm glad I waded in and got my feet wet. I now understand how to power up my customized image on a micro instance for less than $0.10 an hour to configure and re-image it, and how to run that image on an 8 core instance with 50+GB RAM for less than a dollar an hour via Spot Pricing.

28 February 2012

Adventures in R Studio Server: Apache2, Https, Security, and Amazon EC2.

I just put a fresh install of Ubuntu Server (10.04.4 LTS) on one of our machines.  As I was doing some post-install config, I accidentally installed Rstudio Server.  And subsequently fell down an exciting little rabbit-hole of server configuration and "ooooh-lala!" playtime.

A friend sung the wonders of Rstudio Server to me recently, and I filed it under "things to ignore for now".  Just another thing to learn, right?  Turns out, the Rstudio folks do *great* work and write good docs, so I hardly had to learn anything.  I just had to dust off my sysadmin skills and fire up some google.

I'm a little concerned about running web services on public-facing machines.  Even more so, given that R provides fairly low-level access to operating system services.  Still, I was impressed to see system user authentication.

I followed the docs for running apache2 as a proxy server, and learned a little about apache in the process.  Since I made it this far, I figured I'd run it through https/ssl, add some memory limitations, etc.  I'm still not entirely convinced this is secure -- it seems that running it in a virtual machine or chroot jail would be ideal.

 On the other hand, I ran across this post on running Rstudio Server inside Amazon EC2 instances.  Nighttime EC2 spot prices on "Quadruple Extra Large" instances (68.4 GB of memory,
8 virtual cores with 3.25 EC2 Compute Units each) fell below $1 an hour tonight, which is cheap enough to play with for an hour or two -- take it through some paces and see how well it does with a *very* *large* *job* or two.  Instances can now be stopped and saved to EBS (elastic block storage), and so only need to be configured once, which really simplifies matters. In fact, I'm wondering if Rstudio (well, R, really) is my "killer app" for EC2. 

Overall, I was really impressed at how fast and easy this was to get up and running. Fun times ahead!