Putting the R in romantic

I've used R for a lot of tasks unrelated to statistics or data analysis. For example, it's usually a lot easier for me to write an intelligent batch file/folder renamer or copier as an R script than a bash shell script.

Earlier today I made a collection of photos that I wanted to put on a digital picture frame to mail to my partner. I also made a set of messages that I wanted to show up randomly. What I needed to do was to shuffle the set of 260+ images in such a way that a subset of them would not show up consecutively.

To make referencing the images easier, let's call the overall set of $n$ images $Y$ (with $Y = y_1, \ldots, y_n$), and let $X \subset Y$ be the images we do not want to have consecutive pairs of after the shuffling. Let $Y' = y_{(1)}, \ldots, y_{(n)}$ be the shuffled set of images.

This was really easy to accomplish in R. I started with k <- 0; set.seed(k) and shuffled all the images (using sample.int()). Then I checked whether our very specific requirement was or was not met.

If we did end up with a pair of consecutive images from $X$, we increment $k$ by 1 and repeat the procedure until $\{y_{(i-1)}, y_{(i)}\} \not\subset X ~\forall~i = 2, \ldots, n$.

I think what makes R really nice to use for tasks like this is vectorized functions and binary operators like which(), %in%, order(), duplicated(), sample(), sub(), and grepl(), as well as data.frames that you can expand to include additional data, such as indicators of whether row $m$ is related to row $m-1$.

Next time you have to do something on the computer that is repetitive and time-consuming, I urge you to consider writing a script/program to do it for you if you know R but haven't considered it before for doing file organization.

Cheers~

Mostly-free resources for learning statistics

In the past year or two I've had several friends approach me about learning statistics because their employer/organization was moving toward a more data-driven approach to decision making. (This brought me a lot of joy.) I firmly believe you don't actually need a fancy degree and tens of thousands of dollars in tuition debt to be able to engage with data, glean insights, and make inferences from it. And now, thanks to many wonderful statisticians on the Internet, there is now a plethora of freely accessible resources that enable curious minds to learn the art and science of statistics.

First, I recommend installing R and RStudio for actually using it. They're free and what I use for almost all of my statistical analyses. Most of the links in this post involve learning by doing statistics in R.

Okay, now on to learning stats…

There's Data Analysis and Statistical Inference + interactive companion course by Mine Çetinkaya-Rundel (Duke University). She has also written the OpenIntro to Statistics book (available for free as a PDF).

Free, self-paced online courses from trustworthy institutions:

Not free online courses from trustworthy institutions:

Free miscellaneous resources:

Book recommendations:

Phew! Okay, that should be enough. Feel free to suggest more in the comments below.

Freelancing Hourly Rate Calculator (Shiny app)

The other day I got tired of basically coming up with random hourly rate estimates for freelancing projects because I actually never sat down to figure out what the hell my hourly rate should be. I found a great blog post How to Calculate Hourly Freelance Rates for Web Design, Development Work and made a spreadsheet with the appropriate formulas.

But then I wanted to combine the explanation of the blog post with the dynamic aspect of the spreadsheet. So I opened up R and wrote a Shiny app where you can specify all the different numbers and percentages and it’ll update the plots and details of how the final rate was calculated.

If you want to figure out what you should be charging your clients, go to http://bearloga.shinyapps.io/freelancr/

Words, words, words

I needed a list of adverbs/adjectives that start with "do." First I tried Wolfram|Alpha but that couldn't filter the list to adjectives and there's no way to build a query pipeline (at least with a free account). I ended up using the wordnet package in R:

require(magrittr) # install.packages('magrittr')
require(wordnet) # install.packages('wordnet')
getTermFilter('StartsWithFilter','do',TRUE) %>%
    getIndexTerms('ADVERB',1e4,.) %>% sapply(getLemma) %>%
        paste(collapse=', ')

Output: doctrinally, doggedly, doggo, dogmatically, dolce, dolefully, doltishly, domestically, domineeringly, dorsally, dorsoventrally, dottily, double, double quick, double time, doubly, doubtfully, doubtless, doubtlessly, dourly, dowdily, down, down the stairs, downfield, downhill, downright, downriver, downstage, downstairs, downstream, downtown, downward, downwardly, downwards, downwind

P.S. If you're on OS X, you can use MacPorts to install WordNet with: sudo port install wordnet

Then select the port-installed dictionary in R with: setDict('/opt/local/share/WordNet-3.0/dict')

Guide to Shiny apps with Shiny Server on Amazon EC2

Preface: posting this for archive purposes only. This was the first of its kind and has been succeeded by better guides.

I am writing this guide because this guide did not exist when I decided to put my 2010 US Census Shiny App on Amazon's servers (demo here). Surely I can't be the only one who's never had any experience with EC2 (or SSH or vi, for that matter).

So here's a newbie's guide to newbies for deploying your rad Shiny app on Amazon Elastic Compute Cloud (EC2) from scratch. It took me sixteen 30 Rock episodes to figure this stuff out (counting the time it took to download the census data) but hopefully you'll have your app up and running in less time than...a BBC Sherlock episode.

What are Shiny and Shiny Server?

Shiny is an R package developed by the incredible folks at RStudio for making interactive web applications. Shiny Server is a server program that makes Shiny applications available over the web.

Amazon Elastic Compute Cloud (EC2)

Amazon EC2 is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers. If you have an Amazon account then you can start using Amazon Web Services (AWS) for free! AWS Free Tier includes 750 hours of Linux or Windows Micro Instances each month for one year.

Setting up Shiny web apps on Amazon EC2

Launching an EC2 Instance

The process for creating and launching a new EC2 instance is pretty straightforward. I'd recommend going with Ubuntu 11.10 and then carefully thinking about how much space you'll be using. For example, my app uses the 2010 US Census data which comes in several R packages that total 4.7 GB. You'll have to create a security key and download it. Remember where it gets downloaded to as you'll have to use it to connect with the instance through SSH.

SSH

If you're on Linux or OS X, you can use Terminal and run all the commands from there. On Windows you have to download and install PuTTY. I'm writing this on a Mac, my apologies if you run into a problem on Windows. May I recommend this?

cd Downloads
ssh -i that_key_you_downloaded_earlier.pem ubuntu@your-ec2-instance-address.amazonaws.com

Node.js

sudo apt-get update
sudo apt-get install python-software-properties python g++ make
sudo add-apt-repository ppa:chris-lea/node.js
sudo apt-get update
sudo apt-get install nodejs npm

Installing R and packages

Before you start installing R and Shiny, you need to add a source so that when you install R the latest version (2.15.2) gets installed. If you skip this step then you'll end up installing 2.12 and nothing will work.

Usually you'd open the sources list in gedit or another text editor which has an interface. In this case we'll have to use vi to add our R source. I haven't used vi until today and found this cheat sheet invaluable for learning it.

sudo vi /etc/apt/sources.list.d
# go in there and there should be a list file there from the Node.js step
# open that list file for editing
# type in:
o
# you will then be able to type text on a new line
# type in:
deb http://lib.stat.cmu.edu/R/CRAN/bin/linux/ubuntu/ version/
# where version=oneiric or precise or whatever
# make sure to have a space there! Otherwise you'll the Malformed Line error.
# You can use other CRAN repos; you're not limited to CMU.
# [ESC] to finish editing. To exit and save changes:
:x

Once you're done with that, it's time to install R. Just run the following code (thanks to Ananda Mahto from stack overflow:

gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
gpg -a --export E084DAB9 | sudo apt-key add -
sudo apt-get update
sudo apt-get install r-base

Then you'll install the Shiny package and Shiny Server itself:

sudo su - -c "R -e \"install.packages('shiny', repos='http://cran.rstudio.com/')\""
sudo npm install -g shiny-server
# Create a system account to run Shiny apps
sudo useradd -r shiny
# Create a root directory for your website
sudo mkdir -p /var/shiny-server/www
# Create a directory for application logs
sudo mkdir -p /var/shiny-server/log
# To be able to run shiny-server as a process later on:
wget https://raw.github.com/rstudio/shiny-server/master/config/upstart/shiny-server.conf
sudo cp shiny-server.conf /etc/init/shiny-server.conf

ui.R and server.R

We need to download ui.R and server.R (and any auxiliary files). This is done with wget:

sudo apt-get install wget
wget https://raw.github.com/.../master/myapp/server.R
wget https://raw.github.com/.../master/myapp/ui.R
mkdir myapp
mv ui.R myapp/ui.R
mv server.R myapp/server.R
sudo cp -R ~/myapp /var/shiny-server/www/

Run sudo start shiny-server to start and sudo stop shiny-server to stop. Open your web browser and go to http://[hostname]:3838/myapp/

Optional: Elastic IPs and Dyn DNS

Problem You own a domain name and DNS hosting from a service like Dyn.com and want people to use your Shiny app by going to your domain. Solution In the AWS Management Console go to EC2 and then Elastic IPs. You can allocate a new address and then associate it with an instance. You can then use that IP when you make a new hostname the Dyn control panel.

Cheers

You got yourself a Shiny app running on Shiny Server on Amazon EC2. Remember to watch out for those free tier limits!

Google Forms Multi-response Separator

Previously: Using Google Drive to make Survey Forms and importing answers into R

I noticed that if you make a question with multiple responses then that responder's response to that question will be a concatenation of responses. Not very useful for data analysis.

Suppose two items on the survey ask to select which Apple & Microsoft products the responder has used in the past 6 months. When you import the responses into R using that importer code you might see responses 1. “iPad, iPhone” and 2. “iPod Touch, iPhone, iMac” in the Apple column, and 1. “Xbox, Surface” and 2. “Zune” in the Microsoft column.

So we run

x <- separate(survey.data, vars = c("Apple", "Microsoft"))

Which would output a list with two components, each of which is a data frame with indicator variable for each possible response (using the first 5 characters). If we combine these two components into one data frame using cbind we might see:

Apple.iPad Apple.iPod Apple.iPhone Apple.iMac Apple.MacBo Microsoft.Xbox Microsoft.Surfa Microsoft.Zune
1 0 1 0 0 1 1 0
0 1 1 1 0 0 0 1

The code for the separator function is:

separate <- function(x, vars) {
    # x : data frame vars : vector of column names
    temp <- list()
    for (i in 1:length(vars)) {
        temp[[i]] <- sapply(x[, vars[i]], function(y) {
            strsplit(as.character(y), ", ")
        })
    }
    lvls <- lapply(temp, function(y) {
        unname(sapply(substr(unique(unlist(y)), 1, 5), function(z) {
            if (is.na(z)) 
                "NA" else z
        }))
    })
    n.lvls <- sapply(lvls, length)
    VARS <- list()
    for (i in 1:length(vars)) {
        OBS <- as.data.frame(matrix(0, nrow = length(temp[[i]]), ncol = n.lvls[i]))
        names(OBS) <- lvls[[i]]
        for (j in seq(along = temp[[i]])) {
            if (!(is.na(temp[[i]][[j]])[1])) {
                for (k in seq(along = temp[[i]][[j]])) {
                  OBS[j, substr(temp[[i]][[j]][k], 1, 5)] <- 1
                }
            } else {
                OBS[j, ] <- rep(NA, n.lvls[[i]])
            }
        }
        which.na <- which(lvls[[i]] == "NA")
        names(OBS) <- paste(vars[[i]], lvls[[i]], sep = ".")
        VARS[[i]] <- OBS[, -which.na]
    }
    names(VARS) <- vars
    return(VARS)
}

I admit that the approach I've taken above is inefficient but it works.

NOTE: If one of the possible responses contains commas then multiple columns will be created for that response. So, for example, if the responder can check “Online (Amazon, eBay)” along with “In-store (Best Buy, Frys)” then we will see the following columns (which have the same number of 1s and 0s): Onlin, eBay, In-st, Frys. This is because the function uses “, ” to separate a response into multiple possible responses. This is unavoidable so be careful.

Using Google Drive to make Survey Forms and importing answers into R

I remember when Google Docs first launched. I was still in high school and I immediately became a Google evangelist. I told everyone to start using this wonderful new cloud-based service. I don’t think the term ‘cloud-based’ even existed at the time, although it’s more likely that I was simply not aware of its existence. Since then the service has grown substantially. It includes a lot more features, a significantly better UI, and it even lets people design surveys!

It's even easy to make a survey with branching questions. See:

The form can accessed here. Feel free to submit a response. This is the form we'll be using in this example. (Responses so far.)

The results of such surveys are stored as Spreadsheets on Google Drive (formerly Google Docs & Spreadsheets). But what if we want to access all those answers in R and perform some EDA or analysis? And what if we don’t want to go to the Google Drive page and download the results as a CSV manually? Let me show you how we can achieve all of this with some R code after a quick-and-easy initial setup.

I assume you've designed the form and may or may not have responses already. Open it up and you should see the spreadsheet. Click on File -> Publish to the web. Then click Start Publishing and pick CSV under Get a link to the published data

Copy the link. You will use this as the filename.

filename <- "https://docs.google.com/spreadsheet/pub?key=0ApOyZxZwgCv6dC1uUUVVbl9ieEJSQjhMQWpGZUxuYUE&output=csv"

By default, R has some issues downloading files through https; so we need to use a package called RCurl.

# Checks if RCurl is installed. If not, installs it.
if (sum(installed.packages()[, 1] == "RCurl") < 1) install.packages("RCurl")
require(RCurl)

Then we need to actually download the survey results. This is done through the following script:

txt = tryCatch(getURL(filename), error = function(e) {
    getURL(filename, ssl.verifypeer = FALSE)
})
tc = textConnection(txt)  # Opens a connection.
survey.results <- read.csv(tc, header = T, stringsAsFactors = TRUE, na.strings = "")
close(tc)  # Closes the connection.
rm(txt, tc, filename)  # Cleans up the workspace.

You'll actually get really long column names as well as a column of the dates and times of the submissions. Let's clean this up this really quick:

# remember to modify data frame name to your needs
statisticians <- survey.results[, -1]
# remember to modify as appropriate
names(statisticians) <- c("software", "used.ggplot2", "role")

Here's what the data looks like:

software used.ggplot2 role
R Yes Graduate Student
SAS NA Professional

Thanks for reading! Enjoy :)