R Introduction

2 August 2014

What is R?

A Domain-Specific-Language (DSL) for statistics and data analysis
Based on the S Programming Language
An environment for Exploratory Data Analysis (EDA)
A quasi-functional language with IDE and REPL
A vectorized language with BLAS support
A collection of over 7,000+ libraries
A large and active community across industry and academia
Around 20 years old (Lineage dates from 1975 - almost 40 years ago)

Types

Primitives (numeric, integer, character, logical, factor)
Data Frames
Lists
Tables
Arrays
Environments
Others (functions, closures, promises..)

Simple Types

x <- 1
class(x)
## [1] "numeric"

y <- "Hello World"
class(y)
## [1] "character"

z <- TRUE
class(z)
## [1] "logical"

as.integer(z)
## [1] 1

Simple Types - Vectors

The basic type unit in R is a vector

x <- c(1,2,3)
x
## [1] 1 2 3
x <- 1:3
x[1]
## [1] 1
x[0]
## integer(0)
x[-1]
## [1] 2 3

Generating Vectors

R provides lots of convenience functions for data generation:

rep(0, 5)
## [1] 0 0 0 0 0
seq(1,10)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(1,2,.1)
##  [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
seq(1,2,length.out=6)
## [1] 1.0 1.2 1.4 1.6 1.8 2.0

Indexing

x <- c(1, 3, 4, 10, 15, 20, 50, 1, 6)
x > 10
## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE
which(x > 10)
## [1] 5 6 7
x[x>10]
## [1] 15 20 50
x[!x>10]
## [1]  1  3  4 10  1  6
x[x<=10]
## [1]  1  3  4 10  1  6
x[x>10 & x<30]
## [1] 15 20

Functions

square <- function(x) x^2
square(2)
## [1] 4

pow <- function(x, p=2) x^p
pow(10)
## [1] 100
pow(10,3)
## [1] 1000
pow(p=3,10)
## [1] 1000

Functions can be passed as data:

g <- function(x, f) f(x)
g(10, square)
## [1] 100

h <- function(x,f,...) f(x,...)
h(10, pow, 3)
## [1] 1000

R is Vectorized

Example - multiplying two vectors:

mult <- function(x,y) {  
 z <- numeric(length(x))
 for (i in 1:length(x)) {
    z[i] <- x[i] * y[i]
 }
 z
}

mult(1:10,1:10)

##  [1]   1   4   9  16  25  36  49  64  81 100

R is Vectorized

Multiplying two vectors 'the R way':

1:10 * 1:10

##  [1]   1   4   9  16  25  36  49  64  81 100

NOTE: R recycles vectors of unequal length:

1:10 * 1:2

##  [1]  1  4  3  8  5 12  7 16  9 20

NOTE: Random Number Generation

R contains a huge number of built-in random number generators for various probability distributions

# Normal variates, mean=0, sd=1
rnorm(10)

##  [1]  1.058489  0.362267  0.680629 -0.066726  0.816936 -0.218883 -1.103620
##  [8] -0.224080 -1.633193 -0.578103

rnorm(10, mean=100, sd=5)

##  [1] 101.071  92.236 106.405 103.254  89.887  96.956  93.717 104.615
##  [9]  98.873  99.458

Many different distributions available (the r* functions)

Data Frames

Data frames are the fundamental structure used in data analysis
Similar to a database table in spirit (named columns, distinct types)

d <- data.frame(x=1:6, y="AUDUSD", z=c("one","two"))
d

##   x      y   z
## 1 1 AUDUSD one
## 2 2 AUDUSD two
## 3 3 AUDUSD one
## 4 4 AUDUSD two
## 5 5 AUDUSD one
## 6 6 AUDUSD two

Data Frames

Data frames can be indexed like a vector or matrix:

# First row
d[1,]
##   x      y   z
## 1 1 AUDUSD one

# First column
d[,1]
## [1] 1 2 3 4 5 6

# First and third cols, first two rows
d[1:2,c(1,3)]
##   x   z
## 1 1 one
## 2 2 two

Data Frames

Let's generate some dummy data:

generateData <- function(N) data.frame(time=Sys.time()+1:N, 
  sym="AUDUSD", 
  bid=rep(1.2345,N)+runif(min=-.0010,max=.0010,N),
  ask=rep(1.2356,N)+runif(min=-.0010,max=.0010,N),
  exch=sample(c("EBS","RTM","CNX"),N, replace=TRUE)) 

prices <- generateData(50)
head(prices, 5)

##                      time    sym    bid    ask exch
## 1 2014-12-27 18:55:12.727 AUDUSD 1.2352 1.2349  EBS
## 2 2014-12-27 18:55:13.727 AUDUSD 1.2345 1.2356  CNX
## 3 2014-12-27 18:55:14.727 AUDUSD 1.2340 1.2363  CNX
## 4 2014-12-27 18:55:15.727 AUDUSD 1.2347 1.2347  EBS
## 5 2014-12-27 18:55:16.727 AUDUSD 1.2355 1.2353  RTM

Data Frames

We can add/remove columns on the fly:

prices$spread <- prices$ask-prices$bid
prices$mid <- (prices$bid + prices$ask) * 0.5
head(prices)

##                      time    sym    bid    ask exch      spread    mid
## 1 2014-12-27 18:55:12.727 AUDUSD 1.2352 1.2349  EBS -3.1483e-04 1.2351
## 2 2014-12-27 18:55:13.727 AUDUSD 1.2345 1.2356  CNX  1.1894e-03 1.2350
## 3 2014-12-27 18:55:14.727 AUDUSD 1.2340 1.2363  CNX  2.3628e-03 1.2351
## 4 2014-12-27 18:55:15.727 AUDUSD 1.2347 1.2347  EBS  2.7443e-05 1.2347
## 5 2014-12-27 18:55:16.727 AUDUSD 1.2355 1.2353  RTM -1.3792e-04 1.2354
## 6 2014-12-27 18:55:17.727 AUDUSD 1.2349 1.2364  CNX  1.4767e-03 1.2356

Data Frames

Some basic operations on data frames:

names(prices)
## [1] "time"   "sym"    "bid"    "ask"    "exch"   "spread" "mid"

table(prices$exch)
## 
## CNX EBS RTM 
##  17  14  19

summary(prices$mid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.23    1.23    1.24    1.24    1.24    1.24

Data Frames

Operations can be applied across different dimensions of a data frame:

sapply(prices,class)
## $time
## [1] "POSIXct" "POSIXt" 
## 
## $sym
## [1] "factor"
## 
## $bid
## [1] "numeric"
## 
## $ask
## [1] "numeric"
## 
## $exch
## [1] "factor"
## 
## $spread
## [1] "numeric"
## 
## $mid
## [1] "numeric"

Aggregations and Rollups

Aggregation and rollups:

tapply(prices$spread, prices$exch, mean)

##        CNX        EBS        RTM 
## 0.00138489 0.00074092 0.00100872

Aggregations and Rollups

tapply(prices$mid, prices$exch, function(x) diff(log(x)))

## $CNX
##  [1]  7.0477e-05  4.0285e-04 -3.0331e-04 -2.0051e-04  1.8353e-04
##  [6] -1.5554e-04 -3.1436e-04  1.9462e-04  7.4922e-04 -6.0734e-04
## [11] -1.2643e-04  1.7685e-04  4.7554e-04 -8.8123e-05 -4.9113e-04
## [16] -8.7994e-05
## 
## $EBS
##  [1] -0.00030726  0.00101082 -0.00101368  0.00029379 -0.00049941
##  [6]  0.00037749 -0.00045959  0.00081848 -0.00082648  0.00051327
## [11]  0.00061863  0.00010343 -0.00053213
## 
## $RTM
##  [1] -6.1733e-04  1.0196e-03 -1.1565e-03  6.1929e-04 -1.3694e-04
##  [6] -4.6347e-04  9.5553e-04  6.9131e-05 -8.6649e-04  2.5265e-04
## [11] -1.2193e-04  2.1709e-04  3.4202e-04 -5.6068e-04  3.1838e-04
## [16] -2.7571e-04  6.1238e-04 -5.2919e-04

Aggregations and Rollups

Aggregating summary statistics by time:

aggregate(prices[,c("bid","ask")], 
  by=list(bucket=cut(prices$time, "10 sec")), mean)

##                    bucket    bid    ask
## 1 2014-12-27 18:55:12.727 1.2347 1.2357
## 2 2014-12-27 18:55:22.727 1.2343 1.2356
## 3 2014-12-27 18:55:32.727 1.2346 1.2356
## 4 2014-12-27 18:55:42.727 1.2346 1.2355
## 5 2014-12-27 18:55:52.727 1.2347 1.2359

Aggregations and Rollups

aggregate(prices[,c("bid","ask")], 
          by=list(bucket=cut(prices$time, "10 sec")), 
          function(x) c(min=min(x),max=max(x)))

##                    bucket bid.min bid.max ask.min ask.max
## 1 2014-12-27 18:55:12.727  1.2340  1.2355  1.2347  1.2365
## 2 2014-12-27 18:55:22.727  1.2336  1.2355  1.2347  1.2365
## 3 2014-12-27 18:55:32.727  1.2336  1.2353  1.2346  1.2365
## 4 2014-12-27 18:55:42.727  1.2335  1.2354  1.2347  1.2362
## 5 2014-12-27 18:55:52.727  1.2336  1.2354  1.2350  1.2365

Aggregations and Rollups

prices <- generateData(2000)
prices$spread <- prices$ask-prices$bid
boxplot(prices$spread ~ prices$exch, main="Spread Distribution By Exchange")

plot of chunk unnamed-chunk-22

Closures

R functions contain a reference to the enclosing environment:

add <- function(x) { 
 function(y) x + y
}

increment <- add(1)
increment(2)

## [1] 3

R's Lisp Influence

R is heavily influenced by Lisp/Scheme

cf. "Structure And Interpretation of Computer Programs"

R's Lisp Influence

`+`(2,3)

## [1] 5

(`*` (2, (`+`(2,3))))

## [1] 10

codetools::showTree(quote(2*(2+3)))

## (* 2 ("(" (+ 2 3)))

R's Lisp Influence

Also evident in the underlying C code:

SEXP attribute_hidden do_prmatrix(SEXP call, SEXP op, 
  SEXP args, SEXP rho)
{
    int quote;
    SEXP a, x, rowlab, collab, naprint;
    char *rowname = NULL, *colname = NULL;

    checkArity(op,args);
    PrintDefaults();
    a = args;
    x = CAR(a); a = CDR(a);
    rowlab = CAR(a); a = CDR(a);
    collab = CAR(a); a = CDR(a);

    quote = asInteger(CAR(a)); a = CDR(a);

Operators Are Functions

`(`

## .Primitive("(")

(1+2)

## [1] 3

`(` <- function(x) 42
(1+2)

## [1] 42

rm("(")

Example: Median Absolute Deviation

\(MAD(x) = median\left(\left|Y_i - \hat{Y}\right|\right)\)

mad

## function (x, center = median(x), constant = 1.4826, na.rm = FALSE, 
##     low = FALSE, high = FALSE) 
## {
##     if (na.rm) 
##         x <- x[!is.na(x)]
##     n <- length(x)
##     constant * if ((low || high) && n%%2 == 0) {
##         if (low && high) 
##             stop("'low' and 'high' cannot be both TRUE")
##         n2 <- n%/%2 + as.integer(high)
##         sort(abs(x - center), partial = n2)[n2]
##     }
##     else median(abs(x - center))
## }
## <bytecode: 0x103aade30>
## <environment: namespace:stats>

Example: Median Absolute Deviation

Shows: lazy evaluation, filering, logical casting, if/else return values, partial sorting



function (x, center = median(x), constant = 1.4826, na.rm = FALSE, 
    low = FALSE, high = FALSE) 
{
    if (na.rm) x <- x[!is.na(x)]
    n <- length(x)
    constant * if ((low || high) && n%%2 == 0) {
        if (low && high) 
            stop("'low' and 'high' cannot be both TRUE")
        n2 <- n%/%2 + as.integer(high)
        sort(abs(x - center), partial = n2)[n2]
    }
    else median(abs(x - center))
}

Example: Simulating Coin Tosses

What is the probability of exactly 3 heads in 10 coin tosses for a fair coin?

Using binomial identity: \(\binom{n}{k}p^{k}(1-p)^{(n-k)} = \binom{10}{3}\left(\frac{1}{2}\right)^{3}\left(\frac{1}{2}\right)^{7}\)

choose(10,3)*(.5)^3*(.5)^7

## [1] 0.11719

Using binomial distribution density function:

 dbinom(prob=0.5, size=10, x=3)

## [1] 0.11719

Using simulation (100,000 tosses):

sum(replicate(100000,sum(rbinom(prob=1/2, size=10, 1))==3))/100000

## [1] 0.11636

Example: Random Walk

Generate 1000 up-down movements based on a fair coin toss and plot:

x<-(cumsum(ifelse(rbinom(prob=0.5, size=1, 10000)==0,-1,1)))
plot(x, type='l', main='Random Walk')

plot of chunk unnamed-chunk-30

Example: Generating Random Data

randomWalk <-function(N)(cumsum(ifelse(rbinom(prob=0.5, size=1, N)==0,-1,1)))
AUDUSD <- 1.2345 + randomWalk(1000)*.0001
plot(AUDUSD, type='l')

plot of chunk unnamed-chunk-31

Example: OANDA FX Data

require(quantmod);require(TTR)
EURUSD <- getSymbols("EUR/USD", src="oanda", auto.assign=FALSE)
plot(EURUSD)
lines(EMA(EURUSD,10), col='red')
lines(EMA(EURUSD,30), col='blue')

Example: Connecting to kdb+

Rorys-MacBook-Pro:kdb rorywinston$ <b>./rlwrap q/m32/q -p 5000</b>
KDB+ 3.1 2014.07.01 Copyright (C) 1993-2014 Kx Systems
m32/ 8()core 16384MB rorywinston rorys-macbook-pro.local 127.0.0.1 NONEXPIRE  
Welcome to kdb+ 32bit edition
<b>q)\p</b>
5000i
<b>q) trades:([]time:100?.z.P;price:100?2.;
  side:100?`B`S;exch:100?`CNX`RTM`EBS;sym:100?`EURUSD`AUDUSD`GBPUSD)</b>
<b>q)10#trades</b>
time                          price     side exch sym   
--------------------------------------------------------
2010.08.13D12:33:29.975458112 0.6019404 B    CNX  EURUSD
2001.11.24D20:53:58.972661440 0.7075032 S    CNX  EURUSD
2002.12.12D03:12:04.442386736 1.500898  S    CNX  GBPUSD
2002.02.12D21:48:33.887104336 0.6170263 S    EBS  AUDUSD
2014.05.01D06:59:44.647138496 0.8821325 S    EBS  GBPUSD
2010.12.06D15:30:14.928601664 1.094677  S    RTM  AUDUSD
2009.04.19D23:12:33.919967488 1.187474  B    RTM  GBPUSD
2008.07.18D22:13:25.681742656 0.1768144 B    EBS  GBPUSD
2010.08.22D10:16:15.261483520 0.3576458 S    EBS  AUDUSD
2010.02.28D13:49:33.686598976 1.526465  S    RTM  EURUSD

Example: Connecting to kdb+

setwd("/Users/rorywinston/sandbox/kdb")
source("qserver.R")
open_connection("localhost", 5000)
trades <- k("select from trades")
head(trades)
                 time     price side exch    sym
1 2010-08-13 22:33:29 0.6019404    B  CNX EURUSD
2 2001-11-25 07:53:58 0.7075032    S  CNX EURUSD
3 2002-12-12 14:12:04 1.5008982    S  CNX GBPUSD
4 2002-02-13 08:48:33 0.6170263    S  EBS AUDUSD
5 2014-05-01 16:59:44 0.8821325    S  EBS GBPUSD
6 2010-12-07 02:30:14 1.0946771    S  RTM AUDUSD

kdb+ datatypes are converted to native R types

Example: Reading Log Data From File

# Read file into data frame
logfile <- read.csv("/tmp/application.log", sep=",", header=FALSE)
# Set column descriptors
colnames(logfile) <- c("time","message","severity")
# Convert to native date/time
logfile$time <- as.POSIXct(strptime
                  (logfile$time, "%Y-%m-%d %H:%M:%OS"), tz="GMT")

Example: Using Datasets

The famous 'Air passengers' dataset

plot(AirPassengers)

plot of chunk unnamed-chunk-35

Example: Using Datasets

The 'Anscombe Quartet' dataset

op <- par(mfrow=c(2,2),mar=rep(1,4))
with(anscombe,{plot(x1,y1,pch=20);plot(x2,y2,pch=20);
               plot(x3,y3,pch=20);plot(x4,y4,pch=20)})

plot of chunk unnamed-chunk-36

par(op)

Recommended Libraries

ggplot2 - Mini-DSL for data visualization
zoo/xts - Time series libraries
Matrix - Enhanced matrix library
plyr/reshape - Data reshaping/manipulation
data.table - Faster data.frame manipulation
e1071 - Machine learning/data mining functions
caret - Statistical learning/training functions
randomforest - Random forest library
Rcpp - Convenient C++ interface

Links

http://www.r-project.org

What is R?

Types

Simple Types

Simple Types - Vectors

Generating Vectors

Indexing

Functions

R is Vectorized

R is Vectorized

NOTE: Random Number Generation

Data Frames

Data Frames

Data Frames

Data Frames

Data Frames

Data Frames

Aggregations and Rollups

Aggregations and Rollups

Aggregations and Rollups

Aggregations and Rollups

Aggregations and Rollups

Closures

R's Lisp Influence

R's Lisp Influence

R's Lisp Influence

Operators Are Functions

Example: Median Absolute Deviation

Example: Median Absolute Deviation

Example: Simulating Coin Tosses

Example: Random Walk

Example: Generating Random Data

Example: OANDA FX Data

Example: Connecting to kdb+

Example: Connecting to kdb+

Example: Reading Log Data From File

Example: Using Datasets

Example: Using Datasets

Recommended Libraries

Other Topics (Not Covered)

Links