2 August 2014

What is R?

  • A Domain-Specific-Language (DSL) for statistics and data analysis
  • Based on the S Programming Language
  • An environment for Exploratory Data Analysis (EDA)
  • A quasi-functional language with IDE and REPL
  • A vectorized language with BLAS support
  • A collection of over 7,000+ libraries
  • A large and active community across industry and academia
  • Around 20 years old (Lineage dates from 1975 - almost 40 years ago)

Types

  • Primitives (numeric, integer, character, logical, factor)
  • Data Frames
  • Lists
  • Tables
  • Arrays
  • Environments
  • Others (functions, closures, promises..)

Simple Types

x <- 1
class(x)
## [1] "numeric"

y <- "Hello World"
class(y)
## [1] "character"

z <- TRUE
class(z)
## [1] "logical"

as.integer(z)
## [1] 1

Simple Types - Vectors

The basic type unit in R is a vector

x <- c(1,2,3)
x
## [1] 1 2 3
x <- 1:3
x[1]
## [1] 1
x[0]
## integer(0)
x[-1]
## [1] 2 3

Generating Vectors

R provides lots of convenience functions for data generation:

rep(0, 5)
## [1] 0 0 0 0 0
seq(1,10)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(1,2,.1)
##  [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
seq(1,2,length.out=6)
## [1] 1.0 1.2 1.4 1.6 1.8 2.0

Indexing

x <- c(1, 3, 4, 10, 15, 20, 50, 1, 6)
x > 10
## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE
which(x > 10)
## [1] 5 6 7
x[x>10]
## [1] 15 20 50
x[!x>10]
## [1]  1  3  4 10  1  6
x[x<=10]
## [1]  1  3  4 10  1  6
x[x>10 & x<30]
## [1] 15 20

Functions

square <- function(x) x^2
square(2)
## [1] 4

pow <- function(x, p=2) x^p
pow(10)
## [1] 100
pow(10,3)
## [1] 1000
pow(p=3,10)
## [1] 1000

Functions can be passed as data:

g <- function(x, f) f(x)
g(10, square)
## [1] 100

h <- function(x,f,...) f(x,...)
h(10, pow, 3)
## [1] 1000

R is Vectorized

Example - multiplying two vectors:

mult <- function(x,y) {  
 z <- numeric(length(x))
 for (i in 1:length(x)) {
    z[i] <- x[i] * y[i]
 }
 z
}

mult(1:10,1:10)
##  [1]   1   4   9  16  25  36  49  64  81 100

R is Vectorized

Multiplying two vectors 'the R way':

1:10 * 1:10
##  [1]   1   4   9  16  25  36  49  64  81 100

NOTE: R recycles vectors of unequal length:

1:10 * 1:2
##  [1]  1  4  3  8  5 12  7 16  9 20

NOTE: Random Number Generation

R contains a huge number of built-in random number generators for various probability distributions

# Normal variates, mean=0, sd=1
rnorm(10)
##  [1]  1.058489  0.362267  0.680629 -0.066726  0.816936 -0.218883 -1.103620
##  [8] -0.224080 -1.633193 -0.578103
rnorm(10, mean=100, sd=5)
##  [1] 101.071  92.236 106.405 103.254  89.887  96.956  93.717 104.615
##  [9]  98.873  99.458

Many different distributions available (the r* functions)

Data Frames

  • Data frames are the fundamental structure used in data analysis
  • Similar to a database table in spirit (named columns, distinct types)
d <- data.frame(x=1:6, y="AUDUSD", z=c("one","two"))
d
##   x      y   z
## 1 1 AUDUSD one
## 2 2 AUDUSD two
## 3 3 AUDUSD one
## 4 4 AUDUSD two
## 5 5 AUDUSD one
## 6 6 AUDUSD two

Data Frames

Data frames can be indexed like a vector or matrix:

# First row
d[1,]
##   x      y   z
## 1 1 AUDUSD one

# First column
d[,1]
## [1] 1 2 3 4 5 6

# First and third cols, first two rows
d[1:2,c(1,3)]
##   x   z
## 1 1 one
## 2 2 two

Data Frames

Let's generate some dummy data:

generateData <- function(N) data.frame(time=Sys.time()+1:N, 
  sym="AUDUSD", 
  bid=rep(1.2345,N)+runif(min=-.0010,max=.0010,N),
  ask=rep(1.2356,N)+runif(min=-.0010,max=.0010,N),
  exch=sample(c("EBS","RTM","CNX"),N, replace=TRUE)) 

prices <- generateData(50)
head(prices, 5)
##                      time    sym    bid    ask exch
## 1 2014-12-27 18:55:12.727 AUDUSD 1.2352 1.2349  EBS
## 2 2014-12-27 18:55:13.727 AUDUSD 1.2345 1.2356  CNX
## 3 2014-12-27 18:55:14.727 AUDUSD 1.2340 1.2363  CNX
## 4 2014-12-27 18:55:15.727 AUDUSD 1.2347 1.2347  EBS
## 5 2014-12-27 18:55:16.727 AUDUSD 1.2355 1.2353  RTM

Data Frames

We can add/remove columns on the fly:

prices$spread <- prices$ask-prices$bid
prices$mid <- (prices$bid + prices$ask) * 0.5
head(prices)
##                      time    sym    bid    ask exch      spread    mid
## 1 2014-12-27 18:55:12.727 AUDUSD 1.2352 1.2349  EBS -3.1483e-04 1.2351
## 2 2014-12-27 18:55:13.727 AUDUSD 1.2345 1.2356  CNX  1.1894e-03 1.2350
## 3 2014-12-27 18:55:14.727 AUDUSD 1.2340 1.2363  CNX  2.3628e-03 1.2351
## 4 2014-12-27 18:55:15.727 AUDUSD 1.2347 1.2347  EBS  2.7443e-05 1.2347
## 5 2014-12-27 18:55:16.727 AUDUSD 1.2355 1.2353  RTM -1.3792e-04 1.2354
## 6 2014-12-27 18:55:17.727 AUDUSD 1.2349 1.2364  CNX  1.4767e-03 1.2356

Data Frames

Some basic operations on data frames:

names(prices)
## [1] "time"   "sym"    "bid"    "ask"    "exch"   "spread" "mid"

table(prices$exch)
## 
## CNX EBS RTM 
##  17  14  19

summary(prices$mid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.23    1.23    1.24    1.24    1.24    1.24

Data Frames

Operations can be applied across different dimensions of a data frame:

sapply(prices,class)
## $time
## [1] "POSIXct" "POSIXt" 
## 
## $sym
## [1] "factor"
## 
## $bid
## [1] "numeric"
## 
## $ask
## [1] "numeric"
## 
## $exch
## [1] "factor"
## 
## $spread
## [1] "numeric"
## 
## $mid
## [1] "numeric"

Aggregations and Rollups

Aggregation and rollups:

tapply(prices$spread, prices$exch, mean)
##        CNX        EBS        RTM 
## 0.00138489 0.00074092 0.00100872

Aggregations and Rollups

tapply(prices$mid, prices$exch, function(x) diff(log(x)))
## $CNX
##  [1]  7.0477e-05  4.0285e-04 -3.0331e-04 -2.0051e-04  1.8353e-04
##  [6] -1.5554e-04 -3.1436e-04  1.9462e-04  7.4922e-04 -6.0734e-04
## [11] -1.2643e-04  1.7685e-04  4.7554e-04 -8.8123e-05 -4.9113e-04
## [16] -8.7994e-05
## 
## $EBS
##  [1] -0.00030726  0.00101082 -0.00101368  0.00029379 -0.00049941
##  [6]  0.00037749 -0.00045959  0.00081848 -0.00082648  0.00051327
## [11]  0.00061863  0.00010343 -0.00053213
## 
## $RTM
##  [1] -6.1733e-04  1.0196e-03 -1.1565e-03  6.1929e-04 -1.3694e-04
##  [6] -4.6347e-04  9.5553e-04  6.9131e-05 -8.6649e-04  2.5265e-04
## [11] -1.2193e-04  2.1709e-04  3.4202e-04 -5.6068e-04  3.1838e-04
## [16] -2.7571e-04  6.1238e-04 -5.2919e-04

Aggregations and Rollups

Aggregating summary statistics by time:

aggregate(prices[,c("bid","ask")], 
  by=list(bucket=cut(prices$time, "10 sec")), mean)
##                    bucket    bid    ask
## 1 2014-12-27 18:55:12.727 1.2347 1.2357
## 2 2014-12-27 18:55:22.727 1.2343 1.2356
## 3 2014-12-27 18:55:32.727 1.2346 1.2356
## 4 2014-12-27 18:55:42.727 1.2346 1.2355
## 5 2014-12-27 18:55:52.727 1.2347 1.2359

Aggregations and Rollups

aggregate(prices[,c("bid","ask")], 
          by=list(bucket=cut(prices$time, "10 sec")), 
          function(x) c(min=min(x),max=max(x)))
##                    bucket bid.min bid.max ask.min ask.max
## 1 2014-12-27 18:55:12.727  1.2340  1.2355  1.2347  1.2365
## 2 2014-12-27 18:55:22.727  1.2336  1.2355  1.2347  1.2365
## 3 2014-12-27 18:55:32.727  1.2336  1.2353  1.2346  1.2365
## 4 2014-12-27 18:55:42.727  1.2335  1.2354  1.2347  1.2362
## 5 2014-12-27 18:55:52.727  1.2336  1.2354  1.2350  1.2365

Aggregations and Rollups

prices <- generateData(2000)
prices$spread <- prices$ask-prices$bid
boxplot(prices$spread ~ prices$exch, main="Spread Distribution By Exchange")

plot of chunk unnamed-chunk-22

Closures

R functions contain a reference to the enclosing environment:

add <- function(x) { 
 function(y) x + y
}

increment <- add(1)
increment(2)
## [1] 3

R's Lisp Influence

R is heavily influenced by Lisp/Scheme

cf. "Structure And Interpretation of Computer Programs"

R's Lisp Influence

`+`(2,3)
## [1] 5
(`*` (2, (`+`(2,3))))
## [1] 10
codetools::showTree(quote(2*(2+3)))
## (* 2 ("(" (+ 2 3)))

R's Lisp Influence

Also evident in the underlying C code:

SEXP attribute_hidden do_prmatrix(SEXP call, SEXP op, 
  SEXP args, SEXP rho)
{
    int quote;
    SEXP a, x, rowlab, collab, naprint;
    char *rowname = NULL, *colname = NULL;

    checkArity(op,args);
    PrintDefaults();
    a = args;
    x = CAR(a); a = CDR(a);
    rowlab = CAR(a); a = CDR(a);
    collab = CAR(a); a = CDR(a);

    quote = asInteger(CAR(a)); a = CDR(a);

Operators Are Functions

`(`
## .Primitive("(")
(1+2)
## [1] 3
`(` <- function(x) 42
(1+2)
## [1] 42
rm("(")

Example: Median Absolute Deviation

\(MAD(x) = median\left(\left|Y_i - \hat{Y}\right|\right)\)

mad
## function (x, center = median(x), constant = 1.4826, na.rm = FALSE, 
##     low = FALSE, high = FALSE) 
## {
##     if (na.rm) 
##         x <- x[!is.na(x)]
##     n <- length(x)
##     constant * if ((low || high) && n%%2 == 0) {
##         if (low && high) 
##             stop("'low' and 'high' cannot be both TRUE")
##         n2 <- n%/%2 + as.integer(high)
##         sort(abs(x - center), partial = n2)[n2]
##     }
##     else median(abs(x - center))
## }
## <bytecode: 0x103aade30>
## <environment: namespace:stats>

Example: Median Absolute Deviation

Shows: lazy evaluation, filering, logical casting, if/else return values, partial sorting
function (x, center = median(x), constant = 1.4826, na.rm = FALSE, 
    low = FALSE, high = FALSE) 
{
    if (na.rm) x <- x[!is.na(x)]
    n <- length(x)
    constant * if ((low || high) && n%%2 == 0) {
        if (low && high) 
            stop("'low' and 'high' cannot be both TRUE")
        n2 <- n%/%2 + as.integer(high)
        sort(abs(x - center), partial = n2)[n2]
    }
    else median(abs(x - center))
}

Example: Simulating Coin Tosses

What is the probability of exactly 3 heads in 10 coin tosses for a fair coin?

Using binomial identity: \(\binom{n}{k}p^{k}(1-p)^{(n-k)} = \binom{10}{3}\left(\frac{1}{2}\right)^{3}\left(\frac{1}{2}\right)^{7}\)

choose(10,3)*(.5)^3*(.5)^7
## [1] 0.11719

Using binomial distribution density function:

 dbinom(prob=0.5, size=10, x=3)
## [1] 0.11719

Using simulation (100,000 tosses):

sum(replicate(100000,sum(rbinom(prob=1/2, size=10, 1))==3))/100000
## [1] 0.11636

Example: Random Walk

Generate 1000 up-down movements based on a fair coin toss and plot:

x<-(cumsum(ifelse(rbinom(prob=0.5, size=1, 10000)==0,-1,1)))
plot(x, type='l', main='Random Walk')

plot of chunk unnamed-chunk-30

Example: Generating Random Data

randomWalk <-function(N)(cumsum(ifelse(rbinom(prob=0.5, size=1, N)==0,-1,1)))
AUDUSD <- 1.2345 + randomWalk(1000)*.0001
plot(AUDUSD, type='l')

plot of chunk unnamed-chunk-31

Example: OANDA FX Data

require(quantmod);require(TTR)
EURUSD <- getSymbols("EUR/USD", src="oanda", auto.assign=FALSE)
plot(EURUSD)
lines(EMA(EURUSD,10), col='red')
lines(EMA(EURUSD,30), col='blue')

Example: Connecting to kdb+

Rorys-MacBook-Pro:kdb rorywinston$ <b>./rlwrap q/m32/q -p 5000</b>
KDB+ 3.1 2014.07.01 Copyright (C) 1993-2014 Kx Systems
m32/ 8()core 16384MB rorywinston rorys-macbook-pro.local 127.0.0.1 NONEXPIRE  
Welcome to kdb+ 32bit edition
<b>q)\p</b>
5000i
<b>q) trades:([]time:100?.z.P;price:100?2.;
  side:100?`B`S;exch:100?`CNX`RTM`EBS;sym:100?`EURUSD`AUDUSD`GBPUSD)</b>
<b>q)10#trades</b>
time                          price     side exch sym   
--------------------------------------------------------
2010.08.13D12:33:29.975458112 0.6019404 B    CNX  EURUSD
2001.11.24D20:53:58.972661440 0.7075032 S    CNX  EURUSD
2002.12.12D03:12:04.442386736 1.500898  S    CNX  GBPUSD
2002.02.12D21:48:33.887104336 0.6170263 S    EBS  AUDUSD
2014.05.01D06:59:44.647138496 0.8821325 S    EBS  GBPUSD
2010.12.06D15:30:14.928601664 1.094677  S    RTM  AUDUSD
2009.04.19D23:12:33.919967488 1.187474  B    RTM  GBPUSD
2008.07.18D22:13:25.681742656 0.1768144 B    EBS  GBPUSD
2010.08.22D10:16:15.261483520 0.3576458 S    EBS  AUDUSD
2010.02.28D13:49:33.686598976 1.526465  S    RTM  EURUSD

Example: Connecting to kdb+

setwd("/Users/rorywinston/sandbox/kdb")
source("qserver.R")
open_connection("localhost", 5000)
trades <- k("select from trades")
head(trades)
                 time     price side exch    sym
1 2010-08-13 22:33:29 0.6019404    B  CNX EURUSD
2 2001-11-25 07:53:58 0.7075032    S  CNX EURUSD
3 2002-12-12 14:12:04 1.5008982    S  CNX GBPUSD
4 2002-02-13 08:48:33 0.6170263    S  EBS AUDUSD
5 2014-05-01 16:59:44 0.8821325    S  EBS GBPUSD
6 2010-12-07 02:30:14 1.0946771    S  RTM AUDUSD

kdb+ datatypes are converted to native R types

Example: Reading Log Data From File

# Read file into data frame
logfile <- read.csv("/tmp/application.log", sep=",", header=FALSE)
# Set column descriptors
colnames(logfile) <- c("time","message","severity")
# Convert to native date/time
logfile$time <- as.POSIXct(strptime
                  (logfile$time, "%Y-%m-%d %H:%M:%OS"), tz="GMT")

Example: Using Datasets

The famous 'Air passengers' dataset

plot(AirPassengers)

plot of chunk unnamed-chunk-35

Example: Using Datasets

The 'Anscombe Quartet' dataset

op <- par(mfrow=c(2,2),mar=rep(1,4))
with(anscombe,{plot(x1,y1,pch=20);plot(x2,y2,pch=20);
               plot(x3,y3,pch=20);plot(x4,y4,pch=20)})

plot of chunk unnamed-chunk-36

par(op)

Recommended Libraries

Other Topics (Not Covered)

  • S3/S4 Classes/Objects
  • Packages
  • Lazy Evaluation
  • Formula Interface
  • JIT Compilation/bytecode
  • Debugging
  • C/C++ Interfaces
  • Reproducible Research (Sweave/knitr/markdown)

Links