I put together a short intro presentation for some people explaining a little bit about R from an introductory point of view. Slides put together with R/markdown and ioslides.

Here is the raw markdown if you are interested:

[code lang=”r”]

—

title: "R Introduction"

author: "Rory Winston"

date: "2 August 2014"

output:

ioslides_presentation: default

beamer_presentation:

fig_height: 6

fig_width: 8

keep_tex: yes

logo: r_logo.png

self_contained: no

fontsize: 10pt

—

## What is R?

– A Domain-Specific-Language (DSL) for statistics and data analysis

– Based on the S Programming Language

– An environment for Exploratory Data Analysis (EDA)

– A quasi-functional language with IDE and REPL

– A vectorized language with BLAS support

– A collection of over 7,000+ libraries

– A large and active community across industry and academia

– Around 20 years old (Lineage dates from 1975 – almost 40 years ago)

“`{r,echo=FALSE,message=FALSE}

options("digits"=5)

options("digits.secs"=3)

“`

## Types

– Primitives (numeric, integer, character, logical, factor)

– Data Frames

– Lists

– Tables

– Arrays

– Environments

– Others (functions, closures, promises..)

## Simple Types

“`{r,collapse=TRUE}

x <- 1

class(x)

y <- "Hello World"

class(y)

z <- TRUE

class(z)

as.integer(z)

“`

## Simple Types – Vectors

The basic type unit in R is a vector

“`{r, collapse=TRUE}

x <- c(1,2,3)

x

x <- 1:3

x[1]

x[0]

x[-1]

“`

## Generating Vectors

R provides lots of convenience functions for data generation:

“`{r,collapse=TRUE}

rep(0, 5)

seq(1,10)

seq(1,2,.1)

seq(1,2,length.out=6)

“`

## Indexing

“`{r,collapse=TRUE}

x <- c(1, 3, 4, 10, 15, 20, 50, 1, 6)

x > 10

which(x > 10)

x[x>10]

x[!x>10]

x[x<=10]

x[x>10 & x<30]

“`

## Functions {.smaller}

“`{r, collapse=TRUE}

square <- function(x) x^2

square(2)

pow <- function(x, p=2) x^p

pow(10)

pow(10,3)

pow(p=3,10)

“`

Functions can be passed as data:

“`{r,collapse=TRUE}

g <- function(x, f) f(x)

g(10, square)

h <- function(x,f,…) f(x,…)

h(10, pow, 3)

“`

## R is Vectorized

Example – multiplying two vectors:

“`{r}

mult <- function(x,y) {

z <- numeric(length(x))

for (i in 1:length(x)) {

z[i] <- x[i] * y[i]

}

z

}

mult(1:10,1:10)

“`

## R is Vectorized

Multiplying two vectors ‘the R way’:

“`{r}

1:10 * 1:10

“`

NOTE: R recycles vectors of unequal length:

“`{r}

1:10 * 1:2

“`

## NOTE: Random Number Generation

R contains a huge number of built-in random number generators for various probability distributions

“`{r}

# Normal variates, mean=0, sd=1

rnorm(10)

rnorm(10, mean=100, sd=5)

“`

Many different distributions available (the <b>r*</b> functions)

## Data Frames

– Data frames are the fundamental structure used in data analysis

– Similar to a database table in spirit (named columns, distinct types)

“`{r}

d <- data.frame(x=1:6, y="AUDUSD", z=c("one","two"))

d

“`

## Data Frames

Data frames can be indexed like a vector or matrix:

“`{r,collapse=TRUE}

# First row

d[1,]

# First column

d[,1]

# First and third cols, first two rows

d[1:2,c(1,3)]

“`

## Data Frames {.smaller}

Let’s generate some dummy data:

“`{r}

generateData <- function(N) data.frame(time=Sys.time()+1:N,

sym="AUDUSD",

bid=rep(1.2345,N)+runif(min=-.0010,max=.0010,N),

ask=rep(1.2356,N)+runif(min=-.0010,max=.0010,N),

exch=sample(c("EBS","RTM","CNX"),N, replace=TRUE))

prices <- generateData(50)

head(prices, 5)

“`

## Data Frames

We can add/remove columns on the fly:

“`{r}

prices$spread <- prices$ask-prices$bid

prices$mid <- (prices$bid + prices$ask) * 0.5

head(prices)

“`

## Data Frames

Some basic operations on data frames:

“`{r,collapse=TRUE}

names(prices)

table(prices$exch)

summary(prices$mid)

“`

## Data Frames {.smaller}

Operations can be applied across different dimensions of a data frame:

“`{r,collapse=TRUE}

sapply(prices,class)

“`

## Aggregations and Rollups {.smaller}

Aggregation and rollups:

“`{r}

tapply(prices$spread, prices$exch, mean)

“`

## Aggregations and Rollups {.smaller}

“`{r}

tapply(prices$mid, prices$exch, function(x) diff(log(x)))

“`

## Aggregations and Rollups {.smaller}

Aggregating summary statistics by time:

“`{r}

aggregate(prices[,c("bid","ask")],

by=list(bucket=cut(prices$time, "10 sec")), mean)

“`

## Aggregations and Rollups {.smaller}

“`{r}

aggregate(prices[,c("bid","ask")],

by=list(bucket=cut(prices$time, "10 sec")),

function(x) c(min=min(x),max=max(x)))

“`

## Aggregations and Rollups {.smaller}

“`{r}

prices <- generateData(2000)

prices$spread <- prices$ask-prices$bid

boxplot(prices$spread ~ prices$exch, main="Spread Distribution By Exchange")

“`

## Closures

R functions contain a reference to the enclosing environment:

“`{r}

add <- function(x) {

function(y) x + y

}

increment <- add(1)

increment(2)

“`

## R’s Lisp Influence

R is heavily influenced by Lisp/Scheme

cf. "Structure And Interpretation of Computer Programs"

<center><img src="sicp_cover.jpg" width="200" height="350"/></center>

## R’s Lisp Influence

“`{r}

`+`(2,3)

(`*` (2, (`+`(2,3))))

codetools::showTree(quote(2*(2+3)))

“`

## R’s Lisp Influence {.smaller}

Also evident in the underlying C code:

“` c

SEXP attribute_hidden do_prmatrix(SEXP call, SEXP op,

SEXP args, SEXP rho)

{

int quote;

SEXP a, x, rowlab, collab, naprint;

char *rowname = NULL, *colname = NULL;

checkArity(op,args);

PrintDefaults();

a = args;

x = CAR(a); a = CDR(a);

rowlab = CAR(a); a = CDR(a);

collab = CAR(a); a = CDR(a);

quote = asInteger(CAR(a)); a = CDR(a);

“`

## Operators Are Functions

“`{r}

`(`

(1+2)

`(` <- function(x) 42

(1+2)

rm("(")

“`

## Example: Median Absolute Deviation {.smaller}

$MAD(x) = median\left(\left|Y_i – \hat{Y}\right|\right)$

“`{r}

mad

“`

## Example: Median Absolute Deviation

Shows: lazy evaluation, filering, logical casting, if/else return values, partial sorting

<code>

<pre>

function (x, center = <span style=’background-color:yellow’>median(x)</span>, constant = 1.4826, na.rm = FALSE,

low = FALSE, high = FALSE)

{

if (na.rm) x <- <span style=’background-color:yellow’>x[!is.na(x)]</span>

n <- length(x)

constant * <span style=’background-color:yellow’>if ((low || high) && n%%2 == 0) {

if (low && high)

stop("’low’ and ‘high’ cannot be both TRUE")

n2 <- n%/%2 + <span style=’background-color:orange’>as.integer(high)</span>

sort(abs(x – center), <span style=’background-color:orange’>partial = n2</span>)[n2]

}

else median(abs(x – center))</span>

}

</pre>

</code>

## Example: Simulating Coin Tosses {.smaller}

What is the probability of exactly 3 heads in 10 coin tosses for a fair coin?

*Using binomial identity:*

$\binom{n}{k}p^{k}(1-p)^{(n-k)} = \binom{10}{3}\left(\frac{1}{2}\right)^{3}\left(\frac{1}{2}\right)^{7}$

“`{r}

choose(10,3)*(.5)^3*(.5)^7

“`

*Using binomial distribution density function:*

“`{r}

dbinom(prob=0.5, size=10, x=3)

“`

*Using simulation (100,000 tosses):*

“`{r}

sum(replicate(100000,sum(rbinom(prob=1/2, size=10, 1))==3))/100000

“`

## Example: Random Walk {.smaller}

Generate 1000 up-down movements based on a fair coin toss and plot:

“`{r}

x<-(cumsum(ifelse(rbinom(prob=0.5, size=1, 10000)==0,-1,1)))

plot(x, type=’l’, main=’Random Walk’)

“`

## Example: Generating Random Data {.smaller}

“`{r}

randomWalk <-function(N)(cumsum(ifelse(rbinom(prob=0.5, size=1, N)==0,-1,1)))

AUDUSD <- 1.2345 + randomWalk(1000)*.0001

plot(AUDUSD, type=’l’)

“`

## Example: OANDA FX Data {.smaller}

“`{r,message=FALSE,eval=FALSE}

require(quantmod);require(TTR)

EURUSD <- getSymbols("EUR/USD", src="oanda", auto.assign=FALSE)

plot(EURUSD)

lines(EMA(EURUSD,10), col=’red’)

lines(EMA(EURUSD,30), col=’blue’)

“`

<center><img src="oanda_eurusd.png" height="400px" width="600px" /></center>

## Example: Connecting to kdb+ {.smaller}

“`

Rorys-MacBook-Pro:kdb rorywinston$ <b>./rlwrap q/m32/q -p 5000</b>

KDB+ 3.1 2014.07.01 Copyright (C) 1993-2014 Kx Systems

m32/ 8()core 16384MB rorywinston rorys-macbook-pro.local 127.0.0.1 NONEXPIRE

Welcome to kdb+ 32bit edition

<b>q)\p</b>

5000i

<b>q) trades:([]time:100?.z.P;price:100?2.;

side:100?`B`S;exch:100?`CNX`RTM`EBS;sym:100?`EURUSD`AUDUSD`GBPUSD)</b>

<b>q)10#trades</b>

time price side exch sym

——————————————————–

2010.08.13D12:33:29.975458112 0.6019404 B CNX EURUSD

2001.11.24D20:53:58.972661440 0.7075032 S CNX EURUSD

2002.12.12D03:12:04.442386736 1.500898 S CNX GBPUSD

2002.02.12D21:48:33.887104336 0.6170263 S EBS AUDUSD

2014.05.01D06:59:44.647138496 0.8821325 S EBS GBPUSD

2010.12.06D15:30:14.928601664 1.094677 S RTM AUDUSD

2009.04.19D23:12:33.919967488 1.187474 B RTM GBPUSD

2008.07.18D22:13:25.681742656 0.1768144 B EBS GBPUSD

2010.08.22D10:16:15.261483520 0.3576458 S EBS AUDUSD

2010.02.28D13:49:33.686598976 1.526465 S RTM EURUSD

“`

## Example: Connecting to kdb+

“`{r,eval=FALSE}

setwd("/Users/rorywinston/sandbox/kdb")

source("qserver.R")

open_connection("localhost", 5000)

trades <- k("select from trades")

head(trades)

time price side exch sym

1 2010-08-13 22:33:29 0.6019404 B CNX EURUSD

2 2001-11-25 07:53:58 0.7075032 S CNX EURUSD

3 2002-12-12 14:12:04 1.5008982 S CNX GBPUSD

4 2002-02-13 08:48:33 0.6170263 S EBS AUDUSD

5 2014-05-01 16:59:44 0.8821325 S EBS GBPUSD

6 2010-12-07 02:30:14 1.0946771 S RTM AUDUSD

“`

kdb+ datatypes are converted to native R types

## Example: Reading Log Data From File

“`{r, eval=FALSE}

# Read file into data frame

logfile <- read.csv("/tmp/application.log", sep=",", header=FALSE)

# Set column descriptors

colnames(logfile) <- c("time","message","severity")

# Convert to native date/time

logfile$time <- as.POSIXct(strptime

(logfile$time, "%Y-%m-%d %H:%M:%OS"), tz="GMT")

“`

## Example: Using Datasets

The famous ‘Air passengers’ dataset

“`{r}

plot(AirPassengers)

“`

## Example: Using Datasets {.smaller}

The ‘Anscombe Quartet’ dataset

“`{r,collapse=TRUE,fig.height=4,fig.width=5,fig.align=’center’}

op <- par(mfrow=c(2,2),mar=rep(1,4))

with(anscombe,{plot(x1,y1,pch=20);plot(x2,y2,pch=20);

plot(x3,y3,pch=20);plot(x4,y4,pch=20)})

par(op)

“`

## Recommended Libraries

– ggplot2 – Mini-DSL for data visualization

– zoo/xts – Time series libraries

– Matrix – Enhanced matrix library

– plyr/reshape – Data reshaping/manipulation

– data.table – Faster data.frame manipulation

– e1071 – Machine learning/data mining functions

– caret – Statistical learning/training functions

– randomforest – Random forest library

– Rcpp – Convenient C++ interface

## Other Topics (Not Covered)

– S3/S4 Classes/Objects

– Packages

– Lazy Evaluation

– Formula Interface

– JIT Compilation/bytecode

– Debugging

– C/C++ Interfaces

– Reproducible Research (Sweave/knitr/markdown)

## Links

http://www.r-project.org

[/code]