Category: Statistics

Density Estimation of High-Frequency Financial Data

Post author By Rory Winston
Post date June 12, 2012

Frequently we will want to estimate the empirical probability density function of real-world data and compare it to the theoretical density from one or more probability distributions. The following example shows the empirical and theoretical normal density for EUR/USD high-frequency tick data $X$ (which has been transformed using log-returns and normalized via $\frac{X_i-\mu_X}{\sigma_X}$). The theoretical normal density is plotted over the range $\left(\lfloor\mathrm{min}(X)\rfloor,\lceil\mathrm{max}(X)\rceil\right)$. The results are in the figure below. The discontinuities and asymmetry of the discrete tick data, as well as the sharp kurtosis and heavy tails (a corresponding interval of $\approx \left[-8,+7\right]$ standard deviations away from the mean) are apparent from the plot.

tick density — Empirical and Theoretical Tick Density

We also show the theoretical and empirical density for the EUR/USD exchange rate log returns over different timescales. We can see from these plots that the distribution of the log returns seems to be asymptotically converging to normality. This is a typical empirical property of financial data.

density_scaled — Density Estimate Across Varying Timescales

The following R source generates empirical and theoretical density plots across different timescales. The data is loaded from files that are sampled at different intervals. I cant supply the data unfortunately, but you should get the idea.

[source lang=”R”]
# Function that reads Reuters CSV tick data and converts Reuters dates
# Assumes format is Date,Tick
readRTD <- function(filename) {
tickData <- read.csv(file=filename, header=TRUE, col.names=c("Date","Tick"))
tickData$Date <- as.POSIXct(strptime(tickData$Date, format="%d/%m/%Y %H:%M:%S"))
tickData
}

# Boilerplate function for Reuters FX tick data transformation and density plot
plot.reutersFXDensity <- function() {
filenames <- c("data/eur_usd_tick_26_10_2007.csv",
"data/eur_usd_1min_26_10_2007.csv",
"data/eur_usd_5min_26_10_2007.csv",
"data/eur_usd_hourly_26_10_2007.csv",
"data/eur_usd_daily_26_10_2007.csv")
labels <- c("Tick", "1 Minute", "5 Minutes", "Hourly", "Daily")

par(mfrow=c(length(filenames), 2),mar=c(0,0,2,0), cex.main=2)
tickData <- c()
i <- 1
for (filename in filenames) {
tickData[[i]] <- readRTD(filename)
# Transform: `$Y = \nabla\log(X_i)$`
logtick <- diff(log(tickData[[i]]$Tick))
# Normalize: `$\frac{(Y-\mu_Y)}{\sigma_Y}$`
logtick <- (logtick-mean(logtick))/sd(logtick)
# Theoretical density range: `$\left[\lfloor\mathrm{min}(Y)\rfloor,\lceil\mathrm{max}(Y)\rceil\right]$`
x <- seq(floor(min(logtick)), ceiling(max(logtick)), .01)
plot(density(logtick), xlab="", ylab="", axes=FALSE, main=labels[i])
lines(x,dnorm(x), lty=2)
#legend("topleft", legend=c("Empirical","Theoretical"), lty=c(1,2))
plot(density(logtick), log="y", xlab="", ylab="", axes=FALSE, main="Log Scale")
lines(x,dnorm(x), lty=2)
i <- i + 1
}
par(op)
}
[/source]

Coding Finance R Statistics

Binomial Pricing Trees in R

Post author By Rory Winston
Post date June 11, 2012

Binomial Tree Simulation

The binomial model is a discrete grid generation method from $t=0$ to $T$. At each point in time ($t+\Delta t$) we can move up with probability $p$ and down with probability $(1-p)$. As the probability of an up and down movement remain constant throughout the generation process, we end up with a recombining binary tree, or binary lattice. Whereas a balanced binomial tree with height $h$ has $2^{h+1}-1$ nodes, a binomial lattice of height $h$ has $\sum_{i=1}^{h}i$ nodes.

The algorithm to generate a binomial lattice of $M$ steps (i.e. of height $M$) given a starting value $S_0$, an up movement $u$, and down movement $d$, is:

FOR i=1 to M FOR j=0 to i STATE S(j,i) = S(0)*u^j*d^(n-j) ENDFOR ENDFOR

We can write this function in R and generate a graph of the lattice. A simple lattice generation function is below:

[source lang=”R”]
# Generate a binomial lattice
# for a given up, down, start value and number of steps
genlattice <- function(X0=100, u=1.1, d=.75, N=5) {
X <- c()
X[1] <- X0
count <- 2

for (i in 1:N) {
for (j in 0:i) {
X[count] <- X0 * u^j * d^(i-j)
count <- count + 1
}
}
return(X)
}
[/source]

We can generate a sample lattice of 5 steps using symmetric up-and-down values:

[source lang=”R”]
> genlattice(N=5, u=1.1, d=.9)
[1] 100.000 90.000 110.000 81.000 99.000 121.000 72.900 89.100 108.900 133.100 65.610
[12] 80.190 98.010 119.790 146.410 59.049 72.171 88.209 107.811 131.769 161.051
[/source]

In this case, the output is a vector of alternate up and down state values.

We can nicely graph a binomial lattice given a tool like graphviz, and we can easily create an R function to generate a graph specification that we can feed into graphviz:

[source lang=”R”]
function(S, labels=FALSE) {
shape <- ifelse(labels == TRUE, "plaintext", "point")

cat("digraph G {", "\n", sep="")
cat("node[shape=",shape,", samehead, sametail];","\n", sep="")
cat("rankdir=LR;","\n")

cat("edge[arrowhead=none];","\n")

# Create a dot node for each element in the lattice
for (i in 1:length(S)) {
cat("node", i, "[label=\"", S[i], "\"];", "\n", sep="")
}

# The number of levels in a binomial lattice of length N
# is `$\frac{\sqrt{8N+1}-1}{2}$`
L <- ((sqrt(8*length(S)+1)-1)/2 – 1)

k<-1
for (i in 1:L) {
tabs <- rep("\t",i-1)
j <- i
while(j>0) {
cat("node",k,"->","node",(k+i),";\n",sep="")
cat("node",k,"->","node",(k+i+1),";\n",sep="")
k <- k + 1
j <- j – 1
}
}

cat("}", sep="")
}
[/source]

This will simply output a dot script to the screen. We can capture this script and save it to a file by invoking:

[source lang=”R”]
> x<-capture.output(dotlattice(genlattice(N=8, u=1.1, d=0.9)))
> cat(x, file="/tmp/lattice1.dot")
[/source]

We can then invoke dot from the command-line on the generated file:

[source lang=”bash”]
$ dot -Tpng -o lattice.png -v lattice1.dot
[/source]

The resulting graph looks like the following:

If we want to add labels to the lattice vertices, we can add the labels attribute:

[source lang=”R”]
> x<-capture.output(dotlattice(genlattice(N=8, u=1.1, d=0.9), labels=TRUE))
> cat(x, file="/tmp/lattice1.dot")
[/source]

Coding R Statistics

Quasi-Random Number Generation in R

Post author By Rory Winston
Post date June 6, 2012

Random number generation is a core topic in numerical computer science. There are many efficient
algorithms for generating random (strictly speaking, pseudo-random) variates
from different probability distributions. The figure below shows a
sampling of 1000 two-dimensional random variates from the standard Gaussian and
Cauchy distributions, respectively. The size of the extreme deviations of the
Cauchy distribution is apparent from the graph.

However, sometimes we need to produce numbers that are more evenly distributed (quasi-random numbers). For example, in a Monte Carlo integration exercise, we can get faster convergence with a lower error bound using so-called low-discrepancy random sequences, using the GSL library. In the figure below, we show two-dimensional normal and Sobol (a low-discrepancy generator) variates.

To generate the graph below, I used the GSL library for R, as shown below:

[source lang=”R”]
library(gsl)
q <- qrng_alloc(type="sobol", 2)
rs <- qrng_get(q,1000)
par(mfrow=c(3,1))
plot(rnorm(1000), rnorm(1000), pch=20, main="~N(0,1)",
ylab="", xlab="")
plot(rs, pch=20, main="Sobol",
ylab="", xlab="")
plot(rcauchy(1000), rcauchy(1000),pch=20,
main="~C(0,1)", ylab="",xlab="")
[/source]

The property of low-discrepancy generators is even more apparent if we view the random variates in a higher dimension, for example the figure below shows the variates as a 3-dimensional cube. Note how the clustering around the centre of the cube is much more pronounced for the Gaussian cube.

To plot the figure above, I used the GSL and Lattice libraries:

[source lang=”R”]
library(gsl)
library(lattice)
q <- qrng_alloc(type="sobol", 3)
npoints <- 200
rs <- qrng_get(q,npoints)
ltheme <- canonical.theme(color = FALSE)
ltheme$strip.background$col <- "transparent"
lattice.options(default.theme = ltheme)
trellis.par.set(layout.heights =
list(top.padding = -20,
main.key.padding = 1,
key.axis.padding = 0,
axis.xlab.padding = 0,
xlab.key.padding = 0,
key.sub.padding = 0,
bottom.padding = -20))

# Plot the normal variates in a 3-dim cube
p1 <- cloud(rnorm(npoints) ~ rnorm(npoints) + rnorm(npoints), xlab="x", ylab="y",
zlab="z", pch=20, main="~N(0,1)")
p2 <- cloud(rs[,1] ~ rs[,2] + rs[,3], xlab="x", ylab="y",
zlab="z", pch=20, main="Sobol")
print(p1, split=c(1,1,2,1), more=TRUE)
print(p2, split=c(2,1,2,1))
[/source]