## The Geometry Of Least Squares

Here’s a simple and intuitive way of looking at the geometry of a least squares regression:

Take the bottom left point in the triangle below as the origin O. For the linear model:

$$Y=X\beta + \epsilon$$

Both $Y$ and $$X\beta$$ are vectors, and the residual vector $$\epsilon$$ is the difference. The standard least squares error technique uses $$\epsilon^2$$ or $$(Y-X\beta)^T(Y-X\beta)$$ as the error measure to be minimised, and this leads to the calculation of the $$\beta$$ coefficient vector.

Geometrically, the beta coefficients calculated by the least squares regression minimise the squared length of the error vector. This turns out to be the projection of $$Y$$ on to $$X\beta$$ – i.e. the perpendicular vector that turns (O, $$Y$$, $$X\beta$$) into a right-angled triangle.

The projection of $$Y$$ onto $$X\beta$$ is done using the projection matrix P, which is defined as

$P = X\left(X^{T}X\right)^{-1}X^{T}$

So $$X\beta = \hat{Y} = PY$$.

Using the Pythagorean theorem:

$$Y^TY = \hat{Y}^T\hat{Y} + (Y-X\beta)^T(Y-X\beta)$$

In other words, the total sum of squares = sum of squares due to regression + residual sum of squares. This is a fundamental part of analysis of variance techniques.

## Plotting Tick Data with ggplot2

Here are some examples of using ggplot2 and kdb+ together to produce some simple graphs of data stored in kdb+. I am using the qserver extension for R (http://code.kx.com/wsvn/code/cookbook_code/r/) to connect to a running kdb+ instance from within R.

First, lets create a dummy data set: a set of evenly-spaced timestamps and a random walk price series:

ONE_SEC:long$1e9 tab:([]time:.z.P+ONE_SEC * (til 1000);price:sums?[1000?1.<0.5;-1;1]) Then import the data into R: [source lang="r"]>tab <- execute(h,'select from tab')[/source] Then plot a simple line graph - remember ggplot2 works natively with data frames: [source lang="r"]>library(ggplot2) >ggplot(tab, aes(x=time, y=price)) + geom_line() + ggtitle("Stock Price Evolution")[/source] This will produce a line graph similar to the one below: Next, we can do a simple bin count / histogram on the price series: [source lang="r"]ggplot(tab, aes(x=(price))) + geom_histogram()[/source] Which will produce a graph like the following: We can adjust the bin width to get a more granular graph using the binwidth parameter: [source lang="r"]> ggplot(tab, aes(x=(price))) + geom_histogram(position="identity", binwidth=1)[/source] We can also make use of some aesthetic attributes, e.g. fill color - we can shade the histogram by the number of observations in each bin: [source lang="r"]ggplot(tab, aes(x=(price), fill=..count..)) + geom_histogram(position="identity", binwidth=1)[/source] Which results in: Some other graphs: Say I have a data frame with a bunch of currency tick data (bid/offer/mid prices). The currencies are interspersed. Here is a sample: [source lang="r"] > head(ccys) sym timestamp bid ask mid 1 AUDJPY 2013-01-15 11:00:16.127 94.485 94.496 94.4905 2 AUDJPY 2013-01-15 11:00:22.592 94.486 94.496 94.4910 3 AUDJPY 2013-01-15 11:00:30.117 94.498 94.505 94.5015 4 AUDJPY 2013-01-15 11:00:30.325 94.498 94.506 94.5020 5 AUDJPY 2013-01-15 11:00:37.118 94.499 94.507 94.5030 6 AUDJPY 2013-01-15 11:00:47.348 94.526 94.536 94.5310 [/source] I want to add a column containing the log-returns calculated separately for each currency: [source lang="r"] log.ret <- function(x) do.call("rbind", lapply(seq_along(x), function(i) cbind(x[[i]],lr=c(0, diff(log(x[[i]]$mid))))))
ccys <- log.ret(split(ccys, f=ccys$sym)) [/source] Then I can plot a simple line chart of the log returns using the lovely facets feature in ggplot to split out a separate panel per symbol: [source lang="r"] ggplot(ccys, aes(x=timestamp, y=lr)) + geom_line() + facet_grid(sym ~ .) [/source] Which produces the following: Another nice graph - display a visual summary of the tick frequency by time. This one uses a dummy column that represents a tick arrival. Note in the following graph I have reduced the line width and set the alpha value to a tiny value (creating a large transparency effect) as otherwise the density of tick lines is too great. The overall effect is visually pleasing: [source lang="r"] ccys <- cbind(ccys, dummy=rep(1,nrow(ccys))) ggplot(ccys, aes(x=timestamp,y=dummy, ymin=0, ymax=1)) + geom_linerange(alpha=1/2,size=.01,width=.01) + facet_grid(sym ~ .) + theme(axis.text.y=element_blank()) + xlab("ticks") + ylab("time") + ggtitle("Tick Density") [/source] Which results in: Next time (time permitting) - covariance matrices and bivariate density plots.. ## Rmathlib and kdb+ part 3: Utility Functions In the first two parts of this series, I looked at the basics of the interface I created between rmathlib and kdb+. In this post, I’ll go through some of the convenience functions I wrote to emulate some basic R functionality. NOTE: I wrote these functions as a learning exercise to familiarise myself with q/kdb+ a bit more – they are very simplistic and I am sure there are much better ways to do them (I know the very nice qml project does a lot of this stuff, plus interfaces to external BLAS libraries). Some basics: load the library: q)\l rmath.q ### seq – sequence function The first convenience function is seq, which is like R’s seq() or Numpy’s arange() in that it takes a start and end point, and generates a range of numbers. q)seq[1;100] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26... The function is just a curried wrapper around seqn, which also takes a step size: q)seqn[1;10;.5] 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10  This is handy when working with some of the probability generation functions shown last time. For instance the commands below calculate the normal distribution (cdf) values for evenly spaced points from -3 to 3: q)each[{pnorm[x;0;1]}] seqn[-3;3;.5] 0.001349898 0.006209665 0.02275013 0.0668072...  ### table – Symbol Tabulation R provides a handy function to show summary count tables of factor levels. In q, this can be used as follows: q)t:([] sym:1000?ABCD) q)table[sym;t] B| 241 A| 244 C| 256 D| 259 This tabulates the column sym from the table t. The summary is ordered by increasing frequency count. NOTE: this has really no advantage over the standard q)x xasc select count i by sym from t sym| x ---| --- B | 241 A | 244 C | 256 D | 259 except some slight brevity. ### range – Min/Max The range function simply returns the boundaries of a set of values: q)x:rnorm[10000] q)range[x] -3.685814 4.211363 q)abs (-) . range[x] / absolute range 7.897177 ### summary – Min/Max/Mean/Median/IQR The summary function provides summary stats, a la R’s summary() function: q)x:norm[10000;3;2] q)summary x min | -4.755305 1q | 1.59379 median| 2.972523 mean | 2.966736 3q | 4.336589 max | 10.00284 ### quantile – Quantile Calculations A very simple quantile calc: q)x:norm[10000;3;2] q)quantile[x;.5] 2.973137 ### hist – Bin Count Very crude bin count – specify the data and the number of bins: q)hist[x;10] -4.919383| 14 -3.279589| 101 -1.639794| 601 0 | 1856 1.639794 | 3043 3.279589 | 2696 4.919383 | 1329 6.559177 | 319 8.198972 | 40 9.838766 | 1 ### diag – Identity Matrix Generation q)diag 10 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 ### Scale functions Sometimes its very useful to scale an input data set – e.g. when feeding multiple inputs into a statistical model, large differences between the relative scales of the inputs combined with finite-precision computer arithmetic can result in some inputs being dwarfed by others. The scale function just adjusts the input as follows: $$X_{s}=(X-\mu)/\sigma$$. The example below scales two inputs with different ranges: q)x:norm[10;0;1]; y:norm[10;5;3] q)x 1.920868 -1.594028 -0.02312519 1.079606 -0.5310111 0.2762119 0.1218428 0.9584264 -0.4244091 -0.7981221 q)y 10.69666 2.357529 8.93505 3.65696 5.218461 3.246216 5.971919 7.557135 1.412827 1.246241 q)range x -1.594028 1.920868 q)range y 1.246241 10.69666 q)range scale x -1.74507 1.878671 q)range scale y -1.232262 1.845551 There are other useful scaling measures, including the min/max scale: $$\frac{x-min(x)}{max(x)-min(x)}$$. This is implemented using the minmax function: q)minmax x 1 0 0.4469273 0.760658 0.302432 0.5320897 0.4881712 0.726182 0.3327606 0.226438 range minmax x 0 1f  There are other functions which are useful for scaling, e.g. the RMSD (root-mean-square deviation): $$\sqrt{\frac{\sum{x_i^2}}{N}}$$: q)x:rnorm 1000 q)rms x 1.021065 ### nchoosek – Combinations The nchoosek function calculates the number of combinations of N items chosen k at a time (i.e. $${N}\choose{k}$$: q)nchoosek[100;5] 7.528752e+07 q)each[{nchoosek[10;x]}] seq[1;10] 10 45 120 210 252 210 120 45 10 1f The source file is available here: https://github.com/rwinston/kdb-rmathlib/blob/master/rmath_aux.q. ## Rmathlib and kdb+, part 2 – Probability Distribution Functions Following on from the last post on integrating some rmathlib functionality with kdb+, here is a sample walkthrough of how some of the functionality can be used, including some of the R-style wrappers I wrote to emulate some of the most commonly-used R commands in q. ## Loading the rmath library Firstly, load the rmathlib library interface: q)\l rmath.q ## Random Number Generation R provides random number generation facilities for a number of distributions. This is provided using a single underlying uniform generator (R provides many different RNG implementations, but in the case of Rmathlib it uses a Marsaglia-multicarry type generator) and then uses different techniques to generate numbers distributed according to the selected distribution. The standard technique is inversion, where a uniformly distributed number in [0,1] is mapped using the inverse of the probability distribution function to a different distribution. This is explained very nicely in the book “Non-Uniform Random Variate Generation”, which is availble in full here: http://luc.devroye.org/rnbookindex.html. In order to make random variate generation consistent and reproducible across R and kdb+, we need to be able to seed the RNG. The default RNG in rmathlib takes two integer seeds. We can set this in an R session as follows: [source lang=”R”] > .Random.seed[2:3]<-as.integer(c(123,456)) [/source] and the corresponding q command is: q)sseed[123;456] Conversely, getting the current seed value can be done using: q)gseed[] 123 456i The underlying uniform generator can be accessed using runif: q)runif[100;3;4] 3.102089 3.854157 3.369014 3.164677 3.998812 3.092924 3.381564 3.991363 3.369.. produces 100 random variates uniformly distributed between [3,4]. Then for example, normal variates can be generated: q)rnorm 10 -0.2934974 -0.334377 -0.4118473 -0.3461507 -0.9520977 0.9882516 1.633248 -0.5957762 -1.199814 0.04405314 This produces identical results in R: [source lang=”r”] > rnorm(10) [1] -0.2934974 -0.3343770 -0.4118473 -0.3461507 -0.9520977 0.9882516 1.6332482 -0.5957762 -1.1998144 [10] 0.0440531 [/source] Normally-distributed variables with a distribution of $$N(\mu,\sigma)$$ can also be generated: q)dev norm[10000;3;1.5] 1.519263 q)avg norm[10000;3;1.5] 2.975766 Or we can alternatively scale a standard normal $$X ~ N(0,1)$$ using $$Y = \sigma X + \mu$$: q)x:rnorm[1000] q) int$ (avg x; dev x) 0 1i q)y:(x*3)+5 q) int$(avg y; dev y) 5 3i ## Probability Distribution Functions As well as random variate generation, rmathlib also provides other functions, e.g. the normal density function: q)dnorm[0;0;1] 0.3989423 computes the normal density at 0 for a standard normal distribution. The second and third parameters are the mean and standard deviation of the distribution. The normal distribution function is also provided: q)pnorm[0;0;1] 0.5 computes the distribution value at 0 for a standard normal (with mean and standard deviation parameters). Finally, the quantile function (the inverse of the distribution function – see the graph below – the quantile value for .99 is mapped onto the distribution function value at that point: 2.32): q)qnorm[.99;0;1] 2.326348 We can do a round-trip via pnorm() and qnorm(): q)int$ qnorm[ pnorm[3;0;1]-pnorm[-3;0;1]; 0; 1] 3i

Thats it for the distribution functions for now – rmathlib provides lots of different distributions (I have just linked in the normal and uniform functions for now. There are some other functions that I have created that I will cover in a future post.

All code is on github: https://github.com/rwinston/kdb-rmathlib

[Check out part 3 of this series]

## Density Estimation of High-Frequency Financial Data

Frequently we will want to estimate the empirical probability density function of real-world data and compare it to the theoretical density from one or more probability distributions. The following example shows the empirical and theoretical normal density for EUR/USD high-frequency tick data $$X$$ (which has been transformed using log-returns and normalized via $$\frac{X_i-\mu_X}{\sigma_X}$$). The theoretical normal density is plotted over the range $$\left(\lfloor\mathrm{min}(X)\rfloor,\lceil\mathrm{max}(X)\rceil\right)$$. The results are in the figure below. The discontinuities and asymmetry of the discrete tick data, as well as the sharp kurtosis and heavy tails (a corresponding interval of $$\approx \left[-8,+7\right]$$ standard deviations away from the mean) are apparent from the plot.

We also show the theoretical and empirical density for the EUR/USD exchange rate log returns over different timescales. We can see from these plots that the distribution of the log returns seems to be asymptotically converging to normality. This is a typical empirical property of financial data.

The following R source generates empirical and theoretical density plots across different timescales. The data is loaded from files that are sampled at different intervals. I cant supply the data unfortunately, but you should get the idea.

[source lang=”R”]
# Function that reads Reuters CSV tick data and converts Reuters dates
# Assumes format is Date,Tick
tickData$Date <- as.POSIXct(strptime(tickData$Date, format="%d/%m/%Y %H:%M:%S"))
tickData
}

# Boilerplate function for Reuters FX tick data transformation and density plot
plot.reutersFXDensity <- function() {
filenames <- c("data/eur_usd_tick_26_10_2007.csv",
"data/eur_usd_1min_26_10_2007.csv",
"data/eur_usd_5min_26_10_2007.csv",
"data/eur_usd_hourly_26_10_2007.csv",
"data/eur_usd_daily_26_10_2007.csv")
labels <- c("Tick", "1 Minute", "5 Minutes", "Hourly", "Daily")

par(mfrow=c(length(filenames), 2),mar=c(0,0,2,0), cex.main=2)
tickData <- c()
i <- 1
for (filename in filenames) {
# Transform: $Y = \nabla\log(X_i)$
logtick <- diff(log(tickData[[i]]$Tick)) # Normalize: $\frac{(Y-\mu_Y)}{\sigma_Y}$ logtick <- (logtick-mean(logtick))/sd(logtick) # Theoretical density range: $\left[\lfloor\mathrm{min}(Y)\rfloor,\lceil\mathrm{max}(Y)\rceil\right]$ x <- seq(floor(min(logtick)), ceiling(max(logtick)), .01) plot(density(logtick), xlab="", ylab="", axes=FALSE, main=labels[i]) lines(x,dnorm(x), lty=2) #legend("topleft", legend=c("Empirical","Theoretical"), lty=c(1,2)) plot(density(logtick), log="y", xlab="", ylab="", axes=FALSE, main="Log Scale") lines(x,dnorm(x), lty=2) i <- i + 1 } par(op) } [/source] ## Binomial Pricing Trees in R Binomial Tree Simulation The binomial model is a discrete grid generation method from $$t=0$$ to $$T$$. At each point in time ($$t+\Delta t$$) we can move up with probability $$p$$ and down with probability $$(1-p)$$. As the probability of an up and down movement remain constant throughout the generation process, we end up with a recombining binary tree, or binary lattice. Whereas a balanced binomial tree with height $$h$$ has $$2^{h+1}-1$$ nodes, a binomial lattice of height $$h$$ has $$\sum_{i=1}^{h}i$$ nodes. The algorithm to generate a binomial lattice of $$M$$ steps (i.e. of height $$M$$) given a starting value $$S_0$$, an up movement $$u$$, and down movement $$d$$, is:  FOR i=1 to M FOR j=0 to i STATE S(j,i) = S(0)*u^j*d^(n-j) ENDFOR ENDFOR  We can write this function in R and generate a graph of the lattice. A simple lattice generation function is below: [source lang=”R”] # Generate a binomial lattice # for a given up, down, start value and number of steps genlattice <- function(X0=100, u=1.1, d=.75, N=5) { X <- c() X[1] <- X0 count <- 2 for (i in 1:N) { for (j in 0:i) { X[count] <- X0 * u^j * d^(i-j) count <- count + 1 } } return(X) } [/source] We can generate a sample lattice of 5 steps using symmetric up-and-down values: [source lang=”R”] > genlattice(N=5, u=1.1, d=.9) [1] 100.000 90.000 110.000 81.000 99.000 121.000 72.900 89.100 108.900 133.100 65.610 [12] 80.190 98.010 119.790 146.410 59.049 72.171 88.209 107.811 131.769 161.051 [/source] In this case, the output is a vector of alternate up and down state values. We can nicely graph a binomial lattice given a tool like graphviz, and we can easily create an R function to generate a graph specification that we can feed into graphviz: [source lang=”R”] function(S, labels=FALSE) { shape <- ifelse(labels == TRUE, "plaintext", "point") cat("digraph G {", "\n", sep="") cat("node[shape=",shape,", samehead, sametail];","\n", sep="") cat("rankdir=LR;","\n") cat("edge[arrowhead=none];","\n") # Create a dot node for each element in the lattice for (i in 1:length(S)) { cat("node", i, "[label=\"", S[i], "\"];", "\n", sep="") } # The number of levels in a binomial lattice of length N # is $\frac{\sqrt{8N+1}-1}{2}$ L <- ((sqrt(8*length(S)+1)-1)/2 – 1) k<-1 for (i in 1:L) { tabs <- rep("\t",i-1) j <- i while(j>0) { cat("node",k,"->","node",(k+i),";\n",sep="") cat("node",k,"->","node",(k+i+1),";\n",sep="") k <- k + 1 j <- j – 1 } } cat("}", sep="") } [/source] This will simply output a dot script to the screen. We can capture this script and save it to a file by invoking: [source lang=”R”] > x<-capture.output(dotlattice(genlattice(N=8, u=1.1, d=0.9))) > cat(x, file="/tmp/lattice1.dot") [/source] We can then invoke dot from the command-line on the generated file: [source lang=”bash”]$ dot -Tpng -o lattice.png -v lattice1.dot
[/source]

The resulting graph looks like the following:

If we want to add labels to the lattice vertices, we can add the labels attribute:

[source lang=”R”]
> x<-capture.output(dotlattice(genlattice(N=8, u=1.1, d=0.9), labels=TRUE))
> cat(x, file="/tmp/lattice1.dot")
[/source]

## Quasi-Random Number Generation in R

Random number generation is a core topic in numerical computer science. There are many efficient
algorithms for generating random (strictly speaking, pseudo-random) variates
from different probability distributions. The figure below shows a
sampling of 1000 two-dimensional random variates from the standard Gaussian and
Cauchy distributions, respectively. The size of the extreme deviations of the
Cauchy distribution is apparent from the graph.

However, sometimes we need to produce numbers that are more evenly distributed (quasi-random numbers). For example, in a Monte Carlo integration exercise, we can get faster convergence with a lower error bound using so-called low-discrepancy random sequences, using the GSL library. In the figure below, we show two-dimensional normal and Sobol (a low-discrepancy generator) variates.

To generate the graph below, I used the GSL library for R, as shown below:

[source lang=”R”]
library(gsl)
q <- qrng_alloc(type="sobol", 2)
rs <- qrng_get(q,1000)
par(mfrow=c(3,1))
plot(rnorm(1000), rnorm(1000), pch=20, main="~N(0,1)",
ylab="", xlab="")
plot(rs, pch=20, main="Sobol",
ylab="", xlab="")
plot(rcauchy(1000), rcauchy(1000),pch=20,
main="~C(0,1)", ylab="",xlab="")
[/source]

The property of low-discrepancy generators is even more apparent if we view the random variates in a higher dimension, for example the figure below shows the variates as a 3-dimensional cube. Note how the clustering around the centre of the cube is much more pronounced for the Gaussian cube.

To plot the figure above, I used the GSL and Lattice libraries:

[source lang=”R”]
library(gsl)
library(lattice)
q <- qrng_alloc(type="sobol", 3)
npoints <- 200
rs <- qrng_get(q,npoints)
ltheme <- canonical.theme(color = FALSE)
ltheme$strip.background$col <- "transparent"
lattice.options(default.theme = ltheme)
trellis.par.set(layout.heights =

# Plot the normal variates in a 3-dim cube
p1 <- cloud(rnorm(npoints) ~ rnorm(npoints) + rnorm(npoints), xlab="x", ylab="y",
zlab="z", pch=20, main="~N(0,1)")
p2 <- cloud(rs[,1] ~ rs[,2] + rs[,3], xlab="x", ylab="y",
zlab="z", pch=20, main="Sobol")
print(p1, split=c(1,1,2,1), more=TRUE)
print(p2, split=c(2,1,2,1))
[/source]

## High-Frequency Statistical Arbitrage

Computational statistical arbitrage systems are now de rigeur, especially for high-frequency, liquid markets (such as FX).
Statistical arbitrage can be defined as an extension of riskless arbitrage,
and is quantified more precisely as an attempt to exploit small and consistent regularities in
asset price dynamics through use of a suitable framework for statistical modelling.

Statistical arbitrage has been defined formally (e.g. by Jarrow) as a zero initial cost, self-financing strategy with cumulative discounted value $$v(t)$$ such that:

• $$v(0) = 0$$,
• $$\lim_{t\to\infty} E^P[v(t)] > 0$$,
• $$\lim_{t\to\infty} P(v(t) < 0) = 0$$,
• $$\lim_{t\to\infty} \frac{Var^P[v(t)]}{t}=0 \mbox{ if } P(v(t)<0) > 0 \mbox{ , } \forall{t} < \infty$$

These conditions can be described as follows: (1) the position has a zero initial cost (it is a self-financing trading strategy), (2) the expected discounted profit is positive in the limit, (3) the probability of a loss converges to zero, and (4) a time-averaged variance measure converges to zero if the probability of a loss does not become zero in finite time. The fourth condition separates a standard arbitrage from a statistical arbitrage opportunity.

We can represent a statistical arbitrage condition as $$\left| \phi(X_t – SA(X_t))\right| < \mbox{TransactionCost}$$ Where $$\phi()$$ is the payoff (profit) function, $$X$$ is an arbitrary asset (or weighted basket of assets) and $$SA(X)$$ is a synthetic asset constructed to replicate the payoff of $$X$$. Some popular statistical arbitrage techniques are described below. Index Arbitrage

Index arbitrage is a strategy undertaken when the traded value of an index (for example, the index futures price) moves sufficiently far away from the weighted components of the index (see Hull for details). For example, for an equity index, the no-arbitrage condition could be expressed as:

$\left| F_t – \sum_{i} w_i S_t^i e^{(r-q_i)(T-t)}\right| < \mbox{Cost}$ where $$q_i$$ is the dividend rate for stock i, and $$F_t$$ is the index futures price at time t. The deviation between the futures price and the weighted index basket is called the basis. Index arbitrage was one of the earliest applications of program trading. An alternative form of index arbitrage was a system where sufficient deviations in the forecasted variance of the relationship (estimated by regression) between index pairs and the implied volatilities (estimated from index option prices) on the indices were classed as an arbitrage opportunity. There are many variations on this theme in operation based on the VIX market today. Statistical Pairs trading is based on the notion of relative pricing - securities with similar characteristics should be priced roughly equally. Typically, a long-short position in two assets is created such that the portfolio is uncorrelated to market returns (i.e. it has a negligible beta). The basis in this case is the spread between the two assets. Depending on whether the trader expects the spread to contract or expand, the trade action is called shorting the spread or buying the spread. Such trades are also called convergence trades.

A popular and powerful statistical technique used in pairs trading is cointegration, which is the identification of a linear combination of multiple non-stationary data series to form a stationary (and hence predictable) series.

In recent years, computer algorithms have become the decision-making machines behind many trading strategies. The ability to deal with large numbers of inputs, utilise long variable histories, and quickly evaluate quantitative conditions to produce a trading signal, have made algorithmic trading systems the natural evolutionary step in high-frequency financial applications. Originally the main focus of algorithmic trading systems was in neutral impact market strategies (e.g. Volume Weighted Average Price and Time Weighted Average Price trading), however, their scope has widened considerably, and much of the work previously performed by manual systematic traders can now be done by “black box” algorithms.

Trading algorithms are no different from human traders in that they need an unambiguous measure of performance – i.e. risk versus return. The ubiquitous Sharpe ratio ($$\frac{\mu_r – \mu_f}{\sigma}$$) is a popular measure, although other measures are also used. A measure of trading performance that is commonly used is that of total return, which is defined as

$R_T \equiv \sum_{j=1}^{n}r_j$

over a number of transactions n, and a return per transaction $$r_j$$. The annualized total return is defined as $$R_A = R_T \frac{d_A}{d_T}$$, where $$d_A$$ is the number of trading days in a year, and $$d_T$$ is the number of days in the trading period specified by $$R_T$$. The maximum drawdown over a certain time period is defined as $$D_T \equiv \max(R_{t_a}-R_{t_b}|t_0 \leq t_a \leq t_b \leq t_E)$$, where $$T = t_E – t_0$$, and $$R_{t_a}$$ and $$R_{t_b}$$ are the total returns of the periods from $$t_0$$ to $$t_a$$ and $$t_b$$ respectively. A resulting indicator is the Stirling Ratio, which is defined as

$SR = \frac{R_T}{D_T}$

High-frequency tick data possesses certain characteristics which are not as apparent in aggregated data. Some of these characteristics include:

• Non-normal characteristic probability distributions. High-frequency data may have large kurtosis (heavy tails), and be asymmetrically skewed;
• Diurnal seasonality – an intraday seasonal pattern influenced by the restrictions on trading times in markets. For instance, trading activity may be busiest at the start and end of the trading day. This may not apply so much to foreign exchange, as the FX market is a decentralized 24-hour operation, however, we may see trend patterns in tick interarrival times around business end-of-day times in particular locations;
• Real-time high frequency data may contain errors, missing or duplicated tick values, or other anomalies. Whilst historical data feeds will normally contain corrections to data anomalies, real-time data collection processes must be aware of the fact that adjustments may need to be made to the incoming data feeds.

Probably one of the nicest explanations of the bias/variance tradeoff is the one I found in the book Introduction to Information Retrieval (full book available online). The tradeoff can be explained mathematically, and also more intuitively. The mathematical explanation is as follows:

if we have a learning method that operates on a given set of input data (call it $x$) and a “real” underlying process that we are trying to approximate (call it $\alpha$), then the expected (squared) error is:

\begin{aligned} E[x-\alpha]^2 = Ex^2 – 2Ex\alpha + \alpha^2\\ = (Ex)^2 – 2Ex\alpha + \alpha^2 + Ex^2 – 2(Ex)^2 + (Ex)^2\\ = [Ex-\alpha]^2 + Ex^2 – E2x(Ex) + E(Ex)^2\\ = [Ex-\alpha]^2 + E[x-Ex]^2 \end{aligned}

Taking advantage of the linearity of expectation and adding a few extra cancelling terms, we end up with the representation:

$$Error = bias (E[x-\alpha]^2) + variance (E[x-Ex]^2)$$

Thats the mathematical equivalence. However, a more descriptive approach is as follows:

Bias is the squared difference between the true underlying distribution and the prediction of the learning process, averaged over our input datasets. Consistently wrong prediction equal large bias. Bias is small when the predictions are consistently right, or the average error across different training sets is roughly zero. Linear models generally have a high bias for nonlinear problems. Bias can represent the domain knowledge that we have built into the learning process – a linear assumption may be unsuitable for a nonlinear problem, and thus result in high bias.

Variance is the variation in prediction (or the consistency) – it is large if different training sets result in different learning models. Linear models will generally have lower variance. High variance generally results in overfitting – in effect, the learning model is learning from noise, and will not generalize well.

Its a useful analogy to think of most learning models as a box with two dials – bias and variance, and the setting of one will affect the other. We can only try and find the “right” setting for the situation we are working with. Hence the bias-variance tradeoff.