Plotting Tick Data with ggplot2

Here are some examples of using ggplot2 and kdb+ together to produce some simple graphs of data stored in kdb+. I am using the qserver extension for R (http://code.kx.com/wsvn/code/cookbook_code/r/) to connect to a running kdb+ instance from within R.

First, lets create a dummy data set: a set of evenly-spaced timestamps and a random walk price series:

ONE_SEC:long$1e9 tab:([]time:.z.P+ONE_SEC * (til 1000);price:sums?[1000?1.<0.5;-1;1]) Then import the data into R: [source lang="r"]>tab <- execute(h,'select from tab')[/source] Then plot a simple line graph - remember ggplot2 works natively with data frames: [source lang="r"]>library(ggplot2) >ggplot(tab, aes(x=time, y=price)) + geom_line() + ggtitle("Stock Price Evolution")[/source] This will produce a line graph similar to the one below: Next, we can do a simple bin count / histogram on the price series: [source lang="r"]ggplot(tab, aes(x=(price))) + geom_histogram()[/source] Which will produce a graph like the following: We can adjust the bin width to get a more granular graph using the binwidth parameter: [source lang="r"]> ggplot(tab, aes(x=(price))) + geom_histogram(position="identity", binwidth=1)[/source] We can also make use of some aesthetic attributes, e.g. fill color - we can shade the histogram by the number of observations in each bin: [source lang="r"]ggplot(tab, aes(x=(price), fill=..count..)) + geom_histogram(position="identity", binwidth=1)[/source] Which results in: Some other graphs: Say I have a data frame with a bunch of currency tick data (bid/offer/mid prices). The currencies are interspersed. Here is a sample: [source lang="r"] > head(ccys) sym timestamp bid ask mid 1 AUDJPY 2013-01-15 11:00:16.127 94.485 94.496 94.4905 2 AUDJPY 2013-01-15 11:00:22.592 94.486 94.496 94.4910 3 AUDJPY 2013-01-15 11:00:30.117 94.498 94.505 94.5015 4 AUDJPY 2013-01-15 11:00:30.325 94.498 94.506 94.5020 5 AUDJPY 2013-01-15 11:00:37.118 94.499 94.507 94.5030 6 AUDJPY 2013-01-15 11:00:47.348 94.526 94.536 94.5310 [/source] I want to add a column containing the log-returns calculated separately for each currency: [source lang="r"] log.ret <- function(x) do.call("rbind", lapply(seq_along(x), function(i) cbind(x[[i]],lr=c(0, diff(log(x[[i]]$mid))))))
0 1i
q)y:(x*3)+5
q) int$(avg y; dev y) 5 3i Probability Distribution Functions As well as random variate generation, rmathlib also provides other functions, e.g. the normal density function: q)dnorm[0;0;1] 0.3989423 computes the normal density at 0 for a standard normal distribution. The second and third parameters are the mean and standard deviation of the distribution. The normal distribution function is also provided: q)pnorm[0;0;1] 0.5 computes the distribution value at 0 for a standard normal (with mean and standard deviation parameters). Finally, the quantile function (the inverse of the distribution function – see the graph below – the quantile value for .99 is mapped onto the distribution function value at that point: 2.32): q)qnorm[.99;0;1] 2.326348 We can do a round-trip via pnorm() and qnorm(): q)int$ qnorm[ pnorm[3;0;1]-pnorm[-3;0;1]; 0; 1]
3i

Thats it for the distribution functions for now – rmathlib provides lots of different distributions (I have just linked in the normal and uniform functions for now. There are some other functions that I have created that I will cover in a future post.

All code is on github: https://github.com/rwinston/kdb-rmathlib

[Check out part 3 of this series]

Integrating Rmathlib and kdb+

The R engine is usable in a variety of ways – one of the lesser-known features is that it provides a standalone math library that can be linked to from an external application. This library provides some nice functionality such as:

* Probability distribution functions (density/distribution/quantile functions);
* Random number generation for a large number of probability distributions

In order to make use of this functionality from q, I built a simple Rmathlib wrapper library. The C wrapper can be found here and is simply a set of functions that wrap the appropriate calls in Rmathlib. For example, a function to generate N randomly-generated Gaussian values using the underlying rnorm() function is:

[source lang=”c”]
K rnn(K n, K mu, K sigma) {
int i,count = n->i;
K ret = ktn(KF, count);
for (i = 0; i < count; ++i)
kF(ret)[i] = rnorm(mu->f, sigma->f);
return ret;
}
[/source]

These have to be imported and linked from a kdb+ session, which is done using special directives (the 2: verb). I decided to automate the process of generating these directives – the code shell script below parses a set of function declarations in a delimited section of a C header file and produces the appropriate load statements:

[source lang=”bash”]
INFILE=rmath.h
DLL=\:rmath

echo "dll:$DLL" DECLARATIONS=$(awk ‘/\/\/ BEGIN DECL/ {f=1;next} /\/\/ END DECL/ {f=0} f {sub(/K /,"",$0);print$0}’ $INFILE) for decl in$DECLARATIONS; do
FNAME=${decl%%(*} ARGS=${decl##$FNAME} IFS=, read -r -a CMDARGS <<< "$ARGS"
echo "${FNAME}:dll 2:(\$FNAME;\${#CMDARGS[*]})"
done

echo "\\l rmath_aux.q"
[/source]

This generates a set of link commands such as the following:

[code]
dll::rmath
rn:dll 2:(rn;2)
rnn:dll 2:(rnn;3)
dn:dll 2:(dn;3)
pn:dll 2:(pn;3)
qn:dll 2:(qn;3)
sseed:dll 2:(sseed;2)
gseed:dll 2:(gseed;1)
nchoosek:dll 2:(`nchoosek;2)
[/code]

It also generates a call to load a second q script, rmath_aux.q, which contains a bunch of q wrappers and helper functions (I will write a separate post about that later).

A makefile is included which generates the shared lib (once the appropriate paths to the R source files is set) and q scripts. A sample q session looks like the following:

q) \l rmath.q
q) x:rnorm 1000 / generate 1000 normal variates
q) dnorm[0;0;1] / normal density at 0 for a mean 0 sd 1 distribution

The project is available on github: https://github.com/rwinston/kdb-rmathlib.

Note that loading rmath.q loads the rmath dll, which in turn loads the rmathlib dll, so the rmathlib dll should be available on the dynamic library load path.

[Check out Part 2 of this series]