Functional Selects/Updates in kdb+

Functional selects/updates are a relatively trick topic in kdb+ – mainly as the syntax takes a lot of getting used to. They are normally required when there are some dynamic elements in e.g. column selection or grouping criteria.

They are pretty well covered in Q For Mortals, but I wanted to add a couple of examples…combining functional select and update for example:

To start, load the sample tables in sp.q..we will use the table called ‘p’:

\l sp.q
p | name color weight city w
–| —————————-
p1| nut red 12 london 91
p2| bolt green 17 paris 91
p3| screw blue 17 rome 91
p4| screw red 14 london 91
p5| cam blue 12 paris 91
p6| cog red 19 london 91

Functional Selects

A simple select from p with some criteria:

q)select from p where city=`london
p | name color weight city w
–| —————————-
p1| nut red 12 london 91
p4| screw red 14 london 91
p6| cog red 19 london 91

Now lets look at the parse tree for this query:

q)parse "select from p where city=`london"

Now in order to convert this to a functional select, we need to turn this parse tree into an executable statement using the ? operator.

The basic form of the ? operator is

?[tablename;(select criteria);(grouping criteria);(columns)]

The parse tree above gives us each of the four elements in the right order – we just need to convert them to a valid functional syntax. For the example above this translates to:

q)?[p;enlist (=;`city;enlist `london);0b;()]
p | name color weight city w
–| —————————-
p1| nut red 12 london 91
p4| screw red 14 london 91
p6| cog red 19 london 91

So if we want to add e.g. column selection:

q)select name,color,weight from p where city=`london
name color weight
nut red 12
screw red 14
cog red 19

The parse tree looks like:

q)parse"select name,color,weight from p where city=`london"

In this case we have added a dictionary mapping selected columns to their output names:

q)?[p;enlist (=;`city;enlist `london);0b;(`name`color`weight)!(`name`color`weight)]
name color weight
nut red 12
screw red 14
cog red 19

Similarly, we can change the select criteria:

q)select name,color,weight from p where city in `london`paris
name color weight
nut red 12
bolt green 17
screw red 14
cam blue 12
cog red 19

Which produces the following parse tree:

q)parse"select name,color,weight from p where city in `london`paris"

This is only a small modification to the original functional select:

q)?[p;enlist (in;`city;enlist `london`paris);0b;(`name`color`weight)!(`name`color`weight)]
name color weight
nut red 12
bolt green 17
screw red 14
cam blue 12
cog red 19

Functional Updates

Functional updates follow an almost identical form, but use the ! operator, e..g

q)update w:sum weight by city from p
p | name color weight city w
–| —————————-
p1| nut red 12 london 45
p2| bolt green 17 paris 29
p3| screw blue 17 rome 17
p4| screw red 14 london 45
p5| cam blue 12 paris 29
p6| cog red 19 london 45

q)parse "update w:sum weight by city from p"

This parse tree maps to:

q)![p;();(enlist `city)!enlist `city;(enlist `w)!enlist (sum;`weight)]
p | name color weight city w
–| —————————-
p1| nut red 12 london 45
p2| bolt green 17 paris 29
p3| screw blue 17 rome 17
p4| screw red 14 london 45
p5| cam blue 12 paris 29
p6| cog red 19 london 45

Now if we want to update the output from a select, e.g. a simple grouped update:

q)update w:sum weight by color from select name,color,weight from p where city in `london`paris
name color weight w
nut red 12 45
bolt green 17 17
screw red 14 45
cam blue 12 12
cog red 19 45

The parse tree for this looks like:

q)parse"update w:sum weight by color from select name,color,weight from p where city in `london`paris"

This parse tree looks complex, but the main complexity comes from the nested functional select within.

We could explicitly write the entire function (nested select and update):

q)![?[p;enlist (in;`city;enlist `london`paris);0b;`name`color`weight!`name`color`weight];();(enlist `color)!enlist `color;(enlist `w)!enlist (sum;`weight)]
name color weight w
nut red 12 45
bolt green 17 17
screw red 14 45
cam blue 12 12
cog red 19 45

However it may be easier to read if we store the select in its own variable:

q)sel::?[p;enlist (in;`city;enlist `london`paris);0b;`name`color`weight!`name`color`weight]
name color weight
nut red 12
bolt green 17
screw red 14
cam blue 12
cog red 19

And then refer to the select thus:

q)![sel;();(enlist `color)!enlist `color;(enlist `w)!enlist (sum;`weight)]
name color weight w
nut red 12 45
bolt green 17 17
screw red 14 45
cam blue 12 12
cog red 19 45

History of k/kdb+ Presentation

Last year at the inaugural Singapore kx meetup, my colleague Ajay and I gave a short presentation on the history and background behind the K language and kdb+. I did get some feedback from people like Arthur Whitney and Rob Hodgkinson on the presentation, and even got some great old pics (courtesy of Rob) of Rob/Arthur/Ken Iverson from way back! I have attached the PDF – its an interesting story and an incredibly powerful language if you have not already tried it.

Presentation here: Introduction to kdb+

Introductory R Presentation

I put together a short intro presentation for some people explaining a little bit about R from an introductory point of view. Slides put together with R/markdown and ioslides.

Presentation here.

Here is the raw markdown if you are interested:

[code lang=”r”]

title: "R Introduction"
author: "Rory Winston"
date: "2 August 2014"
ioslides_presentation: default
fig_height: 6
fig_width: 8
keep_tex: yes
logo: r_logo.png
self_contained: no
fontsize: 10pt

## What is R?

– A Domain-Specific-Language (DSL) for statistics and data analysis
– Based on the S Programming Language
– An environment for Exploratory Data Analysis (EDA)
– A quasi-functional language with IDE and REPL
– A vectorized language with BLAS support
– A collection of over 7,000+ libraries
– A large and active community across industry and academia
– Around 20 years old (Lineage dates from 1975 – almost 40 years ago)



## Types

– Primitives (numeric, integer, character, logical, factor)
– Data Frames
– Lists
– Tables
– Arrays
– Environments
– Others (functions, closures, promises..)

## Simple Types
x <- 1

y <- "Hello World"

z <- TRUE


## Simple Types – Vectors

The basic type unit in R is a vector

“`{r, collapse=TRUE}
x <- c(1,2,3)
x <- 1:3

## Generating Vectors

R provides lots of convenience functions for data generation:

rep(0, 5)

## Indexing

x <- c(1, 3, 4, 10, 15, 20, 50, 1, 6)
x > 10
which(x > 10)
x[x>10 & x<30]

## Functions {.smaller}

“`{r, collapse=TRUE}
square <- function(x) x^2

pow <- function(x, p=2) x^p


Functions can be passed as data:

g <- function(x, f) f(x)
g(10, square)

h <- function(x,f,…) f(x,…)
h(10, pow, 3)

## R is Vectorized

Example – multiplying two vectors:

mult <- function(x,y) {
z <- numeric(length(x))
for (i in 1:length(x)) {
z[i] <- x[i] * y[i]


## R is Vectorized

Multiplying two vectors ‘the R way’:

1:10 * 1:10

NOTE: R recycles vectors of unequal length:
1:10 * 1:2

## NOTE: Random Number Generation

R contains a huge number of built-in random number generators for various probability distributions

# Normal variates, mean=0, sd=1
rnorm(10, mean=100, sd=5)

Many different distributions available (the <b>r*</b> functions)

## Data Frames

– Data frames are the fundamental structure used in data analysis
– Similar to a database table in spirit (named columns, distinct types)

d <- data.frame(x=1:6, y="AUDUSD", z=c("one","two"))

## Data Frames

Data frames can be indexed like a vector or matrix:

# First row

# First column

# First and third cols, first two rows

## Data Frames {.smaller}

Let’s generate some dummy data:

generateData <- function(N) data.frame(time=Sys.time()+1:N,
exch=sample(c("EBS","RTM","CNX"),N, replace=TRUE))

prices <- generateData(50)
head(prices, 5)

## Data Frames

We can add/remove columns on the fly:
prices$spread <- prices$ask-prices$bid
prices$mid <- (prices$bid + prices$ask) * 0.5

## Data Frames

Some basic operations on data frames:



## Data Frames {.smaller}

Operations can be applied across different dimensions of a data frame:


## Aggregations and Rollups {.smaller}

Aggregation and rollups:

tapply(prices$spread, prices$exch, mean)

## Aggregations and Rollups {.smaller}

tapply(prices$mid, prices$exch, function(x) diff(log(x)))

## Aggregations and Rollups {.smaller}

Aggregating summary statistics by time:

by=list(bucket=cut(prices$time, "10 sec")), mean)

## Aggregations and Rollups {.smaller}

by=list(bucket=cut(prices$time, "10 sec")),
function(x) c(min=min(x),max=max(x)))

## Aggregations and Rollups {.smaller}

prices <- generateData(2000)
prices$spread <- prices$ask-prices$bid
boxplot(prices$spread ~ prices$exch, main="Spread Distribution By Exchange")

## Closures

R functions contain a reference to the enclosing environment:

add <- function(x) {
function(y) x + y

increment <- add(1)

## R’s Lisp Influence

R is heavily influenced by Lisp/Scheme

cf. "Structure And Interpretation of Computer Programs"

<center><img src="sicp_cover.jpg" width="200" height="350"/></center>

## R’s Lisp Influence
(`*` (2, (`+`(2,3))))

## R’s Lisp Influence {.smaller}

Also evident in the underlying C code:

“` c
SEXP attribute_hidden do_prmatrix(SEXP call, SEXP op,
SEXP args, SEXP rho)
int quote;
SEXP a, x, rowlab, collab, naprint;
char *rowname = NULL, *colname = NULL;

a = args;
x = CAR(a); a = CDR(a);
rowlab = CAR(a); a = CDR(a);
collab = CAR(a); a = CDR(a);

quote = asInteger(CAR(a)); a = CDR(a);

## Operators Are Functions
`(` <- function(x) 42

## Example: Median Absolute Deviation {.smaller}
$MAD(x) = median\left(\left|Y_i – \hat{Y}\right|\right)$

## Example: Median Absolute Deviation

Shows: lazy evaluation, filering, logical casting, if/else return values, partial sorting
function (x, center = <span style=’background-color:yellow’>median(x)</span>, constant = 1.4826, na.rm = FALSE,
low = FALSE, high = FALSE)
if (na.rm) x <- <span style=’background-color:yellow’>x[!]</span>
n <- length(x)
constant * <span style=’background-color:yellow’>if ((low || high) && n%%2 == 0) {
if (low && high)
stop("’low’ and ‘high’ cannot be both TRUE")
n2 <- n%/%2 + <span style=’background-color:orange’>as.integer(high)</span>
sort(abs(x – center), <span style=’background-color:orange’>partial = n2</span>)[n2]
else median(abs(x – center))</span>

## Example: Simulating Coin Tosses {.smaller}

What is the probability of exactly 3 heads in 10 coin tosses for a fair coin?

*Using binomial identity:*
$\binom{n}{k}p^{k}(1-p)^{(n-k)} = \binom{10}{3}\left(\frac{1}{2}\right)^{3}\left(\frac{1}{2}\right)^{7}$


*Using binomial distribution density function:*
dbinom(prob=0.5, size=10, x=3)

*Using simulation (100,000 tosses):*
sum(replicate(100000,sum(rbinom(prob=1/2, size=10, 1))==3))/100000

## Example: Random Walk {.smaller}

Generate 1000 up-down movements based on a fair coin toss and plot:

x<-(cumsum(ifelse(rbinom(prob=0.5, size=1, 10000)==0,-1,1)))
plot(x, type=’l’, main=’Random Walk’)

## Example: Generating Random Data {.smaller}
randomWalk <-function(N)(cumsum(ifelse(rbinom(prob=0.5, size=1, N)==0,-1,1)))
AUDUSD <- 1.2345 + randomWalk(1000)*.0001
plot(AUDUSD, type=’l’)

## Example: OANDA FX Data {.smaller}
EURUSD <- getSymbols("EUR/USD", src="oanda", auto.assign=FALSE)
lines(EMA(EURUSD,10), col=’red’)
lines(EMA(EURUSD,30), col=’blue’)
<center><img src="oanda_eurusd.png" height="400px" width="600px" /></center>

## Example: Connecting to kdb+ {.smaller}

Rorys-MacBook-Pro:kdb rorywinston$ <b>./rlwrap q/m32/q -p 5000</b>
KDB+ 3.1 2014.07.01 Copyright (C) 1993-2014 Kx Systems
m32/ 8()core 16384MB rorywinston rorys-macbook-pro.local NONEXPIRE
Welcome to kdb+ 32bit edition
<b>q) trades:([]time:100?.z.P;price:100?2.;
time price side exch sym
2010.08.13D12:33:29.975458112 0.6019404 B CNX EURUSD
2001.11.24D20:53:58.972661440 0.7075032 S CNX EURUSD
2002.12.12D03:12:04.442386736 1.500898 S CNX GBPUSD
2002.02.12D21:48:33.887104336 0.6170263 S EBS AUDUSD
2014.05.01D06:59:44.647138496 0.8821325 S EBS GBPUSD
2010.12.06D15:30:14.928601664 1.094677 S RTM AUDUSD
2009.04.19D23:12:33.919967488 1.187474 B RTM GBPUSD
2008.07.18D22:13:25.681742656 0.1768144 B EBS GBPUSD
2010.08.22D10:16:15.261483520 0.3576458 S EBS AUDUSD
2010.02.28D13:49:33.686598976 1.526465 S RTM EURUSD

## Example: Connecting to kdb+

open_connection("localhost", 5000)
trades <- k("select from trades")
time price side exch sym
1 2010-08-13 22:33:29 0.6019404 B CNX EURUSD
2 2001-11-25 07:53:58 0.7075032 S CNX EURUSD
3 2002-12-12 14:12:04 1.5008982 S CNX GBPUSD
4 2002-02-13 08:48:33 0.6170263 S EBS AUDUSD
5 2014-05-01 16:59:44 0.8821325 S EBS GBPUSD
6 2010-12-07 02:30:14 1.0946771 S RTM AUDUSD

kdb+ datatypes are converted to native R types

## Example: Reading Log Data From File

“`{r, eval=FALSE}
# Read file into data frame
logfile <- read.csv("/tmp/application.log", sep=",", header=FALSE)
# Set column descriptors
colnames(logfile) <- c("time","message","severity")
# Convert to native date/time
logfile$time <- as.POSIXct(strptime
(logfile$time, "%Y-%m-%d %H:%M:%OS"), tz="GMT")


## Example: Using Datasets

The famous ‘Air passengers’ dataset

## Example: Using Datasets {.smaller}
The ‘Anscombe Quartet’ dataset

op <- par(mfrow=c(2,2),mar=rep(1,4))

## Recommended Libraries

– ggplot2 – Mini-DSL for data visualization
– zoo/xts – Time series libraries
– Matrix – Enhanced matrix library
– plyr/reshape – Data reshaping/manipulation
– data.table – Faster data.frame manipulation
– e1071 – Machine learning/data mining functions
– caret – Statistical learning/training functions
– randomforest – Random forest library
– Rcpp – Convenient C++ interface

## Other Topics (Not Covered)

– S3/S4 Classes/Objects
– Packages
– Lazy Evaluation
– Formula Interface
– JIT Compilation/bytecode
– Debugging
– C/C++ Interfaces
– Reproducible Research (Sweave/knitr/markdown)

## Links

The Geometry Of Least Squares

Here’s a simple and intuitive way of looking at the geometry of a least squares regression:

Take the bottom left point in the triangle below as the origin O. For the linear model:

$$ Y=X\beta + \epsilon $$

Both $Y$ and \(X\beta\) are vectors, and the residual vector \(\epsilon\) is the difference. The standard least squares error technique uses \(\epsilon^2\) or \((Y-X\beta)^T(Y-X\beta)\) as the error measure to be minimised, and this leads to the calculation of the \(\beta\) coefficient vector.

Least Squares Geometry
Least Squares Geometry

Geometrically, the beta coefficients calculated by the least squares regression minimise the squared length of the error vector. This turns out to be the projection of \(Y\) on to \(X\beta\) – i.e. the perpendicular vector that turns (O, \(Y\), \(X\beta\)) into a right-angled triangle.

The projection of \(Y\) onto \(X\beta\) is done using the projection matrix P, which is defined as

\[ P = X\left(X^{T}X\right)^{-1}X^{T} \]

So \( X\beta = \hat{Y} = PY \).

Using the Pythagorean theorem:

\( Y^TY = \hat{Y}^T\hat{Y} + (Y-X\beta)^T(Y-X\beta) \)

In other words, the total sum of squares = sum of squares due to regression + residual sum of squares. This is a fundamental part of analysis of variance techniques.

Plotting Tick Data with ggplot2

Here are some examples of using ggplot2 and kdb+ together to produce some simple graphs of data stored in kdb+. I am using the qserver extension for R ( to connect to a running kdb+ instance from within R.

First, lets create a dummy data set: a set of evenly-spaced timestamps and a random walk price series:

tab:([]time:.z.P+ONE_SEC * (til 1000);price:sums?[1000?1.<0.5;-1;1])

Then import the data into R:

[source lang="r"]>tab <- execute(h,'select from tab')[/source]

Then plot a simple line graph - remember ggplot2 works natively with data frames:

[source lang="r"]>library(ggplot2)
>ggplot(tab, aes(x=time, y=price)) + geom_line()
+ ggtitle("Stock Price Evolution")[/source]

This will produce a line graph similar to the one below:

Simple line chart

Next, we can do a simple bin count / histogram on the price series:

[source lang="r"]ggplot(tab, aes(x=(price))) + geom_histogram()[/source]

Which will produce a graph like the following:


We can adjust the bin width to get a more granular graph using the binwidth parameter:

[source lang="r"]> ggplot(tab, aes(x=(price)))
+ geom_histogram(position="identity", binwidth=1)[/source]


We can also make use of some aesthetic attributes, e.g. fill color - we can shade the histogram by the number of observations in each bin:

[source lang="r"]ggplot(tab, aes(x=(price), fill=..count..))
+ geom_histogram(position="identity", binwidth=1)[/source]

Which results in:


Some other graphs: Say I have a data frame with a bunch of currency tick data (bid/offer/mid prices). The currencies are interspersed. Here is a sample:

[source lang="r"]
> head(ccys)
sym timestamp bid ask mid
1 AUDJPY 2013-01-15 11:00:16.127 94.485 94.496 94.4905
2 AUDJPY 2013-01-15 11:00:22.592 94.486 94.496 94.4910
3 AUDJPY 2013-01-15 11:00:30.117 94.498 94.505 94.5015
4 AUDJPY 2013-01-15 11:00:30.325 94.498 94.506 94.5020
5 AUDJPY 2013-01-15 11:00:37.118 94.499 94.507 94.5030
6 AUDJPY 2013-01-15 11:00:47.348 94.526 94.536 94.5310

I want to add a column containing the log-returns calculated separately for each currency:

[source lang="r"]
log.ret <- function(x)"rbind",
cbind(x[[i]],lr=c(0, diff(log(x[[i]]$mid))))))
ccys <- log.ret(split(ccys, f=ccys$sym))

Then I can plot a simple line chart of the log returns using the lovely facets feature in ggplot to split out a separate panel per symbol:

[source lang="r"]
ggplot(ccys, aes(x=timestamp, y=lr))
+ geom_line()
+ facet_grid(sym ~ .)

Which produces the following:


Another nice graph - display a visual summary of the tick frequency by time. This one uses a dummy column that represents a tick arrival. Note in the following graph I have reduced the line width and set the alpha value to a tiny value (creating a large transparency effect) as otherwise the density of tick lines is too great. The overall effect is visually pleasing:

[source lang="r"]
ccys <- cbind(ccys, dummy=rep(1,nrow(ccys)))
ggplot(ccys, aes(x=timestamp,y=dummy, ymin=0, ymax=1))
+ geom_linerange(alpha=1/2,size=.01,width=.01)
+ facet_grid(sym ~ .)
+ theme(axis.text.y=element_blank())
+ xlab("ticks") + ylab("time")
+ ggtitle("Tick Density")

Which results in:


Next time (time permitting) - covariance matrices and bivariate density plots..

Rmathlib and kdb+ part 3: Utility Functions

In the first two parts of this series, I looked at the basics of the interface I created between rmathlib and kdb+. In this post, I’ll go through some of the convenience functions I wrote to emulate some basic R functionality.

NOTE: I wrote these functions as a learning exercise to familiarise myself with q/kdb+ a bit more – they are very simplistic and I am sure there are much better ways to do them (I know the very nice qml project does a lot of this stuff, plus interfaces to external BLAS libraries).

Some basics: load the library:

q)\l rmath.q

seq – sequence function

The first convenience function is seq, which is like R’s seq() or Numpy’s arange() in that it takes a start and end point, and generates a range of numbers.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26...

The function is just a curried wrapper around seqn, which also takes a step size:

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10

This is handy when working with some of the probability generation functions shown last time. For instance the commands below calculate the normal distribution (cdf) values for evenly spaced points from -3 to 3:

q)each[{pnorm[x;0;1]}] seqn[-3;3;.5]
0.001349898 0.006209665 0.02275013 0.0668072...

table – Symbol Tabulation

R provides a handy function to show summary count tables of factor levels. In q, this can be used as follows:

q)t:([] sym:1000?`A`B`C`D)
B| 241
A| 244
C| 256
D| 259

This tabulates the column sym from the table t. The summary is ordered by increasing frequency count.

NOTE: this has really no advantage over the standard

q)`x xasc select count i by sym from t
sym| x  
---| ---
B  | 241
A  | 244
C  | 256
D  | 259

except some slight brevity.

range – Min/Max

The range function simply returns the boundaries of a set of values:

-3.685814 4.211363
q)abs (-) . range[x]   / absolute range 

summary – Min/Max/Mean/Median/IQR

The summary function provides summary stats, a la R’s summary() function:

q)summary x
min   | -4.755305
1q    | 1.59379
median| 2.972523
mean  | 2.966736
3q    | 4.336589
max   | 10.00284

quantile – Quantile Calculations

A very simple quantile calc:


hist – Bin Count

Very crude bin count – specify the data and the number of bins:

-4.919383| 14
-3.279589| 101
-1.639794| 601
0        | 1856
1.639794 | 3043
3.279589 | 2696
4.919383 | 1329
6.559177 | 319
8.198972 | 40
9.838766 | 1

diag – Identity Matrix Generation

q)diag 10
1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 1

Scale functions

Sometimes its very useful to scale an input data set – e.g. when feeding multiple inputs into a statistical model, large differences between the relative scales of the inputs combined with finite-precision computer arithmetic can result in some inputs being dwarfed by others. The scale function just adjusts the input as follows: \(X_{s}=(X-\mu)/\sigma\).

The example below scales two inputs with different ranges:

q)x:norm[10;0;1]; y:norm[10;5;3]
1.920868 -1.594028 -0.02312519 1.079606 -0.5310111 0.2762119 0.1218428 0.9584264 -0.4244091 -0.7981221
10.69666 2.357529 8.93505 3.65696 5.218461 3.246216 5.971919 7.557135 1.412827 1.246241
q)range x
-1.594028 1.920868
q)range y
1.246241 10.69666
q)range scale x
-1.74507 1.878671
q)range scale y
-1.232262 1.845551

There are other useful scaling measures, including the min/max scale: \( \frac{x-min(x)}{max(x)-min(x)} \). This is implemented using the minmax function:

q)minmax x
1 0 0.4469273 0.760658 0.302432 0.5320897 0.4881712 0.726182 0.3327606 0.226438
range minmax x
0 1f

There are other functions which are useful for scaling, e.g. the RMSD (root-mean-square deviation): \( \sqrt{\frac{\sum{x_i^2}}{N}}\):

q)x:rnorm 1000
q)rms x

nchoosek – Combinations

The nchoosek function calculates the number of combinations of N items chosen k at a time (i.e. \({N}\choose{k}\):

q)each[{nchoosek[10;x]}] seq[1;10]
10 45 120 210 252 210 120 45 10 1f

The source file is available here:

Rmathlib and kdb+, part 2 – Probability Distribution Functions

Following on from the last post on integrating some rmathlib functionality with kdb+, here is a sample walkthrough of how some of the functionality can be used, including some of the R-style wrappers I wrote to emulate some of the most commonly-used R commands in q.

Loading the rmath library

Firstly, load the rmathlib library interface:

q)\l rmath.q

Random Number Generation

R provides random number generation facilities for a number of distributions. This is provided using a single underlying uniform generator (R provides many different RNG implementations, but in the case of Rmathlib it uses a Marsaglia-multicarry type generator) and then uses different techniques to generate numbers distributed according to the selected distribution. The standard technique is inversion, where a uniformly distributed number in [0,1] is mapped using the inverse of the probability distribution function to a different distribution. This is explained very nicely in the book “Non-Uniform Random Variate Generation”, which is availble in full here:

In order to make random variate generation consistent and reproducible across R and kdb+, we need to be able to seed the RNG. The default RNG in rmathlib takes two integer seeds. We can set this in an R session as follows:

[source lang=”R”]
> .Random.seed[2:3]<-as.integer(c(123,456))

and the corresponding q command is:


Conversely, getting the current seed value can be done using:

123 456i

The underlying uniform generator can be accessed using runif:

3.102089 3.854157 3.369014 3.164677 3.998812 3.092924 3.381564 3.991363 3.369..

produces 100 random variates uniformly distributed between [3,4].

Then for example, normal variates can be generated:

q)rnorm 10
-0.2934974 -0.334377 -0.4118473 -0.3461507 -0.9520977 0.9882516 1.633248 -0.5957762 -1.199814 0.04405314

This produces identical results in R:

[source lang=”r”]
> rnorm(10)
[1] -0.2934974 -0.3343770 -0.4118473 -0.3461507 -0.9520977 0.9882516 1.6332482 -0.5957762 -1.1998144
[10] 0.0440531

Normally-distributed variables with a distribution of \( N(\mu,\sigma) \) can also be generated:

q)dev norm[10000;3;1.5]
q)avg norm[10000;3;1.5]

Or we can alternatively scale a standard normal \( X ~ N(0,1) \) using \( Y = \sigma X + \mu \):

q) `int$ (avg x; dev x)
0 1i
q) `int$ (avg y; dev y)
5 3i

Probability Distribution Functions

As well as random variate generation, rmathlib also provides other functions, e.g. the normal density function:


computes the normal density at 0 for a standard normal distribution. The second and third parameters are the mean and standard deviation of the distribution.

The normal distribution function is also provided:


computes the distribution value at 0 for a standard normal (with mean and standard deviation parameters).

Finally, the quantile function (the inverse of the distribution function – see the graph below – the quantile value for .99 is mapped onto the distribution function value at that point: 2.32):



We can do a round-trip via pnorm() and qnorm():

q)`int $ qnorm[ pnorm[3;0;1]-pnorm[-3;0;1]; 0; 1]

Thats it for the distribution functions for now – rmathlib provides lots of different distributions (I have just linked in the normal and uniform functions for now. There are some other functions that I have created that I will cover in a future post.

All code is on github:

[Check out part 3 of this series]

Integrating Rmathlib and kdb+

The R engine is usable in a variety of ways – one of the lesser-known features is that it provides a standalone math library that can be linked to from an external application. This library provides some nice functionality such as:

* Probability distribution functions (density/distribution/quantile functions);
* Random number generation for a large number of probability distributions

In order to make use of this functionality from q, I built a simple Rmathlib wrapper library. The C wrapper can be found here and is simply a set of functions that wrap the appropriate calls in Rmathlib. For example, a function to generate N randomly-generated Gaussian values using the underlying rnorm() function is:

[source lang=”c”]
K rnn(K n, K mu, K sigma) {
int i,count = n->i;
K ret = ktn(KF, count);
for (i = 0; i < count; ++i)
kF(ret)[i] = rnorm(mu->f, sigma->f);
return ret;

These have to be imported and linked from a kdb+ session, which is done using special directives (the 2: verb). I decided to automate the process of generating these directives – the code shell script below parses a set of function declarations in a delimited section of a C header file and produces the appropriate load statements:

[source lang=”bash”]

echo "dll:$DLL"

DECLARATIONS=$(awk ‘/\/\/ BEGIN DECL/ {f=1;next} /\/\/ END DECL/ {f=0} f {sub(/K /,"",$0);print $0}’ $INFILE)
for decl in $DECLARATIONS; do
IFS=, read -r -a CMDARGS <<< "$ARGS"
echo "${FNAME}:dll 2:(\`$FNAME;${#CMDARGS[*]})"

echo "\\l rmath_aux.q"

This generates a set of link commands such as the following:

rn:dll 2:(`rn;2)
rnn:dll 2:(`rnn;3)
dn:dll 2:(`dn;3)
pn:dll 2:(`pn;3)
qn:dll 2:(`qn;3)
sseed:dll 2:(`sseed;2)
gseed:dll 2:(`gseed;1)
nchoosek:dll 2:(`nchoosek;2)

It also generates a call to load a second q script, rmath_aux.q, which contains a bunch of q wrappers and helper functions (I will write a separate post about that later).

A makefile is included which generates the shared lib (once the appropriate paths to the R source files is set) and q scripts. A sample q session looks like the following:

q) \l rmath.q
q) x:rnorm 1000 / generate 1000 normal variates
q) dnorm[0;0;1] / normal density at 0 for a mean 0 sd 1 distribution

The project is available on github:

Note that loading rmath.q loads the rmath dll, which in turn loads the rmathlib dll, so the rmathlib dll should be available on the dynamic library load path.

[Check out Part 2 of this series]

Exporting Data From R to KDB

Here is the beginnings of a simple routine to convert R data frames to Q format (in this case a dictionary). It uses the S3 dispatch mechanism to handle the conversion of different data types. Extremely basic (I havent even bothered to put in proper file output – just capturing the output of cat) but very quick to knock up.

The code is mainly a top-level function called to_dict:

[code lang=”R”]
to_dict <- function(x) {
cat(substitute(x),":", sep="")
nms <- names(x)
for (n in nms) { cat("`",n,sep="") }
r <- rep(c(";",")"),times=c(length(nms)-1,1))
for (i in 1:length(nms)) { cat(qformat(x[[nms[i]]]),r[i],sep="") }

Where qformat is a generic S3 function:

[code lang=”R”]
qformat <- function(x) {

qformat.default <- function(x) {

qformat.logical <- function(x) {

qformat.factor <- function(x) {
cat("",gsub("\\s","",format(x)), sep="`")

It can be used as follows (using the famous Anscombe quartet data):

[code lang=”R”]
> write(capture.output(to_dict(anscombe)),

Then within a Q session:

q)\l /tmp/anscombe.q
x1| 10 8 13 9 11 14 6 4 12 7 5
x2| 10 8 13 9 11 14 6 4 12 7 5
x3| 10 8 13 9 11 14 6 4 12 7 5
x4| 8 8 8 8 8 8 8 19 8 8 8
y1| 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68
y2| 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 4.74
y3| 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73
y4| 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 6.89

Density Estimation of High-Frequency Financial Data

Frequently we will want to estimate the empirical probability density function of real-world data and compare it to the theoretical density from one or more probability distributions. The following example shows the empirical and theoretical normal density for EUR/USD high-frequency tick data \(X\) (which has been transformed using log-returns and normalized via \(\frac{X_i-\mu_X}{\sigma_X}\)). The theoretical normal density is plotted over the range \(\left(\lfloor\mathrm{min}(X)\rfloor,\lceil\mathrm{max}(X)\rceil\right)\). The results are in the figure below. The discontinuities and asymmetry of the discrete tick data, as well as the sharp kurtosis and heavy tails (a corresponding interval of \(\approx \left[-8,+7\right]\) standard deviations away from the mean) are apparent from the plot.

tick density
Empirical and Theoretical Tick Density

We also show the theoretical and empirical density for the EUR/USD exchange rate log returns over different timescales. We can see from these plots that the distribution of the log returns seems to be asymptotically converging to normality. This is a typical empirical property of financial data.

Density Estimate Across Varying Timescales

The following R source generates empirical and theoretical density plots across different timescales. The data is loaded from files that are sampled at different intervals. I cant supply the data unfortunately, but you should get the idea.

[source lang=”R”]
# Function that reads Reuters CSV tick data and converts Reuters dates
# Assumes format is Date,Tick
readRTD <- function(filename) {
tickData <- read.csv(file=filename, header=TRUE, col.names=c("Date","Tick"))
tickData$Date <- as.POSIXct(strptime(tickData$Date, format="%d/%m/%Y %H:%M:%S"))

# Boilerplate function for Reuters FX tick data transformation and density plot
plot.reutersFXDensity <- function() {
filenames <- c("data/eur_usd_tick_26_10_2007.csv",
labels <- c("Tick", "1 Minute", "5 Minutes", "Hourly", "Daily")

par(mfrow=c(length(filenames), 2),mar=c(0,0,2,0), cex.main=2)
tickData <- c()
i <- 1
for (filename in filenames) {
tickData[[i]] <- readRTD(filename)
# Transform: `$Y = \nabla\log(X_i)$`
logtick <- diff(log(tickData[[i]]$Tick))
# Normalize: `$\frac{(Y-\mu_Y)}{\sigma_Y}$`
logtick <- (logtick-mean(logtick))/sd(logtick)
# Theoretical density range: `$\left[\lfloor\mathrm{min}(Y)\rfloor,\lceil\mathrm{max}(Y)\rceil\right]$`
x <- seq(floor(min(logtick)), ceiling(max(logtick)), .01)
plot(density(logtick), xlab="", ylab="", axes=FALSE, main=labels[i])
lines(x,dnorm(x), lty=2)
#legend("topleft", legend=c("Empirical","Theoretical"), lty=c(1,2))
plot(density(logtick), log="y", xlab="", ylab="", axes=FALSE, main="Log Scale")
lines(x,dnorm(x), lty=2)
i <- i + 1