Categories
Coding kdb R Statistics

Rmathlib and kdb+ part 3: Utility Functions

In the first two parts of this series, I looked at the basics of the interface I created between rmathlib and kdb+. In this post, I’ll go through some of the convenience functions I wrote to emulate some basic R functionality.

NOTE: I wrote these functions as a learning exercise to familiarise myself with q/kdb+ a bit more – they are very simplistic and I am sure there are much better ways to do them (I know the very nice qml project does a lot of this stuff, plus interfaces to external BLAS libraries).

Some basics: load the library:

q)\l rmath.q

seq – sequence function

The first convenience function is seq, which is like R’s seq() or Numpy’s arange() in that it takes a start and end point, and generates a range of numbers.

q)seq[1;100]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26...

The function is just a curried wrapper around seqn, which also takes a step size:

q)seqn[1;10;.5]
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10

This is handy when working with some of the probability generation functions shown last time. For instance the commands below calculate the normal distribution (cdf) values for evenly spaced points from -3 to 3:

q)each[{pnorm[x;0;1]}] seqn[-3;3;.5]
0.001349898 0.006209665 0.02275013 0.0668072...

table – Symbol Tabulation

R provides a handy function to show summary count tables of factor levels. In q, this can be used as follows:

q)t:([] sym:1000?`A`B`C`D)
q)table[`sym;t]
B| 241
A| 244
C| 256
D| 259

This tabulates the column sym from the table t. The summary is ordered by increasing frequency count.

NOTE: this has really no advantage over the standard

q)`x xasc select count i by sym from t
sym| x  
---| ---
B  | 241
A  | 244
C  | 256
D  | 259

except some slight brevity.

range – Min/Max

The range function simply returns the boundaries of a set of values:

q)x:rnorm[10000]
q)range[x]
-3.685814 4.211363
q)abs (-) . range[x]   / absolute range 
7.897177

summary – Min/Max/Mean/Median/IQR

The summary function provides summary stats, a la R’s summary() function:

q)x:norm[10000;3;2]
q)summary x
min   | -4.755305
1q    | 1.59379
median| 2.972523
mean  | 2.966736
3q    | 4.336589
max   | 10.00284

quantile – Quantile Calculations

A very simple quantile calc:

q)x:norm[10000;3;2]
q)quantile[x;.5]
2.973137

hist – Bin Count

Very crude bin count – specify the data and the number of bins:

q)hist[x;10]
-4.919383| 14
-3.279589| 101
-1.639794| 601
0        | 1856
1.639794 | 3043
3.279589 | 2696
4.919383 | 1329
6.559177 | 319
8.198972 | 40
9.838766 | 1

diag – Identity Matrix Generation

q)diag 10
1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 1

Scale functions

Sometimes its very useful to scale an input data set – e.g. when feeding multiple inputs into a statistical model, large differences between the relative scales of the inputs combined with finite-precision computer arithmetic can result in some inputs being dwarfed by others. The scale function just adjusts the input as follows: \(X_{s}=(X-\mu)/\sigma\).

The example below scales two inputs with different ranges:

q)x:norm[10;0;1]; y:norm[10;5;3]
q)x
1.920868 -1.594028 -0.02312519 1.079606 -0.5310111 0.2762119 0.1218428 0.9584264 -0.4244091 -0.7981221
q)y
10.69666 2.357529 8.93505 3.65696 5.218461 3.246216 5.971919 7.557135 1.412827 1.246241
q)range x
-1.594028 1.920868
q)range y
1.246241 10.69666
q)range scale x
-1.74507 1.878671
q)range scale y
-1.232262 1.845551

There are other useful scaling measures, including the min/max scale: \( \frac{x-min(x)}{max(x)-min(x)} \). This is implemented using the minmax function:

q)minmax x
1 0 0.4469273 0.760658 0.302432 0.5320897 0.4881712 0.726182 0.3327606 0.226438
range minmax x
0 1f

There are other functions which are useful for scaling, e.g. the RMSD (root-mean-square deviation): \( \sqrt{\frac{\sum{x_i^2}}{N}}\):

q)x:rnorm 1000
q)rms x
1.021065

nchoosek – Combinations

The nchoosek function calculates the number of combinations of N items chosen k at a time (i.e. \({N}\choose{k}\):

q)nchoosek[100;5]
7.528752e+07
q)each[{nchoosek[10;x]}] seq[1;10]
10 45 120 210 252 210 120 45 10 1f

The source file is available here: https://github.com/rwinston/kdb-rmathlib/blob/master/rmath_aux.q.