Classic Bit Hackery In Java

July 10th, 2008

In the next release of Commons::Net, there is going to be a utility class for calculating various metrics around Classless Interdomain Routing (CIDR) rules and classless IP addresses - the sort of thing you can find in many online Perl-based address calculators. Anyways, the interesting thing about the address rules is that they are all basically forms of bit manipulation. A 32-bit IPv4 address can be decomposed into a network portion (n bits) and a host portion (32-n bits).

/* Create a binary netmask from the number of bits specification /x */
int cidrPart = rangeCheck(Integer.parseInt(matcher.group(5)), 0, NBITS-1);
for (int j = 0; j < cidrPart; ++j) {
netmask |= (1 << 31-j);
}

In the example code above, a 32-bit integer is initialized with a number of 1’s corresponding to the value of the cidrPart variable. for instance, if cidrPart = 8, the netmask integer would look like 11111111000000000000000000000000, with the most significant eight bits set. This allows us to take a 32-bit IP address and easily figure out various attributes. For instance, the line

network = (address & netmask);

strips out the host portion and just leaves us with the network portion of the address.

Similarly, if we know that a broadcast address is in the form nnnnnn11…..1, i.e. n host bits and 32-n 1s, then we can create the broadcast address using the netmask with the line

broadcast = network | ~(netmask);

i.e. invert the netmask and OR with the supplied address.

We can also iterate over valid address ranges in CIDR ranges easily. For instance, we can calculate the lowest and highest addresses easily:

private int low() { return network() + 1; }
private int high() { return broadcast() - 1; }

And thus iterate over each address in the range:

public String[] getAllAddresses() {
String[] addresses = new String[getAddressCount()];
for (int add = low(), j=0; add <= high(); ++add, ++j) {
addresses[j] = format(toArray(add));
}
return addresses;
}
}

But the clever part is the bit counting technique which is used to count the number of 1 bits in the netmask integer (the population). The obvious way is to count up or down from 0 to tex:2^{31}, shifting the target integer at each stage, and counting the set bits using AND. However, in the book Hacker’s Delight, there are tons of clever recipes for situations like this. The one that I ended up using is the following (almost identical to the C version, apart from the use of >>> for unsigned shifts):

/*
* Count the number of 1-bits in a 32-bit integer using a divide-and-conquer strategy
* see Hacker’s Delight section 5.1
*/
int pop(int x) {
x = x - ((x >>> 1) & 0×55555555);
x = (x & 0×33333333) + ((x >>> 2) & 0×33333333);
x = (x + (x >>> 4)) & 0×0F0F0F0F;
x = x + (x >>> 8);
x = x + (x >>> 16);
return x & 0×0000003F;
}

This extremely elegant piece of code is a divide-and-conquer procedure for counting 1 bits in a 32-bit integer. At each step, it counts the number of 1s in each adjacent tex:2^n-bit group, where n is increased from 0 to tex:\log_2{(n-1)}, where n is the number of bits in the integer (32 in this case). At each stage, the number of bits in each n-bit group are summed, until finally, we are left with the count of the number of 1 bits, which can be represented in 6 bits (32 = 0xb100000).

It’s easier to visualize what’s going on if you can see how the magic hex constants relate to the structure of a 32-bit integer. For instance, using bc:

/cygdrive/c/sandbox>bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty’.

ibase=16
obase=2
55555555
1010101010101010101010101010101
33333333
110011001100110011001100110011
0F0F0F0F
1111000011110000111100001111

If you want to see the java code involved, it is here: SubnetUtils.java. There is also a very nice explanation of the bit counting routine (in assembler) here.

JDBC Thin Clients and Oracle RAC

July 8th, 2008

If you are wondering why your jdbc thin client connections to an Oracle Real Application Cluster instance are failing to connect, one possible reason is that you may not be using the correct connection syntax. An Oracle RAC instance can be identified by running tnsping and looking at the host list and the FAILOVER setting:

/cygdrive/h>tnsping MYDB

TNS Ping Utility for 32-bit Windows: Version 10.2.0.1.0 - Production on 08-JUL-2008 17:02:15

Copyright (c) 1997, 2005, Oracle. All rights reserved.

Used parameter files:
c:\lib\oracle-10.2\network\admin\sqlnet.ora

Used LDAP adapter to resolve the alias
Attempting to contact (DESCRIPTION = (SDU = 8192) (TDU = 8192) (ADDRESS_LIST = (ADDRESS = PROTOCOL = TCP)(HOST = host1)(PORT = 1621))(ADDRESS = (PROTOCOL = TCP)(HOST = host2)(PORT = 1621)) (LOAD_BALANCE = on) (FAILOVER = on ) ) (CONNECT_DATA = (SERVICE_NAME = MYDB) (FAILOVER_MODE = (TYPE = session) (METHOD = basic) (RETRIES = 20) (DELAY = 5) ) ) ) OK (0 msec)

The standard JDBC thin URL syntax doesnt seem to work correctly here:

jdbc:oracle:thin:@host1:1621:MYDB

Seems to fail (at least for me), but the alternative unwieldy syntax:

jdbc:oracle:thin:@(DESCRIPTION=(LOAD_BALANCE=on)(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=host1)(PORT=1621))(ADDRESS=(PROTOCOL=TCP)(HOST=host2)(PORT=1621)))(CONNECT_DATA=(SERVICE_NAME=MYDB)))

Works fine.

The NTLM and Java Bible

July 4th, 2008

Not so much a bible really, but a series of posts showing how to perform NTLM authentication using various techniques in Java (and C, via the Neon library):

http://www.theresearchkitchen.com/blog/archives/25
http://www.theresearchkitchen.com/blog/archives/24
http://www.theresearchkitchen.com/blog/archives/23
http://www.theresearchkitchen.com/blog/archives/39
http://www.theresearchkitchen.com/blog/archives/38
http://www.theresearchkitchen.com/blog/archives/37

.vimrc

July 2nd, 2008

Putting my .vimrc here, to save me having to recreate my key shortcuts again:

set nocompatible
source $VIMRUNTIME/vimrc_example.vim
source $VIMRUNTIME/mswin.vim
behave mswin

set diffexpr=MyDiff()
function MyDiff()
  let opt = '-a --binary '
  if &diffopt =~ 'icase' | let opt = opt . '-i ' | endif
  if &diffopt =~ 'iwhite' | let opt = opt . '-b ' | endif
  let arg1 = v:fname_in
  if arg1 =~ ' ' | let arg1 = '"' . arg1 . '"' | endif
  let arg2 = v:fname_new
  if arg2 =~ ' ' | let arg2 = '"' . arg2 . '"' | endif
  let arg3 = v:fname_out
  if arg3 =~ ' ' | let arg3 = '"' . arg3 . '"' | endif
  let eq = ''
  if $VIMRUNTIME =~ ' '
    if &sh =~ '\<cmd'
      let cmd = '""' . $VIMRUNTIME . '\diff"'
      let eq = '"'
    else
      let cmd = substitute($VIMRUNTIME, ' ', '" ', '') . '\diff"'
    endif
  else
    let cmd = $VIMRUNTIME . '\diff'
  endif
  silent execute '!' . cmd . ' ' . opt . arg1 . ' ' . arg2 . ' > ' . arg3 . eq
endfunction

au GUIEnter * simalt ~x
set guifont=Lucida\ Console:h11
set backupdir=c:/lib/vim/backup
set directory=c:/lib/vim/backup
set expandtab
set tabstop=4
set shiftwidth=4
set smarttab

:nmap <C-N><C-N> :set invnumber<CR>
:nmap <C-T> :tabnew<CR>
:nmap <C-H> :set invhlsearch<CR>

colorscheme desert

Matrix Exponentiation in R

July 1st, 2008

A short while ago, I needed to do some matrix exponentiation in R (raising a matrix to a power). By default, the exponentiation operator ^ in R, when applied to a matrix, will just raise each element of the matrix to a power, rather than the matrix itself:

> A < - matrix(c(1:4), nrow=2, byrow=T)
> A
[,1] [,2]
[1,] 1 2
[2,] 3 4
> A ^ 2
[,1] [,2]
[1,] 1 4
[2,] 9 16
>
>

Whereas what we really want in this case is the equivalent of:

> A %*% A
[,1] [,2]
[1,] 7 10
[2,] 15 22
>

There are a few different ways of creating a matrix exponentiation operator in R: we could create an R function and create an exponentiation operator for matrices, similar to the %*% matrix multiplication operator that exists already, or we could write the function in C and link to it. It seems logical to create an %^% operator to match the current set of matrix operators. The choice of writing the exponentiation routine in R or C is actually less important than the exponentiation algorithm used. A fast exponentiation algorithm is the “square and multiply” method, an explanation of which can be found here.

I chose to write a basic exponentiation routine in C, by creating a new operator definition, and following these basic rules:

  • tex:A^{-1} returns the matrix inverse.
  • tex:A^0 returns the identity matrix.
  • tex:A^1 returns the original matrix.
  • tex:A^n returns A to the nth power.
  • Complex as well as real-valued matrices should be supported.

The code for this is shown below. Note that to calculate the matrix inverse, I don’t call directly into the underlying BLAS libraries, but actually call back into R’s solve() function to calculate the inverse. The changes were made to array.c in the R source tree under /src/main.

SEXP do_matexp(SEXP call, SEXP op, SEXP args, SEXP rho)
{
int nrows, ncols;
SEXP matrix, tmp, dims, dims2;
SEXP x, y, x_, x__;
int i,j,e,mode;

// necessary?
mode = isComplex(CAR(args)) ? CPLXSXP : REALSXP;

SETCAR(args, coerceVector(CAR(args), mode));
x = CAR(args);
y = CADR(args);

dims = getAttrib(x, R_DimSymbol);
nrows = INTEGER(dims)[0];
ncols = INTEGER(dims)[1];

if (nrows != ncols)
error(_("can only raise square matrix to power"));

if (!isNumeric(y))
error(_("exponent must be a scalar integer"));

e = asInteger(y);

if (e < -1)
error(_("exponent must be >= -1"));
else if (e == 1)
return x;
else if (e == -1) { /* return matrix inverse via solve() */
SEXP p1, p2, inv;
PROTECT(p1 = p2 = allocList(2));
SET_TYPEOF(p1, LANGSXP);
CAR(p2) = install("solve.default");
p2 = CDR(p2);
CAR(p2) = x;
inv = eval(p1, rho);

UNPROTECT(1);
return inv;
}

PROTECT(matrix = allocVector(mode, nrows * ncols));
PROTECT(tmp = allocVector(mode, nrows * ncols));
PROTECT(x_ = allocVector(mode, nrows * ncols));
PROTECT(x__ = allocVector(mode, nrows * ncols));

if (mode == REALSXP)
Memcpy(REAL(x_), REAL(x), (size_t)nrows*ncols);
else
Memcpy(COMPLEX(x_), COMPLEX(x), (size_t)nrows*ncols);

// Initialize matrix to identity matrix
// Set x[i * ncols + i] = 1
if (mode == REALSXP)
for (i = 0; i < ncols*nrows; i++)
REAL(matrix)[i] = ((i % (ncols+1) == 0) ? 1 : 0);
else
for (i = 0; i < ncols*nrows;i++) {
COMPLEX(matrix)[i].i = 0.0;
COMPLEX(matrix)[i].r = ((i % (ncols+1) == 0) ? 1.0 : 0.0);
}

if (e == 0) {
; // return identity matrix
}
else
while (e > 0) {
if (e & 1) {
if (mode == REALSXP)
matprod(REAL(matrix), nrows, ncols,
REAL(x_), nrows, ncols, REAL(tmp));
else
cmatprod(COMPLEX(matrix), nrows, ncols,
COMPLEX(x_), nrows, ncols, COMPLEX(tmp));

//copyMatrixData(tmp, matrix, nrows, ncols, mode);
if (mode == REALSXP)
Memcpy(REAL(matrix), REAL(tmp), (size_t)nrows*ncols);
else
Memcpy(COMPLEX(matrix), COMPLEX(tmp), (size_t)nrows*ncols);
e–;
}

if (mode == REALSXP)
matprod(REAL(x_), nrows, ncols,
REAL(x_), nrows, ncols, REAL(x__));
else
cmatprod(COMPLEX(x_), nrows, ncols,
COMPLEX(x_), nrows, ncols, COMPLEX(x__));

//copyMatrixData(x__, x_, nrows, ncols, mode);
if (mode == REALSXP)
Memcpy(REAL(x_), REAL(x__), (size_t)nrows*ncols);
else
Memcpy(COMPLEX(x_), COMPLEX(x__), (size_t)nrows*ncols);

e >>= 1;
}

PROTECT(dims2 = allocVector(INTSXP, 2));
INTEGER(dims2)[0] = nrows;
INTEGER(dims2)[1] = ncols;
setAttrib(matrix, R_DimSymbol, dims2);

UNPROTECT(5);
return matrix;
}

To actually hook this routine up to the %^% operator, I needed a further piece of glue in /src/main/names.c to associate the operator:

{"%^%", do_matexp, 3, 1, 2, {PP_BINARY, PREC_POWER, 0}}

Then to try it out (after recompiling R of course!):

> A < - matrix(c(1:4), nrow=2, byrow=TRUE)
> A %^% 2
[,1] [,2]
[1,] 7 10
[2,] 15 22
>

Project Euler Problem #2 (R)

June 17th, 2008

Here is a solution for Project Euler’s Problem #2 in R, which is stated as:

Each new term in the Fibonacci sequence is generated by adding the previous two terms. By starting with 1 and 2, the first 10 terms will be:

1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …

Find the sum of all the even-valued terms in the sequence which do not exceed four million.

Firstly ,we will define a function to calculate the Fibonacci numbers below N, where N in this case is 4*10^6:

fib < - function(n) {
x <- c(length=10)
x[1] <- 1; x[2] <- 2
for (i in 3:n) {
y <- x[i-1]+x[i-2];
if (y > n){ break }
else { x[i] < - y }
}
x
}

Next, we just calculate the sum of the even-valued problems, in a similar manner to the way we solved problem #1:

f < - fib(4E6)
sum(f[f %% 2 ==0])

Memory Usage Statistics via Bash

June 17th, 2008

I was just experimenting with a couple of useful commands (one is basically from the ps man page), and I want to record them here before I forget them.

Firstly,

ps -o pid,pmem,user --sort pmem --user [myuser]

Will show the currently running processes for [myuser], and for each process, the percentage of RSS (resident set size) of that process with respect to the total amount of physical memory on the machine. Very useful. If you want to see the entire process virtual size, add the vsize field to the field list above.

Secondly:

echo `cat /proc/meminfo | /bin/awk '/MemTotal/ {tot=$2} /MemFree/ {print $2/tot}'`

I’m sure there is a much nicer way to do this in awk, but I am an awk newbie. This returns the amount of free memory expressed as a percentage of total physical memory.

Awk doesn’t seem to have a round() function, so a version with more readable output would be:

echo `cat /proc/meminfo | /bin/awk '/MemTotal/ {tot=$2} /MemFree/ {print int($2/tot*10
0) "% memory free"}'`

Project Euler Problem #1 (R)

June 12th, 2008

Here is a solution for Project Euler’s problem #1 in R. The problem is expressed as:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

As usual with Project Euler questions, there is an obvious way, and a less obvious, but much more efficient way. In this case, the obvious way is:

x < - seq(1,999)
sum(x[x %% 3 ==0 | x %% 5 == 0])

Which very concisely returns the correct answer, 233168. However, if we use the following intuition:

tex:S_N = S_3 + S_5 - (S_{3,5})

i.e. the sum of all numbers divisible by 3 or 5 is the sum of all numbers divisible by 3, plus the sum of all numbers divisible by 5, minus the sum of all numbers divisible by 3 and 5 (as we have double counted them), then we get the correct answer.

Since tex:S_n  = \frac{n(a_1 + a_n)}{2}, where tex:n = \lfloor \frac{N}{n} \rfloor, and tex:a_n = a_1n, the last piece of the puzzle is what to use for tex:S_{3,5}. This is straightforward, we just use the lowest common multiple of 3 and 5, which in this case is 15. Hence, the R representation of this is:

sum(333*((3+333*3)/2),199*((5+199*5)/2)-66*((15+66*15)/2))

MathTran

June 11th, 2008

MathTran is a very popular (and useful) web service which performs a similar function to Google Charts - hit it with a valid TeX formula and it will return an image representation of the formula. Jonathan Fine has written a JavaScript wrapper which enables you to mark up img tags with alt values like “tex:[formula]” and these are automagically converted to images.

For example, the tag

<img alt="tex:\frac{1}{\sqrt{2\pi}\sigma}\int_{-\infty}^{x}{e^{-\frac{(x-\mu)^2}{2\sigma^2}}}dx">

will be converted to:

tex:\frac{1}{\sqrt{2\pi}\sigma}\int_{-\infty}^{x}{e^{-\frac{(x-\mu)^2}{2\sigma^2}}}dx

I have taken the MathTran source script and modified it slightly - MathTran allows you to specify a size parameter, between 1 and 9. This is not available via the default script, but I added the ability to have the img tag take a “size” parameter. This can be used like so:

<alt="tex:\sum_{0}^{\infty}x_i = \frac{1}{(1-x)}" size="2">

An example below:

tex:\sum_{i=0}^{\infty}x^i = \frac{1}{(1-x)}
tex:\sum_{i=0}^{\infty}x^i = \frac{1}{(1-x)}
tex:\sum_{i=0}^{\infty}x^i = \frac{1}{(1-x)}
tex:\sum_{i=0}^{\infty}x^i = \frac{1}{(1-x)}
tex:\sum_{i=0}^{\infty}x^i = \frac{1}{(1-x)}
tex:\sum_{i=0}^{\infty}x^i = \frac{1}{(1-x)}
tex:\sum_{i=0}^{\infty}x^i = \frac{1}{(1-x)}
tex:\sum_{i=0}^{\infty}x^i = \frac{1}{(1-x)}
tex:\sum_{i=0}^{\infty}x^i = \frac{1}{(1-x)}

UPDATE: I made a slight modification to add the raw TeX source to the title attribute of the img tag, so if you hover on a generated MathTran image, you should see the original formula.

If you want to get the (modified) MathTran script, it is here: http://www.theresearchkitchen.com/blog/wp-includes/js/mathtran.js

Recursive Parameter Substitution Routine

June 9th, 2008

I was recently looking at implementing a parameter substitution routine, along the lines of what Ant does for ${}-style properties. I initially picked up Spring’s PropertyPlaceholderConfigurer implementation, and created a modified version of the parseStringValue() routine, which is a recursive routine that unwraps delimited properties and resolves them. This did mostly what I want, but there were a few issues with it, mainly that it doesn’t handle truly recursive properties. For instance, if I want to do a Bash-style eval of a property ${FOO${BAR}}, where $BAR is mapped to "def", then I want the ultimate evaluation of this property to be ${FOOdef}.

This seems pretty simple to do in theory, as it would lend itself well to a recursive method, or an iterative method, possibly using a stack to store nested property names as we extract them from the string.

What I came up with was (if PREFIX is a property start delimiter, e.g. '{', and SUFFIX is a property end delimiter, e.g. '}'):

void recurse(StringBuilder str, int index) {
        if (str.indexOf(PREFIX, index) != -1)
            recurse(str, str.indexOf(PREFIX, index)+1);

            if (str.indexOf(PREFIX) == -1)
                return;

            String resolved = (str.substring(index, str.indexOf(SUFFIX, index)));
            str.replace(index-PREFIX.length(),
                    str.indexOf(SUFFIX, index)+SUFFIX.length(),
                    lookup(resolved));
    }

    private String lookup(String key) {
        System.out.println("looking up [" + key+  "] => [" + map.get(key) + "]");
        return map.get(key);
    }

Which does the job, although it could still be more elegant (not to mention efficient). The explicit check to str.indexOf(PREFIX) in order to exit the routine bugs me a little bit - I’m sure its possible to rearrange this to be more concise and skip the explicit check like that. Still this routine can now parse strings like TEXT{FOO}MORETEXT{BAR{GOO}} and resolve nested properties. I’m trrying to find a nicer recursive representation for this. maybe I should ask on #haskell.