Project Euler Problem #6 (R)

July 18th, 2008

Problem 6 in the Project Euler list asks us to:

Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum.

The literal translation of this is:

sumdiff <- function(n) {
    sum(c(1:n))^2-sum(c(1:n)^2)
}

However, this is another problem (like Problem #1) that has a closed-form solution, and so can run in constant time.

The closed form expression for the sum of the squares is

tex:\sum_{i=1}^{n}i^2 = \frac{1}{6}n(2n+1)(n+1)

The closed form expression for the square of the sum is

tex:\left(\sum_{i=1}^{n}i\right)^2 = \left(\frac{n(n+1)}{2}\right)^2

So if we subtract and simplify, we end up with the solution

tex:\frac{1}{12}n(3n^3+2n^2-3n-2).

The solution is shown below:

sumdiff2 <- function(n) {
    #(1/4)*(n^2)*(1+n)^2 - (1/6)*n*(1+n)*(1+2*n)
    (1/12)*(n)*(-2-3*n+2*n^2+3*n^3)
}

We can calculate the answer as shown:

> sumdiff2(100)
[1] 25164150

Needless to say, closed form solutions are preferable where possible.

Project Euler Problem #5 (R)

July 18th, 2008

Problem 5 on the Project Euler list is the following:

2520 is the smallest number that can be divided by each of the numbers from 1 to 10 without any remainder.

What is the smallest number that is evenly divisible by all of the numbers from 1 to 20?

This is asking us to compute the lowest common multiple (lcm) from 1 to 20. The lcm of two integers a and b is defined as tex:\frac{ab}{\mbox{gcd}(a,b)}

In order to compute the gcd, we can use a simple algorithm, like a Euclidian reduction:

# Iterative Euclidean algorithm
gcd <- function(a,b) {
    t <- 0
    while (b != 0) {
        t <- b
        b <- a %% b
        a <- t
    }
    a
}

Defining the lcm then becomes:

lcm <- function(a,b) {
    abs(a * (b/(gcd(a,b))))
}

Now we have defined pretty much everything we need to solve the problem. We can write an iterative lcm routine that calculates the least common denominator from 1 to x:

lcm_ <- function(xs) {
    l <- 1
    for (i in 1:(length(xs))) {
        l <- lcm(l,xs[i])
    }
    l
}

And we can calculate the answer:

> lcm_(c(1:20))
[1] 232792560

However, I also experimented with writing a reduce() function (similar to Haskell’s foldr operation). We can write reduce() as follows:

reduce <- function(op,x,xs) {
    y <- x
    for (i in 1:(length(xs)-1)) {
        y <- do.call(op, list(y, xs[i]))
    } 
    y
}

And then invoke:

>reduce(lcm, 1, c(1:20))
[1] 232792560

Note that R actually has inbuilt Map/Reduce operations. Using the native Reduce, this is:

> Reduce(f=lcm, x=c(1:20), right=1)
[1] 232792560

Project Euler Problem #4 (R)

July 18th, 2008

Problem 4 in on Project Euler’s list is the following:

Find the largest palindrome made from the product of two 3-digit numbers.

We can attack this as follows: firstly, we know that for an n-digit number, the minimum n-digit number is tex:10^{n-1} and the maximum is tex:9\sum_{i=0}^{n-1}10^{i}. So we need the cartesian product of all numbers in this range, which we can then test for the palindrome property, and finally filter for the maximum element.

Testing for the palindrome property requires that we check each digit in the number for equality on either side. We do this first by calculating the number of digits in the product (using tex:\lfloor\log_{10}(x)\rfloor) and then just comparing the digits by using the modulus and shifting. Note a couple of points in the R code below:

  • %/% is integer division;
  • Note the vectorization at work in statements like (p %/% 10^x): this will divide p by successive powers of 10, where the powers are defined in the x vector;
  • Note the use of the logical all() function to test whether all palindromic contenders are equal.
  • pal < - function(N) {
     palindromes <- c()
     min_no <- 1*10^(N-1)
     max_no <- sum(9*10^(seq(0,N-1)))
     n <- 1
     for (i in seq(max_no,min_no,-1)) {
      for (j in seq(i,min_no,-1)) {
       p <- i * j
       digits <- floor(log10(p))
       x <- seq(0, digits %/% 2)
        if (all( ((p %/% 10^x) %% 10)==((p %/% 10^(digits-x)) %% 10) )) {
         palindromes[n] <- p
         n <- n + 1
        }
      }
     }
     max(palindromes)
    }

Project Euler Problem #3 (R)

July 17th, 2008

The third problem in Project Euler is an interesting one:

The prime factors of 13195 are 5, 7, 13 and 29.

What is the largest prime factor of the number 600851475143 ?

In order to compute prime factors, we can use a sieving method. The simplest and oldest sieving algorithm is the sieve of Erastothenes, which we can implement in R as follows:

sieve <- function(n) {
    if (n <= 1) {
        return(numeric(0))
    }
    if (n == 2) {
        return(c(2))
    }

    a <- seq(2,n)
    b <- c()
    b[1] <- 2

    i <- 1
    while (length(a) > 0) {
        b[i] <- a[1]
        i <- i + 1
        a <- a[!(a %% a[1] == 0)]
    }
    b
}

So now we need to compute the largest prime factor of 600851475143. An upper bound for the factorisation is tex:p \leq \lfloor\sqrt{x}\rfloor, so we can halt our search at this point (alternatively, we may be able to begin our search here and work downwards). Let’s do it the obvious (and inefficient) way, starting at 1 and working up to tex:\lfloor\sqrt{x}\rfloor, building up a list of prime factors along the way:

pfactor <- function(x) {
    factors <- c()
    j <- 1
    primes <- sieve( floor(sqrt(x)) )
    for (prime in primes) {
        if (x %% prime == 0) {
            factors[j] <- prime
            j <- j + 1
    }
    }
    factors
}

Now to compute the maximum factor, we just take the maximum of this list:

max( pfactor(600851475143) )

Sweave, LaTeX, and Font Encodings

July 16th, 2008

I just found out how to save a couple of frustrating hours trying to track down why font encodings don’t seem to work in LaTeX when using Sweave. Somewhere along the line, it looks like ae.sty is getting directly or indirectly invoked by Sweave, which means that any font encoding directives placed before usepackage Sweave will be ignored. The solution is to declare the Sweave package import like so:

\usepackage[noae]{Sweave}

And fonts should work as specified.

Classic Bit Hackery In Java

July 10th, 2008

In the next release of Commons::Net, there is going to be a utility class for calculating various metrics around Classless Interdomain Routing (CIDR) rules and classless IP addresses - the sort of thing you can find in many online Perl-based address calculators. Anyways, the interesting thing about the address rules is that they are all basically forms of bit manipulation. A 32-bit IPv4 address can be decomposed into a network portion (n bits) and a host portion (32-n bits).

/* Create a binary netmask from the number of bits specification /x */
int cidrPart = rangeCheck(Integer.parseInt(matcher.group(5)), 0, NBITS-1);
for (int j = 0; j < cidrPart; ++j) {
netmask |= (1 << 31-j);
}

In the example code above, a 32-bit integer is initialized with a number of 1’s corresponding to the value of the cidrPart variable. for instance, if cidrPart = 8, the netmask integer would look like 11111111000000000000000000000000, with the most significant eight bits set. This allows us to take a 32-bit IP address and easily figure out various attributes. For instance, the line

network = (address & netmask);

strips out the host portion and just leaves us with the network portion of the address.

Similarly, if we know that a broadcast address is in the form nnnnnn11…..1, i.e. n host bits and 32-n 1s, then we can create the broadcast address using the netmask with the line

broadcast = network | ~(netmask);

i.e. invert the netmask and OR with the supplied address.

We can also iterate over valid address ranges in CIDR ranges easily. For instance, we can calculate the lowest and highest addresses easily:

private int low() { return network() + 1; }
private int high() { return broadcast() - 1; }

And thus iterate over each address in the range:

public String[] getAllAddresses() {
String[] addresses = new String[getAddressCount()];
for (int add = low(), j=0; add <= high(); ++add, ++j) {
addresses[j] = format(toArray(add));
}
return addresses;
}
}

But the clever part is the bit counting technique which is used to count the number of 1 bits in the netmask integer (the population). The obvious way is to count up or down from 0 to tex:2^{31}, shifting the target integer at each stage, and counting the set bits using AND. However, in the book Hacker’s Delight, there are tons of clever recipes for situations like this. The one that I ended up using is the following (almost identical to the C version, apart from the use of >>> for unsigned shifts):

/*
* Count the number of 1-bits in a 32-bit integer using a divide-and-conquer strategy
* see Hacker’s Delight section 5.1
*/
int pop(int x) {
x = x - ((x >>> 1) & 0×55555555);
x = (x & 0×33333333) + ((x >>> 2) & 0×33333333);
x = (x + (x >>> 4)) & 0×0F0F0F0F;
x = x + (x >>> 8);
x = x + (x >>> 16);
return x & 0×0000003F;
}

This extremely elegant piece of code is a divide-and-conquer procedure for counting 1 bits in a 32-bit integer. At each step, it counts the number of 1s in each adjacent tex:2^n-bit group, where n is increased from 0 to tex:\log_2{(n-1)}, where n is the number of bits in the integer (32 in this case). At each stage, the number of bits in each n-bit group are summed, until finally, we are left with the count of the number of 1 bits, which can be represented in 6 bits (32 = 0xb100000).

It’s easier to visualize what’s going on if you can see how the magic hex constants relate to the structure of a 32-bit integer. For instance, using bc:

/cygdrive/c/sandbox>bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty’.

ibase=16
obase=2
55555555
1010101010101010101010101010101
33333333
110011001100110011001100110011
0F0F0F0F
1111000011110000111100001111

If you want to see the java code involved, it is here: SubnetUtils.java. There is also a very nice explanation of the bit counting routine (in assembler) here.

JDBC Thin Clients and Oracle RAC

July 8th, 2008

If you are wondering why your jdbc thin client connections to an Oracle Real Application Cluster instance are failing to connect, one possible reason is that you may not be using the correct connection syntax. An Oracle RAC instance can be identified by running tnsping and looking at the host list and the FAILOVER setting:

/cygdrive/h>tnsping MYDB

TNS Ping Utility for 32-bit Windows: Version 10.2.0.1.0 - Production on 08-JUL-2008 17:02:15

Copyright (c) 1997, 2005, Oracle. All rights reserved.

Used parameter files:
c:\lib\oracle-10.2\network\admin\sqlnet.ora

Used LDAP adapter to resolve the alias
Attempting to contact (DESCRIPTION = (SDU = 8192) (TDU = 8192) (ADDRESS_LIST = (ADDRESS = PROTOCOL = TCP)(HOST = host1)(PORT = 1621))(ADDRESS = (PROTOCOL = TCP)(HOST = host2)(PORT = 1621)) (LOAD_BALANCE = on) (FAILOVER = on ) ) (CONNECT_DATA = (SERVICE_NAME = MYDB) (FAILOVER_MODE = (TYPE = session) (METHOD = basic) (RETRIES = 20) (DELAY = 5) ) ) ) OK (0 msec)

The standard JDBC thin URL syntax doesnt seem to work correctly here:

jdbc:oracle:thin:@host1:1621:MYDB

Seems to fail (at least for me), but the alternative unwieldy syntax:

jdbc:oracle:thin:@(DESCRIPTION=(LOAD_BALANCE=on)(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=host1)(PORT=1621))(ADDRESS=(PROTOCOL=TCP)(HOST=host2)(PORT=1621)))(CONNECT_DATA=(SERVICE_NAME=MYDB)))

Works fine.

The NTLM and Java Bible

July 4th, 2008

Not so much a bible really, but a series of posts showing how to perform NTLM authentication using various techniques in Java (and C, via the Neon library):

http://www.theresearchkitchen.com/blog/archives/25
http://www.theresearchkitchen.com/blog/archives/24
http://www.theresearchkitchen.com/blog/archives/23
http://www.theresearchkitchen.com/blog/archives/39
http://www.theresearchkitchen.com/blog/archives/38
http://www.theresearchkitchen.com/blog/archives/37

.vimrc

July 2nd, 2008

Putting my .vimrc here, to save me having to recreate my key shortcuts again:

set nocompatible
source $VIMRUNTIME/vimrc_example.vim
source $VIMRUNTIME/mswin.vim
behave mswin

set diffexpr=MyDiff()
function MyDiff()
  let opt = '-a --binary '
  if &diffopt =~ 'icase' | let opt = opt . '-i ' | endif
  if &diffopt =~ 'iwhite' | let opt = opt . '-b ' | endif
  let arg1 = v:fname_in
  if arg1 =~ ' ' | let arg1 = '"' . arg1 . '"' | endif
  let arg2 = v:fname_new
  if arg2 =~ ' ' | let arg2 = '"' . arg2 . '"' | endif
  let arg3 = v:fname_out
  if arg3 =~ ' ' | let arg3 = '"' . arg3 . '"' | endif
  let eq = ''
  if $VIMRUNTIME =~ ' '
    if &sh =~ '\<cmd'
      let cmd = '""' . $VIMRUNTIME . '\diff"'
      let eq = '"'
    else
      let cmd = substitute($VIMRUNTIME, ' ', '" ', '') . '\diff"'
    endif
  else
    let cmd = $VIMRUNTIME . '\diff'
  endif
  silent execute '!' . cmd . ' ' . opt . arg1 . ' ' . arg2 . ' > ' . arg3 . eq
endfunction

au GUIEnter * simalt ~x
set guifont=Lucida\ Console:h11
set backupdir=c:/lib/vim/backup
set directory=c:/lib/vim/backup
set expandtab
set tabstop=4
set shiftwidth=4
set smarttab

:nmap <C-N><C-N> :set invnumber<CR>
:nmap <C-T> :tabnew<CR>
:nmap <C-H> :set invhlsearch<CR>

colorscheme desert

Matrix Exponentiation in R

July 1st, 2008

A short while ago, I needed to do some matrix exponentiation in R (raising a matrix to a power). By default, the exponentiation operator ^ in R, when applied to a matrix, will just raise each element of the matrix to a power, rather than the matrix itself:

> A < - matrix(c(1:4), nrow=2, byrow=T)
> A
[,1] [,2]
[1,] 1 2
[2,] 3 4
> A ^ 2
[,1] [,2]
[1,] 1 4
[2,] 9 16
>
>

Whereas what we really want in this case is the equivalent of:

> A %*% A
[,1] [,2]
[1,] 7 10
[2,] 15 22
>

There are a few different ways of creating a matrix exponentiation operator in R: we could create an R function and create an exponentiation operator for matrices, similar to the %*% matrix multiplication operator that exists already, or we could write the function in C and link to it. It seems logical to create an %^% operator to match the current set of matrix operators. The choice of writing the exponentiation routine in R or C is actually less important than the exponentiation algorithm used. A fast exponentiation algorithm is the “square and multiply” method, an explanation of which can be found here.

I chose to write a basic exponentiation routine in C, by creating a new operator definition, and following these basic rules:

  • tex:A^{-1} returns the matrix inverse.
  • tex:A^0 returns the identity matrix.
  • tex:A^1 returns the original matrix.
  • tex:A^n returns A to the nth power.
  • Complex as well as real-valued matrices should be supported.

The code for this is shown below. Note that to calculate the matrix inverse, I don’t call directly into the underlying BLAS libraries, but actually call back into R’s solve() function to calculate the inverse. The changes were made to array.c in the R source tree under /src/main.

SEXP do_matexp(SEXP call, SEXP op, SEXP args, SEXP rho)
{
int nrows, ncols;
SEXP matrix, tmp, dims, dims2;
SEXP x, y, x_, x__;
int i,j,e,mode;

// necessary?
mode = isComplex(CAR(args)) ? CPLXSXP : REALSXP;

SETCAR(args, coerceVector(CAR(args), mode));
x = CAR(args);
y = CADR(args);

dims = getAttrib(x, R_DimSymbol);
nrows = INTEGER(dims)[0];
ncols = INTEGER(dims)[1];

if (nrows != ncols)
error(_("can only raise square matrix to power"));

if (!isNumeric(y))
error(_("exponent must be a scalar integer"));

e = asInteger(y);

if (e < -1)
error(_("exponent must be >= -1"));
else if (e == 1)
return x;
else if (e == -1) { /* return matrix inverse via solve() */
SEXP p1, p2, inv;
PROTECT(p1 = p2 = allocList(2));
SET_TYPEOF(p1, LANGSXP);
CAR(p2) = install("solve.default");
p2 = CDR(p2);
CAR(p2) = x;
inv = eval(p1, rho);

UNPROTECT(1);
return inv;
}

PROTECT(matrix = allocVector(mode, nrows * ncols));
PROTECT(tmp = allocVector(mode, nrows * ncols));
PROTECT(x_ = allocVector(mode, nrows * ncols));
PROTECT(x__ = allocVector(mode, nrows * ncols));

if (mode == REALSXP)
Memcpy(REAL(x_), REAL(x), (size_t)nrows*ncols);
else
Memcpy(COMPLEX(x_), COMPLEX(x), (size_t)nrows*ncols);

// Initialize matrix to identity matrix
// Set x[i * ncols + i] = 1
if (mode == REALSXP)
for (i = 0; i < ncols*nrows; i++)
REAL(matrix)[i] = ((i % (ncols+1) == 0) ? 1 : 0);
else
for (i = 0; i < ncols*nrows;i++) {
COMPLEX(matrix)[i].i = 0.0;
COMPLEX(matrix)[i].r = ((i % (ncols+1) == 0) ? 1.0 : 0.0);
}

if (e == 0) {
; // return identity matrix
}
else
while (e > 0) {
if (e & 1) {
if (mode == REALSXP)
matprod(REAL(matrix), nrows, ncols,
REAL(x_), nrows, ncols, REAL(tmp));
else
cmatprod(COMPLEX(matrix), nrows, ncols,
COMPLEX(x_), nrows, ncols, COMPLEX(tmp));

//copyMatrixData(tmp, matrix, nrows, ncols, mode);
if (mode == REALSXP)
Memcpy(REAL(matrix), REAL(tmp), (size_t)nrows*ncols);
else
Memcpy(COMPLEX(matrix), COMPLEX(tmp), (size_t)nrows*ncols);
e–;
}

if (mode == REALSXP)
matprod(REAL(x_), nrows, ncols,
REAL(x_), nrows, ncols, REAL(x__));
else
cmatprod(COMPLEX(x_), nrows, ncols,
COMPLEX(x_), nrows, ncols, COMPLEX(x__));

//copyMatrixData(x__, x_, nrows, ncols, mode);
if (mode == REALSXP)
Memcpy(REAL(x_), REAL(x__), (size_t)nrows*ncols);
else
Memcpy(COMPLEX(x_), COMPLEX(x__), (size_t)nrows*ncols);

e >>= 1;
}

PROTECT(dims2 = allocVector(INTSXP, 2));
INTEGER(dims2)[0] = nrows;
INTEGER(dims2)[1] = ncols;
setAttrib(matrix, R_DimSymbol, dims2);

UNPROTECT(5);
return matrix;
}

To actually hook this routine up to the %^% operator, I needed a further piece of glue in /src/main/names.c to associate the operator:

{"%^%", do_matexp, 3, 1, 2, {PP_BINARY, PREC_POWER, 0}}

Then to try it out (after recompiling R of course!):

> A < - matrix(c(1:4), nrow=2, byrow=TRUE)
> A %^% 2
[,1] [,2]
[1,] 7 10
[2,] 15 22
>