Building QuickFIX on Solaris

I just had to build the QuickFIX C++ client library on Solaris, and here are a few notes that might be useful when doing the same:

* Make sure you have /usr/ccs/bin in your PATH (so that ar can be found);
* If you are using gmake, you may need to export MAKE=gmake before you run configure.
* If building static libs as well as .so files, run configure --enable-shared=yes.

Once the libraries are built, you can compile and link using:

$ g++ -I /usr/local/include/quickfix/include/ main.cpp libquickfix.a -o main -lxml2

This will statically compile against quickfix and dynamically link against libxml2.

The Bias/Variance Tradeoff

Probably one of the nicest explanations of the bias/variance tradeoff is the one I found in the book Introduction to Information Retrieval (full book available online). The tradeoff can be explained mathematically, and also more intuitively. The mathematical explanation is as follows:

if we have a learning method that operates on a given set of input data (call it $x$) and a “real” underlying process that we are trying to approximate (call it $\alpha$), then the expected (squared) error is:

\[
\begin{aligned}
E[x-\alpha]^2 = Ex^2 – 2Ex\alpha + \alpha^2\\
= (Ex)^2 – 2Ex\alpha + \alpha^2 + Ex^2 – 2(Ex)^2 + (Ex)^2\\
= [Ex-\alpha]^2 + Ex^2 – E2x(Ex) + E(Ex)^2\\
= [Ex-\alpha]^2 + E[x-Ex]^2
\end{aligned}
\]

Taking advantage of the linearity of expectation and adding a few extra cancelling terms, we end up with the representation:

$$
Error = bias (E[x-\alpha]^2) + variance (E[x-Ex]^2)
$$

Thats the mathematical equivalence. However, a more descriptive approach is as follows:

Bias is the squared difference between the true underlying distribution and the prediction of the learning process, averaged over our input datasets. Consistently wrong prediction equal large bias. Bias is small when the predictions are consistently right, or the average error across different training sets is roughly zero. Linear models generally have a high bias for nonlinear problems. Bias can represent the domain knowledge that we have built into the learning process – a linear assumption may be unsuitable for a nonlinear problem, and thus result in high bias.

Variance is the variation in prediction (or the consistency) – it is large if different training sets result in different learning models. Linear models will generally have lower variance. High variance generally results in overfitting – in effect, the learning model is learning from noise, and will not generalize well.

Its a useful analogy to think of most learning models as a box with two dials – bias and variance, and the setting of one will affect the other. We can only try and find the “right” setting for the situation we are working with. Hence the bias-variance tradeoff.

Financial Amnesia

The FT has a nice article on financial amnesia, which talks about the desire of the CFA UK discipline to incorporate some financial history into their curriculum, ostensibly to implant enough of a sense of deja vu in budding financial managers when they encounter potential disaster scenarios that they may avoid repeating them.

I think this is a great idea – the only problem is that it probably wont have any sort of meaningful impact other than provide some fascinating course material for CFA students. The problem with an industry that makes its living from markets that essentially operate on two gears – fear and greed – is that despite all of the prior knowledge in the world, we will willingly impose a state of amnesia upon ourselves if we can convince ourselves that what is happening now is in some way different, or just different enough from whatever disasters happened before. The Internet bubble, tulip-mania, and the various property boom and busts that we have seen over the last decade share at their core a common set of characteristics, but they are different (or “new”) enough that we were able to live in a state of denial. The “fear and greed” mentality also means that even if you know you are operating in a bubble that will eventually burst, you carry on regardless, as you plan to be out of the game before it all goes bad (The Greater Fool theory).

Incidentally, if you want to read a beautifully written and entertaining account of financial bubbles by one of the greatest writers on this topic, you should read this book (“A Short History of Financial Euphoria” by J.K. Galbraith). It packs a huge amount of wit and insight into its relatively small page count.

NOTE: Galbraith also wrote the definitive account on the Great Crash of 1929, which rapidly became required reading around three years ago…it’s also excellent. Galbraith has a beautiful turn of phrase.

The Elements of Modern Javascript Style

In the spirit of the last post, I wanted to share my newfound interest in another language that I previously hated. Ive always treated JavaScript like an unloved cousin. I hated dealing with it, and my brief encounters with it were really unenjoyable. However, I’ve had to deal with it a bit more recently, and thankfully my earlier encounters are distant memories, and the more recent experience is much better. There are two reasons for this: one is the excellent JQuery framework, and the other is the disparate collection of elegant patterns and best practises that have made JavaScript into a reasonably elegant language to work with.

The single best reference I have found so far that elucidates on all of these is the excellent “Javascript Web Applications” from O’Reilly, which expands on many facets of modern JavaScript usage. I would definitely recommend checking this book out.

The Elements of Modern C++ Style

Herb Sutter has a fantastic short article over at his site on the new C++11 standard. The new extensions to the language are way overdue. One particular example that I really liked was the application of library algorithms and lambdas to create “extensions” to the language. For instance, the example below adds a “lock” construct (similar to the one in C#):

// C#
lock( mut_x ) {
    ... use x ...
}
// C++11 without lambdas: already nice, and more flexible (e.g., can use timeouts, other options)
{
    lock_guard<mutex> hold { mut_x };
    ... use x ...
}
// C++11 with lambdas, and a helper algorithm: C# syntax in C++
// Algorithm: template<typename T, typename F> void lock( T& t, F f ) { lock_guard<t> hold(t); f(); }
lock( mut_x, [&]{
    ... use x ...
});

Of course, all of this is dependent on compiler support. A couple of nice resources on the current state of C++11 support are:

http://www.aristeia.com/C++11/C++11FeatureAvailability.htm

and

https://wiki.apache.org/stdcxx/C%2B%2B0xCompilerSupport

Deducing the JDK Version of a .jar File

Here is a little script that uses Pyton to examine the contents of a jar file (or specifically the first .class file it comes across) and then reads the major version byte and maps it to a JDk version. May be useful if you have a bunch of jars compiled by different JDKs and want to figure out which is which.

#!/usr/bin/python
import zipfile
import sys
import re
class_pattern=re.compile("/?\w*.class$")
for arg in sys.argv[1:]:
    print '%s:' % arg,
    file=zipfile.ZipFile(arg, "r")
    for entry in file.filelist:
        if class_pattern.search(entry.filename):
            bytes = file.read(entry.filename)
            maj_version=ord(bytes[7])
            if maj_version==45:
                print "JDK 1.1"
            elif maj_version==46:
                print "JDK 1.2"
            elif maj_version==47:
                print "JDK 1.3"
            elif maj_version==48:
                print "JDK 1.4"
            elif maj_version==49:
                print "JDK 5.0"
            elif maj_version==50:
                print "JDK 6.0"
            break

Gdb Macros for R

When debugging R interactively, one hurdle to navigate is unwrapping SEXP objects to get at the inner data. Gdb has some useful macro functionality that allows you to wrap useful command sequences in reusable chunks. I recently put together the following macro that attempts to extract and print some useful info from a SEXP object.

It can be used as follows. For instance, given a SEXP called “e”:

(gdb) dumpsxp e
Type: LANGSXP (6)
Function:Type: SYMSXP (1)
"< -"
Args:Type: LISTSXP (2)
(SYMSXP,LISTSXP)

We can see that e is a LANGSXP, and the operator is “< -". Functions have different components - here we can see the function representation (the SYMSXP) and the function arguments (the LISTSXP).

Some knowledge of LANGSXP structure is useful here. For instance, if we know that for a LANGSXP that CAR(x) gives us the function and CDR(x) gives us the arguments, we can view the components individually.

To see the first component:

(gdb) dumpsxp CAR(e)
Type: SYMSXP (1)
"< -"

The arguments are given by the CDR of e. We can then crack open the list and view the function arguments, recursively looking through the pairlist until we get to the end:

(gdb) dumpsxp CDR(e)
Type: LISTSXP (2)
(SYMSXP,LISTSXP)
(gdb) dumpsxp CADR(e)
Type: SYMSXP (1)
"x"
(gdb) dumpsxp CADDR(e)
Type: LANGSXP (6)
Function:Type: SYMSXP (1)
"sin"
Args:Type: LISTSXP (2)
(REALSXP,NILSXP)
(gdb) dumpsxp CADDDR(e)
Type: NILSXP (0)

The NILSXP tells us that we’ve got to the end of the list.

The GDB macro is below. Put it in your .gdbinit to automatically load it when gdb starts up.

define dumpsxp
	if $arg0==0
		printf "uninitialized variable\n"
		return
	end
	set $sexptype=TYPEOF($arg0)
	# Typename
	printf "Type: %s (%d)\n", typename($arg0), $sexptype
	# SYMSXP
	if $sexptype==1
		# CHAR(PRINTNAME(x))
		print_char PRINTNAME($arg0)
	end
	# LISTSXP
	if $sexptype==2
		printf "(%s,%s)\n", typename(CAR($arg0)), typename(CDR($arg0))
	end
	# CLOSXP
	if $sexptype==3
		dumpsxp BODY($arg0)
	end
	# PROMSXP
	# Promises contain pointers to value, expr and env
	# tmp = eval(tmp, rho);
	if $sexptype==5
		printf "Promise under evaluation: %d\n", PRSEEN($arg0)
		printf "Expression: "
		dumpsxp ($arg0)->u.promsxp.expr
		# Expression: (CAR(chain))->u.promsxp.expr
	end
	# LANGSXP
	if $sexptype==6
		printf "Function:"
		dumpsxp CAR($arg0)
		printf "Args:"
		dumpsxp CDR($arg0)
	end
	# SPECIALSXP
	if $sexptype==7
		printf "Special function: %s\n", R_FunTab[($arg0)->u.primsxp.offset].name
	end
	# BUILTINSXP
	if $sexptype==8
		printf "Function: %s\n", R_FunTab[($arg0)->u.primsxp.offset].name
	end
	# CHARSXP
	if $sexptype==9
		printf "length=%d\n", ((VECTOR_SEXPREC)(*$arg0))->vecsxp.length
		#print_veclen $arg0
		print_char $arg0
	end
	# LGLSXP
	if $sexptype==10
		set $lgl=*LOGICAL($arg0)
		if $lgl > 0
			printf "TRUE\n"
		end
		if $lgl == 0
			printf "FALSE\n"
		end
	end
	# INTSXP
	if $sexptype==13
		printf "%d\n", *(INTEGER($arg0))
	end
	# REALSXP
	if $sexptype==14
		print_veclen $arg0
		print_double $arg0
	end
	# STRSXP
        if $sexptype==16
		print_veclen $arg0
		set $i=LENGTH($arg0)
		set $count=0
		while ($count < $i)
			printf "Element #%d:\n", $count
			dumpsxp STRING_ELT($arg0,$count)
			set $count = $count + 1
		end
	end
	# VECSXP
	if $sexptype==19
		print_veclen $arg0
	end
	# RAWSXP
	if $sexptype==24
		print_veclen $arg0
	end
end
define print_veclen
	printf "Vector length=%d\n", LENGTH($arg0)
end
define print_char
		# this may be a bit dodgy, as I am not using the aligned union
		printf "\"%s\"\n", (const char*)((VECTOR_SEXPREC *) ($arg0)+1)
end
define print_double
		printf "%g\n", (double*)((VECTOR_SEXPREC *) ($arg0)+1)
end

Custom Checkstyle Rule Example

Here is an example of a custom Checkstyle rule that catches the following situation: sometimes an Exception can be caught and an error message logged, but the underlying exception (and thus the stack trace) may not be logged at all. This is not a problem generally, unless the exception tends wrap an underlying cause (i.e. one or more nested exceptions). The rule is designed to catch instances like:

try {}
catch (Exception e) {
  logger.error("blah");
}

With the assumption that the following is better:

try {}
catch (Exception e) {
  logger.error("blah", e);
}

Here is the code for the custom check:

package au.com.national.efx.build;
import java.util.ArrayList;
import java.util.List;
import com.puppycrawl.tools.checkstyle.api.Check;
import com.puppycrawl.tools.checkstyle.api.DetailAST;
import com.puppycrawl.tools.checkstyle.api.TokenTypes;
/**
 * Check that attempts to catch instances of the following:
 * <code>
 * catch (Exception e) { logger.error("foo"); }
 * </code>
 *
 * with the assumption that the following is preferable:
 *
 * <code>
 * catch (Exception e) { logger.error("foo", e); }
 * </code>
 * @author rwinston
 *
 */
public class SwallowedExceptionInLoggerCheck extends Check {
	@Override
	public int[] getDefaultTokens() {
		return new int[] { TokenTypes.LITERAL_CATCH };
	}
	/**
	 * Get ident of exception
	 * Try to find it in logger error/warn parameter list
	 */
	@Override
	public void visitToken(DetailAST aAST) {
		super.visitToken(aAST);
		final DetailAST parameterDef = aAST.findFirstToken(TokenTypes.PARAMETER_DEF);
		final DetailAST ident = parameterDef.findFirstToken(TokenTypes.IDENT);
		final String exceptionIdent = ident.getText();
		final DetailAST slist = aAST.findFirstToken(TokenTypes.SLIST); // Find '{'
		// Find all method calls within catch block
		final List<DetailAST> variables = findChildASTsOfType(slist, TokenTypes.METHOD_CALL);
		try {
			for (DetailAST methodCall : variables) {
				DetailAST dot = methodCall.findFirstToken(TokenTypes.DOT);
				// I'm assuming the last child will be the method name called
				DetailAST lastIdent = dot.getLastChild();
				if (lastIdent.getText().equals("error")) {
					// Ok, now check that the ELIST contains an IDENT matching
					// the exception name
					DetailAST elist = methodCall.findFirstToken(TokenTypes.ELIST);
					boolean exceptionInParameterList = false;
					for (DetailAST identAST : findChildASTsOfType(elist, TokenTypes.IDENT)) {
						if (identAST.getText().equals(exceptionIdent))
							exceptionInParameterList = true;
					}
					if (!exceptionInParameterList) {
						log(methodCall, "error() method does not contain caught Exception as a parameter");
					}
				}
			}
		} catch (Exception e) { e.printStackTrace(); }
	}
	/**
	 * Recursively traverse an expression tree and return all
	 * ASTs matching a specific token type
	 * @param parent
	 * @param type
	 * @return
	 */
	private List<DetailAST> findChildASTsOfType(DetailAST parent, int type) {
		List<DetailAST> children = new ArrayList<DetailAST>();
		DetailAST child = parent.getFirstChild();
		while (child != null) {
			if (child.getType() == type)
				children.add(child);
			else {
				children.addAll(findChildASTsOfType(child, type));
			}
			child = child.getNextSibling();
		}
		return children;
	}
}

Project Euler Problem #21

This is a solution for problem 21 on the Project Euler website. It consists of finding the sum of all the amicable numbers under 10000. This was pretty easy to solve, but the solution could probably be improved quite a bit.

Solution #1 in R is as follows (it calculates the proper divisors of each number using prop.divs, and then adds up the sequence of amicable numbers in the main function).

prop.divs <- function(x) {
		if (x == 1) return (1)
			divs <- integer(30)
				j <- 1
				divs[j] <- 1
				j <- j + 1
				for (i in 2:(floor(x/2))) {
					if ((x %% i) == 0) {
						divs[j] <- i
							j <- j + 1
					}
				}
		sum(divs[1:(j-1)])
	}
problem.21 <- function(N) {
    s <- 0
    for (i in 2:N) {
	da <- prop.divs(i)
	if (da == i) next
	db <- prop.divs(da)
	if ( db==i ) {
		s <- s + da + db
  	}
    }
    s/2
}

The s/2 is needed as each factor is added twice during the calculation.

This gives the correct answer, but the implementation is a bit naive. I remembered coming across an article about prime factors and proper divisors on PlanetMath a while ago, and this seemed like potentially a more efficient way to calculate the factors involved. Specifically, the sum of proper divisors of a number n can be given by:

tex:{\prod_{i=1}^k\frac{p_i^{m_i+1}-1}{p_i - 1}-\prod_{i=1}^kp_i^{m_i}}.

The second attempt at this problem looked like the following:

prime.sieve <- function(n) {
 a <- seq.int(1,n)
 p <- 1
 M <- as.integer(sqrt(n))
 while ((p <- p + 1) <= M) {
  if (a[p] != 0)
    a[seq.int(p*p, n, p)] <- 0
 }
 a[a>1 & a>0]
}
sum.proper.divisors <- function(x) {
   primes <- prime.sieve( x )
   primes <- primes[ x %% primes == 0]
  geo.sum <- numeric(length(primes))
  i <- 1
  for (prime in primes) {
    n <- x
    curr <- 0
    while (n %% prime == 0) {
      curr <- curr + 1
      n <- n %/% prime
    }
    geo.sum[i] <- ( (prime^(curr+1) - 1)/(prime - 1) )
    i <- i + 1
  }
  prod(geo.sum)-x
}
problem.21_2 <- function(N) {
  s <- 0
  for (i in 2:N) {
		da <- sum.proper.divisors(i)
		if (da == i) next
		db <- sum.proper.divisors(da)
		if (db==i) s <- s + da +db
	}
	s/2
}

This also gives the correct answer, but with much reduced runtime overhead:


> system.time(problem.21(10000))
user system elapsed
103.943 0.511 106.978
> system.time(problem.21_2(10000))
user system elapsed
24.834 0.160 26.565