Categories
Coding

Data Precision, Excel, and Commons::Math

A post on the Jakarta User mailing list piqued my interest this week. The poster had noticed that they were getting significantly different results for some statistical measures from the output of Commons::Math vs. what Excel was producing. This, if true, would be a pretty serious situation. Excel’s calculation engine is proven and very mature (I know one of the guys who works on it, and he’s a genius), so any discrepancy would seem to point to Commons::Math.

Needless to say, the best way to verify this is with a simple “spike”, as the agile guys would say. So I fired up Excel, and using its random number generator, produced 20,000 normally distributed numbers with a standard deviation of 1,000 and a mean of 5,000. I used the Tools > Data Analysis add-in to do this, but you could also use the =NORMINV(rand(),mean,standard_dev) function.

When this was complete, I exported the data to a text file, and read in the values and calculated some simple stats using Commons::Math. Here is the sample program, if you’re interested:


package uk.co.researchkitchen.math;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

import org.apache.commons.math.stat.descriptive.moment.Mean;
import org.apache.commons.math.stat.descriptive.moment.StandardDeviation;
import org.apache.commons.math.stat.descriptive.moment.Variance;
import org.apache.commons.math.stat.descriptive.rank.Median;

public class TestPrecision {
  
  static final int DATA_SIZE = 20000;
  
  double[] data = new double[DATA_SIZE];
  
  public static void main(String[] argsthrows IOException {
    TestPrecision testPrecision = new TestPrecision();
    // Book1.txt is an exported Excel spreadsheet containing
    // 20,000 normally distributed numbers with a mean of ~5,000
    // and a stdev (sigma) of ~1,000
    testPrecision.calculateStats("c:\\temp\\Book1.txt");
  }
  
  public void calculateStats(String filenamethrows IOException {
    BufferedReader br = new BufferedReader(new FileReader(filename));
    
    int count = 0;
    String line = null;
    while ((line = br.readLine()) != null) {
      double datum = Double.valueOf(line);
      System.out.println(datum);
      data[count++= datum;
    }
    
    System.out.println("Read " + count + " items of data.");
    
    System.out.println("Standard deviation = " new StandardDeviation().evaluate(data));
    System.out.println("Median = " new Median().evaluate(data));
    System.out.println("Mean = " new Mean().evaluate(data));
    System.out.println("Variance = " new Variance().evaluate(data));
  }

}

I then went back to Excel and calculated the same measures in there (calculating both the 1/(N-1) sample and 1/N population standard deviation and variance). Here are the tabulated results:

Commons Math Excel
Standard Deviation 1005.8672054459015 1005.8672054332
Median 50011.934275 50011.9342750000
Mean 50008.74172390576 50008.7417239057
Variance 1011768.8349915474 1011768.8349659700

As suspected, they are almost identical, bar some rounding differences, at least an order of magnitude closer than the figures in the post (I told Excel to limit the precision to 10 digits, hence some figures seem a smaller precision). I don’t know what precision Excel uses internally for these calculations. It may be interesting to write a BigDecimal-based equivalent of the statistics package in Commons::Math.

There may be other reasons why the numbers given in the example don’t match, but unless I’m missing something obscure, or specific to the use case shown, it looks like the issue is not with the internal implementation of [math] (which incidentally looks like a very very neat little toolkit).