Archive for the ‘Coding’ Category

Project Euler Problem #11

Tuesday, November 25th, 2008

Problem 11 on Project Euler involves calculating the maximum product of adjacent numbers in any direction in a 20×20 matrix.

The solution below takes advantage of the symmetry of calculations to cut down on unnecessary loop operations:

problem11 < - function() {
    numbers <- scan("problem11.dat")
        m <- matrix(as.numeric(numbers), 20, byrow=TRUE)
        maxprd <- 0
        N <- 20; n <- 4
        prd1 <- 0; prd2 <- 0; prd3 <- 0
        dims <- dim(m)
        a <- (n-1)
        x <- c(0:a)
        for (i in 1:(dims[1])) {
            for (j in 1:(dims[2])) {
                prd1 <- ifelse(j <= N-a, prod(m[i,j+x]), 0) # row prod
                    prd2 <- ifelse(i <= N-a, prod(m[i+x,j]), 0) # column prod
# lower right diagonal
                    prd3 <- ifelse(i <= N-a && j <= N-a, prod(diag(m[i:(i+a),j:(j+a)])),0)
# lower left diagonal
                    prd4 <- ifelse(i <= N-a && j > a, prod(diag(m[i:(i+a),j:(j-a)])), 0)
                    maxprd < - max(prd1,prd2,prd3,prd4,maxprd)
            }
        }
    maxprd
}

Analysing SVN Commits Across Committer Groups

Friday, October 3rd, 2008

Here is an example of using the SVNKit API to crawl a SVN repository and pick up the commit sizes. It uses a very simple (and incorrect) heuristic for estimating the number of lines changed per commit - it just gets the absolute value of the difference of the numer of lines added and subtracted per commit.

The code below will produce a comma-separated values file containing the author, commit time, line change count estimate, and revision number.

Loading the resulting file into R allows us to apply some analysis. We can plot the total number of commits per comitter:

Commits By Author

Commits By Author

Or look at the total number of lines committed on each commit:


And look at some summary stats (again, per author):

$user1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 1.0 5.0 439.3 45.5 45100.0

$user2
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 3.0 26.0 294.9 105.5 62700.0

$user3
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.00 1.00 46.64 5.00 22300.00

$user4
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 5.5 51.0 225.5 166.0 1882.0

$user5
Min. 1st Qu. Median Mean 3rd Qu. Max.
39.0 108.0 267.0 231.4 298.0 445.0

$user6
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 2.0 7.0 181.3 41.0 21170.0

$user7
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 5.0 34.5 164.8 136.0 3066.0

You can see from the entries for the first couple of authors above that the mean is skewed by some very large commits - making the median a much more robust measure of average lines per commit.

package com.researchkitchen.svn;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.text.SimpleDateFormat;
import java.util.ArrayList;

import org.tmatesoft.svn.core.SVNException;
import org.tmatesoft.svn.core.SVNLogEntry;
import org.tmatesoft.svn.core.SVNURL;
import org.tmatesoft.svn.core.auth.ISVNAuthenticationManager;
import org.tmatesoft.svn.core.internal.io.svn.SVNRepositoryFactoryImpl;
import org.tmatesoft.svn.core.io.SVNRepository;
import org.tmatesoft.svn.core.io.SVNRepositoryFactory;
import org.tmatesoft.svn.core.wc.SVNClientManager;
import org.tmatesoft.svn.core.wc.SVNDiffClient;
import org.tmatesoft.svn.core.wc.SVNRevision;
import org.tmatesoft.svn.core.wc.SVNWCUtil;

public class SVNClient {

  @SuppressWarnings("unchecked")
   public static void main(String[] args) throws IOException {
     final String url = "svn://myserver/myproject/trunk";
     final String name = "rory";
     final String pass = "password";
     BufferedWriter writer = new BufferedWriter(new FileWriter(new File("svn-stats.dat")));
     SimpleDateFormat formatter = new SimpleDateFormat("dd/M/yyyy HH:mm:ss");

    try {
       SVNRepositoryFactoryImpl.setup();
       SVNURL svnUrl = SVNURL.parseURIDecoded(url);
       ISVNAuthenticationManager authManager = SVNWCUtil.createDefaultAuthenticationManager(name, pass);
       SVNRepository repo = SVNRepositoryFactory.create(svnUrl);
       repo.setAuthenticationManager(authManager);

      // Create a diff client
       SVNClientManager clientManager = SVNClientManager.newInstance();
       SVNDiffClient diffClient = clientManager.getDiffClient();

      writer.write("Revision,Author,Date,LinesChanged\n");

      // Get svn log for entire repo history
       long currentRev = repo.getLatestRevision();
       ArrayList<SVNLogEntry> entries = new ArrayList<SVNLogEntry>(repo.log(new String[] {""}, null, 1, currentRev, true, true));

      // Diff all subsequent revisions
       for (int i = 1; i < entries.size(); ++i) {
         int changedThisCommit = 0;
         SVNLogEntry current = entries.get(i);
         SVNLogEntry prev = entries.get(i-1);

        System.out.println("Revision " + current.getRevision()
             + " committed by " + current.getAuthor());

        ByteArrayOutputStream io = new ByteArrayOutputStream();
         System.out.println("Diff between " + current.getRevision() + "=>" + prev.getRevision() + ":");
         diffClient.doDiff(svnUrl, SVNRevision.HEAD, SVNRevision.create(prev.getRevision()),
             SVNRevision.create(current.getRevision()), true, false,io);

        // Very basic (and probably wrong) changed lines metric
         // see http://en.wikipedia.org/wiki/Diff#Unified_format
         BufferedReader br = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(io.toByteArray())));
         String line = null;
         while((line = br.readLine()) != null) {
           if (line.matches("^\\+([^\\+]).*"))
             changedThisCommit++;
           else if (line.matches("^\\-([^\\-]).*"))
             changedThisCommit–;
         }
         changedThisCommit = (changedThisCommit < 0 ? -changedThisCommit : changedThisCommit) + 1;
         System.out.println("Lines changed this commit:" + changedThisCommit);
         br.close();

        writer.write(current.getRevision() + "," + current.getAuthor() + ","
             + formatter.format(current.getDate()) + "," + changedThisCommit + "\n");      
       }
       writer.close();
     } catch (SVNException se) {
       se.printStackTrace();
     } catch (FileNotFoundException e) {
       e.printStackTrace();
     }
   }
}

The Gini Coefficient As A Measure of Software Project Risk

Friday, September 12th, 2008

Introduction

In economics, the Gini Coefficient (http://www.statistics.gov.uk/about/methodology_by_theme/gini/default.asp) is a standard quantitative measure of the relative inequality in the distribution of wealth. The name “Gini Coefficient” is a moniker for a large family of variations on the basic inequality measure, but the standard interpretation is that of the ratio of the area under the Lorenz curve (a function of the cumulative distribution) to that of the line of perfect equality. The coefficient is normalized such that it lies between 0 and 1, where 0 represents perfect equality (for example, everyone has the same amount of wealth), and 1 represents perfect inequality (one person holds all of the wealth). For instance, say we have a theoretical population where everyone has the same income, say $100. If we are using R, we can calculate the Gini coefficient (using the ineq library):


>library(ineq)
> x < - rep(100,10)
> x
[1] 100 100 100 100 100 100 100 100 100 100
> round(Gini(x))
[1] 0
> plot(Lc(x))

Lorenz Curve (perfect equality)

Lorenz Curve (perfect equality)

Now if we create a theoretical income distribution where one member (or class) earns $100 and everybody else earns nothing, we have perfect inequality:

> x < - c(100,rep(0,9))
> x
[1] 100 0 0 0 0 0 0 0 0 0
> round(Gini(x))
[1] 1
> plot(Lc(x))

Lorenz Curve (Inequality)

Lorenz Curve (Inequality)

But what has this got to do with software?

At the User! 2008 conference in Germany this year, John Fox gave a superb talk on the social construction of the R project. The talk was very interesting for a number of reasons, but mainly because he looked at the project’s evolution not from the standard technical viewpoint that we tend to unconsciously fall into, but rather from the viewpoint of that of a qualitative social scientist. The result was a fascinating exposition of the social dynamics behind an extremely popular and large open source project (The slides for John’s talk are available online from the conference site here).

One surprising fact that John revealed in his presentation was the relative dependence of the R project (which has a core team of around 20 members) on a single individual committer. To quantify this, John calculated the Gini coefficient for the R project, where the inequality metric was based on the number of commits per core team member (extracted from the R svn logs). The results were surprising - a Gini coefficient of over 0.8, which signifies a large degree of inequality in the number of commits per member. This may signify a large dependence on a single member of the project team, a situation referred to in management-speak as “key man risk”.

We can reproduce this analysis using R and a copy of the R svn commit logs (found here). In order to turn the output of svn log into something we can work with, we need to massage the data slightly. I just did this in Vim, using the two commands:

%v/^r\d\+\s\+|/d

To remove all lines that we are not interested in, followed by

%s/^r\(\d\+\)\s\+|\s\+\(\w\+\)\s\+|\s\+\(\d\+-\d\+-\d\+ \d\+:\d\+:\d\+\) .*/\1,\2,\3/

to turn the log output into a comma-delimited triplet of revision number, author, and date. To load it into R, I used read.table():

R_svn < - read.table("/tmp/R_svnlog_2008.dat", col.names=c("rev","author","date"),sep=",")

As R will turn repeated string data (such as the committer name) into a factor, it is trivial to extract the data we need to calculate the Gini coefficient using a contingency table:

> round(Gini(table(R_svn$author)), 2)
[1] 0.81

And then plot the Lorenz curve:

Lorenz Curve (R commit data)

Lorenz Curve (R commit data)

As you can see, the curve is skewed towards large inequality. The (possibly) worrying aspect of this metric is that it is increasing over time (as illustrated in John Fox’s presentation), signifying that dependence on a single team member is growing.

What’s The Problem With This Metric?

So is this actually meaningful - i.e. is there anything to be concerned about in this measure? Well, for starters, there are a few obvious drawbacks with this metric. One primary flaw is that the number of commits per developer is a weak measure if we take it to signify total contribution towards a project. Some developers prefer to commit smaller changes more often, whilst others prefer to commit larger, more complete changes less frequently. The number of commits is also completely independent from the quality of those commits. Another crucial factor is that committing code is just one part of the work that goes into a successful open source project. Other tasks include, but are not limited to: website maintenance, release management, mailing list participation and support, and general advocacy and awareness tasks. A blunt metric such as number of commits per developer will not take any of these into account.

Improving The Metric

Due to the nature of software development, no single quantitative measure will be able to accurately describe a complete measure of contribution. However, if we were to try and improve on the basic commit metric, we could conceivably propose something like the following:

tex:C = \beta_1 \lambda(N_C) + \beta_2 \lambda(N_L) + \beta_3 \lambda(N_I) + \beta_4 \lambda(N_M)

Where the tex:\beta components are just weights applied to each factor, and the factors are:

  • tex:N_C: Number of commits;
  • tex:N_L: Number of source lines changed;
  • tex:N_I: Number of issues or bugs closed/resolved/participated in;
  • tex:N_M: Number of messages posted to e.g. a user mailing list

The tex:\lambda function is just an exponential decay factor, to compensate for the survivor bias inherent in such statistics. Most of these numbers are reasonably easily obtainable for open source projects. Such a metric would probably be more informative than the simple “commits per developer” model, in that it would provide a slightly more balanced notion of “contribution”.

Some Other Examples

Just out of curiosity, I decided to calculate the coefficient for some other large software projects, including Apache httpd and Parrot (Perl 6). The results for Parrot are shown below:

> round(Gini(table(parrot_svn$author)),2)
[1] 0.74

Lorenz Curve for Perl 6

Lorenz Curve for Perl 6

Parrot exhibits a few of the features that seem to be common to many open source projects: the commit distribution is mainly peaked at the low end, with a smaller set of peaks at the high end, signifying that most project members make a relatively small number of commits, but an individual or small core group of individuals make a much larger set of commits. This makes perfect sense: project founders, so-called “benevolent dictators”, developers in the employ of companies who contribute to open source projects, or simply developers with more time and energy on their hands will contribute the lion’s share of the project effort. In a commercial enterprise, this may be more of a concern (due to the time and cost implications of sudden replacement) than in open source projects, where by definition, the openess of the codebase and contribution process offers a buffer against adverse consequences of key man risk.

TimedMap Implementation

Wednesday, August 20th, 2008

Here is the implementation for a timed concurrent map implementation that I wrote a while ago. It delegates to an underlying ConcurrentMap, and uses a single ReentrantLock to synchronize writers. It is designed to use in situations where objects eventually age themselves out of cache storage.

package com.researchkitchen.map;

import java.util.Map;
import java.util.Timer;
import java.util.TimerTask;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;/**
 * A wrapper around a {@link ConcurrentMap} that expires entries from the underlying map
 * after a specific period (the TTL value) has expired. 
 * @author winstor
 *
 * @param <K>
 * @param <V>
 */
public final class TimedMap<K,V> {
    private final ConcurrentMap<K, ExpiredObject<K,V>> map = new ConcurrentHashMap<K, ExpiredObject<K,V>>();
    private final Lock writeLock = new ReentrantLock();
    private static final long DEFAULT_TTL = 60000L;
    private final Timer timer = new Timer("TimedMap Timer", true);
    /**
     * A wrapper for a underlying object that associates a {@link TimerTask} instance
     * with the object.
     * @author winstor
     *
     * @param <K> The key type K
     * @param <V> The value type V
     */
    class ExpiredObject<K,V> {
        private final V value;
        private final ExpiryTask<K> task;
        private final long ttl;

        public ExpiredObject(K key, V value) {
            this(key, value, DEFAULT_TTL);
        }

        public ExpiredObject(K key, V value, long ttl) {
            this.value = value;
            this.task = new ExpiryTask<K>(key);
            this.ttl = ttl;
            timer.schedule(this.task, ttl);
        }

        public ExpiryTask<K> getTask()   { return task; }
        public V getValue()              { return value; }
        public long getTtl()             { return ttl; }
    }

    /**
     * A {@link TimerTask} implementation that removes its
     * associated entry (identified by a key K) from the
     * internal {@link Map}.
     * @author winstor
     *
     * @param <K> The object key
     */
    class ExpiryTask<K> extends TimerTask {
        private final K key;

        public ExpiryTask(K key) {
            this.key = key;
        }

        public K getKey() {
            return key;
        }

        @Override
        public void run() {
            System.out.println("Expiring element with key [" + key + "]");
            try {
                writeLock.lock();
                if (map.containsKey(key))
                    map.remove(getKey());
            }
            finally {
                writeLock.unlock();
            }
        }
    }

    /**
     * Insert an entry into the underlying map, specifying a time-to-live
     * for the element to be inserted.
     * <p/>
     * @param key The key of the element to be inserted.
     * @param value The item to be inserted.
     * @param expiry A time-to-live value (specified in milliseconds).
     */
    @SuppressWarnings("unchecked")
    public void put(K key, V value, long expiry) {
        try {
            writeLock.lock();

            final ExpiredObject<K, V> object =
                map.putIfAbsent(key, new ExpiredObject<K, V>(key, value, expiry));
            /* Was there an existing entry for this key? If so, cancel the existing timer */
            if (object != null)
                object.getTask().cancel();
        }
        finally {
            writeLock.unlock();
        }
    }

    /**
     * Insert an entry into the map with a default time-to-live value.
     * @param key The key of the element to be inserted.
     * @param value The item to be inserted.
     */
    public void put(K key, V value) {
        put(key, value, DEFAULT_TTL);
    }

    /**
     * Retrieve the entry identified by <code>key</code>, or <code>null</code> if
     * it does not exist.
     * @param key The key identifying the entry to retrieve
     * @return The entry corresponding to <code>key</code>, or <code>null</code>.
     */
    @SuppressWarnings("unchecked")
    public V get(K key) {
        return (V) (map.containsKey(key) ? map.get(key).getValue() : null);
    }

    /**
     * Clear the underlying map and cancel any outstanding expiry timers.
     */
    public void clear() {
        try {
            writeLock.lock();
            for (ExpiredObject<K, V> obj : map.values()) {
                obj.getTask().cancel();
            }
            map.clear();
            timer.purge();  // Force removal of all cancelled tasks
        }
        finally {
            writeLock.unlock();                                                                                                                                                                                                                                                                                                                                                
        }
    }

    public boolean containsKey(K key) {
        return map.containsKey(key);
    }

    public boolean isEmpty() {
        return map.isEmpty();
    }

    /**
     * Remov the element identified by <code>key</code> from the map, returning <code>null</code>
     * if it does not exist.
     * @param key The key identifying the element to remove
     * @return The removed element, or null if it was not present.
     */
    public V remove(K key) {
        final ExpiredObject<K,V> object;
        try {
            writeLock.lock();
            System.out.println("Removing element with key:" + key);
            object = map.remove(key);
            if (object != null)
                object.getTask().cancel();
        }
        finally {
            writeLock.unlock();
        }
        return (object == null ? null : object.getValue());
    }

    public int size() {
        return map.size();
    }
}

The Problem With #defines

Wednesday, July 23rd, 2008

The C/C++ #define mechanism is a very powerful, but very blunt templating approach. It’s a simple textual search-and-replace mechanism, which can be very useful, but not so subtle.

This was illustrated to me today when I ran into a problem with some code for an R extension that I am writing, that interfaces to Reuters market data feeds. The offending code is shown below:


RTRXFileDb* configDb = new RTRXFileDb(configPath);
if (configDb->error()) {
      Rf_error("Error initializing SFC: %s", configDb->errorText());
            return ScalarReal(-1);
}

This code creates an object instance, and checks the boolean member property error() property for problems. If there is a problem, it calls the R function error() to print the error to the R console.

When I tried to compile this code, however, I got the following error:

error C2039: ‘Rf_error’ : is not a member of ‘RTRXFileDb’

This initially confused me, but then I looked at the R header file error.h and found the definition of error():

#define R_ERROR_H_

#ifdef  __cplusplus
extern "C" {
#endif

void    Rf_error(const char *, …);
void    Rf_warning(const char *, …);
void    WrongArgCount(const char *);
void    UNIMPLEMENTED(const char *);
void    R_ShowMessage(const char *s);
    

#ifdef  __cplusplus
}
#endif

#ifndef R_NO_REMAP
#define error Rf_error
#define warning Rf_warning
#endif

#endif /* R_ERROR_H_ */

So error is mapped to Rf_error via a #define directive, meaning that whenever the preprocessor encountered the “error” token, it went ahead and replaced it with the R-mapped definition, regardless of its syntactic context. This is normally not a problem, apart from when we get name collisions, as we have here. There are a few potential ways to fix this, none of them ideal. Some of the options are:

  • #define R_NO_REMAP in the preprocessor, which will stop error being mapped to Rf_error, and just use Rf_error explicitly. The problem with this approach is that R_NO_REMAP will undefine a whole bunch of other R functions, making it tedious to replace every one.
  • Change the R header file: either change the #define block to map Rf_error to _error, or some alternative name, and just change the usages of that, or alternatively remove the error mapping completely.
  • Use #undef error and just use Rf_error explicitly (Pointed out on Reddit)

I’m going to go for one of the latter options, as they are the path of least resistance. This was actually a very small issue. But name collisions of this type can potentially be nasty, or at the very least tedious to fix.

Project Euler Problem #10 (R)

Saturday, July 19th, 2008

Problem 10 asks us to find the sum of the prime numbers below 2*106. With the sieve() routine written for one of the earlier problems, this is easy: it becomes

> sum(sieve(2E6))

It is slow, however. It makes you appreciate how computationally intensive it is to search for larger primes.

Project Euler Problem #9 (R)

Saturday, July 19th, 2008

Problem 9 involves finding the Pythagorean triple tex:a^2+b^2=c^2 where a+b+c=1000.

Pythagorean triples have lots of useful relationships that we can exploit. The ones we shall use to solve this problem are:

tex:a = k(m^2-n^2)

tex:b = k(2mn)

tex:c = k(m^2+n^2)

In this case, we take tex:k=1, and generate the triples as follows:

# Problem 9
# Pythagorean Triple
problem9 <- function() {

        sum <- 0
        a <- 0; b <- 0; c <- 0;
        m <- 0

        for (m in 1:1000) {
         for (n in m:1000) {
         a <- m^2-n^2
         b <- 2*m*n
         c <- n^2 + m^2

         if (all(a^2 + b^2 == c^2, sum(a,b,c)==1000)) {
          return(prod(a,b,c))
         }
        }
        }
}

Project Euler Problem #8 (R)

Saturday, July 19th, 2008

Problem #8 on the Project Euler problem list requires finding the greatest product of five consecutive digits in a large string of integers.

For this problem, I copied and pasted the large number into a file, and read it in using scan(). As there were line breaks in the copied and pasted numbers, I used paste() with the collapse argument to restore it to a single line. From there, it was a matter of taking 5-character substrings, converting them to numbers, and using prod() to calculate the product of each group of 5 numbers.

gprod <- function() {
 # Read number as string
 snum <- scan(file="number.dat", what=character(0),
  strip.white=TRUE, quiet=TRUE)
 # Concatenate lines into a single line
 snum <- paste(snum, collapse="", sep="")
 mp <- 0
 for (i in 1:(nchar(snum)-4)) {
  s <- substr(snum, i, i+4)
  m <- prod(as.numeric(strsplit(s,"")[[1]]))
  if (m > mp) {
   mp <- m
}
 }
 mp
}

Project Euler Problem #7 (R)

Friday, July 18th, 2008

Problem 7 on the Project Euler site is the following challenge:

By listing the first six prime numbers: 2, 3, 5, 7, 11, and 13, we can see that the 6th prime is 13.

What is the 10001st prime number?

The routine below returns a list of primes, each successive prime being calculated by testing none of the currently known primes divide the current integer counter value evenly. If not, the current integer value is added to the list of known primes.

nprime <- function(n) {
        if (n == 1) {
                return (2)
        }
        psofar <- c()
        psofar[1] <- 2
        j <- 2
        p <- 1
        i <- 2
        while (p < n)  {
                if ( all((i %% psofar) > 0) ) {
                        psofar[j] <- i
                        j <- j + 1
                        p <- p + 1
                }
                i <- i + 1
        }
        psofar
}

To return the 10001st prime, we invoke it like so:

>tail(psofar(10001),1)

Project Euler Problem #6 (R)

Friday, July 18th, 2008

Problem 6 in the Project Euler list asks us to:

Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum.

The literal translation of this is:

sumdiff <- function(n) {
    sum(c(1:n))^2-sum(c(1:n)^2)
}

However, this is another problem (like Problem #1) that has a closed-form solution, and so can run in constant time.

The closed form expression for the sum of the squares is

tex:\sum_{i=1}^{n}i^2 = \frac{1}{6}n(2n+1)(n+1)

The closed form expression for the square of the sum is

tex:\left(\sum_{i=1}^{n}i\right)^2 = \left(\frac{n(n+1)}{2}\right)^2

So if we subtract and simplify, we end up with the solution

tex:\frac{1}{12}n(3n^3+2n^2-3n-2).

The solution is shown below:

sumdiff2 <- function(n) {
    #(1/4)*(n^2)*(1+n)^2 - (1/6)*n*(1+n)*(1+2*n)
    (1/12)*(n)*(-2-3*n+2*n^2+3*n^3)
}

We can calculate the answer as shown:

> sumdiff2(100)
[1] 25164150

Needless to say, closed form solutions are preferable where possible.