Shell Scripting Fun

I have a large CSV file full of open corporate events, which is the result of running queries across multiple systems. Some of these events may be duplicated across the systems, and the only unique reference is a string code called the job number. I wanted to quickly get a feel for the number of duplicates in the file, so I fired up a Cygwin bash shell, and used the following command sequence:

$ cat OpenEvents.csv | sed -n '/^Job/!p' | cut -f1 -d, | sort | uniq | wc -l)

Which basically broken down into steps, says:

  • sed -n '/^Job/!p' – Do not print any lines beginning with the string “Job”. This strips out the header line;
  • cut -f1 -d – Strip out the first comma-delimited field;
  • The next portion just calls an alphanumeric sort on the input;
  • The next portion calls uniq to filter out duplicates (by default, uniq only prints successive duplicated lines, so you need to sort the input first);
  • The last portion calls wc -l to print the number of lines returned by uniq.

This gives me the number of non-duplicate lines in OpenEvents.csv. If I want to find out the number of duplicate lines, I could pass the -d flag to uniq.

Of course, one of the best all-round tools for text manipulation is awk. Here is a script that filters out duplicates and put the results into another file.

awk '{
if ($0 in stored_lines)
x=1
else
print
stored_lines[$0]=1
}' OpenEvents.csv > FilteredEvents.csv

Which uses Awk associative arrays (hashmaps) to store each line as it is read, and only print a line if it has not been encountered before.

Bash Parameter Substitution

There is one neat trick that the Bash shell offers that I occasionally find very useful. However, usually when I go to use it, it’s been just long enough for me to forget how to do it. So here it is, so I know how to find it.

It’s called parameter substitution, and it works like this (paraphrased from the Advanced Bash Scripting Guide):

${var#Pattern}, ${var##Pattern} – Removes from $var the shortest/longest part of $Pattern that matches the front end of $var.

${var%Pattern}, ${var%%Pattern} – Remove from $var the shortest/longest part of $Pattern that matches the back end of $var.

I normally use it when I want to strip the file extension from a set of files and replace it with another, as part of an overall process like so:

for file in $(ls *.DAT); do mv $file ${file%.DAT}.ABC; done

will move all .DAT files in the current directory to files with a .ABC extension instead.

Perl Regular Expressions

I don’t get to use Perl very often these days, but when I do, it’s usually because I want to do some powerful data manipulation in a hurry. In the last couple of weeks, I’ve used it to extract and format a large amount of financial-related data for reporting purposes (it is the Practical Extraction and Reporting Language, after all), and currently I am using it to extract data from a series of text files containing technical trivia. The text files are in question-and-answer format, and the regex I have defined to extract the data and write to an XML-based format looks like this (snipped):

while ($trivia =~ m/^Q\s\$([A-Z0-9]{3})\)\s(.*?)(?=^A)/smg ) {
my $q = "<question number=\"$1\">$2</question>\n";
push (@questions, $q);
}

while ($trivia =~ m/^A\s\$([A-Z0-9]{3})\)\s(.*?)(?=^Q|\Z)/smg ) {
my $a = "<answer number=\"$1\">$2</answer>\n";
push (@answers, $a);
}

It’s a two-pass approach using two separate patterns. The two regex expressions look very similar, and indeed they are. The main difference is the lookahead portion, which is the (?=^Q|\Z) bit. This is slightly different for both expressions (it acts as an anchor that tells the parser when to stop).

None of this is very hard in Java, either, our patterns can be ported straight over:

Pattern patt = Pattern.compile(
"^Q\\s\\$([A-Z0-9]{3})\\)\\s(.*?)(?=^A|\\Z)",
Pattern.DOTALL | Pattern.MULTILINE);

However, what is really powerful about the Perl approach is the amazing power of its regex engine. For instance, it would be nice to be able to dynamically switch between the anchoring conditions, depending on what we had just picked up as the value of $1. So, for instance, if $1 was “Foo”, we could change the anchoring condition to be “Bar”, giving us a dynamic regex that automagically seems to understand the semantics of the data it is processing, as well as just the syntax.

In Perl, this is possible using the (??{}) and (?{}) operators. These allow you to execute some code in the body of the regular expression and use the output of that code as a dynamic pattern. You can even “feed” the code with backreferences from the current pattern.

To illustrate, check out the following examples in Perl.

First, our test string that we will search upon:

my $string = "Hello World";

Let’s also create a couple of lookup tables:

my %sym = (
'Hello' => 'World',
'World' => 'Hello'
);

my %sym2 = (
'Hello' => '\w{5}'
);

And now let’s use them.

my $var = "Hello";

First, a simple lookup using a variable:

if ($string =~ m/$var/) {
print "Matched variable\n";
}

Now, let’s insert some dynamic Perl code into the regex:

if ($string =~ m/$sym{'Hello'}/) {
print "Matched hash lookup\n";
}

Next, let’s pass a backreference into the dynamic code:

if ($string =~ m/(Hello) (??{$sym{$1}})/) {
print "Matched dynamic hash lookup using backreference\n";
}

And finally (the best bit) – using a backreference, the hash lookup resolves to a string of regex metacharacters (\w{5} in this case)! This is exactly what I need to make the lookahead dynamic.

if ($string =~ m/(Hello) (??{$sym2{$1}})/) {
print "Matched dynamic hash lookup, resolving to regex metacharacters, using backreference\n";
}

It’s things like this that make me reach for Perl time and time again .