September 28, 2005 – The Research Kitchen

I have a large CSV file full of open corporate events, which is the result of running queries across multiple systems. Some of these events may be duplicated across the systems, and the only unique reference is a string code called the job number. I wanted to quickly get a feel for the number of duplicates in the file, so I fired up a Cygwin bash shell, and used the following command sequence:

$ cat OpenEvents.csv | sed -n '/^Job/!p' | cut -f1 -d, | sort | uniq | wc -l)

Which basically broken down into steps, says:

sed -n '/^Job/!p' – Do not print any lines beginning with the string “Job”. This strips out the header line;
cut -f1 -d – Strip out the first comma-delimited field;
The next portion just calls an alphanumeric sort on the input;
The next portion calls uniq to filter out duplicates (by default, uniq only prints successive duplicated lines, so you need to sort the input first);
The last portion calls wc -l to print the number of lines returned by uniq.

This gives me the number of non-duplicate lines in OpenEvents.csv. If I want to find out the number of duplicate lines, I could pass the -d flag to uniq.

Of course, one of the best all-round tools for text manipulation is awk. Here is a script that filters out duplicates and put the results into another file.

awk '{ if ($0 in stored_lines) x=1 else print stored_lines[$0]=1 }' OpenEvents.csv > FilteredEvents.csv

Which uses Awk associative arrays (hashmaps) to store each line as it is read, and only print a line if it has not been encountered before.