Categories
Coding

Web Retrieval With Ruby

I recently had the need to automatically retrieve and parse a table of BT fixed-line call tariff data. Normally I would use Perl for this sort of thing. However on this occasion, I decided this might be a good opportunity to learn a bit of Ruby.


require 'net/http'
require 'html/tree'
require 'html/xmltree'
require 'http-access2'

client=HTTPAccess2::Client.new()
url = 'http://www.bt.com/...' # long URI omitted

parser = HTMLTree::Parser.new(false,false)
parser.feed(client.get_content(url))

tariffs = Array.new()

# Iterate through each <tr>
rows = parser.html.select { |ea| ea.tag == 'tr' }

# Extract and normalize the content
rows.each { |row|
texts = row.select { |item| item.data? }. # just look at cdata
collect { |data| data.strip }. # strip it
select { |data| data.size > 0 } # and keep the non-blank fields
texts = texts.join('|')

# Only store the contents that contain actual call tariff data
tariffs.push(texts) if (/^[^|]+\|((\d)+\.(\d)+\|){2}(\d)+\.(\d)+$/ =~ texts)
}

# Send to stdout so we can run $ ./client.rb > tariffs.dat
puts tariffs

This produces a pipe-delimited output of call tariffs by country. My initial impressions of Ruby (I’m way behind the curve here) are:

  • It’s very “Perl-like” in some ways – you can see a definite Perl influence in the language.
  • I love the iterator and closure syntax: collect(), map(), etc. It’s very clean and intuitive.
  • The idea of code blocks as first-class objects seems to be integral to Ruby: in the code above, the output of a select {} block is passed to a collect {} block, which is passed in turn to another select {} block (all done within an each {} block). Very reminiscent of the simple building block approach of Unix shell commands.

There seems to be a lot of hype around Ruby at the moment, mainly driven by Rails. However, the basic language itself is quite exciting in that it seems to be as useful and concise as Perl, whilst having some syntactic advantages that make it more readable and maintainable.