Monthly Archives: December 2010

Getting to Know the Ruby Standard Library – Timeout

This article has been republished on Monkey and Crow.

I asked for suggestions about what to cover next, and postmodern suggested the Timeout library among others. Timeout lets you run a block of code, and ensure it takes no longer than a specified amount of time. The most common use case is for operations that rely on a third party, for instance net/http uses it to make sure that your script does not wait forever while trying to connect to a server:

  def connect
    ...
    timeout(@open_timeout) { TCPSocket.open(conn_address(), conn_port()) }
    ...

You could also use Timeout to ensure that processing a file uploaded by a user does not take too long. For instance if you allow people to upload files to your server, you might want to limit reject any files that take more than 2 seconds to parse:

  require 'csv'
  
  def read_csv(path)
    begin
      timeout(2){ CSV.read(path) }
    rescue Timeout::Error => ex
      puts "File '#{path}' took too long to parse."
      return nil
    end
  end

Lets take a look at how it works. Open up the Timeout library, you can use qw timeout if you have Qwandry installed. Peek at the timeout method, it is surprisingly short.

  def timeout(sec, klass = nil)   #:yield: +sec+
    return yield(sec) if sec == nil or sec.zero?
    ...

First of all, we can see that if sec is either 0 or nil it just executes the block you passed in, and then returns the result. Next lets look at the part of Timeout that actually does the timing out:

    ...
    x = Thread.current
    y = Thread.start {
      sleep sec
      x.raise exception, "execution expired" if x.alive?
    }
    return yield(sec)
    ...

We quickly see the secret here is in ruby’s threads. If you’re not familiar with threading, it is more or less one way to make the computer do two things at once. First Timeout stashes the current thread in x. Next it starts up a new thread that will sleep for your timeout period. The sleeping thread is stored in y. While that thread is sleeping, it calls the block passed into timeout. As soon as that block completes, the result is returned. So what about that sleeping thread? When it wakes up it will raise an exception, which explains the how timeout stops code from running forever, but there is one last piece to the puzzle.

  ...
  ensure
    if y and y.alive?
      y.kill 
      y.join # make sure y is dead.
    end
  end
  ...

At the end of timeout there is an ensure. If you haven’t come across this yet, it is an interesting feature in ruby. ensure will always be called after a method completes, even if there is an exception. In timeout the ensure kills thread y, the sleeping thread, which means that it won’t raise an exception if the block returns, or throws an exception before the thread wakes up.

It turns out that Timeout is a useful little library, and it contains some interesting examples of threading and ensure blocks. If there is any part of the standard library you are curious about or think is worthy of some more coverage, let me know!

4 Comments

Filed under ruby, stdlib

Qwandry 0.1.0 – Now Supporting More Languages

I just finished updating Qwandry so that it can support any number of other languages or packaging systems. Want to use perl, python, or node with Qwandry? No problem:


qw -r python numpy # opens python's numpy library
qw -r perl URI     # open perl's URI library
qw -r node express # open express if it is installed for node

Qwandry will probe these dynamic languages and detect their load paths. This is just the first step towards making code more accessible to people. I would love to hear what you think of it, and if you have any suggestions.

Go ahead and install it with ruby’s package manager:


  gem install qwandry

Warning

If you had customized Qwandry before, this release will break your custom init.rb file. Configuration commands looked like this:


  add 'projects', '~/toys'
  add 'projects', '~/samples'

Now they look slightly different:


  register 'projects' do
    add '~/toys'
    add '~/samples'
  end

Awesome

By wrapping the commands that actually add paths to Qwandry’s search path in a block, we can defer slow operations like probing. Furthermore, we now only need to build up the paths for what you are looking for. By deferring configuration until it is needed, we can add support for any language or package scheme we like without slowing Qwandry down.

So what would you like to see Qwandry support next?

2 Comments

Filed under development, ruby

Getting to Know the Ruby Standard Library – Pathname

This article has been republished on Monkey and Crow.

Pathname is useful library that demonstrates a good refactoring: “Replace Data Value With Object”. In this case the data value is a String representing a path. Pathname wraps that String and provides a wide variety of methods for manipulating paths that would normally require you to call the File, FileStat, Dir, and IO modules. You might even be using it already without knowing as it shows up in Rails’ paths. First we will see a short example of Pathname in action, and then we will look at some of the patterns it employs.

Example of Pathname

  require 'pathname'
  path = Pathname.new('.') # current directory
  path += 'tests'          # ./tests
  path += 'functional'     # ./tests/functional
  path = path.parent       # ./tests
  path += 'config.yaml'     # ./tests/config.yaml
  path.read                # contents of ./tests/config.yaml
  path.open('w'){|io| io << "env: test"}
  path.read                # "env: test"
  path.children{|p| puts p.inspect} # prints all the files/directories in ./tests

Pathname provides a nicer interface for interacting with the filesystem, now lets take a look at how it works. As usual, I suggest opening up the file for yourself and following along, if you have Qwandry installed you can type qw pathname.

Pathname

We will start with how a Pathname gets created:

  def initialize(path)
    path = path.__send__(TO_PATH) if path.respond_to? TO_PATH
    @path = path.dup
    ...

The main thing Pathname#initialize does is store a copy of the path argument, while optionally calling TO_PATH on it, we’ll come back to this in a moment. Since strings are mutable in ruby, dup is called on the path argument. This ensures that if you later call path.gsub!('-','_'), or any other method that mutates the string, Pathname‘s copy will remain the same. This is a good practice whenever you are dealing with mutable data. Now lets take a look at TO_PATH:

  if RUBY_VERSION < "1.9"
    TO_PATH = :to_str
  else
    # to_path is implemented so Pathname objects are usable with File.open, etc.
    TO_PATH = :to_path
  end

This code invokes special behavior based on the current RUBY_VERSION. Ruby 1.9 will set TO_PATH to :to_path, and call that in the initializer above if the object being passed in implements to_path. A quick look at the RDocs show that File implements to_path, so we can pass files directly into Pathname. Now let’s take a look at how Pathname makes use of the rest of ruby’s file libraries.

  def read(*args) 
    IO.read(@path, *args) 
  end

The definition of Pathname#read is quite simple, it just takes the path you passed in and uses it to call IO, so where you might have done IO.read(path) with Pathname you can just do path.read. This pattern is repeated in Pathname for many of the common filesystem operations, for instance take a look at mtime:

  def mtime() 
    File.mtime(@path) 
  end

We see the same pattern has been repeated, but this time it delegates to File. Since a Pathname may reference a file or a directory, some of the methods will delegate to either Dir or File:

  def unlink()
    begin
      Dir.unlink @path
    rescue Errno::ENOTDIR
      File.unlink @path
    end
  end

First it tries to delete the path as a directory, then as a file. Perhaps a simpler formulation would be directory? ? Dir.unlink @path : File.unlink @path, but the result is the same. This pattern encapsulates knowledge that the caller no longer needs to deal with.

Pathname also overrides operators where they make sense, which lets you concatenate paths. Let’s look at how Pathname does this.

  def +(other)
    other = Pathname.new(other) unless Pathname === other
    Pathname.new(plus(@path, other.to_s))
  end

The plus operator is just a method like any other method in ruby, so overriding it is pretty simple. First, the other path being added to this one is converted to a Pathname if it isn’t one already. After that, the paths are combined with plus(@path, other.to_s). This might look rather odd since we just converted other to a Pathname, but remember that Pathname treats anything responding to to_path specially.

Here are some examples of its behavior:

  p = Pathname.new('/usr/local/lib') #=> #<Pathname:/usr/local/lib> 
  p + '/usr/'                        #=> #<Pathname:/usr/> 
  p + 'usr/'                         #=> #<Pathname:/usr/local/lib/usr/>
  p + '../include'                   #=> #<Pathname:/usr/local/include>

Adding an absolute path to an existing path behaves differently from a relative path or a path referencing the parent directory. This obviously has some logic beyond our typical string operators. For the sake of brevity, we can skip the details of how plus is implemented, though if anyone is interested, we can dissect it later. I suggest skimming the rest of pathname.rb, look at how public and private methods are defined, and how they are used to simplify methods.

Overview

Pathname wraps up a lot of functionality that is scattered across multiple libraries by encapsulating that information. Hopefully you have seen how Pathname can be useful, and have also learned a few patterns that will make your code more useable.

5 Comments

Filed under ruby, stdlib

Getting to Know the Ruby Standard Library – Abbrev

This article has been republished on Monkey and Crow.

We’re going to take a look at another little piece of ruby’s standard library, this time it is Abbrev, a tiny library that generates abbreviations for a set of words. We will expand ever so slightly on the one-liner from my last post to show an example of Abbrev in action:

require 'abbrev'
commands = Dir[*ENV['PATH'].split(':').map{|p| p+"/*"}].select{|f| File.executable? f}.map{|f| File.basename f}.uniq.abbrev
commands['ls']   #=> 'ls'
commands['spli'] #=> 'split'
commands['spl']  #=> nil

This will match anything on your path, or any substring that will match only one longer string. Combine this with Shellwords, and you have the pieces for an auto completing console. It could also be used for matching rake tasks, tests, or giving suggestions for mistyped methods.

How it Works

So that is what Abbrev does, but how does it work? If you open up the library (qw abbrev if you have Qwandry), you will see that it is pretty small, there’s just one method, and then a helper that extends array.

Starting from the beginning, we see that it takes an array of words and an optional pattern. It stores the abbreviations in table, and tracks of occurrences of each abbreviation in seen using the counting idiom I mentioned in Hash Tricks.

def abbrev(words, pattern = nil)
  table = {}
  seen = Hash.new(0)
  ...

The pattern can be a RegularExpression or a String:

  if pattern.is_a?(String)
    pattern = /^#{Regexp.quote(pattern)}/	# regard as a prefix
  end

If it’s a String, it is converted to a RegularExpression. Notice that Regexp.quote(pattern) is used so that any characters that have special meanings as RegularExpressions will get escaped. If this pattern is present, it is used to ignore any abbreviations that don’t match it. Next we see how the abbreviations are generated for each word:

  ...
  words.each do |word|
    next if (abbrev = word).empty?
    while (len = abbrev.rindex(/[\w\W]\z/)) > 0
      abbrev = word[0,len]

      next if pattern && pattern !~ abbrev

    	case seen[abbrev] += 1
    	when 1
    	  table[abbrev] = word
    	when 2
    	  table.delete(abbrev)
    	else
    	  break
    	end
    end
  end
  ...

The first part of this sets the current word to abbrev, but skips the word if it is blank. The next part of the loop is a little more confusing, what does abbrev.rindex(/[\w\W]\z/) do? It gives you the index of the last character in the String, as far as I can tell in ruby 1.9 this is equivalent to String#length - 1. So the inner while loop is going to use abbrev = word[0,len] to chop off a character each time until the String is empty. The hash seen is incremented by 1 for this substring. If this is the first time the word has been seen, then the word is recorded. If this is the second time the word has been seen, the word is removed because it is not unique. If the word has been seen more than twice, then not only has this word been seen, but we know that all the substrings of this word have been seen and removed, so the loop exits.

  words.each do |word|
    next if pattern && pattern !~ word

    table[word] = word
  end

  table
end

Finally Abbrev loops through the original words and inserts them. This means that if the array contained “look” and “lookout” they both get added as matches for themselves even though “look” is a substring of “lookout”.

So there you have it, ruby’s Abbrev library explained, go forth and shorten words.

5 Comments

Filed under ruby, stdlib

Finding Binaries with Ruby

Here’s a quick one liner in ruby that finds all of the binaries on your PATH:


  Dir[*ENV['PATH'].split(':').map{|p| p+"/*"}].select{|f| File.executable? f}.map{|f| File.basename f}.uniq

How does it work?

Working in the order of execution, we get the PATH from your environment variables. Next we split by : the path separator, and then add "/*" to it. This gives you an array of strings like /usr/bin/*. Passing those into Dir[ ] will get you all the files matching those paths. From there we select all files that are executable, use File.basename to get the filename from the path, and then call uniq to make sure we don’t have any duplicates.

This might not be something you do every day, but it does show off a number of ruby’s File and Enumerable methods.

1 Comment

Filed under ruby

Hash Tricks

Ruby’s Hash can accept a block when you initialize it. The block is called any time you attempt to access a key that is not present. Hash’s initialization block expects the following format:

Hash.new{|hash, key| ... }

The hash references the hash itself, and the key parameter is the missing key. With this, you can initialize default values in the hash before they get accessed. Here are a few interesting things you can do with a hash’s initialization block.

By setting the value to an array, you can easily group items in a list:

groups = Hash.new{|h,k| h[k] = [] }
list   = ["cake", "bake", "cookie", "car", "apple"]

# Group by string length:
list.each{|v| groups[v.length] << v}
groups #=> {4=>["cake", "bake"], 6=>["cookie"], 3=>["car"], 5=>["apple"]}

Setting the value to 0 is a good way to count the occurrences of various items in a list:

counts = Hash.new{|h,k| h[k] = 0 }
list   = ["cake", "cake", "cookie", "car", "cookie"]

# Group by string length:
list.each{|v| counts[v] += 1 }
counts #=> {"cake"=>2, "cookie"=>2, "car"=>1}

Or if you return hashes that return hashes, you can build a tree structure:

tree_block = lambda{|h,k| h[k] = Hash.new(&tree_block) }
opts = Hash.new(&tree_block)
opts['dev']['db']['host'] = "localhost:2828"
opts['dev']['db']['user'] = "me"
opts['dev']['db']['password'] = "secret"
opts['test']['db']['host'] = "localhost:2828"
opts['test']['db']['user'] = "test_user"
opts['test']['db']['password'] = "test_secret"
opts #=> {"dev"=>
           {"db"=>{"host"=>"localhost:2828", "user"=>"me", "password"=>"secret"}}, 
          "test"=>
            {"db"=>{"host"=>"localhost:2828", "user"=>"test_user", "password"=>"test_secret"}}
          }

A block can also be used to create a caching layer:

require 'net/http'
http = Hash.new{|h,k| h[k] = Net::HTTP.get_response(URI(k)).body }
http['http://www.google.com'] # makes a request
http['http://www.google.com'] # returns cached value

In ruby 1.9 hashes are ordered so you can make the cache a fixed length, and evict old values:

http = Hash.new{|h,k| 
  h[k] = Net::HTTP.get_response(URI(k)).body 
  if h.length > 3
    h.delete(h.keys.first)
  end
}
http['http://www.google.com']
http['http://www.yahoo.com']
http['http://www.bing.com']
http['http://www.reddit.com'] # this evicts http://www.google.com
http.keys #=> ["http://www.yahoo.com", "http://www.bing.com", "http://www.reddit.com"]

You can also use it to compute recursive functions:

factorial = Hash.new do |h,k| 
  if k > 1
    h[k] = h[k-1] * k
  else
    h[k] = 1
  end
end

This will cache each result, so if you have computed part of a number’s factorial, it won’t need to compute it again. For instance, factorial[4] will compute the values for 1,2, and 3, and then if you call factorial[3] it will already have the result. This is a somewhat contrived use, but it’s interesting none the less.

As you can see the default block for a Hash has a lot of interesting uses, are there any that you find particularly useful?

16 Comments

Filed under ruby

Getting to Know the Ruby Standard Library – TSort

This article has been republished on Monkey and Crow.

TSort is an interesting part of the ruby standard library that performs topological sorting. Topological sorting is used in package management, analyzing source code, and evaluating prerequisites. RubyGems uses TSort to determine what order to install gems in when there are multiple dependencies. Let’s take a look at TSort in action.

Here is an example of a JobRunner which will take a bunch of tasks with prerequisites, and then tell you which order they should be performed in:

require 'tsort'

class JobRunner
  include TSort
  
  Job = Struct.new(:name, :dependencies)
  def initialize()
    @jobs = Hash.new{|h,k| h[k] = []}
  end

  alias_method :execute, :tsort
  
  def add(name, dependencies=[])
    @jobs[name] = dependencies
  end
  
  def tsort_each_node(&block)
    @jobs.each_key(&block)
  end
  
  def tsort_each_child(node, &block)
    @jobs[node].each(&block)
  end
end

if __FILE__ == $0
  runner = JobRunner.new
  runner.add('breakfast', ['serve'])
  runner.add('serve', ['cook'])
  runner.add('cook', ['buy eggs','buy bacon'])
  puts runner.execute
end

Running this file will show that you need to buy the eggs and bacon before you can cook and then serve breakfast.


  buy eggs
  buy bacon
  cook
  serve
  breakfast

Hardly an achievement of complex reasoning, but as the number of prerequisites grow, this becomes vastly more useful.

TSort

Let’s take a look at the source (qw tsort if you have qwandry). The first thing to notice is that TSort is a module instead of a class. This means that instead of extending it, or calling it directly we will include it in another class. Notice that in the JobRunner above we called include TSort at the very beginning. This means our class will now include TSort’s functionality, but in order for it to work we need to implement a few methods. From the TSort rdoc:

  TSort requires two methods to interpret an object as a graph,
  tsort_each_node and tsort_each_child.
  
  * tsort_each_node is used to iterate for all nodes over a graph.
  * tsort_each_child is used to iterate for child nodes of a given node.
  

Our implementation of tsort_each_node iterates over each job, while tsort_each_child will iterate over each prerequisite for the job. These methods allow TSort to provide two useful methods. The first is strongly_connected_components, which will return each node or sets of nodes forming a circular dependency. The second is tsort which we used above to return the nodes in a sorted order so that each of their prerequisites are satisfied. Lets take a look at what each of these does:

  def tsort
    result = []
    tsort_each {|element| result << element}
    result
  end

This is simple enough, it just collects all the results from tsort_each and returns them. Following this thread we see that tsort_each then iterates over the results of each_strongly_connected_component, so lets take a closer look at that, and perhaps we can figure out how TSort works.

  def each_strongly_connected_component # :yields: nodes
    id_map = {}
    stack = []
    tsort_each_node {|node|
      unless id_map.include? node
        each_strongly_connected_component_from(node, id_map, stack) {|c|
          yield c
    ...

From this snippet of code we can see that TSort is going to iterate over each of the nodes (jobs in our case), while doing this it will keep track of the position each node has in the stack with the id_map hash. Next let’s see what each_strongly_connected_component_from does with the node.

def each_strongly_connected_component_from(node, id_map={}, stack=[])
  minimum_id = node_id = id_map[node] = id_map.size
  stack_length = stack.length
  stack << node
  ...

We can see that id_map and stack are passed around to keep track of what has been seen already. Jumping to the end of the method, we see that when we finish, TSort is going to yield back an array of nodes:

  if node_id == minimum_id
    component = stack.slice!(stack_length .. -1)
    component.each {|n| id_map[n] = nil}
    yield component
  end

We know from our example above that this eventually returns all the nodes in their proper order, and tsort_each expects the arrays to have a length of 1 so we can assume that if everything goes correctly stack.slice!(stack_length .. -1) should return an array with a length of 1. After it returns this array, it clears the entries from the id_map. We can deduce that TSort sorts the nodes by pushing items onto the stack, and then returning the top of it when there are no remaining prerequisites for the node.

Now let’s look at the main part of this method:

  ...
  tsort_each_child(node) {|child|
    if id_map.include? child
      child_id = id_map[child]
      minimum_id = child_id if child_id && child_id < minimum_id
    else
      sub_minimum_id =
        each_strongly_connected_component_from(child, id_map, stack) {|c|
          yield c
        }
      minimum_id = sub_minimum_id if sub_minimum_id < minimum_id
    end
  }
  ...

For each child we see that if we have already evaluated the node. If its index in the stack is less than our current minimum_id we treat it as the new minimum. Otherwise we recursively call each_strongly_connected_component_from for this node, which will push the child onto the stack and push its prerequisites onto the stack as well. Once there are no more nodes left to push on, the loop exits and the node will get popped off. If there is a cycle (nodes that depend on each other), all the nodes in the cycle will be returned. If this is starting to look familiar to you, this essentially amounts to a depth first search of all the nodes.

Recap

Our exploration of TSort has revealed a useful library, and shown us one way to perform a depth first search on a data structure that may include cycles, which is pretty neat. Curious about other applications of TSort? The strongly_connected_components that TSort returns also happens to be useful for detecting tightly coupled methods and classes which depend on each other which is useful when doing refactoring and static analysis of source code. This is just one of the handy things we can do with ruby’s standard library.

9 Comments

Filed under ruby, stdlib

Getting to Know the Ruby Standard Library – MiniTest::Mock

This article has been republished on Monkey and Crow.

Recently we looked at MiniTest, this time around we’re going to dive into MiniTest::Mock, a tiny library that will let you test systems that would otherwise be very difficult to test. We will take a look at what MiniTest::Mock provides, and then how it works.

A MiniTest::Mock Example

If you’re not familiar with Mock objects in general, wikipedia has a nice article on them. Let’s imagine that we want to write a script that deletes any email messages that are more than a week old:

  class MailPurge
    def initialize(imap)
      @imap = imap
    end
  
    def purge(date)
      # IMAP wants dates in the format: 8-Aug-2002
      formatted_date = date.strftime('%d-%b-%Y')
    
      @imap.authenticate('LOGIN', 'user', 'password')
      @imap.select('INBOX')

      message_ids = @imap.search(["BEFORE #{formatted_date}"])
      @imap.store(message_ids, "+FLAGS", [:Deleted])
    end
  end

We want to make sure that MailPurge only deletes the messages the imap server says are old enough. Testing this will be problematic for a number of reasons. Our script is going to be slow if it has to communicate with the server, and it has the permanent side effect of deleting your email. Luckily we can drop a mock object in to replace the imap server. We need to make a list of all the interactions our code has with the imap server so that we can fake that part of the server. We can see our script will call authenticate, select, search, and store, so our mock should expect each call, and have a reasonable response.

  def test_purging_mail
    date = Date.new(2010,1,1)
    formatted_date = '01-Jan-2010'
    ids = [4,5,6]
    
    mock = MiniTest::Mock.new
    
    # mock expects:
    #            method      return  arguments
    #-------------------------------------------------------------
    mock.expect(:authenticate,  nil, ['LOGIN', 'user', 'password'])
    mock.expect(:select,        nil, ['INBOX'])
    mock.expect(:search,        ids, [["BEFORE #{formatted_date}"]])
    mock.expect(:store,         nil, [ids, "+FLAGS", [:Deleted]])
    
    mp = MailPurge.new(mock)
    mp.purge(date)
    
    assert mock.verify
  end

We call MiniTest::Mock.new to create the mock object. Next we set up the mock’s expectations. Each expectation has a return value and an optional set of arguments it expects to receive. You can download this file and try it out (don’t worry it won’t actually delete your email). The MailPurge calls our fake imap server, and in fact does delete the message ids the server sends back in response to the @imap.search. Finally, we call verify which asserts that MailPurge made all the calls we expected.

How it Works

Lets dive into the source, if you have Qwandry you can open it with qw minitest. Looking at mock.rb you will see that MiniTest::Mock is actually quite short. First let’s look at initialize.

def initialize
  @expected_calls = {}
  @actual_calls = Hash.new {|h,k| h[k] = [] }
end

We can see that Mock will keep track of which calls were expected, and which ones were actually called. There is a neat trick in here with the Hash.new {|h,k| h[k] = [] }. If a block is passed into Hash.new, it will get called any time there is a hash miss. In this case any time you fetch a key that isn’t in the hash yet, an array will be placed in that key’s spot, this comes in handy later.

Next lets look at how expect works:

def expect(name, retval, args=[])
  n, r, a = name, retval, args # for the closure below
  @expected_calls[name] = { :retval => retval, :args => args }
  self.class.__send__(:define_method, name) { |*x|
    raise ArgumentError unless @expected_calls[n][:args].size == x.size
    @actual_calls[n] << { :retval => r, :args => x }
    retval
  }
  self
end

This looks dense, but if you take a moment, it’s straightforward. As we saw in the example above, expect takes the name of the method to expect, a value it should return, and the arguments it should see. Those parameters get recorded into the hash of @expected_calls. Next comes the tricky bit, MiniTest::Mock defines a new method on this instance that verifies the correct number of arguments were passed. The generated method also records that it’s been called in @actual_calls. Since @actual_calls was defined to return an array for a missing key, it can just append to whatever the hash returns. So expect dynamically builds up your mock object.

The final part of Mock makes sure that it did everything you expected:

def verify
  @expected_calls.each_key do |name|
    expected = @expected_calls[name]
    msg = "expected #{name}, #{expected.inspect}"
    raise MockExpectationError, msg unless
      @actual_calls.has_key? name and @actual_calls[name].include?(expected)
  end
  true
end

We can see here that verify will check each of the @expected_calls and make sure that it was actually called. If any of the expected methods aren’t called, it will raise an exception and your test will fail. Now you can build mock objects and make sure that your code is interacting the way you expect it to.

You should be aware though that MiniTest::Mock does not have many of the features that much larger libraries such as mocha do. For instance it does not let you set up expectations on existing objects, and requires you to specify all the arguments which can be cumbersome.

So we have dived into another piece of ruby’s standard library and found some more useful functionality. Hopefully along the way you have lerned some uses for mocking, and a neat trick with ruby’s Hash.

4 Comments

Filed under ruby, stdlib, testing

Getting Help Inside IRB

Here’s a quick tip, ruby’s ri utility will look up documentation about a method. For instance you can type ri String#split to see the documentation for String’s instance method split. If you have an irb session open you can tell irb to shell out using back ticks like this:


  ruby-1.9.1-p378 > puts `ri String#split`

You can also make a little helper method like this:


  def ri(signature)
    puts `ri #{signature}`
  end

Leave a comment

Filed under Uncategorized

Getting to Know the Ruby Standard Library – Shellwords

This article has been republished on Monkey and Crow.

Previously we answered a few questions about Minitest, and learned a little about exit hooks and introspection in ruby. Now lets look at an often overlooked library, Shellwords. Shellwords lets you break up a string the same way the Bourne shell does. So again, we will try to answer a few questions:

  1. What does Shellwords do?
  2. How does Shellwords break up the input?

Overview

Before diving into the code, lets look at some examples of what Shellwords does. Lets open up irb and try a few things out.

require 'shellwords'
Shellwords.split "search for 'some word'"
#=> ["search", "for", "some word"] 
Shellwords.split "search for 'some \"word\"'"
#=> ["search", "for", "some \"word\""]

As you can see, Shellwords splits up the input while respecting quoting. It is fairly strict though:

Shellwords.split "georgia o'keefe"
#=> ArgumentError: Unmatched double quote: "georgia o'keefe"
Shellwords.split "artist is \"georgia o'keefe\""
["artist", "is", "georgia o'keefe"]

So how could you use it? It’s good for tokenizing tags if you want to allow them to be more than one word. It could also come in handy if you wanted to make a mini language for scripting things:

instructions <<- END
  Activate Timmy
  Say "Hello world!"
  Wave
end

instructions.lines.each do |line|
  tokens = Shellwords.split(line)
  ...
end

It wouldn’t be the first little language made in ruby.

Shellwords

Now we have a rough idea of what it does, how does it work? To start with, take a look at the source (qw shellwords if you have Qwandry installed). We immediately come across:

module Shellwords
  ...
  def shellsplit(line)
    words = []
    field = ''
    line.scan(/\G\s*(?>([^\s\\\'\"]+)|'([^\']*)'|"((?:[^\"\\]|\\.)*)"|(\\.?)|(\S))(\s|\z)?/m) do
      |word, sq, dq, esc, garbage, sep|
      raise ArgumentError, "Unmatched double quote: #{line.inspect}" if garbage
      field << (word || sq || (dq || esc).gsub(/\\(?=.)/, ''))
      if sep
        words << field
        field = ''
      end
    end
    words
  end

Without jumping into the hefty regular expression, we can see this is using String#scan to repeatedly match a pattern against the input string. From there it’s building up field and then whenever there is a separator (sep) it adds that word to the list of tokenized words it will return at the end.

Now I suggest looking at that regular expression and quavering with fear. Tremors aside, we can break it down:

\G                          # Start of match attempt
\s*                         # Some optional whitespace
(?>                         # A non matching group
  ([^\s\\\'\"]+)            # word:     Something without spaces, quotes, or escapes
  |'([^\']*)'               # sq:       Something in single quotes
  |"((?:[^\"\\]|\\.)*)"     # dq:       Something in double quotes
  |(\\.?)                   # esc:      An escaped character 
  |(\S)                     # garbage:  Anything that doesn't match the pattern
)                           #
(\s|\z)?                    # sep:      A space or the end of the line

We start with \G, which I had to look up. This apparently means continue from the last match, but in my experiments I couldn’t find an instance where it made a difference in String#scan. If anyone can give a ruby example where this makes a difference, leave me a comment. The next chunk will gobble any leading whitespace. After that comes a non matching group, this is handy if you just want to group some parts of a regexp together, for example (and it’s a poor example):

"plus 123".match(/(plus|minus)\s(\d+)/)
# => <MatchData "plus 123" 1:"plus" 2:"123">
"plus 123".match(/(?>plus|minus)\s(\d+)/)
# => <MatchData "plus 123" 1:"123">

So that first group in the shellsplit is just to keep everything inside it together. The pipe between each of these groups says that the regular expression will just pick one of them. We can see that each of these groups will be yielded to the block, while the non matching group is ignored. So each time the block is called word, sq, dq, esc, or garbage will be filled in with a value. This is largely how the different quoting and escaping rules are implemented in Shellwords.split.

It’s interesting to notice the esc pattern, it’s going to gobble up anything following a ‘\’. This doesn’t seem quite right, an experiment shows how this plays out:

Shellwords.split('I will escape\ these \\bonds')
# => ["I", "will", "escape these", "bonds"]

Is this intentional? Running this little script from bash gives us a different result:

ruby -e 'puts ARGV.inspect' I will escape\ these \\bonds
["I", "will", "escape these", "\\bonds"]

So maybe we found a bug? To be honest I could not really say. Another interesting part is the garbage capture. It should only be matched by something that isn’t whitespace and doesn’t match any of the other options. If Shellwords encounters this, it will raise an exception, so this is how Shellwords ensures valid inputs. The last part will match a space or the end of the line, and tells Shellwords where to end this chunk of input.

That wasn’t so bad now was it? While you have shellwords.rb open, perhaps you should look around. As of ruby 1.9, there is a bonus method waiting for you.

Recap

So we set off to learn about another part of the ruby standard library. Along the way we learned how it works, and saw that we could break down a rather obscure regular expression. If you were appropriately curious you may have also found Shellwords.escape, and the String#shellwords shortcut. So once again, reading the source is pretty neat.

5 Comments

Filed under ruby, stdlib