Getting to Know the Ruby Standard Library – Abbrev

This article has been republished on Monkey and Crow.

We’re going to take a look at another little piece of ruby’s standard library, this time it is Abbrev, a tiny library that generates abbreviations for a set of words. We will expand ever so slightly on the one-liner from my last post to show an example of Abbrev in action:

require 'abbrev'
commands = Dir[*ENV['PATH'].split(':').map{|p| p+"/*"}].select{|f| File.executable? f}.map{|f| File.basename f}.uniq.abbrev
commands['ls']   #=> 'ls'
commands['spli'] #=> 'split'
commands['spl']  #=> nil

This will match anything on your path, or any substring that will match only one longer string. Combine this with Shellwords, and you have the pieces for an auto completing console. It could also be used for matching rake tasks, tests, or giving suggestions for mistyped methods.

How it Works

So that is what Abbrev does, but how does it work? If you open up the library (qw abbrev if you have Qwandry), you will see that it is pretty small, there’s just one method, and then a helper that extends array.

Starting from the beginning, we see that it takes an array of words and an optional pattern. It stores the abbreviations in table, and tracks of occurrences of each abbreviation in seen using the counting idiom I mentioned in Hash Tricks.

def abbrev(words, pattern = nil)
  table = {}
  seen = Hash.new(0)
  ...

The pattern can be a RegularExpression or a String:

  if pattern.is_a?(String)
    pattern = /^#{Regexp.quote(pattern)}/	# regard as a prefix
  end

If it’s a String, it is converted to a RegularExpression. Notice that Regexp.quote(pattern) is used so that any characters that have special meanings as RegularExpressions will get escaped. If this pattern is present, it is used to ignore any abbreviations that don’t match it. Next we see how the abbreviations are generated for each word:

  ...
  words.each do |word|
    next if (abbrev = word).empty?
    while (len = abbrev.rindex(/[\w\W]\z/)) > 0
      abbrev = word[0,len]

      next if pattern && pattern !~ abbrev

    	case seen[abbrev] += 1
    	when 1
    	  table[abbrev] = word
    	when 2
    	  table.delete(abbrev)
    	else
    	  break
    	end
    end
  end
  ...

The first part of this sets the current word to abbrev, but skips the word if it is blank. The next part of the loop is a little more confusing, what does abbrev.rindex(/[\w\W]\z/) do? It gives you the index of the last character in the String, as far as I can tell in ruby 1.9 this is equivalent to String#length - 1. So the inner while loop is going to use abbrev = word[0,len] to chop off a character each time until the String is empty. The hash seen is incremented by 1 for this substring. If this is the first time the word has been seen, then the word is recorded. If this is the second time the word has been seen, the word is removed because it is not unique. If the word has been seen more than twice, then not only has this word been seen, but we know that all the substrings of this word have been seen and removed, so the loop exits.

  words.each do |word|
    next if pattern && pattern !~ word

    table[word] = word
  end

  table
end

Finally Abbrev loops through the original words and inserts them. This means that if the array contained “look” and “lookout” they both get added as matches for themselves even though “look” is a substring of “lookout”.

So there you have it, ruby’s Abbrev library explained, go forth and shorten words.

5 Comments

Filed under ruby, stdlib

5 responses to “Getting to Know the Ruby Standard Library – Abbrev

  1. Pingback: Getting to Know the Ruby Standard Library – Abbrev | Ruby Here Blog

  2. Leomayleomay

    The “when 2” and “else” branches in the case-statement can be merged into one as:

        
          else:
            table.delete(abbrev) if table.has_key?(abbrev)
            break
          end
        
      

    That’s my thought on this to make it easier to be understood.

    • Adam Sanderson

      The three case statements are actually pretty important.

      If the first word you come across is: “looks”, then the table will have the keys “l”, “lo”, “loo”, and “look”. If you come across “looks” next, then in the current implementation each of “l”, “lo”, and “loo” will have a count of 2, and get removed. Later if you have the work “loot”, it will know to break the loop as soon as “loo” is seen since all the substrings for “loo” have already been removed.

      So if you merge case 2 and the else, then “l” and “lo” would still be in the table.

  3. trans

    This seems very inefficient. If we have more then a few words the table will be quite big. I think a better solution would be to return a object that dynamically handles the lookup –basically encapsulating:

      words.find{ |word| /^#{match}/ =~ word }
    

    Am I missing something that #abbrev can do that a bit more robust rendition of the above cannot?

    • Adam Sanderson

      If you wanted to get the exact same results, you would have to use words.select, and only return the results if there is a single match.

      I could imagine that if you keep the hash around for a while, it might eventually be more efficient than doing repeated searches.

      If you’re curious, the only place I have found that uses of abbrev is in ‘rdoc/ri/display.rb’.