Getting to Know the Ruby Standard Library – Shellwords

This article has been republished on Monkey and Crow.

Previously we answered a few questions about Minitest, and learned a little about exit hooks and introspection in ruby. Now lets look at an often overlooked library, Shellwords. Shellwords lets you break up a string the same way the Bourne shell does. So again, we will try to answer a few questions:

  1. What does Shellwords do?
  2. How does Shellwords break up the input?

Overview

Before diving into the code, lets look at some examples of what Shellwords does. Lets open up irb and try a few things out.

require 'shellwords'
Shellwords.split "search for 'some word'"
#=> ["search", "for", "some word"] 
Shellwords.split "search for 'some \"word\"'"
#=> ["search", "for", "some \"word\""]

As you can see, Shellwords splits up the input while respecting quoting. It is fairly strict though:

Shellwords.split "georgia o'keefe"
#=> ArgumentError: Unmatched double quote: "georgia o'keefe"
Shellwords.split "artist is \"georgia o'keefe\""
["artist", "is", "georgia o'keefe"]

So how could you use it? It’s good for tokenizing tags if you want to allow them to be more than one word. It could also come in handy if you wanted to make a mini language for scripting things:

instructions <<- END
  Activate Timmy
  Say "Hello world!"
  Wave
end

instructions.lines.each do |line|
  tokens = Shellwords.split(line)
  ...
end

It wouldn’t be the first little language made in ruby.

Shellwords

Now we have a rough idea of what it does, how does it work? To start with, take a look at the source (qw shellwords if you have Qwandry installed). We immediately come across:

module Shellwords
  ...
  def shellsplit(line)
    words = []
    field = ''
    line.scan(/\G\s*(?>([^\s\\\'\"]+)|'([^\']*)'|"((?:[^\"\\]|\\.)*)"|(\\.?)|(\S))(\s|\z)?/m) do
      |word, sq, dq, esc, garbage, sep|
      raise ArgumentError, "Unmatched double quote: #{line.inspect}" if garbage
      field << (word || sq || (dq || esc).gsub(/\\(?=.)/, ''))
      if sep
        words << field
        field = ''
      end
    end
    words
  end

Without jumping into the hefty regular expression, we can see this is using String#scan to repeatedly match a pattern against the input string. From there it’s building up field and then whenever there is a separator (sep) it adds that word to the list of tokenized words it will return at the end.

Now I suggest looking at that regular expression and quavering with fear. Tremors aside, we can break it down:

\G                          # Start of match attempt
\s*                         # Some optional whitespace
(?>                         # A non matching group
  ([^\s\\\'\"]+)            # word:     Something without spaces, quotes, or escapes
  |'([^\']*)'               # sq:       Something in single quotes
  |"((?:[^\"\\]|\\.)*)"     # dq:       Something in double quotes
  |(\\.?)                   # esc:      An escaped character 
  |(\S)                     # garbage:  Anything that doesn't match the pattern
)                           #
(\s|\z)?                    # sep:      A space or the end of the line

We start with \G, which I had to look up. This apparently means continue from the last match, but in my experiments I couldn’t find an instance where it made a difference in String#scan. If anyone can give a ruby example where this makes a difference, leave me a comment. The next chunk will gobble any leading whitespace. After that comes a non matching group, this is handy if you just want to group some parts of a regexp together, for example (and it’s a poor example):

"plus 123".match(/(plus|minus)\s(\d+)/)
# => <MatchData "plus 123" 1:"plus" 2:"123">
"plus 123".match(/(?>plus|minus)\s(\d+)/)
# => <MatchData "plus 123" 1:"123">

So that first group in the shellsplit is just to keep everything inside it together. The pipe between each of these groups says that the regular expression will just pick one of them. We can see that each of these groups will be yielded to the block, while the non matching group is ignored. So each time the block is called word, sq, dq, esc, or garbage will be filled in with a value. This is largely how the different quoting and escaping rules are implemented in Shellwords.split.

It’s interesting to notice the esc pattern, it’s going to gobble up anything following a ‘\’. This doesn’t seem quite right, an experiment shows how this plays out:

Shellwords.split('I will escape\ these \\bonds')
# => ["I", "will", "escape these", "bonds"]

Is this intentional? Running this little script from bash gives us a different result:

ruby -e 'puts ARGV.inspect' I will escape\ these \\bonds
["I", "will", "escape these", "\\bonds"]

So maybe we found a bug? To be honest I could not really say. Another interesting part is the garbage capture. It should only be matched by something that isn’t whitespace and doesn’t match any of the other options. If Shellwords encounters this, it will raise an exception, so this is how Shellwords ensures valid inputs. The last part will match a space or the end of the line, and tells Shellwords where to end this chunk of input.

That wasn’t so bad now was it? While you have shellwords.rb open, perhaps you should look around. As of ruby 1.9, there is a bonus method waiting for you.

Recap

So we set off to learn about another part of the ruby standard library. Along the way we learned how it works, and saw that we could break down a rather obscure regular expression. If you were appropriately curious you may have also found Shellwords.escape, and the String#shellwords shortcut. So once again, reading the source is pretty neat.

5 Comments

Filed under ruby, stdlib

5 responses to “Getting to Know the Ruby Standard Library – Shellwords

  1. Great to have such a useful library in the standard-lib, no wonder its so small</irony>>

  2. Great ! I really appreciate this kind of article !

    Thanks a lot !

  3. Pingback: Getting to Know the Ruby Standard Library – Abbrev | End of Line

  4. Hey Adam, great writeup!

    I just wished to mention that you can do:

    'a b c'.shellsplit # => [ "a", "b", "c" ]
    

    Once you require the shellwords module in your source code.

  5. Just be careful – Shellwords.escape doesn’t care about multibyte chars. Sometime this can be an issue.