Archive for Ruby

Capture standard output in Ruby

Recently, I wrote a little Ruby script where I wanted to capture the data that is send to standard output. To make things clear, it’s not about capturing output from a subprocess. What I wanted to do was calling a method and redirecting everything this method writes to standard output to a string buffer. It took me a while to figure out how this can be done, so I thought I better write it down. Maybe this is helpful also for someone else.

Ruby has two ways to access standard output. First, there is a global constant STDOUT that refers to an IO object. You can print data to standard output by sending for example a puts message to STDOUT.

STDOUT.puts "Hello, World!"
=> Hello, World!
(This code is rarely seen in Ruby programs. The Kernel module also provides methods like puts and print which delegate to standard output, so one can simply write puts "Hello, World! instead of STDOUT.puts "Hello, World!").

The second way to access standard output is through the global variable $stdout. Both, STDOUT and $stdout refer to the same IO object, so you can substitute STDOUT with $stdout in the example above. The difference, however, is that STDOUT is a global constant and $stdout is a variable. This means that you cannot assign a new value to STDOUT (well, you can, but Ruby will issue a warning and you will burn in programmer’s hell forever if you ignore it). But what you can do is assign another value to $stdout. We will see in a moment why this is important.

My first attempt was to use the reopen method that is provided by the IO class. This method takes an IO object or a path and optionally a mode string. In the first case, reopen re-associates the receiver with the given IO object. In the second case, reopen opens a new output stream on the path and then re-associates the receiver with this stream.

old_stdout = STDOUT.dup
STDOUT.reopen('/tmp/ruby-output')
puts "Hello, World!"
STDOUT.reopen(old_stdout)

In the example above, we use reopen on STDOUT to send all data that is written to standard output to the file /tmp/ruby-output (warning: don’t try this in irb since after STDOUT.reopen(...), you won’t see any more output!). So far so good, but as I mentioned earlier, I wanted to have the output in a string. Of course, we could use File.read('/tmp/ruby-output') to read the contents of the file into a string but well, that’s not exactly an elegant solution. What we really want is an alternative implementation of IO that collects all output into a string. Searching the ruby documentation quickly revealed StringIO, an IO compatible class that provides pseudo I/O on a String object. Fine, so let’s try it out:

old_stdout = STDOUT.dup
out = StringIO.new
STDOUT.reopen(out)
puts "Hello, World!"
STDOUT.reopen(old_stdout)
puts out.string

Opposed to the previous example, we don’t reopen STDOUT to a file but to a StringIO object. Sadly, this doesn’t work. The reopen statement fails with “can’t convert StringIO into String (TypeError)”. It looks like reopen does not accept the StringIO object. In fact, the implementation of IO#reopen checks if the first argument is an IO object. If not, the argument is converted to a string and a new IO object is opened on the path represented by that string. The error occurs because reopen does not recognize the StringIO object as a valid IO object and fails to convert it to a string. Typing StringIO.superclass into irb reveals Data as the superclass of StringIO! Since reopen checks whether the class of the first argument is_a?(IO), it does not accept our StringIO object as an IO object.

As a side node, the IO implementation does not follow the concept of duck typing here (while it could do so easily). Instead of checking the base-class of the first argument, it would be better to check if the object behaves like an IO object. It doesn’t really matter whether its class is a subclass of IO. As long as it behaves like an IO, it should be ok (duck typing: “looks like a duck, walks like a duck, must be a duck!”). Checking the behavior at this point would mean to check that the object responds to messages that could be send to an IO object. In fact, reopen could simply omit the check and let errors happen as soon as messages are send to the object that it doesn’t understand.

So, back to our reopen problem. A solution would be to create a subclass of IO that simply delegates all messages to a StringIO object. Or we could implement our own StringIO class. Both solutions are feasible, but require some work and add complexity to the application. Luckily, we still have the global variable $stdout. Since it is a variable, we can assign it a new value.

old_stdout = $stdout
out = StringIO.new
$stdout = out
puts "Hello, World!"
$stdout = old_stdout
puts out.string

Simply assigning a StringIO object to $stdout does the trick. Since the implementation of Kernel#puts delegates to $stdout, the output is send to our StringIO object. This solution has a few drawbacks, however. First, it doesn’t work if someone writes to standard output directly via STDOUT. Second, output from subprocesses is not captured. If this is a problem, you need to reopen STDOUT, either to a file or to a custom IO object. If not, the simple solution from above should work fine.

There is one more pitfall in the solution (that also applies to the reopen solution). If an exception occurs between $stdout=out and $stdout = old_stdout, $stdout won’t be set back to its original value. To avoid this problem, we have to surround our “business code” with a begin...ensure block.

old_stdout = $stdout
out = StringIO.new
$stdout = out
begin
   puts "Hello, World!"
ensure
   $stdout = old_stdout
end
puts out.string

Now, the ratio of “infrastructure code” to “business code” is not very good. Well, it wasn’t good before either :-). We have exactly two lines where the actual work is done and seven lines dealing with standard output redirection. Fortunately, Ruby gives us blocks so we could easily encapsulate the infrastructure code.

def with_stdout_captured
   old_stdout = $stdout
   out = StringIO.new
   $stdout = out
   begin
      yield
   ensure
      $stdout = old_stdout
   end
   out.string
end

out = with_stdout_captured do
   puts "Hello, World!"
end
puts out

The method with_stdout_captured makes the code easier to read. Not only does it hide the details of redirecting standard output but it also clearly reveals the intention when someone else is reading the code.

Ok, this was a rather lengthy post about something simple as capturing standard output. However, at least I learned a lot about standard output in Ruby and if you read until here, I hope you enjoyed it.

Links:

Comments (2)

Mixin Module Methods

In Ruby, a module can be included in other modules or classes which adds the features of the included module to the including module or class. This works fine if I want to mixin instance methods into a class.

module Hello
   def say_hello
      puts "Hello, World! Here is #{self.to_s}."
   end
end
class Person
   include Hello
   def initialize(name)
      @name = name
   end
   def to_s
      "#{@name}, a Person."
   end
end
Person.new("Stefan").say_hello
=> Hello, World! Here is Stefan, a Person.

The include Hello statement in the Person class causes the Ruby interpreter to add all instance methods that are defined in Hello to Person. This works well for instance methods but things get a bit more tricky if we want to mixin module or class methods. The following code looks straightforward, but unfortunately doesn’t work.

module Hello
   def self.say_hello
      puts "Hello, World! Here is #{self.to_s}."
   end
end
module Stefan
   include Hello
end
Stefan.say_hello
=>  undefined method `say_hello' for Stefan::Module (NoMethodError)

Here we define a module method say_hello in the module Hello and then include Hello in Stefan, hoping that say_hello will be mixed into Stefan as a module method. But Ruby reports a NoMethodError when we try to send the say_hello message to Stefan. Reading the documentation reveals the problem: include works only for instance methods (and constants and module variables). Hence our module method say_hello isn’t included. Bad Luck.

Note: There is a more elegant solution to the problem of mixing module/class methods into modules and classes in the update at the end of this post. However, if you are interested in how singleton classes work, the next paragraphs may also be worth reading.

Fortunately, we can work around this. Ruby has a feature called singleton classes (also known as metaclasses). Singleton classes are used to add methods to objects. Normally, the methods provided by an object are determined by it’s class. You can’t store methods directly in an object. But as we know, Ruby allows us to do this. You can define a method that belongs only to a specific object.

stefan = Object.new
def stefan.say_hello
   puts "Hello, World! Here is Stefan."
end
stefan.say_hello
=> Hello, World! Here is Stefan.

In this example, we create a new object ’stefan’ and add a method say_hello to ’stefan’. Now how does this work if we can’t store methods in an object? The solution is that the method is not added to the object ’stefan’ but to it’s singleton class. This class does not affect an object’s inheritance chain, ’stefan’ is still an Object. A singleton class intercepts the messages send to an object before it goes up the inheritance chain. If the singleton class can respond to a message it will do so. If not, the message is passed up through the inheritance chain to see if another class can handle that message.

Back to our module mixin problem. In Ruby everything is an object. A module is really an object of class Module. And as every object, a module can also have a singleton class. If we define a module method, the method isn’t really added to the module object but to it’s singleton class. The same happens for class methods, by the way. In fact, we could write a module method as follows.

module Stefan
   class << self
      def say_hello
         puts "Hello, World! Here is Stefan."
      end
   end
end
Stefan.say_hello
=> Hello, World! Here is Stefan.

The class << self syntax opens a singleton class, in this case for the enclosing module Stefan. In the singleton class, we define the method say_hello. The same happens behind the scenes if you write def self.say_hello. If we now send the message ’say_hello’ to Stefan, first the module’s singleton class will get the opportunity to respond to the message. Since our singleton class has a method say_hello, it will handle the message by calling this method. If the singleton class doesn’t know how to respond to a message, Stefan’s class (Module) will get the chance to handle it. Be aware that Stefan is a module, and a module is an object of class Module!

With this in mind, we now have a solution for mixing module methods into a module. We first have to define a module that provides the methods to be included. We than include this module into another module’s singleton class, not in the module itself.

module Hello
   def say_hello
      puts "Hello, World! Here is #{self.to_s}."
   end
end
module Stefan
   class << self; include Hello; end
end
Stefan.say_hello
=> Hello, World! Here is Stefan.

Et voila, say_hello is now a module method of Stefan. Note that say_hello is defined in Hello as an instance method, not as a module method, and how Hello is included in Stefan. This technique works for mixing class methods into classes as well.

Update (May 19, 2007):

Gregory Brown commented about a much cleaner solution: simply use object.extend which adds all instance methods from the modules given as parameter to object. Since modules and classes are objects, we can send them the extend message.

module Hello
   def say_hello
      puts "Hello, World! Here is #{self.to_s}."
   end
end
module Stefan
   extend Hello
end
Stefan.say_hello
=> Hello, World! Here is Stefan.

Works like the same but looks much cleaner than my solution. As Gregory pointed out, we could also use a combination of extend and include to mixin instance and module methods at the same time.

module Hello
   module ClassMethods
      def say_hello
         puts "Hello, World! Here is #{self.to_s}."
      end
   end
   def included(base)
      base.extend(ClassMethods)
   end
end
module Stefan
   include Hello
end
Stefan.say_hello
=> Hello, World! Here is Stefan!

In this example, the ‘included’ callback is used. This callback is invoked on a module whenever it is included in another module or class. The parameter ‘base’ holds the including module or class. Note that the say_hello method is placed in a separate module ClassMethods. The example uses the included callback to extend base with the methods from this module when base includes Hello. This way, the methods defined in ClassMethods become module methods of Stefan when it includes Hello.

Gregories solution is obviously much more elegant than my approach since you don’t need to cope with singleton classes. To my excuse, I wasn’t aware of the extend methods and it’s function. However, at least I learned a lot about singleton classes when writing this post :-).

Links:

Comments (6)

Delicious Hpricot

I keep a list of the blogs I frequently read in my del.ico.us bookmarks. Every blog I’m currently interested in is bookmarked under the tag “blogroll”. I prefer reading the blogs in a web browser over feed readers or aggregators. Therefore a bookmark points to the blog’s webpage and not to an atom or rss feed.

Yesterday I wanted to create a list with the authors of all blogs I’m currently reading. Unfortunatly, there is no way to safely extract this information from a webpage (oh dear, we are still lightyears away from the semantic web). The only way to discover details about the author is to look at a blog’s news feed. Atom feeds usually have an author element with meta information about the author. Rss feeds don’t but at least the email address of the creator can be extracted from most feeds. Luckily any blog I’m aware of provides a feed in atom or rss format. Information about available feeds is usually encoded in the header of a blog’s webpage. For each feed, a link element with rel="alternate" exists that specifies the feed’s URL and type. So the solution to my problem was to extract the feed URLs from a webpage, load the feed and get the author from the feed header.

Now I could have used regular expression to extract the alternate links. However, I recently came accross Hpricot, “a fast, enjoyable HTML parser for ruby”. And while I find regexps quite useful, I’m not a particular fan of them, especially when they get complex. So I decided to give Hpricot a try. I wrote a little ruby script to extract feed URLs from a webpage. At its center is the following code:

   def discover_feed(href)
      doc = Hpricot(open(href))
      feeds = []
      doc.search('//head/link[@rel=alternate]').each do |candidate|
         if candidate['type'] =~ /^([a-zA-Z]+\/)?([a-zA-Z]*)+.*$/
            format = $2.downcase
            feeds << {:address => candidate[’href’], :format => format}
         end
      end
   end

Thanks to Hpricot’s XPath-Style queries, finding all alternate links in the header is almost a no-brainer. I then use a regular expression to extract the feed’s format from the link’s type attribute which specifies a mime-type. The discover_feed function returns an array of hashes, one for each feed. I can now easliy select a feed:

   avail_feeds = discover_feeds(webpage)
   feed = avail_feeds.find {|f| f[:format] == 'atom'} || avail_feeds.first

The second line looks for an atom feed in the first place and falls back the first available feed if no atom feed has been discovered. Quite simple, it took me around 30 minutes to write the whole script (including installation of Hpricot). Working out a regular expression would have taken much longer (at least for me). Not to talk of the nightmare of remembering what the expression was meant to do when looking at the code next month.

Comments