So you have a bunch of objects - an object graph - the result of some operation or API call. The task: analyze the data and store the results of the analysis as metadata for the graph.
Too abstract? Think of how a compiler works: the parser spits out a parse tree (or Abstract Syntax Tree or AST) that represents the code as tree structure. Then: a lot of algorithms run over the tree in passes: gathering symbols in the symbol table, doing type inference, using these types to do type checking, etc.
But wait: the last two passes show the problem: where does the type inferencing code store its data for the type checker to use? It'd be most convenient to store the data where it's needed - i.e. if an expression node (in the AST) is found to return type Foo, then it'd be best to store that information right in the node.
To illustrate the solution, we'll look at some tools that work on ASTs - not a compiler, but tools with similar requirements. In Ruby, the ParseTree library returns Ruby source code as an AST. Example:
[:vcall, :obj, :say_hello, [:array, [:lit, 42], [:lvar, :foo]]]
This is a bit of Ruby code represented as ParseTree AST. Since this is an article and not a VM, it's a bit difficult to work with objects and object references organized as a graph - so we use ParseTree's s-expr representation of the tree. S-exprs are nested lists; each list represents a node in the tree, with the first element specifying the node type. In the sample, the node represents a call to a virtual method (vcall). The parameters are the receiver (the 'self' in the called method), the name of the method and the parameters.
Ideas for tools would be a type checker, static analysis tools or automatic refactorings. One of the requirements for tools of that kind to work is some kind of type inference, i.e. determining the types of variables or expressions. For instance, this AST represents a call to the to_s method:
[:vcall, :obj, :to_s]
What information can be gathered about this, and what could be done with this information? The type of the return value, for instance. This is hard to determine - but since it's to_s, a method that returns the string representation of an object - let's just say it returns "String". This is not necessarily true - it's a guess. For tools such as a code completion, it's fine enough - 100 % accuracy isn't possible for these matters. Although in some cases, more information could be gathered, and the analyzer could determine more accurate information.
Analyzers work on the tree and annotate the AST with the metadata they have determined. For keeping codebases modular, it's a good idea to separate the analyzers from the metadata consumers, i.e. the code that walks the AST and does something with the metadata. For instance, a Ruby editor could highlight a method that overrides another in its superclass.
The solution
Long story short: here's a solution for annotating ParseTree nodes:
node = [:vcall, :obj, :to_s]
def node.set_metadata(key, value)
@_metadata ||= Hash.new
@_metadata[key] = value
end
def node.metadata
@_metadata ||= {}
end
node.set_metadata(:type, :String)
What does that do?
The secret sauce: singleton classes
Every object in Ruby is the instance of a class. Unlike many other OOP languages, Ruby allows you to change an object's class. Don't confuse this with open classes: in Ruby, it's possible to modify a class - even at runtime. Singleton classes are similar - except that they only affect one object's class. While the difference might seem small - it has the big benefit of limiting the effects of the class modification to the affected object. Open classes, on the other hand, change the class definition for all the code and all the objects of these classes. This is useful, but can also mess with other people's code and if too many pieces of code do it on common classes, name clashes with existing methods can happen. In the case of ParseTree nodes - which are ordinary Ruby Arrays - this means that only the actually used Array objects are affected - not all the Arrays on the heap.
Another way to see this: with singleton classes, the changes to the class stay local to the code that creates and uses them - they never need to be visible outside. Open classes, on the other hand, are global changes: a class name is a global variable, i.e. "String" refers to the class object representing Strings. Just as more scoped variables (local variables, member variables) are preferable over global variables, so are singleton classes to open classes.
Of course: which solution to choose (singleton or open classes) depends on the particular situation. With open classes, the change to the class happens once - after it, all objects of that class have the added methods. With singleton classes, the change to the class (i.e. creating the singleton class and changing the object's class pointer to point to it) happens at every change.
The syntax - as seen above - is simple:
def object_variable.method_name()
# code
end
The variable pointing to the object is prefixed to the method name. A more flexible and modular way is to use Mixins to do it. This allows to collect methods that handle some aspect in a Module and mix them in in one go. E.g.
module Metadata
def set_metadata(key, val)
@_metadata[key] = val
end
def metadata
@_metadata ||= {}
end
end
x = [:vcall, :obj, :to_s]
x.extend Metadata
x.set_metadata(:type, :String)
Mixins allow to mix functions defined in a Module into a class - in this sample, it's the node object's singleton class. Again: only this object will have these methods now.
Another example - dead code removal
If the static analyzer code is too abstract, let's try another tool: a dead code remover. "Dead code" is simply code that doesn't really do anything and can be removed without changing the behavior of the program. Like this code:
for x in [1,2,3] do
end
With our annotating concept, we can look for this kind of code and annotate it as such:
node.metadata(:dead_code, :true)
But marking the code as dead is only the first step - now we need to remove it. This is where it's easy to see that it makes sense to split up analysis and action part. An IDE might just want to highlight some code as dead code, but it might also offer a simple way (a Quick Fix) that removes the code.
How can this code be removed? Sure - it'd be possible to dump the node from the AST and then use Ruby2Ruby to spit out Ruby source. But that's not a very nice solution: Ruby2Ruby turns ParseTree ASTs into Ruby source code, but it loses a lot of useful information, such as formatting (white space) and comments. Dropping these is not acceptable in an IDE or refactoring/cleanup tools.
Metadata to the rescue: the nodes can be annotated with source locations - ie. where this particular node is found in the original source code. And with that - it's easy: the cleanup code simply marches through the code, finds all nodes marked with :dead_code and removes them in the source file, using the character offsets of the source location metadata. (Of course - once some nodes were removed, the offsets of the rest of the nodes are off. This can be solved by simply tracking the number of deleted characters and subtracting them from the actual offsets. With that, it's not necessary to reparse and re-analyze the source).
Conclusion
The examples in this article were focussed on language tools - but the ideas are usable for all kinds of object collections. Wherever object graphs need to be annotated, possibly in multiple passes by independently written analyzers, it's convenient to store the annotations with the nodes. If the classes of the object graph are not under the developer's control and are extensively used elsewhere (e.g. the Array class in ParseTree), singleton classes are a good choice. For other situations, open classes might be a better solution.
As for implementing the specific tools mentioned here: ParseTree is available for Ruby 1.8.x, in Rubinius and JRuby (jparsetree). The JRuby version also adds source locations for the individual nodes, so it's even easier to modify source. Have fun coming up with ideas for tools and writing them - it's easy with Ruby.