Storing Code in Queryable Data Structures?
Is today’s mainstream use of flat files the optimal way to represent code? Several discussions occurred in the blogspace in reaction to Rick Minerich’s post advocating for moving away from this paradigm.
He argues that representing code in flat files does not allow to structure code in the most appropriate way. Both the order of functions or classes in a file and the folder organization of a program depend on arbitrary choices of programmers and reflect the ideas they have about structuring code and expressing its meaning. However “no two programmers think identically alike” and as soon as source code involves several contributors, the structure risks to be modified thus loosing coherence at each separate level – code structure within files and folder structure of the program – and between the two.
Even though solutions exist to reduce these risks – e.g. separating out things into as many files as possible, marking out regions of code - Rick Minerich believes that these solutions offer only a partial response to the issues he raised because “they are anchored to flat files”.
Moreover, in some cases it may be interesting to have “a different ordering/meaning […] for a particular task”, but it is rather unconceivable to reorganize code represented by flat files for each separate task.
To respond to these issues, Rick advocates for a different approach to code representation:
If you can treat the reflected code from a programming language like an abstract data structure, why can’t you just keep the source itself in a similarly abstracted data structure? Isn’t the structure of a program more similar to a graph, than a list?
If we kept our code in queryable data structures it would be easy to lay our environment in any way we chose. […] You could also, for instance, show a method and everything which references it. The possibilities for code visualization are limitless.
The real boon of moving on is the power and understanding we will gain from being able to visualize the structure of our programs in any way we choose.
Rick’s post triggered many reactions. Steve Hawley shares his viewpoint and suggests using LINQ for supporting the query based approach to code:
It strikes me that the process of figuring out which variables you're touching when you're compiling a line of code is really a database query. Scoping and the semantics of scoping are part of the query (as well as how the database has been built).
Further, the actual link of a completed compile (whether or not it's being done at build time or run time), is another query.
The process of compilation should really be the process of building up a database.
Several commentators, however, drew attention to the fact that such approach to code already exists. Keith Braithwaite argues, for instance, that “the logical conclusion of what [Rick Minerich is] talking about is an image-based environment” which exists in Smalltalk and Lisp. Along with Smalltalk, another commentator gives the example of Visual age suite where “all code source is stored in a code database […], and you can query it any which way you want”.
However, Steve Hawley along with other bloggers stresses that one should not dismiss advantages of using flat files. They allow efficient navigation through code since humans “are very aware of space and spatial layout of things and this translates naturally into flat files” so that people “develop a familiarity with the layout of a file and can navigate very efficiently to the right location within it via muscle memory”.
In the discussion occurred on Reddit, one of commentators argues that what Rick Minerich considers to be a flaw of flat files, i.e. the arbitrary structure, may be considered as a benefit because this flexibility in defining the structure is used to express meaning:
Things like the number of spaces between operators can be used for nice stuff like laying out bits of consecutive lines that have parallel meaning so that they line up. Ordering of functions can be chosen so as to tell a narrative. People have grown quite creative in using the tools they have to write expressive code. If you're going to take this away, I expect to see a good reason to believe that it can be replaced by something equally effective.
Keith Braithwaite reminds the fact that getting away from flat file representation would also mean saying goodbye to text editors and tools for doing version control and he believes programmers are not ready to pay this price. Another commentator, JSJ, speaks about even a larger tools set that would not be usable with image-based formats without being written into the system and stresses that the ability to “build a toolset and use it for ALL (or most) of your programming languages” is “a huge win for flat files”.
The issue of tooling was also raised by Rick Minerich himself who argues that one of the reasons why flat files are still used lays in the fact that all the tools have been built for flat files structured code. Almost all compilers, for instance, require having a complete program. He believes that “a language which is not tied to traditional compiling and linking would be ideal for research into keeping code in abstracted data structures” and suggests a first step solution for supporting query based code:
A good first step would be an IDE/Editor that can manage all of the code in a database and allow the programmer to dynamically construct queries to build views and otherwise manipulate the code. The environment could then generate flat files in order to be compatible with current compilers.
What about .QL?
Francisco Jose Peredo Noguez
Modern IDEs like Eclipse make it easy to access the AST of source or compilation units, and make queries of these units fast by indexing the code. With that in mind, a source file on your HD is a tree structure... a lazy tree structure; if you want to access it as a tree, a parser has to turn it from it's flat representation into an in-memory tree representation. Storing an AST in a binary representation on the disk is simply an optimization... ie. caching the output of a parser.
With this point of view, Eclipse's workspaces are actually image based - except that on the disk, you only see "flat" source files; the image part is done by a) indexing and b) caching the result of the indexing process on disk.
Actually, you can approach this from the other direction too, as a group at IBM is now doing: they make Smalltalk code inside an image available in source file form by treating the Smalltalk image as a kind of file server. Actually, before I botch the description, here's a podcast with one of the members of the team:
as I said
PS. suggested table structures:
Let's do it.
a source file on your HD is a tree structure... a lazy tree structure; if you want to access it as a tree, a parser has to turn it from it's flat representation into an in-memory tree representation.
Hmmm, this doesn't count as a tree structure to me, any more than a bag of flour, a couple of eggs and some milk and baking powder count as breakfast. Sure, they can become breakfast, but they're in an intermediate form that requires some processing before they're in their final form.
That said, I think that Herr Schuster has a good point – why not start with creating toolsets that handle source code in a pre-parsed form, and just front them with a data source that knows how to convert flat files? If someone then wants to store the in-memory representation as a BLOB in a database somewhere, then it's just a different data source to retrieve it.
Re: Source Control
The real consideration I see is newness. Of course existing source control tools would have issues. Just as eisting text editors would. But I don't really consider these problems. When cosidering the value of this idea I prefer to consider the ideal implementation of it - in which it is accepted practice - since I can then consider whether it is something worth thinking about.
And I think this could have promise.
Re: What about .QL?
Oege de Moor
[Disclosure: I'm the CEO of Semmle]
I was using it
That´s possible with and within GeneXus (www.GeneXus.com)
Shane Hastie on Distributed Agile Teams, Product Ownership and the Agile Manifesto Translation Program
Shane Hastie Apr 17, 2015