BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Storing Code in Queryable Data Structures?

Storing Code in Queryable Data Structures?

This item in japanese

Bookmarks

Is today’s mainstream use of flat files the optimal way to represent code? Several discussions occurred in the blogspace in reaction to Rick Minerich’s post advocating for moving away from this paradigm.

He argues that representing code in flat files does not allow to structure code in the most appropriate way. Both the order of functions or classes in a file and the folder organization of a program depend on arbitrary choices of programmers and reflect the ideas they have about structuring code and expressing its meaning. However “no two programmers think identically alike” and as soon as source code involves several contributors, the structure risks to be modified thus loosing coherence at each separate level – code structure within files and folder structure of the program – and between the two.

Even though solutions exist to reduce these risks – e.g. separating out things into as many files as possible, marking out regions of code - Rick Minerich believes that these solutions offer only a partial response to the issues he raised because “they are anchored to flat files”.

Moreover, in some cases it may be interesting to have “a different ordering/meaning […] for a particular task”, but it is rather unconceivable to reorganize code represented by flat files for each separate task.

To respond to these issues, Rick advocates for a different approach to code representation:

If you can treat the reflected code from a programming language like an abstract data structure, why can’t you just keep the source itself in a similarly abstracted data structure? Isn’t the structure of a program more similar to a graph, than a list?

[…]

If we kept our code in queryable data structures it would be easy to lay our environment in any way we chose. […] You could also, for instance, show a method and everything which references it. The possibilities for code visualization are limitless.

[…]

The real boon of moving on is the power and understanding we will gain from being able to visualize the structure of our programs in any way we choose.

Rick’s post triggered many reactions. Steve Hawley shares his viewpoint and suggests using LINQ for supporting the query based approach to code:

It strikes me that the process of figuring out which variables you're touching when you're compiling a line of code is really a database query.  Scoping and the semantics of scoping are part of the query (as well as how the database has been built).

Further, the actual link of a completed compile (whether or not it's being done at build time or run time), is another query.

The process of compilation should really be the process of building up a database.

Several commentators, however, drew attention to the fact that such approach to code already exists. Keith Braithwaite argues, for instance, that “the logical conclusion of what [Rick Minerich is] talking about is an image-based environment” which exists in Smalltalk and Lisp. Along with Smalltalk, another commentator gives the example of Visual age suite where “all code source is stored in a code database […], and you can query it any which way you want”.

However, Steve Hawley along with other bloggers stresses that one should not dismiss advantages of using flat files. They allow efficient navigation through code since humans “are very aware of space and spatial layout of things and this translates naturally into flat files” so that people “develop a familiarity with the layout of a file and can navigate very efficiently to the right location within it via muscle memory”. 

In the discussion occurred on Reddit, one of commentators argues that what Rick Minerich considers to be a flaw of flat files, i.e. the arbitrary structure, may be considered as a benefit because this flexibility in defining the structure is used to express meaning:

Things like the number of spaces between operators can be used for nice stuff like laying out bits of consecutive lines that have parallel meaning so that they line up. Ordering of functions can be chosen so as to tell a narrative. People have grown quite creative in using the tools they have to write expressive code. If you're going to take this away, I expect to see a good reason to believe that it can be replaced by something equally effective.

Keith Braithwaite reminds the fact that getting away from flat file representation would also mean saying goodbye to text editors and tools for doing version control and he believes programmers are not ready to pay this price. Another commentator, JSJ, speaks about even a larger tools set that would not be usable with image-based formats without being written into the system and stresses that the ability to “build a toolset and use it for ALL (or most) of your programming languages” is “a huge win for flat files”.

The issue of tooling was also raised by Rick Minerich himself who argues that one of the reasons why flat files are still used lays in the fact that all the tools have been built for flat files structured code. Almost all compilers, for instance, require having a complete program. He believes that “a language which is not tied to traditional compiling and linking would be ideal for research into keeping code in abstracted data structures” and suggests a first step solution for supporting query based code:

A good first step would be an IDE/Editor that can manage all of the code in a database and allow the programmer to dynamically construct queries to build views and otherwise manipulate the code. The environment could then generate flat files in order to be compatible with current compilers.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Source Control

    by Saul Vanger,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I think source control would be the biggest problem. I've dealt with a non-flat file programming language and even though source control was supported with a merging plugin there were numerous performance issues and bugs which made it very difficult to use. The files weren't human readable so if there was a bug in the tool you were screwed. I witnessed a 30 minute bug fix turn into a week long ordeal partly because it was impossible (or it was believed to be impossible) to merge the change back in and everytime the change was redone someone had created a newer version.

  • What about .QL?

    by Francisco Jose Peredo Noguez,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Do we really need queryable data structures for code? I remember reading about a project that allowed developers to run queries over their java code .QL of Semmle (semmle.com/documentation/semmlecode/tutorials/o...) and have made it work with our plain text based code.

  • Hmm...

    by Werner Schuster,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I don't agree with the term 'flat files'... source code tools use AST representions of code - the T in AST stands for Tree. Trees ain't flat.
    Modern IDEs like Eclipse make it easy to access the AST of source or compilation units, and make queries of these units fast by indexing the code. With that in mind, a source file on your HD is a tree structure... a lazy tree structure; if you want to access it as a tree, a parser has to turn it from it's flat representation into an in-memory tree representation. Storing an AST in a binary representation on the disk is simply an optimization... ie. caching the output of a parser.

    With this point of view, Eclipse's workspaces are actually image based - except that on the disk, you only see "flat" source files; the image part is done by a) indexing and b) caching the result of the indexing process on disk.

    Actually, you can approach this from the other direction too, as a group at IBM is now doing: they make Smalltalk code inside an image available in source file form by treating the Smalltalk image as a kind of file server. Actually, before I botch the description, here's a podcast with one of the members of the team:
    www.cincomsmalltalk.com/blog/blogView?showComme...

  • as I said

    by ding jack,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Like I said, (reading the plain-text code and) storing the code into DB was as easy as every programmer can DIY. Then we can query them freely.
    PS. suggested table structures:
    classes(id,name)
    imports(class_id,package_id);
    methodes(class_id,name,parameters,body)
    packages(parent_id,name);
    ...
    Let's do it.

  • Re: Hmm...

    by Tracy Nelson,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    a source file on your HD is a tree structure... a lazy tree structure; if you want to access it as a tree, a parser has to turn it from it's flat representation into an in-memory tree representation.


    Hmmm, this doesn't count as a tree structure to me, any more than a bag of flour, a couple of eggs and some milk and baking powder count as breakfast. Sure, they can become breakfast, but they're in an intermediate form that requires some processing before they're in their final form.

    That said, I think that Herr Schuster has a good point – why not start with creating toolsets that handle source code in a pre-parsed form, and just front them with a data source that knows how to convert flat files? If someone then wants to store the in-memory representation as a BLOB in a database somewhere, then it's just a different data source to retrieve it.

  • Re: Source Control

    by Jason Simone,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I think its an interesting idea, and I don't understand the potential problems deriving from source control. It seems to me that source control could be more robust and might even be able to make sense of interrelated changes, such as code refactoring.

    The real consideration I see is newness. Of course existing source control tools would have issues. Just as eisting text editors would. But I don't really consider these problems. When cosidering the value of this idea I prefer to consider the ideal implementation of it - in which it is accepted practice - since I can then consider whether it is something worth thinking about.

    And I think this could have promise.

  • Re: What about .QL?

    by Oege de Moor,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Thanks for bringing this up. Indeed SemmleCode allows you to query any aspect of your code, whether it is to gain insight, to audit code quality, to guide refactoring or to enforce policies. Everything (yes, even JavaDoc and XML config files) is searchable. Semmle has just released SemmleCode Professional Edition, at semmle.com.

    Enjoy!

    -Oege

    [Disclosure: I'm the CEO of Semmle]

  • I was using it

    by Gabriel Medina,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I was using it from 15 years ago.
    That´s possible with and within GeneXus (www.GeneXus.com)


    Gabriel Medina
    gabrielmedina@gxagro.com.ar
    Rio Cuarto
    Argentina
    www.gxagro.com.ar

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT