Simon Thompson and Huiquing Li on Refactoring in Functional Languages Like Haskell or Erlang
Bio Simon Thompson is a professor of Logic and Computation at the University of Kent. He has written several books on functional programming, including “Erlang Programming” and “Haskell: The Craft of Functional Programming.” Huiqing Li got her PhD at Kent University in September 2006 and works as a post doc in the EU project ProTest to further develop the refactoring tool Wrangler.
The Erlang Factory is an event that focuses on Erlang - the computer language that was designed to support distributed, fault-tolerant, soft-realtime applications with requirements for high availability and high concurrency. The main part of the Factory is the conference - a two-day collection of focused subject tracks with an enormous opportunity to meet the best minds in Erlang and network with experts in all its uses and applications.
Simon Thompson: I am Simon Thompson from the University of Kent.
Huiqing Li: I am Huiqing Li from the University of Kent.
Simon Thompson: We have been building refactoring tools, first of all for Haskell, but recently we have been working on trying to build tools that help to make Erlang programmers more productive. So we have been here to listen to what other people are doing, but also to present some new work that we have been doing to help people write their own refactorings and use our tool to execute them.
Huiqing Li: Wrangler for refactoring Erlang programs and HaRe for refactoring in Haskell programs.
Simon Thompson: Some of what we do is quite similar to what people do in refactoring tools for a language like Java and if you look at the use of the refactoring tool we built, people use it to do renaming and things like functions, modules, variables and so on for extraction of a function and you see similar things that are in Java too like method extraction. So some of the things are common because what you are doing in writing programs is building abstractions from the bottom up, so you write your large piece of code and discover their abstractions within that which you then turn into a method or a function. So those things are similar.
I guess one of the places where it’s different is in the way that we can be sure that a refactoring is going to be correct it’s easier for a functional language because they either have no side effects in the case of Haskell, or controlled side effects, or in the case of Erlang more limited side effects. So in Erlang you don’t compute by changing the values of variables, it's a single assignment language. The only side effects are something like I/O or communication and those tend to be used in a much more limited way than computation using variables which pervades Java. So it’s more difficult to tell in Java if you do a refactoring whether it’s going to preserve correctness.
The tradition there has been you always have to test afterwards to be sure that you have done that. I suppose the final thing and this is getting into a bit of detail, when we discover that some code has side effects we can wrap it up in what is called a closure and pass that around. So we can do things more flexibly because the functional language has the constructs that allow us to wrap up side effecting code and pass it around in a safe way.
5. One of the things that exist in the object oriented refactoring is the big catalog of refactorings, so have you build up your own catalog of bad patterns that you want to refactor, so how did you find these? Did you analyze a lot of code or how does that work?
Huiqing Li: Yes, I think we do analyze a lot of code, also because each language is so different, each language has its own particular refactorings, like for Erlang we have process oriented refactorings or we could refactor a non-OTP application into an OTP application, so these are different. Also for Haskell we have different data type related refactorings.
Huiqing Li: Yes for Haskell we have example, like you can originally have a tuple but you want to turn it into a record so that is a very useful refactoring for Erlang.
Simon Thompson: And in Haskell you can write concrete data types and we have a refactoring that turns a concrete data type into an abstract data type. And the advantage of that, you may say why you want to do that, and the reason you might want to do that is that then allows you to decouple the client from the implementation, so you can then change the implementation without having to change the client code, whereas if you use explicit concrete data types, you use pattern matching and so if you change the data type you have to change every function that uses pattern matching over that that you’ve constructed.
So being able to move to this abstract view of the type gives you this nice decoupling. I think that is an example of how refactoring often works. In practice you don’t do it necessarily for the sake of the refactoring itself, but for the sake of flexibility, the changes it allows you to make afterwards. So there are things that you do in the course of the development, function extraction is an example of that. You are doing some development, you see: "Oh, right, I can actually reuse that bit of code, I extract it into a function and carry on". So these are refactorings you do, micro-refactorings you do in the course of development, whereas something that we found in building the tool was, and this was true both of HaRe and Wrangler that you start off implementing these simple refactorings, people say: "Oh, that is very nice, but give us ways of using them."
We built a clone detection facility for Wrangler, so you can detect similar code and interactively with the user eliminate that and people found that very useful. And that just uses the things that were there in the basic tool, but what we have done is give people information on top of that to use the tool more effectively. So particularly in test code we found it’s very helpful to turn straight-line test code into code that is more compact and more readable and we’ve had experience of doing that with engineers from Ericsson, so that’s been a very positive experience.
Simon Thompson: We’ve developed it as part of the European project called "ProTest" and we’ve got collaborators in that project who particularly have used it a company named Lambda Stream, Erlang Solutions engineers have used it and particularly a number of groups within Ericsson have been using it. And also we get comments and so on from people around the world who’ve used the tool. If you want to use it it’s part of Eclipse as well as part of Emacs, so there is an Eclipse plug-in for Erlang called ErlIDE and you can get Wrangler as a plug-in for that. And I think Eclipse users expect there’ll be a refactoring tool. They have seen it for Java and we worked quite hard. One of our colleagues George Orosz from Budapest did a project with us in Canterbury and then has helped subsequently on building that integration, which is not terribly easy because Eclipse is a huge beast with all these internal interfaces and so on.
Huiqing Li: I think Eclipse provides API which encapsulates the workflow of refactoring, it's kind of a plug-in that provides selection, predicate analysis, transformation. Wrangler also has this kind of workflow and I think it just defines it as extension points.
Huiqing Li: Yes, I think that's used by ErlIDE, so the integration makes use of the infrastructure and the IDE integration.
Simon Thompson: But it’s true that the refactoring is done in Erlang, so there is an Erlang node or a number of Erlang nodes.
Huiqing Li: The Erlang node and an Eclipse node communicate with each other.
Simon Thompson: That has not been tried. That is interesting.
Simon Thompson: That is true, yes. I am going to guess we have to talk to the ErlIDE people about that and I don’t quite know enough about the Erjang project to know how close it is to being able to run. It depends what libraries they use. In this case it may actually work because they are probably using a relatively constrained group of libraries from OTP I suspect, probably gen_servers, supervision and that’s about it. Yes, it is interesting, I hadn’t thought about it.
Huiqing Li: Wrangler uses the Abstract Syntax Tree (AST) as the internal representation. Basically what it does you get a collection of files, of what you want to refactor and Wrangler parses them into the AST and then adds some semantic information to the AST. Then condition analysis to make sure its transformation doesn’t change the behavior and then you do the transformation part. After transformation you have to produce the source code, so you have to kind of pretty print the AST, but we don’t use the a pretty printer because we have to keep the layout of the original program, so we use a modified version of the pretty printer. which takes the original location information into account so we try to produce a new program which is in the layout as much as possible similar to the original one.
Simon Thompson: Because if you give somebody their code back, refactored but looking completely different they just won’t use it. And it’s amazing people say: "Oh, that is terrible, because I always have a space after my comma" and this sort of list. You don’t put that on the people that are very ...
Huiqing Li: Some people are very strict about layout. "We had the comma here and you move the comma there", so it is very strict.
13. That is an important point for refactoring because otherwise it’s not acceptable. So you take the original source locations, so you just make some modifications and then you essentially pretty print it with the original formatting?
Huiqing Li: Yes.
Simon Thompson: If you don’t reformat a function then you can print that entirely exactly as it was, if you don’t refactor or change a function you print that entirely as it was before.
Huiqing Li: For function that are not changed we use the original token stream so it does connect the original program to the token stream to make sure the layout is 100% preserved.
Simon Thompson: But if you change the function a bit then what you want to do is try and make the new function as close to the old function as possible.
Simon Thompson: Yes. But of course you maybe can’t do that because if you pulled a few lines out from a function then how does that affect it? So it’s not trivial.
Huiqing Li: Sometimes we don't keep the original layout because I if we add something in the middle of the line you have to move some code to the right hand side and the following code may have to be moved as well, so not keeping the originally layout you have to make it perfect after you change the code.
Huiqing Li: Yes we do analysis.
Huiqing Li: We don’t use anything from compiler because I think the compiler compiles Erlang into an internal representation which is different from the syntax tree. Also I think Erlang compiler doesn’t use Syntax_Tools, because we use Syntax_Tools to generate the AST, but the Erlang compiler maybe just uses the parse tree and then does some more transformation.
Simon Thompson: And Syntax_Tools is a nice library that gives you more abstract interface to parse tree manipulation. But we do analyses which the compiler doesn’t do, as well. So for instance if you want to do renaming in Erlang it’s quite possible to have a module, a function and a named process, all having the same name. and those are all atoms. So we have to do an analysis on the basis of: "Is this atom here, does this atom here mean a module, the module foo or the function foo or the registered process foo, because if we are renaming a module we don’t want to change the function and the process. So we have to do this sort of analysis which isn’t done by the compiler. Similarly we have to do a side effect analysis for when we are doing function extraction. So I think those are things that a compiler just wouldn’t do.
Huiqing Li: Yes, another thing, the compiler doesn't keep comment information in the token stream or in the syntax tree, and so the location information is only line numbers, no column numbers, so that is not very useful. So we have to keep all the location information, line numbers, column numbers, also the comments in the syntax tree. So that is not needed by normal compilers.
Huiqing Li: Yes, it’s crucial for refactoring tools.
Simon Thompson: And I think one of the interesting things and a point it’s always worth making is that a refactoring is not just a program transformation. There is a precondition plus a program transformation. It’s often easier to do the transformation. So if you are changing the name of something, changing a name is almost trivial, it's changing a place or a small number of places. But doing the analysis to make sure that change in the name doesn’t change the behavior of the program is more complex. So we have to implement not just transformations, but analyses, but it’s there that we have to do a lot of the analysis work that we do to support those evaluations of those conditions.
Huiqing Li: Yes we have a collection of refactorings, also a collection of code inspection functions which just detect some code smells. Of course after you have detected these code smells, you can think: "Which refactoring do I need to do?" It’s like with clone detection we detect duplicate code so after you have reported the clones you can apply functions like function extraction to code. So I think most refactoring tools also come also with the code inspection tool.
Simon Thompson: And some of the things we report quite at small scale, but other things we report across the whole project, so we have a set of reports about the module structure. So, for instance, we can detect circular inclusions within modules. We found a very nice example: it was ibrowse, which is an open source library where we found there was a circular inclusion and the reason for it was this particular function that was in the top level client code, but was so useful that other things in the library were calling it. And they had this very nice comment: "internal export", so basically it’s saying it shouldn’t be exported in this thing's in the wrong place. And our report pointed to that and once you move it into the library then the structure of the module dependency graph is much simplified. That is a really nice example. And that is helpful for large scale projects.
The other thing that we’ve done and the reason that we are here at this conference is to talk about the moment up to now Wrangler was a black box it had a whole lot of refactorings built-in. But if you wanted to do something else then you had to either smile sweetly at us to get us to implement it or you have to use the low level Erlang libraries that we use. And what we’ve been doing recently is working on a template library that allows you to write transformations in something like Erlang concrete syntax, transformations and conditions, using Erlang macros that make it a lot easier for somebody to write their own refactorings.
Simon Thompson: Yes, exactly.
Simon Thompson: Yes. And what is nice about that we do want to say something about the integration.
Huiqing Li: The best thing about that is that you don’t have to care about layout preservation, for example, you have got the preview for free, also, you can do selective refactoring which allows you to say these places you can change or you can choose to change or not to change, so all these things you can get free. The user only needs to think about: "How do I transform programs?" All the integration with Emacs brings things like Preview or Undo for free.
Huiqing Li: Yes, exactly the same as refactorings provided by Wrangler.
Simon Thompson: Yes, that is right.
Simon Thompson: I can’t remember what I said.It’s a very nice high level method. Smile sweetly.
Huiqing Li: You don’t have to know anything about AST because all you write is just template in Erlang concrete syntax. The AST is hidden. So what you need to know is just a collection of API functions from Wrangler, also the behavior module which encapsulates the logic of the refactoring flow. So all you need to do is provide a few call-back functions to behavior, so you don't have to know AST details.
25. So ad-hoc refactoring is an Erlang behavior that plugs into the whole system. So with the template language, so say I found a bad pattern and write my sample and I put some holes in there, meta variables, to make it a pattern.
Huiqing Li: Yes.
Simon Thompson: But, there's quite a bit of power behind that you can control where it gets applied, you can scan through one file or a whole collection of files, you can check whether it applied at a particular point by seeing whether a condition holds. So it’s more subtle than just doing just a sort of pattern matching in an editor. You’ve got much more control and so we give people the power of writing things that are going to be correct. We don’t check so we are handing over, in Wrangler as it is, you’ve got to trust us, that we’ve written things in a way that keeps your code correct, whereas what we do here is we are giving you the tools, but we are saying it is up to you to make sure that what you are doing is not disrupting. Some of the questions yesterday were focused around could we provide more support, other ways that we could check and that is something that we should think about, could we build in some safeguards.
Huiqing Li: Yes, we could check and make sure that the transformation provided by you doesn’t violate certain constraints of the program. For example you shouldn’t introduce any unbound variables, so you should make sure the program compiles after the refactoring, this kind of collection of constraints, like if these constraints are not satisfied, we tell the user these transformations are going to break the program, are you sure you want to continue with the transformation.
Huiqing Li: Yes.
Simon Thompson: And also, I am slightly nervous about trying to give people the impression that we can do everything for them. And I think in the end it really is handing the tool over to the user, we will never be able to be sure what they have written is not disrupting the behaviour of the program in some subtle way. So I guess we are giving people a higher level way into what we are doing and we are not getting people to construct syntax trees for themselves, for instance, so that avoids the whole lots of problems.
27. And you also provide, as you mentioned, you give access to your semantic analysis so you can know something's a bound variable etc. So users can use this. I suppose you found all of these APIs for predicates yourselves.
Huiqing Li: The AST node is annotated with a lot of semantic information like free variables, bound variables, information about atoms, function definition or process information, lots of information is annotated to each AST node. So we provide a collection of functions which you can apply to meta variables, for example, because each meta variable maps to a syntax tree node, so you can just apply a function to that meta variable and get semantic information from it. But if the information is not enough, there are also some macros like collect, which allow you to collect some information.
Simon Thompson: An example shown yesterday was if you want to write a refactoring to remove an argument from a function that only makes sense if through the definition of the function the argument in that position is never used. So what you have to do is collect all the uses of that function within its definition and make sure that that matches.
Simon Thompson: Exactly. So we can look at the context. We can walk over the tree and collect information and that is it.
Huiqing Li: Yes, the whole project, you can specify whether to use a single file or a whole project.
Simon Thompson: I think it is declarative.
Huiqing Li: It is.
Simon Thompson: We are not using side effects. We are not changing things, we are not doing anything destructible and I think it’s what I would call declarative.
Huiqing Li: It is conditional transformation. So every rule has got a condition, so specify one you want to transform this conditional part, this condition part specifies some conditions in the meta variables, so it is declarative.
Simon Thompson: So we have rules that we have templates that match a bit of syntax, we have rules that do the transformation, we then have strategies that tell you how to traverse the syntax tree. So the rules don’t do any recursion. We have these strategies that say: "Apply this right the way through the tree or apply this through the tree until it succeeds", which you call stop top down and these are ideas of strategic programming. People have done research on this for a decade or more, the language called "Stratego", that separates the local transformation from the global way in which it traverses a tree or a structure. So that is again declarative. I think that's nice, that sort of separation I think is a good lesson.
Simon Thompson: Exactly. So it’s saying these are these stylized ways for doing transformations. You probably want to use one of these, because if you write your own recursion through a tree all sorts of things could go wrong. So I think that is a sensible thing to do. One thing that people have asked us is what can we say about refactoring functional programs in general. So we’ve worked building the HaRe system and the system for Erlang and in a funny way the answer is not really much, because functional languages are such a broad area, on one end you’ve got Haskell which is lazy, which is pure, you’ve got Erlang here which has eager evaluation, has side effects, Haskell has a strong type system, Erlang doesn’t.
The spectrum between those two is such that it’s quite difficult to say anything in general about what you can do for functional languages. In that sense it’s a silly question I think. I mean functional languages are such a broad church. I guess one thing there is there you can always do the encapsulation you might want to, because you’ve got higher order functions you can pass around pieces of behavior wrapped up inside a lambda and that you have an Erlang, you have that in Haskell and that can be helpful. And I think also this general point that we can be more sure about what the effect of the refactoring will be because we have limited side effects or controlled side effects or no side effects.
But there are real subtleties. If you look in Haskell there is what is called a monomorphism restriction. So you’d think if you want to generalize a function over an empty list that happens to be there and say I want to pass in instead of using the empty list there, I want to pass in the argument as a variable, you maybe can’t do that because the empty list might be being used in two different types and you can’t pass in parameters, at least in standard Haskell, which are forced to be polymorphic. In Erlang you’ve got all these issues about the use of atoms. You can dynamically form atoms from strings. So the number of differences there are between individual languages makes me very skeptical about ever building a generic refactoring tool.
Huiqing Li: I agree. For Erlang we especially with the process structure it’s very hard because syntactically there is no boundary for the process, Erlang is all processes and you send messages between them, but there is no map between the receive and send and which processes can share the same loop, so syntactically it’s very hard to say: "Ok, this code belongs to this process" For example if you want to add another tag to the message of this process this receive loop may be shared by another process, so that makes quite difficult to refactor.
Simon Thompson: I think we are sort of happy with a refactoring that takes an ordinary Erlang process to a named process, but in order to do that at the moment processes find out about each other by passing them a variable around with a process ID in that variable. So what you have to do in order to see where this newly named process is being addressed, you have to track through variable values, so that you know that the variable containing that PID is used here and, ok, we should change that to a named reference. So you can’t do everything. What you can do is say: "we are sure that these places should be replaced, but maybe you need to look at these other places because we are not sure if those should be replaced or not. So Erlang doesn’t make that easy. That is certainly true.
Simon Thompson: We've talked about it. Trying to discover the process structure of an application by profiling it, that is clearly something that we talked about, but we haven’t got any further with that, but it’s clearly something that we might think of doing. Of course it’s tricky because if you just run something a small number of times you might be over specific in the conclusion you come to, so it’s a tricky one, but yes that information could be helpful.
34. Last question maybe, now with this ad-hoc refactoring, I guess this would be helpful for... functional languages users build their own domain specific languages or embedded languages, because then they have to discover patterns and transform their own patterns. Have you seen anyone do something like that?
Simon Thompson: Possibly, I am not sure because I think each language might have its own idioms, each domain specific language, or perhaps idioms from its own domain, so possibly yes that is certainly possible. We had a very interesting suggestion in the talk yesterday, somebody saying: "Let’s have a repository of refactoring so that people can contribute their own, that is very nice and I think that is something that we are going to act on. so if you are interested in playing with this, it’s in the latest release of Wrangler and the URL will be up on the screen
[Editor's note: http://www.cs.kent.ac.uk/projects/forse/ and http://www.cs.kent.ac.uk/projects/forse/wrangler/wrangler-0.9/doc-0.9.3/ ]