BT

Keith Adams on PHP at Facebook, Efficient PHP with HHVM, Optional Typing with Hack
Recorded at:

Interview with Keith Adams by Werner Schuster on Dec 06, 2013 | NOTICE: The next QCon is in San Francisco Nov 3-7, Join us!
40:57

Bio Keith Adams is a founding member of Facebook's HipHop Virtual Machine (HHVM) team. HHVM is a Facebook's just-in-time compiler for PHP. Keith has also contributed to Facebook's search engine. Before Facebook, he worked on VMware's virtual machine monitor. He's a founding member of the I-can't-believe-I'm-a-PHP-advocate club.

Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

   

1. We are here at Qcon San Francisco 2013 and I am sitting here with Keith Adams. So, who are you?

I am a computer programmer at Facebook, I have been there for about four and a half years. I have done a couple things at Facebook. I worked on their search backend and recently I moved to their AI group. The thing that is probably most interesting to your viewers is that I worked on our PHP VM for a while, for a lot of the last three years, HHVM, the HipHop virtual machine is our engine for PHP that runs facebook. com

   

2. So it is a big surprise for many people that Facebook, of all companies, runs on PHP. So why do you use PHP ?

Yes, OK. We actually get this question a fair amount. So PHP, if you have somehow been living under a cave and have never heard of PHP, is one of these dynamic scripting languages that you can, for the purposes of this discussion, pretend is JavaScript of Python or Lua or whatever. Any one of these sort of “every variable can take on every value, dynamic dispatch” languages. PHP carved a niche out for itself in the early 2000s doing server side web stuff. So when Mark Zuckerberg was in his dorm room trying to get a web site up, PHP and Apache are the famous LAMP stack, Linux/Apache/MySQL/PHP , was the Ruby on Rails of its day, it was this very competitive hard to beat stack and if you wanted to do it on something other than PHP , you would have had to spend months and months setting up some kind of web framework for Python of Ruby or whatever other thing you wanted to do.

The reason you get asked this question a lot is that basically PHP has a really poor reputation and in fairness I do not think that is undeserved. The core quality of the language is a little bit low. It has definitely grown instead of being designed, it has been used in ways that were never anticipated early on in its life. So there are definitely unforced errors in the design of PHP. I do not want to sound like a complete PHP apologist but I would say that the problems with PHP are a lot easier to see than the benefits. So PHP has these very easy to make fun of, very easy to lampoon misfeatures where you can write some little expression on the board and say “Ha -Ha,, this expression does not do what you expected, LOL PHP”. But it also has actually a few opinionated little decisions that make you more likely to be successful rather than less and I think that the three ones that matter the most are its attitude about concurrency which is basically that the unit of concurrency is the web request.

You can have concurrency in PHP , but you do it by curling to localhost, so that means it is shared-nothing, that means you are copying things in and out which kind of gets to its attitude about state. So global PHP variables do not survive across web requests. Web requests are a unit of fault isolation, so if I make some horrible mistake in my program and I serve you up a blank page, the next time I go to search some other user I get to start over with a blank slate again. That actually reduces the cost of bugs which is a really valuable thing. The other one is just the programmer workflow. PHP's workflow is very much hit “save”, “reload the page”. Even if you are using a language that has a REPL like Python, or Ruby or Lisp or something, it is very easy to end up getting to a state where you have a stateful server that you need to restart whenever you do non-trivial changes and PHP has a bunch of decisions that kind of keep you in the center of that workflow where you just hit “Save” and reload the page. There are all these results from psychology – programming language design is actually applied psychology – there are all these results from psychology about people’s short term memories and the things that you can kind of keep in your short term memory and if it takes you more than 5 seconds to accomplish some of these tasks, you are just not going to be able to do them.

So PHP has a much better chance of keeping you in that narrow window of figuring out whether what you tried worked or not. So it is a fairly convoluted claim and none of these things are incredibly intrinsic to PHP, you can replicate these in other languages and to some extent in other frameworks, although if you do not have the language on board it’s sort of a matter of time before somebody violates the framework. But I still think it is a lesson to learn from PHP that has actually helped us be more successful.

   

3. I am interested in your claim of the statelessness or this idea of statelessness. This works for web requests, but how do you write your business logic or logic that has to run for a longer time. How does that work?

Right. It is not stateless in a strong sense, obviously. You probably noticed that Facebook has “state”, that you go there and you seem to see things that people have changed about the state of Facebook, for instance. It is not that there is no form of persistent state, but it is really that that state is isolated from the core language so that state is Šon the other side of some kind of function-call boundary. There is some clear place in the code where you call the database or where you stick something into memcache or where you write some to the file system or where you communicate with a Thrift service that does storage for you or what have you. But those things are sort of extra-linguistic. They are not sitting around in global variables that are just a core part of the PHP program that will be there the next time you start off your PHP program. And that ends up being kind of an interesting midpoint between stronger versions of statelessness and the practical, because it means you can basically analyze the places where it might be stateful.

   

4. So in a way that seems like other languages, like Node.js that use of processes as their unit of concurrency. Is that something you can say?

Yeah. I am not an expert in Node.js first of all, I have only kind of stared it from across the room and said it looks interesting. Node.js’s notion of processes, my impression of them, is that it is partly inspired by actors and stuff which is a perfectly sound way to go. I think as long as you are in the “shared nothing” boat, I think you are probably getting most of the benefits of what PHP offers. I do not have any real experience using Node.js but yeah, from 100.000 feet it seems like a consciously designed version of what PHP stumbled into.

Werner: In another way of thinking it is the Unix model, where every process lives on its own and you can fork off a new process.

Yes. And the organic way this happened in the original LAMP stack was that these really were processes where it was that Apache was actually forking an instance of PHP and that you would die and that is why you were stateless. But that turns out to be a sort of a wonderful poor man’s garbage collection, that turns out that sort of fork and exit is a good way of getting the system back into a known state after your process trashed its heap and all kinds of other things. We have actually kind of stepped away from that traditional in HHVM – HHVM is a web server and a language runtime and it is the threaded server, it is not using processes for isolation and it is one of those perennial kind of bikeshed discussions we get into every once in a while. I think processes are awesome but I am old, so I do.

Werner: So when you say it is a threaded server I guess it still emulates the stateless model by using a thread inside a process.

Yes. Each PHP thread runs on a private heap. It runs on a thread private heap and the way that you share things across them is through a facility called APC which stands for Apache PHP cache, but it is basically across a function call boundary, it is a key value store that you can retrieve things from, but it is not a variable that is around forever. So it is still possible to statefully mess things up, just to be clear. It is just that it is just a little less easy than those languages make it.

   

5. So PHP is the main language at Facebook or what other languages do you use?

Yes, that is an interesting question. So PHP and C++ are the two most important languages by far. Most people who are writing code at Facebook are spending most of their day in PHP and C++. There is a significant minority that is spending most of their days in Java also. For instance everything we do in the Hadoop Ecosystem, that is all Java and we have a huge investment in Hadoop, too. Of course, it is a big software company so there are people who would rather die than use something that isn’t Java or would rather die than use something that is not C++ or would rather die than use something that isn’t OCaml or Haskell and so on. So there is this long tail phenomenon but, yes, in terms of line count it is basically two humps around C++ and PHP and then a lot of little mole hills, again with an exception for Java – Java is significant. But if you try to look at the volume of OCaml produced, it is really tiny compared to C++ and PHP.

So if you are writing a service at Facebook and you do not know anything in particular about the service and you do not have a strong opinion about languages, then the default is probably C++ just because we have the most expertise built up around it. We have a lot of good C++ programmers around and made a lot of good C++ libraries so it is sort of a nice set of shoulders to stand on if you are not particularly concerned about which language you are going to use. In terms of line count, there is a gigantic amount of PHP and a gigantic amount of C++. The trend over time has actually been for both of them to grow very quickly, but C++ has been growing a little more quickly so if I compare the distribution from when I joined to now, both have grown a lot, but there is more C++ than PHP as a proportion. So it is not as if we were migrating from one to the other or anything like that, they are both growing pretty quickly, just the dynamics of it is that C++ is growing more quickly.

   

6. You already mentioned that Java is used for Hadoop tasks. So where is PHP? Is that mostly front-end web stuff or is there some internal features too?

Yes. PHP, the two places where it shines are the front-end web as the thing that is actually talking to you over port 80 when you load the site on your phone. It also excels at scripting, so when people write little one off things, you have a nice set of shoulders to stand on if you write them in PHP at Facebook because there is a lot of PHP code that understands how to operate at Facebook and where to find services and things like that.

   

7. So for prototyping essentially or things like that?

Or but actual automation too where it is really scripting. We do database tasks with PHP a lot of times, things that need to be aware, a lot of the business logic at Facebook is in PHP or is the actual code that knows what a group is or what a message is or who user ID 3733 etc is. That code is mostly in PHP even if there is code that does it somewhere else on the site, the code that is most battle tested and that has to work all the time - full stop - is going to be in PHP. So if you are doing things that are, that for instance, check privacy or interact really richly with the application objects of Facebook, you are probably writing PHP.

   

8. You said that business logic is written in PHP. If I want to use that, do I import a it as a library or is that hidden behind the service boundary?

Yes. We tried it in a couple different ways and I am not sure that any of them have died out completely. But also the equilibrium has been that that code ends up in PHP because if you tried to put a service boundary around them it ends up being rigid and you end up wanting to violate that and you end up pushing and re-pushing and re-pushing different versions of that interface and it gets painful because the web tiers are on some different schedule and the tier you actually want to talk to and what have you and so usually if people find themselves wanting to interact in some hard to narrow down to a little tiny bottleneck way with the logic of the site they will end up writing it in PHP . We have tried making little libraries of the PHP, for instance when we had an ahead of time compiler for HipHop , the system that preceded the HipHop VM. HipHop produced this big ELF binary of our PHP code and one of the cool things about that is that you had a C++ implementation of the site that you could link in if you wanted to.

There were a handful of people who tried to use that, that tried to go out and aggressively do that, but the problems of them pushing on a different schedule than the web code and needing to keep those in sync ended up making that really difficult. So as far as I know nobody is doing that right now. But in principle, if you really wanted to try hard enough, HHVM has a bytecode format for these repositories which you can push separately from source code and if you are really hell bent on doing this, you could try and either build your software as an extension to HHVM. So you could build the C++ functionality like say it is compute intensive - say you want to decode JPEGs or do encryption or something. You would probably just build that in a native language and then call it through function boundaries from PHP. It would have some really rich and distributed communication pattern and you would wrap some Thrift API around it and make an entry point into it that you touch from PHP.

   

9. Does PHP have some sort of native extension mechanism or would you be using a HHVM specific mechanism?

There is no standard, by the way, for PHP. There is no equivalent of ECMAScript for the world of PHP. It’s tough to talk about PHP in a way that captures both the PHP that everybody else is using currently and HHVM. So I want to put a star there. There are O'Reilly books about how to write PHP extension for PHP.net, PHP. There are also ways to write extension for HHVM, but the API is different, so currently if you really want to write one thing and have it run in both places you would have to do some significant plumbing yourself. That plumbing is a little bit repetitive and we are hoping to get to the point where we can automate it sometime, but there are no big announcements there yet. For now we achieved a sort of parity with all the ecosystem of open source extensions for PHP.net, PHP, one extension at a time which is very painful. But if you are working for Facebook and you are specifically doing some compute intensive thing that you want to expose to PHP, there is a little API for writing extensions for the language.

   

10. We have been touching on it for a while now. So HHVM is a project you were involved in?

Yes, that’s right. The story there is that about four years ago now me and several colleagues – Jason Evans, Andrew Paroski – got each other excited about this idea of using runtime feedback to generate better code for PHP. I mentioned HipHop briefly: HipHop was the first new PHP engine that Facebook produced and HipHop was an ahead of time compiler. It works the way the C++ compiler does: it chews on a source base, produces a big ELF binary out the back and that is exactly the same way you would work with your C++. The reason Java programmers tend to use JITs is that Java is a little more of a dynamic language, you load code in and out, some of the bindings are actually mutable and then as you go further and further in the spectrum of dynamicity out to languages like JavaScript and Ruby and Lua, the returns to JIT compilation get stronger and stronger and PHP is much closer to that in the spectrum than it is to the Java or let alone the C++ end of the spectrum.

Our feeling was that there are all these opportunities to optimize that we were leaving on the table. There is always information that you can’t see at compile time that will let you run this program faster and so we were going to build a JIT that shows the world that you can do that. I think at some level we were proven right, but through a more complicated and nuanced story than we probably told ourselves to begin with. In retrospect I am glad did know how tall the mountain we were trying to climb was when I started because I probably would not have started.

Werner: : It sounds like a familiar story from how V8 was created which is “how hard could it be?” and it is very hard.

In Lars Bak probably had a lot more relevant experience going into V8 than I did going into this. I was coming from several years working at VMWare and VMWare had a dynamic binary translator that it used for basically running unrestricted X86 in a safe sandbox on the X86. That is a form of JIT compilation and there are some applicable lessons from it and in retrospect, a lot of the idiosyncratic things about HHVM come from me viewing the world through those lenses at that time. We probably had more to learn than Lars Bak did with V8. There is no Strongtalk license that I could just import into the project that would come with all sort of useful code and useful ideas that I had done three times before. But yes, it is always harder than you think and the language is bigger than you think. But we drew a lot of inspiration from the V8 early on in the project. V8 had already shipped by the time we were having these conversations and it powerfully influenced our optimism about how good a job you could do with dynamic languages these days. From across the room PHP looks a lot like JavaScript. It is really only when you start getting into the details that you realize that you cannot just port concepts from the JavaScript implementation to a PHP implementation and expect them to work.

Speaking of V8, long before we settled into doing HHVM, I started out by just trying to write a transpiler that would write JavaScript for your PHP and then run it on V8 and if you do that, basically the benchmark cheating from my talk when I had a system that did not work, but I was still trying to run benchmarks with it. We had all these wonderful results that were incredibly encouraging. As you start filling in the system and building more and more of the PHP runtime in JavaScript, which is what you end up doing if you want to use a JavaScript VM as a black box, you start to realize that this is actually a foolhardy approach, that the sort of assumptions JavaScript makes about how programs behave or how PHP programs behave, but it was an interesting place to start.

   

11. From your point of view, what is the nature of PHP programs? Is it more imperative or do you create global functions that you use? It has an OOP system nowadays but that is a new addition; what is the nature of PHP in that regard?

PHP is essentially multi-paradigm these days so it has first class functions now. If you want to write functional PHP, you can. There aren’t a lot of big PHP codebases out there in practice of course that are really aggressively functional in style, but there is no reason why you could not write a LISPy PHP program if you wanted to these days. The house style at Facebook is actually pretty aggressively OO so there are a lot of classes, most files have one class in them, there is much more method dispatch than raw top level function dispatch, at least in terms of the actual application code. The leaves of the call tree are often standard PHP functions and the standard PHP library is made mostly procedural. So it depends on what you are looking at. If you were to cut a swath through a time-weighted history of open source PHP projects, PHP code bases are probably mostly procedural, probably because you did not have a working object system until PHP 5 a lot of the PHP systems that are out there which grew up before PHP 5.

So it is tough to generalize to the extent that obviously our top priority is that we need to run Facebook well, so we probably make decisions that look a little bit more like what a Java VM would make in terms of what we emphasize and what we go after. But we also care about running open-source well and the open-source world is a little more procedural so we try to keep that working well. For the most part, the procedural stuff is easier to run fast than the object oriented stuff, the object oriented stuff comes along with a lot of strange rules about, for instance, visibility and accessibility, so it brought along a lot of Java’s machinery, including public, private and protected, but unlike Java you cannot check any of that stuff at compile time in PHP so there really are runtime checks about whether I am allowed to call this method even though it exists and things like that which are not there for procedural style code. So most of the stuff we do for object oriented is just strictly a harder case than the procedural case.

   

12. PHP is a scripting language essentially and those tend to have a lot of the basic library written in native code. How is that in PHP and how does that impact making HHVM fast? Do you just farm out to these or how do you integrate them?

Yes. That is a fascinating story. PHP was born as a glue language and the story was that you would write a tiny little bit of code to render your html and do some database queries. But all the heavy lifting and everything that is compute intensive should be off in C and if you cannot find some package out there that already does it, here is how you extend the language in C and so on. PHP organically grew that way for a long time. It acquired a large and somewhat inconsistent standard library where basically this group of people made some decisions about memcache, that group of people made decisions about MySQL, this other group of people made some decisions about PostgreSQL and so on. In terms of the practical consequences of this - does PHP still behave like a glue language? Well, in some ways yes, in some ways no. The way that we are running…If you look at HHVM in-flight on Facebook, the size of the PHP application we are running dwarfs the size of the C code, of the native code that we are running. There are tens of millions of the lines of the PHP, HHVM's codebase including extensions is of the order of hundreds of thousands. So most of the logic is in PHP.

If you sample CPU time, actually most of the CPU time is in the runtime. A lot of that runtime work is needed in implementations of language features, in implementing the semantics of arrays, it's allocating memory, its setting up web requests and tearing them down, but a lot of it is actual extensions, a lot of it is the actual the glue code style: “I am going to run the memcache client now” and we have this really sophisticated memcache client that does a lot of pooling and connection sharing and clever stuff and tries to hide latency and so on and so forth. Now, from a CPU sample perspective, when we first started HHVM, when we first got it working on the site we spent about 20% of our time in the translation cache, in the output of the JIT and about 80% of the time somewhere else. Our web tier at the time was and usually still is CPU bound so this is really the thing you want to take a look at. The strange thing is that that tells you that the optimal thing to do is to go optimize the runtime and we spent a bunch of time optimizing the runtime. But the interesting thing was that we found that you could make changes to the JIT output that had much more impact than that model of the world which says “The JIT output is only 20%” could possibly explain.

For instance, early on in the life of HHVM, after we first got it working, one of our engineers – Jordan DeLong made this change that basically was a pure parameter change, a wonderful example of where we just made this decision where we were going to search this list from front to back and he made it that we search it from back to front, vice versa, it turns out. That only affected JIT output and had a 14% impact on performance. Before the change, you spent 20% of the time in the JIT and 80% of your time in the runtime and after the change you spent 20% of your time on the JIT and 80% of your time in the runtime and somehow this change that only affected JIT output was the tide that lifted all boats. It took us a while, we scratched our heads about that for a while.

We basically thought it was a bad experiment for the longest time and said there were something methodologically wrong here, you broke the JIT for instance, maybe you are not really running the program, because 14% is too big a win but that was not true. It really was a tide that was lifting all boats and it turns out what was going on there was cache. The cache is a shared medium that causes spooky effects where non-local things can affect each other, so both the 20% over here and the 80% over there are all viewing the universe through the same and tiny little drinking straw of the L1 cache. So if you make changes to the JIT that makes one thing that takes 80% run slower you can preserve that ratio and yet make the whole system slower and faster. Once we realized that was going on, that opened the door to a lot of interesting tuning opportunities. So that question – Is it a glue language? – depends what perspective you are looking at it from.

From the point of view of raw CPU cycles it is behaving like a glue language, but the thing we are gluing together is actually really large and complicated and also the most powerful opportunities for affecting performance we have are still in making the JIT output a better code even though you spend most of your CPU time in the runtime and that has gone up to about 30-70 as we brought more of the runtime and made the JIT more runtime aware and brought more of the overheads of running the native stuff into JIT output. But the rule of thumb still applies and it seems like we have more influence on the performance by making a better compiler than we do by tuning the runtime.

Werner: : I think PHP is using reference counting for memory management.

This is true, yes.

   

13. What is the impact of that on compiling it and on the JIT? Can you make it fast?

Yes. It is interesting. Reference counting is the mechanism, like Python, like Perl, like lots of other languages that evolve from a folk implementation that was a hobbyist’s interpreter to something that ended up spanning the world. PHP started out with reference counting because it is a really easy place to start. One of the interesting things with PHP is that it exposed the decision of programmers implicitly, in various ways. The two ways that are easiest to explain are that classes have destructors and destructors run at deterministic times, so when your reference count goes to 0 is when your destructor runs and people do use that to do RAII, C++ style things, delete a file or close a socket or what have you when this object goes away. One of the ones I have seen is that people built a benchmark timer with destructors, the thing that hit “stop” on the stop watch, so you time things by creating this time object and when it goes out of scope then the timer stops and you’ll break those programs if you do not at least approximate that part of reference counting.

That is why 1 to 0 needs to be precise. It turns out the transition from one to many also needs to precise because PHP has value semantics for its arrays. So when you assign an array to some local variable, when you pass it to a function you are logically making a copy of that array, but PHP programs have the performance model that is a O(1) copy so you actually change the big O behavior of PHP programs if you cheat on this. It is tempting to make - the world does have a bunch of interesting heuristics for copy-on-write but if you have somewhere where that heuristic reliably fails you have really changed the actual running time of a program and that seems dangerous to us.

HHVM actually implements naïve reference counting and it is possible to…if you go on the literature there are ways to make reference counting fast and there has been a little bit of renaissance of this stuff, if you look at Steve Blackburn and his work on making reference count fast - some of his students have been investigating this. Usually the model of the universe that those systems have is that the transition you care about is between 0 and 1, or between alive and dead. Caring about the difference between one and many is a little bit stranger and complicates things a little bit. So it is possible to do deferred reference counting, it is possible to build an undoable deferred reference counting where you can basically de-optimize the reference counting you deferred, where you can recover all the precise reference counts later, if you want to. It is just really hard to actually come out ahead versus naïve reference counting doing that.

That information needs to go somewhere, that somewhere is going to take data cache and it is a big engineering project. I think there is something there intuitively, it is just a big project to really get your head around. But yes, the last time I measured this, something like 30% of our JIT's output is reference counting related, so it is a big chunk of big nasty code, it is what we spend a lot of time doing and it is very tempting. It feels like low hanging fruit, but it is hard to do without breaking PHP. To some extent, the Python community I think has solved this politically right, so Python also had deterministic destructors and when Jython came along, they said “Hey guys, this is crazy. We are not going to implement reference counting on top of the Java collector”. There was a sort of discussion that was held, that more people felt like they wanted Jython to really be Python and so the official semantics of Python are now finalization semantics instead of destructor semantics. But PHP does not have a formal community, PHP just has implementations that run useful PHP programs and we are more concerned with being an implementation that runs useful PHP programs right now than we are trying to squeeze every last bit out. And also it still feels like we are riding a pretty steep performance curve so, why get into a sort of a fight that involves breaking working code when we are not out of ideas for things that run PHP as is?

Werner: PHP is single threaded. I think that should make reference counting easier or cheaper.

Yes. I am sorry. That is a wonderful point. This is so baked into my implicit model of the universe that I forgot to say this. Yes, these are not atomic constructions that we are doing reference counting with. The heap is private. There are a few optimizations we do that are invisible to the user where some read only data is actually shared, but we don't have to reference count that normally. Yes, that is a great point. This is not the problem that the Java people have with the reference counting, for instance.

Werner: … or Objective-C.

Exactly. These are not massive, 30 cycle lock-adds and lock-decs and whatever. These are the regular read modify write, sort of X86 instructions doing this stuff.

   

14. We talked about PHP and that PHP is a dynamic language and of course for large codebases it would be nice to have some sort of typing or some static analysis tools. I think you started the project, or Facebook has started a project about this?

: I want to triply underline that whenever I am talking about this PHP stuff at Facebook there are big groups of people who work on all of this stuff, but especially this vision of introducing static types incrementally into PHP is really not my baby. This is the intellectual output of my colleague Julien Verlaguet, he is one of those people who writes OCaml and he knows a lot about programming languages. About a year and a half ago he came to talk to some people in the HHVM team with this idea he had for doing static types for PHP and I also heard this pitch a couple of times, so I was a little tuned out initially because the usual pattern for trying to introduce static types into one of these dynamic languages is that you have somebody who believes really strongly in static types and does not have a lot of sympathy for why PHP is the way it is, does not have a lot of understanding about why dynamicity might be a valuable thing for productive programmers and just wants to ram Haskell's type systems down your throat or something like that. That is what I expected to hear.

What I heard instead was very different. There is a system that we call Hack for HipHop right now that is a language that targets HHVM, it is a language that has a lot in common with PHP. If you were to do a Venn diagram, it is not quite a proper subset or a proper superset, but there is a gigantic amount of overlap. It takes away very little of PHP and adds very little to PHP, but it adds enough to make an almost completely sound type system tractable. One of the neat things about it is that it was actually designed in such a way that we wanted to – we did not have the option of throwing away all of our code and starting over with some new language - so one of the neat things about it is that they built a lot of awesome tools to actually make the migration from generic PHP to more soundly typed PHP fairly painless. We are not ready to share any hard numbers, but the vast majority of programmers at Facebook who work on the PHP codebase check in Hack every week.

The vast majority of files in our codebase are Hack files so most of this enormous codebase really has adopted gradual typing while there have been some rough spots here and there, I think the consensus is that it has actually been really helpful, so it did a lot more good than harm. Of course, PHP has this lovely workflow that I described before – hitting “Save” and reload the web page – and one of the reason that I would have expected this to be a disaster is that most static type system behave like compilers, they are these big programs that you run over your source base and then it tells you something about your source base and that is a terrible work flow. We did not want that to become part of every interaction with PHP. They did a lot of really great systems engineering in actually turning that job that compilers do into an online task that a daemon performs. They actually have this interesting daemon called HHServer that sits there and monitors your source base and keeps a very fine grained model of the type dependencies in your code and when you change something it does not just throw everything out and recompute the grand truth.

It actually diffs the file that you changed, figures out what parts of the type system might have changed and recomputes their soundness and then communicates to your editor to say: “Hey, here are the type errors you just introduced”. So it actually made the workflow better in a way because instead of save/reload the page, the workflow is now actually “Save”. “Save” and I got a pile of errors before I can blink my eye that says: “Here, you did not type this correctly because this expects an int and you passed a string”. Another interesting learning from that is that programmer adoption of that system was really incredibly dependent on good error messages. It is a type inferred system, it is not Java, we are not labeling all the types inside of a PHP program. We are basically only labeling the parameter types and function return types and members in classes. So all the locals are just inferred. A lot of people using type inferred systems had this bad experience of making a change and having a type error in some file you never heard of and some type you have never heard of.

It could not unify this thing you have never heard of with this other thing you have never heard of and designing both the type system and the error messages so that they tell a story so that there is a beginning, there is a type error here, there is a middle which is “It expected a this and it wanted a that” and occasionally it ends where you have to explain if there was a generic involved, the path whereby it actually got derived. But those four steps really seemed to be enough to capture the whole story for everything. It is a cool system. It is one of those things that I could talk about all day and it would be hard to get across what's neat about it. You need a demo. Until you see this thing giving you type errors on this multi ten-million line codebase right away, it is hard to believe that you can interact with a type system that way. It could actually be a boon to productivity instead of this block you stumble over all the time you want to do something.

Werner: I like the idea of basically cloud-based, static analysis code that runs incrementally. It is an IDE level system.

It sort of factored out that part of the IDE that has a really fine grained model of your program and made it into a daemon that can talk over an interface to lots of different programs. It can talk to an IDE, if you are using an IDE, it can talk to emacs, it can talk to vim. We are never going to migrate everybody to IDEs, so saying the story “You get this if you switch to whatever you favorite IDE is” is kind of a weak story. We knew that there would be people – I am one of them – that you are just going to pry them from my cold dead hands just because my investment muscle emory is too high. It was an interesting decision that they have made. The thing that the IDE has going for it too, is that the IDE sees the modifications and you do not hold Eclipse responsible if you do a git checkout and if you do a checkout that changes 10.000 files, you realize Eclipse is going to churn for a while and it will give you a progress bar somewhere. But if you do not have a big integrated experience that you control that way you have to be breathtakingly fast. They actually come back online very quickly after even things like doing a git check out that fast forwards over 10.000 diffs and changed 20.000 files which is pretty impressive.

   

15. You mentioned that Hack takes certain things away from the PHP language. Are those particularly naughty things like eval or something like that?

People are always fascinated with the eval. It is interesting. Eval is disallowed in Hack, as it turns out. But eval is not necessarily the craziest thing in the whole world for dynamic languages to support well. Another interesting design decision with Hack, by the way, is that HHVM just takes the Hack sugar, the Hack syntax and desugars it. It has absolutely no influence on code generation for the back end. So the type system is not quite sound, so we can’t take the things it proves all the way to the bank. The things that can prove all the way down to ground truth or a compiler can prove all the way down to the ground truth or at least it should be able to. We do not actually view this…People tend to see this as a performance optimization or a way to make PHP go faster. Never say never! Maybe we will get there, but at least for the JavaScript people experience with the ECMAScript type annotations it suggests that it is really hard to come out ahead from exploiting partial type information in a mostly dynamically coded language. We are subtracting mostly some exotic stuff, so for instance you can do break $x as a form of dynamic control flow and $x is some number and that is how many loops you will break out of. If $x is three you will break out of to this loop but if it is two you only break out to this loop.

Werner: Without a label, dynamically.

It is not a label name, it is a variable.

Werner: That sounds like fun.

It is a computed goto. Sometimes people use it as a computed goto. That is off the table. One of the reason why that is off the table, in case it is not clear: PHP is not block structured. It means you need to do control flow analysis to type infer local variables. Having to turn complete stuff in the middle of your control flow analysis isn’t going to work necessarily. People have mostly thanked us for this. We haven’t had a lot of people hurl: You will take break $x from my cold dead hands. There are a few other things that I am struggling to recall right now, but a few of them are basically syntactic irregularities. With PHP you cannot quite parse context-free and we took away some of the things that would have made that impossible. I am sorry. I do not have this at my finger tips, but it is a very small subset of things that occur very rarely.

   

16. Is Hack available publicly?

Yes, this is a great question. One of the bummer things about telling a story right now is that we have not open sourced it yet and honestly, I hope that by talking about it this partly acts as a commitment device to help get us off of our collective tuches and get it open-sourced. Our intention is that this should be open-sourced. There is really no point in hoarding a language like this. If anything it would be wonderful if someday years hence would already know hack when Facebook hires them.

Werner: But I think HHVM is open-sourced.

Yes, HHVM is promiscuously open-source. We are under the same license that Zend uses, so you can mix and match code if you have got an extension you want to port, for instance. There should not be a problem with that. It is github.com/facebook/hiphop-PHP. We develop it out in the open, we have been lucky enough to get some contributions back from the community going over the last year. If you are curious, give it a shot. We also have binary distributions for popular Linux distros and MacOS is still coming along, but it is still work in progress.

Werner: That sounds interesting. I guess all our PHP listeners and audience would like to try it out if they haven’t already. HHVM.com

Yes, we are at HHVM.com.

Werner: So we will all check it out. Thank you, Keith

Cool. Thank you very much.

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT