I'm a developer, an entrepreneur, probably best known for being one of the creators of the Seaside web framework, which is a Smalltalk web framework, for being the co-founder of the company that built Dabble DB and also a new product that we are working on called Trendly. I'm Canadian, I live in Vancouver. Actually, I live on a small island off the coast of Vancouver - that's me. And I like Smalltalk.
2. So Dabble DB is written in Smalltalk?
Yes. The core of Dabble DB is written in Smalltalk. We do use a bunch of Ruby scripts and actually a little bit of Python for a sort of infrastructure stuff, peripheral stuff, but the core is written in Smalltalk and runs on Seaside.
3. What's your new project? Is it a product yet? What is Trendly?
Trendly came out of something that we really needed internally for Dabble DB, which is that we use - like a lot of people do - Google Analytics to track our web traffic. For a web business, the traffic is really your lifeblood and you really need to understand it. The thing about Google Analytics is it does a great job of collecting data and just showing the raw data, but for an analytics product it does very little analysis. What we are interested in is not necessarily that the traffic went down with 10% today or went up 5% tomorrow, but have there been really any meaningful changes to the traffic patterns? - and I mean that in a statistically significant sense. Has anything changed that we really need to pay attention to rather than just noise?
I spent a while just reading about stats and we were searching this stuff to come up with a model that I thought would work for this kind of traffic. What were kicking this around with a bunch of people inside the company and also people from other companies, like Patrick Collison from Auctomatic or Bob Hardsorn [?] who's now Facebook, but was in a startup at that time. What we realized is that what we really want was a news feed, almost like the Facebook news feed, where it tells you what your friends are up to, we wanted to know what our traffic was up to and to pick and choose the events that we were reported on based on whether or not they were truly significant.
If your traffic just randomly goes up a bit, we don't want to tell you about that, but if you been slashdotted and suddenly you've got huge referrals from everywhere then that's something we wish to report. Or subtler things, like you've changed your website and suddenly your average conversion rate has gone from 3% to 3.5% - if truly your average has changed, that's something that we want to report on.
We had this statistical model, we had this idea of a news feed and we spent a few months just iterating on how does that turn into something useful and usable and what kinds of visualizations can we do to support it. That's what I'm talking at FutureRuby is about, that sort of long iterative design process. What we ended up with I think is quite good and we started previewing in that with people and we hope to fully launch it as a product quite shortly.
4. Is it available right now to look at it?
Anyone can go right now and connect to the Google Analytics account and we will pull in their data and do at least a one-off historical report of visualizing their traffic for the last 3 years, say, if they have that much data in it, but what we are going to be doing going forward and what we would be charging for, would be for recurring updates to that. I mean, real value is in checking it every day, or every week or having it email you when interesting things are happening, so that you can track what's happening with your data over time.
For now, the preview lets people get a feel for it and lets us understand - for people who are willing to share the results with us - lets us understand how that kind of engine is doing on different profiles of sites, from very small sites that maybe are only getting 100 visitors a month, up to people who are getting 1,000,000 visitors a month. Does the analysis work across all the different site profiles?
5. How do you get my data from Google Analytics?
Google Analytics released an API - I think that maybe at the beginning of May or the beginning of April it became publicly available - and they use a sort of combination of OpenID and OAuth to authenticate you. So, they authenticate you when, we never see your Google password, and then you tick a box and Trendly is allowed to access these parts of my Google accounts. We asks for the email address so that we can get in touch with you and then we ask for your Google Analytics data and we may, at some point, ask for your Google Calendar data, so you can use Google Calendar to mark significant events, like "This was the day that I released a new version of a website", but we don't do that yet.
Then Google Analytics basically gives us an API token and we can use that to do queries against your Google Analytics data. In our case, we're really doing a huge amount of queries, because we are trying to pull in as much data as we think we can process for you, then we run a bunch of statistical analysis on it and based on the results of that, we do a bunch of visualization generating charts and spark lines and news reports and that kind of things.
Yes. Let me step back and tell you something about the model that we use. We use a stepwise linear model. The assumption is that for whatever metrical we are looking at, let's say the metric is the number of hits a day you get from Twitter. That number is obviously going to change day by day - I mean how many hits you get from Twitter -, but we assume that the average is going to be the same for a relatively long period of time. For a month or year, we're going to have the same average hits from Twitter and then suddenly, Ashton Kutcher is going to post about you and everyone is going to retweet it or whatever and that will cause a shift in the average and for a while, while this kind of meme is propagating out and Demi Moore is retweeting Ashton Kutcher twitting about how cool InfoQ is, that average is going to be quite different.
Then, maybe it will fall back again or maybe you'll have broken out and everyone's always going to be twitting about InfoQ from now on, so that average is going to stay at that high level. What we are trying to do statistically, is simply to detect where those shifts happen and figure out what our confidence level is about whether or not a shift happened and about when a shift happened. Although we were assuming, the model assumes that the shift was abrupt, that there was a change from this step to this step, we may be able to tell it happened on that day or we may only be confident that it happened some time this week or some time in this time period.
That model, I think, maps fairly well to web traffic data in that it's usually a pretty constant in telling some event happens, whether it's an internal event, like you changed your website or whether it's an external event like someone links to you or some kind of complicated combination like you search optimized your website and Google finally noticed that and updated you in their index and now you are getting more search traffic on Google or whatever. This could obviously also be applied to other types of data - sales data for a company, probably load numbers on a server. There is lots of times you use data that you can apply this kind of analysis, too.
The nice thing for us about traffic data is that Google Analytics does have this API, so it's really easy for us to get you started. All you have to do is 2 clicks, they can use my analytics data and they can run the analysis. For other kinds of data it would be more complicated for us to get a large volume of data from you. To start with, this is a nice easy market and it's a well-defined way for us to market the product, but if people do like what's doing for their traffic data and think it would be nice to integrate other kinds of data into it, then that's something we'd like to talk about and maybe we can publish an API that will let them do that or something like that.
It's a very visual talk, so it's hard to do in an interview. A lot of it is showing the screen shots of what the product looks like as it was evolving, but one part of the talk that is less visual that I can talk about is the implementation. We went through a lot of different versions of both of the front end implementation and of the back end implementation. On the front end, the sort of central piece of Trendly is a chart. It's sort of a stream chart. I don't know if you've seen this Last.FM visualizations of what people's music listening history is, but it was inspired by that. I wrote the code to do that chart at least 6 times, I think.
What we finally ended up using was Ruby in ImageMagick generating it on the server side, but I also did a version in Canvas, I did a version in SVG, I did a version in Flash, I did a version using Ruby and OpenGL. Trying to hit the sweet spot of performance, because the problem with ImageMagick is that it is not hardware accelerated, it's actually quite slow to generate the charts and appearance - with Open GL we can never get the antialiasing the way we wanted it to look. Java 2D actually looks like it would be a good way to do it, but I don't want to write more Java than I have to.
On the client side, Canvas and SVG was too slow, Flash was great but it was limited to 4,000 pixels for any dimension of a Flash movie and we wanted this long scrolling like 25,000 pixel long charts and it was going to be splitting it up in a multiple Flash movies and it was just a huge pain. Something similar happened with the backend analysis implementation. It started out being all written in Smalltalk and what we found was that for a variety of reasons, breaking off parts of it into other little helper scripts ended up being necessary. For example, my initial chart implementation was written in Squeak, using the Squeak graphic primitives, but again, the antialiasing wasn't as good as I wanted and that ended in Ruby Image Magick.
Some of the core number crunching code was just running too slow in any dynamic language, so I ended up writing it in Java. It would have been better to write it in C in some sense, but Java was the nicest environment to develop in that still had an acceptable performance. I'm sure now somebody is going to tell me I should have written it in Ocaml. We ended up with this amazing hodgepodge - it's a pipe line. It's almost like a make process, where each of these little programs running in different languages is taking an input file and producing an output file, which also is nice in terms of parallelizing and distributing the work across how many CPUs we have to work with.
But, the way I describe it in the talk is that the thinking is done in Smalltalk, so the complicated analysis stuff, clustering is done in Smalltalk, the computing is done in Java - just the hardcore number crunching -, the interaction is done in Ruby. Any time we need to deal with an operating system or a C library, like ImageMagick, they just save the interaction with the user. Then, the interaction with the user of this product ended up with a huge amount of Javascript code.
It's really basically a big Javascript app that just pulls a little bit of JSON and some images from the server and then builds the UI from that. There is lots and lots of code running on the client to deal with the user interface of the visualization. It's a lot of fun, actually, because there are all these different parts and different languages that we are playing with. I think some people would run away screaming from that, but to me, I really enjoy it.
8. What about using Objective-C instead of Java?
That would have been fine. I guess I'm more familiar with using Objective-C on Mac OS X and in a server environment we tend to deploy on Linux servers. I know the GNUstep project does exist and GCC can compile objective C, but I'm not as familiar with that particular environment. Java is fast enough.
9. What do you use to host Trendly? What do you host it on? EC2 or on your own servers?
We have dedicated servers that are hosted. There is a Montreal datacenter, IWeb is where it is, and we basically rent a big beefy server core from them and host on that. The main thing for us is to not have to deal with buying and installing and replacing hardware, but we do like rather than having a lot of little individual slices or EC2 instances that we have to manage for us, for system administration it's easier to have one big very large 8 core machine. For now, all of Trendly can run on that. As more people use it, we may need to add more in the mix.
The thing to say about Trendly from the deployment point of view or from a server resource point view is that it's a batch job. I mean this data only changes every day and so we can over night just queue up a whole bunch of analysis that we need to do and do that at a leisure. We don't really need the kind of dynamic ability to add and remove server capacity on a hour-by-hour, minute-by-minute basis because we can choose when to run these jobs and make sure that we're making maximum use even of a single server.
Then, when people are actually access it effectively, we are just serving static files and we've already done the analysis and saved it and all the of dynamic stuff happens on the client side in Javascript. It's actually really nice from that point of view. It's such a relief to not have to be worrying about server load spiking because 100 people are using it at once. If 100 people ask for sign up at once and we are trying to do their analysis, we just queue them up and get to them when we get to them.
10. On the Ruby part of Trendly, what do you use - 1.8 MRI, 1.9 JRuby, others?
Yes, MRI 1.8, I guess. Whatever 'apt-get install ruby' gives, whatever the default is. I'm not trying to find a bleeding edge implementation. We did consider using JRuby. It's conceivable that by using JRuby and Java we can port the Smalltalk stuff to Ruby and we could get this all down to a single runtime. If we ever decided we wanted a downloadable version of this and someone could run behind the firewall, would obviously be easier if this were all packaged up into a single JVM and we could just send them a JAR file or something. But for now we just use MRI.
One of the things that has come up with Trendly is, as I said, a huge amount of the interface is written in Javascript and I've seen that, increasingly, every project we do more and more of it is happening on the client side, is happening in Javascript and less and less of the user interface part of it is happening on the server side. In this case, for example, we don't do any generation of HTML on the sever side; that all happens on the client side in Javascript.
We have somewhat a division in our company between backend engineers and frontend engineers. Luke Andrews does all of our design work, but also does the majority of the Javascript frontend development. As that shift is happening, it's putting more and more of a burden on him and we would love as backend engineers to be able to help him out, but we are used to working with different tools in different languages.
We certainly could. I mean we are comfortable enough in Javascript, we could just write Javascript, but I guess we have certain ideas about what nice language syntax is and Javascript doesn't necessarily meet that and certain ideas about nice tools are and Javascript doesn't necessarily have those, certain ideas about nice ways of structuring programs are and Javascript maybe can come up with a convention that can work, but it isn't necessarily best suited for that.
One thing that I've been experimenting within, really just as a kind of side project experiment for now is whether or not I could compile Smalltalk, although you could just as easily do this with Ruby to equivalent Javascript and that would let us write our interfaces rather than writing them in Javascript, write them in Smalltalk. I should be clear that I'm not aiming to perfectly recreate the Smalltalk language on top of the Javascript runtime, which certainly you could do, but my goal is that there are all these very highly optimized Javascript VMs coming out like V8 or like the work Mozilla is doing on Trace Monkey or SquirrelFish Extreme from Apple and to take the best advantage of those.
What I want is not just to be generating Javascript that runs, but Javascript that is more or less idiomatic. Not necessarily that someone reading the Javascript would enjoy reading it, but that VM isn't going to be surprised by the things that you are doing. One good example of this is Smalltalk and Ruby both have this notion of early returns from a block returning from the enclosing method. In Ruby, say, if you have 'array.each do ... end' and than somewhere in that do block there is a return, that doesn't just return from that iteration of the block that actually returns from the entire method.
The only way to that in Javascript would be to throw an exception and then you got exception handlers everywhere, basically wherever you got a block. I don't know, I haven't measured what the performance impact is on that, but I can certainly imagine that, even if it's not true now, the Javascript implementations possibly in the future might be very much slowed down by it. What I'm trying to do is come up with a subset of Smalltalk that maps nicely to Javascript. So far, that's been pretty successful. The fun thing so far is that, because I'm sort of visioning eventually a Smalltalk-like development environment inside the browser and it needs to be self hosting and so I've been playing around with PEG grammars which are a lot of fun.
The parsing expression grammars and having this Smalltalk subset has a PEG library that lets it parse and compile itself. It's fully bootstraped that way and you can just load a bootstrap file into a browser and go from there. The name of this project for now anyway is Clamato and I was trying to figure out the other day where that even came from. I think it was that I was talking to Chad Folwer and Chad said "Hey, I just made this neat discovery! There is this thing called Clamato and it's great! It's clam juice mixed into tomato juice!" Most people I think would have gone "Oh, that's totally disgusting!", but I'm Canadian and Clamato is a Canadian thing, is Canadian invention and so this is totally natural for me.
For me, the response was to say back "Yes, and you should try putting that into a Bloody Mary. It's called a Cesar." I don't actually even like Cesars, but it somehow triggered a sense of national identity and the next project that I did got randomly called Clamato because of that. It's on GitHub , you can find it and check it out. I really don't know where it's going and I don't know if it's just what I'm interested in this week and it will go away, but I think the amount of investment that's going into Javascript runtimes right now, we would be foolish not to take advantage of it. I'm curious to see whether V8, for example, becomes more of a compilation target for a new dynamic language implementations or just new dynamic languages.
12. What's the current subset of Smalltalk in Clamato?
Right now, one difference for now between Clamato and Smalltalk is that I have just a file based classic method definition syntax rather than having full environment. I guess it's like GNU Smalltalk in that respect although I've tried to be more minimal, have less punctuation in the syntax, but that's irrelevant. There is no metaclass hierarchy in the sense that instance methods get inherited but class side methods - are what in the Java world you might think of static methods, in Ruby there are singleton methods - anyway, [class methods] don't get inherited! That's the same as in Javascript.
If you attach a method to a constructor, stuff that has that constructor's instance as a prototype won't get that method, but the instance methods. There is a normal inherited hierarchy. That's one big but I don't think terribly significant difference. I mean Smalltalk just love the metaclass hierarchy, but I think for a lot of beginners it's just a source of confusion more than anything else. I'm so far not hugely missing the lack of that. There are no cascades in the syntax, which is probably the thing most likely to change. I may add those in if it gets too frustrating.
For people that don't know Smalltalk, a cascade is a syntax for sending a succession of messages, calling a bunch of methods in a row on the same object. Where it's most useful, it's when you are configuring a new instance. You just created something, you want to call a bunch of methods on it to set up a bunch of stuff. It's nice if you don't have to have a temp variable that you keep referring to when you are doing that, but again, it's no big deal. The biggest change probably is - getting back to what I was saying before about early return semantics from blocks - there is no explicit return at all. Like Ruby, there is an implicit return of the last expression - or like Lisp, I guess - but there is no way at all to return early from a method.
That changes a little bit how you structure stuff. I'm used to guard clauses where you check something and return early if something isn't what you want or if there is a base case and a recursion or whatever and instead, you end up with the kind of nasty if expression, very expression based stuff that you might see more like a scheme, but I do really like that implicit expression return. When you got a simple method, it's really nice to just have a single expression and that be the result rather than have to explicitly return. It's just an aesthetic thing, but the semantics of the early returns from blocks were going to be too unjavascripty for this project.
The only other change that I can think above hand is that rather than declaring instance variables upfront as in Smalltalk and having that be a standard amount of memory that's allocated when you are creating an object, in Javascript you can always create a new property so that's true for instance variables in this language as well. Instance variables work more like they do in Ruby and actually I use the same syntax of having an @ sign in front of instance variables. Any time you use an @ sign, the identifier creates a new instance variable for you on the fly.
13. Which means you don't have to use self and the name of the getter, I suppose, as in Smalltalk?
In Smalltalk you can always reference the name of the instance variable directly, if you wanted to, but they did have to be declared as part of the class definition, as you would have to in Java or Objective-C or anything else. In general, I like that, because you have a very clear knowledge of what the memory profile of a class is going to be. If you are declaring the instance variable and is going to allocate exactly that many slots, then you can predict precisely how much memory an instance of that class is going to take up. In Ruby, you might use a hash table, which might use an arbitrary amount of memory, it might be allocating slots explicitly or it might some mix of the two. So, you can get much less predictable and also much higher memory overheads.
Let me answer the current state of Squeak first. Squeak has, as long as I've been working with it, had sort of a split personality. Squeak originally was developed as a platform for educational software, for Etoys, for the research that Alan Kay was doing and so it has always had those roots and many geared towards that use, but people like me have realized that it is also a very useful commercial development platform. There has always been a little bit of tension, not between people in the community, because I think the community is quite cohesive and friendly and supportive, but tension between these different goals.
People who are using it for commercial development might be more interested in having a minimal kernel image that you can load specific packages into, much more interested in the development tools and people who are using it as a platform for educational software are more interested in having good media support, good animation support, stuff that doesn't affect me at all. The difficulty in the Squeak community has always been in managing these conflicting goals and yet still having a single thing that we can call Squeak. The best way to think about it is that there have been a number of new distributions of Squeak that have come up.
In the same way that Linux has distributions that are geared towards people using for desktop use distributions, toward people who are using it for server use distributions geared towards, people who want to compile their own stuff. In Squeak, we are seeing distributions geared towards academic research, distributions geared towards 3D environments, distributions geared towards educational use and distributions geared towards commercial development.
Pharo is probably the most recent player and is getting, I think, a lot of attention. Pharo is really trying to be a commercial Squeak in the sense that it's a Squeak that you can use for that's intended to be used for that kind of development. Personally, one Squeak image is pretty much like another to me. I can load most of my code into any of them and they all have pretty much the same development tools plus or minus a few bells and whistles and they all basically work. If there is a few megabytes of cruft that I'm not using, there are a lot of megabytes of system libraries on my Mac OS X machine that I don't use, too, but I don't spend a lot of time obsessing about it.
I've also trained myself to be comfortable in the lowest common denominator Squeak. If it's got bitmap fonts and true type fonts, I don't care. If it doesn't have syntax coloring, it's not going to bother me. For me, there is a lot of fuss about this, there is a lot of handwringing in the community about how do we deal with this and a lot of technical effort being spent on how do we strip down to a minimal image. I think that's great, but it doesn't really affect me very much. What we use for Dabble is maybe Squeak Image ca. 2005 or something that still works fine. For Trendly we are using a somewhat newer image, but I don't even know, because it just doesn't matter that much.
The state of Squeak on one hand, there is a lot of turmoil with all these different distributions, on the other hand any of them is great, so it doesn't really matter. Still, Smalltalk as a whole, it's interesting to see how many commercial vendors are still going strong. Cincom is still doing well with their Smalltalk and just released WebVelocity, which is an almost entirely web-based development environment for it built on Seaside, which is neat to see. Dolphin is still a viable, Windows Smalltalk, commercial Smalltalk, Visual Age Smalltalk which was originally out of IBM is now being maintained by someone else, but is still putting out releases.
In the open source world, there is Squeak, there is GNU Smalltalk which keeps pumping out releases and GemStone, of course. You have to imagine that if there are all these different both commercial vendors and open source projects that are continuing and seem to be continuing quite healthily, that it is still being used a lot. Of course, the cliché of Smalltalk is that lots of companies who are using it don' t want to admit it because they consider it a competitive advantage.
The belief has always been, anyway that there was a lot more Smalltalk out there than anyone knew. Of course, I have no way of knowing whether this is true, but the only indication that I have is simply that there is all of this continued commercial activity around Smalltalk and so, presumably, it's doing OK. My position has always been to some degree that the rise of Ruby, for example, is Smalltalk winning. Yes, there is lots of stuff that the Ruby community hasn't taken from Smalltalk, that I wish they would, but they will increasingly, over time and Rubinius is a great example of this, of explicitly saying there was a lot of good stuff in Smalltalk-80 that we could bring to Ruby and I feel such nostalgia looking through the Rubinius source because all the names are right - it's called MethodContext, it's called MethodDictionary. I'm sure that will be true for any Smalltalker. If you talk to the people like Alan Kay, who invented Smalltalk, they never expected Smalltalk-80 to last this long. They produced Smalltalk-72, Smalltalk-76, every 4 years they expected to totally reinvent the language and here we are, nearly 30 years later and we are still using Smalltalk-80.
I'm really excited about Newspeak for example, which is Gilad Bracha's new effectively Smalltalk dialect, but it does a bunch of things differently and he is also looking at targeting V8 as an execution platform. I feel like those ideas are great ideas, but one way or another, those ideas are doing well. Whether Smalltalk, as that particular syntax and particular environment, is popular at any one time is not that important to me.