Bio Avi Bryant is the co-CEO of Dabble DB, a Vancouver startup focused on web-based data management and collaboration tools. He is the author of the Seaside web application framework, and is active in the open source Squeak Smalltalk community. His latest project is http://trendly.com/
FutureRuby isn't a Ruby conference, but a conference for Rubyists. This is a call to order - a congress of the curious characters that drew us to this community in the first place. We have a singular opportunity to express a long-term vision, a future where Ruby drives creativity and prosperity without being dampened by partisan politics.
I'm a developer, an entrepreneur, probably best known for being one of the creators of the Seaside web framework, which is a Smalltalk web framework, for being the co-founder of the company that built Dabble DB and also a new product that we are working on called Trendly. I'm Canadian, I live in Vancouver. Actually, I live on a small island off the coast of Vancouver - that's me. And I like Smalltalk.
Yes. The core of Dabble DB is written in Smalltalk. We do use a bunch of Ruby scripts and actually a little bit of Python for a sort of infrastructure stuff, peripheral stuff, but the core is written in Smalltalk and runs on Seaside.
Trendly came out of something that we really needed internally for Dabble DB, which is that we use - like a lot of people do - Google Analytics to track our web traffic. For a web business, the traffic is really your lifeblood and you really need to understand it. The thing about Google Analytics is it does a great job of collecting data and just showing the raw data, but for an analytics product it does very little analysis. What we are interested in is not necessarily that the traffic went down with 10% today or went up 5% tomorrow, but have there been really any meaningful changes to the traffic patterns? - and I mean that in a statistically significant sense. Has anything changed that we really need to pay attention to rather than just noise?
I spent a while just reading about stats and we were searching this stuff to come up with a model that I thought would work for this kind of traffic. What were kicking this around with a bunch of people inside the company and also people from other companies, like Patrick Collison from Auctomatic or Bob Hardsorn [?] who's now Facebook, but was in a startup at that time. What we realized is that what we really want was a news feed, almost like the Facebook news feed, where it tells you what your friends are up to, we wanted to know what our traffic was up to and to pick and choose the events that we were reported on based on whether or not they were truly significant.
If your traffic just randomly goes up a bit, we don't want to tell you about that, but if you been slashdotted and suddenly you've got huge referrals from everywhere then that's something we wish to report. Or subtler things, like you've changed your website and suddenly your average conversion rate has gone from 3% to 3.5% - if truly your average has changed, that's something that we want to report on. We had this statistical model, we had this idea of a news feed and we spent a few months just iterating on how does that turn into something useful and usable and what kinds of visualizations can we do to support it. That's what I'm talking at FutureRuby is about, that sort of long iterative design process. What we ended up with I think is quite good and we started previewing in that with people and we hope to fully launch it as a product quite shortly.
Anyone can go right now and connect to the Google Analytics account and we will pull in their data and do at least a one-off historical report of visualizing their traffic for the last 3 years, say, if they have that much data in it, but what we are going to be doing going forward and what we would be charging for, would be for recurring updates to that. I mean, real value is in checking it every day, or every week or having it email you when interesting things are happening, so that you can track what's happening with your data over time.
For now, the preview lets people get a feel for it and lets us understand - for people who are willing to share the results with us - lets us understand how that kind of engine is doing on different profiles of sites, from very small sites that maybe are only getting 100 visitors a month, up to people who are getting 1,000,000 visitors a month. Does the analysis work across all the different site profiles?
Google Analytics released an API - I think that maybe at the beginning of May or the beginning of April it became publicly available - and they use a sort of combination of OpenID and OAuth to authenticate you. So, they authenticate you when, we never see your Google password, and then you tick a box and Trendly is allowed to access these parts of my Google accounts. We asks for the email address so that we can get in touch with you and then we ask for your Google Analytics data and we may, at some point, ask for your Google Calendar data, so you can use Google Calendar to mark significant events, like "This was the day that I released a new version of a website", but we don't do that yet.
Then Google Analytics basically gives us an API token and we can use that to do queries against your Google Analytics data. In our case, we're really doing a huge amount of queries, because we are trying to pull in as much data as we think we can process for you, then we run a bunch of statistical analysis on it and based on the results of that, we do a bunch of visualization generating charts and spark lines and news reports and that kind of things.
Yes. Let me step back and tell you something about the model that we use. We use a stepwise linear model. The assumption is that for whatever metrical we are looking at, let's say the metric is the number of hits a day you get from Twitter. That number is obviously going to change day by day - I mean how many hits you get from Twitter -, but we assume that the average is going to be the same for a relatively long period of time. For a month or year, we're going to have the same average hits from Twitter and then suddenly, Ashton Kutcher is going to post about you and everyone is going to retweet it or whatever and that will cause a shift in the average and for a while, while this kind of meme is propagating out and Demi Moore is retweeting Ashton Kutcher twitting about how cool InfoQ is, that average is going to be quite different.
Then, maybe it will fall back again or maybe you'll have broken out and everyone's always going to be twitting about InfoQ from now on, so that average is going to stay at that high level. What we are trying to do statistically, is simply to detect where those shifts happen and figure out what our confidence level is about whether or not a shift happened and about when a shift happened. Although we were assuming, the model assumes that the shift was abrupt, that there was a change from this step to this step, we may be able to tell it happened on that day or we may only be confident that it happened some time this week or some time in this time period.
That model, I think, maps fairly well to web traffic data in that it's usually a pretty constant in telling some event happens, whether it's an internal event, like you changed your website or whether it's an external event like someone links to you or some kind of complicated combination like you search optimized your website and Google finally noticed that and updated you in their index and now you are getting more search traffic on Google or whatever. This could obviously also be applied to other types of data - sales data for a company, probably load numbers on a server. There is lots of times you use data that you can apply this kind of analysis, too.
The nice thing for us about traffic data is that Google Analytics does have this API, so it's really easy for us to get you started. All you have to do is 2 clicks, they can use my analytics data and they can run the analysis. For other kinds of data it would be more complicated for us to get a large volume of data from you. To start with, this is a nice easy market and it's a well-defined way for us to market the product, but if people do like what's doing for their traffic data and think it would be nice to integrate other kinds of data into it, then that's something we'd like to talk about and maybe we can publish an API that will let them do that or something like that.
It's a very visual talk, so it's hard to do in an interview. A lot of it is showing the screen shots of what the product looks like as it was evolving, but one part of the talk that is less visual that I can talk about is the implementation. We went through a lot of different versions of both of the front end implementation and of the back end implementation. On the front end, the sort of central piece of Trendly is a chart. It's sort of a stream chart. I don't know if you've seen this Last.FM visualizations of what people's music listening history is, but it was inspired by that. I wrote the code to do that chart at least 6 times, I think.
What we finally ended up using was Ruby in ImageMagick generating it on the server side, but I also did a version in Canvas, I did a version in SVG, I did a version in Flash, I did a version using Ruby and OpenGL. Trying to hit the sweet spot of performance, because the problem with ImageMagick is that it is not hardware accelerated, it's actually quite slow to generate the charts and appearance - with Open GL we can never get the antialiasing the way we wanted it to look. Java 2D actually looks like it would be a good way to do it, but I don't want to write more Java than I have to.
On the client side, Canvas and SVG was too slow, Flash was great but it was limited to 4,000 pixels for any dimension of a Flash movie and we wanted this long scrolling like 25,000 pixel long charts and it was going to be splitting it up in a multiple Flash movies and it was just a huge pain. Something similar happened with the backend analysis implementation. It started out being all written in Smalltalk and what we found was that for a variety of reasons, breaking off parts of it into other little helper scripts ended up being necessary. For example, my initial chart implementation was written in Squeak, using the Squeak graphic primitives, but again, the antialiasing wasn't as good as I wanted and that ended in Ruby Image Magick.
Some of the core number crunching code was just running too slow in any dynamic language, so I ended up writing it in Java. It would have been better to write it in C in some sense, but Java was the nicest environment to develop in that still had an acceptable performance. I'm sure now somebody is going to tell me I should have written it in Ocaml. We ended up with this amazing hodgepodge - it's a pipe line. It's almost like a make process, where each of these little programs running in different languages is taking an input file and producing an output file, which also is nice in terms of parallelizing and distributing the work across how many CPUs we have to work with.
That would have been fine. I guess I'm more familiar with using Objective-C on Mac OS X and in a server environment we tend to deploy on Linux servers. I know the GNUstep project does exist and GCC can compile objective C, but I'm not as familiar with that particular environment. Java is fast enough.
We have dedicated servers that are hosted. There is a Montreal datacenter, IWeb is where it is, and we basically rent a big beefy server core from them and host on that. The main thing for us is to not have to deal with buying and installing and replacing hardware, but we do like rather than having a lot of little individual slices or EC2 instances that we have to manage for us, for system administration it's easier to have one big very large 8 core machine. For now, all of Trendly can run on that. As more people use it, we may need to add more in the mix.
The thing to say about Trendly from the deployment point of view or from a server resource point view is that it's a batch job. I mean this data only changes every day and so we can over night just queue up a whole bunch of analysis that we need to do and do that at a leisure. We don't really need the kind of dynamic ability to add and remove server capacity on a hour-by-hour, minute-by-minute basis because we can choose when to run these jobs and make sure that we're making maximum use even of a single server.
Yes, MRI 1.8, I guess. Whatever 'apt-get install ruby' gives, whatever the default is. I'm not trying to find a bleeding edge implementation. We did consider using JRuby. It's conceivable that by using JRuby and Java we can port the Smalltalk stuff to Ruby and we could get this all down to a single runtime. If we ever decided we wanted a downloadable version of this and someone could run behind the firewall, would obviously be easier if this were all packaged up into a single JVM and we could just send them a JAR file or something. But for now we just use MRI.
The parsing expression grammars and having this Smalltalk subset has a PEG library that lets it parse and compile itself. It's fully bootstraped that way and you can just load a bootstrap file into a browser and go from there. The name of this project for now anyway is Clamato and I was trying to figure out the other day where that even came from. I think it was that I was talking to Chad Folwer and Chad said "Hey, I just made this neat discovery! There is this thing called Clamato and it's great! It's clam juice mixed into tomato juice!" Most people I think would have gone "Oh, that's totally disgusting!", but I'm Canadian and Clamato is a Canadian thing, is Canadian invention and so this is totally natural for me.
If you attach a method to a constructor, stuff that has that constructor's instance as a prototype won't get that method, but the instance methods. There is a normal inherited hierarchy. That's one big but I don't think terribly significant difference. I mean Smalltalk just love the metaclass hierarchy, but I think for a lot of beginners it's just a source of confusion more than anything else. I'm so far not hugely missing the lack of that. There are no cascades in the syntax, which is probably the thing most likely to change. I may add those in if it gets too frustrating.
For people that don't know Smalltalk, a cascade is a syntax for sending a succession of messages, calling a bunch of methods in a row on the same object. Where it's most useful, it's when you are configuring a new instance. You just created something, you want to call a bunch of methods on it to set up a bunch of stuff. It's nice if you don't have to have a temp variable that you keep referring to when you are doing that, but again, it's no big deal. The biggest change probably is - getting back to what I was saying before about early return semantics from blocks - there is no explicit return at all. Like Ruby, there is an implicit return of the last expression - or like Lisp, I guess - but there is no way at all to return early from a method.
In Smalltalk you can always reference the name of the instance variable directly, if you wanted to, but they did have to be declared as part of the class definition, as you would have to in Java or Objective-C or anything else. In general, I like that, because you have a very clear knowledge of what the memory profile of a class is going to be. If you are declaring the instance variable and is going to allocate exactly that many slots, then you can predict precisely how much memory an instance of that class is going to take up. In Ruby, you might use a hash table, which might use an arbitrary amount of memory, it might be allocating slots explicitly or it might some mix of the two. So, you can get much less predictable and also much higher memory overheads.
Let me answer the current state of Squeak first. Squeak has, as long as I've been working with it, had sort of a split personality. Squeak originally was developed as a platform for educational software, for Etoys, for the research that Alan Kay was doing and so it has always had those roots and many geared towards that use, but people like me have realized that it is also a very useful commercial development platform. There has always been a little bit of tension, not between people in the community, because I think the community is quite cohesive and friendly and supportive, but tension between these different goals.
People who are using it for commercial development might be more interested in having a minimal kernel image that you can load specific packages into, much more interested in the development tools and people who are using it as a platform for educational software are more interested in having good media support, good animation support, stuff that doesn't affect me at all. The difficulty in the Squeak community has always been in managing these conflicting goals and yet still having a single thing that we can call Squeak. The best way to think about it is that there have been a number of new distributions of Squeak that have come up.
In the same way that Linux has distributions that are geared towards people using for desktop use distributions, toward people who are using it for server use distributions geared towards, people who want to compile their own stuff. In Squeak, we are seeing distributions geared towards academic research, distributions geared towards 3D environments, distributions geared towards educational use and distributions geared towards commercial development.
Pharo is probably the most recent player and is getting, I think, a lot of attention. Pharo is really trying to be a commercial Squeak in the sense that it's a Squeak that you can use for that's intended to be used for that kind of development. Personally, one Squeak image is pretty much like another to me. I can load most of my code into any of them and they all have pretty much the same development tools plus or minus a few bells and whistles and they all basically work. If there is a few megabytes of cruft that I'm not using, there are a lot of megabytes of system libraries on my Mac OS X machine that I don't use, too, but I don't spend a lot of time obsessing about it.
I've also trained myself to be comfortable in the lowest common denominator Squeak. If it's got bitmap fonts and true type fonts, I don't care. If it doesn't have syntax coloring, it's not going to bother me. For me, there is a lot of fuss about this, there is a lot of handwringing in the community about how do we deal with this and a lot of technical effort being spent on how do we strip down to a minimal image. I think that's great, but it doesn't really affect me very much. What we use for Dabble is maybe Squeak Image ca. 2005 or something that still works fine. For Trendly we are using a somewhat newer image, but I don't even know, because it just doesn't matter that much.
The state of Squeak on one hand, there is a lot of turmoil with all these different distributions, on the other hand any of them is great, so it doesn't really matter. Still, Smalltalk as a whole, it's interesting to see how many commercial vendors are still going strong. Cincom is still doing well with their Smalltalk and just released WebVelocity, which is an almost entirely web-based development environment for it built on Seaside, which is neat to see. Dolphin is still a viable, Windows Smalltalk, commercial Smalltalk, Visual Age Smalltalk which was originally out of IBM is now being maintained by someone else, but is still putting out releases.
In the open source world, there is Squeak, there is GNU Smalltalk which keeps pumping out releases and GemStone, of course. You have to imagine that if there are all these different both commercial vendors and open source projects that are continuing and seem to be continuing quite healthily, that it is still being used a lot. Of course, the cliché of Smalltalk is that lots of companies who are using it don' t want to admit it because they consider it a competitive advantage.
The belief has always been, anyway that there was a lot more Smalltalk out there than anyone knew. Of course, I have no way of knowing whether this is true, but the only indication that I have is simply that there is all of this continued commercial activity around Smalltalk and so, presumably, it's doing OK. My position has always been to some degree that the rise of Ruby, for example, is Smalltalk winning. Yes, there is lots of stuff that the Ruby community hasn't taken from Smalltalk, that I wish they would, but they will increasingly, over time and Rubinius is a great example of this, of explicitly saying there was a lot of good stuff in Smalltalk-80 that we could bring to Ruby and I feel such nostalgia looking through the Rubinius source because all the names are right - it's called MethodContext, it's called MethodDictionary. I'm sure that will be true for any Smalltalker. If you talk to the people like Alan Kay, who invented Smalltalk, they never expected Smalltalk-80 to last this long. They produced Smalltalk-72, Smalltalk-76, every 4 years they expected to totally reinvent the language and here we are, nearly 30 years later and we are still using Smalltalk-80.
I'm really excited about Newspeak for example, which is Gilad Bracha's new effectively Smalltalk dialect, but it does a bunch of things differently and he is also looking at targeting V8 as an execution platform. I feel like those ideas are great ideas, but one way or another, those ideas are doing well. Whether Smalltalk, as that particular syntax and particular environment, is popular at any one time is not that important to me.
transcript constrained by obnoxiously small box
Pretty sure you mean "Newspeak" there, not "new Squeak"
Re: transcript constrained by obnoxiously small box
BUT: there _is_ the 'show all' button.