Bio Alexandru Popescu is Chief Architect and co-founder of InfoQ.com. He is involved in many open source initiatives and bleeding-edge technologies, being co-founder of the TestNG Framework and a committer on the WebWork and Magnolia projects. Alexandru formerly was one of three committers on the AspectWerkz project before it merged with AspectJ. He blogs at http://themindstorms.blogspot.com/.
1. Hi my name is Ryan Slobojan I am here with Alexandru Popescu, who is the chief architect at InfoQ.com. Alexandru, why don't you tell us a little bit about the general site architecture at InfoQ - what does it look like? How is it built?
If you look at the InfoQ architecture you can see it from two different perspectives: our reader's perspective, where InfoQ is shown like a normal website; but for our editors and for people behind it, it is a full blown CMS. So basically what you see on InfoQ is built on top of a custom, homemade CMS that integrates the content with user accounts, tracking, with advertising mechanisms, things like that. We should probably look at the easy way to describe this application - it's a web application, both are, even the CMS you can see it as a web application which has the normal layers: the view layer, the services layer and the persistence layer.
There have been a couple of interesting choices by the time I have started this project two and a half years ago, because for example the persistence is not only based on relational databases but also using the JCR API for storing content. Also we have to choose between using a component-based web framework versus an action-based framework and we picked the action-based one. We considered that it fits better the design of our solution, even if we probably would like to have something like a portlet-based... at that moment the portlet specification was pretty weak I would say, hopefully I will not be blamed a lot for that. You can imagine very simple as a three-layer application, with the normal suspects in place: a bit of Spring, a bit of WebWork, a bit of Hibernate and the JCR API.
Yes, I think I can do that if I remember correctly. Let's say we have the browser here. There are two ways that HTTP requests are hitting our application. We have normal requests from the browser, or AJAX requests, XMLHTTPRequest, and then we are going through the stack which is WebWork or DWR. If a normal request is coming then it will go through the WebWork part. If it is an AJAX request it will go through DWR and then this is dispatched to the service layer where pretty much everything is only Spring and some AOP with AspectJ, to enhance our model. And then it goes down to the persistence layer which as I said is split between Hibernate and the JCR.
So at the end we have two different storages. You can probably ask at this moment why we picked using two solutions for storing something that might have lived in the same storage. The problem was that when designing this application we weren't sure how the model will look and how our content will evolve over time, and dealing with these changes inside the relational schema is pretty difficult, complex to migrate data and maintain between different versions and things like that. The JCR API provides exactly this support for unstructured content and a couple of more features like versioning, full-text indexing and we are taking advantage of all of these features.
And, for the authoring tool it's pretty much the same without the sugar you are seeing on InfoQ.com. It's the same stack, almost the same API, because we designed the same application and we split at build time the API between here on the public application we have only readonly access API, while on the authoring tool we have a read/write access API, but basically behind it's a single source base.
Initially we started with the project as we were. I mean we had a module that dealt with the DWR and the WebWork application. But finally I figured out that it would be easy for me and for my developers to have a common way to access and decide if we are serving requests through DWR or WebWork. So I have created a module that integrates these two frameworks together. And by the way I contributed that code back to DWR and so everybody can use it and it was pretty generic so right now you can use it with Struts 2 and things like that. So, at the end of the day, right now we are able to decide very late when we are writing code the way the HTTP request will be served: either through normal request/response cycle or through AJAXian.
I get this question quite a lot and probably it's the most challenging question I ever got. As you can imagine, after working two and a half years on the same project, you have many different ideas of how to change things and what to improve. At this moment I can probably say that I wouldn't change anything. I would probably try different approaches, to see if they would work, but so far the solution we picked at the beginning seems to work pretty well.
I would probably look into unifying the API to access the storages, to make a common base API on top of Hibernate and JCR so the developers don't bother to think where the real data will be stored and probably it will be in terms of the internal API and not very major changes.
So the only number I am allowed to distribute outside the company is the unique visitors per month. The number that we are currently making public on the left corner of our site. Right now I think we are at a quarter of a million unique visitors per month.
Let's take them one by one. So far I haven't noticed any problem at the Hibernate level. I mean we didn't even have to optimize our queries. We are using exactly what Hibernate generates for us. And the performance we are getting is very very good. Secondly, we haven't partitioned the data yet, even if we are handling a huge amount of data behind the scenes, because of this performance we are getting at this moment. We are looking, we are investigating, we are keeping an eye on the site itself, but so far we don't need that. Another interesting thing about our architecture is that in fact the only possible bottleneck is that relational database we are using, and because the other database for the content is present on any public server serving the site so we can scale linearly in terms of serving the content. And if we are going to run into performance problems related to this relational database, it will probably be pretty easy for us to create a cluster of MySQL databases.
Yes, we are running MySQL, and have multiple readonly access instances and a single instance for write.
So far I haven't noticed. There is a slight delay, a small delay but it's not noticeable. Usually we have logically partitioned so far, not physically. But logically partitioned the data so that we don't necessarily hit the database on each request. So we are able to cache very much of the information we are actually needing for serving a request. And what goes to the database is mostly tracking information and things to help out with the advertisement. Even if we see some delays in publishing this data on the cluster, it will not impact the performance on the front end.
We are using local caching and we have a single point of caching. It's an object cache.
It's above Hibernate. It's a bit above, and in fact you may be right and say that we have two caches because we are using the Hibernate cache, but we are aggregating the Hibernate objects into our objects, because these are more complex and we are caching logically and through our own API these custom objects.
We initially started with a streaming-based solution and using a third-party provider for this streaming solution. Unfortunately, immediately after designing this solution and starting with it we discovered that the provider of this service required us and required the client to have specific opened ports to access this Flash streaming solution. And this was a real big problem for our big clients because you usually run a big company like IBM behind firewalls, and they will never open you special ports just for watching videos on InfoQ even if they are high quality. So we started to think, what would be the alternative?
At that moment we noticed that YouTube and the other video servicing providers were moving towards download-based video solutions. And at that moment also Amazon was launching the well-known by now services like S3 and EC2 and we tried to think about a solution to use these public services that we hope to be really reliable and the new architecture is based on Amazon S3 and Amazon EC2. It's a very basic setup -- you just need a web server able to provide you access to indexed videos, and some storage and that's all. If you start thinking about this solution, you'll probably be able to create a solution in a couple of days. It's so easy at this moment. Being sure that the Amazon services are up is a very important thing for us and the SLA they put for the S3 service helped us decide to go this direction. Now we are waiting for the same service from EC2.
12. When you get video provided to you: is that something that is all done by InfoQ and all the video is in an encoding that's going to be your nice Flash encoding? Or do you sometimes need to use third party or internal or external transcoding mechanisms?
Basically the whole video processing, it's a workflow. It starts from those cameras and it goes to some video editing professionals that are indexing and creating the metadata and then we have -- and we are working towards having a more automatic way to manage this workflow. So basically, to answer your question, everything is done inside our company. It's not everything is automatic by now, we will probably have something like that in a couple of months to make the editors' life easier, but for now all these small steps are pretty much manual but it's an internal workflow.
13. You said you store the videos on the Amazon service. And what you basically get is a bucket where you put some data in, doesn't matter how big it is and what it is, you just put it in and they deliver it. Do you have a URL that you can give a client or a user and you can use it in his browser, probably? Where do you store the mapping from your internal keys to the URLs used by Amazon or where's mapping store? How do you known where you store your video?
We have the S3 storage, and we have the EC2 server. In order to be able to serve the videos we need to make them available from S3 storage to this instance. So we have a setup here that is synchronizing between S3 bucket and the local storage and then everything is accessed from here. Now, in order to resolve how you get to this resource we have our content database that is telling us the name of the resource to look for. So every piece of metadata we are storing in our JCR storage, everything that is related to content lives here. And then the services were built the way... We provide an ID and then everything, every piece in here knows how to retrieve it. Even if it's S3 or before that was the VitalStream third-party support, it was the same. A resource look-up based on IDs.
Sorry if that got to you that way. What I wanted to say is that our model is richer than what we are retrieving only through Hibernate. So we are aggregating together more different objects to create the object that will represent the page or something like that. So it's an aggregation process it's not like moving from the model to a DTO or something like that.
15. Do you use the association mechanisms provided by Hibernate? So for example I create a user. A user can have a number of roles (you could configure Hibernate such that you can retrieve your user and all roles for example). And Hibernate is doing that for you. And you say you aggregate it in an upper level, this means that you only get these single entities or collections of entities and resolve it in a upper layer, or...
Remember that I mentioned that we have different storages. I am fetching everything from every storage I have and I need to assemble it to represent a page. I am using all the Hibernate features like lazy or eagerly fetching and join fetching and everything.
Exactly. If you are looking at a page and you try to describe it as a model, the page is composed by content, advertisement elements, pictures and things like that -- all these represent parts of our model. In order to represent the full page we need to aggregate together all these small pieces, like the advertisement elements, the content, and the way we are aggregating them is very interesting because we are starting from content and using, based on that metadata related to the content, we are trying to deduce what kind of advertisement fits in and things like that. Basically we have a core model, the content with its metadata, and afterwards we are decorating this core model with additional things.
Considering that our company is very virtual, I mean we have people all around the world, working in different spaces and different time zones, we have set up a custom process that basically is running through a backlog where we are prioritizing everything that we need to implement for the next couple of years, and then just pulling out the next couple of iterations, where we discuss the details and things like that. So to answer your question right now our backlog is I think seven pages long. There will be lots of new features coming live, sooner or later. We have a couple of new ideas that are not yet on the backlog, but I would prefer to keep the surprise for that moment, I mean we are running in sight of, with many competitors around so we are trying to keep our secrets. But the last time, for example, the video re-implementation, we've done a draft and we asked our users to look at it and comment on and give us their feedback before pushing it live so for major features from now on we will be using the same process. If you are registered on InfoQ then it will be a chance that you may help us in the future to implement new things. So please do register.
As I told you at the beginning the JCR API provides full-text indexing. So we have this part. But currently we are using the Google search mechanism, because we figured out that the performance is a bit better and for the moment they are doing a very good job. In the future we are thinking to mix these two solutions and provide an advanced search, being able to look for and use specific query language for looking into site because, as you know we are using labels for content and things like that and to provide support for these kind of searches.
which JCR implementation are you using?
Thank you for this insteresting interview, it was good to see behind the scenes. Still, I'd like to ask which JCR implementation you are using in production and how did you integrate it with your DAO layer? Did you made a custom module that works with Spring or did you choose a 3rd party/open-source one?