Generic versus User Specific Data Streams for Scalable Web Sites
One of the most important architectural decisions that must be done early on in a scalable web site project is splitting the data flow into two streams: one that is user specific and one that is generic.
The main rationale for partitioning data this way is because the constraints for each use of data predicate different caching and state management approaches:
For example, most generic data flow is completely stateless, but user-specific actions are often stateful. If these two flows are clearly separated, then we can process generic actions with stateless services and servers. Stateless services are much easier to manage and scale than stateful services, since they can be easily replaced without any effect on the system operation. You can just stack more cheap servers and throw requests at them using round-robin or simple load balancing when you need more power. Stateful services cannot be scaled so easy — they might rely on resource locking and load balancers have to send all requests for a single session to the same server. If a stateful server dies, that has a visible effect on the system operation, so these services have to be much more resilient than stateless ones.
Gojko provides guidance for how to separate the data streams. For a two-tier application, this includes creating different data sources. For the data stream that has static data, turn on caching and turn off transactions. A more elaborate setup is proposed for three-tier applications:
In three tier architectures, I like to split the middleware straight from the start into user-specific and generic servers. Web servers sit on top, get the generic data from the first middleware group and process transactions using the second middleware group. Generic data-flow servers can be clustered and scaled easily, and any load balancing system will work right out of the box. They can be restarted, taken out of the cluster or put back in without any effect on the rest of the system. Transparent caching can be applied to those servers easily. User-specific servers, on the other hand, are much more tricky in all those aspects and should absolutely never be transparently cached. This split is a preparation for further scaling and caching, since generic data servers can be split regionally, put under several layers of cache servers, divided vertically by product range or type. The functionality on user-specific servers is focused and isolated, so that we will have less to focus on when we need to partition that later.
Gojko suggests moving as much as possible into the generic servers to take advantage of the power of caching, citing techniques such as loading the user specific content into a generic page using AJAX, and using cookies to store user details that would typically be displayed at the top of each page. The generic data stream can then by served by highly performany http servers, such as LightHttpd.