Max Indelicato, a Software Development Director and former Chief Software Architect, has written a post on how to design a web application for scalability. He suggests choosing the right deploying and storage solution, a scalable data storage and schema, and using abstraction layers.
The Right Tool for the Job
Indelicato’s first advice is to choose “the right tool for the job” by selecting one of the following architectural solutions:
- Using a cloud deploying solution
- Using a scalable data storage solution such as MongoDB, CouchDB, Cassandra, or Redis
- Adding a caching layer like Memcached
This reporter considers that none of these solutions is mandatory from the start of the application, but it is wise to choose a scalable data storage solution from the beginning to avoid a switch later on. Deploying to the cloud brings some advantages especially for startups which cannot accurately determine the usage of their application after its launch. Deploying to the cloud would allow the application to scale gracefully if the need arises. Many software architects have told the story of how their application had to grow and they introduced a caching layer, solving a good part of the problem. But the solution must not necessarily be considered from the design phase. It can be easily implemented along the road.
A Scalable Data Storage
Indelicato continues by suggesting to choose a data storage that supports partitioning, replication and is elastic, one of the following: MongoDB, Cassandra, Redis, Tokyo Cabinet, Project Voldemort, or MySQL for a relational DB. This would be desirable because partitioning is necessary anyway over the life of the application. Replication is not necessary for scalability reasons but for “ensuring a high level of availability”. Elasticity is good to quickly add more nodes when peak traffic is encountered, but also when “maintenance is required on a node as a result of a hardware failure or upgrade, a large scale schema change, or any number of reasons that a node might require downtime.”
A Scalable Data Schema
Indelicato suggests creating a schema that easily allows data sharding, giving as example the following casual components, an User and an UserFeedEntry:
Collection (or Table, or Entries, etc) User { UserId : guid, unique, key Username : string PasswordHash : string LastModified : timestamp Created : timestamp }
Collection (or Table, or Entries, etc) UserFeedEntry { UserFeedEntryId : guid, unique, key UserId : guid, unique, foreign key Body : string LastModified : timestamp Created : timestamp }
And continues by suggesting to partition on UserId:
By partitioning on the UserId field, in both the User Collection and the UserFeedEntry Collection, we’ll be clumping the two related data chunks together on the same node. All UserFeedEntry entries with a UserId of xxx-xxx-xxx-xxx will be contained on the same shard as the User entry with a UserId value of xxx-xxx-xxx-xxx.
Why is this scalable? Because our requirement for this application is perfect for this distribution of data. As each visitor visits a User’s profile page, a request will be made to a single shard to retrieve a User to display that user’s details and then a second request will be made to that same shard to retrieve that user’s UserFeedEntries. Two requests, one for a single row and another for a number of rows all contained on the same shard. Assuming most user’s profile gets hit about the same amount throughout the day, we’ve designed a scalable schema that supports our web application’s requirements.
Using Abstraction Layers
Indelicato’s last suggestion is to use the following abstraction layers amongst others: Repository, Caching, and Service. When creating the Repository layer, he recommends to:
- Don’t name methods in a manner particular to the data storage you’re abstracting. For example, if you’re abstracting relational storage, its common to see Select(), Insert(), Delete(), Update() functions defined for performing SQL queries and commands. Don’t do this. Instead, name your functions something less specific like Fetch(), Put(), Delete(), and Replace(). This will ensure you are more closely following the Repository Patterns intent and make life easier if you need to switch out the underlying storage.
- Use Interfaces (or abstract classes, etc) if possible. Pass these interfaces into higher layers of the application so that you’re never actually directly referencing a specific concrete implementation of the Repository. This is great for building for unit testability too because you can write alternate concrete implementations that are pre-filled with data for test cases.
- Wrap all of the storage specific code in a class (or module, etc) that the actual Repositories reference or inherit from. Only put the necessary specifics of an accessor function in each function (query text, etc).
- Always remember that not all Repositories need to abstract the same data storage solution. You can always have the Users stored in MySQL and the UserFeedEntries stored in MongoDB if you wish, and Repositories should be implemented in such a way that they support needing to do this down the road without much overhead. The previous three points indirectly help with this as well.
For the Caching layer Indelicato says that he often starts with a “simple page (or View, etc) level caching or Service Layer caching as these are two areas where it’s not uncommon to see state change infrequently.”
Indelicato considers that a Service layer needs to have enough abstraction so one can easily switch the internal implementation of the service with an out-of-process one when the need arises.
Some consider that an application can be built without worrying about scalability issues, because those can be addressed when it’s necessary. But if one is to consider scalability it from the beginning, what other suggestions can be added?
Community comments
Going with relational is much easier in the beginning
by Philopator Ptolemy,
Re: Going with relational is much easier in the beginning
by Cameron Purdy,
Re: Going with relational is much easier in the beginning
by Florent Empis,
Re: Going with relational is much easier in the beginning
by Cameron Purdy,
Re: Going with relational is much easier in the beginning
by Florent Empis,
Re: Going with relational is much easier in the beginning
by Steve Harris,
Re: Going with relational is much easier in the beginning
by Florent Empis,
Re: Going with relational is much easier in the beginning
by Cameron Purdy,
Going with relational is much easier in the beginning
by Philopator Ptolemy,
Your message is awaiting moderation. Thank you for participating in the discussion.
Going with relational DB is much easier in the beginning. Java has Hibernate, Ruby has ActiveRecord, etc.
These tools are indispensable in getting the project up very fast. For a start-up companies it might be very important. Then, later, when (if) the success comes, it might be worthwhile to go the NoSQL route.
Re: Going with relational is much easier in the beginning
by Cameron Purdy,
Your message is awaiting moderation. Thank you for participating in the discussion.
It's not a question of the database, it's a question of architecture. If the application is not built on top of a partitionable domain model, you'll have to throw the whole thing out and re-do it to get it to scale.
Peace,
Cameron Purdy | Oracle Coherence
coherence.oracle.com/
Re: Going with relational is much easier in the beginning
by Florent Empis,
Your message is awaiting moderation. Thank you for participating in the discussion.
Funny that you mention this, as Coherence, although a very goog product, requires an application to be partially rewritten to use it.
Your main competitor (Terracotta/EhCache, with whom I am by no means associated) does not have this requirement....
Re: Going with relational is much easier in the beginning
by Cameron Purdy,
Your message is awaiting moderation. Thank you for participating in the discussion.
Hi Florent -
It is hardly a coincidence that my understanding of high-scale architectures and the evolution of Coherence have proceeded together. The solutions in Coherence are based almost entirely on what we've learned from working with large-scale applications, as we turned one-off custom solutions that we helped customers implement into features of Coherence.
That's not correct. First, applications tend to require major overhauls and retrofitting to take advantage of Terracotta. For any significant application, Terracotta is often harder to implement than Coherence.
Second, Terracotta is not one of our main competitors; they're probably #5 or #4 on the list. I do think (with the acquisition of EhCache and work they're doing with that) that they intend to become a significant competitor, and there are definitely some bright engineers there, so I don't doubt that they are capable of doing great things.
Personally, I like that Terracotta took a much different approach to solving scalability problems with their product. There are a lot of "me too" Coherence-like products out there, but Terracotta cut a new path. I don't know how they're doing business-wise, but they seem to be doing alright in this market, which is an accomplishment in itself.
Peace,
Cameron Purdy | Oracle Coherence
coherence.oracle.com/
Re: Going with relational is much easier in the beginning
by Florent Empis,
Your message is awaiting moderation. Thank you for participating in the discussion.
I don't to start a flame war and a game of "theirs is bigger than yours"; all products have their up and down-sides and Oracle has a clear history of solid successes.
My understanding of Coherence comes directly from an Oracle representative who agreed that, based on my particular use-case, rewritting costs would be very high with Coherence (in essence, rewriting a lot of DAO code) whereas with Terracotta/EhCache it would (almost) be plug and play. As you said, they took a different approach.
My apologies for the probably too harsh tone of my first comment.
Re: Going with relational is much easier in the beginning
by Florent Empis,
Your message is awaiting moderation. Thank you for participating in the discussion.
A follow-up question on that. You said they're #4 on the list, what would be #2 and #3 ?
Re: Going with relational is much easier in the beginning
by Cameron Purdy,
Your message is awaiting moderation. Thank you for participating in the discussion.
Hi Florent -
The companies that we encounter most often (varying over time, but in general) are: IBM, Gemstone (acquired by VMWare) and Gigaspaces. Each seems to have its own set of strengths and weaknesses.
Peace,
Cameron Purdy | Oracle Coherence
coherence.oracle.com/
Re: Going with relational is much easier in the beginning
by Steve Harris,
Your message is awaiting moderation. Thank you for participating in the discussion.
Hi Florent,
I'm not much for the public posting but I wanted to thank you for the kind words.
Ehcache/Terracotta are experiencing tremendous growth as people are really getting our simple scale message. Being able to provide the ability for our millions of production nodes of Ehcache to scale-out with just 2 lines of config is such a powerful concept. Try it and see for yourself. We are super excited and continue crank out features that enhance our vision.
Stay tuned...