Surviving Success

A few early, correct design choices can avoid both over-engineering and monumental failure in cloud projects according to Mark Simms and Mark Souza in their Microsoft Build 2015 session “Surviving Success: Architecting Web Sites and Services for Rapid Growth”.

They list four categories of design choices: scalability, manageability, availability, and feasibility.

Scalability is the biggest change when moving from a data center to the cloud. You must learn how to add resources efficiently and incrementally instead of overbuying fixed capital resources.

Manageability through telemetry and ALM is the most critical. Without it, you do not know your problems or their remedies. The effects of new changes are unknown. Nor can you pull out bad changes.

Availability is how to stay up given the inevitable transient and enduring faults in the application and the underlying services.

Feasibility demands that new features can be continually added on time and under budget without accumulating a great deal of technical debt.

Simms and Souza maintain that without devoting time, people and money to work out these non-functional design choices you will have a production “learning moment” which will not be pleasant.

Throughout the talk they continually emphasize the need for telemetry to make educated decisions about your design choices. Examining it has to be part of your daily experience.

They recommend addressing the four design categories first by understanding the nature of your application and decomposing your workload with special attention to state and consistency. Is your system a burst mode application? Must it run consistently for a limited time (i.e. 9am – 5 pm) without failure? On a mobile platform during a real-time event with concurrent updates, you may need 3-5 seconds of read consistency, while infrequent scheduled updates can takes minutes. You need to continually test in production long enough to avoid outrageous surprises.

Given that understanding, trade money for precious engineering time. Given limited resources, Platform as a Service (PAAS) is easier than managing virtual machines because many design choices do not have to be made or analyzed.

You must find your breaking point before your users. Run a destructive load test on the smallest deployment. Contention points and scale units are discovered. You can then decide if you need performance improvements or you can buy time to add new features by adding cloud resources.

Then you can use data to identify the places you need to add more resources. Use data to identify bottlenecks and contention points. Can you identify features in the future (such as caching) that you might need?

The final question they examine is how do you prepare to use multiple data centers? Front end or middle tier resources need low state and code replication. The database which is highly stateful needs eventually consistent data replication. You need to provide performance and locality based DNS routing. Prepare your system to handle multiples of each the previous items. Get a prescriptive not imperative description of the system which facilitates automatic publishing and named deployment. You can then distinguish new deployments in telemetry data.

Simms and Souza summarized their talk by emphasizing three points. Use the dynamic cloud environment to buy time to get the functionality right. Use telemetry data as a solid foundation for making choices and solving problems. Use data to help analyze stress situations.

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the Telemetry topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter