Key Takeaways
- Cloud-native data is stored and structured in ways that encourage flexibility.
- Getting comfortable with micro-databases requires new levels of automation and self-service.
- Like cloud-native apps, cloud-native data platforms should scale up and scale out.
- How you choose to use, integrate with, and analyze cloud-native data may be different than what you're used to.
You've likely heard about "cloud-native apps." That term refers to software built for change, scale, resilience, and manageability. Oftentimes it's equated with microservices and containers, but those aren't required. Whether running in public or private clouds, a cloud-native app takes advantage of the elasticity and automation offered by the host platform.
But what are the implications for the data capabilities those apps depend on? While cloud-native apps have blueprints like the twelve factor criteria to steer design, your data services don't.
Below, we look at ten characteristics of cloud-native data and why each one helps you deliver better software. To be sure, you may neither want, nor need, to follow each guidepost. That's fine. But these should give you a sense of what matters most when collecting, storing, and retrieving data in modern systems.
#1 - Cloud-native data is ... stored in many ways.
Fifteen years ago, where did you store your data? In all likelihood, you had a local or network attached file system, and a relational database to work with. You put binary content on file storage, and transactional data in a normalized database. That was it. Today, your cloud-native data is shaped in all different ways and resides in a number of places.
The options seem endless. Cloud-native data might sit in an event log, relational database, document or key-value store, object store, network attached storage, cache, or cold storage. The ideal choice depends on the situation. Storing media files where durability matters? Use Object storage. Saving unused transactional data for a regulatory-required duration? Stash it in cold storage. Serving up product catalog information to a high traffic web property? Look to a cache or read-oriented key-value store. Considerations for latency, read performance, durability and the like will help you narrow down your choices.
Be aware that in cloud-native systems, the unified log often becomes the system of record. Materialized views show you a representation of that data for a given purpose. This is different way of thinking of data storage, and for many, turns the idea of a database inside out! The unified log holds individual transactions from your various inputs. Those items may inflate into objects or records in your applications or cache. This may be a new way for you to store data, but it’s proven to be an excellent solution at scale.
That said, you don't have to throw out your trusty relational database. Instead, reassess how you use it. For example, if you've been using your relational database for application session state, consider introducing something like Redis and get familiar with key-value stores.
At the same time, introduce modern relational databases like Google Cloud Spanner that are designed for geographic resilience and cloud-scale performance on demand. Or welcome in NoSQL databases optimized for fast lookups and high availability. Object storage is an easy service to get started with. Get away from (local) file system dependencies where possible and refactor applications to use externalized storage instead.
#2 - Cloud-native data is ... independent of fixed schemas.
Overwhelming, you see cloud-native apps and services work with data in a JSON format. That said, you may also use protocol buffers, or classic XML to structure your data. Regardless, cloud-native apps prioritize adaptability. That means making it easy to accommodate change. That's hard to do if you have to constantly alter the table structure or regenerate typed classes when the shape of your data changes.
To the point above, don't be afraid to diversify your storage options. If you require all data to conform to a fixed schema and slot neatly into SQL Server columns, you're unnecessarily constraining yourself. Consider reducing heavy use of ORMs and typed classes that make changes difficult.
If you like using structured data schemas, consider using a well-designed service facade in front of the database. To add new features or handle data changes, you can version the API.
#3 - Cloud-native data is ... duplicated.
What's one thing we're taught when learning software engineering? Don't repeat yourself. It's valuable guidance, but when looking at the data in your cloud-native apps, it may not rest in one place.
You'll find some excellent resilience in public cloud databases. Services like Amazon RDS make it easy to create asynchronously updated read-replicas. Microsoft Azure SQL Database supports geo-distribution. The data is mastered in one place, but duplicated elsewhere. When working with NoSQL-style databases, there's no "master" server that stores everything; data is replicated across a ring of machines. This provides resilience, but often sacrifices consistency to make it happen.
As you introduce more vigorous caching, you'll find yourself doing both reads and writes into the cache. You may write-behind to the system of record, but again, the same data is spread around. Caching itself is a form of data duplication and this copy gives you better application performance and resilience.
Data often flows from the edge to internal systems. That data may move untouched from devices, through cloud gateways, to apps, and to data stores. Or, that data may get saved, filtered or aggregated as it gets processed. That data may not disappear from those intermediate stages. It may be used later for comparison or calculation.
While "system of record" is very important, your cloud-native data might be duplicated many times over for processing, caching and multi-cloud storage purposes.
#4 - Cloud-native data is ... integrated via service interfaces.
Take a look at the documentation for cloud-first databases like Google Cloud Spanner or Amazon DynamoDB. Notice something? The APIs are web service (REST) based. No drivers, no fixed IP addresses. To be fair, those same cloud providers offer traditional relational databases (e.g. Google Cloud SQL, Amazon RDS) that use standard client applications, drivers, and host names to interrogate the database. But you see a trend towards accessing cloud data through service APIs, not low level access to known servers and raw schemas. When you extract data from Salesforce.com, you have no option to hit the underlying Oracle database. Rather, you go through a carefully designed service API that governs usage and shapes data.
There's a whole new crop of integration platforms that cater to cloud endpoints. Microsoft offers Logic Apps, Dell delivers Dell Boomi, and there are even more consumer friendly tools like IFTTT. What all of these have in common is that they connect to a host of different cloud systems and integrate them via service interfaces. As you design your cloud-native data strategy, think about how to reveal data endpoints on your applications.
#5 - Cloud-native data is ... self-service oriented.
One could argue that the primary reason cloud computing took off a dozen years ago is because of self-service. Enterprise developers were no longer held hostage by arcane corporate rules for acquiring hardware. And startups didn't have to make massive capital investments to experiment with business ideas.
A cloud-native data platform supports on-demand provisioning, and self-service configuration. That's non-negotiable. What's the point in having cloud-native apps that get continuously deployed and scales on a moments notice, if your corresponding data store can't keep up? No, cloud-native data is stored in databases, caches, and file stores that can be provisioned with ease and scaled automatically or with simple API calls. Loading or extracting data is done via known APIs. We can't ignore the need for shared identity, access, and storage policies, but these should be part of automated provisioning or capable of being audited after the fact.
While the public cloud has set the bar for cloud-native data storage, it's not impossible to get these capabilities on site as well. If we assume that operating "cloud native" means software-run-by-software, then any on-premises data product must function as a platform.
#6 - Cloud-native data is ... isolated from other tenants.
For performance, agility, and security reasons, cloud-native data isn't stashed in a single shared instance.
We're all used to building massive database instances designed to store all the things. But shared capacity--regardless of how much--is dangerous. Noisy neighbors have a cascading effect on everyone else. All tenants are stuck with the same software upgrade window and disaster-recovery strategy. From a security perspective, it's straightforward to add users with permissions on specific database objects. But as the number of tenants grow, you end up with a web of access control rules that may lead to elevated permissions for those who don't need it.
Cloud-native data supports per-tenant database instances. Whether on a shared cloud platform or on-premises environment, these databases are allocated to services and applications, not entire enterprises. This means that teams have control over when and how their upgrades happen, the capacity vectors to scale, and who exactly has access. These micro-databases offer greater agility as each team can pick the database engine and deployment model that makes the most sense for their application or service.
#7 - Cloud-native data is ... at home on managed platforms.
Cloud-native: it’s software run by software. Platforms are the critical piece, especially for databases. It's the only way to effectively manage a growing set of database instances.
What does a managed database platform offer you? First, it's about installation and configuration. No longer are you lovingly installing Microsoft SQL Server by hand on a carefully constructed cluster. That's error prone and time consuming. Developers and app teams need to click a button or make an API call to get a properly configured database instance anywhere they want.
The second thing a managed platform offers you is "day 2" management. This means built-in monitoring, infrastructure scaling, patching, version upgrades, and failover recovery. Need a read-replica? It takes a few moments using Amazon RDS. Need high availability? Microsoft Azure Cosmos DB handles node (or region) failure without requiring changes to code. These aren't nice-to-have features. They represent the way that leading companies store and access cloud-native data.
#8 - Cloud-native data is ... not afraid of scale (out).
Besides thinking "flexibility" when you hear the word "cloud", you also probably think of the word "scale." The Internet is littered with stories of web companies and startups processing billions and trillions of data points. While you may not face that level of scale today, you should be planning for it. And just like cloud-native apps, your data capabilities should focus on scaling out, not up.
Devices throw off unprecedented amounts of data. The hardware and software in your data centers emit diagnostic information. Services in the cloud now eagerly discharge events when things happen. Business apps generate and consume all sorts of data. As you embrace a cloud-native data approach, you learn to expect increasing volumes of data.
You're faced with more data, coming in more quickly. Cloud-native data flies through real-time messaging or event-stream systems and gets stored by the petabyte. To make this a reality, you need to ensure that your messaging middleware is designed for bursts and always-on availability. Traditionally, this meant provisioning giant clusters up front. In a cloud-native world, your underlying platform should scale the messaging tier out as demand dictates.
Your databases (and data microservices) must be capable of absorbing constant updates of small and large batches. This may impact the way you design a RDBMS schema, or your choice of a schema-less option for intensive workloads. Instead of using a massive, single database instance, you’ll want to consider databases that also scale out to more instances. Scaling out your databases may come with transactional tradeoffs, but in exchange, you get better agility. That means smaller initial footprints, and readiness for inevitable instance (and site!) failures.
#9 - Cloud-native data is ... often used and discarded.
Purging data can be a mental hurdle to get over. We like storing data "just in case." But while cloud-native data is friendly to scale (see above), it's also about temporary usage.
To be sure, plenty of your cloud-native data is persisted indefinitely. However, you'll notice that an increasing amount of data is processed (somewhere) and dropped. Maybe it's aggregated at the edge and a summary event is passed on to an on-premises system. Or it's used in a time-sensitive window to look for server performance anomalies and deleted afterwards. And it could be on-the-fly shopping recommendations generated by a machine-learning model and removed when the shopper leaves the site.
Don't feel the need to store everything. Recognize that you're now dealing with more transient data than ever, and figure out the necessary lifespan before planning it's storage medium.
#10 - Cloud-native data is ... analyzed in real-time and batch.
Streaming data is all the rage, but research firm Gartner says that 85% of enterprises still favor batch-oriented techniques. That number will decrease over time, but cloud-native data needs both real-time and batch processing cycles.
Cloud-based streaming engines like AWS Kinesis or Azure Event Hubs make it simple to process an unbounded set of events. Customers uses these engines to detect fraud, update pricing, or reveal performance issues for users. But these engines also pump data into data warehouses where more sophisticated analytics occur in batch. There's a place for both spot analysis and more thoughtful analysis with the same data.
Summary
It's early days for figuring out what cloud-native data looks like. How do we bring legacy data stores into cloud-native apps? What's the right approach for dealing with multi-cloud data needs? Where does lock-in and portability come in? There aren't answers to all these questions yet, but this article took a first look at how cloud-native concepts apply to the data world. In the months and years ahead, I suspect that we'll spend significantly more time exploring this area of focus.
About the Author
Richard Seroter is a Senior Director of Product at Pivotal, with a master's degree in Engineering from the University of Colorado. He's also a 10-time Microsoft MVP, trainer for developer-centric training company Pluralsight, speaker, the lead InfoQ editor for cloud computing, and author of multiple books on application integration strategies. Richard maintains a regularly updated blog on topics of architecture and solution design and can be found on Twitter as @rseroter.