InfoQ Homepage Articles The Next Evolution of the Database Sharding Architecture

AI, ML & Data Engineering

The Next Evolution of the Database Sharding Architecture

This item in japanese

Bookmarks

Jan 12, 2022 12 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event or an InfoQ Dev SummitFind out more

Key Takeaways

Data sharding is the breakdown of data spread across multiple computers, either as horizontal or vertical partitioning. On the other hand, data partitioning is when the database is broken down into different subsets but held within a single database instance.
Database Plus is a concept for creating a distributed database system for more than sharding, positioned above DBMS.
Apache ShardingSphere is a distributed database middleware created to solve data sharding issues. It also supports data encryption, shadow database, distributed authentication, and distributed governance.
ShardingSphere’s architecture includes four layers: Foundation, Storage, Function and Solution layers.
ShardingSphere also supports DistSQL (distributed SQL), a SQL dialect to operate and manage all functions of the ShardingSphere.

With mobile phones and the internet becoming what some would call a daily necessity, it's not uncommon for websites and business services to receive billions of visits on a weekly basis - and that’s not all.

Sales days like Black Friday in North America or Double Eleven (aka Singles Day) in Asia are excellent examples of traditional retail enterprises adapting to the digital world. These enterprises must now contend with new needs and challenges, to successfully achieve their business goals.

They all have to answer the same question: we need to drive our digital sales on this Black Friday, but when we succeed and this incredible traffic reaches the database cluster, will our databases be able to handle it?

Different business cases have multiple options when it comes to database solutions. Such options range from NoSQL products (e.g., MongoDB, Cassandra, Amazon DynamoDB, etc.), to NewSQL products (like Amazon Aurora or CockroachDB, which are popular these days).

Besides these great solutions, some industries will also consider transparent sharding on top of existing database clusters.

According to the database trend ranking DB-Engines, although many new database offerings are hitting the market, the traditional relational database still retains a considerable share of the pie.

Considering the new challenges databases are facing, is there an efficient and cost-effective way to leverage these types of databases and enhance them through some new practical ideas? Database transparent sharding is one of the best answers to this question.

Database popularity ranking on DB-Engines

One of the best techniques for this is to split the data into separate rows and columns. This splitting of large database tables into multiple small tables are known as shards. The original table is divided into either vertical shards or horizontal shards. Terminologies used to label these tables can be subjective to ‘VS1’ for vertical shards and ‘HS1’ for flat shards. The number represents the first table or the first schema. Then 2 and 3, and so on. These subsets of data are referred to as the table's original schema.

So what is the difference between sharding and partitioning? Both sharding and partitioning include breaking large data sets into smaller ones. But a key difference is that sharding implies that the breakdown of data is spread across multiple computers, either as horizontal or vertical partitioning. On the other hand, partitioning is when the database is broken down into different subsets but held within a single database, sometimes referred to as the database instance.

Since for sharding data is divided into numerous pieces stored across different machines, this approach offers the following advantages:

It makes a standalone DBMS become a distributed system.
Expands the storage capacity of the database system.
Traffic is shared evenly across different shards.
High availability.

However, sharding architecture is not perfect and has some drawbacks:

Complex route/query topology.
Complicated management of distributed database clusters.
Query overhead.
Incomplete support for native SQLs. Adopting a sharding architecture on a pre-existing distributed database system, might cause SQL compatibility issues. Previously functioning SQLs might not run successfully in the newly created sharding database.

Sharding: One to multiple shards

Like for most things in technology, let alone in life, there is no silver bullet. You should perform a thorough analysis to have a complete picture of your needs and scenarios, and only then move on to choose the best possible solution.

Generally, the advantages of sharding architecture prevail, and many excellent products that play an essential role in the database industry are based on this architecture. Citus or Vitess have their respective definitions, but they're based on database-sharding architecture in nature.

Citus manages a coordinator (proxy) cluster to distribute the PostgreSQL cluster, while Vitess shards MySQL alike. Both of them concentrate on providing a low-cost and efficient distributed solution of traditional but prevailing relational databases. Actually, sharding architecture is fundamental for most NoSQL and NewSQL products as well, but that would be another topic focusing on sharding with NoSQL and NewSQL. This article focuses on sharding with relational databases since there are some innovations brought to the classic sharding technique.

The occurrence of sharding is the result of the distributed needs of databases. These days an increasing number of new issues involve databases, such as privacy protection, SQL audit, tenant, distributed authentication, etc.

These represent real-world new demands for databases. How to deal with these issues is an inevitable question for all database products, no matter the type of database. Could these issues be addressed with a database sharding solution? It looks like sharding needs to evolve to meet these challenges, which is our topic i.e., what's the next evolution for database sharding architecture.

My answer is Database Plus, the guiding concept for creating a distributed database system for more than sharding, positioned above DBMS.

It was conceived with the aim to build a standardized layer and ecosystem above existing and fragmented databases, as well as provide a unified and standardized database usage specification. This provides for upper-level applications, and the challenges faced by businesses due to underlying databases’ fragmentation get minimized as much as possible. The result is an environment where the applications only need to speak with a standardized service, instead of different ones for each database.

This idea was initiated by the PMC (Project Management Committee) of Apache ShardingSphere, and it took about a year to release 5.0.0 GA and implement this concept in its architecture.

In the 3.x and 4.x release stages, we defined Apache ShardingSphere as a distributed database middleware (sharding architecture) to only solve the sharding issue. However, new challenges for databases and the community pushed this project to evolve and include more features such as data encryption, shadow database, distributed authentication, distributed governance, etc. All of these changes go beyond the traditional sharding range, as sharding is just one part of Database Plus.

The evolution of ShardingSphere's Database plus architecture

Apache ShardingSphere's example supports my thesis that a simple and classic sharding architecture can do more than sharding. The kernel mechanism directs all traffic through a proxy or driver, and then if it could parse SQLs and know the location of every database, the following jobs will be easy to perform:

Understand users’ expectations for data.
Hijack traffic and modify it.
Reroute this modified query to a certain database.
Merge or change the metadata and data of a result.
Monitor the status of computing nodes (Proxies) and storage nodes (Databases).
Collect the changes or routine information of this distributed system.
Give personalized recommendations using Machine Learning (ML) techniques.

So what do these jobs mean to the end users? Based on these kernel jobs, the products of Apache ShardingSphere are qualified to ease users’ database pain points.

Originally sharding, data encryption, shadow database, distributed authentication, distributed governance, etc. were all based on the necessary steps above. The architecture proposed by the Database Plus concept of Apache ShardingSphere brings these enhancing features with flexibility in mind.

All functions are just plugins that can be added or removed at any given time in this distributed system. Some people may just want to shard a database, while others may prefer to do data encryption. Users’ needs never stop evolving and are diverse, and for this reason Database Plus can be fully customizable and continuously receive new plugins (features) allowing it to meet user’s demands one by one specifically and flexibly.

Architecture

ShardingSphere’s architecture includes the following four layers as shown in Figure 1 below.

ShardingSphere's Four Layer Architecture

Foundation Layer: Provides a variety of access terminals such as driver or proxy to flexibly meet the needs of users in different scenarios.

Storage Layer: All the functions are supported in these databases with the possibility to include more.

Function Layer: Provides a variety of functional plug-ins that meet users' needs, allowing a high degree of flexibility in plug-in choice and combination.

Solution Layer: End users are provided with industry-oriented (e.g. financial, e-commerce and entertainment industries) and specific scenario-oriented standard product solutions (e.g. distributed database solution, encrypted database solution or database gateway).

Multi-Access Terminal Mixed Mode Production Available

ShardingSphere JDBC and ShardingSphere Proxy have been polished and tested for five years and are now available in production. Many community users provided relevant production cases, and production feasibility has been verified.

With the shared core functions among different ShardingSphere clients, users can also choose hybrid deployment to achieve balance between query performance and management convenience (shown in Figure 2 below).

ShardingSphere JDBC and Proxy Hybrid Development

Standardized Cluster Management with DistSQL

The Apache ShardingSphere community proposed a SQL dialect, i.e., DistSQL (distributed SQL), to operate and manage all functions of ShardingSphere.

SQL is the standard and conventional interaction method with databases. However there are many new features in this distributed database system which required us to think of a SQL dialect to configure and use these new functions.

DistSQL allows users to use SQL-like commands to create, modify or delete a distributed database and table, or to encrypt or decrypt data. All of the above mentioned functions could be performed with distributed SQL. Some DistSQL snippets are presented below.

DistSQL in Action

Distributed Governance Capability

Distributed database system governance capability is necessary to alleviate the suffering of distributed cluster management. In the ShardingSphere ecosystem where computing and storage are separated, features are greatly enhanced in the new version including:

the distributed governance of databases (i.e., storage nodes) and Proxy/JDBC (i.e. computing nodes),
online user metadata DDL change,
on/off running storage nodes and computing nodes,
circuit breaker and disabling, and
high availability.

Additionally, the distributed lock new feature is scheduled to be released soon.

ShardingSphere’s Distributed Governance

Pre-Deployment Notice

Although many advantages have been listed above, there are some constraints or limitations that are worth mentioning. You should give careful consideration to the following items before adopting ShardingSphere::

Some complex native SQLs, especially SQLs joining different shards, cannot work well or will take longer than SQL executed in DBMS. A business oriented sharding strategy is recommended to prevent this from happening, to avoid such type of complex SQLs.
Proxy mode produces network overhead. For this reason, I recommend hybrid adapter deployment.
Distributed transactions will significantly lower transactions per second (TPS) or queries per second (QPS).
It's challenging to get a consistent point-in-time snapshot across shards.

Practical Examples

This section will introduce two practical examples to demonstrate how create a distributed database and create an encrypted table with DistSQL - the SQL dialect connecting all the elements of the ShardingSphere ecosystem.

Distributed Database Solution

This part will guide you through an example of how to leverage DistSQL to create a distributed database. Users and applications visit Proxy to achieve a logic table (distributed table) which has been sharded among different servers. There is no need to take care of these shards, instead make your applications operate and manage this logical table.

Prerequisites:

Deploy MySQL instance and create two MySQL databases
Deploy a ShardingSphere Proxy

Process:

mysql -h127.0.0.1 -uroot -P3307 -proot

ADD RESOURCE ds_0( HOST=127.0.0.1, PORT=3306, DB=demo_ds_0, USER=root, PASSWORD=root );

ADD RESOURCE ds_1 ( HOST=127.0.0.1, PORT=3306, DB=demo_ds_1, USER=root, PASSWORD=root );

Create sharding rule by distSQL

CREATE SHARDING TABLE RULE t_order( RESOURCES(ds_0,ds_1), SHARDING_COLUMN=order_id, TYPE(NAME=hash_mod,PROPERTIES("sharding-count"=4)), GENERATED_KEY(COLUMN=order_id,TYPE(NAME=snowflake,PROPERTIES("worker-id"=123))) );

Create sharding table by the previous sharding rule

CREATE TABLE `t_order` ( `order_id` int NOT NULL, `user_id` int NOT NULL, `status` varchar(45) DEFAULT NULL, PRIMARY KEY (`order_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Show resources, sharding databases and sharding tables

sql SHOW SCHEMA RESOURCES;

SHOW DATABASES;

SHOW TABLES;

Show sharding tables

SHOW TABLES;

Below are the tables in MySQL:

And below are the tables in ShardingSphere Proxy:

Drop sharding table

DROP TABLE t_order;

Data Encryption Example

This example shows you how to create an encrypted table with DistSQL. The data encryption feature is ShardingSphere Proxy, which helps encrypt and decrypt data. Applications do not need any coding refactoring, and just send the plaintext to Proxy, where plaintext is encrypted and resends the ciphertext to the databases. Additionally, users can configure which column in which table should be encrypted by which encryption algorithm.

Prerequisites:

Deploy MySQL instance and create two MySQL databases.
Deploy a ShardingSphere Proxy.

Process:

mysql -h127.0.0.1 -uroot -P3307 -proot

Add Resources by distSQL.

ADD RESOURCE ds_0 ( HOST=127.0.0.1, PORT=3306, DB=ds_0, USER=root, PASSWORD=root );

Create encrypt rule

CREATE ENCRYPT RULE t_encrypt ( COLUMNS( (NAME=user_id,PLAIN=user_plain,CIPHER=user_cipher,TYPE(NAME=AES,PROPERTIES('aes-key-value'='123456abc')))));

SHOW ENCRYPT TABLE RULE t_encrypt;

Create encrypt table

CREATE TABLE `t_encrypt` ( `order_id` int NOT NULL, `user_plain` varchar(45) DEFAULT NULL, `user_cipher` varchar(45) DEFAULT NULL, PRIMARY KEY (`order_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Below is the result in MySQL:

Insert a data into this table

INSERT INTO `t_encrypt` VALUES(1,"abc");

What exactly happened in MySQL:

Alter encrypt rule

ALTER ENCRYPT RULE t_encrypt ( COLUMNS( (NAME=user_id,PLAIN=user_plain,CIPHER=user_cipher,TYPE(NAME=MD5)) ));

SHOW ENCRYPT RULES;

Drop encrypt rule

DROP ENCRYPT RULE t_encrypt;

A database distributed system including sharding, encryption and other additional features, positioned above DBMS is a practical and efficient way to meet users’ continuously changing needs, at a low cost. Such a solution would eliminate worries about instability, and heavy workloads caused by the adoption of a totally new distributed database.

As a ShardingSphere PMC (Project Management Committee) member myself I may appear biased, but it is also true that I chose to contribute to this open source project because of its great innovation potential to solve real-world database related problems and production scenarios.

In my professional career I have been part of companies managing and leveraging enormous amounts of data, in one of the societies with the highest internet penetration in the world. I was well aware of the challenges that data spikes were creating, and the gap between production needs and readily available database solutions.

I am not saying that Database Plus is the best and only way to solve the new challenges in cloud era, but I’d recommend it as a possible and innovative solution.

One last word from me goes to sharding. Sharding is one of the many ways to solve the new challenges created by the internet evolution. Some specialists might say that sharding database architecture is out of fashion, but that couldn’t be farther from the truth.

It might not sound as fancy, or have all the bells and whistles of other solutions, but it is certainly effective and practical.

Recently it has received significant new innovative contributions that have advanced sharding beyond what was imaginable not so long ago, and maybe that's why it has been growing in popularity among blockchain companies looking to achieve scalability.

About the Author

Trista Pan

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

The Next Evolution of the Database Sharding Architecture

InfoQ Article Contest

Key Takeaways

Related Sponsored Content

Architecture

Multi-Access Terminal Mixed Mode Production Available

Standardized Cluster Management with DistSQL

Distributed Governance Capability

Pre-Deployment Notice

Practical Examples

Distributed Database Solution

Data Encryption Example

About the Author

Trista Pan

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter