Distributing Complex Services in Cross-Geolocational IDCs
Tencent is one of the biggest Internet companies in China, mainly covering instant messaging and gaming. Tencent's flagship product, QQ instant messaging, has over 800 million monthly active users (according to a Tencent report in 2013 Q3), and Qzone (qzone.qq.com) is a social networking platform for QQ users with over 600 million monthly active users.
This interview was first published on InfoQ China as a video interview.
A brief note on terminology
IDC refers to Internet Data Centre. In China this refers to a specific type of data center provided by carriers, because Internet infrastructure is a government controlled industry. Even Tencent and Alibaba don't build their own data centers because of government regulations, they have to rent from a carriers' IDC.
InfoQ: We have heard a lot about "SET" in Qzone recently. Can you explain what the term means?
Sun: A simple explanation: a SET can be seen as a container, a box containing stuff. A SET in Qzone is a box with a certain set of Qzone services, and the box can be taken for distributed deployment.
The reason we came up with SET is a long story. Qzone started off in 2005 with 100-200 machines, and in 2006-2007 the service scaled to over 1,000 machines. Our data-centers didn't expand as fast as the service, so eventually a single Qzone service - from logical layer to data layer - might be scattered on various IDCs in Shenzhen. As a consequence, our service became very dependent on inter-IDC private networking, and when the network is unstable - say a fibre got destroyed by an excavator (which is quite common in China) - our service would be unavailable.
To make things worse, we are a social platform with vast numbers of connected users, and there are diffusions. For example, if you need to pull your friend info list, this request will be scattered to hundreds of requests down the path. This makes us even more dependent on our private networking. In 2006 - 2007, whenever there was an unstable network, whether a broken switch or something happened to the fibre, the entire Qzone service could not be accessed. There were many complaints about not being able to visit Qzone around that time.
So the problem was clear, and we decided to do an optimization to our deployment, which is basically what SET is all about. We wanted to reduce dependency on private networking. We want a single SET to be able to provide all the core services of Qzone. Each SET can be relatively independent of the other SETs.
This is the background and objective for SET. There were two key points in designing a SET:
1. How large we want a SET to be. It was impossible to let a SET contain the entire of Qzone because Qzone was too big - we had basic pages, blogs, photos, service platforms, etc. We had to draw a line. We eventually decided that a SET should contain the landing page. The landing page has the most visits of all the pages. Also, it is the entry point for all the other Qzone services. So we defined the SET to contain: user's Qzone status, his relationships, his access privileges, and his underlying related data.
In 2007 there was a big change to Qzone. The old Qzone was guest-oriented, which focused on page decorating; the new Qzone was more friend status oriented, which focused on relationships. So we added friend feeds into our SET as well and the line was drawn.
2. How many users we want a SET to serve. Serving more users from a single SET means we don't have to scale often, but each scale would be big. After some calculations, we decided to retain a SET to serve as many users as a single switch could serve: at that time, a switch would carry around 400 servers with it, serving about 5 million online users. So this became the standard SET size.
InfoQ: I also heard you are implementing an un-SET process?
Yes, we are currently removing some SETs because there are too many of them. Since we started doing SET in 2008, we saw the benefits - the cost of scaling and cost of operations decreased, and the quality of service increased - so we have been using SET extensively, and the number of SETs quickly rose to over 30.
The problem of having over 30 SETs was, they were scattered over IDCs in 3 different geolocations, and configuration management became difficult. Each piece of data must have a copy in every single SET; which copy the user accessed was up to the router he visited, so the biggest challenge was router configuration management. Our proposed solution was to make these SET things invisible to the front-end, as in, the front-end only needs to know one common name for the service, and the back-end allocates it to the nearest router. We call this process un-SET in the sense that the front-end does not need to care about which SET it would visit.
InfoQ: Can you explain how the SETs run in IDCs? We heard you are now doing single write multiple read. How do you sync data?
Sun: The way we deploy SETs is what we studied most about. In every SET, failover is what we care about the most. There are many ways an IDC could fail - rack loses electricity, switch is down, etc - so we do double switch, 2 copies of data with each copy on separate racks, things like that. Looking from a wider perspective, Qzone has many dependencies on QQ services, and QQ would also call certain Qzone services. Most of the services were deployed in Shenzhen, so we also needed to consider failover at the IDC level. We started to deploy SETs in multiple IDCs in Shenzhen, and then to better serve users in northern and eastern parts of China, we further deployed SETs in IDCs in these areas, with Shenzhen IDCs serving mainly users from the southern part of China. So now it is multiple SETs in multiple IDCs.
All distributed systems with data in different areas face challenges in data consistency, availability and partition tolerance, and according to CAP theorem we cannot satisfy all three. We do single write multiple read: we have pre-defined master write nodes, and we make all the write nodes in one SET - we call it the data source SET. Whenever a data source is generated, a sync is triggered in our distribution system (we call it the sync center), which syncs the write operation to all other IDCs in China, and this same operation gets executed in every single SET, too. So we are not syncing data, but duplicating user operations, so the process would be faster.
We did many things in our sync center to get rid of the lagging problem and the unstable network problem. One thing we did was implement smart rules of private networking and external (public) networking usage: in Tencent we have considerable bandwidth for our private network, but the bandwidth is still limited and there are too many services on it, so we are using lots of external bandwidth as well. The basic rule is to get large and unimportant data flow on the external network, with small and important data flow on the private network. Whenever one breaks, all data flow switches to the other network. So rich media data like friend feeds normally travels on the external network, whilst basic data such as adding a friend travels on the private network. Usually for the less important data, users do not expect an instant response, so the external network is not too bad most of the time.
We are constantly doing work to make all responses more instant. For example, when a user has done an operation, we would direct the user to read from the data source SET to let him feel the consistency. We keep each transmission within milliseconds so that users wouldn't feel the delay. When there is a traffic jam in the network, we would re-send the user's request to avoid data loss. We have quota control on bandwidth, we have quality of service at the SOA level, and we have support from NetScreen to ensure transmission quality. So overall it's all under control.
InfoQ: At the infrastructure level, are you using Tencent QCloud or your own cloud?
Sun: Just now my colleague mentioned about us using the cloud. Within Tencent there are actually two cloud systems, one for external users such as 3rd party apps and website hosting; we call it the external cloud. For our own services, because the software is too complicated and the scale is big, we are not entirely on Tencent QCloud. We use the same storage system as QCloud, and at the logical level we use some components from QCloud, but the way we manage our services is different. External cloud puts the emphasis on compute, on virtualization, while our services are usually serving too many users, so we have many customized features, and many features on the external cloud we would just remove. I would say Qzone is 90% on an internal cloud (storage + compute), the rest still runs on legacy systems. We are still in the process of moving it onto our internal cloud, I mean cloud is really helpful to increase operational efficiency and quality.
InfoQ: There have been several structural changes for Qzone. You are responsible for designing SOA for Qzone, can you share with us how this has changed over the years?
Sun: Back in 2009, we had some technical discussions with our e-commerce department (Tencent Paipai) about the implementation of SOA. At that time, the best SOA implementation was in Amazon, while in China it was Taobao. We were discussing what Paipai could learn from them, as in how they could turn transaction management and all the flows at such a large scale into services. For Qzone we didn't actually say "we want to do SOA": we did Qzone module by module, feature by feature, all loosely coupled via networking, but we wouldn't call it SOA.
At that discussion, we suddenly realized that our service was actually more complicated than e-commerce, because we had more modules, and cost of operations and management was quite high. We thought we should really improve the quality and ease of management of the platform - Qzone was already serving external apps as a platform at that time, so we started the SOA optimizing project for Qzone.
There were 3 main considerations when we started the project:
1. What core services do we offer to external developers?
2. What do these services have in common?
3. What problems are we currently facing?
We studied them and came out with the following conclusions:
1. Unified gateway
The core services we offer to external developers included friend relationship data, access permissions, apps and feeds. External developers don't need to know how they are deployed, how they do failover, how they do call frequency control, so all we needed to do was to provide a unified Qzone gateway to cover up all the external services.
2. Standardized protocols
At the time we had many protocols, which was inconvenient. In SOA, standardizing protocols is very important. So we started unifying to a WP protocol which came from our wireless services, working in a similar way as Google's protobuf, which is a bit-oriented protocol. The good thing about bit-oriented protocol is that you can easily add fields and remove fields from the outside.
3. Classification of modules
We had hundreds of modules at that time, and it was too difficult to manage and maintain. So we came up with some abstractions to contain similar modules. For example, the comment system and relationship chain services can be abstracted according to the direction of relations (top-down or inverse). So with the abstractions, together with the basic distributed services and storage services, we were done with less than a hundred modules. This made life much easier.
I would say our SOA isn't quite the same SOA as Paipai or Amazon, but a unique way of building service abstractions. When we add a new feature, we don't build a new one from scratch, but rather put pieces of existing components - network frameworks, storage services, etc - together just like playing Lego. The good thing about the entire process is that we build new features much faster, and our overall efficiency has increased.
InfoQ: Final question is, what are you planning to do for future optimizations?
Sun: We have been thinking about it recently, and we have identified some areas we could improve.
One thing is right after the SOA thing I was talking about. We have turned some basic data into services, and now we want to make feeds into services too, as many other apps are using feeds now. The other thing is targeted recommendations, users see different feeds according to their preferences and interests. The last thing is still about the basic quality of service and decreasing cost. Although we have done optimizations, there are still users having slow access, so we still need to work on it. We need to set up more monitoring, to pinpoint problems before users send complaints about them. These are the three main things we are planning.
About the Interviewee
Micro Sun joined Tencent after graduation in 2006, and has been working on their Qzone platform since then. He participated in the processes that saw Qzone scaled from millions of users to over hundreds of millions of users, during which the architecture of Qzone changed several times for performance optimization. He graduated from Xi'an Jiaotong University with a masters degree.
Stephanie Davis (nee Stewart) Dec 21, 2014