InfoQ Homepage Articles Graph API in a Large Scale Environment

Graph API in a Large Scale Environment

Feb 13, 2016 12 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

With more than 80 million users, information about 2.6 billion individuals and 6.1 billion records, MyHeritage is a rapidly-growing destination used by families around the world to discover, preserve and share their histories. There has recently been a marked escalation in content contributed by users and other sources that creates increasing demand for our services. These services are accessed both internally and externally by our partners via the FamilyGraph API which is a RESTful API. Millions of API calls are made every day by dozens of partners all over the world, facing us with a huge challenge in terms of performance, scalability and security.

API as a standard

MyHeritage was founded in 2003 and by 2011 was serving millions of users via multiple channels including web, mobile, and our desktop client. Even though these channels were a great way to access our services, there was increasing demand from our partners to provide some interface to embed MyHeritage in their products. Providing open and secured API became a standard for every software company that is looking to expand its partnerships with all kinds of developers, from freelancers to big software companies. We were eager to accept the challenge and started developing the FamilyGraph API that shares the concept of Facebook’s Graph API.

Our data model

As a company that deals with family history we have rich data such as family trees and profiles, albums and photos, historical records — which are all interconnected. This can be represented as a graph in which each data element is a vertex and the connection between two elements is an edge. For example, an individual can be a vertex in the graph and the relationship between two individuals is the edge. The individual also belongs to a tree which is also a part of the site. So the individual, tree and site elements are all vertices and the connection between them can be represented as edges.

We needed an architecture that will allow us to expose this data structure through a simple API in such a way that it will not only be easy to retrieve information about objects (the vertices) but also navigate between them (through edges). A graph data structure is the most suitable for this purpose.

REST, OAuth2 and JSON — A winning combination

REST stands for “Representational State Transfer”. It is not a language or protocol, but rather an architecture that defines a set of constraints for creating scalable web services. REST is a simple, lightweight architecture compared to its competitor SOAP. It is therefore widely used in the industry and specifically in leading software companies such as Facebook, Google and Twitter.

REST doesn’t provide any guidelines for how authentication for the API should be done. You can use basic HTTP authentication, OAuth2 or even your own custom developed authentication method. However, most of the APIs support the OAuth2 authentication method. In addition, REST doesn’t define the format of the represented data. It can be JSON, XML or even plain HTML. A few years ago, XML was very popular and most of the REST APIs used this format to represent the transferred data. Although XML has many benefits it is more complicated compared to JSON and companies started providing JSON as an alternative.

We chose REST as the architecture, OAuth2 for authentication and JSON as the resource format. The combination of the three is the most popular API architecture today and allows scaling up fast, a secured environment, and also provides easy, cross-language and platform integration for developers around the world.

The challenges of rapid scaling

When we released the Graph API in 2011 we had just a few clients using it, our API provided access to basic family data, and there were no write APIs at all. We didn’t have to deal with performance issues and the FamilyGraph web service was part of the front-end web servers.

Over the years MyHeritage has emerged as one of the leading players in the family history industry. This has brought us millions of new users, billions of new individuals and historical data we had to provide through our API. In addition, we added more APIs and more clients registered to our service.

This tremendous growth raised many issues we had to solve:

The load on the web servers increased dramatically
We needed to tighten our security mechanisms
We spent a lot of time on adding new APIs
New partners needed good API documentation to help them get started
We needed better tools for monitoring the services

These issues required us to take immediate action.

Putting the FamilyGraph service on a different cluster

Except for the obvious step of adding more servers to our farm we needed to ensure that the load on the API would not hurt users on our web servers. Therefore, we decided to create a new dedicated cluster just for the FamilyGraph web service and redirect all API traffic to this cluster. In addition, we created a separate database cluster where we keep all of the FamilyGraph-related data and statistics, in order to minimize the effect of heavy traffic on web users.

Controlling the rate

Today we have an average load of more than 5 million calls to the FamilyGraph API per day. OAuth scopes ensure that clients can access only certain APIs but it doesn’t protect us from brute force attacks, application bugs or even a normal high load during peak hours. This is where the rate limiting comes to the rescue.

We developed a rate limit mechanism similar to Twitter’s Rate Limits in their REST API. Through this method, we limit the rate a client can access our API within a given time frame. With every successful call, we return “rate limit” headers which tell the client what is the window size, how many calls are left in the current window and what is the current limit. The client can calculate when it should send another request based on this information. In case the client exceeds the allowed rate, we return a 429 HTTP code (rate limit exceeded error) and the client is instructed to wait at least “window size” before trying to hit us again.

This method ensures that clients will not pass a limit we set according to the current capacity. In addition, the rate is dynamic so we may increase or decrease it according to the load on our servers. This allows us to give balanced service to all clients and provide a target of five nines (99.999%) availability.

Auto-generated code and documentation

The FamilyGraph API is growing rapidly. We’re adding or updating graph objects on a daily basis. We strive to make all changes backward compatible. We cannot afford to waste time on trivial tasks such as coding documentation pages or generating the skeleton of the models.

For this purpose we created a special object metadata structure in which we declare the following:

Functionality - Supported fields and connections
Security - The required privilege for accessing an object
Documentation - Object description, fields, connections and examples

An example metadata description is shown in Figure 1.

array(
    'overview' =>
    "A media album as represented in the Family Graph API.\n\n" .
    "[Media items][mediaitem] in a [family site][site] can be organized in albums. An album contains a list of media " .
    "items ([photos][photo], [videos][video], [documents][document] and [audios][audio]). A media item can be in one album, multiple " .
    "albums or even in no albums at all.",
    'fields' => array(
        'id' => array(array('primitive', 'id'), "The album's id"),
        'name' => array(array('primitive', 'string'), "The album's name", array('read' => 'basic', 'write' => 'basic_write')),
        'description' => array(array('primitive', 'string'), "The album's description", array('read' => 'basic', 'write' => 'basic_write')),
        'link' => array(array('primitive', 'url'), "Link to album on MyHeritage.com"),
        'submitter' => array(array('object', 'user'), "The user who submitted (created) the album in the family site"),
        'created_time' => array(array('primitive', 'time'), "The date and time when the album was submitted to the family site"),
        'updated_time' => array(array('primitive', 'time'), "The date and time when the album was last updated (including adding or removing media item from the album)"),
        'media_count' => array(array('primitive', 'int'), "Number of media items in the album"),
        'is_public' => array(array('primitive', 'boolean'), "Indicates whether the album is public (its media items can be searched and viewed by other users)", array('read' => 'basic', 'write' => 'basic_write')),
        'site' => array(array('object', 'site'), "The site of the album"),
        'cover_photo.url,thumbnails' => array(array('object', 'photo'), "The albums cover photo"),
    ),
    'fields_exported_by_default' => array(
        'id',
        'name',
    ),
    'connections' => array(
        'keywords' => array(array('array', array('primitive', 'string')), "List of keywords associated with the media item"),
        'media' => array(array('array', array('object', 'mediaitem')), "List of media items in the album", array('read' => 'basic', 'write' => 'upload_media')),
    ),
    'delete_scope' => 'delete_album',
    'class_name' => 'Album',
    'examples' => array(array(array('localhost' => 'album-149444662-2900003', 'production' => 'album-144530322-1900003'))),
)

Figure 1: Example meta-data for the album API.

(Click on the image to enlarge it)

Figure 2: Documentation as generated from the API metadata.

After defining the metadata, the developer can concentrate on implementing the logic of the object and the FamilyGraph infrastructure will handle the rest. API documentation is automatically generated in real time from the code metadata when users navigate to the documentation website. The documentation includes generated JSON responses and applies security validations before an object is accessed.

Adding more caching layers

One of the most popular APIs we provide is the “matchingrequest” API. This API allows partners to find record matches for individuals such as birth, marriage and census records via our advanced Record Matching and Smart Matching™ technologies. Partners can then enrich their products with information about their users who have these matches.

Unlike matches in our website that are pre-calculated in an offline process which doesn’t affect the user, this API calculates the matches in real time. This can be a relatively heavy process and we needed to use any optimization we could in order to shorten the response time.

For this purpose we defined a caching policy in three levels:

Proxy servers caching
Caching by request hash
Caching by request signature

Proxy Servers Caching

Proxy servers caching works by identifying recurring calls to the same resource by the URI. If found in cache, it checks that it is still valid by checking the caching headers (e.g. “Expires”). If valid, the server will return the cached result and thus spare the remote address from having to recalculate the request.

In order to allow proxy servers to successfully cache the result, we did two things:

Require MD5 hash of the JSON body to be part of URI of the matching request API. This will uniquely identify the request.
Return “Expires” and “Cache-Control” headers that are used by proxy servers to determine the length of time for which to keep the cached item.

Caching by Request Hash

Caching proxy servers is a good start, but — because such servers may not in actuality be present along the way — it’s not enough. We therefore created our own caching mechanism, which matches matching requests and results. The key of this map is a unique ID which is a MD5 hash of the individual details sent with the matching request. As long as the individual properties don’t change, the hash will not change, and we can return the result from the cache instead of processing it again.

Caching by Request Signature

The matching request cache worked well and reduced the load on the servers by hundreds of percent. However, we found that we could do even better.

Every matching request contains many details about the individual. Some properties don’t affect the matching process at all. So basically, two matching requests with different hashes could end up with the same matching query and therefore the same response.

We thus added a new mechanism that calculates a unique hash out of the relevant properties for matching, and saved it as the “significant signature” of the request. The result is that two requests with the same signature result in only one query.

For example, individual images do not affect the matching request but are rather part of the request. If only the images’ content changes between two subsequent calls, we detect that the signature is the same and provide the result for the second call from cache.

We found that we can save ~15% of the requests and serve them from cache. The result was a quicker response to clients, a reduced load on the servers and less storage needed for storing cache items.

Figure 3: Example Request ID and Significant Signature.

Figure 3 shows an example request which illustrates the action of hashing only on relevant properties in the request. The MD5 hash of the original request is “887bbdd92ca8a9a4f32626650e5a8422” and the signature is “abda21f64c46c9a741783495c1f95f44”.

Then, the client updated the image of the individual and sent the request again. The request ID of the updated individual details is “794fff604de9cb9e82148bdde2c2c744” (different) but the signature remains the same: “abda21f64c46c9a741783495c1f95f44”. We can skip the calculation of this request and send the cached result from the previous call.

Figure 4: Cache Request-Response Flow Logic.

Monitoring through charts and sensors

Statistics and Charts

We have a unified system of statistics for all the services in our website and we use it to monitor the status of service as well as analysing trends in the API usage.

Some of the statistics we collect include:

Successful/failed API calls
API processing time
Unauthorized access attempts
Rate limit reached

Sensors

We use Paessler’s PRTG server to set automatic sensors on many parts of the system. This system allows us to be notified immediately by email or SMS when some event has occurred.

For example, we are notified when the processing time of APIs is too slow. In response, we may add more servers to the cluster, take down unnecessary services or check for errors in the site.

(Click on the image to enlarge it)

Summary

The FamilyGraph API is one of the key components in our growth. It allows partners to quickly integrate our services into their products. It also provides us a simple, clean and unified interface for building rich and powerful applications without having to rewrite the same functionality in different platforms.

However, our rapid growth also presents us with many challenges in various aspects, such as security, performance and maintenance. The toolset we built helps us in these areas. The rate limit mechanism protects us from DoS attacks, the monitoring and statistics tools let us plan ahead and increase the capacity as needed and the automatic code generator allows us to quickly add more functionality according to new requirements.

Want to become a partner?

About the Author

Maor Cohen is a senior member of the MyHeritage backend team with expertise in infrastructure. He has been with MyHeritage since 2012. Maor has over 10 years of experience in technology companies as a developer and team leader in server-side and desktop client technologies.

InfoQ Software Architects' Newsletter