BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Graph Learning at the Scale of Modern Data Warehouses

Graph Learning at the Scale of Modern Data Warehouses

Bookmarks
45:03

Summary

Subramanya Dulloor outlines an approach to addressing the challenges of warehouses and shows how to build an efficient and scalable end-to-end system for graph learning in data warehouses.

Bio

Subramanya Dulloor is a computer scientist and engineer with extensive experience in distributed systems and systems for ML. As a founding engineer at Kumo.ai, he is focused on making predictive analytics on data warehouses plug-and-play easy. Before joining Kumo, Dulloor worked on distributed systems and database internals at various startups.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Dulloor: My name is Dulloor. I'm a founding engineer at Kumo.ai. At Kumo, we are working on bringing the value of AI driven predictive analytics to relational data, which is the most common form of data in enterprises today. Specifically, we apply graph learning to data in warehouses to deliver high quality predictions for a wide variety of business use cases. This talk is about the challenges of productionizing graph learning at scale, and the systems that we built at Kumo to address those challenges.

For this talk, I'll first provide a brief overview of graph neural networks or GNNs, and explain their advantages over traditional machine learning. Then we'll have a quick primer on graph representation learning using PyG, a popular open source GNN library. After that, I will delve into how the Kumo GNN platform with PyG at its core simplifies productionizing GNNs at a very large scale for enterprise grade applications.

GNNs - Best Choice for ML on Graphs

First, some background. We have all seen how deep learning has transformed the field of machine learning. It has revolutionized the way we approach complex tasks, such as computer vision and natural language processing. In computer vision, deep learning has replaced traditional handcrafted feature engineering with representation learning, enabling the learning of optimal embeddings for each pixel and its neighbors. This has enabled a significant improvement in accuracy and efficiency for computer vision tasks. Similarly, in the case of natural language processing, the state-of-the-art performance achieved with deep learning is truly astounding. With deep learning, it's now possible to train models that can understand and generate natural language with incredible accuracy and fluency. The ability to transfer learn from data rich tasks to data per task has further extended the capabilities of deep learning, allowing high model performance to be achieved with fewer labels. Overall, the impact of deep learning on machine learning has been transformative, and it has unlocked new levels of accuracy, efficiency, and capabilities across a wide range of applications.

While deep learning has seen wide adoption for visual data and natural languages, enterprise data typically consists of interconnected tabular data with a lot of attribute-rich entities. This data can be visualized as a graph where tables are connected by primary and foreign keys. The graph neural networks or GNNs bring the revolution of deep learning to such graph data in enterprises.

Graph neural networks or GNNs are a type of neural network designed to operate on graph structured data. They extend traditional neural network architectures to model and reason about relationships between entities, such as nodes and edges in a graph. The basic idea behind GNNs is to learn representations for each node in the graph by aggregating information from its neighbors. This is done through a series of message passing steps, where each node receives messages from its neighbors, processes them, and then sends out its own message to its neighbors. At each step, the GNN computes a hidden state for each node based on its current state and the messages it has received. This hidden state is then updated using a nonlinear activation function and passed to the next layer of the GNN. This process is repeated until a final representation is obtained for each node, which can then be used for downstream tasks such as node classification, link prediction, and graph level prediction. GNNs provide a very powerful tool for modeling and reasoning about graph structured data.

Now let's see how graph-based ML compares to traditional tabular ML when it comes to graph data. Traditional tabular ML often requires one-off efforts to formulate the problem, to perform necessary feature engineering, select an ML algorithm, construct training data for the problem. Finally, train the model using one of the many frameworks. As a result of so many one-off steps, traditional tabular ML is often error prone. In a sense, it amounts to throwing everything at the wall and seeing what sticks. Adding to that, the need to deal with so many frameworks and their peculiarities makes the problem even worse. Graph-based ML, on the other hand, is a much more principled approach to learning on graph data. Graph learning offers better performance at scale, and also generalizes to a wide variety of tasks. Problem formulation is much easier too because a use case has to be translated into only one of a handful of graph ML tasks. Once a graph ML task is defined, GNNs automatically learn how to aggregate and combine information to learn complex relational patterns at scale. GNNs are also good at reasoning across multiple hops. It is very hard to pre-compute and capture this as input features. That's one reason why traditional tabular ML tends to lose signal and result in poorer model quality, particularly as the use cases become more sophisticated. Finally, GNN's learned representations are more effective and generalizable than manually engineered features, and they are well suited for use in the downstream tasks. With traditional ML, problem formulation and mapping the problem to an ML task must be done pretty much from scratch for each use case. There is very little generalization that you can get across multiple tasks.

No wonder because of all these advantages, GNN adoption has really picked up in the industry. GNNs have repeatedly demonstrated their effectiveness in real-world scenarios, with machine learning teams at top companies successfully leveraging them to improve performance across a range of tasks. These tasks include recommendation and personalization systems, fraud and abuse detection systems, forecasting dynamic systems, and modeling complex networks, and many more use cases. GNNs have certainly become a potent tool for data scientists and engineers who are dealing with graph data.

Graph Representation Learning with PyG

Let's do a quick overview of how graph representation learning is done today in the research and open source community. For this part, we'll focus on PyTorch Geometric or PyG. PyG is a very popular open source library for deep learning on graphs that is built on top of PyTorch. It is a leading open source framework for graph learning, with wide adoption in the research community and industry. PyG's thriving community consists of practitioners and researchers contributing to the best new GNN architectures and design all the time. As we will see later, PyG is also at the core of the Kumo GNN platform. PyG provides a set of tools for implementing various graph-based neural networks, as well as utilities for data loading and preprocessing. It enables users to efficiently create and train graph neural networks for a wide range of applications from node classification and link prediction to graph classification. PyG also provides several curated datasets for deep learning on graphs. These datasets are specifically designed for benchmarking and evaluating the performance of graph-based neural networks. PyG also provides a variety of examples and tutorials for users to get started with graph deep learning.

The PyG's programming model is designed to be flexible and modular, allowing users to easily define and experiment with different types of GNNs for various graph-based machine learning tasks. The first step is creating and instantiating graph datasets and graph transformations. For this step, PyG provides a variety of built-in graph datasets, such as citation networks, social networks, and bioinformatics graphs. You can also create your own custom dataset by extending the PyG dataset class. Graph transformations allow you to perform preprocessing steps on your graphs such as adding self-loops, normalizing node features, and so on. After creating, the next step is defining how to obtain mini-batches from your dataset. In PyG, mini-batches are created using the data loader class, which takes in a graph dataset and a batch size. PyG provides several different sampling methods for mini-batching out of the box, such as random sampling and neighbor sampling for graph's edge, and so on. You can also define your own sampling methods by creating a custom sampler class. The third step is designing your own custom GNN via predefined building blocks or using predefined GNN models. PyG provides a variety of predefined GNN layers such as graph convolutional networks, graph attention networks, and graph's edge. You can also create your own custom GNN layers by extending the underlying PyG message passing class. GNNs in PyG are designed as modules, which can be stacked together to create a multi-layer GNN. The final step in doing GNNs with PyG is implementing your own training and inference routines. PyG provides an API for training and evaluating GNNs, which looks very similar to the PyTorch API. Training a GNN involves defining a loss function, optimizer, and calling the train method on your GNN module. The inference involves calling the eval method on your GNN module, followed by making predictions on new graphs.

While PyG provides a great basis for graph learning, productionizing GNNs requires many additional capabilities that are difficult to build and scale. Firstly, let's consider graph creation. PyG expects a graph to be in either COO or CSR format to model complex heterogeneous graphs. Both nodes and edges can hold any set of curated features. Datasets that come with PyG are ready to use, however for non-curated datasets, such as what we will see in enterprises all the time, it is up to the users to create and provide a graph that PyG expects. Graph creation and management are not trivial, particularly at scale. The second issue is the problem formulation. While PyG supports any graph related machine learning tasks, the problem formulation itself is up to the user. It is the user's responsibility to define a business problem as one of the supported graph learning task types in PyG. Even after the problem formulation, curation of training labels for a given task is also the user's responsibility. When doing this, one has to make sure that temporal consistency is maintained during label generation and neighbor sampling, and all the way in the pipeline to avoid leakage from future entities. The same goes for predictions. Avoiding data leakage and serving GNN predictions is a non-trivial task. Another limitation is that, while GNN supports full customization, from model architecture to training routine pipeline, the best model architecture itself is both data and task dependent, and users must take into account several factors such as, which GNN is best suited for a given task? How many neighbors and how many hops to sample? How to ensure that the model generalizes over time, and how to deal with class imbalances and overfitting. To add to all this complexity, when new data arrives, or the structure of the graph changes, it requires updating the graph, retraining and versioning the model outputs. As you can see, productionizing GNNs particularly at scale is challenging, because there are many complex and error prone steps generally required to go from a business problem to a production model. It requires significant GNN and systems' expertise to productionize GNNs at scale. The Kumo GNN platform addresses this gap.

The Kumo GNN Platform

We will now dive deeper into the Kumo programming model and the platform. The Kumo platform makes it easier to express business problems as graph-based machine learning tasks. It is also easy to securely connect to large amounts of data in warehouses and start running queries. Once a connection to a data source is established, and a meta-graph is specified, the Kumo platform automatically materializes the graph at the record level. The business problem itself can be specified declaratively using predictive query syntax. Internally, predictive queries are compiled into a query execution plan, which includes the training plan for the corresponding ML task. As data in the data source changes, the materialized graph and features are incrementally updated, and Kumo automatically optimizes the graph structure for specific tasks, such as adding meta-paths between records in the same table to reduce the number of hops that is required to learn effective representations. When it comes to model training, Kumo provides out of the box few-shot AutoML capabilities that are tailored to GNNs. Finally, Kumo also provides enterprise grade features such as explainability in addition to MLOps capabilities.

The Kumo programming model for training and deploying GNNs is pretty simple, and it can be broken down into five steps. The first step is to create Secure Connectors to one or more data sources. The next step is to create one or more business graphs by connecting tables from the data sources. Following that, we formulate the business problem with predictive queries on the business graphs. Finally, the predictive queries are trained, typically with an AutoML trainer, whose search space has been configured by the predictive query compiler. Then, running inference can be done multiple times on this trained model. All of the steps described here can also be done via UI. The Python SDK and UI both share the same REST service backend. Let's take a look at each of these steps in more detail. We use the popular H&M Kaggle dataset for the running example. It has three tables, user, sales, and products that are linked to each other as shown here. The first step is to create a connector for each data source. In the Kumo platform, connectors abstract physical data sources to provide a uniform interface to read metadata and data. Kumo supports connectors to S3 and a number of cloud warehouses such as Snowflake and Redshift. For datasets that are in S3, Kumo supports both CSV and Parquet file formats, and a number of common partitioning schemes, and direct tree layouts. The Kumo platform uses the Secure Connectors to ingest data and cache them for downstream processing. Working with a wide variety of data sources can be challenging due to data cleanliness issues, particularly typing related issues. That's something the Kumo platform takes care of automatically. After creating connectors, the next step is to register a meta-graph by adding tables and specifying linkage between them. In well-designed schemas, these linkages are typically primary key, foreign key relationships, and the meta-graph represents the graph at the schema level. The actual graph at the record level is typically large and difficult to construct and maintain. The graph here is materialized automatically in the backend in a scalable manner.

After building a graph, users can define their business use cases declaratively with the predictive query syntax. Predictive query makes it easy to express business problems without worrying about how they map to graph-based ML tasks. Once a predictive query is defined, Kumo automatically infers the task type, generates the training labels for the task, handles the train and evaluation splits, and determines the best training strategy based on the understanding of both data and task. Through the training process, Kumo also automatically handles time correctness to prevent data leakage. Users may create any number of predictive queries on a given graph. Kumo then provides multiple options for creating your model. As mentioned before, for a given predictive query, based on the understanding of data and the task, Kumo can automatically generate the best training strategy and search space for few-shot AutoML. This auto-selected config includes both AutoML process options, such as data split strategy, upsampling and downsampling of target classes, the training budget. It also includes GNN design space options, such as GNN model architecture, sampling strategy, optimization parameters, and so on. In addition, the type of encoding to use for features is also part of the search space. The encoding options are selected automatically based on data understanding and statistics, thereby avoiding manual feature engineering, which is error prone, and often ad hoc. Advanced users, they also have the option of customizing the AutoML search space and restricting or expanding the set of training experiments to run. AutoML requires that the Kumo platform is able to run a large number of concurrent experiments for each predictive query.

The final step is to run inference on the best trained model. This inference could be run potentially many times a day. The results from running the inference, which is the model output, can be pushed to S3 or directly to a warehouse such as Snowflake. Kumo GNNs can produce both predictions and embeddings. Predictions are usually integrated directly into business applications, while embeddings are typically used in a variety of downstream applications to improve their performance. To summarize, Kumo GNNs provide a flexible and powerful way of doing ML on enterprise graph data. The Kumo programming model itself is simple enough that even non-experts in ML and GNNs can bootstrap quickly and deliver business value. However, enabling this nice user experience, particularly at scale, is challenging.

Productionizing GNNs at Scale

Let's get into how we productionize GNNs at scale at Kumo. Before getting into the Kumo platform architecture, let's first take a look at how a typical workflow for training a single predictive query looks. When a predictive query arrives, the platform first determines the data dependencies. Based on that, data is ingested from data source and cached. Next, we compute stats and run some custom logic to better understand data. Simultaneously, the platform also materializes features and edges to a format that is better suited to support training, particularly mini-batch fetching. These steps so far can be shared between predictive queries with the same set of dependencies. Then the workflows that follow are more query specific. First, the predictive query compiler takes the predictive query as input, infers the type of ML task, and generates a query execution plan based on the task and data characteristics. The workflows following that take the query execution plan as input, generate the target table, and train models using the specified AutoML config. It is important to note that to search for the best model, AutoML itself spawns a number of training jobs and additional workflows. The key takeaway from the previous slide is that there is a lot happening in the backend, even to train a single predictive query. When you add to that, the Kumo platform's objective of running model training and inference for many predictive queries simultaneously, the problem becomes much more challenging. To address this challenge, the Kumo platform was able to scale seamlessly and on-demand to a very large number of data and compute intensive workflows.

Here, I'll first discuss some of the key principles that we followed in designing the Kumo platform architecture. These principles can be seen as common-sense guidelines that have been proven effective in large scale systems for a long time. Firstly, since the Kumo platform must be scalable enough to handle a large number of fine-grained concurrent workflows, a microservice based architecture is a straightforward choice. For operational simplicity, we want to use standard cluster management tools like K8s. Another common design choice in large scale systems is a separation of compute and storage. In our context, workflows must be well defined, taking inputs from S3, and producing outputs to S3 to maintain the separation of compute and storage. This simple principle enables the system to scale compute independently from storage, allowing for efficient resource usage and flexibility. The third principle is that the microservice architecture also makes it easy to choose AWS instances based on specific workload requirements. Data processing and ML workflows in Kumo have very different characteristics, requiring a mix of GPU instances, memory optimized instances, and IO optimized instances often with very large SSDs. A tailored approach to choosing the instance types is required for optimized performance and cost efficiency. Finally, the Kumo platform deals with data in the order of tens of terabytes. For continuous operations, starting from scratch after every data update can become prohibitively expensive, and slow. Incremental processing of newly arriving data and incremental materialization of graphs and features are necessary, not only for efficient resource usage, but also to reduce the overall turnaround time for training and predictions. The Kumo platform's design is guided by these four simple scaling principles.

The overall architecture can be broken down into four major components. The control plane is the Always On component that holds the REST service, manages metadata, and the workflow orchestrator. The control plane also houses the predictive query planner or compiler that compiles a user provided predictive query into an execution plan driven by the workflow orchestrator. The data engine is the component responsible for interfacing with the data sources. It is responsible for ingesting and caching data, inferring column semantics, and computing table column edge stats for data understanding. Data engine also materializes edges and features from raw data into artifacts that are used by the graph engine for neighbor sampling and feature serving, respectively. The Compute Engine is the ML workhorse of the system with PyG at its core. The Compute Engine is where model training jobs are run for all experiments and for all predictive queries. Inference jobs that produce predictions or embeddings are also run in the Compute Engine. The Compute Engine works on mini-batches of data during the training process. Each example in a mini-batch represents a sub-graph that contains a node and a sample of its neighborhood. It is the only component that requires GPU instances. Finally, the graph engine exists as a separate component to provide Compute Engine with two services. One is graph neighbor sampling, and the second, feature servicing. Compute Engine requires these services to construct mini-batches for training. Graph engine is an independently scaling shared service that's used by all trainer instances in the Compute Engine.

Next, we'll get into the details of each of these components and the specific challenges that we had to solve with them. First, let's take a look at the data engine. Like mentioned before, the data engine is responsible for all raw data processing and transformations in the system. Data engine uses Secure Connectors to ingest data from the data sources and cache them internally in a canonical format. During this process, data engine also tries to infer the semantic meaning of columns, and it computes extensive table, column, and edge level stats for better data understanding. This information is used later by the predictive query planner in feature encoding decisions and in determining the AutoML search space. For efficient execution of predictive queries, some components need data in columnar format, while others need it in row-oriented format. For instance, data engine needs data in columnar format for statistics computation, and semantic inference, and edge materialization. Whereas the feature store in the graph engine requires data in row-wise format for fast feature fetching and mini-batch construction. The data engine efficiently produces all these artifacts, and is also responsible for versioning and lifecycle management of data artifacts in the cache.

Let's first take a look at one of the two main functions of data engine, which is feature materialization. Materializing features converts raw data features in tables to a row-oriented format that is suited for random feature serving. Materializing features is not only data intensive, but the artifacts produced are also very large. One of the primary goals here is to avoid any additional processing when these artifacts are loaded into the feature store. We want the loading to be as lightweight as possible. Additionally, we want to minimize the time taken to load the materialized features. Since we implement the feature store as a RocksDB key-value store, the data engine achieves both of these objectives by materializing features directly to the RocksDB internal SSD file format. These SSD files can later be loaded directly by the feature store. During feature materialization, nodes are assigned unique indices, and these same indices are used when materializing the list of edges between each pair of connected tables. That's how we can make the edges themselves a fixed size and more storage efficient. Finally, feature materialization can be easily parallelized both across tables and across partitions within a table. The second big function of data engine is graph or edge materialization. Let's take a look at that now. To materialize the graph, we need to generate the list of edges for each pair of connected tables in the graph. This step is where automatic graph creation occurs so that users don't have to bring their own graph. The edges are produced by joining tables on keys that link them. The materialized edges are also output in COO format to enable more data parallelism while generating these edges. Additionally, as new data arrives, the materialized edges are updated incrementally, which is faster and also more efficient.

As we can see, between data caching, stats computation, and materialization of features and edges, there's a lot happening in the data engine, particularly when data is in the order of tens of terabytes. Fortunately, all of this data processing can be distributed and parallelized. However, it is important to ensure that all data processing in Kumo data engine is completely out of core, meaning that at no time do we need all of the input to be present in memory. For this purpose, Kumo data engine uses Spark for data processing. Specifically, we use PySpark with EMR on EKS, which enables the data engine to autoscale based on its compute needs. EMR on EKS also integrates cleanly with the Kumo control plane, which is Kubernetes based. Also, the data engine uses Apache Livy to manage Spark jobs. Livy provides a simple REST interface to submit jobs as snippets of code. It also provides a clean, simple interface to retrieve these results synchronously or asynchronously. Livy simplifies the Spark context management and provides excellent support for concurrent, long running Spark context that can be reused across multiple jobs and clients. Livy also manages and simplifies the sharing of cast RDDs across jobs and clients, which turned out to be very helpful for efficient job execution. Finally, as you can see here, data engine implements a lightweight Livy driver, which launches Spark jobs on-demand in EMR on EKS. That job launches from the Livy driver themselves are triggered by workflow activities, which are scheduled by the orchestrator.

Now that we are done with data engine, let's move on to the scaling challenges with the Compute Engine. The Compute Engine is the ML workhorse of the Kumo GNN platform. It is where AutoML driven model training happens, and also where model outputs from inference are produced. Take the example of a training pipeline for a single AutoML training job. Taking this example, the GPU based trainer instance with PyG at its core, must continuously fetch mini-batches of training examples from the graph engine. Train on these mini-batches, and do all of these while the feature serving and mini-batch production matches the training throughput. Construction of the mini-batches itself is a two-step process. The first step is to sample the neighborhood for a set of requested nodes and construct a subgraph for each of those nodes. The next step is to retrieve features for the nodes in these subgraphs, and construct a mini-batch of examples for the training. This process is repeated for each step of model training and Kumo supports several sampling strategies. While the specific sampling method may vary across trainer jobs, this sampler itself always ensures temporal consistency if the event timestamps are present. An AutoML search algorithm decides which configurations the trainers will execute based on data, task, and past performance metrics. The goal of this searcher algorithm is to ensure that the optimal set of parameters is learned for any given task. At any given time, we may have multiple trainer jobs waiting to be executed for a predictive query, and many predictive queries in flight. For a reasonable turnaround time, the Compute Engine must be able to launch and execute many trainers in parallel. That is the main scaling requirement for the Compute Engine to be able to scale up the number of trainer jobs on-demand.

To scale to a large number of trainers in parallel, we rely on two key ideas. The first is a separation of graph engine from the Compute Engine, so that we are able to scale these two components independently. By sharing the feature and graph stores across multiple training jobs, we are able to run a large number of training jobs in parallel with low resource requirements on the GPU nodes themselves. Now these trainers communicate independently with the feature and graph stores through a shared mini-batch fetcher, which not only fetches, but also caches these mini-batches that are requested by trainers. In practice, we have seen this mini-batch fetcher to be quite helpful. The second key idea is autoscaling trainers with intelligent selection of the type of GPU instance. The Kumo engine implements a lightweight driver that manages a K8s cluster of trainer instances. When requested by the AutoML searcher, this driver is able to launch a trainer on-demand, selecting the type of instance that is suited for the specific training job config. Now the resulting architecture is highly scalable, resource efficient, and cost efficient. It is also extremely flexible, and makes it easy to integrate new machine learning approaches. To enable this sharing of feature store across multiple trainer jobs, we keep only raw features in the feature store. The feature encoding itself happens on the fly and in the Compute Engine. The specific types of encoding, which is part of the AutoML search space depend on both data and task. Advanced users always have the option to overriding feature encoders, just like they're able to override other options in the generated AutoML config. Kumo supports a large number of encoding types out of the box, and continues to add more as needed.

Finally, let's take a look at the graph engine and how we scale feature and graph stores. The Kumo feature store is a horizontally scalable persistent key-value store that is optimized for random read throughput. It is implemented as a service for feature fetching over RPC. We have implemented three key optimizations to improve the efficiency of feature serving, and also speed up the conversion of raw features in the storage to Tensors expected by the caller. The first optimization is to reduce communication overhead. We use protobuf/gRPC to communicate between the data server and the compute client. We store these individual node features as protobufs. Typically, like in TensorFlow, these example features are defined as a list of individual feature messages that contain the name of the feature and its associated value. However, this representation is not particularly storage friendly, due to the duplication of names and lack of memory alignment. We circumvent this issue with a simple optimization, which is based on this idea. The idea is pretty simple, that we create a separate feature config, which determines the order of columns in the protobuf that contains the feature values. Additionally, we also group the columns in this feature config by the data types to enable features of the same type to be stored in a compacted array. That further reduces the message size.

The second optimization is related to how we convert from row-wise to column-wise feature representation on the client side. On the client side, the feature store receives features that are stored as protobuf in the row-wise feature representation, as has been extracted from the feature store. These features need to be converted to a column-wise feature matrix, so that we can easily perform feature transformations that are applied to columns. The image on the bottom left actually depicts this process. While the column-wise feature matrix is constructed in a client that is written in C++, it is later used in a number of places in Python code, and in one or more processes. We chose Arrow as our column-wise data format to benefit from its zero-copy design. To be able to do this, the most challenging part that we had to handle was basically dealing with the NA values for which we designed lightweight math, and some careful design decisions that helped us get to the zero-copy design. We want to optimize the feature access performance by improving the data locality so that we can maximize the number of features that are fetched within each field operation. We implemented a simple but very effective idea to reorder the nodes in the feature store based on the neighbors. For example, as you can see here in the picture at the bottom right, we placed triangle nodes accessed by the same circle neighbor as close as possible. Then we placed the square nodes accessed by the same triangle neighbors as close as possible. This optimization works very well for a lot of real-world applications, and in our particular scenario, where we have to traverse the graph and get the features for neighbors. With these optimizations in place, we were able to achieve 3x speedup in the end-to-end feature fetching, which includes the time it takes to convert features to a Tensor. As a result, we are able to keep GPUs fully utilized during model training, and run many more trainer instances in parallel.

Let's move on to the Kumo graph store, which is implemented as an in-memory graph store that is optimized for sampling. It is also implemented as a separate service to allow for independent scaling from the feature store and the Compute Engine. Sampling in GNNs produces a uniquely sampled subgraph for each seed node. To ensure maximum flexibility during training, the graph store must be optimized for fast random access to outgoing neighbors given an input node. Additionally, to scale to large graphs, with tens of billions of edges, the graph store must minimize the memory footprint by using compressed graph formats. The graph engine achieves both of these objectives by leveraging core PyG sampling algorithms that are optimized for heterogeneous graphs in CSR format. When timestamps are provided, the edges in CSR are secondarily sorted by timestamp, the speed of the temporal sampling process. Furthermore, since enterprise graphs are typically sparse, the CSR representation itself is able to achieve very high compression ratios.

Now let's move on to the control plane and some of the specific challenges that we had to solve over there. The Kumo platform, as we have seen before, is designed to run many predictive queries on potentially many graphs simultaneously. Training a predictive query requires running a large number of workflows to securely ingest data into caches, compute stats, materialize features and edges for the graph engine. Then, additional workflows are needed to automatically generate the training and validation tables, train the models, and run inference to produce model outputs. To support all of this, the Kumo control plane is built around a central orchestrator with dynamic task queues that can scale to execute thousands of stateful workflows, and activities, all of them concurrently. After carefully evaluating a number of options for orchestrators out there, we chose Temporal for workflow orchestration. In addition, the control plane includes a metadata manager and enterprise grade security and access controls. The metadata itself is stored in a highly available transactional database. The control plane also handles all aspects of graph management, including scheduling incremental updates to the graph, and also graph versioning. Furthermore, the control plane provides basic MLOps functionality, including model versioning, and tools to monitor data and model quality.

Summary

GNNs bring deep learning revolution to graph data found in enterprises. The graph-based ML can significantly simplify predictive analytics on relational data by replacing ad hoc feature engineering with a much more principled approach that automatically learns from connections between entities in a graph. However, deploying GNNs can be very challenging, particularly at the scale that is required by many enterprises. The Kumo platform is designed from the ground up with an architecture that scales GNNs to very large graphs. Then the platform is also able to simultaneously train many models and produce model outputs from many queries on the same graph at the same time. While building out the capabilities in the Kumo platform requires deep expertise in both GNNs and high-performance distributed systems, the complexity itself is hidden away from users by means of a simple yet intuitive API. As an example of how much Kumo platform can scale, in one specific customer deployment, we score 45 trillion combinations of user item pairs and generate 6.3 billion top ranking link predictions in a matter of a couple of hours, starting from scratch. By designing for flexibility, the Kumo GNN platform is able to enable users to quickly go from the business use case to a deployable GNN at scale, and therefore derive business value much faster.

 

See more presentations with transcripts

 

Recorded at:

Feb 16, 2024

BT