InfoQ Homepage Presentations Fabricator: End-to-End Declarative Feature Engineering Platform

Fabricator: End-to-End Declarative Feature Engineering Platform

View Presentation

Speed:

42:50

Summary

Kunal Shah discusses how their ML platform designed Fabricator by integrating various open source and enterprise solutions to deliver a declarative end-to-end feature engineering framework.

Bio

Kunal Shah is an ML Platform Engineering Manager at Doordash focusing on building a feature engineering platform. Over the last year he has launched declarative frameworks for both batch and real time feature development, accelerating the development lifecycle by over 2x. Previously, he has worked on ML Platforms and Data Engineering frameworks at Airbnb and YouTube.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Shah: How many of you have directly or indirectly faced the struggle of improving the scale and velocity of feature engineering with the growing number of ML applications in your organization? An estimated 75% of model development time goes into feature engineering, which makes feature engineering velocity, one of the most vital offerings on a good ML platform. My name is Kunal. I've been working on DoorDash's ML platform for the last two years. I want to deep dive into Fabricator, DoorDash's feature platform that has helped us scale feature volumes to almost 10x and improve feature iteration times from days to hours. In this talk, I hope to cover the design, the architecture, and some of the key learnings in our journey.

Outline

For the agenda, what I wanted to really cover was go into a brief history of feature engineering at DoorDash. Then convert some of the learnings and pain points of our past into an ideal feature platform vision. Then from there, dive into an overview of Fabricator, which helps surface this ideal platform. Then, lastly, go into some architecture deep dives, to cover an end-to-end use case. Then, follow up with the results and learnings based on this entire architecture.

Feature Engineering at DoorDash

Feature engineering at DoorDash is pretty vital to most of our data scientists' lives. To give a brief peek at our volumes, about a year ago, we had about 400 unique features, serving our models in production, mapping to about 100 billion daily feature uploads. Then, each of these were powered by about 60 jobs across the spectrum. How did the legacy systems look in terms of the data science experience that powered the scale? We had two phases that the data scientists cared about, which is feature development and feature serving. We did have an efficient online feature serving ecosystem on the ML platform side with a scalable, efficient feature store that was based on Redis. The online feature serving piece was owned by the ML platform. Then we also had a robust ETL framework supported by the data engineering team that helped us interface with the data warehouse and have a sound ETL flow in place. However, all of these were manual steps for our data scientists in terms of setting up an end-to-end feature pipeline. They would have to interface directly with our workflow orchestration as well as the data warehouse to set up their feature generation. Then, also work with data warehouse again, through SQL statements to prepare these features for model training. Then go through the feedback loop, build new features, iterate, test your models, so on, repeat till you have a set of production-ready features. Once that's done, they then approach the ML platform to get these features uploaded on a daily basis to the online feature store so that they can move on to the serving piece of their architecture.

As this narrative indicates, this was quite a painful process for data scientists on many levels. First and foremost, these fragmented systems definitely hampered their iteration velocity. They had to interface with many loosely coupled systems. They had to deeply understand how orchestration works. What are the nuances of the warehouse? How do the feature stores operate? How can they get the data into the feature store for specific feature types? How do they access all of this inside their notebooks, and then set up the whole loop again, next time they want to create a new type of feature? One more piece of this was, the semantic meaning of feature gets lost when you jump through all these hoops. You start off with a table, you end up with a feature store, a feature gets lost somewhere in the middle. Also, the infrastructure evolution itself got really slow. For the ML platform engineers, each time you wanted to make a specific change to how a feature behaves in production, you'd have to go all the way to the top to understand how the systems and the integrations work from the feature generation aspects. The same thing applies to the other systems. Any evolution breaks the delicate balance of how the flow works today. Integrations and mutations take way too long. We had a growing amount of inefficiency over time, which made our pipelines really slow as our scale grew.

Then, lastly, we didn't particularly have a control plane for our features. Our data scientist builds a new feature, and that particular data scientist might know the end-to-end runs of all the workflow, but it's very difficult for the data scientist to share this with other parts of their ecosystem. How do they convey this feature to some other data scientists or other part of verticals? How do they keep track of features that they've built in the past? What logic went into them? Even apart from the logic itself, how do they monitor their features? Is the feature performing well in production? Is the pipeline down? Are there any systemic inefficiencies in any of the systems you discuss, the warehouse, and the feature store, and so on? The control plane was pretty nonexistent, and that made it really hard for data scientists to monitor the health of their feature ecosystem.

Reimagining an Ideal Feature Platform

That got us thinking, and for a while we consider incremental improvements to the system. We realized soon that we were hitting a wall with how fast the incremental systems could bring about a change. We set about to reimagine what an ideal platform should do, and which our platform isn't doing today, and then try to go from there. To get this process kickstarted, we reached out to a large number of data scientists within our own organization. We also looked at multiple other industry solutions around feature engineering and feature platform space, to understand where our gaps were and how we could address them. These were the five major points that we came out with as a must-have in an ideal feature platform. The first one was to have a single entrypoint. The entire features lifecycle from dev to production should operate with a single interface without a data scientist needing to go to multiple input points. Additionally, this entrypoint should also provide them a means to define their feature semantically. The system or the platform should natively understand what a feature is, how it operates, how it's generated, what are the constituents. How should we serve this in production? How should we monitor this in production? All of these aspects should live within that same semantic representation.

Thirdly, it's very important for us to be able to operate and generate these features through simplified abstractions. As we deep dive into the actual data engineering aspects of features, we realize that understanding Spark, or Snowflake, or any other feature warehouse solutions, the compute solutions, is harder for the data scientists to keep up with. If possible, they should operate with simple, high-level APIs that can be scaled by platform engineers who understand the systems a lot better than the data scientists do. All of these components combined, was basically a means to an end to serve a higher goal, which is higher iteration velocity. That is an absolute must-have on a feature platform. We need to make the process of converting ideas into code as simple as possible. Features can change really quickly, and it shouldn't really take a week to deploy new features or new changes to features.

Then, lastly, once all of this generation aspects are in place, and you are able to iterate quickly, we wanted to automate all the downstream steps once a feature is created. How can the feature get uploaded to the feature store once it's ready? How do we monitor this feature's health in production? How do we orchestrate this to run on a daily basis? Is there something else that we can observe on the feature like drift, or regressions, or performance detractions that we can automate? All of these should be three pieces of infrastructure that we can provide as an integration to the single entrypoint we discussed above. Converting all the words into a picture, this was the architecture of a high-level ideal platform that we came out with where the users interface mainly with a single feature registry. Then the serving, the development, and the management aspects of this feature lifecycle should come for free. As we go into the vision of Fabricator, we dive into how we made this possible.

Fabricator: Overview

Fabricator was the ecosystem that we built around to bring the ideal platform to life. The vision that we started off with was that data scientists should be able to define their end-to-end feature pipelines in a declarative way. Then, once that is in place, the remaining operational lifecycle should be fully automated. As long as we can deliver this, we would have delivered on all the pain points we discussed above, as well as realized the five must-haves on the feature platform we defined earlier. We broke it down into three major components to deliver those aspects. The first one was the central declarative registry that would focus on providing that single entrypoint that allow users to define their semantic representations in a highly simple API form. That will be followed up with a unified execution environment which should provide a place where the definitions that they set up would be able to run using simple, high-level APIs that interface with all the lower-level details like Spark, Snowflake, the other parts of our infrastructure like the feature store, and so on, without the user having to write more code. This would give them the high iteration velocity they were looking for. Additionally, this execution environment would also focus on providing a playground where they could build their features or iterate on their features the same way it would run in production, without any change to their code. Our goal here was also to provide a sandbox environment that mimics production in every possible way. Then, lastly, focus on the infrastructure automation pieces that takes the generated features from the first two steps, from the first two parts of the free framework, and convert that into an end-to-end ecosystem. Because it's not just enough to create features, we want to do a lot more with it once they are ready. That part all goes into the infrastructure automation pillar of Fabricator.

Once again, putting a picture to those components. We restructured or redrew the ideal platform into these components where the feature registry still is in place, which is the central entrypoint for the data scientists. The unified execution environment provides the place where the code actually runs. This is either in development or in production, if you want notebook access, or you want production access. Both of these will operate in the same sandbox or API environment that the execution environment provides. Then once the features are ready, the infrastructure automation takes over the remaining parts of the lifecycle.

Architecture Deep Dives

For the rest of this talk, what I wanted to do was deep dive into a production use case and look at it from the lens of that feature going all the way through the pipeline, and see how we achieved this to the different frameworks we adopted. For this case, what I wanted to focus more on was a source use case which tries to build engagement features per consumer. Then persist those to a feature store for online serving via Redis. Jumping into the first part, which is, how do we get started? How does a data scientist start thinking about building these engagement features on the registry? A few salient considerations for the feature registry were that we wanted a simple YAML definition for the data scientists to provide their feature semantics. YAML was the language we chose because it is very descriptive. There is nothing it cannot cover. However, it's more approachable for the data scientist as well. It gives us a room to be really broad in our definition space, yet really easy to adopt for multiple practices. However, YAML is also like the Wild West. We wanted to put some structure to the YAML spec, and so we created protobuf backed schemas for each of those YAML objects. What this enabled us to do was provide validation scripts, provide robustness and backward compatibility for our YAML definitions to ensure that we could keep this framework extensible for any new infrastructure or any new production practices our data scientists wanted over time. Thirdly, we also have a central service that is backed by a database that stores all of these protobuf YAML objects, so that every other service within production infrastructure can have access to these definitions, and take actions on them as and when we want them to. This helps us disseminate all of our user definitions across the entire ecosystem. Then, lastly, these YAML definitions are checked into GitHub, which is continuously deployed for every change. Each time a user operates on a specific feature, continuous deployment, which takes a few minutes takes the entire feature change live. This enables the really high iteration velocity, which was one major focal point for all of our efforts.

What are these feature semantics that I've been talking about? What does it mean to define a feature lifecycle? In our experience, an end-to-end pipeline effectively comes down to three major YAML definitions. The first piece of the YAML definition is a source. We need to define the generative lifecycle of a feature. What does it take to get a feature from an upstream source to its offline storage? Again, within the space of a source definition, we broke it down into three components. The first is the compute_spec. How am I going to create this feature? Am I using Spark? Am I using some other frameworks? All the information that is required to set up your creative aspects go into the compute_spec definition. This particular example here highlights how we use a spark_spec that can take in a file and some resource constraints on how you want your Spark code to run, that will set up your feature Spark pipeline. However, you wanted to set up the Spark alone, what if you want to apply Snowflake SQL, or what if you want to apply Spark SQL? You could apply a new computation type called the corresponding SQL type, and then add your SQL statements. We can execute those, and that can become part of that computational spec.

More radically, if you want to do something with near-real time features, we have an internal framework called Riviera that lets users operate onto kafka_sources directly using Flink SQL to apply their feature transformations. This could also be part of your compute_spec. This focuses on the generation parts of your source. Once your source is ready, once you've set up your generation parts, you'd also need to tell us where you want to store your features. Again, this is part of an evolving ecosystem. Long ago, we were using Snowflake as a primary warehouse. As we move towards more native types of storage, think about embeddings, RA type storage, and so on, we realize that Delta Lake based storages are really helpful as a paradigm as well. We have a duality of storage types. As a data scientist, you can choose how you want to store your tables. Then you can choose the type of your storage and which tables you want to store it in. That is enough to tell us where to store your features. Then, lastly, once you define your compute and storage, you also need to orchestrate this for a regular cadence. We have something known as a trigger_spec that lets you do many things. You can choose to trigger your pipelines when the upstream data is ready. Or alternatively, you could choose to trigger your pipelines at a specific time every day. Or you could choose it every specific time every week. The trigger_spec allows the flexibility for the data scientists to choose when the pipelines run.

Once your features reach the offline storage, the next part is the serving ecosystem. Where should your features be materialized into an online store, so that they can be served to production models? That brings us to the concept of a sink. Sinks are independent of the actual feature itself. A sink is an online storage layer where your features should end up. This is particularly a capability that the ML platform itself maintains so that we can define all the potential supported sinks in one place. The data scientists can just access these definitions. A sink definition typically looks like this. Let's say the search team has their own Redis cluster, and so we define that as the search-redis cluster with the type REDIS. Then we give a redis_spec that gives us the cluster node from where we can start uploading features to. This is enough to set a sink up forever. Any search features in the future, not just engagement features, can land up on the sink, and that's all the information we need.

Then, lastly, once you have a source and sink setup ready, now you can go on to define the features. A source as we described earlier is typically a table that has a large number of features as columns that map to a specific logical aspect. Let's say engagement features focuses on measuring clicks and views. Now this table will have multiple features stored in this table. You can define features. Each feature has a unique name of itself, a unique entity which tells you which key space this feature is based on. In this particular case, let's say it is consumer, because there are consumer specific features. Then you tell us which source table has the features that we can look up for. You can have many features in one table or just one, up to you. It's a logical grouping. It's entirely up to the data scientist. Then, lastly, you can tell us where you want to serve your features to. In this particular case, we've chosen search-redis. However, this also promotes the concept of shareability. Let's say the search team as well as the ads team, were both looking for the same feature, then we could literally materialize this particular feature to both of their Redis clusters and they could share that feature on their own serving stacks. That summarizes the feature YAML setup.

However, why did we choose this design? What did we achieve by going down this route instead of just coding all of these into Python based wrappers? The first one is that evolution was really easy. Because we focus on protobuf based backends, we were able to make our definitions really robust. We were able to make them extensible over time. When we first started Fabricator, we didn't have this wider support for large number of storages, large number of computes. We focused on a small slice of the puzzle and then expanded this, but we were able to do that because the protobuf spec allows us to keep adding new features, capabilities to the framework for each different feature without burdening the system too much. Secondly, we were able to support a large amount of infrastructure flexibility. Because the YAML spec doesn't particularly change much for the user experience when we switch over to a new type of storage, like I mentioned, Snowflake, Delta Lake, or we switch over to a new type of compute, which is Snowflake SQL versus Spark, our users don't actually experience a significant shift in their onboarding. They go through the same entrypoint, they just simply set up their YAML and they change something, and it just works magically. They don't really need to choose the right set of knobs without too much effort.

Then, lastly, these definitions are globally available. Every particular downstream that we have can look up the registry and take actions based on changes to the registry, without the users actually having to trigger anything. Once continuous deployment finishes, the registry has the new and updated changes. Then those changes can further trigger downstream actions, like feature serving, feature observability, and so on, which we'll describe in the infrastructure automation section, for free. This helped us improve our velocity by quite a bit. One of the salient features of the registry design was that the feature definition time, which was the biggest time sink for our data scientists, went down from days, because they had to navigate so many frameworks, down to minutes, because understanding YAML is seconds of work. That was really helpful for our data scientists.

However, the definition space is only a part of the puzzle. A definition is only as helpful as getting it to run in turn. We actually were tasked with a bigger challenge of finding an execution environment that makes it really easy for these definitions to execute. How does Spark compute get stored onto a Snowflake table? Or better yet, how do we convert a Flink transformation that they were done in SQL, and get it all the way into a Delta Lake without them adding additional plumbing? The answer lies entirely in our execution environment. It includes a library suite that bridges the gap between a simple YAML definition and the actual infrastructure. This suite focuses on enabling something we call contextual executions that takes the registry definitions and translates them into native Python definitions, and then initializes them into jobs, and then runs them end-to-end. This contextual execution is the prime focus of the execution environment. Then, lastly, apart from just being able to run this code, the library suite and the environment itself, packages a lot of knobs and black box optimizations that make it really quick for users' code executions. If we learn a better way to run specific types of embeddings, we can bundle this into the library, and then apply this to all possible embedding jobs, and then improve the compute of all of them in one go. This has happened to us in the past where we were running really slow embeddings for over a few hours, and we were able to bring the running down time to minutes, because we were able to apply these black box optimizations to all the jobs.

What do contextual executions really mean? What the library focuses on is to provide Pythonic wrappers around simple YAML definitions that allow our framework to execute them efficiently. We do it using two major concepts. The first is a context. A FeatureContext, for example, is a context that wraps the entire YAML definition you saw earlier. It takes a source and the set of features that are connected to that source, and a set of entities connected to that feature, and uses all of that information, including a storage_spec, compute_spec, and so on, and packages that into a single Python class that has all of this information. This can be initialized directly from the registry. Once you do that, we actually have a single context that knows what needs to be run for a user score. The second piece is to actually have a runner. A runner that can take a context and take it to completion and have the data end up in the Delta Lake, or the Snowflake table, and so on as required. We define a set of upload classes. For example, a SparkFeatureUpload focuses on taking a FeatureContext, running it in Spark, and getting it to the Delta Lake. Or correspondingly, we have a SnowflakeFeatureUpload that focuses on running a Snowflake SQL on your given FeatureContext and running it to the end-to-end storage. These are all internal libraries. What does the user code look like? The user code is actually just going to be a simple set of three lines which says, FeatureContext initialized from my source name, which we defined earlier as consumer_engagement_features. Then you initialize a SparkFeatureUpload using this context, and then you simply run the job. These three simple lines of code are all you'll need to upload your job metrics, get alerts on when your job completes, monitor this job for failures, get failure conditions, and finally have the data end up in your targeted storage, if everything else succeeded.

What's even more interesting, and this goes into the benefits of this design, is that these three lines of code that you see here are actually not even required to be written by the users if they haven't extended these classes. If you see in this particular case, FeatureContext and SparkFeatureUploads are library constructs. If you were a power user, you'd probably look to extend these. If you're not, if you're looking for straightforward, mainstream supported use cases, you could simply run your jobs with no code. That's one of our first benefits. Most of our jobs within that layer within Fabricator are no code. If you're using simple YAMLs, if you don't have customizations for your jobs, we can run your jobs directly without any additional code. These pieces in the last lines of code can simply be autofilled. Additionally, this provides a means for our users to do high fidelity testing.

Notebook clusters and production clusters are pretty much the same. You could actually set up your definitions and then run these three lines of code in a playground, and have it do the exact same thing that the production jobs would do, and see what the features look like, see what using them in your models would look like. You could actually have an entire iteration loop in your playground before you're ready to commit these features to production. Then, lastly, like I mentioned earlier, the black box optimizations make executions extremely efficient. The embeddings example I gave earlier, we were able to achieve more than 10x speedup on our user score for embeddings, simply because we were able to use the UDF vectorizations within Spark a little differently than they had set it up. Once we learned how to optimize one particular long-running job, we were able to package that into the library and apply it to all embedding jobs. That helped us save hours of running time on many jobs on a daily basis. This is one example of how efficient executions can be powered through this execution environment.

This, as I said earlier, gets us to the generation lifecycle. My data lives in the offline storage now. As a data scientist, that's not enough for me. I need more things to finally use in production models. The infrastructure automation piece provides a lot of utilities that help enable multiple downstream integrations for free. In this section, we'll take a quick peek at all the integrations we enable. The first one, of course, is orchestration. Now that my feature pipelines are ready, can you please run them for me on a regular cadence? The good news is that we already have all the information in the YAML spec. You've told us how to compute your features, and you've told us how to trigger your features. We convert those into automatically constructed DAGs with the right dependency setup and the right hooks in place. We actually have a flexible choice for orchestrators. Internally, we support both Airflow and Dagster. For Fabricator itself, we lean towards Dagster as a choice. What we additionally do for our users is to add something known as date partitioning indexer that creates a time slice of each day for your YAML.

Like I mentioned earlier, contextual executions enable us to provide a context for each day of the run. What this does is it provides scalable and parallelized backfills. If you see in the corner here, there's a launch backfill button. The launch backfill button pretty much enables our users. Let's say you developed a pipeline today, and you're like, I need 30 days of data to actually see if my pipeline is going to be useful for my models in production. You could simply go to launch backfill and choose the last 30 days of time slices, or date partitions that you wanted to backfill, and within a few minutes, you'll just be able to parallelize and scalably run all your jobs to backfill those dates, and they would end up in the data lake or storage layer as well. Then you could start testing with those backfill dates. This was actually a step change for data scientists, for whom previously backfill was a reasonably involved task. If they wanted to backfill, they had to tweak around a little bit on their production ETL code and see how they can get this to run in turn. This was a really helpful change that we got from Dagster based date partitions.

Secondly, the online serving piece. As I mentioned earlier, we had a very well scaled Redis feature store in production, and an upload service that focused on getting user data into the Redis store. With this framework in place, we can actually trigger the upload service each time a new partition for your data is available. Whenever you run a new daily run of your generation job, the new data can trigger the upload service and just tell it to get it into the Redis store. This enables to have really high data freshness without the users having to set anything else up. They already told us what they wanted to do by providing the materialize_spec and the sink definitions earlier. Thirdly, feature discovery gets really easy. Internally at DoorDash we use Amundsen as our data catalog, or like a means to index all our data utility, or data objects. What we were able to do was connect the metadata extractors to the feature registry, and use that to extract information on all our different features and their upstreams. Effectively connect them to the entire company-wide data lineage graph, all the way from upstream golden tables, to the downstream models that use these features. This was really helpful for our data scientists to visualize how the feature flow happens.

Then, lastly, the feature observability piece that I mentioned. Now that your features are in production, running happily, getting served to models, how are they performing? How do you measure that? We use Chronosphere as a dashboard to collect metrics on different feature performance in production. We can tell you, how many of your online predictions ended up in defaults, using this feature, or how many times did you have a miss on your cache for your features? What was the standard deviation like? What was your features distribution like over the last couple of weeks? You can use these and set up alerts in case you wanted specific thresholds for your features distribution?

Results

Given everything that we've built so far, where did we end up? A year down the line, what results do we have, and what did we learn from those? Today, Fabricator supports over 2000 unique features that produce about a trillion values on a daily basis into our offline storage layer. This maps to about 500 different jobs across the board. We were able to achieve significant scaling simply through users being able to churn through their experiments a lot quicker and be able to build new product ideas faster.

Learnings

What did we learn along the way? The first piece of the puzzle was, actually we realized that building products not systems was really helpful to drive adoption for the feature platform. Previously, our users thought of features as an interaction between a lot of different systems that they had to maneuver. That added an extra piece of friction to the development process, because they were systems that they were interfacing with. However, once they came to Fabricator, they were interfacing with an idea of a feature. That helped them think of a product mindset. That helped them move the needle a lot faster. Secondly, on the side of Fabricator, we made it really easy to do the right thing, all through the design, as we've described earlier. We focused a lot on simplifying the most prolific patterns. It's the 80/20 rule. Eighty percent of our use cases follow a similar pattern all across the board. If the users can do that really efficiently, really easily, then that longtail gets a lot easier to approach. We left room for customization for the longtail but made it really easy to do the right thing. However, there is a flip side to this coin, and that was the reliability aspects of our framework. Originally, the framework was fairly reliable in the sense that we were able to utilize the reliability of the systems we use internally that power Fabricator. However, growth takes over soon, and reliability comes at a cost of the robustness. We scaled 10x the utilities that we use. Is that still really reliable at 500 pipelines? Do we need to add more pieces to it? Do we need to monitor the health of those systems? What about Chronosphere? Can it consume metrics at the scale that we are growing our features? What about a feature store, are we cost efficient? These questions come up after, if you have runaway adoption. We realized that reliability was a cornerstone of most things that gets left behind when you grow too fast.

The Journey to Measurable Improvements for Fabricator

What did it take us to demonstrate the measurable improvements for this effort? What was that journey like?

It's taken us a while to get to this point. Where we started was at actually a critical juncture where some pieces of our feature infrastructure were quite painful for the data scientists. This typically applied to problem spaces like embeddings-based features, because those are computationally intensive. The warehouse plus ETL structure that we used in the past was making users run these pipelines for about 10 to 12 hours every day. One change would take them each day to propagate. The systems we had weren't actually effective to do this. We started bottoms-up, which is we first tried to solve that problem more efficiently and try to identify what would it take to make all problems around embedding spaces easier. As we built that idea further up, we reached this stage of what an ideal platform would look like. We had a working POC along the way, where we were able to bring down 10-hour execution pipelines using Delta Lake plus Spark, to like an hour or 50 minutes. Then we realized that, once we do that, can we actually make it faster for them to ship changes to their pipelines that didn't have to wait for a day? As we added each of those pieces, we were able to build some metrics around development and execution speed, for just embeddings. Once you can build a framework to solve that problem efficiently, then you can add more incrementals to it, because you now have measurable metrics for them. It took three to six months to fully reach stable state with infrastructure automation. That's how we started making our case for this framework.

Development, and the Path to Adoption of the Fabricator Framework

I think we did have an inflection point. In terms of paths to adoption, we began with few adoptions, but powerful adoptions. What that meant is we made substantial changes to a very small subset of use cases. Once we got an entire team, or entire set of use cases onto the framework, and that team was convinced to try for more number of use cases, that led to incremental features being developed. We never really reached a case where the ideal framework was built right off the bat. You build something that is great, but there are some things that are missing. Because users are encouraged to use it, they give us feedback. That's the pushback and feedback loop that we get from our data scientists. I think it took us about six months to fully polish the framework, and hit the growth milestones we wanted to reach. Then it became more about reliability and incremental features after that.

Deprecating, and Changing Features

Typically, we don't allow our users to deprecate a feature. That one is something that we still try to keep the platform in the loop for. I know that's not the ideal automation state you want to be at. The reason for doing that is because sometimes a feature may be shared across multiple users. If someone makes a choice to deprecate a feature, you'd actually need some more robustness in place. Are there any production models that are not this user's models that are using this feature? If we deprecate it, will we cause some degradations? Some of them may involve database deletions that we try to be very cautious around. If you're trying to deprecate a feature, you're also trying to delete its metadata and lifecycle. We do have deprecation processes, but they are manual. We try to see and let our features usage drop down to zero for like 30 days in production, before we go ahead and clean up its existence in all our offline tables, online tables, metadata, and so on. There is a process but it's not a trivial one. We take it as bulk cleanups.

Questions and Answers

Luu: In terms of the experience of going through this since the beginning, if you were to do anything differently, based on the learnings, anything that jumps out at you that you might want to do it differently?

Shah: Yes, actually. There are one or two pieces that I would think a little bit more carefully around the design of. That was our API for offline serving. We built a lot of robustness around creating features, but consuming these features and assembling them into datasets was something that we took on as an incremental step once we had a lot of adoption going into creating the features. There were some insufficiencies around how we treat the concept of time that we've been adding on more recently. Having a clear concept of what time means for features and these tables right off the bat in our original design would have made the system a lot easier for users to develop now. Today, with 2000 features already on the pipelines, it's hard to pivot folks to a more consistent sense of time. That's one thing I would do differently if I could redesign this.

Luu: I think all machine learning platforms would need some similar kinds of framework in terms of what's out there. There are different companies in open source, any thoughts and perspective from you on that one?

Shah: We did look at a lot of inspiration from open source or enterprise solutions in terms of building this. I think the biggest one for us was Feast, which has a declarative definition for a feature, although in code, that will eventually get served and so on. The key reasons why we pivoted away from many frameworks was also because we had a highly scalable serving solution established in-house before we ventured into the creation space. The second was the ability to own the creation space as well. Many frameworks focus either on the metadata and the serving pieces of a feature, or they focus on the creation space for the feature. We wanted to build something that was holistic all the way from creation, generation, and the control plane. We use them as inspirations in order to build this. In the end, some of the ideas are borrowed from successful frameworks out there, and then we amalgamated them to build something better.

See more presentations with transcripts

Recorded at:

Oct 10, 2023

Kunal Shah

InfoQ Software Architects' Newsletter