Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Declarative Machine Learning: a Flexible, Modular and Scalable Approach for Building Production ML Models

Declarative Machine Learning: a Flexible, Modular and Scalable Approach for Building Production ML Models



Shreya Rajpal discusses declarative ML systems, and how they address key issues that help shorten the time taken to bring ML models to production.


Shreya Rajpal is a Sr. ML Engineer and Domain Lead for ML Infrastructure at Predibase. Her work involves building scalable solutions for ML training and inference that improve the stability, robustness and effectiveness of ML model training. Previously, she'd worked on using state of the art ML models to solve problems in autonomous systems.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Rajpal: I'm going to be talking about declarative machine learning systems, a new paradigm for flexible, modular, and scalable machine learning development. My name is Shreya. I'm a Senior Machine Learning Engineer and the Domain Lead for Machine Learning Infrastructure at Predibase. Previously, I'd spend some time at Apple Special Projects and, working on applied machine learning problems.

Enterprise ML Today

Let's start by taking a look at the state of the world for machine learning today. Machine learning development in big enterprises falls into one of these two styles. The first option is using low-level APIs to build custom solutions for your machine learning problem. This typically involves using frameworks such as TensorFlow, PyTorch, scikit-learn, to write code that implements your machine learning algorithm, preprocesses your data, and serves your model. The other alternative is using AutoML vendors. Typically, AutoML platforms refer to no-code platforms that help automate a lot of the workflows related to machine learning. I'm going to argue that these styles are very inefficient, and that there's a better middle ground called declarative machine learning. Declarative machine learning solves a lot of the problems that exist in these two styles. To understand declarative machine learning, we first need to understand what's missing from these two styles.

ML Frameworks Today - Low-Level APIs, and AutoML

First, let's take a look at low-level APIs. This slide shows the process of building a machine learning solution for your problem with low-level APIs. On a very high level, you can see that the parts of this process are wrangling and preprocessing raw data so that it can be used for machine learning, writing, training, and testing the model, and scaling it up and deploying it. As we can see in the diagram, this is a very time-consuming process, each step of the process itself is quite time consuming. It's also a very fragmented process, so each step requires different tools.

For example, to build your model, you would typically use something like PyTorch, or TensorFlow, and you would use something totally different to scale it up, for example, something like Ray. With this fragmentation, you also have different skill sets for different steps. For example, machine learning research skills or data science skills are needed to improve on model quality, for example, if you're improving your model accuracy. On the other hand, machine learning engineering skills are needed to serve the model efficiently. Finally, the boundaries between these steps are pretty artificial, which means that you often have to go back to the drawing board multiple times.

An example of what might happen as you're working on this is that when you start training the model, you realize that you actually have a very imbalanced dataset, in which case, you have to go back to the preprocessing step and preprocess it so that you maybe oversample or correct that imbalance. This style is something that I've seen a lot during my previous experience as a machine learning engineer. Generally, in industry, this is a pretty common way of building machine learning models. All of this to say that low-level APIs are slow and expensive.

On the other hand, AutoML models simplify this whole process greatly. You can already see from the slide here that all of the previous steps of preprocessing data, model training, are covered by AutoML tools. This is great for simple machine learning problems, where you want to train on your model in one shot and be done with it. However, that's not typically how machine learning development works in practice.

In a typical AutoML tool, there's no scope for a practitioner to inject their own expertise, or to iterate on a model. In practice, what ends up happening with AutoML tools is something like this. You get a quick one-shot model, but maybe you want to iterate and get higher accuracy, or be able to predict with higher confidence. For now, for the purposes of this example, let's say that the way to achieve that higher accuracy is by adding regularization to your model, or by just training for longer.

As a user, there's no way for you to go back and change the result of AutoML, so that it uses the decision that you want to make about how to get a better model. Typically, what happens with AutoML tools is that you see a very high churn rate, because users don't feel like they have control of the process of building their own models.

I have a quote on this page. These are the words of an AutoML user, describing their experience with an AutoML tool. The quote was taken from the Whither AutoML paper, "Whither AutoML? Understanding the Role of Automation in Machine Learning Workflows." The paper is from Doris Xin and others from Berkeley. I highly recommend checking out the paper, if you haven't come across it yet. It basically describes the practical experience of using AutoML in your day-to-day work.

Generally, I really like this quote because it captures the essence of issues with AutoML, which is that it really helps you get off the ground quickly, but it doesn't necessarily help you get to production quickly, primarily because there's no offramp from the solution that AutoML comes up in. If a user wants to use a different algorithm or a different architecture or something, that's just not possible.

Organizations Take an Inefficient Approach to ML

Let's zoom out a little bit. In the previous slides, we basically went over the process of building one machine learning solution for one project. If we zoom out, we can see what happens on an organization level where there isn't one single ML project, there are often many projects, and they have different use cases, different complexities, and different stakeholders. For example, project 1 here, maybe like an intent classification task and uses TensorFlow, whereas project 2, which is fraud prediction, need a stack that is better implemented with PyTorch, and so on for project 3 and project 4.

With low-level APIs in an organization, there's low reusability across projects and high technical debt. Often, there's very large teams that are needed to support model scaling and deployment. Very frankly, organizations can't hire machine learning experts fast enough. The problem becomes even more interesting when you see how AutoML works in an organization. Once again, AutoML may work well for a project and then may help you get off the ground quickly for that project.

Since it's a one-size-fits-all solution, if you have other projects that are more complex, AutoML may not end up working out. What you'd see is like for ML project 1 here, while you are able to use an AutoML vendor and get to production very quickly, you still need to use low-level APIs for ML project 2, and ML project 3. You often see AutoML tools being used simultaneously with low-level APIs within an organization, and this just adds to the tooling mess. From an organizational point of view, the problem is clear.

You want to share infrastructure across projects. Your users should feel empowered to change their models as needed, and use their own expertise. Generally, for machine learning projects, you should be able to get off the ground quickly and speed up the time to value for ML.

Declarative Machine Learning

Enter declarative machine learning systems. They're the middle ground that solve a lot of the problems that I mentioned in the previous slides. They have a higher abstraction that provides flexibility, automation, ease of use. Because of their simpler APIs, they open the door for many non-expert users to use the power of machine learning for their tasks. They were pioneered recently by Ludwig and Overton over the last few years, and specifically inside the deep learning age. What is a declarative machine learning system? In a declarative ML system, a user defines what they want to predict or classify or recommend, and the system figures out the how of doing it.

If that sounds familiar, then it is because they follow a very common pattern called separation of interests. Separation of interests is a common principle in building software systems. The general idea is that an advanced user is responsible for improving implementation of algorithms, infrastructure, while keeping the same interface. This advanced user typically works on a lower level of programming. Then, on the other side of that interface is a less expert user that uses the interface to achieve some task. This user doesn't necessarily care about the low-level implementation, and this user also typically works on a higher level of abstraction.

Declarative machine learning systems also follow this general principle. We've seen this principle before, multiple times. For example, in database management systems, a database admin manages the low-level implementation to optimize the indices, infrastructure, query planning parameters. Then on the other side of the interface, working on a higher level of abstraction is a data analyst who declares their needs in SQL without necessarily needing to know about the low-level infrastructure.

You see this in compilers as well, wherein a compiler developer, writes low-level code to optimize the parser, the heuristics, and writes assembly code. Whereas a programmer only uses a high-level programming language and doesn't necessarily care about assembly.

We believe that this pattern is coming to machine learning as well, and that the next generation machine learning tool will have something like a data science researcher or machine learning person that writes the low-level implementations of optimizing models and optimizing architectures, making sure that hyperparameter optimization and inference work well. Then a machine learning practitioner doesn't necessarily need to care about all of those low-level implementation details, and only need to declare the schema of their task and the data that they want to work on without necessarily writing any of the ML code themselves.

A simpler interface generally leads to higher adoption. Declarative ML provides that simpler interface for machine learning. Historically, we've seen that adoption of a tool is inversely proportional to its complexity. We've seen this in a bunch of different areas. For example, way more people write C or C++ code rather than people who write assembly. SQL has a much bigger user base than COBOL. Similarly for machine learning, the right interface should also be easy to use, but should be able to do everything that low-level APIs can do.

Adoption of Declarative Machine Learning

What does this adoption of declarative machine learning in industry look like? We've already seen a lot of big companies hop on this trend of declarative machine learning. For example, Overton is a system that was created at Apple. In terms of its adoption internally, it has been used to execute billions of queries till date. It's currently available only internally at Apple. Another such system is Looper, which was created at Meta. In terms of its adoption, just for the month of April 2022, it was used to host 700 AI models, and run 4 million predictions per second. This tool is also a tool that is only available internally at Meta. Then the final one is Ludwig, which was created at Uber AI.

It currently has 8500 GitHub stars and over 1000 forks. In terms of its availability, this is an open source project. I'm one of the contributors to the Ludwig open source. Predibase, the company where I'm at was founded by the creator of Ludwig. With Ludwig, our goal is to make the innovation of declarative machine learning easily accessible outside of just big tech companies. For the rest of this presentation, I'll be using Ludwig APIs to demonstrate different functionalities of declarative machine learning.

Model Development with Declarative ML

In the next section of my talk, I'm going to do a deep dive on how to build machine learning models using a declarative framework. For a deep dive, we're going to be building a text classification system. Our goal is to predict the class of a news article given its title and content. This class can be anything like business, science, and technology, sports. Here's a little snippet of the dataset that we're going to be using. The dataset is called AGNews. There's three columns in the dataset. The first column is title. This is a text column and it's basically the title of the news article. The second column is description. This is also a text column and this is basically the content of the news article.

The third column is class, which is a categorical column. This is basically the target that we're going to be predicting. I've highlighted that here to reflect that. For working with Ludwig, we need to start by creating a YAML configuration that describes our dataset. I've added a little snippet of this YAML configuration here. We basically use it to define what our input features are like and what our output features are like. You can see here that my input features are title and description, and my output feature is class.

We also add information about the type of data for each column. Since my title and description are text features, I add that little type annotation here. Then my output feature is a categorical feature, and so I add the type category. For the deep dive, I'm going to assume that model development only requires these four steps, data preprocessing, model building, model evaluation, and model serving. There are of course other parts of the process, for example, like scaling up your model.

Model Development - Data Preprocessing

First, starting with data preprocessing. The first step in data preprocessing is to connect your data wherever it may live. Within Ludwig, you can use both tabular data as well as multimodal data such as images, text. To preprocess the data, we can edit the config that we created earlier and specify what type of preprocessing we want to perform. Since our feature types are text and categorical, we need to think about what are the possible preprocessing steps we can do.

For text features, we may want to tokenize the text strings so that it can be used by a machine learning model. For our categorical columns, since this is our output feature type, if there are any missing rows in the data, we basically want to ignore those rows totally, and do that preprocessing on our output feature. Here's how we can do that in this config.

We basically edit the config to add this preprocessing information for each column. You can see for text, I've added two lines about tokenizing, and the type of tokenizer I want to use, which is here space and punctuation. Then for the output feature column, I've added the preprocessing for handling missing values, which is basically the drop rows. That's it actually. There's no need for writing big Spark jobs that apply data preprocessing to your dataset. The same config with the same interface will scale up preprocessing to very large clusters and dataset sizes.

Model Development - Model Building

Now let's move on to model building. Since our input data is text, we're going to use an LSTM to encode the input data. An LSTM is a very simple architecture for encoding text strings. I'm going to give a quick code walkthrough about doing this without using your declarative machine learning framework first. For this part of the code, we're going to use PyTorch. PyTorch is awesome, because it provides a lot of building blocks just out of the box.

Here, what I've highlighted is the LSTM blocks that comes natively with PyTorch. However, we still need to write some code to show exactly how this LSTM block should train on our input data. Here, what I have is a class that does exactly that, where you parse in your input data and it basically loops through it. You still have to write the actual training loop and not just the class. The training loop is where you use the model you'd written in the previous snippet, to train on each batch of your dataset. You don't need to zoom into the code here, but here's a little code snippet to do that.

There's still more stuff to do. You have to write your own utilities to save checkpoints and to load checkpoints, to save metrics and load metrics. Before you can do any of this, you have to find a way to load the data that you preprocessed and feed it into your training loop. If you remember the PyTorch LSTM building block that we started out with, which is like a really good starting point, even with that, there's a lot of code that we still need to write. This is for one torch model on a single machine and we're not even using GPUs yet. Now let's clean our slate and look at what writing this LSTM model looks like with Ludwig.

Let's go back to our old config that we use for preprocessing, and we can add two lines of code to specify what model type we want to use. In our case, the LSTM. Compared to before, it's two additional lines for each feature column that just say, use an LSTM for encoding this input feature. Then to actually start training this model, you can use this CLI command that triggers your training job. That's it actually. All of the preprocessing that we'd specified earlier, that will also be executed by the same command.

Model Development - Model Evaluation

After you're done with all of your model training, you may want to analyze what the model performance was like. Here's a small snippet to do that in Python. It's not too bad. It's not that many lines of code. What does that look like in a declarative world? In declarative machine learning, and in Ludwig, you basically get this out of the box with one command. Up until now, we've managed to replicate the entire functionality of the large PyTorch code with one small config.

In fact, we've managed to do a lot more than what that code did. We are able to support scalable preprocessing that we'd done before we even started model training. Any parameter that you want to change in your PyTorch model, you can also do that via Ludwig. Generally, declarative machine learning systems give you that flexibility and control, the same that you'd get with low-level APIs without any of the pain of having to do the low-level coding yourself. Even with the level of flexibility that we've shown in this small example, we've already exceeded what is offered to you via AutoML tools, and there's still more to come.

Model Development - Iterating on Model

All of our model development up until now has been very linear. Let's say that after analyzing my model, I realize that my performance is just not good enough, and my model is underfitting on my data. Typically, what I would do in this case is use a bigger, more powerful model. For example, for my next iteration on the dataset, I might want to use something like BERT, which is a much more sophisticated transformer based neural network instead of using the LSTM that I'd used before.

All I need to do in my config is to swap out a few lines. This is the config that we'd started with, and this is the new config that I'm going to be training with. The only change I needed to make was swap out the type of encoder. Instead of requesting an LSTM, I requested a BERT model. A well-designed declarative ML system should take care of all of the rest of the plumbing. It'll use the correct model architecture for BERT. It will make sure that your data is prepared so that it can be consumed by the new encoder. Make sure that the output of the new encoder can be used downstream without any issues, and all of the other things will be taken care of.

You can change this config to be as detailed as you'd like it to be. For example, if you don't want to use the standard BERT implementation, and you want to use more layers, you can do that by just adding a parameter. I've highlighted that here where num_hidden_layers can be increased to 16. This is a lot of functionality that you can do. You might want to do even more still, for example, you may need to train for longer now that you have a bigger model.

You can specify the number of epochs that you want to train for by just adding this configuration that I've added here. Now that you are also training for longer, you might want to add regularization to make sure that your model doesn't overfit. You can also specify that in the config very easily.

Model Development - Model Serving

Finally, when you're finished training the model, you can serve it easily with a single command. I've added this example of a command here, ludwig serve with a path to your model. This will spawn a REST API that you can use to get predictions. An example of an endpoint that is offered by this REST API is the predict endpoint. Calling the predict endpoint will run the model on one data point.

There's an example here to do this in the slides where you can specify each field. In Predibase, there are actually multiple formats that a model can be exported to, including TorchScript. Then, TorchScript allows your model to be then deployed in a very high-performance environment.

Ludwig Open Source Declarative Deep Learning

Let me summarize the key points that we've learned from the model building deep dive. First, declarative machine learning is very easy to get started with. We did a lot of the preprocessing earlier. Even if we'd skipped all of that preprocessing specification that we'd done at the start, you'd still be able to get off the ground. All you have to do is list out the input and output features of your model, and you're good to go.

Second, it's very easy to have fine-grained control, similar to what you'd be able to do with low-level APIs. This makes it very easy to iterate during model development. It doesn't take any control away from the user. Finally, there's a lot of advanced functionality that you get just straight out of the box. For example, you can do hyperparameter optimization super easily by just adding a few lines in your configuration. You can scale up your training to very large clusters. You can train on GPUs, so on and so forth.

Different Interfaces for Different Stakeholders

Using the config, there are many different ways to build an interface for declarative machine learning. What I showed you throughout the deep dive was the CLI. You can also use a Python SDK, where you call the same train, evaluate, and serve commands on your data but from a Python script. You can use a GUI. Then, you can also use something like a query language. I've included an example of that here.

The query language really opens up machine learning development to a whole new type of data analyst users. At Predibase, we've built out what we call PQL, or the Predictive Query Language, in which you can create models and run predictive queries in a very familiar SQL-like language.

Model Development (Summary)

In summary, how does declarative machine learning help your data science team? For starters, it makes machine learning development modular and reusable. Instead of spending all of their time writing code, our data science teams can easily plug and play with different architectures by just changing the config. It's like swapping out Lego blocks. All of their model development becomes reusable, and they're able to reuse components from one ML project to another.

Second, there's no model-itis. Model-itis is when you're always trying to keep up with the latest and greatest model. Right now, in machine learning, there are like 20 new machine learning papers published every week, and all of them promise state-of-the-art performance. All of them are pretty great. As a practitioner, building and maintaining custom models at that frequency in production becomes very hard.

Declarative machine learning makes a model a commodity, so that instead of figuring out how you should be building a model, you can just make a decision about whether you want to use it or not. Then a separate entity, like Ludwig in this case, with our standardized implementations that you as a data scientist can use.

Declarative machine learning unblocks version control for machine learning. Generally, why is version control for ML hard in the first place? It is because ML metric performance is not just related to code changes, but changes in how you configure model training and the parameters of that configuration. Bugs aren't introduced just by code. They could also be introduced by our config.

Typically, like both of these things, the code and the config are tracked separately. You'd have like Git, for example, for code, and a model registry for configs, or parameters. With declarative machine learning, both code and parameter changes are captured in one place by the config. You can easily do a diff on the configuration and figure out why performance changed.

Overall, this leads to higher reproducibility, since the same config on a dataset will result in the same model performance. Because you are implementing the machine learning code yourself, you have a high-quality standard implementation to work off of, and it makes debugging machine learning easier. What debugging typically looks like is that if you see poor performance, you know that it's not because of any code bugs or any machine learning bugs, and so the only possible fix is going to lie in your configuration. If you want to improve your performance, you can update your config to get better results.

Finally, this overall results in the separation of interests between different stakeholders. With a declarative interface, you're able to separate out ownership of model metrics so that each stakeholder is able to focus on how they can deliver value fastest through their skill set.

Scaling with Declarative ML (Ludwig)

In this next section, I'm going to take a quick peek under the hood at Ludwig's scalable backend infrastructure. Ludwig offers Ray as a backend engine. Here's a quick overview of how Ludwig uses Ray internally. Multiple parts of the preprocessing, training, and evaluation process are on the Ray ecosystem. We use Dask-on-Ray for preprocessing, Horovod on Ray for training, and Dask-on-Ray for model evaluation.

In practice, this means that you can independently scale up or scale down the compute required for preprocessing, training. A concrete example of this is that you can have a large CPU-only cluster for preprocessing that you scale down after preprocessing is completed. Then a large GPU cluster for training jobs only.

Then, once again, you can do all of this very easily by just changing the config. On the left, I have an example of our old config that we use for the training and the preprocessing and everything. On the right, I have a new config that now distributes training on a GPU with four workers. This is all done by these few lines that I've added, that I've highlighted here.

Ludwig also provides Ray Tune as an abstraction for large scale Hyperopt jobs. Each trial can be independently scaled up using the same Ray ecosystem of Dask-on-Ray for preprocessing, Horovod on Ray for training, and Dask-on-Ray for evaluation. What this means is that you can do distributed training per trial, which is pretty cool. Finally, we also use Ray datasets for preprocessing and loading our data.

This is a nice example of how Ludwig solves a lot of the tricky optimization problems to get the best model performance on your task. While Ray is very powerful and flexible, often there are a lot of challenges as you're trying to get the optimal performance on your task. In Ludwig, we try to hide all of that from you. Where we introduce Ray datasets, without any changes to the user facing interface, we swapped out the previous data loader for Ray datasets, and now it acts as a bridge between Dask-on-Ray and Horovod.

There's a great example also of how the declarative interface hides complexity from the end user. We're able to greatly speed up data loading while keeping the same interface for the user.

Declarative ML In Your Organization

How can declarative ML help you inside your organization? Declarative ML is useful in different ways to different stakeholders. In this slide, I'll go through some of the common types of stakeholders and how they can benefit from this paradigm. First, the domain expert. This is someone who is not an ML practitioner, but is an expert on the specific topic or dataset. For example, this may be someone who is in charge of collecting data and labeling a dataset, but may not necessarily have ML expertise.

For the domain expert, declarative ML allows them to build and deploy machine learning models without being blocked by the data science team. Next is the data analyst. This is someone who typically queries the data to draw insights and analyze trends about the data. For them, declarative machine learning, especially with a query language interface, allows them to run a predictive query without being blocked by a data science team. They can now run SQL-like queries to train models and get queries like predict housing prices given data, for example. The data scientist, so this is an expert machine learning practitioner.

Declarative machine learning supercharges data science users in their everyday job. Again, they have reusable components. They don't have to spend all of their time writing code, so they're able to just pull these components from across their projects. They have a faster time to production overall. They get access to state of the art machine learning without sacrificing any flexibility or explainability. For an ML engineer, declarative machine learning enables them to scale up, serve, and deploy models very easily.

Finally, for a business leader who wants to use machine learning in their organization, declarative machine learning provides a much faster time to value for machine learning projects.

Wrap-up, and Resources

The future of machine learning development is not going to be manual and it is not going to be fully automated, but instead the future of ML development will be human in the loop. To learn more about declarative machine learning and Ludwig, check out our website,, go to our GitHub,, and join our Slack,

Ludwig In Comparison to Other High-Level Libraries

How does this compare with other high-level libraries such as fastai, and like including similarly easy to use and what the differences are?

My impression and my understanding of fastai, is I would put it in a similar category as PyTorch, in terms of the flexibility and the utility that it provides. You are, for example, able to use some building blocks out of the box. You can use specific encoders like ResNet. At that level of API, you aren't able to customize it as much as you want. On the other hand, it also seems like you are able to develop your own fully custom encoders.

Similar to PyTorch, you would create an encoder, where you would define your own forward paths, you would define how you do backdrop. You can do similar functionality with fastai. I think it falls into that consistent pattern that we've seen in other libraries, where it's either super high level, where it doesn't afford you a lot of flexibility. Or if you want flexibility, you have to go really deep down and then write your own code for like, "This is what I want my model to do. This is how my training job is going to go."

I think the thing that's really missing from this landscape and that Ludwig provides is this middle ground where you still have full functionality. Compared to something like fastai, you really are able to really whittle down into how you define a model, customize and specify anything you want about it, without having to actually write any of your own code, that could have issues such as, it could be buggy.

The code itself could have bugs. As we all know in machine learning, machine learning bugs can be very insidious, very hard to debug. You can still get accuracy that looks fairly good, and you're like, maybe that's all the signal there is in my data, but you won't be able to get the optimal performance. That is one point of difference that I would highlight.

Then the second one is related to the UI essentially. The thing that I really like about Ludwig is the configuration-based system. People have different preferences with respect to using Python, and you can use the same API and not have to use a YAML and just use a Python configuration. I really like the idea of using YAML generally, because I think like in a production system, you aren't able to track changes, or just track the lineage of models the same way as you would for code.

Because ML has both config changes and code changes. Having the config makes it really easy to just understand what the differences in models are. I think with the example of BERT, you can easily do a diff between a pre-BERT config and a post-BERT config and then just quickly figure out what the issue is. If we were to do this with code, you'd have a class for BERT living in the same place as like a class for an LSTM.

Then you'd use a different configuration for BERT that configures that class, and that makes it overall much harder to track, what do you attribute performance differences to. The config is something that I would consider a strong class for Ludwig.

Automated Test Cases in ML

This is mostly a question around testing machine learning. A similar principle is like what you would apply for general CI/CD for ML, are what will also be applicable for Ludwig, generally. CI/CD for ML is generally not as straightforward as software, like some of the things that Ludwig internally has. Some of the ones are like basically benchmark performance tests. You have one dataset, you set your random seed so that the data is loaded the same way, like all of the random weights in the model are initialized the same way.

You try to account for as much randomness as possible. Then you train a model with the same configuration on that data, and then hope to see the same performance, like it could be the same loss, time after time, like as you train, so that is one test that Ludwig internally has. Another set of tests is that, as you train the model, you make sure that the weights of the model update as you expect.

It basically comes down to ML bugs, very hard to debug. Making sure that your model has at least training and the weights are changing. The issue with all of these is, your model training might still work, but there would just be some delta differences in training performance each time. You need to have thresholds for what is acceptable performance. You aren't going to hit the exact same numbers between different training runs.

Another example that I want to share here is that, I work at a company called Predibase, which is basically building a managed platform on top of Ludwig. We use Ludwig as our core engine to power some of our production systems. Within Predibase, we basically have tests around Ludwig, similar to, basically, performance and robustness, like training regression tests, where we set the same config values. We know what numbers we expect, and we know what is an acceptable threshold.

We also have numbers in terms of how much time it takes to train a certain model, and some threshold around that. Then we add tests to make sure that we have roughly the same performance each time.

Luu: I imagine, if you were to use FastAPI to develop your model in code, so you can put that in a function, and then you run that. Then you can change your test cases, maybe after the model is trained, for example. In a world of declarative where you only have a YAML file, in that kind of world then, where would your test cases live?

Rajpal: Very tactically, for example, in the Ludwig repository, which is open source on GitHub, we basically have four benchmark tests. We basically define the YAML, and then we train the model from scratch each time. It's a test that we don't run on every commit merge. It runs at a fixed cadence. It could be like a nightly test or a weekly test or something to make sure that there aren't performance regressions. For each test, we basically have the model configuration in a YAML that is specified in a folder. Then we basically have expected performance numbers that we use to compare that things behave as expected.

Model Evaluation in Ludwig

I think that this is like two parts. One is like, how would you do model evaluation within Ludwig. Then B is like, since you have to do model evaluation and you have to understand exactly what the model is doing, are domain experts truly independent?

Essentially, for model evaluation, we basically have a very large set of metrics that we compute at each training run. For each epoch or for each batch, there's a bunch of metrics that we compute over time, and then we surface them very transparently to the end user. At the end of training, you can also optionally create any evaluation graphs that you would need to create. In the slide deck, I'd given an example of a confusion matrix.

All of the plots that you need to figure out if your model performance is actually any good, all of those come out of the box with Ludwig. We also provide third-party integrations with Weights & Biases, or Comet, AimStack. Any of these experiment management platforms that machine learning practitioners typically use to evaluate their models and figure out if their performance is satisfactory.

Domain Experts' Independence

Then the second one is that, since domain experts would still need that deep understanding to figure out what to do next, how to iterate on their model, like, what is the marginal value that Ludwig provides? I am a machine learning practitioner, and I build ML models, and so how I think about making decisions is that I analyze performance with any of the tools that I just listed off. Then I analyze performance, and I would maybe do like slice-based analysis.

I would see what the worst performing slices of my data are. I would figure out next steps based on that. As an example, my background was in self-driving. At the startup where I worked at, we saw that, ok, self-driving car performance for trucks would be particularly bad. I figured out like, ok, for trucks maybe I need to oversample the examples that have trucks in them, and change my data that way. To me, as a data scientist, I mostly think on the level of decisions, and then the bottleneck that I would find in my workflow, that like I make these high-level decisions. Then it's pretty boilerplate to actually go down and have to implement them.

There's only a set of specific patterns about remedial actions that you take. Then, as a data scientist, you then have to go and basically spend all of your day writing code. Whereas the actual special insight that you have is at a very high level on like, ok, maybe I want to oversample my data, or maybe I want to use a bigger model, or use more regularization. What Ludwig basically does is data scientists still have that domain expertise, and still are able to figure out, this is maybe why my model is performing poorly, and this is what I should do next.

They're able to just do that, like make that change quickly in a config, instead of having to maybe spend the next half of a day implementing code, and maybe the code will be super error prone, and debugging the code, adding tests. You get well tested code out of the box for all of the remedial actions or patterns that you would typically take.


See more presentations with transcripts


Recorded at:

Nov 07, 2023