BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Ludwig: A Code-Free Deep Learning Toolbox

Ludwig: A Code-Free Deep Learning Toolbox

Bookmarks
41:55

Summary

Piero Molino introduces Ludwig, a deep learning toolbox that allows to train models and to use them for prediction without the need to write code.

Bio

Piero Molino is a Senior Research Scientist at Uber AI with focus on machine learning for language and dialogue. He worked for Yahoo Labs in Barcelona on learning to rank, IBM Watson in New York on NLP with DL and then joined Geometric Intelligence, where he worked on grounded language understanding.

About the conference

QCon.ai is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.

Transcript

Molino: I'm Piero from Uber AI, and I'm going to tell you about this project called Ludwig, which is an open-source toolbox for deep learning that doesn't require writing code. It is based on TensorFlow, and it's really good for experimentation and you will see why during the presentation. The fact that no code is required doesn't mean you cannot code, you can use it also as an API and also can extend it through code.

These are the main features, the main design principle we baked into it when we were developing it. In particular, the first thing is that we try to make it general to the extent that it can be applied to many different use cases. In particular, these data type abstraction that I will explain in detail, is the main new idea that allows it to be really general. It's also really flexible, so both experts and novice users can find value in it because a novice can use it really easily, and experts can actually fine tune every little needed really detail of their models. It's extensible so that if there's something that Ludwig is not doing it's really easy to add an additional model to it or an additional type of data to it.

We, from the beginning, decided to bake into it also some visualization facilities, the reason is that we want to have some understanding of what's going on with our models. Not really within the models because in particular, all the models that we're going to see are all deep learning models, and understanding the inner workings can be tricky. At least at the level of predictions, the level of measures we want to understand the performance of the models and the predictions of the models. It’s really easy to use and it's open-source, it's Apache 2 licensed, so go on and use it. Feel free to use it, and if you want, feel also free to contribute back.

Simple Example

Let's start with a simple example that gives an overall idea of how it works, and then we will go deeper into the details. Let's assume we have a dataset with two columns: one text column and one column that contains a class, this could be any generic text classification dataset really. In order to train a model with Ludwig, all you have to do is to run this command, Ludwig experiment --data_csv with the path to the CSV file containing the data that I showed you before, and then this model definition. We're going to see a lot of these model definitions, but just to give you an initial description of it, you can specify just what are your inputs, and what are your outputs, and what is their type. Once you do that, then what Ludwig does for you is, it takes the dataset, splits it in training, validation, and test, you can specify how to do that, or if you have already some splits, you can use the already defined splits.

Trains on the training set validates on the validation set, in order to figure out when to stop training, if for instance, the accuracy on the validation doesn't improve after a certain amount of epochs, then the training is interrupted. Then at the end it uses the test set to predict on the test set. This is what happens at the output of the first epoch, there is some training steps and then evaluation on training, validation, and test, you have all your measures and your loss.

Then after a few epochs, in this case, the fourth epoch, peak validation accuracy is reached, now we are in the 14th epoch after the 10 epochs where the validation did not improve. There's early stopping, so we're going to use the model from the fourth epoch, because it was the one that was the best on the validation set, so we're going to use that model for computing our predictions. Then you're going to have some measures of quality of your predictions, these are at the general level, then, you also have measures that are class-specific for each single class.

How Does It Work?

This is what Ludwig does in a nutshell, but let's look at what is going on inside, how does it work. There are a few phases like every single machine learning of model building process. In the training phase, you provide your raw data, there's a model that performs the preprocessing and saves some mappings in a JSON format. These mappings are useful because then the same mappings have to be used also at prediction time. For instance, if you have some categories that are strings, these mappings will contain a mapping from String to Integer, because the model then is going to use those Integers for training. This preprocessed data is actually saved also on disc, if you want, you can specify not to save it. The reason to save it is because this preprocessing step can be expensive, so you want to do it only one time if you want to train several models on the same data. You want to preprocess one time and really use this kind of cached preprocessed data.

The model training actually saves two different objects, one is the weights of the model, the other one is the hyperparameters of the model. Then at prediction time, you provide new data, you map it into preprocessed data using the same mappings that you obtained during training. The data is provided to the model that is loaded from the weights and hyperparameters that were saved before. The model will predict some values, some Tensors really and those are going to be mapped back into data space through the mappings again, in this case, in the opposite direction. If the model predicts Class 7, for instance, it will be replaced by what Class 7 maps do, like the name of the class.

The experiment, as I showed you before the command, Ludwig experiment, actually does first one, first training and then prediction. The experiment saves all the things that are the output of both training and prediction steps. In particular, at the end, you also have the outputs, in this case labels of the model that the model is predicting and also some training statistics and some test statistics. All these things are useful because there is also a visualization component that will provide some graphs for you based on these outputs.

Under the Hood

Let's look under the hood what actually is going on. What makes it magic is mainly three different things, one is the data type abstraction and I will tell you how it does that. The model definition, which is a YAML file, so, the one which I showed you before was just one long string, but you can also provide a YAML file that is exactly going to be parsed in the same exact way. For your convenience, if it gets a little bit longer, you don't want to have one super long string. If you know a little bit about Python, what's going on behind the scenes is that it uses **kwargs which is a way to map a dictionary into a set of parameters for a function, a set of arguments for a function, in a smart way that enables you to add an additional way of specifying your model and have it mapped directly into the YAML file without having to do anything really.

Going back to our example, this is the model definition part, let me get into a little bit more detail here. You have two pieces in particular here: input features and output features, both of them are lists, so as you can imagine, you can have several input features and several output features. Each element of these lists contains a name and a type, the name is exactly the name of the column in your CSV file that contains that feature. I have several different types that we're going to see and you need to specify the type because different types are going to be treated differently, both in terms of preprocessing but also in terms of parts of the model that will deal with them.

That model definition is actually resolved against a set of defaults and becomes this much bigger thing, which is the real model definition that is provided to Ludwig, which is separated in five different sections. The input features and the output features, we have already seen an example of that. These other three sections, the combiners, the training and the preprocessing. Input features, combiner, and output features define the model itself, while the training part defines a bunch of hyperparameters for training, for instance, the batch size or the number of epochs or the learning rate. The preprocessing defines some parameters for the preprocessing, for instance, if you are preprocessing text, you want to define the maximum length of the text or how many words you want to keep as the most frequent, and then just map everything else into an unknown token or things like that.

Let's first look at the modeling part, this is the overall architecture. Every single input feature that you specify has a type associated with it, as you've seen. They are mapped into a representation by an encoder, you can have several of them and when you have several of them, you need some part of the model that combines the information coming from all the input features. That's the combiner that takes in the information from all the input features, combines it, and provides it to several different output features. In many use cases, the output feature is going to be only one if you have only one target, but if you want to do multitask learning, you can also do it automatically with Ludwig by specifying several output features.

The nice thing about this architecture is that it maps into several different use cases. If you have only one text feature and one categorical output feature, you have a text classifier; while if you have an image input feature and a text output feature, you have an image captioning system. If you have categorical, numerical and binary features as inputs and one numerical feature as output, then you have a regression model, and so on. You can think of combining input data types and output data types to obtain many different machine learning models and machine learning applications.

Additionally, each single input type can have different models that encode information into a latent representation, also, each single output type can have different decoders that decode information from the inner representation into the data space. Some input types are really simple, so they actually don't need to have many different encoders, some others are more complicated, so there are many options available for you. For instance, text features which are probably the most developed part of Ludwig at the moment give you five options, the transformer is coming in the next version of Ludwig, I'm currently working on it, it's already working, but I haven't released it yet, but it's already there basically. You can decide to encode your text with a stack of CNNs or with parallel CNNs, or a combination of the two, or with an RNN and you can specify as a hyperparameter the type of cell if it's on LSTM or if it is on GRU or anything else, a combination of the two CNN and an RNN, and/or if you want to use a transformer for mapping your inputs into some vector representation.

This is how you define your model definition, if you want to use a parallel CNN, you have to specify these four parameters: name as we've already seen, the name of the column in the CSV file that contains the text, type, in this case, text obviously, encoder, parallel CCN in this case -if you want to change encoder, you just change that string and you have a new model- the label, if you want to work at the word level or at the character level. All the other parameters are hyperparameters, and what you're seeing are the default parameters, that if you don't specify anything those are going to be used. You can specify each single parameter in detail, so, for instance, you can specify, what is the activation function that is going to be used or the embedding size? Or if you want to use pre-trained embeddings, you can specify the file that contains the pre-trained embeddings, how many convolutional layers, what's the filter size, and so on, so you have all these options at your disposal.

The smart use of **kwargs happens here because all these parameters are mapped one-on-one into the parameters, the arguments of the init function of the object that implements this encoder. This is just to show you that if you want to use a Stacked CNN you just have to change the string of the name and that's it and obviously, the Stacked CNN has potentially different hyperparameters from the Parallel CNN. You have all these options for the Stacked CNN, for the Stacked Parallel CNN, for the RNN. Depending on what encoder you decide to use, you have a different set of parameters you can define. Here in the RNN, you can specify which cell type to use, if it's an RNN, or LSTM, or a GRU, the state size, if it's bidirectional or not, if to use dropout, all these things. You have all the control on the model, if you want to have that.

This is something that I added into version 0.1.1 of Ludwig. You can also specify a specific preprocessing for each single of those languages that are the ones that are supported at the moment. This could be useful if you are dealing with some languages that are not directly English, where English tokenization is easy, for other languages can be a little bit trickier. The nice thing about this architecture is that it's really useful for the code we use, because then I have exactly the same encoders, and I can use them not only for text, but also for generic discreet sequences and also for time series.

Image features are a little bit less developed at the moment. I only have two encoders. More will be coming, but at the moment, I have these two encoders, which are the mostly used encoders in practice in most of the real-world applications that we have seen so far. There’s a stack of CNNs which is similar to a BGG if you are familiar with the model, and also a stack of residual layers as a ResNet encoder.

The category features are really simple, you can just decide if it is going to be encoded as a one-off encoding, or as a dense embedding. Numerical and binary features are also encoded in a simple way, numerical features have one neuron attached to them, plus a normalization which is just useful for scaling. Binary -you cannot use that if you don't want to- features are just used as they are, just concatenated with other things.

Finally, set and bag features are useful because they basically can be encoded, as every single element in the set will have its own embedding. Then there's an aggregation function that combines the embeddings of all the elements in the set, and those are passed through some fully connected layers. The same happens for the bag, but the aggregation in this case is weighted by the frequency or a normalized version of the frequency of the bag.

Now we have all these input features that we have seen, each of them has at least one way to be encoded from raw data to vector representation. Now that we have all these vector representations, what are we going to do with them? There's the combiner piece that basically combines the information from all the inputs, the most basic one is this concat combiner. You take all the vector representations, you combine them and you pass them through a certain number of fully connected layers that you can specify. We have also other combiners, one that combines information based on the time dimension, if you have a time series or text or a sequence, so concatenates along the time dimension and it's really to write your own combiner if you have a specific way that you want to combine information.

The output of the combiner is actually provided to the different output decoders, the most developed one is the category one, which is basically a decoder that is used for doing multi-class classification. Here, you can set a bunch of parameters, there could be some fully connected layers that are specific to each single decoder. You can specify which loss you want to use, how much categorization you want to have, if you want to smooth out the probabilities that are coming from your model, and so on. You have all a bunch of parameters that you can set in order to obtain a classification, a decoder that works really well.

Numerical features decoding- this is the decoder that makes you have basically a regression model. You can specify the loss type, at the moment, I have mean squared error, mean average, mean absolute error, and I think you can also specify directly R2 as a loss. The simplest one is the binary decoder, you have this binary classification problem and you use binary cross-entropy as the loss here. You also can give additional parameters, give some penalty for overconfidence, or you can decide to set a threshold to a specific value. It is not the default 0.51 if you are in a really unbalanced data regime that could be useful, so you have all the options.

The most complex one is the sequence decoder that also doubles as a text decoder. Here you have two different types of decoders, and the generator one, which is the default one, what it does is it takes the output from the combiner, generates the output which is the first element of the sequence and then feeds back the output of the first element of the sequence as input to the second decision, and keeps on going until end of sequence token is reached. It is pretty standard if you think about sequence to sequence model; this is the two sequence part, the second part of the model.

The other encoder that you have at your disposal is the tagger, what it does is it assumes that you have a sequence as input and performs a classification for each single element of the sequence, input sequence. It is useful if you have text and you want to classify something for each single word in your text. You want to say if that word belongs to one entity or not, or you want to tag a part of speech of that word; all these things can be done with the tagger.

There's also a set feature decoder which basically implements a multi-label classifier, so potentially each single element of the set is a label and you define a probability for each of the labels independently. A really neat feature that I added is this idea of having feature dependencies among the outputs. In many cases where you are doing multitask learning, the different tasks which are mapped, in this case into different output features, are dependent on each other. One model that I worked on was a customer support model and we had to predict what class the ticket that we received belong to, and what was the best template among a set of templates that should be used to answer to this ticket. Obviously, if you know what class of ticket, you can select much better what type of answer to give to that ticket.

What you can do in Ludwig for that is that you can specify that, in this case, Output Feature 2 has its own loss, but then Output Feature 3 depends on Output Feature 2. The output of Output Feature 2 is going to be used as input for Output Feature 3 in order to provide the classification, or any other kind of loss and predictions that you want to have with different data types. This is just a small selection of the training parameters, there are many more. You can have all sorts of different things, you can decide all the parameters of the optimizer, but also you can decide what's the best size, what's the number of epochs, after how many epochs you should do early stopping, if there's weight-decay, and so on, you have a bunch of possibilities.

Then you have these preprocessing parameters, the first few are actually generic, so they are not dependent on any feature specifically. If you want to have Ludwig perform the split in the data, you want to specify how much of the data is in each split and so you can define that through this parameter, but then all the other ones are specific to a specific type of features. They are global in the sense that if you have several text features, these parameters are going to be used for all your text features. Within the definition of each single input feature and output feature, you can also specify feature specific parameters. If you have two texts that are the title of a book and the title of a news and the body of the news, you want to have different parameters; for the title you can set the maximum length to be 20 words, but for the body you want to set the maximum length to be 500 words, you can define that with this model.

Example Models

Let me show you some example models that you can build with Ludwig, just to give you the flavor of how easy is it. This is the text classification model that we have seen before, the command on the right is all you have to write to train a model that does that thing. If you want to have an object classification model, it is basically exactly the same thing, the only difference is that now the name is image path instead of text, and the type is image instead of being text and now you have an image classification.

If you want to have a sequence to sequence model that can be used as a chitchat dialog model, a really simple one, all you have to do is define an input feature which is sequence or text, it doesn't really matter, the difference is that now tokens are split, but the models are exactly the same. You can have an input feature which is the sequence and then an output feature which is again a sequence. In this case, the decoder is the generator, because we don't know how long the sequence is going to be. It can be longer than the input sequence or shorter than the input sequence, so you want to use a generator. In this case, you can also specify if you want to use attention, which is an additional way to make models better at predicting sequences.

This is a restricted version of a model that we actually have in production at Uber for calculating the expected time of delivery on Uber Eats. You have the restaurant and you encode it as a category, you have the order and you encode it as a set of items, and then you have just the hour of the day and the minutes and you encode them as numerical features. What you are predicting is a numerical feature which is the expected time of delivery, how many minutes is the delivery of this specific order from this restaurant at this moment in time going to take.

This is actually an example with the tagger, as I was saying before. You have these sentences and these tags: P means person, O means I don't care, C means CD, and D means date. Here, you have an alignment between all the words in the inputs and all the tags as the output. In this case, you want to have this output feature which is of type sequence, but the decoder is a tagger instead of a generator, like we were using before for the sequence to sequence one.

Ludwig also comes with some additional features. First of all, it's really easy to install, you just have to install it and it works right away. The programmatic API gives you a lot of flexibility, in order to use it you just have to import this class of the model from the package and initialize a Ludwig model with a model definition, and that model definition there is just a dictionary that contains the exact same information that is provided in the YAML file. Then you use that object, you train that object on some data frame data that you have. After you’ve trained it- or you can also load a pre-trained model- you can just use it for predict on some other data that you have and you can get the predictions out of that. You can basically train a model with two lines of code and use the model to predict with other two lines of code.

It's integrated with Horovod, it's an open-source library that Uber released for training deep learning models in a distributed way on multiple GPUs and on multiple machines with multiple GPUs. All you have to do in Ludwig to use Horovod is just to have this flag --use horovod and it works either way. It is also integrated with some internal toolset. We have two projects we're going to probably open source too, one is PyML which is this tool. You can imagine that it's like a Docker image with some Python code inside and it provides you with a data frame and you have to return a data frame. You can plug your model inside, a Ludwig model inside and you can use it for prediction, once you put it there, that's deployed at scale in a replicated way with throttlers and everything in place already for you. Also, it's used with this Quicksilver AutoTune which is a way to perform hyperparameter optimization.

I think it's really good for experimentation in particular, because you can plug in new encoders and decoders really easily just to conform to a really simple interface, which basically is Tensor in Tensor out, whatever you do within the model with those Tensors, it's your responsibility. You have this experimental setting which is fixed and you're just changing one thing, like your model or the specific model that you're using, and you can compare really easily many different models. You just have to change one string in a YAML file and you have a different model, so it makes it really easy for you.

This is the way we do hyperparameter search, this is not released as open source yet because it's actually tied to our infrastructure. It doesn't really make much sense to do that, but I'm showing you just to give you a glimpse of an idea of how you can do that. It's really simple, you have on the left side a model definition which is the same thing that we've seen so far. This is a base model definition that all the models that you're going to train are going to share. There are two parameters that you care about. In this case, the training dropout and the embedding size for the input feature which is called flow node. You define some ranges and what's going to happen if you're going to do grid or random search within those ranges there are going to be some samples drawn from these ranges, and those are going to be trained in parallel. The results, in terms of validation accuracy and validation measures, are going to be collected in one place so you're going to know how much a specific change in one of the parameters impacted the final result. You can also do that with Bayesian optimization, we have something internal at Uber that does that already.

Visualizations

The nice thing is that after you get all these results from Ludwig with really simple commands, what the model spills out is a CSV file with the predictions and also some NumPY files and two JSON files, one for training statistics and one for test statistics. With just one command, you can obtain - I don't remember exactly the number- around 20 different visualizations that you can already have of all your model performances. The one in the top left is a comparison between two models on three different measures, the one in the top middle is a calibration plot, the one in the top right is coverage against accuracy plot at changing the threshold on the confidence of the probability of the model.

The bottom left one is the same thing, but projected in 2D, the middle bottom one is the comparison among two different models in terms of predictions. Where it's green, it means that both models got the prediction correct, and the yellow part is where the predictions of the models were different; one was correct and one was wrong. The red part is where both models are wrong and the two different parts of red is when both models are wrong predicting the same thing, and when both models are wrong predicting different things. The bottom right one is a plot that shows F1 score against frequency in a setting where we have more than 1,000 classes and so it shows that. For classes that are more frequent, we perform better in those classes.

What are the next steps that I'm going to do with Ludwig? First of all, I want to add new feature types, I want to add video, audio, point clouds, speech, list of lists, which is if you are interested I can tell you later why I need that. I want to add additional encoders, in particular, I'm working on ELMO, BERT and GPT for text features and on some other encoders for images in particular. At the moment, there's no decoder for time series and for images, I want to fix that and add more decoders also for those features. The same dependency structure that I have between outputs, I want to add it for inputs too.

The weakest point at the moment of Ludwig is the fact that you have to provide the CSV file. If you have been on a hive cluster, hooking things can be a little bit tricky. We want to use Petastorm which is an open-source software that was released by Uber, that abstracts away the way that it is obtained, and you can obtain data from your S3 files on S3 or from your hive tables or from anywhere else you want. Those are going to be used within Ludwig as data frames, so it makes transparent the way you get data to the way that then you manage the data. This will make it much more usable for enterprise use cases.

You would want to check the documentation. I spent a lot of time writing the documentation, so I think it's pretty comprehensive. You can also check the repository and there's a blog post that explains most of the higher-level concepts. There's going to be a paper that I'm going to release soon about it, I invite you to take a look at it and if you want to contribute to it ,because I think it can be useful as a community effort.

Questions and Answers

Moderator: Super impressive. What are the gotchas? It seems way too good to be true. What's going to bite me?

Molino: One possible limitation is the fact that the data types that I have are not all the data types that are possible. One thing that you cannot do at the moment, I think this you cannot do simply, is if you want to have an object detection system, because object detection systems have as output a list of bounding boxes. List of bounding boxes is not one of the data types that I have, so you basically cannot do it at the moment, but the counterpart is that it will be really easy to add that. I haven't done it yet because the team that is behind this is not huge. It is between me and two other engineers at Uber that are helping me with 10% of their time. It's just a matter of when it's going to be there, rather than if it's going to be there or not.

Participant 1: I work in a bank and we have a lot of problems when it comes to explainability. Do you have plans of adding LIME or something like that as an option into this package?

Molino: I think there wouldn't be the need for doing that, to be honest, because whenever you have a model in Ludwig, it saves it as a TensorFlow model, so you could apply LIME directly on the TensorFlow model without even having to pass through Ludwig if you want. But yes, in general, supporting it natively could be a reasonable extension.

Participant 2: If I want to build custom combiners of estimators, is programmatic API the only way, or is it just a new YAML file that I point to?

Molino: In order to add a combiner or add an encoder, there's an interface in code that you have to respect which is really lightweight. You have to create an object and this object has to have an init function and the init function contains all the hyperparameters that you care. Then, there's a call function and the call function takes a Tensor as the main argument, there's a bunch of other arguments but they're not really important. What you are supposed to provide as output of that call function is again a Tensor, and the shape of those Tensors has to be a specific type.

For instance, if you have a sequence encoder, the shape of the input Tensor is batch size times length and it's an [00:37:43], and the output is supposed to be batch size times length times hidden dimension. Whatever you put inside that input and output, it's whatever you want to do. Once you do that, you can basically use the name of the encoder that you built from the YAML file directly, you don't have to do anything else. Actually, there's one little thing that you have to do, you have to add this new class to a dictionary that contains the name of the class and the class. After you do that, you can call it directly from the YAML file, so it's really easy to extend.

Participant 3: How much of cleaning on data is required outside of Ludwig like missing data?

Molino: Ludwig does some cleaning for you. In the preprocessing, there are a bunch of functions that are used for mapping the raw data into Tensors. You can specify if you want to fill missing values and there are several strategies that you can use for filling missing values. There are a bunch of other things that you can do also directly from Ludwig, but a good strategy would be to provide data that is already pretty clean already, that would be ideal.

Participant 4: I'm interested in the sequence tagging, part of it. I noticed that your label is for each word, like ignore when you are specifying it as a training data. That seems a bit hacky, is there a way to just specify label at what position in the string?

Molino: I don't find it hacky, to be honest.

Participant 4: If I want to experiment with a character based or a word based, those labels have to constantly change to reflect the model I'm trying to use.

Molino: But you will have to specify in your way a label for each single token.

Participant 4: Yes, but the way we do it right now is we just specify a start position, end position like an index based label.

Molino: Yes. If you look at the example that I showed you with PPP or CCC, something like that, you could specify P and 03. That's not the type of supervision that you can provide at the moment to Ludwig, but mapping from that supervision into that list, it's extremely easy anyway. I don't find it as a huge problem.

Participant 5: Once you have a trained model, are they ready to be uploaded to Google Cloud ML and be served?

Molino: There is one caveat there, the model that you save, you saved as a TensorFlow model, so you could potentially take that model, upload it to Google Cloud and serve it. The problem with that is that those models expect data provided in Tensors with specific things. If the class was a specific class that expects to be the integer that is provided expected to be 3 or 4 or whatever other number is mapped to that class. The preprocessing and the postprocessing of the model are done at the moment in Ludwig in Python code, and not within the model itself. For that reason, there's this tricky part, which is you may want to have the preprocessing and postprocessing till done within Ludwig, and then when the model is actually called, then that's the moment when you are hitting your deployed Google Cloud model.

 

See more presentations with transcripts

 

Recorded at:

Jun 05, 2019

BT