Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News CMU Develops Algorithm for Guaranteeing AI Model Generalization

CMU Develops Algorithm for Guaranteeing AI Model Generalization

This item in japanese

Researchers at Carnegie Mellon University's (CMU) Approximately Correct Machine Intelligence (ACMI) Lab have published a paper on Randomly Assign, Train and Track (RATT), an algorithm that uses noisy training data to provide an upper bound on the true error risk of a deep-learning model. Using RATT, model developers can determine how well a model will generalize to new input data.

In the paper, which has been submitted to the upcoming International Conference on Machine Learning (ICML), the researchers show mathematical proofs of RATT's guarantees and perform experiments on several benchmark datasets for natural language processing (NLP) and computer vision (CV) models. The paper shows that when a trained model achieves a high error on randomly-labeled (or noisy) data, but a low error on clean data, then the model is guaranteed to have a low error rate on new input data, and an upper bound on the error can be calculated from the training error. According to the authors,

This work provides practitioners with an option for certifying the generalization of deep nets even when unseen labeled data is unavailable and provides theoretical insights into the relationship between random label noise and generalization.

Generalization is the ability of a learned model to produce correct output for unseen input data; that is, data that was not used during training. The generalization ability of large deep-learning models is not well-understood, especially for models with more parameters than training data samples. For example, it can be shown that these models can achieve low training errors even on random input data, indicating that they essentially memorize the training data; yet when trained with real datasets, they can indeed generalize to unseen data.

A model's ability to generalize is measured by its average error, or risk, calculated on the entire input population. While it can be difficult, if not impossible, to determine a model's true risk, there are techniques for calculating its theoretical upper bound. However, in many cases these techniques produce a vacuous upper bound, predicting that the model will do no worse than getting every answer wrong. In practice, most model developers hold out a portion of the training data and evaluate the trained model on this test set to get an estimate of its ability to generalize.

The CMU team noted that recent research has shown that deep learning models exhibit an early learning phenomenon when trained on a combination of clean and noisy data: the model first fits to the clean data, then later memorizes the noisy data. The researchers then proved that if a model is trained on such a combination of clean and noisy data, and the average training error on clean data is low but on noisy data is high (around 50%), then the model's risk will have a non-vacuous upper bound that is a function of the two training error averages; this bound will be slightly larger than the average error on the clean data, but still relatively low.

To further validate their proof, the researchers trained several deep-learning models using common benchmark datasets. The MNIST and CIFAR-10 image datasets were used to train a multilayer perceptron (MLP) and a ResNet18 CV model. The IMDb sentiment analysis dataset was used to train a Long Short-Term Memory (LSTM) and to fine-tune a BERT model. Before training, the team set aside a small fraction of each dataset and randomly assigned new labels to the samples to create noisy data. The models were trained with both clean and noisy data, tracking the error from the sets (thus the name RATT). The team compared the predicted accuracy bounds from their proof with actual accuracy calculated from using traditional test-set evaluation on a model trained only on clean data. The predicted bound tracked test performance closely; for example, the predicted accuracy of the ResNet18 model on MNIST data was 96.8%, compared with an actual accuracy of 98.8%.

In a discussion on Twitter, one user asked about methods for preventing the models from memorizing the noisy data. ACMI lab leader and co-author Zachary Lipton replied,

Even for those models that eventually fully memorize, they first reach *near* full accuracy at a point where this mechanism assures generalization.

Generalization is currently an active research area in the corporate world as well as academia. Microsoft recently published a paper at the International Conference on Learning Representations (ICLR), showing how to use distillation techniques to convert a model into an equivalent, less complex one, whose generalization bound calculation is more tractable. In 2019, Google published a paper studying the relationship between neural network properties and generalization, open-sourcing the models they used in their research.

Rate this Article