MIT CSAIL TextFooler Framework Tricks Leading NLP Systems

A team of researchers at the MIT Computer Science & Artificial Intelligence Lab (CSAIL) recently released a framework called TextFooler which successfully tricked state-of-the-art NLP models (such as BERT) into making incorrect predictions. Before modification, these models exhibit accuracy above 90% on tasks such as text classification and text entailment. After TextFooler’s modifications, which changed less than 10% of the input data, accuracy fell below 20%.

Extensive research has gone into understanding how adversarial attacks are handled by ML models that interpret speech and images. Less attention has been given to text, despite the fact that many applications related to internet safety rely on the robustness of language models. Engineers and researchers can incorporate TextFooler into their workflows in order to test the boundaries of hate-speech flagging, fake-news detection, spam filters, and other important NLP-powered applications.

TextFooler identifies the most important words in the input data and replaces those words with grammatically correct synonyms until the the model changes its prediction. CSAIL evaluated TextFooler on three state-of-the-art deep-learning models over five popular text classification tasks and two textual entailment tasks. The team proposed a four-way automatic and three-way human evaluation of language adversarial attacks to evaluate effectiveness, efficiency, and utility-preserving properties of the system.

CSAIL describes the core algorithm in depth in the research paper they released in January. First, the algorithm discovers the important words in the input data with a selection mechanism. The selection mechanism works by giving each word an importance score by calculating the prediction change before and after deleting that word. All words with a high importance score are then processed through a word replacement mechanism. This replacement mechanism uses word embeddings to identify the top synonyms of the same part of speech, then it generates new text with these replacements, only keeping texts which are above a certain semantic similarity threshold. Finally, if any generated text exists which can alter the prediction of the target model, then the word with the highest semantic similarity score is selected to be used for the attack.

The below example shows how TextFooler modified the input data to change model interpretation of a movie review from negative to positive.

Original: The characters, cast in impossibly contrived situations, are totally estranged from reality.

Attack: The characters, cast in impossibly engineered circumstances, are fully estranged from reality.

"The system can be used or extended to attack any classification-based NLP models to test their robustness," said lead researcher Di Jin. "On the other hand, the generated adversaries can be used to improve the robustness and generalization of deep-learning models via adversarial training, which is a critical direction of this work."

At UC Irvine, assistant professor of Computer Science Sameer Singh focuses heavily on adversaries for NLP. Singh acknowledges that the research has successfully attacked best-in-class NLP models, but also notes that any attack with an architecture like TextFooler’s, which has to repeatedly probe the target model, might be detected by security programs. [Source]

The code, pre-trained target models, and test samples are now available on Di Jin’s GitHub as part of the TextFooler project.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter