Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Hands-on Feature Engineering for Natural Language Processing

Hands-on Feature Engineering for Natural Language Processing



Susan Li shares various NLP feature engineering techniques from Bag-Of-Words to TF-IDF to word embedding that includes feature engineering for both ML models and emerging DL approach. She covers the details including contextual & linguistic feature extraction, vectorization, n-grams, topic modeling, named entity resolution, which are based on concepts from mathematics, information retrieval, NLP.


Susan Li is Sr. data scientist at Kognitiv where she specializes in machine learning and NLP. She is passionate about helping organizations realize the potential of big data and advanced analytics, and helping individuals enhance skills in data literacy. She frequently writes and speaks about predictive analytics, machine learning and NLP for technical and general audience.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Li: Before I start, I would like to ask a couple questions to the audience. How many of you have some experience with NLP? It can be you participated in project at work, or you self-studied at home, everything’s included. Almost one-third of the group.

How many of you agree with me that natural language processing is very difficult? All of you. The reason we say that, is that one of the major difficulties is that we do not consciously understand the language ourselves. Look at this conversation. What does it mean? Another difficulty is that language itself is ambiguous. Do you still remember this shocking news headlines? Real news headlines in the past.

When we think of a linguistic concept, such as words or sentences, they seem to be simple, clear, and well-formed ideas, but in reality, there are so many borderline cases that can be quite difficult to figure out. For example, one single word can have many different meanings. In order to understand what a sentence means, we have to understand the meaning of the words in that sentence, this is not a simple task. However, as humans, we can learn this effortlessly. For example, when we read a website, when we read those newly made-up words, verbs used as nouns, and nouns used as verbs, or sarcasm, we get it immediately without too much trouble.

However, how computers process language is completely different from how humans process language. Once we go away from whatever training coppers the computers were trained on, they're likely to become hopelessly confused. But all of these difficulties do not stop us from building NLP applications. This is my hand graph of NLP applications, I'm sure you can add many more.

This brings up the topic of the talk today, "Feature Engineering for Natural Language Processing." My name is Susan, I'm senior data scientist at Kognitiv. We are a travel tech solution provider in Toronto, and we help our clients, they are travel companies, hotel groups, resort groups, and vacation clubs all over the world. We help them to drive direct online revenue, online bookings from their own websites. Many aspects of my work are NLP-related. For example, hotel reviews, hotel descriptions, and hotel room-type matchings. That's why I'm here today to share how we learn.

What are the Features for NLP?

The topic is about feature engineering, so our focus is on features. What are the features? When we try to predict New York City taxi fare, we know that our features are distance between pickup and drop off locations, time of the day - because different times of the day have different fares - and the day of the week - is it a holiday or not? We can add more features such as weather condition-related features. Is it snowing or not? Is raining or not? Because when it's snowing, we will expect taxi fares to rise. But what are the features for NLP? How do computers perceive text? The field of NLP aims to convert the human language into a form of representations that are easy for computers to manipulate.

In other words, when we deal with an NLP problem, our input is text. We have to convert those inputted texts into something that our algorithms can understand. In general, we can categorize the features into two big categories. One category is meta-features, such as word counts, stop word counts, punctuation counts, the length of characters, the language of text, and many more. Another big type of feature are text-based features, I'm sure many of you are familiar with tokenization, vectorization, stemming, part of speech tagging, and the name of the entity extraction. If we visualize them, they look like this.

My list is not exhaustive. In the meantime, we do not use all the features in every case scenario. The choice of features is a completely empirical process, mainly based on trial and error. The feature engineering process is very task-dependent.

I'm borrowing quotes from leaders about feature engineering. Pedro Domingos, who's a professor at the University of Washington, said, "At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." Xavier’s number one ranking Kaggler said, "The algorithms we use are very standard for Kagglers. We spend most of our efforts in feature engineering. We were also very careful to discard features likely to expose us to the risk of over-fitting our models." Andrew Ng said, "Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering."

Feature Engineering for NLP

Feature engineering is hard work, and it is an art and a skill. It requires us to have creativity, and it requires us to have domain knowledge, and it requires us to brainstorm with our colleagues, with our co-workers. Think of it as with anything else. Experience comes with practice. Once we have a new project or a new dataset, we always go back to revisit, to double check what has worked before. Naturally, we need to explain what is not feature engineering. Of course, data collection is not feature engineering, and data pre-processing, such as removing stop words, removing non-alphanumeric, removing tags, lower casing, those are not feature engineering; they are data pre-processing, or we say text cleaning. Creating the tagged labels when labeling the data is not feature engineering. Scaling or normalization is not feature engineering. The PCA is not feature engineering, and the last step in machine learning, which we call the hyperparameter tuning and optimization, is not feature engineering.

This is a very high-level overview of NLP workflow, and we're focused on the middle part. We are going to showcase several examples of feature engineering methods. They are TF-IDF - I'm sure many of you are familiar with, this is a first class - and Word2vec, FastText, Topic Modeling, and simple and complicated features. The last is what the future looks like on NLP feature engineering. The goal of today is to showcase as many feature engineering methods as possible using almost one dataset, so we are not going to fit the model, we're not going to produce the results.

Problem One: Predict the Label of a Stack Overview

I wanted to build an NLP model that would best resonate with developers. Stack Overflow is a great source for me to look for data. The changing data we're going to use existed in Google through Courier. It is also publicly available and can be downloaded directly from this cloud storage URL. The data contains questions posted by the users of the Stack Overflow website, as well as corresponding tags. We know what tags are; HTML is a tag, SQL is tag, and Python is another tag, indicating what each question is about.

We are all Stack Overflow users, we know that one question normally has more than one tag. However, in this dataset, one question has only one tag, and this is a classic example of multi-class text classification problem. When we formulate this machine learning problem, we can see something like this. Our machine learning problem is best formulated as a multi-class, single label text classification problem, which we predict a given Stack Overflow question to one set of targeted labels. A single data point look like this, this is a question that is tagged as Python or labeled as Python, and this is an overview of the dataset.

The posts are the questions posted by Stack Overflow users. The tags are the labels, so the post column is the input text and we are going to do feature engineering on this input text, starting from tokenization and TF-IDF. There are several ways we can tokenize a post or a question, such as unigrams. I'm sure some of you have heard about bigrams, trigrams, and character-grams. These are the examples of how each N-gram works on a simple sentence. We let the machine supply the sentence like this in order for them to understand the meaning. After tokenization, these are the correlations we would expect to see. For example, the top three most correlated unigrams and bigrams of Python, and the top three most correlated unigrams and bigrams for SQL.

TF-IDF is a relatively old weighing scheme, but it is simple and effective, so it is still is the number one most popular starting point for any other more recent algorithms. I'm using a small example to quickly explain how TF-IDF is calculated. Let's say we have five documents in total in our coppers, or five posts in total, or I'll say five questions in total. It's the same thing. We want to know which document is more related to Python. In terms of the frequency column, we count the number of times the word Python has appeared in that document. Python appeared in the first document once, in the second document once, in the third document twice, and Python did not appear at all in the fourth and fifth documents. In the inverse document frequency column, we compute total number of documents, divided by the number of documents that contain Python. We have five documents in total, three of them contain Python, five divided by three.

TF-IDF is a product of these two. After TF-IDF computation, we can see that the third document has the highest TF-IDF goal for Python, so the third document has the highest probability of being labeled as Python. The fourth, fifth documents have the least probability of being labeled as Python. After this serious computation, our original dataset will be converted to something like this. Each row is a document, or is a post, or is a question, and each column is a token. TF-IDF's goal is computed for each token in each document. Coding for TF-IDF is very simple; we just use Scikit-learn's TfidfVectorizer class, and we do data pre-processing and feature engineering inside of this class.

The result we get is a huge sparse matrix, which is then fed into a classical shallow classification algorithm, such as linear Support Vector Machine or Logistic Regression. There's a problem, TF-IDF enables us to directly use word vectors to perform this downstream classification. However, this direct approach suffers from a sparsity of word vectors, because in order to form a meaningful representation of words, we typically have to keep hundreds or thousands of dimensions of word vectors, with most of them having a value of zero. With such a high dimensional features space and a huge dataset to process, the changing gets very challenging. Therefore, we need a better representation of the word to solve this problem. That is Word2vec.

Word2vec is little bit complicated, it is not deep learning. However, it turns inputted text into a numerical form that the deep neural network can process as input. The basic idea is that Word2vec should be able to preserve most of the relevant information about the text, in the meantime, having a relatively low dimensionality for better machine learning performance than our previous TF-IDF. Without going into math, I'm trying to give some simple intuitions on what's going on here. Word2vec has two distinct models, one is called CBOW, Continuous Bank of Words and another is called skip-gram. This is how CBOW works, let's say we have a phrase, "How to plot the dataframe bar graph." When we use the CBOW, the parameters and features of how to plot a bar graph are used to predict data frame, that is predicting the current words given the neighboring words.

On the other hand, this is how skip-gram works. We use the same phrase, "How to plot a dataframe bar graph." When we use skip-gram, the parameters and features of dataframes are used to predict how to plot a bar graph. That is predicting the neighboring words given the current word. The architectures of these two models look like this, they are opposite. Generally, the results of these two models are similar. The goal of Word2vec is to learn high quality word vectors from huge amount of data with billions of words, with millions of words in the vocabulary. The resulting word representations are similar words which tend to be close to each other, and the words can have multiple degrees of similarity.

Now we are going to change the Word2vec embedding on our Stack Overflow question dataset. This is what I did - each word is mapped to a vector with 100 dimensions. After that, I use a TSNE plot to depict what word embedding it has learned. We can see that the keywords that related to SQL, such as inner join, select, mySQL, query, tables, rows, columns, are close to each other. The keywords related to positions such as right, left, the top, those words are grouped together. The keywords related to account, such as connection, password, username, login, email address, those words are clustered together. With a datacenter this large, it's a little bit difficult to make those easy-to read visualizations, so we use our Word2vec model to look up the most similar words for any given point.

We use our Word2vec model we just created to look for the most similar word to Python, and the most similar word to SQL. Word2vec techniques have many applications in our lives. Maybe you do not know it; you do not even notice. For example, Airbnb developed something called listing embedding, to improve similar listing recommendations as well as real-time personalizations in search ranking. They wrote a blog post to give the details on how they did it. Uber is inspired by Word2vec and GloVe. Uber has designed its own query2vec model, and they use their eaters' search behavior data. Eventually, they build a query understanding engine. They also wrote a blog post about it.

About three years ago, Facebook released FastText, which is an extension of Word2vec. The advantage of FastText is that FastText can generate a better word embedding for rare words, and it can also construct word vectors for a word even when this word does not exist in our training coppers. As an example, I asked Word2vec, "Give me the most similar words to Word2vec." It complained, but when I asked FastText the same question, "Give me the most similar word to Word2vec," it gave me results which make a lot of sense. The most similar word to Word2vec is word, I like that. The codes for Word2vec and FastText are pretty much similar, if you are in Gensim.

Problem Two: Topic Modeling (Unsupervised)

Unfortunately, most of time we do not have labeled data. Labeling data is a very expensive task, especially in the field of NLP, because when we're labeling data, mostly we require skilled linguists, so therefore, we don't retrain supervised learning on labeled data and unsupervised learning on labeled data. Now, I'm going to talk about unsupervised learning tasks that are suitable for NLP, the code topic modeling.

We are still using the previous Stack Overflow question dataset. This time I removed the pack and I think I removed the labels. We are going to use a topic modeling method called IDA to extract the topics of the post or topics of the questions. Without going into math, let's look at the code directly, and there are a couple of ways you can do this. One is using Gensim's RDA library, and another is what I'm using now, which Scikit Learn's Latent Directional Allocation class. There's only one thing we need to pay attention to, which is that we need a manually-set number of topics. This requires some understanding of the data, and also some choice and errors, and we give our best guess. Our set number of topics is 20, because I'm cheating, I know that in our original dataset, we have a total of 20 labels, but in reality, in our real world, we will not know. We would have to do multiple choices, and give our best guess on each try. Every time we tried and we made a mistake, we said it was not right, and then we learned something else.

Let's visualize how my 20 topics look. They look very good because there's no overlapping. This means the 20 topics can separate themselves very well, and we can look up the words in each topic, the top word in each topic used. We can see that some topics are defined very well. For example, topic zero, if we look at the words in topic zero, we can figure out that it more or less has something to do with SQL. If you look at the words in topic one, we can guess that it has something to do with Python. But not every topic can be defined very well. The reason we say that topic modeling is part of feature engineering is because it is rarely the end of data analysis, because topic modeling is good for data exploration. When you have a completed dataset, and you have no idea what structure you can expect from the dataset, you use a topic modeling to give you a high-level idea of what structure you can find from the dataset.

Another case scenario in which to use topic modeling, is when we do not have the time and resources to construct supervised learning which is based on labeled data, because we don't have label data. We use RDA to give us an idea of what the next steps are, what kind of label we can expect if we want to label data in the next steps. Another case scenario where you use RDA topic modeling is, topic modeling can be used as an extra feature to improve your model's accuracy. I gave you an example of how we use topic modeling in one of our most recent projects. In that project, we needed to do some clustering of hotels in, let's say, in New York, according to hotel descriptions. For example, Marriot Marquis, New York Times Square. We want to find the top 10 most similar hotels to this hotel, according to descriptions. The description is the only data we have, we don’t have anything else. Let's say of 1,000, or say 100 hotels in New York, we want to find the top 10 most similar hotels to this hotel. It's easy, if you have done NLP, you know how to do that. It's just compute cost and similarity on top of the word vectors. It sounds easy, and we did that, exactly that, but we found a problem.

The problem is that actually most of hotels’ descriptions would say something like, "Our check-in time is 3:00 p.m., check-out time is 11:00 a.m.," and they will say something like, "We accept all major credit cards, Visa card, American Express, MasterCard," and they will say something like, "The minimum check-in age is 18 years old," and it will always say something like, "No smoking," plus many payment policies or privacy policies. This is a problem because if two hotels both accept Visa cards, it does not mean that they're similar hotels. If two hotels have the same check-in time or check-out time, it does not mean they're similar, so we needed to find a way to filter out those sentences in the descriptions, or to get rid of those sentences in the descriptions.

We don't want to manually do it, so we tried topic modeling. When we applied topic modeling in a few hundred hotel descriptions, we found that we could extract one topic with keywords, those keywords that we do not want. For example, the keywords that topics have, are check-in, check-out, payment, Visa card, MasterCard, credit card. So we know that those are the topics that we do not want to appear in our descriptions. We had a serious data manipulation, and we found all the sentences that have that topic as the dominant topic. You can imagine what those sentences are. Those sentences are exactly, as I have said earlier, is “accepts the major credit cards” or “check-in” and “check-out time”. We were able to filter out or get rid of all those sentences with that topic as the dominant topic. After this cleaning, we got rid of the useless sentences, then we could do our clustering, and cosine similarity calculation becomes more meaningful, because we were able to do this calculation on top of the clean data to produce more meaningful result. Therefore, topic modeling is part of feature engineering, and not the end of data analysis.

Problem Three: Auto Detect Duplicate Stack Overflow

We go onto the next problem. Let's say you are tasked to build a model to detect duplicate Stack Overflow questions, that is, questions have the same intent and should be answered once and once only. This is a duplicate question in the Stack Overflow website, and this is how this question is treated on the website. It's just marked a duplicate. In order to demonstrate, I manually create a toy dataset, with only 10 rows. In this dataset, the features are question one and question two. The labels are duplicate, one means it's duplicate, zero means it's not duplicate. In this process, I have manually labeled this toy dataset and I depend on my judgment to weigh, to see, to determine, if they are duplicate or not. Our problem becomes a binary classification problem in which we identify and classify whether question pairs are duplicates or not.

There are many ways we can do feature engineering on this problem. Of course, we can go back to our older TF-IDF, and we can use Word2vec, FastText, or we can use our domain knowledge, creativity, and brainstorm to create many features like this. We can create Meta features, such as the word counts of each question in the pair, the correct length of each question in the pair, the number of common words of these two questions in the pair. The more the number of common words, the more similar those two questions are, and we can create some fuzzy string matching-related features. By the way, fuzzy string matching is true, use Levenshtein distance to calculate the difference between two documents. In Python, we use a library to do that, the library called FuzzyWuzzy, and it's amazing. Check it out.

After we obtain the Word2vec vectors, we can compute many more distance-related features on top of these Word2vec vectors. For example, we can compute Word Mover's Distance. By the way, Word Mover's Distance is a method that allows us to assess the distance between two documents in a meaningful way, even when these documents have no word in common. Here, the link is a tutorial on how to compute Word Mover's Distance in Gensim library. After that, we can compute more distance on top of word vectors, such as cosine distance between vectors on top of the vectors of these two questions, and Manhattan distance on top of the vectors of those two questions, and many more. All those distance computations can be done inside Py's library.

The Future: Automated Feature Engineering

What the future looks like in NLP feature engineering. Deep learning claims the end-to-end automated feature engineering. In the future with auto ML, some of our works will be more and more automated, but not removed. The problem with deep learning is, we all know that the deep learning model is kind of a black box. What comes after the auto machine learning model? For example, how do we debug or explain the end-to-end auto ML models we just created? Not only explain to our colleagues, or our managers, but explain to ourselves?

LIME and SHAP are two Python libraries for machine learning model explainability. For example, I train a deep neural network on our Stack Overflow website, and I use LIME to explain the model to myself. This is a post like a question, where the true label is a C, and our model predicts the probability is at 97%; it’s a C. We can see that the word printf makes the highest positive contribution to the label C. If we remove printf from the document, from our question posed, then we would expect the model to predict the label as a C, and the probability of 97% minus 28%.

Also, SHAP explains the revenue network model I created. The SHAP works like this: SHAP shows us the highest magnitude of words in our model and broken down by label. Here, PHP is the biggest signal words in our model, and makes the biggest contribution to the label PHP, of course. But in the meantime, PHP is likely to be a negative signal word for other labels, because it is unlikely to see the word PHP used in a Python question. These are two Python libraries. They purposely explain the deep neural network black box model for us.

To sum up, good features are the backbone of any machine learning model, and good feature creation often needs domain knowledge, creativity, and lots of time. Here are some results of feature engineering, and papers on GloVe and Word2vec.

Questions & Answers

Participant 1: Can you talk a little bit about the challenges of using NLP in a real-world application? If you have to build a new product feature that leverages NLP, what are you thinking about in terms of challenges and complexity?

Li: Challenges in NLP - there are a lot of them. I'll give you an example, I used to work for a content software company. We were going to automate a feature in our software, it's auto-categorization. For example, you're a user for this accounting software and you connect the product accounting software to your bank account. Each transaction coming into the software, for example transactions coming from McDonald's, we want to be able to automatically categorize to news and entertainment. Transactions coming from Staples, you want to be automatically categorized into office supply. It's auto-categorization. I think most accounting software companies are doing this.

The problem actually is not the algorithm’s problem at all. When I just look at this task, I found the problem was incorrectly labeled data, because the data was labeled by users. As a reasonable user, you know that you have a transaction coming from a restaurant; you put your meal under entertainment. But some users will think the transaction comes from this McDonald's at the gas station. “I'm going to put it on fuel charges,” this is very subjective. A transaction coming from Staples, some people will say, "We will go to the office supply," and some people will say, "No, office expense."

The biggest challenge here is that when I just look at this task, we have 250 labels. Many labels, not only incorrectly labeled, but also many labels which are very ambiguous. For example, office expense, office supply. We're going to not do that; we just consolidate because they mean the same thing. Also, the label; a lot of labels said there are others; other office supplies, other computer expenses. Why would they need the others? We consolidate. The biggest challenge is that after your text classification, if you found your result so bad, like 40% accuracy, it's not your algorithm’s problem; you've got to look at data. If the two labels are very ambiguous, there's no way you can separate them out. A human cannot separate them out, for this Stack Overflow example, I cannot separate out mySQL and SQL. They can be the same.

Another is incorrect label data. You have to hire somebody, like an intern or yourself, to correct those labels. A lot of times I spent entire afternoons correcting a label or just labeling data. I make mistakes too, and sometimes after three hours I feel dizzy. The second day I looked at it “Did I do this?” If I do the label data together with my colleagues, we have different ideas. So the biggest challenge is not an algorithm problem, because if your data is clean, it's perfect data, we most likely [inaudible 00:42:26] post can solve all the problem. You've got to check the data, most likely, it's data's problem.

Participant 2: I'd like to know if you have any comments on the most recent breakthrough in NLP techniques, like BERT or the paper just published two weeks ago? We have an EXCELNET, all these deep learning things, compared to all this restructuring, anything you have just mentioned in your talk.

Li: I have just started getting my hands dirty in BERT, to figure out the difference between Word2vec and BERT, or the difference between GloVe and BERT. Let's put it this way, BERT is one of the most recent papers coming out of Google. Word2vec and the GloVe are word-based models. This means the input are words, and the output are word embeddings, and they do not count the order of the words. But when we use BERT, this is a more context-dependent model, so they will give the word embeddings depending on where where the word is in the sentence. I'll give you a borrowed example, a sentence is, "He went to the jail cell with his cell phone.” When you use BERT, these two cells will generate a different word embedding for this cell, the word. If we use Word2vec, these two cells compare the same word embedding. This is the main difference between Word2vec and BERT.

Every time a new operating model comes out, we will try first, but when you just start using NLP, to apply to your company, and you have not done it before, you basically start from TF-IDF or countervectorization or TF-IDF vectorization. This is how I learned. Lots of problems you apply if you see a text. If you apply TF-IDF or convectorizer plus active post, your problem is solved. With clean data, your problem is solved. If your problem is not solved, the basic idea is that when you just start doing NLP, you start from the most simplest one and you can get the most understandable, the most explainable one; and when you go to a higher level, you look at more recent models, like conversational AI. You want to develop a conversational AI for your company, but always start with the most stupid model, as I learned from.

Participant 3: I was wondering how do you combat a lack of data, or what models deal best with not having so much data to train on? How do you combat that in the real world?

Li: Combat short of data, in a short labeled data?

Participant: No. Shortage of data in general.

Li: Then you can start collecting data. We don't have this kind of problem; we do not have data. Then, because we do not have data, some machine learning model or some product, we cannot do it. I would just tell the manager, "We need data for that." Just collect data. When we say collect data, we do not mean rows; we mean columns. We want fat data, not thin data. Column means more features. When we say we don't need data, it’s most likely that we don't have more features of data. Most companies probably have lots of those already.

Moderator: That is a common challenge for some of the machine learning tasks, and some people would use a crowdsourcing way of collecting more data, so you have to be a little bit creative. But yes, it's very challenging. If there's not enough data, it's hard to build a machine learning model.


See more presentations with transcripts


Recorded at:

Aug 16, 2019