BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Data Preprocessing vs. Data Wrangling in Machine Learning Projects

Data Preprocessing vs. Data Wrangling in Machine Learning Projects

Key Takeaways

  • Data preparation takes 60 to 80 percent of the whole analytical pipeline in a typical machine learning / deep learning project.
  • Various programming languages, frameworks and tools are available for data cleansing and feature engineering. Overlappings and trade-offs included.
  • Data wrangling as important add-on to data preprocessing; it’s best used within a visual analytics tool to avoid breaking the analysis flow.
  • Visual analytics tools and open source data science components like R, Python, KNIME or RapidMiner are complementary.
  • Avoiding numerous components speeds up a data science project. Therefore, leveraging a streaming ingestion framework or streaming analytics product also for data preparation steps might be a great alternative.

Machine learning and deep learning projects are gaining more and more importance in most enterprises. The complete process includes data preparation, building an analytic model and deploying it to production. This is an insights-action-loop which improves the analytic models continuously.  Forrester calls the complete process and the platform behind it the Insights Platform.

A key task when you want to build an appropriate analytic model using machine learning or deep learning techniques, is the integration and preparation of data sets from various sources like files, databases, big data storage, sensors or social networks. This step can take up to 80 percent of the whole analytics project.

This article compares different alternative techniques to prepare data, including extract-transform-load (ETL) batch processing, streaming ingestion and data wrangling. Various options and their trade-offs are discussed using different advanced analytics technologies and open source frameworks such as R, Apache Spark, KNIME or RapidMiner. The article also discusses how this is related to visual analytics, and best practices for how different user roles such as the Data Scientist or Business Analyst should work together to build analytic models.

Data Preparation = Data Cleansing + Feature Engineering

Data Preparation is the heart of data science. It includes data cleansing and feature engineering. Domain knowledge is also very important to achieve good results. Data preparation cannot be fully automated; at least not in the beginning. Often this takes 60 to 80 percent of the whole analytical pipeline. However, it’s a mandatory task to get the best accuracy from machine learning algorithms on your datasets.

Data Cleansing puts data into the right shape and quality for analysis. It includes many different functions, for example the following:

  • Basics (select, filter, removal of duplicates, …)
  • Sampling (balanced, stratified, ...)
  • Data Partitioning (create training + validation + test data set, ...)
  • Transformations (normalisation, standardisation, scaling, pivoting, ...)
  • Binning (count-based, handling of missing values as its own group, …)
  • Data Replacement (cutting, splitting, merging, ...)
  • Weighting and Selection (attribute weighting, automatic optimization, ...)
  • Attribute Generation (ID generation, ...)
  • Imputation (replacement of missing observations by using statistical algorithms)

Feature Engineering selects the right attributes to analyze. You use domain knowledge of the data to select or create attributes that make machine learning algorithms work. Feature Engineering process includes:

  • Brainstorming or testing of features
  • Feature selection
  • Validation of how the features work with your model
  • Improvement of features if needed
  • Return to brainstorming / creation of more features until the work is done

Note that feature engineering is already part of the modelling step to build an analytic model, but it also leverages data preparation features (such as extracting parts of a string).

Both data cleansing and feature engineering are part of data preparation and fundamental to the application of machine learning and deep learning. Both are also difficult and time-consuming.

Data preparation occurs in different phases of an analytics project:

  • Data Preprocessing: Preparation of data directly after accessing it from a data source. Typically realized by a developer or data scientist for initial transformations, aggregations and data cleansing. This step is done before the interactive analysis of data begins. It is executed once.
  • Data Wrangling: Preparation of data during the interactive data analysis and model building. Typically done by a data scientist or business analyst to change views on a dataset and for features engineering. This step iteratively changes the shape of a dataset until it works well for finding insights or building a good analytic model.

The Need for Data Preprocessing and Data Wrangling

Let’s take a look at the typical analytical pipeline when you build an analytic model:

  1. Data Access
  2. Data Preprocessing
  3. Exploratory Data Analysis (EDA)
  4. Model Building
  5. Model Validation
  6. Model Execution
  7. Deployment

Step 2 focuses on data preprocessing before you build an analytic model, while data wrangling is used in step 3 and 4 to adjust data sets interactively while analyzing data and building a model. This is also called ‘data wrangling’. Note that these three steps (2,3 and 4) can include both data cleansing and feature engineering.

The following screenshot shows the Google Search trends for the terms “Data Preparation”, “Data Preprocessing” and “Data Wrangling”. You can see that data wrangling is becoming more and more relevant these days:

Figure 1. Google Search Trends for “Data Preparation”, “Data Preprocessing” and “Data Wrangling”

‘Inline data wrangling’ is a special form of ‘data wrangling’ where you leverage the visual analytic tool not just for visualization and building the model, but also for direct interactive wrangling. This has huge advantages as the following graphic depicts:

Figure 2. Decoupled Data Preprocessing vs. Inline Data Wrangling

The steps in the analytical pipeline, including data preprocessing and data wrangling, are typically done by different types of users. The following are the user roles who participate in an analytics project:

  • Business Analyst: Expert in the business / industry with specific domain knowledge
  • Data Scientist: Expert in Mathematics, Statistics and Programming (data science / scripting); can write low-level code or leverage higher level tooling
  • Citizen Data Scientist: Similar to the data scientist, but more high level; needs to use higher level tooling instead of coding; depending on the ease of the tooling this can even be done by the business analyst
  • Developer: Expert in software development process (enterprise applications)

These users have to work together closely to be successful in a data science project (see also ‘How to Avoid the Anti-Pattern in Analytics: Three Keys for Machine Learning’ to understand the different roles better.

While this article focuses on data preparation, it’s important to understand its relation to visual analytics as “a picture shows more than thousand words (or data rows)” and humans are only able to interpret visualizations instead of big unstructured datasets. For more details, see why you should use visual analytics to make better decisions. The leading visual analytics tools include Qlik, Tableau and TIBCO Spotfire.

How is visual analytics related to data wrangling? “When analysts are in the middle of discovery, stopping everything and going back to another tool is jarring. It breaks their flow. They have to come back and pick up later. Productivity plummets and creative energy crashes,” says the chief analyst of RITO Research.

Kaggle’s Titanic Dataset

The following sections show different alternatives for date preparation. The famous Titanic dataset from Kaggle will be used to show some practical examples. It is split into train and test data, and used to build an analytic model to predict which passenger would probably survive or die:

Figure 3. Metadata of Kaggle’s Titanic Dataset

Figure 4. Example Lines of Kaggle’s Titanic Dataset

The initial dataset is not ready to build analytic models. It includes duplicates, missing values and single cells which include various pieces of different information. Therefore, the initial dataset needs to be prepared well to produce best results when applying a machine learning algorithm. Here are some examples of data cleansing and feature engineering:

  • Creation of a new column via feature extraction: Get the name prefix of each passenger out of its long name to defer its sex, e.g. Mr, Mrs, Miss, Master
  • Creation of a new column via aggregation to see how many people were in the traveling group of each passenger: “family size = 1+ SibSp + Parch”
  • Creation of a new column via extraction of the first character to sort and analyse cabins: Extract the first character of the column 'cabin’
  • Removal of duplicates in dataset, e.g. if a passenger is in both the train and test data
  • Add data to empty cells via imputation to be able to process even lines with missing data, e.g. age: Replace ‘not available’ with the average age of all passengers or discretize to bins; cabin: Replace empty values with 'U' for Unknown; or apply advanced imputation method, such as multiple imputation by chained equations (MICE)
  • Leverage data science functions such as scale, normalize, PCA or Box-Cox to bring all data in a “similar shape” to be able to build reasonable analysis

The following section shows various programming languages, frameworks and tools for data preparation. Note that one size doesn’t fit all problems. Besides, there are many overlappings between the different options. Thus, you can solve many problems with different alternatives depending on user role and use case.

Data Preprocessing - by Data Scientist

Some programming languages are built explicitly for data science projects or have strong support for it, specifically R and Python. They include various implementations of machine learning algorithms, preprocessing functions such as filter or extract, and data science functions such as scale, normalize or shuffle. The data scientist needs to do relatively low level coding to do exploratory data analysis and preparation. In contrary to classical programming with Java or C#, you don’t need to write many lines of code with R or Python for data preprocessing; it’s more about understanding the data and experience of what statistical concepts and algorithms you use for data preprocessing and building the analytic model.

These programming languages are built for the data scientist to prepare data and build analytic models, but not for enterprise deployment to deploy the analytic model to new data with high scale and reliability. Therefore, commercial enterprise runtimes are available, too. Typically, they support the same source code, so that you don’t need to rewrite anything for enterprise deployment. For R, you can leverage open source Microsoft R Open (former Revolution R) or TIBCO Enterprise Runtime for R. The latter has the advantage that it isn’t restricted to GPL open source license so that you can use R models in any embedded or external environment without licensing issues.

The following shows an extract from a great R tutorial where the Titanic dataset is preprocessed and analyzed with the basic R language:

### Data preprocessing with basic R language:

# Survival is "yes/no" 
# => Type change: No numeric value and corresponding data processing / analysis
data.combined$Survived <- as.factor(data.combined$Survived)

# Parse out last name and title from the full name
data.combined[1:25, "Name"]
name.splits <- str_split(data.combined$Name, ",")
name.splits[1]
last.names <- sapply(name.splits, "[", 1)
last.names[1:10]

# Feature engineering: Creating a family size feature
# (Siblings / Spouses + Parents / Children + 1)
temp.SibSp <- c(train$SibSp, test$SibSp)
temp.Parch <- c(train$Parch, test$Parch)
data.combined$FamilySize <- as.factor(temp.SibSp + temp.Parch + 1)

Beyond the basic support for preprocessing in these programming languages, many additional data science packages are available. For example, many data scientists leverage the very powerful caret package in R to ease data preparation and reduce lines of code. This package streamlines the model preparation and training process for complex regression and classification problems. It offers a generic interface in front of hundreds of existing R model implementations (with diverse APIs under the hood). The following snippet uses caret’s generic API to do some preprocessing on the Titanic dataset:

### Data preprocessing with the R caret package:

# Leverage caret's preProcess function to normalize data
preproc.data.combined <- data.combined[, c("ticket.party.size", "avg.fare")]
preProc <- preProcess(preproc.data.combined, method = c("center", "scale"))

# -> Instead of absolute values, you see relative values (i.e. their relation to each other):
postproc.data.combined <- predict(preProc, preproc.data.combined)

Another R package for data preprocessing is the dplyr package. It is not as powerful as caret package, and focuses “just” on manipulation, cleaning and summarizing unstructured data. Dplyr aims to provide a function for each basic verb of data manipulation:

  • filter() (and slice())
  • arrange()
  • select() (and rename())
  • distinct()
  • mutate() (and transmute())
  • summarise()
  • sample_n() (and sample_frac())

Therefore it is also very easy to learn and appropriate for many data manipulation tasks. The same is true for the data.table package, to name another one. As you see, you have many options to preprocess your datasets in R.

Preprocessing of Big Data Sets – by Data Scientist or Developer

Programming languages like R or Python can be used to prepare small datasets. However, they are not built for processing really big datasets; in the meantime we often have to analyze several Gigabytes or even Terabytes or Petabytes of data. Big Data frameworks like Apache Hadoop or Apache Spark are built for elastic scalability and data preprocessing at the edge, i.e. where the data is located.

These big data frameworks also focus on ”low level” coding and are much more complex to setup than R or Python environments. Commercial offerings such as Hortonworks, Cloudera, MapR or Databricks help with this. Typically, the data scientist works together with a developer to realize big data projects. The latter takes care of cluster setup, deployment and monitoring while the data scientist leverages a R or Python API to write the code for data preprocessing and building the analytic model.

The source code often looks very similar to coding just with R or Python, but the data preprocessing is done in parallel on the whole cluster. See an example with data preprocessing and feature engineering of the Titanic dataset with Apache Spark using Spark’s Scala API:

### Data preprocessing with Scala and Apache Spark API:

# Feature engineering: Creating a family size feature
# (Siblings / Spouses + Parents / Children + 1)

val familySize: ((Int, Int) => Int) = (sibSp: Int, parCh: Int) => sibSp + parCh + 1
val familySizeUDF = udf(familySize)
val dfWithFamilySize = df.withColumn("FamilySize", familySizeUDF(col("SibSp"), col("Parch")))


// fill empty values for the age column
val avgAge = trainDF.select("Age").union(testDF.select("Age"))
	.agg(avg("Age"))
		.collect() match {
    case Array(Row(avg: Double)) => avg
    case _ => 0
}

You could do the same with Spark’s Java or Python API, of course.

Data Preprocessing - by Citizen Data Scientist

Often, you want to be agile and produce quick results. This typically includes a lot of trial-and-error when preparing and analyzing datasets. Various data science tools exist with a focus on ease-of-use and fast time-to-market. These tools offer:

  • Development environment and runtime / execution server
  • Visual “coding” with drag & drop and code generation
  • Integration of data science frameworks like R, Python or even more powerful big data frameworks like Apache Hadoop, Apache Spark or H2O.ai under the hood

These tools can be used by the data scientist to speed up data preprocessing and model building. In addition, they can also be used by a citizen data scientist who does not have as much experience in data science projects as the tool helps out with data preprocessing and applying implementations of machine learning algorithms. Some tools even include recommendations to support the human in preprocessing, displaying and analyzing datasets. The tools get more and more intelligent with artificial intelligence under the hood.

The following shows examples of how to preprocessing the Titanic dataset with two open source data science tools, KNIME and RapidMiner:

 

Preprocessing of the Titanic Dataset with KNIME

Preprocessing of the Titanic Dataset with RapidMiner

Instead of writing source code in R or Scala as seen before, you use the visual IDE to configure preprocessing. This makes data preparation and analysis easier for most people, and also allows easier maintenance and hand-over to others.

Data Wrangling - by Business Analyst or Citizen Data Scientist

Data wrangling (also sometimes called data munging) is a simple, intuitive way of data preparation with a graphical tool. The focus of these tools is on ease-of-use and agile data preparation. Therefore, it is not necessarily done by a developer or data scientist, but built for everybody, including a business analyst or citizen data scientist. Examples for data wrangling are DataWrangler and Trifacta Wrangler.

Trifacta for Data Wrangling

Note that these tools are not as powerful as data preprocessing frameworks and therefore often used for the last mile of data preparation. They do not replace other integration options like ETL (Extract-Transform-Load) tools or data preprocessing with R, Python, KNIME, RapidMiner, or the likes.

As discussed in the introduction, data wrangling in its own tools can have some disadvantages as it is a decoupled from the actual data analysis. Data wrangling within visual analytics tooling allows inline data wrangling during exploratory analysis of data. It can all be done by one single user with one single tool. For example, see TIBCO Spotfire, which combines visual analytics with inline data wrangling (and other data science features to build analytic models):

Inline Data Wrangling in the Visual Analytics Tool TIBCO Spotfire

Data wrangling tools and visual analytics tools with inline data wrangling can be used by every user role: Business analyst, (citizen) data scientist or developers to speed up data preparation and data analysis.

The article focused on data preparation for building machine learning models. You can use programming languages like R or Python, data science tools like KNIME or RapidMiner, and Data Wrangling with DataWrangler or Trificata, or Inline Data Wrangling via TIBCO Spotfire. Often, you need a step before all this! You need to get access to all your data in a more or less consolidated data source (such as a relational database, data warehouse or big data cluster). Therefore, the following two sections briefly explain ETL and Streaming Analytics tools for data ingestion - which typically also includes parts of data preparation, especially data aggregation and data cleansing.

ETL (Extract-Transform-Load) and DQ (Data Quality) - by Developer

ETL tools are built for the developer to integrate various data sources, including many legacy and proprietary interfaces with very complex data structures (such as Mainframe or EDIFACT interfaces). This also includes data cleansing (often called ‘Data Quality’ tooling in this context), and a focus on ease-of-use and enterprise deployments with visual coding (similar to data science tools like KNIME or RapidMiner, but with focus on ETL and Data Quality). They also support big data frameworks like Apache Hadoop and Apache Spark. In addition, they offer out-of-the-box support for quality improvement, e.g. address validation. ETL and DQ is usually realized in long-running batch processes, which might sometimes have negative implications if you need to act on real time data to build your analytic model.

Some examples of ETL and DQ tooling are open source tools like Pentaho or Talend, or the proprietary vendor Informatica. The market is shifting to more simple web user interfaces so that other personas can also do some basic tasks.

Data Ingestion and Streaming Analytics - by Developer

Data ingestion and streaming analytics tooling can be used for adding and preprocessing data either in streams. These frameworks allow data preprocessing in batch or real time. The following picture shows a typical streaming flow including data ingestion, preprocessing, analytics, processing and output:

Steps of a Streaming Analytics Flow

Various frameworks and tools are available. All of them support big data frameworks like Hadoop or Spark in one or the other way. To name a few:

For more details, refer to a comparison of streaming analytics frameworks, products and cloud services.

The huge advantage of using these tools, including ETL, is that you can leverage the same tool or framework for data preprocessing (of historical data) and real time processing (of new data to apply analytic models while the data is in motion). Therefore, this might be a great option to keep the tooling portfolio small, and benefit of one tool for ETL / ingestion and real time processing. The following shows an example of using TIBCO StreamBase for data preprocessing of the Titanic dataset:

Streaming Preprocessing of the Titanic Dataset

As for Data Ingestion and ETL tooling, the market for streaming analytics is shifting to more simple web user interfaces so that other personas can also do some basic tasks. This will not replace existing tools for more advanced use cases, but will add an option to deploy some rules, correlations or analytic models more easily and directly by a business analyst or data scientist without the help of a developer.

Data Preparation is Key for Success in Machine Learning Projects

Building an analytic model with machine learning or deep learning techniques is not easy. Data Preparation takes 60 to 80 percent of the whole analytical pipeline. Various programming languages, frameworks and tools are available for data cleansing and feature engineering, overlappings and trade-offs included.

Data wrangling is an important add-on to data preprocessing. It is best used within a visual analytics tool to avoid breaking the analysis flow. Visual analytics tools and open source data science components like R, Python, KNIME or RapidMiner are complementary.

However, avoiding numerous components speeds up a data science project. Therefore, leveraging a streaming ingestion framework or streaming analytics product also for data preparation steps might be a great alternative. You write preprocessing steps once, and use them in batch processes for historical data to build an analytic model and for real time processing to apply the built analytic model to new events.

Learn more about data preparation for data science projects by taking a look at these slides and video recording, which walk you through the content of this article in more detail.

About the Author

Kai Wähner is Technology Evangelist and Community Director for TIBCO Software - a leading provider of integration and analytics middleware. His main area of expertise lies within the fields of Big Data, Advanced Analytics, Machine Learning, Integration, SOA, Microservices, BPM, Cloud, Internet of Things and Programming Languages such as Java EE, Groovy or Golang. He regularly write about new technologies, articles and conference talks on his blog

Rate this Article

Adoption
Style

Educational Content

BT