InfoQ Homepage Articles Conquering the Challenges of Data Preparation for Predictive Maintenance

Conquering the Challenges of Data Preparation for Predictive Maintenance


Key Takeaways

  • Machine learning (ML) plays a significant role in the industrial IoT (IIoT) area for data management and predictive analytics.
  • Predictive maintenance (PdM) applications aim to apply ML on IIoT datasets in order to reduce occupational hazards, machine downtime, and other costs.
  • Learn about the data preparation challenges faced by industrial practitioners of ML and the solutions for data ingest and feature engineering related to PdM.
  • Data ingestion processes can be easier to develop and manage using a dataflow management tool such as StreamSets or Apache Nifi.
  • Long Short-Term Memory (LSTM) algorithms are commonly used to forecast rare events in time series data. Since failures are generally very rare in PdM applications, LSTM algorithms are suitable for modeling.


Machine learning (ML) has made it possible for technologists to do amazing things with data. Its arrival coincides with the evolution of networked manufacturing systems driven by IoT, also known as industrial IoT (IIoT), which has led to an exponential growth in the data available for statistical modeling.

Predictive maintenance (PdM) applications aim to apply ML on IIoT datasets in order to reduce occupational hazards, machine downtime, and other costs by detecting when machines exhibit characteristics associated with past failures. In doing so, PdM provides factory operators with information that can help them perform preventive or corrective actions, such as:

  • run machines at lower speed or lower pressure to postpone total failure,
  • have spare equipment brought on site, and
  • schedule maintenance at convenient times.

Implementing PdM involves a process that starts with data preparation and ends with applied ML. Practitioners know well that the bulk of the effort required for effective ML deals with data preparation, and yet, those challenges continue to be underrepresented in ML literature, where authors tend to demonstrate concepts with contrived datasets.

In this article, I hope to address some of the most difficult data preparation challenges faced by industrial practitioners of ML by discussing solutions for data ingest and feature engineering that relate to PdM.

Data Ingest

The first step required for PdM involves data acquisition. Industrial instrumentation is most often associated with measurements of physical quantities such as vibration, infrared heat, electrical current, metal particles in grease, etc. This data typically originates from sensors attached to programmable logic controllers within an industrial control network. Data gateways that bridge control and corporate networks facilitate access to that data via developer-friendly protocols, such as REST and MQTT. It’s also important to consider out-of-band data sources, such as operator logs or weather data, because they can also contain signals that correlate to failure events. Figure 1 below illustrates the interconnections between these types of data assets.

Figure 1. Interconnections between IIoT Data Assets 

Data Pipeline Design

Data ingestion is accomplished by processes that continuously collect and store data. These processes can be implemented as custom applications but are generally much easier to develop and manage using a dataflow management tool, such as StreamSets or Apache Nifi. These tools provide a number of advantages for creating and managing data pipelines, such as:

  1. Simplifying pipeline development and deployment. Integrated development environments (IDEs) such as those provided by StreamSets and Nifi help to minimize the code required for creating pipelines. They also integrate useful utilities for dataflow monitoring and debugging. In addition, these tools support DevOps processes with capabilities such as flow versioning and continuous delivery.
  1. Preventing failures due to scale, schema drift, and topology migrations. Dataflow management tools can act as an important agent for change within a distributed system. They provide reasonable ways of scaling to handle increased load. For example, StreamSets leverages Kubernetes for elastic scalability and Nifi is expected to do the same in the near future. Data sources can also introduce breaking changes when evolving baseline schemas. StreamSets and Nifi enable you to handle schema drift with data validation stages that redirect or reformat messages on-the-fly. Infrastructure topologies can also change over the life of an application. StreamSets and Nifi enable you to define topology agnostic dataflows that can run across edge, on-prem, cloud, and hybrid-cloud infrastructure without sacrificing the resiliency or privacy of data.

To give you an idea of what dataflow management tools do, I’ve prepared a simple StreamSets project that you can run on a laptop with Docker. This project demonstrates a pipeline that streams time-series data recorded from an industrial heating, ventilation, and air conditioning (HVAC) system into OpenTSDB for visualization in Grafana

  1. Create a docker network to bridge containers:
    docker network create mynetwork
  1. Start StreamSets, OpenTSDB, and Grafana:
    docker run -it -p 18630:18630 -d --name sdc --network mynetwork \
    docker run -dp 4242:4242 --name hbase --network mynetwork \
    docker run -d -p 3000:3000 --name grafana --network mynetwork \
  1. Open Grafana at http://localhost:3000 and login with admin / admin
  1. Add http://hbase:4242 as an OpenTSDB datasource to Grafana. If you don’t know how to add a data source, refer to Grafana docs. Your datasource definition should look like the screenshot shown in Figure 2 below.

Figure 2. OpenTSDB datasource definition in Grafana

  1. Download this Grafana dashboard file.
  1. Import that file into Grafana. If you don’t know how to import a dashboard, see Grafana docs. Your import dialog should look like screenshot below.

Figure 3. Dashboard import dialog in Grafana

  1. Download, unzip, and copy this HVAC data to the StreamSets container:
    unzip mqtt.json.gz
    docker cp mqtt.json sdc:/tmp/mqtt.json
  1. Open StreamSets at http://localhost:18630 and login with admin / admin
  1. Download and import the following pipeline into StreamSets. If you don’t know how to import a pipeline, refer to StreamSets docs.
  1. You will see a warning about a missing library in the “Parse MQTT JSON” stage. Click that stage and follow the instructions to install the Jython library.

Figure 4. The warning shown here indicates that a missing library needs to be installed.

  1. Run the StreamSets pipeline. After a few minutes, the StreamSets dashboard should look like the screenshot in Figure 5 below.

Figure 5. StreamSets dataflow dashboard

  1. After letting the pipeline run for several minutes, the Grafana dashboard should look like the screenshot as shown in Figure 6.

Figure 6. Grafana dashboard with data pipeline execution

Hopefully, by setting up this pipeline and exploring StreamSets you’ll get the gist of what dataflow management tools can do.

Pipeline Destinations – File, Table, or Stream?

Data pipelines have a beginning and an end (aka, a source and a sink). As mentioned earlier, MQTT and REST APIs are commonly used to read data, but where pipelines terminate varies widely, depending on the use case. For example, if you aim to simply archive data for regulatory compliance, you might terminate pipelines to a file because files are easy to create and compress in order to minimize storage cost. If your goal is to develop dashboards and alerts in a monitoring tool like Grafana for key metrics in an assembly line, then you might send pipelines to a time-series database like OpenTSDB. In the case of PdM, there exist other requirements that come into play when determining how to persist data. Let’s consider the relative advantages of files, tables, and streams to determine the best way to design data pipelines for PdM:

Files: Files can be used to efficiently read and write data. They can also be easily compressed and moved around if they’re not too large. However, problems arise when files become large (think gigabytes) or they become so numerous (think thousands) that they become difficult to manage. Aside from being difficult to move around, searching and updating data within large files can be extremely slow because their contents are unindexed. Also, although files provide maximum flexibility to save data in any format, they lack any built-in functions for schema validation. So, if you neglect to validate and discard corrupt records before they’re saved, then you’ll be forced to address the difficult task of data cleansing later on.

Streams: Like Apache Kafka, streams were designed to distribute data to any number of consumers through a publish/subscribe interface. This is useful when running multiple data processors (such as ML inference jobs), so they don’t all have to connect to raw data sources, which would be redundant and would not scale. Like files, streams can ingest data very quickly. Unlike files, streams provide the ability to validate incoming data against schemas (such as by using case classes with Apache Spark – which I’ll show later). The disadvantage of terminating pipelines to streams are that they’re immutable. Once data is in a stream, it cannot be modified. The only way to update data in a stream is to copy it into a new stream. Immutability in training data is undesirable for PdM because it prevents features, such as Remaining Useful Life (RUL), from being updated retroactively after important events, like machine faults, occur. 

Database Tables: Schema validation? Yes. Updateable? Yes. Indexable? Yes! Table indexes are especially useful if the database provides secondary indexes because they can speed up requests that query more than one variable. I mentioned the advantages of pub/sub interfaces for streams; can databases also offer those? The answer is, again, yes, assuming the database provides change-data-capture (CDC) streams. The one disadvantage to databases is that data cannot be written as fast as it can with files or streams. However, there are numerous ways to speed up writes. One way is to terminate a pipeline on a stream. Streams can serve two purposes in this case. One, they can buffer high speed bursts, and two, they can distribute high speed pipelines to multiple consumers, which can scale out to collectively achieve the necessary write throughput to a database. This is especially effective when streams and DB run on the same underlying data platform, as is the case with MapR. It’s also worth noting that MapR-DB provides secondary indexes and CDC streams.

MapR as a Data Platform for PdM

IIoT requires a data platform that scales in terms of speed and capacity. In addition, model development requires that ML engineers be able to iterate on concepts quickly. MapR accomplishes this by converging streams, DB, and file storage together on a high performance data platform that scales linearly and provides features that empower data scientists to explore data, develop models, and operationalize those models without friction. 

Feature Engineering

The potential of ML lies in its ability to find generalizable patterns in data. While traditional statistics often use data reduction techniques to consolidate data samples, ML thrives on datasets thick with fidelity (think rows) and dimensionality (think columns). To give you an appreciation for the amount of data that PdM inference models need to digest, consider this:

  • Manufacturing processes can be instrumented by devices that measure hundreds of metrics at speeds that sometimes exceed thousands of samples per second (e.g., vibration sensors).
  • Failures are typically infrequent (e.g., once per month).
  • ML can only predict events that can be generalized from training data. If events are rare, data collection must be that much longer. A good rule of thumb is to train models with datasets that span several hundred events.

So, given the nature of IIoT data to be thick with fidelity and dimensionality, and given that PdM depends on seeing hundreds of examples of infrequent failures, the data platform used to store training data must scale not only in terms of ingest speed and expandable storage, but also in terms of the operations used to find and derive relevant features in training data. This process, known as feature engineering, is critical to the success of ML because it’s the point at which domain-specific knowledge comes into play. 

Feature Creation

Feature engineering frequently involves adding new columns to training data to make either manual or automated analysis easier. These features can help people explore data with analytical tools and are often critical in order for ML algorithms to detect desirable patterns. Feature creation can happen either by augmenting raw data during ingest or by updating training data retroactively. To see how this works, let’s look at an example.

A simple IIoT dataset for a manufacturing process is represented below:

Figure 7. IIoT Sensor Dataset

Timestamps, device ID, and the three metrics named x, y, and z represent performance measurements from the control network. When we expand the table to include operator logs and other out-of-band data sources, it looks like this:

Figure 8. IIoT dataset with operator and weather details

What must have happened in order to add the operator and weather columns into this table? Let’s talk about this for a second. 

Data sources in control and corporate networks are typically unsynchronized. So, the timestamps in operator and weather logs will be different from the timestamps in IIoT data. Unifying those timestamps preserves a level of density that’s advantageous for asking questions, like “show me all the IoT values for machinery operated by Joe.” This kind of data integration is well suited for a nightly batch job, since it could potentially take a long time to update a day’s worth of IIoT records. It’s also more convenient to pay the penalty of data integration one time (e.g., nightly) than to repeat that task every time someone wants to access IoT and log fields together over a time-based query. So, as you look at the table shown above, recognize that behind the scenes a Spark job, or some other data integration task, must have joined operator/weather logs with IIoT logs and unified them by timestamp. 

There are many ways to implement this task, but when those logs live in different data silos, combining them into a unified feature table will be slow. This is why it’s important for data pipelines to sink data to a platform where data integration can run with minimal data movement.

Feature Extraction

Feature extraction involves combining variables to produce fields that are more useful. For example, it can be useful to split a date and time field into its component parts, so that you can easily subset training data by hours of the day, days of the week, phases of the moon (who knows, right?), and so on. This type of feature extraction is easy to implement inside a batch job or a streaming job because they can be implemented in languages such as Java, Python, and Scala, which include libraries designed to make date/time manipulation easy. Implementing a SQL function to determine whether a Date/Time value falls on a weekend is much more difficult. Adding a _weekendattribute while augmenting a feature table inside streaming or batch jobs could make manual analysis much easier and help ML algorithms generalize patterns over the work week.

Lagging Features

PdM is a type of machine learning called supervised machine learning because it involves building a model to predict labels, based on how those labels are mapped to features in training data. The two most common labels used for PdM are:

  1. The possibility of failure in next n-steps (e.g., “About To Fail”)
  2. The time (or machine cycles) left before the next failure (e.g., “Remaining Useful Life”)

The first feature can be predicted using a binary classification model that outputs the probabilities of failure within a prescribed time window (e.g., “I’m 90%sure a failure will occur within the next 50 hours”). The second feature can be predicted using a regression model that outputs a value for Remaining Useful Life (RUL). These variables are lagging, meaning they cannot be assigned a label until a failure event occurs.

Figure 9. IIoT dataset before the failure event occurs

When a failure occurs, then values can be calculated for these lagging variables and applied retroactively to the feature table.

Figure 10. Dataset with remaining useful life

If failure events are rare, but IIoT data is frequent, then retroactively labeling lagging features can lead to a massively large table update. In the next section, I’ll talk about how two technologies, Apache Spark and MapR-DB, can work together to solve this challenge.

Scalable Feature Engineering with MapR-DB + Spark

Feature tables for PdM can easily exceed the capacity of what can be stored and processed on a single computer. Distributing this data across a cluster of machines can increase capacity, but you don’t want to do that if you ultimately still need to move the data back onto a single machine for analytics and model training. In order to avoid data movement and single-points of failure, both storage and compute need to be distributed. Apache Spark and MapR-DB provide a convenient solution for this task. 

Mapr-DB is a distributed NoSQL database that provides the scale and schema flexibility needed to build large feature tables. Apache Spark provides distributed compute functionality that enables feature tables to be analyzed beyond the confines of what can fit in memory on a single machine. The MapR-DB connector for Spark eliminates the need to copy whole feature tables into Spark processes. Instead, MapR-DB locally executes filters and sorts submitted from Spark SQL and returns only the resulting data back to Spark.

Figure 11. MapR-DB and Apache Spark Integration

PdM Feature Engineering Example

I’ve built a notional PdM application for an industrial HVAC system that shows several examples of feature engineering using MapR-DB and Spark. The code and documentation for this demo can be found at this Github project. Excerpts have been included below to illustrate the feature engineering concepts previously discussed.

The data pipeline for this application consists of an MQTT client, which publishes HVAC data to a stream using the Kafka API. The ingest stream buffers those records while a Spark process consumes and persists them with derived features to a table in MapR-DB. The pipeline looks like this:

Figure 12. IIoT data pipeline

The following Scala code shows how stream records are read with Spark:

The above code creates a Dataset object containing raw MQTT records. That Dataset can be enriched with derived features, like this:

(Note, underscores are used at the beginning of field names to denote derived features.)

This enriched dataset is then saved to MapR-DB using the OJAI API, like this:

So far, the feature table in MapR-DB contains HVAC sensor values and a few derived features, such as _weekend, but the values for lagging variables AboutToFail and RemainingUsefulLife are still unassigned. The Spark job for receiving failure notifications on a stream and updating lagging variables looks like this:

When this Spark job receives a failure notification, it calculates values for lagging variables, then retroactively applies them to the feature table in MapR-DB. The output for this process in the demo application looks like the screenshot in Figure 13 below.

Figure 13. IIoT demo application process output

PdM Algorithm Example

In the previous sections, I’ve described techniques for recording data about the conditions leading up to failure events, so that PdM models can be trained with supervised machine learning. In this section, I’m going to talk about what to do with that data and how to actually train a model that predicts failures.

Long Short-Term Memory (LSTM) algorithms are commonly used to forecast rare events in time series data. Since failures are generally very rare in PdM applications, LSTM algorithms are suitable for modeling. There are LSTM examples readily available for most popular machine learning frameworks. I chose Keras to implement an LSTM algorithm for the “About To Fail” metric discussed above. If you would like to see the details, read the Jupyter notebook I posted on GitHub.

My LSTM implementation uses a feature table, like what was described above, to train a model that predicts whether a machine is likely to fail within 50 cycles.  A good result looks like this:

Figure 14. LSTM implementation results

That’s a good result because it predicts failure before it actually occurred, and it predicts it with a high confidence (>90%). Here’s an example of where the model makes a prediction that is not useful, because it predicts failure only after the failure already occurred:

Figure 15. LSM implementation result - second example

It’s kind of fun to play around with different training datasets to better understand the sensitivities of LSTM. In my Jupyter notebook, I explain how to synthesize a dataset and train the model, so you can experiment with LSTM on your laptop. I intentionally used only a few features in that exercise in order to make the steps for data preparation and LSTM implementation easier to understand. I’ll leave it to you to find those details in the notebook I posted on GitHub, rather than repeat them here.


Machine learning has the potential to make predictive maintenance strategies far more effective than the conventional methods used in years past. However, predictive maintenance poses significant data engineering challenges, due to the high bandwidth of industrial IoT data sources, the rarity of mechanical failures in real life, and the necessity of high-resolution data for training models. The effectiveness of any venture to develop and deploy machine learning for predictive maintenance applications will also depend on an underlying data platform that can handle the unique demands of not only data storage but also unencumbered data access, as data scientists iterate on concepts for feature engineering and model development. 

About the Author

Ian Downard is a data engineer and developer evangelist at MapR. He enjoys learning and sharing knowledge about the tools and processes which enable DataOps teams to put machine learning into production. Ian coordinates the Java User Group in Portland, Oregon and writes about Big Data here and here.

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p


Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.