InfoQ Homepage Articles COVID-19 and Mining Social Media - Enabling Machine Learning Workloads with Big Data

COVID-19 and Mining Social Media - Enabling Machine Learning Workloads with Big Data


Key Takeaways

  • Data pipeline architecture includes five layers: 1) ingest data, 2) collect, analyze and process data, 3) enrich the data, 4) train and evaluate machine learning models, and 5) deploy to production.
  • Use data pre-process methods and machine learning models for categorizing, clustering and filtering and later on serve the models and provide access to it.
  • Predictive analytics uses machine learning techniques such as data mining and predictive modeling to analyze current and historical facts to make predictions about the future or unknown events.
  • Open source solution Apache Spark that has built-in API for batch processing, stream processing, ML training and using machine learning models at scale.
  • Learn how to use Spark MLlib with Pyspark using MLFlow on Azure Databricks platform.

COVID-19 changed our lives, we still don't grasp the implications, but we know the world will never be the same again. We are entering a new world, an unknown one. Our decision-making process got flipped on its head, and everyone is slowing down their operations and creating more focused adjustments.

This “new normal” has brought people online discourse to an all time high in the United States, as more people are using the internet to voice their thoughts, questions and concerns. People today are using social media channels like Reddit, Facebook, Twitter, Instagram and more.

Social networks roughly fall into six groups as follows:

  1. Microblogging platforms such as Twitter and Reddit
  2. Blogging platforms such as Medium and
  3. Instant messaging apps (WhatsApp and Telegram)
  4. Social networking platforms like Facebook and LinkedIn
  5. Software collaboration platforms such as GitHub
  6. Photo/video sharing platforms such as Instagram and YouTube

Many researchers are working in this field, analyzing and suggesting new statistics based algorithms to answer the most pressed questions.

In this article, I’ll discuss how to enable machine learning workloads with big data in a production environment to query and analyze COVID-19 Tweets  to understand social sentiment towards COVID-19.

Motivation and Inspiration:

The article Towards detecting influenza epidemics by analyzing Twitter messages by Aron Culotta published on SOMA '10 and was part of KDD conference in 2010, discussed how a rapid response to health epidemics is crucial for saving lives. In their article they processed 500,000 messages spanning 10 weeks. That was in 2010, today we are discussing the scale of tens of millions of Tweets per day. The scale is high and twitter even created a special stream API for querying and analysing tweets.

Another research work is Surveillance Sans Frontières: Internet-Based Emerging Infectious Disease Intelligence and the HealthMap Project that was published in Plos Medicine research journal, by John S. Brownstein and team in 2008. The research system they offered is still relevant for today with COVID19 pandemic.

Here is a high level diagram of their system goals from a research perspective:

Figure 1. High level diagram from John S. Brownstein and team system design and research published in 2008.

In their approach, back in 2008, they referred to acquiring the data, categorizing it, clustering it and filtering it by five categories:

  • Breaking News
  • Warning
  • Old News
  • Context and
  • Not disease related

For our case, we want to create a system to analyze Twitter data, but it's not limited to Twitter alone, once we design the architecture, we can leverage and create a data pipeline for more social media streams.

Let’s think about what will be an ideal software architecture to power this approach.

It needs to support new data streams from new data resources (acquiring) and analyze them at scale (social media data today is exploding). Use data pre-process methods and machine learning models for categorizing, clustering and filtering and later on serve the models and provide access to it.

We will simplify it and focus on Twitter as an example for an input stream. In our scenario we are interested not only in being able to analyze the past, but also predict the future! This approach is called Predictive Analytics.

Predictive Analytics uses machine learning techniques such as data mining and predictive modeling to analyze current and historical facts to make predictions about the future or unknown events. As ML based predictive analytics use cases, we will predict if a tweet will be retweeted based on hashtags, sentiment and location.

Let’s break down the architecture into following layers how our data behaves in Data Lake approach for each layer:

  • Ingest the data
  • Collect, analyze and process the data
  • Enrich the data - in out case we add sentiment analysis based on tweet text
  • Train and evaluate machine learning models
  • Deploy to production

Checkout the github project for more details on the application discussed in this article. Github repo contains code, notebooks and yaml files for deployment.

Software layers for our ML based system

Data Ingestion

This is the layer that queries Twitter using Twitter developer API to pull data and ingest the data to our system. A common solution for building this is using a Kafka client. For learning more about working with Kafka, click here. This layer is also in charge of filtering out data that we are not allowed to store depending on privacy, governmental requirements and so on. In our Data Lake this data is being mirrored and saved to storage in the “RAW” directory for future needs.

Collect, Analyze and Process

This layer has multiple responsibilities:

  • Consuming the data from our data ingest part, processing it, and saving it to storage.
  • Serving the data in real time or near real time to a machine learning model for calculating real time predictions, clustering and categorizing.
  • Clear and prepare data for training new machine learning models by splitting the data into datasets and merging historical data with new one.

To achieve this, we can use open source solutions such as Apache Spark that has built-in API for batch processing, stream processing, ML training and using machine learning models at scale.

One of our options to work with Apache Spark on the public cloud is using Databricks which is a managed and optimized solution for running Apache Spark.

For offline data and future analytics, Machine learning model building, we save the data under ‘CURATED’ or ‘PROCESSED’ directory.

Sometimes, we will merge the data with other data sources and enrich the existing data. That data will often be saved under the ‘REFINED’ directory. Here is the code sample. In the sample we enriched the data using Text Analytics and Sentiment Analysis from Azure Cognitive Services, this enabled us to add a new column with positive sentiment score based on the text column in the given data, which represents the tweet the user wrote.

Train and Evaluate Machine Learning Models

This is a main layer that our Data science and ML team interacted with. After we collected, analyzed and processed the data and made it available for ML, our ML team can access it data exploration and ML Experiments.

Our data science can work with multiple tools for building their experiments. They can leverage

Apache Spark ML for building machine learning models in a distributed manner.  Or if the data fits in memory and they don't need a distributed computing framework such as Apache Spark, they can leverage building Machine Learning (ML) models with less compute power and use scikit-learn SDK for building ML models. Scikit-learn is a free software machine learning library for the Python programming language. It features various ML algorithms and tools for predictive data analysis and more.

Let’s take a look at a Spark MLlib with Pyspark example of how to run an ML experiment, using MLFlow on Azure Databricks to track the experiment metrics. MLFlow with Databricks allows us to create out of the box experiments leveraging Databricks UI.

In this scenario try to predict if a tweet will be retweeted based on positive and negative sentiment, hashtags used, user followers count, user friends count and user favourites.

For that, we used Spark mllib with Pipeline for orchestrating the ML flow. We provide pipeline class with the multiple stages and later run fit method for starting the stages to build the model and transform method to execute the model on new data and get predictions.

In this code snippet, you can see how we use the pipeline for training DecisionTreeClassifier, build the machine learning model itself and register it with MLFlow for future access.

from import DecisionTreeClassifier, DecisionTreeClassificationModel
from import StringIndexer
from pysp import Pipeline
from pyspark.mllib.linalg import Vectors
from import VectorAssembler
from import Normalizer

#start mlflow model for tracking the experiment
with mlflow.start_run():
   #create experiment name
   # Read input columns location and hashtags - and annotate them as categorical values
   indexLocation = StringIndexer(inputCol="user_location",outputCol="indexedLocations").setHandleInvalid("keep")
   indexHashtages = StringIndexer(inputCol="hashtags",outputCol="indexedhashtags").setHandleInvalid("keep")

   # assemble multiple columns into one feature column, we often use this with many Spark mllib out of the box machine learning models
   assembler = VectorAssembler(inputCols=[ "indexedLocations","indexedhashtags","user_followers", "user_friends","user_favourites","positive_sentiment","negative_sentiment","user_verified"], outputCol="features")

   # StringIndexer: Read input column "is_retweet" (true/false) and annotate them as categorical values.
   indexLabel = StringIndexer(inputCol="is_retweet", outputCol="indexedRetweetLabel").setHandleInvalid("keep")

   # most often with decision tree we would want to normalize the data to make it fit into the bins:
   normalizer = Normalizer(inputCol="features", outputCol="normFeatures", p=1.0)

   # DecisionTreeClassifier: Learn to predict column "indexedRetweetLabel" using the "normFeatures" column.
   dtc = DecisionTreeClassifier(labelCol="indexedRetweetLabel",featuresCol="normFeatures")

   # Chain indexer + dtc together into a single ML Pipeline.
   pipeline = Pipeline(stages=[indexLocation,indexHashtages,assembler,indexLabel,normalizer, dtc])

   # train the model
   model =
   # use the model - only for demo, for other cases we will use multiple datasets for testing and training
   prediction = model.transform(tweets)
   #path to persiste the model to
   model_path = "/mnt/root/COVID19_TWEETS/ML-Models/V1"
   # register the model with ml flow as well
   # persist the model to disk

Code 1. This code snippet demonstrates how to build decision tree classifiers with spark mllib, persist it to disk and register the experiment with MLflow.

This is a relatively simple example of how Data Science can work with Spark mllib, but like mentioned before, there are many other libraries that the Data Science and machine learning experts team can leverage for building the machine learning models.

For evaluating and calculating error rate of the model, we leverage Spark mllib MulticlassClassificationEvaluator, this is how:

from import MulticlassClassificationEvaluator
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
   labelCol="indexedRetweetLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(prediction)
print("Test Error = %g " % (1.0 - accuracy))

Code 2. This code snippet demonstrates how leverage spark mllib evaluator class for multiclass classification.

In our case, the error rate was ~0.09 which means 9% of our predictions were wrong, which is a pretty high error rate and means it requires mode data science and ml experts work. We will not delve into this issue further since this was just an example of how we can connect everything and get the architecture layers to work nicely together.

ML Serving

Once our machine learning model passes all Data Science tests, evaluators and quality assurance, it is ready for the next stage to be tested on the staging environment if it’s successful there, it proceeds to production. Staging environment is a simplified replica of our production environment and is used as a gatekeeper for additional software testing before deploying to production.

To deploy our machine learning model to production, we need to define how we will serve it. Our machine learning model output itself is a file, it can take many formats such as: pkl file, onnx file, pmml file, zip file and more. These files are packaged with code snippets that can be loaded and served. The most common way of serving them is using REST API or leveraging the framework they were built with. For our example, we used Spark mllib, when saving the model, it created two directories metadata and stages.

Metadata holds the class information and what we used, this is the  metadata from out use case:


Figure 2. File metadata created by persisting the Spark mllib model pipeline to disk, it contains all the stages and information needed to recreate the ML model itself.

Stages holds directories with the different stages in the pipeline for creating the model.

This is how the directory looks like:

Figure 3. Files created by persisting the Spark ML model pipeline to disk, each stage received it’s own directory.

Each of them holds data and metadata directories of their own.

Data is in binary format and contains the information the model pipeline needs to rebuild itself.

Metadata here holds the information like class, output col and other tuning parms.

Below is a snippet code of how we can serve our machine learning model. The init function starts the component and the model by reading it from the file and the run function takes input data and returns results.

import os , shutil , json


import pyspark
from azureml.core.model import Model
from import PipelineModel

def init():
  global model
  global spark
 #Creating an spark session
  spark = pyspark.sql.SparkSession.builder.appName("tweets_decision_tree").getOrCreate()
  model_name = "tweets_decision_tree"
  #Loading the model
  #My model is stored in Azure Machine Learning Services. If not your case, replace accordingly
  model_path = Model.get_model_path(model_name)
  model_unpacked = "./" + model_name
  #Unpacking archive
  shutil.unpack_archive(model_path, model_unpacked)

  #Creating the PipelineModel object from path
  model = PipelineModel.load(model_unpacked)

# when our server gets new request for classification/prediction/scoring, it calls the model run functionally with the raw_data
def run(raw_data):
  # validate that the model is up
  if isinstance(trainedModel, Exception):
      #Loading routine failed to load the model
      return json.dumps({{"trainedModel":str(trainedModel)}})
      #Converting raw data into Dataframe (Spark)
      input_df = spark.createDataFrame(input_data)
      # Score using spark mllib decision tree pipeline - compute prediction
      result = model.transform(input_df).collect()
      # you can return any data type as long as it is JSON-serializable

      return result

  except Exception as e:
      print('[ERR] Exception happened: ' + str(e))
      result = 'Input ' + str(input_data) + '. Exception was: ' + str(e)
      return result

Code 3. This code snippet demonstrates how to load and use the model we built in the production environment. For more examples on porting SparkML with Azure Machine Learning please check this GitHub repository.

This code uses our chosen machine learning model, how we initiate and run it. We will add another layer or REST API’s and dockerize it so it will be ready to deploy anywhere.

For dockerizing and deploying, we use Azure Machine Learning(AML), AML helps us manage our machine learning models. Here is a step by step tutorial on how to leverage AML to dockerize and deploy models to the Kubernetes environment. For a full open sources solution that is cloud agnostic, please check this tutorial by Facundo Santiago.

This article doesn’t cover everything, one component that is crucial for working with machine learning in production is being able to monitor it and deciding on when to replace it.

We achieve that by implementing observability and monitoring for the overall system. Our dockerized app that serves the machine learning model needs to make sure to write the information we need to track into the logs and monitoring systems.

Observability and Monitoring

Our alert system should be adjusted to collect and monitor our machine learning models in production. We do this by collecting distributed logs from various machines, tracking them and providing us with a simple API for querying it and building our alerts. We use ELK stack (Elasticsearch, Logstash and Kibana) for this requirement. Elasticsearch is a search engine for running full text analytics, Logstash helps us create data pipelines for routing the data to elasticsearch, and Kibana provides the data visualization dashboard on top of Elasticsearch.

Figure 4. This diagram is from the website. provides a managed solution for cloud observability.

In the diagram below you can see a high level of the machine learning model life cycle, the main drivers for triggering a new machine learning training process are often based on monitoring and observability layers.

Three main triggers are:

  • Data driven we detect new variability of data in our systems
  • Scheduled driven we want to release an updated machine learning model every x days.
  • Metrics driven we detected many errors and need an updated machine learning model.

Figure 5. High level overview of the machine learning cycle


In this article we looked at how you can architect your system with big data and machine learning to enable predictive analytics in your organization. We create a simple machine learning model for deciding if a tweet with COVID19 keyword will be retweeted or not, based on sentiment analysis and other parameters. Machine Learning is taking a critical part in fighting COVID19 pandemic and better understanding how to remediate the pandemic using software.

If you are interested in learning more, please checkout the full project on github. This is a live project that constantly evolves. It contains the code, steps, and notebooks for you to get started. The github project doesn't contain the data itself. If you want to test out your system with real data and follow our steps, check out twitter open dataset from Kaggle public datasets - covid19 tweets.

About the Author

Adi Polak is a Sr. Software Engineer and Developer Advocate in the Azure Engineering organization at Microsoft. Her work focuses on distributed systems, real-time processing, big data analysis, and machine learning. In her advocacy work, she brings her vast industry research & engineering experience to bear in helping teams design, architect, and build cost-effective software and infrastructure solutions that emphasize scalability, team expertise, and cost efficiency. Adi holds a master’s degree in computer science and information systems.

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p


Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.