InfoQ Homepage Presentations Advanced Data Visualizations in Jupyter Notebooks

Advanced Data Visualizations in Jupyter Notebooks

View Presentation

Speed:

Download

42:37

Summary

Chakri Cherukuri discusses how to build advanced data visualization applications and interactive plots in Jupyter notebooks, including use cases with time series analysis.

Bio

Chakri Cherukuri is a senior researcher in the Quantitative Financial Research group at Bloomberg LP. His research interests include quantitative portfolio management, algorithmic trading strategies and applied machine learning. Previously, he built analytical tools for Goldman Sachs and Lehman Brothers. He is a core contributor to bqplot, a 2D plotting library for the Jupyter notebook.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Cherukuri: I'm Chakri Cherukuri. I work in the quant research group at Bloomberg. In today's talk, we'll see how we can build advanced data visualizations in Jupyter Notebooks. I will start with a brief overview of Jupyter Notebooks and then we'll look at an overview of interactive widgets in Jupyter Notebooks. Then I'll show you some examples of advanced applications, dashboards, and visualizations in different areas like machine learning and finance.

Jupyter Notebooks

Jupyter Notebooks are an excellent way to promote reproducible research. By sharing documentation, code, and interactive plots, we can provide complete transparency into our research and models.

The notebooks can also be easily shared because notebooks are JSON files so they can be easily shared and they're very low weight. New technologies like JupyterLab allow us to have a fully developed integrated environment, which is almost similar to RStudio or Spyder.

Let's look at some of the features we have in Jupyter Notebooks. You can do extensive documentation. For example, we have complete support for all the HTML tags and CSS styling. Notebooks also provide a complete LateX support so they can write equations to explain our models better. We can also use shortcuts like Markdown shortcuts to improve our documentation. Jupyter supports various kernel backends like Julia, Python, and R. Now you know why it's called Jupyter. There is extensive support for plotting in Jupyter Notebooks. We can have Matplotlib and Seaborn to provide the static plots, so they are rendered as static images in the output cells of the notebook.

Since notebooks run in the browser, they also have a JavaScript enabled plotting libraries that is bqplot, which was developed in our group at Bloomberg, it's open source on GitHub. You also have libraries like Plotly and Bokeh, which provide some interactive capabilities.

Finally, we look at interactive widget libraries. These are not so popular, and that's the main focus of this talk. Using these interactive widgets, we can build rich applications and visualizations right in the notebook. The two libraries I'll be using in today's examples are ipywidgets, which are called UI controls, like text boxes, sliders, buttons, etc. Then bqplot, which is an interactive 2-D plotting library, which was developed at Bloomberg, it's actually built on the same framework on which ipywidgets is built. Bqplot can be seamlessly integrated with ipywidgets to build rich interactive applications. The nice thing about these libraries is that the attributes of widgets can be easily linked using Python callbacks.

Now, I'll be showing you some live examples in the Jupyter Notebooks. Ipywidgets has extensive documentation, you can find it on Read the Docs, you have the complete widget list with all the code and the documentation right here. You can also look at this documentation to understand how to lay out these widgets using various layout classes.

One more important thing is widget events - as I was mentioning before, you need to understand how to link these widgets using callbacks and events. The code is right here. There's also a lot of documentation on building your own custom widgets by writing the Python class and the JavaScript modules.

You can go through this website to get the complete understanding of how these interactive widget libraries work. Bqplot, as I was mentioning before, is open sourced on GitHub.

Let's now look at a simple example, I'll explain you how this interactive widget libraries work at a high level. For every interactive widget, there are two components. There is the Python object, which is the interface which the user sees and uses, and then there is the visual representation, which is mostly implemented in JavaScript. For example, here we have an integer slider which you can give the description and the value. This is the visual representation here, which is implemented in JavaScript. The nice thing about this widget is that the JavaScript is totally hidden from the user, but that's what provides all the interactivity. The user just deals with the Python attributes, and when the attributes are changed, some events are sent.

By capturing those events and registering callbacks, you can create the interactivity. Let me show you an example here, I'm creating an integer slider, this is the slider object and I can render it directly and you see the visual representation here. It's also important to understand the bi-directional communication between the Python backend and the JavaScript frontend. For example, when I moved the slider value, the Python attribute value gets updated and when I change the Python value in the backend, the frontend automatically updates. That's what I mean by this bi-directional communication between the front end and the backend. This is totally seamless. One of the nicest features of these interactive widgets is that you just deal with Python code, there is absolutely no JavaScript you have to know. It's all under the hood and it's hidden from the user.

K-Means Clustering

With that background, let's now look at some examples. I'll start with theoretical machine learning examples. The first example I have is K-Means clustering, I want to highlight some nice aspects of the Jupyter Notebook. For example, here, I'm writing some documentation which explains how this algorithm works.

Let me give you some brief background on K-means clustering. It's an unsupervised learning algorithm. It's used for grouping or clustering a cloud of points in a high dimensional space. What we do is, we first pick randomly K points, the K is the number of clusters you want to create, and then we alternate between two steps. The first step is the assignment step where we take each observation and assign it to the cluster based on its closest centroid. For example, we take sample one and then we look at which of these K centroids is this point closest to? Let's say it's closest to the first centroid, so we assign it to cluster one. Similarly, we assign clusters based on the proximity of the points to the centroids. Once we assign the clusters, we recompute or we update the centroids by taking the average of all the points. Once we update and create new centroids, we repeat the cycle. Again, we reassign the clusters and recompute the centroids and we keep doing it iteratively till we converge when the assignments no longer change. This is called Lloyd's algorithm, it's a heuristic algorithm because this optimization problem is NP-hard and it's a greedy approach so there's no guarantee that you'll reach the minimum.

Let's now look at a nice visualization which helps us understand this algorithm. The whole algorithm is implemented in NumPy. I'm changing the centroids and sending the clusters, everything here. Let me run this example. Here, we have three sliders, the first slider represents the number of points you want to choose and K is the number of clusters. The cluster centroid, by changing this, we can fix the distance between the clusters, whether they are very close to each other or very far away.

Let me start the animation. We can see that the algorithm is getting updated in real-time and the centroids are correctly identifying the clusters. I will change the number of points to 300 and let's choose 5 clusters now. As you can see, the algorithm is iteratively finding the correct clusters and updating the centroids. I can retry the algorithm by assigning randomly different clusters. Here we see a problem, the algorithm got stuck in the local minima because it's assigning these two as the same cluster. As I was mentioning before, this algorithm is heuristic and it got stuck in a local minima. Now, we can try to retry this algorithm and it correctly fixes the problem and now we get five different clusters.

This kind of interactive plots will really help us in understanding this kind of machine learning and mathematical models. As I was mentioning before, this is a nice example of literate programming. We are providing the documentation of the algorithm and we are writing all the code here and then we have a nice interactive plot which helps us understand the model. By tweaking different parameter settings, we can fully understand the impact of those parameters on the model.

Gradient Descent

The next example I have is gradient descent. Gradient descent is the Foster of optimization problem and gradient descent is the workhorse of all the optimization problems we see in deep learning. When you do the backpropagation, the weights are getting updated by doing gradient descent.

Let me give you some intuition of how this gradient descent works. Let's say we have a multivariate function F where X is a multivariate input and let's say this function is differentiable at a point A. The idea here is the value of F decreases when we go in the negative direction of the gradient. Gradient is the first-order derivative of the function with respect to all the inputs. When we go in the negative direction, we are actually going downhill and we'll soon find the minimum of the function. The gradient descent algorithm works as follows. We first start with a starting point, we need to choose a starting point X0.

Then we choose the sequence, X0, X1, X2 iteratively by using the following update step, Xn+1 is Xn-eta, which is a learning rate, times the gradient. Since we are going in the negative direction of gradient, the converge to the minimum of the function. This whole algorithm is implemented in NumPy and now we can look at a nice demonstration of this algorithm. Here, I have a non-convex function. That means it has multiple local minima. Depending on the starting point, the goal is to reach here, but depending on where you start and depending on the learning rate, you can end up anywhere. Here, I'm starting at 2.4, which is somewhere here. As you can see, sequentially, it's going downhill and it reaches the minimum, but that's not the global minimum. That's the local minimum.

Let's see what happens when you increase the learning rate. As you can see, the algorithm is totally not converging but it's totally diverging. Let's now start at a different starting point and see what happens. We can quickly see that the algorithm converges to the global minimum. It's very important to choose the correct starting point. Let's see what happens when I increase the learning rate. We saw that it actually ended up at a different local minimum even though we started at the correct point. What's happening here is, because of a very high learning rate, the value of next step from here, instead of being here, it actually went all the way till here. From here, the gradient is very large and also did a big step. Instead of coming somewhere here, it actually went all the way here and it totally found a different local minimum. This kind of interactive plots will really help us understand this kind of optimization algorithms and models. These notebooks can be used as training materials when you want to teach these data science and machine learning algorithms to students.

Gaussian Processes

The final example I have is Gaussian processes. Here’s some brief intuition about these models. We start with a multivariate Gaussian distribution. We have a multivariate vector, X, and the Gaussian distribution is given by two parameters, mu, and sigma, where mu is the expectation and the sigma is the K by K covariance matrix. We can think of covariance matrix as points which are close to each other will have a higher covariance.

Gaussian process is a regression model, what we are trying to do here is learn Gaussian distribution or a set of functions. We start with a prior assumption that the Gaussian regression is normally distributed with mean, 0, and covariance, K. Later, as we see the training samples, we do a Bayesian update and update and compute the posterior distribution.

Since this is Gaussian distribution, we can compute the mean and covariance in closed form and we can keep adjusting it. Here I implemented the whole algorithm in NumPy. Let me explain how it works. Let's say if you have a set of test samples for which we don't know the value of Y. We are trying to compute f(x) and we don't know the value of Y. The model starts with the assumption that it's all zeros, the Y is all zeros - that's the mean of the distribution. As we keep adding training samples, we recompute the distribution and the regression line changes. Since it's a distribution, I'm sampling five points, five samples from this distribution.

We see that they're all over the place and we can look at the standard deviation bands and they are in between these things. Now, let me add a sample here. This is the X and this is the Y, we know that for this X, that's the value of Y. That's why the distribution automatically moves to take that point into consideration. If you look at the uncertainty, you see that along this point, it's not uncertain because we know for sure that for that value of X, this is the value of Y. Now if you keep adding inputs, we see that the uncertainty is resolved. As we keep adding points, the model learns the regression function better and better.

For example, between this point and this point, we have lot of uncertainty. Therefore, the confidence intervals are very high. If you keep adding points here, we see that all the uncertainty is lost and finally we are able to learn the function much better. That's the intuition behind Gaussian process. This is the regression line. As we see more and more points, the model is able to learn the regression function better and better. These kinds of visualizations are very helpful in trying to understand these complex models. Gaussian processes are actually used in Bayesian optimization. I'll not go into details but the goal here is we try to find the next value by maximizing this acquisition function. Bayesian optimization has lot of applications in hyperparameter tuning and deep learning. That's why Gaussian processes are very important.

Financial Use Cases

Let's now go onto financial use cases. In finance, time series analysis is very important. In bqplot, we created a rich set of selector framework where you can select subset of time periods and do any statistical analysis. For example, here we have the time series of S&P 500 index, which is a very important equity index in U.S. We can see that the time series starts from '95 all the way till 2019. Here, we have the histogram of returns. In finance, when we build models, we make some assumptions. The standard assumption is that the returns are normally distributed, but empirically, we can clearly see that that's not the case. There are heavy tails on the left, there are very abnormal negative returns and that we can easily see from here.

One aspect of financial time series analysis or any time series analysis is that we want to perform the statistical analysis on a subset of time periods. We want to see how is my performance looking during the crisis or during the dot com boom or something like that. We have selectors in bqplot, let me activate the selector by clicking on the plot. As you can see, the interval selector responds to mouse moves. When I move the mouse pointer up, I expand the interval and when I move the mouse pointer down, I contract the interval. I'm doing some statistical analysis by registering callbacks. Whenever I move the interval, the start and endpoints are updated and by slicing into the data frame, I can easily do any statistical treatment. In this case, I'm computing the total return, for example, from this period to this period.

Accordingly, I'm also showing you the trend line and notice that the color of the trend line changes. If the trend is negative, it becomes red and when the trend is positive, it becomes green. The color and the techniques of the line also changes based on the trend indicator. The higher the trend, the thicker the line. Let's look at some interesting time period. This is the financial crisis, where we lost 40% if you invested in the index. What's remarkable is these last nine years. If you did nothing but just bought the index, you would have almost quadrupled your investment. You'd have got almost 4x the returns

These selector frameworks are very helpful. You can do advanced time series analysis and you can slice into the data frame and pick periods where you want to do the analysis and look at the results in real-time. Here, I'm just showing you the performance of S&P 500 index, but S&P 500 actually consists of 500 stocks which are grouped into different sectors. When you're making equity investments, you'd like to look at the performance of all these stocks. That's the next example I'm going to show you here.

What we're trying to do here is understand the performance of all the S&P 500 constituents. There are 500 stocks in this index. They're grouped into different sectors like technology sectors, utilities, healthcare and so on. The crux of this visualization is this tilemap, which serves multiple purposes. The tilemap can be used as a heat map like we can encode it with some numerical values and each cell color will change accordingly. We can use tooltips and we can also use it as a grouping widget. I'm grouping stocks here based on the sectors. Here, we have the healthcare and industrials, IT, etc. You can also click and select specific cells so that I can look at the performance of the stocks and how the stocks were performing for the last eight years. Here, I can color code by different numerical values. Currently, I'm doing market cap and we see that the usual suspects like Amazon, Microsoft, and Google, and Apple have very high market caps.

We can also color code by dividend yield and we can easily identify sectors which pay good dividends, for example, healthcare and communication services, they don't pay much dividends. If you look at energy and utilities, almost every company in that sector pays dividends. We can easily see that from this visualization. You can also color code by volatility. Volatility is the risk in the stock, how much the stock moves on a daily basis. The higher the volatility, the higher the risk, that's why I'm coding everything in red. These stocks are extremely volatile. If you click, you can see that the stock is going all over the place. You can also color code by month to date return. In the month of May, for example, the energy sector has not been doing well, you can easily see that from this visualization. Real estate has been doing well in the amount of May.

Let me click on specific stocks. For example, I'm going to pick tech stocks and see how they perform over the period of last eight years. Equity curve is nothing but what is the value of $100 invested in this stock? If I invest $100 in this stock, over a period of time, you see that the value went up a lot. Here, in this table, I'm showing you different performance metrics. What is volatility? What is the Sharpe ratio? Which, is a risk-adjusted return, etc. I can also compare with different stocks in the same sector or different sectors. Then I can compare all the stocks by, in the descending Sharpe ratio, so stocks having the highest Sharpe ratio are really good performers. For example, MA stands for, I think, MasterCard - it's a really good investment as you can see from this. You can also pick specific time periods to see the performance of these stocks.

This whole visualization is just built with bqplot and ipywidgets. As you can see, it's very easy to assemble all these widgets as Lego building blocks and make this rich visualization and you can also search by names, for example, and see the values of this. It'll automatically get picked. That's all I have for this visualization. As you can see, this provides a very good picture of the overall U.S. equity market in one nice visualization.

Wealth of Nations

The last use case I have is based on a data set called Wealth of Nations. Here, I'm going to show you how to build this kind of visualizations, what Tableau or other tools do. Let me explain this dataset. Here, we have 200 countries and we have 3 data points and each of them is a time series - for example, income. We can start with 1800 and what is the value all the way till 2008, so it's a time series here and the life expectancy, then we have population. We also have the groupings of these countries based on the region into various groups. This is a very canonical data set, in tools like Tableau, we can easily visually select all these different attributes and create a bubble chart. In bqplot also, we can do the same thing.

Here, I created a bubble chart. The X-axis is the income, the Y-axis is the life expectancy, and the size of the bubble is the population. As you can see, when you hover on each bubble, I can see the evolution of this bubble for different years. I'm showing this bubble for one particular year, 1800, all the values are for one year. Here, I have the slider, if I move the slider, you can see the values of the bubble for different time periods - this is for the year 2008. Ipywidgets has a nice a playback widget which I can click and it will animate the whole bubble chart through time. Let me do that.

We can see the interesting evolution of these countries through time as we go forward every year. All that this playback button is doing is, it's just linking to the value of the year slider. When you update the value of the year slider, I go and update the bubble chart by linking them using callbacks. Here, you see some interesting things. Obviously, we have China and India and all the Asian countries, very high population. You can see all the African countries here. Unfortunately, they have lower income and lower life expectancy. We also see these small countries, Qatar, Luxembourg, Macau, Norway, and Singapore, they are also countries having very high life expectancy and the income per capita. You can see the color code based on the regions.

This is a very standard example and I think there is a TED Talk which is based on this. We just recreated that visualization in bqplot and it's actually easy to build these kinds of visualizations. The code is all Python so you don't have to learn any JavaScript or anything like that.

Choropleth

Choropleth is the heat-map, but it is rendered on a geographical map so that we can look at the numerical values associated with countries as a heat-map, but it is rendered in a world map. Here, we have the world map and you can hover and look at the country. Let's pick one attribute for income and let's try to look at the changes over time. Now, we can, instead of looking at the individual names, we can directly look at these changes on a world map. One more nice thing about bqplot is that you can select countries and you can register callbacks on the selected attribute. For example, here, I want to look at the time series of China income from 1800 to 2008. We clearly see that after 1980s, it just shot up. We can look at different countries and compare them, we can look at Brazil.

This aspect of a bqplot is what does separates it from other libraries. I can click on any chart and select specific components of the chart. Then by registering callbacks, I can update, for example, time-series or any other plot, and that's how we can link different plots together. Now, we can look at life expectancy. Also, by looking at this visualization, we can see all the missing countries. For example, there's no data for Egypt and Russia in this dataset. We see that the life expectancies are unfortunately very low in Africa, but high in Europe, Americas, and Asia. For example, this is China, this is Australia, Norway, Indonesia, and Brazil. We also can see that the availability of data points in this data set. There is not much data available here.

Finally, we look at population. Here, we have the population heat map and we clearly see that India, China, and Americas have the highest population. You can see that the U.S. population increased gradually but the population in India just shot up exponentially after the 1940s. The same thing with China, though it was gradual compared to India.

To summarize, in bqplot, we can easily integrate and link different plots seamlessly and it's all Python. There is absolutely no JavaScript, it's all hidden from the user. Bqplot is built using D3.js, which is a very popular JavaScript potting library. That's how we differentiate bqplot from other static visualization libraries like Tableau and other packages. I agree, in Tableau, we can do everything without writing a single line of code by dragging and dropping, but In bqplot, we can do much more and all in the Jupyter Notebook especially when you have to explain your models or explain your research. Having this kind of interactive plots is extremely helpful in trying to convey your message to the stakeholders.

Twitter Sentiment Analysis

Here’s one last example, let me explain briefly what this is about. This is called Twitter sentiment analysis. What we are trying to do here is it's a machine learning problem, we are given tweets of companies, and we want to label them as negative, neutral, or positive based on the sentiment that is expressed in the tweet. For example, if someone says, "This stock is a strong buy," that means it's a positive sentiment, so the label will be positive. Similarly, if they say, "The stock is a sell. It's not doing well." That means the stock has a negative label.

The reason these sentiment labels are very important is because investors can build trading strategies based on the sentiment which is expressed in news and tweets. For example, they would like to buy stocks which have positive sentiment and simultaneously sell short or sell the stocks which h have negative sentiment. What Bloomberg does is we provide the sentiment score as an enterprise data feed to all our hedge funds and other clients.

The problem with building these machine learning models on text data is that we need to first create the label dataset. We need training data so someone, unfortunately, needs to do this manually. What we do, and there is a company called CrowdFlower and Amazon's Mechanical Turk, there are a lot of companies which, manually tag all these tweets based on the sentiment which is expressed in the tweet.

Bloomberg outsources all these manual processes to consultants and unfortunately, they don't have the financial expertise to tag these tweets. They do make mistakes when they tag these tweets. We get around 100,000 tweets as a training dataset. Then what we do is we split the data dataset into training data and test data set and then we train the models on the training dataset. Then using the test dataset, we try to evaluate the performance of our models. I built this dashboard to explain the performance of this model of the machine learning classification model on the test dataset.

On the test data set, we already know the labels. By comparing the predicted labels with the actual labels, we can evaluate the performance of this model and confusion matrix is a nice way of understanding the misclassifications. What it does is, it buckets the results into various cells based on the actual and the predicted labels. By definition, all the diagonal entries are correct predictions because the actual label and the predicted labels match.

Therefore, all the off-diagonal entries here, they are the misclassifications because the predicted and the actual labels do not match. Let me click on this cell here. The model is saying all these tweets are positive, but they're all tagged as negative by the human readers. Let me look at some example. Weight Watchers cease, profits slimmed. There is a subtle polarity inverting word, slimmed, which is totally changing the meaning of the tweet. Unfortunately, the model is not able to catch that. For example, here, this pie chart is showing the three predicted probabilities and it's a logistic regression so it's an interpretable model. We can see why the model is making this prediction as positive because it's giving a lot of importance to this token called profit and the word slim is getting zero importance. It's a clear indication that we need to do some feature engineering to take into account this polarity inverting words.

Let me explain what this is. We can visualize the model predicted probabilities, which is a couple of three numbers, as a point inside an equilateral triangle. The three vertices represent the three labels. Let's say these points are very close to the positive vortex. What this means is the model is assigning a very high positive probability for those points. Each point here represents a tweet here. Let's see how we can use this triangle to cut some data issues. Here, I'm selecting a cell where the model's predictions are all positive, but they're all tagged as neutral by the human readers. If you think about it, that's where is some subjectivity involved. What makes a tweet positive versus neutral or negative versus neutral? The humans can get confused here.

Let's see how the triangle visualization can help us catch these data issues. If you look at the triangle visualization, we see that the points which are close to the positive vertex are interesting because the model is assigning a very high positive probability that these tweets are positive. Let me use the lasso to select some of the streets here. These are filtered here so let me read through the streets for you. "Microsoft sales beat Street hopes, cloud profits up." "Wall Street profit beats estimates." "There's a price target race to 49." "There's a stock upgrade." As you can see, all these tweets are actually positive and the model seems to be doing the right thing by making them positive. These definitely look like data issues to me.

The way we found these data issues is by zeroing in on these points where model is setting a very high positive probability. This kind of interactive visualizations and dashboards are extremely helpful in explaining complex models. You're able to fully understand how the model is making its predictions and they also found how to fix data issues and identify data issues using this visualization.

Questions and Answers

Moderator: This is amazing. These are very beautiful visualizations and also very interactive. I was wondering about the scalability of this visualization. Do you have any experience of what are the limits in terms of data size that you can safely plot and still be reasonably interactive with this kind of library?

Cherukuri: Bqplot is implemented using D3.js, which is based on SVG. What SVG does is, it creates a DOM node - for example, for each scatter point there is one DOM node. If you want to plot a million points in a scatter plot, it's not going to work well because it's going to create million DOM nodes inside their browser and we are looking at canvas-based solutions for those kinds of problems. Most of the time, it works really well, the interactivity is seamless. Only in those extreme cases where you want to visualize millions of points which are very rare, it will break.

Moderator: We can always down-sample at some point if we need to.

Cherukuri: Sure.

See more presentations with transcripts

Recorded at:

Sep 19, 2019

Chakri Cherukuri

InfoQ Software Architects' Newsletter