BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Scaling Deep Learning to Petaflops and beyond!

Scaling Deep Learning to Petaflops and beyond!

Bookmarks
41:22

Summary

Prabhat explores 2D and 3D convolutional architectures for solving pattern classification, regression and segmentation problems in high-energy physics, cosmology and climate science.

Bio

Prabhat leads the Data and Analytics Services team at NERSC; his group is responsible for supporting over 7000 scientific users on NERSC’s HPC systems. Prabhat is the Director of the Big Data Center collaboration between NERSC, Intel, Cray, UC Berkeley, UC Davis, NYU, UBC, Oxford and Liverpool.

About the conference

QCon.ai is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.

Transcript

Prabhat: I'm Prabhat, I lead the data analytics services group at the NERSC supercomputing center, which is it at Lawrence Berkeley National Lab. The speaker right before me was from UC Berkeley, we are right on top of there. Sid invited me to talk about scaling deep learning, and we changed the title to "Scaling to Petaflops and beyond."

What I wanted to do for the next 40 minutes or so, is to motivate why we need to scale deep learning. I want to not talk about it in abstract, I want to present to you with two case studies, the most scalable implementation of deep learning on a CPU system, that's where we achieved the 15 petaflop mark, and then the most scalable implementation of deep learning on a GPU system, wherein we achieved an exaflop level of computing. Those two case studies are actually fairly intensive, it took us a lot of effort to get it right and we learned a lot of lessons the hard way. I want to conclude with some of the open challenges which are relevant.

First, before we get into the big scaling studies, I did want to motivate or at least highlight where I come from, I come from a science background. The NERSC supercomputing center is a mission facility for supporting simulation workloads across the Department of Energy. There are many scientific areas that we work on, spanning cosmology simulations of the whole universe, climate science simulations, all the way down to subatomic physics. We really cover the entire gamut. I think one of the more interesting things about NERSC is that we support many users, about 7,000 users working on 800 different projects.

Historically supercomputing centers like NERSC have all been about simulation workloads, that's the thing you want to make sure that your HPC system does well. But increasingly what we see is that data analysis, and in particular machine learning and deep learning, are becoming more and more important. From here on out, we're trying to do our best to make sure that the same system, the same HPC machine can support both simulation and data analysis workloads.

We have a tradition of naming our HPC systems after the scientists, Cori is our big machine, it's number 10 on the top 500 list of super computers, and right now this is a CPU-only system. It has Intel Haswell and Intel Knights Landing nodes, about 1,200 or so nodes. I think one of more interesting things about Cori, apart from the compute, is I/O. When you try to do deep learning, a lot of what you're going to be doing is reading data as fast as you can. HPC systems historically have really not had on-node disk, they do have on-node memory, but they do not have on-node disk. There's a limit to how far you can push file systems like Lustre, NTFS, and so on.

One of the interesting things about Cori is this 1.5 petabyte DataWarp product. It's a big pool of SSDs that every node can read and write data from. It can read and write that at a fairly high level of performance, so 1.5 terabytes a second. It turns out that for deep learning, that sort of a solution is absolutely key, if you want to be doing deep learning at scale and at high rates.

NERSC Big Data Stack

Now, just commenting on the data stack a little bit, in order to support simulation tools, we need to make sure that Fortran works and C works, and that's pretty much it. The data space is more varied, there are many more technologies and really a lot more services that you have to provide. I just wanted to at least call out that when you think about supporting data, you need to make sure that your users can move data from one supercomputing center to another. Once data lands at a super computer, it should be accessible for the rest of the community. There are a number of technologies like Jupyter, which become key for enabling shared analysis.

Workflows are key, if you want to be repeating the same analysis again and again for a big dataset, then you want to automate that somehow, you want to automate failure recovery and so on. Data management is really important, if your project, if your instrument, the Large Hadron Collider or the telescope or a microscope, are going to be producing hundreds of terabytes of data over their lifetime, then you want to make sure that you're not storing data in text CSV file, that's not going to cut it at some point. There are modern libraries, HDF5, NetCDF, ROOT, that have been developed by the scientific community, that you may choose to use. Certainly we do have some database technology that we support as well.

Visualization is important, humans understand images fairly well, and movies, so creating images from complex scientific data is key. There are some homegrown tools that the DOE has, which we've been developing for a while. Obviously when we talk about big data, the buzz is all around analytics, this year, the coupon is focused on AI, which is a subset of analytics. I'll quickly note that nobody really in the data analysis world is coding in C or Fortran. This is a big shift from 30 years ago, wherein languages like Python, no one took them seriously on HPC machines, but now those are absolutely key, those are absolutely integral.

Python, Spark, the earlier speaker talked about AMPLab, so that's a big product that came out of the AMPLab. If you're a statistician, chances are you know about R, and [inaudible 00:05:39] languages like Julia are certainly very promising. As far as general purpose programming, general purpose analytics, those four technologies are certainly an option, but this talk is about deep learning When we think about deep learning, at NERSC, these four technologies are the ones that we point users to, I'm going to bring that up now.

NERSC Deep Learning Stack

If you're a scientist and this is the first time that you want to work on a deep learning project, then Keras is really the framework of choice, that's what we point people to. In Keras, in maybe 6 to 10 lines of code, you're going to code up a network that can work. If you are inclined to really hack your network and plug in custom layers and so on and so forth, and you prefer coding in Python, then TensorFlow is certainly an option.

In Caffe, I think we kept it there. At least two years ago I think our sentiment was that Caffe was very high performing, and maybe TensorFlow and Python could never keep up, but I think that's been proved wrong in last year or so. PyTorch is also a modern option which one can consider, I think these four technologies, at least at NERSC, are the ones that we deploy, we point our users to.

There is a very long tale here, every other week, every other month, there is certainly a new deep learning framework that comes out, and it is somewhat of a challenge, for someone who's maintaining a stack in production, it is a challenge to track all the latest technologies and make sure that they are working well on the hardware that you support and scaling and so on and so forth. We are quite confident that these four will actually hold up and perform well.

Mostly when you talk about deep learning software, that's all really that you hear about, you hear about the framework, and then you may hear about the hardware. You may have CPUs or GPUs; those are the two typical flavors of hardware that you hear about. Certainly, there's a lot of activity here, tens of startups in the Bay Area right now developing accelerators for both faster training and faster influence of deep learning, so there's a lot of buzz around hardware at this point in time.

We've pushed very hard on CPU, that's the machine that we have on the floor, but it is clear that GPUs are very competitive for deep learning as well. We try our best to make sure that both CPUs and GPUs can perform well for deep learning. As maybe someone from the outside, that's really all you get to hear about, deep learning frameworks and what hardware you're running on, but one of the things that you don't typically appreciate is that there is stuff in between that you do have to think about. Folks like myself have to make sure that when we deploy Keras or TensorFlow or PyTorch, they can actually run well on hardware, so if I'm running on a CPU, and if it's an Intel CPU, then the MKL-DNN library is key in terms of mapping, convolution operations and so on and so forth, effectively on the Intel chipset. Similarly, if you have a GPU hardware backend, then libraries like cuDNN are absolutely critical, cuDNN certainly had a two or three-year head start over Intel, but slowly I think Intel has certainly caught up in that space.

If what you're doing is deep learning on a single node, a few GPUs or CPU, then that is probably sufficient for you, but I think more and more as the next two case studies I'm going to show, if your data sets are really large, and if you're going to be running on a single node, it will take you maybe a week to converge, then you do need to run on multiple nodes, and that's why multi-node libraries become important.

At a very low level, things like gRPC, the Google Remote Procedure Call library, MPI library, which has been around for a very long time in the HPC world, those capabilities are available to you, and there are people at Uber who've developed the Horovod framework, and there are also folks at Cray, who developed a plugin which makes it easy for you to interface communication primitives to a higher level of deep learning framework like TensorFlow and PyTorch. That's the entire stack, those are the key technologies you need to know or keep in mind. As a user, of course, all you're going to see is Keras and TensorFlow and PyTorch and Caffe, but there's a bunch of stuff under the cover that someone has to worry about it for you.

Deep Learning for Science

I want to transition to concrete use cases, for us deep learning is not a hypothetical pipe dream that maybe three years in the future is going to be useful, it's working for us now. For me, I think I asked myself this question around five years ago, "Can deep learning work for science?" It's working for computer vision, it's working for speech recognition, it's working for control systems. How about the world of science? Over the last five years we've shown that deep learning can be fairly effective in modeling, say, galaxy shapes from telescopes, you can produce virtual universes using GAN architectures, it seems to work fairly well for that.

For high energy physics experiments such as the Large Hadron Collider and some other projects around the world, you can solve binary classification problems much better than the state-of-the-art approaches that have been hand-tuned for decades. For clustering, it seems to be able to separate out different kinds of patterns in sub sensory. In biology you can decode ECoG data and produce speech, and there's a revolutionary device called the Oxford Nanopore, which produces a noisy time series, and you can use an LSTM architecture, you can produce a fairly accurate genome sequence. You can checkout this blogpost that goes into details of some of these case studies if you'd like.

When you talk to scientists, when you talk to dozens of scientists, which thankfully I do every day, and you ask them, "Hey, would you be interested in deep learning or machine learning or advanced statistics?" What are the problems that you really care about? What are your use cases? This tables captures at NERSC, our leading use cases across astronomy and cosmology and particle physics and plasma physics and so on and so forth. Those are the columns, and rows are the specific tasks that they want to solve, maybe they have a part in classification test or the regression test. Those two are typically solved in the supervise context, so they typically have labels associated with them. Then you have unsupervised problems, like clustering or dimensionality reduction, where you do not have any labels, and then there are more advanced applications around surrogate models and design of experiments and so on.

Over the past five years the conclusion we've reached is the following. If your scientific domain has labeled data, then chances are that variants of convolution nets, or graph nets, or recurrent nets will likely work. I think we have seen that across the board, the key is having training data.

For unsupervised problems, the results I would say are not as conclusive right now. We've certainly employed auto-encoders for solving clustering tasks, and in some cases, doing dimensionality reduction, but I won't say that they are succeeding right off the bat, I think there's more work to do there.

Surrogate models is a very interesting application space for us, despite the big machines that we have all over the world, there are some problems for which you just cannot enumerate all possibilities. You cannot simulate all possible universes and simulate all kinds of future climate regimes and so on. It may be useful to create a low dimensional surrogate that can produce data relatively inexpensively, as opposed to solving an expensive 3D system of equations.

There are some very intriguing results in this space right now, but we'll see how this shapes up, going forward. There is speculation that reinforcement learning techniques can be used to control experiments, control telescopes, control systems, make their operations more efficient, that's under investigation. Feature learning, if what you care about is learning relevant features for a task, that I think, we're seeing across the board that deep learning is working fairly well. Anomaly detection, I'm not as sure about right now.

Given that we've been doing this for five years, we recently conducted a survey of machine learning users at NERSC asking them, "How are you trying to explore deep learning? What are your pain points at this point in time?" We estimate that we have about 200 users, of the 7,000 users who are regularly exercising the deep learning side, and about 160 folks responded to the survey. I think the figure doesn't quite come out across, but we asked them, "How long does your model take to converge at this point in time?" A lot of people are in the R regime right now, but there are certainly some users who are in the days to weeks end as well.

Then we ask them, "What is your training dataset size? How large are your training datasets at this point in time?" Certainly there are a lot of people in the tens of gigabytes range right now, but folks who already have hundreds of gigabytes to tens of terabytes of data at this point in time. It's clear that with large dataset sizes and long training times, there's no way that people are going to be running training jobs on a single node, that is just not happening.

Why Scale Deep Learning?

Really, why do we need to scale deep learning at all, at least in the scientific context? I suspect that for your industrial use cases this may be true as well. Right now our runtimes for 100 gigabyte to terabyte-sized datasets span days, and in some cases, weeks, it depends on how complex your architecture is. Everyone is moving towards small complex architecture, they're starting off typically with vanilla, AlexNet, or some of those standard convolutional architecture, and then they want to throw more complexity at it. Try hybrid convolution, or LSTM architecture, try spacetime convolutions, so and so forth.

For all of these, say, I have a new science problem, I came up with some architecture, there is no reason to believe that the architecture that you started off with is the optimal one for that problem. If you're going to be writing a paper in nature of science saying, "Hey, I obtained 95% accuracy with this architecture that I somehow managed to converge," there is no reason to believe that that 95% accuracy is the best possible result for that problem. It is clear that you do need hyper-parameter optimization, last week we did a good job of motivating why you need to explore a big range of architectures.

This step is key, but then if each choice of architecture has a week-long runtime, you're going to be limited in how many architectures you are going to be able to explore. This is really why when you talk about deep learning for science, it is key that we're able to scale deep learning on big HPC machines. Thankfully, we've been supporting high performing simulation codes for a very long time. This problem turns out to be really well suited for HPC machines, and the case studies will highlight that.

Deep Learning @15 PF

I'm going to go through two case studies, the first one is a paper that was published at Supercomputing, the theme for this session is "Papers in Production." The archive link for the paper is right here, and this is a collaboration between NERSC and Intel and Stanford. At that point in time, so this is about two years ago, which feels like an eternity in deep learning, the conventional wisdom was that CPUs could not do well on deep learning workloads. I think we tried our best to make sure that we could support an extremely challenging scientific task well on the KNL system that we have.

First, I just want to motivate why we worked on this problem at all. For both this case study and the next one, the driving science problem is one of climate change. What we want to do is to understand how extreme weather is going to be changing in the future. If you're from California, you may have noted that this winter has been unusually rainy, there have been a lot of rain. You may recall that three years ago there was a big drought, so for a period of three years we had no rain whatsoever.

The weather changes, obviously, from year-to-year there is certainly a change in the weather, but when you think about the long run, so, 100 years from now, what will the typical weather in California look like? Is it going to be dry, is it going to be rainy? It is possible now for high resolution simulations, which are on a 10-kilometer or 25-kilometer resolution, we can plug in a range of different climate change scenarios. We can plug in CO2 concentrations and so on, so forth, and we can run the simulations out for hundreds of years.

Once we've done that, once we've run the simulation out for 100 years, think of it as a YouTube movie that's going to be now playing for 100 years, what you'll need to do then is to look at that movie and automatically find where the extreme weather patterns might be. In this doc, when we talk about extreme weather patterns, we're referring to this phenomenon called atmospheric rivers. Most of the time when we get rain in California, it's actually one of these systems that transports water vapor from the tropics, and the water vapor makes landfall in California when it hits here in Nevada.

We talk about hurricanes, on the East Coast we've had a number of significant events, wherein Hurricane Katrina or Hurricane Sandy have made landfall to devastating effect, extratropical cyclones and weather fronts. What we’d like to do is to look at this 100-year long YouTube movie and automatically find in these extreme weather phenomenon. Once you've done that, then you can start making quantitative statements about the distribution of these events in present day, and how that distribution might change 100 years from now.

In order to extract those patterns, this boils on to a computer vision problem. This is a cartoon that sort of compares what people do with images of cats and dogs on YouTube, to what you need to do in climate science. If you have an RGB image that's on YouTube, you might be interested in the problem of classification. Is it a cat in this image or not? Yes or no. You might be interested in the problem of localization, “Give me a tight bounding bonding box around the cat.” You may be interested in the problem of detection, an image that has many objects; “Give me variable size bounding boxes around different kinds of pets. Then segmentation is, “I'm not happy with tight bonding boxes; what I really want is a pixel level segmentation mask.”

In the climate science context, this is what we need to do, except our images are not RGB images. We actually have 16 channels corresponding to temperature and pressure and. Our spatial domains tend to be fairly large, we're not dealing with 32 by 32 pixels; instead we have million-pixel images. We're interested in, given this spatiotemporal patch, is there a hurricane in this image or not? Give me a tight bounding box around the hurricane. Given a global image with many kinds of weather patterns can be variable size bounding boxes around these patterns, and then finally segmentation masks of the sun. For the first case study, we are going to solve the detection problem, and for the second case study, we're going to solve the segmentation problem.

For the detection problem what we did was to come up with a semi-supervised convolutional architecture, this is not intended to be a tutorial on deep learning. I'll just quickly walk you through the architecture, we have an encoder piece given in a million-pixel image. We come up with a compressed representation, and then we ask the network to predict, in a supervised fashion, bounding boxes and class labels for patents for which we have training data, but there are many patents for which we do not have training data, we do not have labels. We're going to ask the network to decode off the compressed representation, the bottleneck layer, an image which matches the dimensionality of the input data.

The network is going to simultaneously minimize reconstruction that is going from left to right, and it's going to minimize prediction that is going from left, to the middle, to the bottom, this architecture was developed in collaboration with folks at MILA.

We can get that to work, we did succeed in converging this. I did want to comment on the deep learning software stack that we used for this problem. Two years ago, the conventional wisdom was that if you wanted to get high performance, you would have to go to something like Caffe, natively in C, C++. Intel was feeling very good about the machine learning scaling library at this point in time. We worked on MKL-DNN to improve it for the CPUs, that's the part that we took through the software stack.

We worked really hard with Intel on making sure that support for convolution layers and deconvolution layers was well done in MKL-DNN. Sid and I were talking about hardware and software startups looking at developing new hardware, and naturally the first thing you want to do is to do great benchmark. AlexNet has a certain image size and three channels and square images and so on. The moment you come with a realistic application which has non-square aspect ratios and many more channels, the core paths aren't necessarily optimal.

We worked with Intel to make sure that everything was done optimally, and in the end by the time we did the optimization, out of the six teraflops maximum that you can obtain on a single KNL chip, we were able to obtain two teraflops, which is fairly respectable fraction of peak for complex workflow.

On the multi-node side, there were some choices to be made, the typical strategy for scaling deep learning is something called data parallelism or batch parallelism. You take your big dataset, you slice it up into pieces, different nodes are going to be operating on a different portion of the data, so when they start up, they all initialize, they look at a different portion of the dataset, and now after they process the dataset, the rate updates are going to look a little bit different. It's important that after the networks have done the proper propagation, so updated their rates, that they talk to each other and make sure that they see the same network before they process the next chunk of data.

This coordination, all the different nodes talking to each other to make sure that the rates look the same, can be done through something called an all-reduce operation. In the HPC world, that's the lingo that we have, all nodes talk to each other and reduce the parameter updates to the same number. That is, I would say, a simpler communication pattern that needs to be involved for deep learning workloads.

That communication pattern can be done in a few flavors, so you can have synchronous, asynchronous update, wherein all nodes decide that it's time to send our parameter updates. In a synchronous fashion, they all block, they do the exchange and then they move on with their processing, or it can be done asynchronously. You have a node set aside, which is tasked with collecting updates from folks and sending it along, Google in fact has explored this idea to have it.

There are pros and cons of each, for the synchronous updates, you are assured that statistically you're doing the right thing, but a single straggler or single node that somehow has degraded performance can slow down the whole operation. For asynchronous stuff, you don't suffer from the effect of stragglers. If one of the nodes is slow, so what? I mean the other nodes just keep moving forward, statistically, convergence is much harder to obtain in asynchronous communication schemes.

For this project, two years ago we went with a hybrid strategy. We split up 9,000 or so nodes into tight topology-aware compute groups that were on the same rack, and then they would share parameter updates in a single session, and then they would communicate asynchronously across groups using an asynchronous plan.

Once you plug all of that in, a high performant, Caffe-based MKL-DNN based, a convolution network, and then you have a hybrid scheme for running on multiple nodes, on multiple groups, then you can scale things fairly well. This is a weak scaling configuration, in the HPC world, what that means is as you throw more nodes at the problem, you start throwing more data at the problem, or vice versa, as you have more and more data, then you have more and more nodes that are available to you, so you can throw that. What you would hope is that things continue to scale well. The synchronous scheme is in blue and then the hybrid scheme with different numbers of compute groups are in green and orange, and the ideal scaling line is in black.

Generally, the things scale well, and overall at the end of it, in terms of performance metrics, when you add up the FLOPS or the floating point operations per second, we can get fairly high levels of performance, about 15 petaflops on 9,600 Knights Landing nodes. I believe it is still the largest Caffe run on a CPU-based HPC system, and the science results are quite promising, it is possible for a single network to find multiple kinds of weather patterns, tropical cyclones, atmospheric rivers. There are certainly some mistakes that the network makes, and that is not exploring scale effectively, but that's thing that we are certainly looking at.

Most importantly to you, you may not care as much about running on 9,600 nodes, but what is important is that all of the work that we're doing here, along with Intel and Cray and others, it's finding its way back into the production bay. All of the optimizations on non-square unit sizes, multi-channel data and MKL-DNN logic or some more advanced stuff in Intel Caffe and MLSL have found a way into the open source deployment, that's been a very good outcome of this effort.

Deep Learning @ 1EF

I want to move on to the second case study, the bounding box results look ok, we wanted to move on to segmentation, extracting pixel level segmentation masks. This was a project that we did last year with NVidia and UC Berkeley and Oak Ridge National Lab. One thing that's worth noting is that this particular work was awarded by something called the Gordon Bell Prize, which is the biggest accomplishment in field of HPC. I think you may recall that two weeks ago the Turing Award was given to the Godfathers of AI, and once you go down one level after the Turing Award, all the awards are more specialized, there's an award in theory, there's an award in systems. In HPC, this is one of the top honor, we're really glad to win this award last year.

This is the team, a lot of very smart people from NVidia and UC Berkeley and so on and so forth that I had the pleasure of working with. Recall that this is the problem that we're going to go after, it's going to be a pixel-level segmentation task, except now we're going to be operating on million-pixel images, and then we have 16 channels per pixel.

This is the segmentation architecture that we ended up using for this problem. It's a variant of the fairly commodity DeepLabV3 network, the input is a million-pixel image with 16 channels, and the output is a million-pixel image with three channels. On a pixel-by-pixel basis the network is going to say, "Oh, I think this pixel is background or it is a tropical cyclone or it is an atmospheric river." Those are the three clusters of interest.

This is a fairly deep network, it has about 66 layers, and if you just add up the number of parameters, and this is about 43 million parameters. When you process high resolution image through this network, we approximate that it takes about 14 teraflops worth of computing for processing a single sample, and we have about 200,000 samples that we need to process. That is the reason why you need a big machine to scale that on.

When computer vision folks have tried to take on segmentation problems, you'll typically see that the images that they're dealing with are maybe a few hundred by a few hundred pixels. No one is really operating at the level of a million pixels, when you try to do that, when you try to not compromise on the signs and really operate at the native, high resolution data, then you do need fairly demanding architectures and fairly demanding computer.

On the software side, I think we were certainly feeling more confident about TensorFlow, last year. There's certainly been a lot of enhancements in TensorFlow, not just on the productivity side of things, but also on the performance side. The bindings of TensorFlow to cuDNN are fairly solid at this point in time. On the multi-node side, we went with Horovod and used MPI for communicating between nodes, and NCCL for communicating between GPUs on a single node.

I forgot to mention that we ran this thing, this particular architecture, on the Summit machine which is the number one system across the world in the top 500 list. Summit has about 4,000 nodes, every node has 6 Volta GPUs on it, so overall about 24,000 GPUs. When you want to be doing the parameter updates on a single node, then there is the NVLink Fabric that connects all the 6 GPUs to each other. NCCL is Nvidia's protocol for invoking that communication, and that's what we use. Finally, of course, we are using Volta GPUs in this case.

Similar to the exercise with Intel, we worked quite hard with NVidia on making sure that we were getting good performance. One of the unique things about the Volta GPU is that it supports native 16-bit competition on the Tensor Cores, and if you can map your competition onto the Tensor Cores, then in theory you can hit up to 120 teraflops on the Volta GPU.

There's an open question in science on whether you can get away with 16-bit competition, beta is natively in 32 or sometimes even in 64-bit competition. Before this project I really wasn't sure whether we could drop on the bit precision and still maintain the high accuracy, but in this case, we verified that for the segmentation task, dropping all of your competition to 16-bit does not degrade the classification accuracy.

Once we made sure that that was the case, there was no performance degradation and we were running on 16-bit, then we were able to obtain about 40 teraflops on a single GPU, which is about a third of the max capability of a Volta GPU. That's quite a remarkable number, we call that the KNL peak, was six teraflops and we got two teraflops. In this case, by dropping to 16 bit, we are able to get to 40 teraflops. When you hear in the media around GPUs for deep learning and so on and so forth, there is a difference. When you drop down to 16-bit competition, the theoretical max that Volta GPU can give you is significantly higher.

For this particular case, once we logged on the single node performance, we came up with a similar sort of data parallel strategy for doing our updates in a synchronous fashion. Then I mentioned the hierarchical scheme, that 24,000 GPUs are not talking to each other, if they started talking to each other, this implementation would just not scale. 6 GPUs on a single node talk to each other, and then 1 GPU owns the token, and then 4,000 GPUs then talk to each other, and that's how they take the topology of how the machine is laid out into account, before we attempt the scaling. Once you get that right, then you can certainly scale really well.

On the Y-axis here are petaflops per second, and in a weak scaling configuration, this thing scales up quite well. By the numbers, on 27,000-plus Volta GPUs, you can hit 1.13 exaflops, both in a peak and in a sustained fashion. The HPC community has talked about exascale computing for about a decade, and talked about how it's important for science and how it's coming. There's a big race right now between U.S. and China and Europe on deploying the first exascale machine, if you can drop down to 16-bit competitions, you can get billion billion floating-point operations per second right now. This is the reason why I believe last year we were awarded the Gordon Bell Prize, because we've waited a decade for this.

What is maybe even more rewarding is that, this is the first exascale deep learning application, and we didn't code this off an assembly, the developer was still working in TensorFlow, they expressed their network in TensorFlow, Horovod is fairly a high level, you can set up your communication patterns there. You really don't have to compromise on performance just in the name of productivity. There's been this tension in the HPC world in that you have to give up on productivity to get good performance, and the data world is all about productivity but, maybe not as much about performance, this project shows that you can have both.

The science results that we get are also fairly compelling, on a global scale, a million pixels, you can get high quality segmentation mask. The blue dots and the red dots are the network's prediction, and the black contour lines are the ground track, it matches up fairly well. A lot of the things that we did in TensorFlow and Horovod eventually made it back to open source.

Open Challenges

After having done this, so after having looked at deep learning use cases for five years and done these big hero runs on machines, we've learned three things the hard way, and I wanted to conclude on that note. One, data management, a lot of buzz is around compute, faster accelerators, making sure that the software isn't in the way. Unfortunately what's getting ignored is that at scale, you are going to be I/O bound, and if you do not have the right storage substrate, and if your I/O middleware is not quite tuned properly, that quickly becomes an issue.

In both cases when we scale Caffe out on the Cori, and then TensorFlow out on Summit, the Lustre file system fell apart quickly, and the GPFS file system fell apart quickly on both systems. Really the saving grace was the burst buffer, SSD pool, and the node-local NVMe. We had to pre-stage our data on both of these flavors of hardware, and at that point the I/O rates, the GPUs could actually ingest the data and the CPUs could ingest the data. That is really important, there's an interview that you can check up on next platform, if you care about those issues.

Hyper-parameter tuning, if you are in my world, wherein you have access to this big machine, 1,000 nodes, 10,000 nodes, and you get a period of 4 hours to do this big hero run, there is really no reason to believe that the learning rate curriculum that you've come up with, the optimizer, how you decay the learning rates, is the optimal choice, there's no guarantee of that. Really, we need to be able to go from smaller scale run, so quick run on 100 nodes, 500 nodes, 1,000 nodes, and that should inform what happens at a larger scale. Right now I think unfortunately the heuristics are just not in place, that's another key thing.

Convergence, I'm not going to have time to cover that today, but we can certainly get things to converge. It's not that all these flops are for nothing, we are getting things to converge and they do converge in a reliable fashion. I think there's certainly an issue in that we subject to suboptimal choice of hyper-parameters, and the large batch size. In some cases, our batch sizes are in fact tens of thousands, and it is not clear what sort of a learning rate curricular you should set up for large batch sizes. There are some ideas out in the community on dynamic batch scaling, but getting things to work at this scale is really an open challenge.

Conclusion

I just want to conclude on time, mostly what I wanted to do today was to convey that deep learning is having a lot of impact in sciences, it's not just commercial use cases that you hear about, I think we have seen a lot of success across the board. Unfortunately, scientific datasets tend to be fairly large and complex, we don't want to compromise the integrity of our analysis, we want to operate on natively high resolution spatial [inaudible 00:39:59], and applying complex deep learning architectures to complex scientific workloads requires that you scale your deep learning pipelines out.

Thankfully for us, HPC systems are known to be a good match, we already have fast compute, we have fast memory, our I/O subsystems scan in some ways keep up. This is a good match and really the communication patterns are not that complicated, we've done more complex things in the past. That's the reason why we've been fairly successful in scaling out deep learning on the larger CPU and GPU systems in the last two years. The most encouraging thing for me, that we are doing all of this in the context of productive frameworks. You are touching only Python and you're really not falling down to assembly or anything low level.

As we've done this, we've discovered a few open challenges along the way, data management is absolutely key, hyper-parameter tuning is an open problem, and time to solution requires both good competition efficiency, but also good statistical efficiency, and that is definitely in an open challenge.

 

See more presentations with transcripts

 

Recorded at:

Jun 17, 2019

BT