Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Automating Software Development with Deep Learning

Automating Software Development with Deep Learning



Emil Wallner discusses the state of the art in software development automation, its current weaknesses, and areas that are ready for production.


Emil Wallner is a Machine Learning engineer, currently exploring code and design synthesis, and reinforcement learning. In 2018, he made a popular open source project that translates design mock-ups into HTML/CSS, Screenshot-to-code. His blog is translated to a dozen languages, which reaches over a million developers each year.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Wallner: My name is Emil Wallner, I'm Swedish, and I'm currently studying computer science in Paris. You might have come across some of my open-source projects. I made the one that you mentioned earlier, the Screenshot-to-code, which was the most popular project on GitHub for almost a month. That's where I translate design markups into HTML and CSS. Another project I've done is coloring black and white photos with neural networks and [inaudible 00:00:30] made a short film about this project.

Software Development as Data

Today, we're going to talk about software automation. The first step you need to take to start understanding this problem is to start looking at software development as data. A lot of people now, if you think about the tasks that we're given, say, design markups, or program descriptions, or meeting a client and trying to understand what their problems are, we see them as human problems that only we can relate to and understand, but more and more, we can start to treat these problems as data problems. To understand how this is possible, I'm just going to give you a short overview of the context.

We've had traditional software that we're all used to. We've had the deep learning that is merging, and then people are talking about blended models, so combining the symbolic AI systems with the gradient-based systems, but that's still in early development.

The narrative of automating software development starts in the mid '80s, where Bill Gates had a really lovely interview, where he was talking about software becoming more and more high level. He was referring to Assembly, C and programmings becoming higher and higher up. He thought that, just in the early '90s, that we'd start having programs that would automate software development. What they realized in the early '90s is that these expert systems, you can make them complex, but they had certain limits. By the early '90s, the consensus was that we can't use traditional software to solve really complex tasks.

That's where the new paradigm is coming in. These are a gradient-based approaches. We used to create all the logic in our programs, but now because we're using gradients and deep learning and automation, we can start tackling more complex tasks. A great example of this is self-driving cars. Just a decade ago, this would have been impossible, but now, instead of creating the logic, we train the models that then create the heuristics.

These are two slides from Andrej Karpathy, he was the person who coined the software 1.0 and software 2.0 distinctions. He runs the AI department at Tesla. When he came in, there was only a certain amount of 2.0 code, and most of it was traditional code. The more he went there, he started replacing more and more of the stack of the car in terms of 2.0 software. His point here – he’s making a bit of fun of the committees - is "Gradient descent can write code better than you. I'm sorry."

Can We Automate Software Development 1.0 with 2.0 software?

In a lot of aspects, software engineers are becoming data scientists, because we used to create the logic, but more and more, we're creating the data sets and the pipelines and the workflows to create the logic. This leads up to the core question I want to talk about today - can we automate software development 1.0 with 2.0 software?

I'm not going to talk give much context to this dialogue, but you'll see on Twitter, on YouTube, at conferences, people are still debating, can we use deep learning and gradients to start tackling software automation? The general consensus is now that every year people have different opinions about this. I think it's very important to start forming your own opinion about this. Even here, I take the risk of our approach, if this is going to happen, and it's going to impact our futures, you want to know that you have skills that are relevant. If you have a business dependent on software development, you also want to understand the kind of complexity and dynamics of this shift.

Some people think that deep learning in 2.0 software layer is not good enough, so we need to develop an either further layer, and here I'm referring to that as 2.5. That's, as I mentioned earlier, combining symbolic AI with the gradient-based approaches. A good way to get a sense of what this technology looks like is the AlphaGo Zero model by DeepMind, where they use a Monte Carlo search tree with neural networks. The neural networks get the pattern recognitions and the high-level understanding and intuitions, while the Monte Carlo search tree makes the search better. This is still early development, and we're going to see a lot of advancements in this area.

On the other side, you have Richard Sutton; he's a thought leader in this field, and he's been working with deep learning for the past couple of decades. He says that, "Researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is leveraging of computation." If you look at a lot of the architectures that we have in Deep Learning, the most powerful ones are removing all the heuristics and putting more and more of the responsibility to the compute and the data sets.

To give the whole picture, we have the 1.0 software stack, and it has a certain complexity limit. We have the 2.0, and then a 2.5 that people are still discussing, do we need this or not? This is going to be the core core thing that I'm going to center around today. Right now, what we can do with automating software development is assisting software development. We can't replace it, but we can make software developers and engineers a lot better, faster, and more efficient, more resilient. We're in the early days of looking at automating closed systems, web developments and systems that are not integrated, so a lot of systems have the potential to be automated today. That's what I'm going to start talking about soon.

Then the red areas, automating of integrated systems, so working with more complex solutions. We're not quite there yet. The time frame, again, we don't know how long this is going to take. This might take 5 or 10 years, but it might take 50 to 100 years, or it might not happen. In this talk, I want to give you the context for you to understand this timeline better and have a look out to understand how soon this is going to happen.

Screenshot-to-Code Project

Now, we're going to talk about Screenshot-to-code project. I wrote this article. There are all the technical details, so I'll have a more high-level approach here. It had close to a million readers now, and being translated to a lot of different languages. I think the reason for that is that, we're just about to realize that we can start automating software development. I think that's what brings so much excitement. The task I'm going to talk about now is how do you take an image, so the raw pixels, use deep learning and translate that into code so that you can render a web page of it?

If you look at it as just a pure problem, it’s creating one function that takes the raw pixels and translates that in the correct syntax. If you have the 1.0 software mindset, you realize that creating this function is very hard; it will take years and will be very expensive to create. But with the deep learning 2.0 mindset, what we do is, we create the function and then we use data to create the heuristics and the logic.

If you just look at it at a glance, it might seem very complex and then we can't approach this. But if you start looking at the literature and the research in AI, you start realizing that we can understand what's in an image, we can generate semantically correct text, and we can correlate the objects in the images with the text.

We know this because we have Convolutional Neural Networks that are really good at understanding what's going on at the picture, Recurrent Neural Networks that understand the semantics and the syntax in it, and Image Captioning Models. Image and captcha models are things that put captchas on the picture. A boy chasing a cat, it will get this picture and it has to predict this captcha. This shows that we can combine these two systems.

I'm just going to walk you through a simple example of "Hello World!" in starting to automate software development. We're going to take just one image, and we're going to represent this as the pixels in the image. Then we're going to create the vocabulary. These are the tokens that we're going to work with, so that determines the size of the vocabulary. How it's trained is that you're going to take the image of the website and then it always gets the previous markup. If we go back here, what happens is that it will get the picture, and then it will get the Start tag, and then after the Start tag, it has to print the HTML tag. Then it gets the Start HTML tag and has to print the Center tag. It keeps doing this over and over.

What you'll see in the beginning is that it will just print the Center tag all the time, but the more you train it, it will start understanding the relationship between these pieces. When you're training it, you're always giving it the correct input and always the correct output. When you're running it in production, it gets the prediction that it made. You feed in the prediction that it made and that's how it creates the next prediction.

The core architecture comes from the Pix2code paper by Tony Beltramelli in 2017. There are a couple of other papers worth mentioning, there's image to latex, Im2Latex, made by the NLP group at Harvard, that has a slightly different approach. There's something called sketching interfaces by Airbnb and Sketch-to-code by Microsoft. I found the Pix2code paper is the cleanest; it's easy to understand, and the approach is a pure end-to-end approach. The other approaches have a lot of object recognition and moving pieces, but this creates a simpler architecture.

If you just expand it to look at the moving pieces, we have the Convolutional Neural Network that processes the image, on the right side, we have the LSTM that takes in the markup. What's interesting here, when I started out, for me, this was just magic. You're taking the pictures, you're taking the syntax, and now what you have is these big vectors of numbers. What you do is you can just concatenate them, you can just put them together and the network starts understanding. Then you use this concatenated features and process that in another LSTM. LSTM stands for short-term long-term memory, and is one of the most common recurrent neural networks. Then in the end, you have the dense layer that makes the prediction.

I could talk more about different architectures, but when I started, I put a lot of effort in choosing these components, but the more I've been working with these networks, you want to see it as a hyper parameter problem. You look at the latest implementations. For the LSTM, you can have a group, you can have a transformer, you could have anything that deals with sequential tasks. Then you just do a hyper-parameter sweep. Instead of you manually choosing the components, you just let the compute do that for you. We're moving away from engineering things, and moving more towards taking a data-science approach to a lot of things.

Now we have the "Hello World!" aspect, so how do we make this more complex? The next step was, how can you create pages that looks like this? I had five different pages with roughly different flavors of this. Now instead of ust having 10, 20 tokens, we have roughly 20,000 tokens, and there are a lot more moving pieces. If you start training the network, you realize that it takes a lot of epochs to get to a good level. What we're doing here is we're just overtraining the network to make sure that we have capacity to solve this problem. Here, you can also realize that an obvious mistake is that, 450 here, you can't see anything. One of the problems with the program is that everything has to be perfect. In text or over paragraphs when you're generating, you can have errors here and there, but when you're working with programming, everything has to be perfect.

The next step was to understand how good are software 2.0 at creating good semantics. There's a lot of research done using LSTM and networks to just generate text. This is some generated text, this is a trained on a C database, I think it's the Linux kernel. You can see that the format looks roughly correct, the function, it has the paragraphs, it's got a couple of comments in here. But when you start looking and paying attention to the details, you see that it doesn't understand the idea of variables, function, classes, so on and so forth. This is an open research problem, and we don't know that if we add more data, so with the recent developments like GPT-2 by OpenAI, they regenerated a document with texts. This kind of leans towards that maybe we can start solving variables and the functions just by more data, but we don't know that yet. It's going in that direction. We're starting to be able to process more programs with more complexity, and so far, we don't really know what the limit is.

I had this data set of 1 trillion websites. I don't know if you know GeoCities, it was the first web-hosting platform, and there are a lot of websites from the '90s. The good thing about this data set is that if you look at the HTML code, it's easier than if you compare it to the the JavaScript-heavy website that we see today. If you start looking into all the code and trying to want to simplify the vocabulary, you realize that this is a very big problem. I think this is still a very interesting research area if you have more computing power, and also more time to clean and structure the data.

What I instead went with is a DSL, domain-specific language, that was introduced in the Pix2code paper. What you end up with are these tokens that represent parts of the website. This is based on the Twitter Bootstrap scene. What's interesting here is that you can start visualizing the neural activations in the network to see what kind of knowledge it has. I think doing these kind of activities, like activating neurals, gives you a better understanding of what it can and can't do. You can see that it understands the idea of paragraphs, rows. It understands the devs, and it has a basic understanding of what it means to create a website.

The website looks something like this. That's the image, and this is the thing that it produced. What you can see is most of it looks correct, but you have one button that is the wrong color. With this model, it can predict roughly at 97% accuracy using a blue score. It's really good to have a generator, we can only do a few components, but with a generator, you can add more and more components and you can increase the expressiveness step by step.

Tony, who made the original paper, has continued to work in this and turn this into a product, and it's called UIzard. If you want to understand the latest in this area, I would highly recommend going to the webpage and try it out. What it does is that, instead of using a design mockup, he uses a sketch, and then it generates the corresponding syntax in an app development. I think it's pretty impressive what it's doing. This will make the creating the user interfaces a lot faster, moving forward.

Software Development as Data

Now I'm going to look at what does this mean for the broader field of software development? We're going to go back to this software development as a data mindset. If you look at software development 1.0, because it becomes very complex, it's easy to predict what's going to happen in the future. What we're seeing with the new computing paradigm is that it has different progress curves. You can see that you start making small progress, and then it has a very rapid increase when you increase the computation and the data.

There have been a lot of different areas. This is a list of them: board games, multiplayer games, text understanding, translation, self-driving cars, medical immunoanalysis. All these fields started out very weak compared to the 1.0 software systems and then had a very rapid increase. This is important because this is going to happen in areas of software development as well. When these things start to happen, they're going to enable us to automate more software.

Now we can start working with Assisted Software Development. We're getting closer to automating closed systems, but we're still far away from integrated systems. The way to think about this is, in pop culture, you will often hear people talking about complexity, empathy, creativity, critical thinking. This is taking the human point of view towards this problem. If you hear this narrative in board meetings, or if you're discussing with people, it can be harmful because you neglect the capacity of the computer. I think what's important if you want to start working with these types of problem is that you take the computer view, you have empathy with the puter and try to understand how the computer sees the world.

There are three things that I've come to find useful and that I think about. The first one is undefined tasks. As I was talking about earlier, taking an image into code is defined by understanding an image, creating a semantically correct syntax, and combining the two. This is the same approach, you want to tackle new problems. Instead of just seeing it as a human problem or a computer problem, you really want to define the tasks that are involved. The second step is novel manifolds. This is a mathematical term that's kind of complex, but it gives us a way to start understanding how the computer sees the world. If you look at, say vision tasks, they would have a certain type of manifolds, but if you compare that to, say, voice or another domain, they can have other type of manifolds. What you realize is that in certain tasks that are in, say, vision, if you will solve one problem, other areas in vision are going to be easy. It's important to understand how novel these manifolds are and how close they are to other. I'm going to dig in a little bit deeper into that soon.

The third one is scalability. We've seen people creating and understanding new manifolds, but then the key step is to understand how well do these scale? The recent invention was something called a neural turning machine, which is combining trying to create a neural computer by combining the logic in a 1.0 computer and using the differentiable systems that you find in 2.0 software. This has been shown to create easy for loop and logic, but we haven't been able to scale this up at a larger scale. You always want to take this different perspectives when you're thinking about different tasks.

Just to get a physical sense of what a manifold is, if you look at a paper, this is a 2D manifold in a 3D space circulating. If you crunch it together, this is really what a manifold is. What the deep neural network does, it just slowly but steadily unfolds this manifold. Then when we unfold it, it's a lot easier to make, say, a classification, or working with the data. A lot of different data types behave in different ways. If you look at images and sound, they have more smooth edges, but if you look at, say, computer programming or reasoning, they could have more of sharp edges. That's why it becomes harder to using gradient-based approaches. This is acting out in thousands and thousands of dimensions, and that's why it's really hard for us to get a really good understanding of what's going on in this space. I think the best way to really understand how they behave is to work with them on a daily basis, because you start getting an intuition for how they behave.

Going back to this over-simplified view of the software stack and the problems, I look at it more like this. You have the 1.0 stack in the left corner, and then the 2.0 stack and 2.5 stack going in the north-east corner, so more scalability and approaching novel manifolds. Again, what you can do right now is augmented IDEs, so just making software engineers a lot faster in a lot of different ways. I'm going to cover this soon. Still, the harder area is in the social media app, it has more integrations and more difficulties. The hardest parts are things such as a bank API that integrate with ATMs and other systems.

Right now, if you want to look at what deep learning can do to assist software development, I would look at some of these areas: refactoring, autocomplete, code review, user testing, graphical user interface, prototyping, semantic code search, security issues, and monitoring. A lot of these might not be ready to use on a daily basis or to integrate to your company, but by using these types of tool, you will also understand how fast this shift is going to happen. What I would recommend is to look at this tool on a say, six-months basis, to understand how good they are. If you see that these types of tools are increasing at a very rapid pace, you can also get the assumption that the timeline that was indicating earlier is going to get shorter and shorter.

If you plot that in the map, it's in the existing area of what we can do. If you look at the next step, the areas that we can really do, so dynamic pages, linking pages, using variables, security rules, integrating databases. These are things where the more tasks that you have to integrate, the more complex it gets. If you just plot that roughly on the map, you'll see that some areas were good, but most of the areas are outside of what we know, and we still need to do a lot of research. The last one is the bank API, where you have API roles, transactions, ATMs, integrations, so on and so forth. This makes it very hard to automate this. This is what it looks like if you plot it roughly on the map.

To understand the narrative, right now we can deal with functions, paragraphs, and static graphical user interface. This is what we can use today to understand the code. If you're doing refactoring for example, you can understand it on a function level, but it's very hard to do refactoring on more of a program scale. The recent advancement in natural language processing, we're moving closer and closer to the orange area where we can create a program or a document that's semantically correct. Most of these area is still in development.

The reason why I have conversations with context there is that a lot of the software development is not necessarily getting a program description or graphical user interface, but it's actually integrating working with businesses and people to really understand their problems, and to solve those types of problem, you need some person that interacts. That goes to the really hard areas that are hard to automate. Those are systems of programs that are interacting with each other; you have several documents that they need to understand. Advanced graphical user interface, and an expert-level dialog. Imagine if you want to create, say, a bank API, you need to really understand the complexity of all of this in tasks, and you need someone to orchestrate all of this. This is just way beyond the current techniques that we have today.

To sum up what I've been talking about, we have these three areas, and the time axis is dependent on how fast development is going to happen. On the Y-axis, we have the novel manifolds and scalability, a way to look at how hard these tasks are. That's it. My name is Emil Wallner. You can find me on GitHub, on Twitter, and my email is the one on the screen.


See more presentations with transcripts


Recorded at:

Sep 11, 2019