Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Improving Developer Productivity with Visual Studio Intellisense

Improving Developer Productivity with Visual Studio Intellisense



Allison Buchholtz-Au and Shengyu Fu discuss how PM, engineering, and data science came together to build Visual Studio IntelliCode, which delivers context-aware code completion suggestions.


Allison Buchholtz-Au joined the Visual Studio PM team after graduating in 2015 from Harvard University with a degree in computer science. Shengyu Fu leads an applied science team in Microsoft Cloud & AI division. Their mission is to infuse ML & AI into Microsoft developer platforms and tools.

About the conference is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.


Buchholtz-Au: I'm Allison Buchholtz-Au, I'm a program manager on the Visual Studio IntelliCode team. We are part of the Visual Studio platform team, although we're a little special because we are working outside the confines of the regular IDE ship schedule. I think of us as an incubation team for the past year. Me and Shengyu work on IntelliCode, and we're going to take you through a little bit behind the machine learning model and how we build that up. Also, at the end, we're going to talk about best practices and things that we've learned as we've tried to get this team off the ground and IntelliCode out into the hands of our users.

Visual Studio IntelliCode

We'll talk a little bit about Visual Studio Intellicode, I'll go into a demo just to show you guys what this is. IntelliCode is really our machine learning offering, our team really focuses on how do we take machine learning and AI and all the things associated with that, and how do we infuse your developer workflow with it at every single point? This can range from things like how do we help you focus in your PRs? If you've got 20 minutes, can we give you information that helps you hone in on a particularly problematic area or a high-risk change? I'm sure you all have been swarmed with PRs, how do you focus that time?

The thing that we offer right now is what we call AI-assisted IntelliSense, I have it defined here. This really uses your current code context and patterns based on thousands of open-source repositories. Shengyu [Fu] will talk about how we use those repositories in order to give you better recommendations. We predict the most likely/relevant contextual recommendations as well as argument completions. If you are within a method call, and you're trying to figure out what argument you need, we also can provide the most likely or relevant suggestion there.

Because a lot of you haven't seen this, I wanted to just switch over and actually show you a little bit of it in action because I can sit here and tell you what it does, but I think it really means more to show you what it does. I've got my demo code here, what's important to notice is that we have this string path and if I'm trying to augment it in some way, when I activate IntelliSense, you'll notice that we have these starred recommendations at the top. Right now, it's saying, "Hey, based on this context you're just doing some formatting on this string. You most likely want Replace," which is indeed what I want. I'm just trying to replicate what I have commented out up here.

If I move into a different context, which is an "if" statement and activate IntelliSense again, you'll notice that it changed what was at the top here. Now it says, "Within the context of an 'if' statement, you most likely want a Length," which I'm sure we can all resonate with. Generally, when I'm doing an "if" statement with a string, I'm probably checking to make sure the Length is not zero, but in this case, I actually want EndsWith, which is the second thing. If I were to go through this one more time and then go into my statement here, I’ve got to remember my parenthesis here, and I do yet another different context. I could type this right. The perils of typing on stage, it changes yet again. Now I can do Substring.

Another thing I just wanted to show is that this also works for Python, which is pretty cool. If I just recreate this thing above, it basically does all my stuff for me, we often joke that eventually, our coding will just be CTRL+SPACE, Activate IntelliSense, press Enter, and it will all do it for you. I hope this demonstrates pretty easily that this is better than just your classic IntelliSense. It's really about taking that context and giving you smarter suggestions.

Why ML? Why This Problem?

Let's talk about why we chose this problem, why we think this is a good problem to use machine learning for it. This number one thing here is the customer need, for years and years people have been loving IntelliSense, they're like, "It's great," but we hear all the time when we ask what people want, they're like, "There's got to be a better way to sort this than just alphabetical. If I have a class with 100 methods in it, and the one I want is all the way down at the bottom, I don't want to have to do another keystroke, I don't want to have to scroll through the list. Give me something better." There was a customer need, and that's one thing that PMs focused a lot on, and how in this team, it's really about making sure that when we build something, customers are going to want it. You could have a great model, you can have a really cool solution, but if it doesn't solve a problem, no one's going to use it.

There was a large amount of opensource data. As I'm sure you've heard many times in the past two days, having data is really important for machine learning. If you don't have the data, you can't really do anything. We wanted to start with an experience that users did a lot, because we wanted to see how useful our solution was. If this was a one-off process that the user did once a week, our impact isn't nearly as big, and it'd be really hard to get accurate feedback on it. We wanted to start with member completion, which is triggered all the time as you're typing.

Solution Principles

We went in, and we were like, "Ok. How do we solve this problem?" and we came up with these principles as we evaluated new ideas on our team. The first one is start with a small, concrete problem, it's a lot easier to tune and get feedback if you have a small problem space. If you have a huge problem statement and many variables, how do you know the impact you're having? Start small, we can always scale up as you get success, create systems that allow for iterations. I think we've heard in every single talk we've attended that allowing your system to change and to iterate is going to be really important as you build these models and integrate them with your workflow.

Avoid being intrusive as you experiment. One of the things we did was instead of completely overriding the IntelliSense list, we augmented it. We provided you suggestions at the very top that we thought would be useful, but we didn't get rid of all the other suggestions because if we've got it wrong, there was no fallback plan for the user. You want to think about how you can augment an experience, versus completely replacing it, because you're probably not going to be right all the time and you need to have a graceful failsafe.

That leads to the last one, consider augmenting. These last two really go hand-in-hand because if you totally override an experience, and your user is confused, then they're going to revert whatever change you made and you're not going to get the feedback you want on your process or your solution. With that, I'm going to hand it over to Shengyu to talk a little bit about how data science actually built this model. Then we'll go into some of the learnings we have on how you bring a program management team, or product management team, a data science team, and your engineering platform together, so that you can think about how you would build that system yourself.

The Data Science Journey

Fu: Thanks, Allison. I'm going talk about the data science journey when we tackled this problem. Just three main points I want to make, I think we did pretty good in this project. First, we started with exploring the data and understanding data, taking a lot of the statistical analysis and data intuition before we even jumped into building machine model. Also, we defined a very clear set of metrics for both offline and online. They are very consistent, so we can measure it both in offline and online to iterate the model.

At the very beginning, even at the prototype stage, you're already taking into consideration the engineering constraints because we need to run the model in the client device, so we need to control the model memory footprint. Usually, it's under like 50 MEG. We also need to be very sensitive to the runtime of the model, we want to keep the model recommendation time under 20 milliseconds for each core, so we don't interfere with the user editing experience. The third thing we heavily rely on when we evaluate the model is using the offline evaluation because it's cheap, it has a very fast iteration, so we can do it right there in the model training and evaluation before we actually release the model to do the dogfooding or to interrupt production.

What data we use? we started with the open-source data source from GitHub. We pretty much crawl all the languages we support from GitHub with certain quality based on the stars, based on some of the metadata from GitHub. For C#, as an example, we crossed more than 2,000 C# repos, and it contains more than 5,000 solutions and 200,000 on C# documents. From this raw data, we spent a lot of time to build this parser based on the compiler. Basically, we walked through the code and built us a syntax tree and built us a mental model, then we extracted all the contexts for each of the coding, the invocations, the member invocations, because those are the things we need to predict.

In this "Hello world" example, we really care about this console rightline, we need to know where this happened in this particular code context, whether it's in your conditional statement or bracket, whether it's in your loop, whether it contains which class, which method and whether the method invoked has certain properties. All those become the context for our recommendation.

Once we generated this raw training data from the source code, we were able to do a lot of exploration, trying to answer some of the questions we had like, "How does C# get used? How do the APIs get used in terms of different classes? Are there any patterns inside each class? How does the method get used at a particular document? Which piece of information is going to be used for our recommendation? What are the reasonable code contexts we should look at?" Here we draw a similarity between NRP and code analysis, code has its own structure or special structure; the code far away could affect the code actually in this particular context. It's this similarity and this apparent difference between natural language and programming language.

Then also we want to understand whether a set for a particular model, is a set of parameters is going to work for all the classes, or is it going to be work for individual classes? Do we need to train a separate model for different classes? We constantly look at, "Do we have enough data for different types of models?"

Here's the first thing we look at, we look at the distribution of all the classes that get used in the training data. You can see the strings are not apparently the most popular class, and the top hundred classes already cover about 28% of all the invocations in the training data. If you look at the top 1,000 classes, it already covers 50%, so it's a very long tail distribution. If we have 6,000, then it covers 70% of invocation, so we have a very good idea that if we're focusing on the head of the classes, the most popular classes we can get have a very good coverage and precision, if we can get a good precision on the popular classes.

Here's another example when we explored the data, for any recommendation problem, there's always a code start problem, there's no code concept when you write the first line of code. How do you do recommendation? That's going to affect our precision heavily. Based on the class popularity, we want to look at how the code start problem is going to affect our classes. Luckily, the most popular class is going to have less a code start problem based on this graph. That also gave us some confidence on how we could handle this in the modeling.

For each type of model that we do, we always draw this learning curve to understand whether the data we use has enough volume to be able to allow the model to learn as much as possible. Before we even start to do the modeling, we define a set of metrics. Here the metrics are pretty obvious, right now we actually make at most five recommendations, we define the metrics in terms of precision. We want to know the top one precision, when we make a recommendation, a whole is a chance that you're either going to pick our top recommendation, or one of the top three, or one of top five, so we have those three metrics in terms of precision.

Then we have the courage, which means that when the recommendation or class gets made to our recommendation engine, how many times we were able to make a recommendation? Some of the class we are not covering, so we could not make any recommendation from the training data. That tell us how many times we can help the developer actually improve the productivity. Then we have the third metric, which is the average or reciprocal rank, it unifies all the precision metrics taking the rank into the consideration. We have this uber kind of precision metric, which uses the inverse of the rank which a user picks.

Once we have this set of metrics, we start to the modeling process, we start with a very simple model, pure frequency. A lot of the other competitor products uses this frequency model, they look at the popularity of each method, it gets used in the training doc and then readjusts. When you invoke this particular class, you rank it based on the popularity, it's a very simple model, it has relatively low precision, but it serves as a baseline. When we have no code context, that's pretty much the model we use.

Then we start to do some literature search, a lot of the literature talks about the clustering model, looking at all the core occurrence of the methods inside a particular code context. We spend a lot of time iterating on that because a lot of literature is saying, "This model actually performed pretty well," but the reality is when we tried it, the precision is not as good as we want, it's very difficult to tune the clusters because each class has a different number of methods, so the cluster size is different. You have to tune each class, which is very tedious and very difficult to tune.

In the end, we gave up on this because we had a tough time of iterating and then a tough time of improving the model precision. We draw the intuition from natural language processing, we come up with a simpler model, which is more a statistical language model. I cannot talk about the details because this is the model we are running in production, you draw the intuition; based on the code context, we build the kind of Bayesian type of model that we can, based on the context we have, we predict the probability of each of the methods that gets used. This model has the smallest model size and has much better precision. It's departing production.

At the same time, you look at the state-of-art approach that we use in the deep learning model. We build a two-layer LSTM model to be able to take the previous tokens in the code and then treat it as a natural language, and then we try to predict the next token. This model has the best precision and coverage because it actually can cover a lot more cases than the statistical language of the model. The problem is the [inaudible 00:18:48] trying to do here is basically, the model size is much bigger, it’s five times bigger than my current critical length model and the execution time is almost 10 times.

Imagine you have a laptop and you need to run this deep learning model, and the code compilation is going to take 200 milliseconds, it will interfere with the coding experience. Right now we are spending a lot of time trying to iterate on this and trying to compress the model, quantizing it, and trying to improve this model before we actually will be able to even try to the debug [inaudible 00:19:22]. That's our modeling journey from a very simple model to a more complex, and then we try to strike the balance between the model precision, the model metrics, and the runtime metrics.


This is our architecture [inaudible 00:19:43]. On the left side box, we have this Model Training Pipeline, which takes across the Public Source Repository in a distributed fashion. Then we have this component called Metrics Builder, which is hand-crafted based on different languages because different languages use different parsers, different compilers. We extract the same training data from those documents with the uniformed schema and we were able to feed this into our learner, which can do the model training and also evaluation, the offline evaluation we do. Based on that, we pick the best model in terms of the model type and the parameters.

Once we have a good model, we release first, we do dogfooding, but it goes to the same infrastructure. There are two types of editors, mainly we serve the model into a model downloading service, which the IDEs like Real Studio and IntelliCode can pick up. The IDEs contain the model execution piece, and it will load the model into memory and then be able to serve the request. We have also an online web app, light version of the code editor, online version, and that serves to the web service. Each of those are code editors has its own telemetry instrument, and it'll feedback those events back to our learner. Right now, we are not using the actual data for training the model, but we use those metrics, online metrics, to tune our offline training parameters.

Offline Model Evaluation

This is a simple illustration of the offline model evaluation. We take the training data, we split it into 80-20, then we train the model on the 80% of the dataset, and we use different model types and different parameters. Then we do model evaluation on 20% of the test data. We play the code in a way that you're writing the code sequentially line by line, and for each invocation, our model is going to make a prediction and then we use the prediction to match the label. We treat all the code basically as the true label. Then we can measure the metrics I described before, like the precision metrics and current batch with coverage metrics and the average of the support rank.

This is an illustration of the precision metrics across the different type of models. With the INN, the LST model, we have the best precision. The second one is the statistical language model, which is in production, and the clustering and the frequency model. It's a relative scale, I cannot show the absolute number yet, you get the sense of the improvement when we try to move the model from different model types.

Online Evaluation

In terms of online, when in the editor the developer writes a single line of code, when they hit, 'dot,' make a new invocation, we actually construct, for example, in this case when the "Directory." happens on the online execution engine, it’s going to take the code before this 'dot' and construct all the feature vectors. The feature vector is going to go send to the model execution and it'll make a recommendation of a list of methods. In this case, our model is so confident that it’s going to use "exist.", it only recommended one recommendation. When the user actually picks this recommendation, we are logged at "commit events." With this recommendation event, the "commit events," we will be able to calculate all the metrics we described for offline, so we have a set of exact same metrics for online and offline.

We do see a very positive correlation between the offline-online, that's why right now actually we don't really have A/B testing infrastructure yet because it's much harder to do A/B testing because it's not a web-based application, you have to download and install, but we are working on that. When we serve in the model, we want to be able to do A/B testing. Right now, the offline evaluation gives us a very good tool, we can iterate really fast, when the offline shows better results and when we have a statistical significant improvement, we are going to see an improvement at the online as well.

In terms of online evaluation, Allison [Buchholtz-Au] is going to later talk about the online survey we do. What we really care about is the user experience, whether the user actually considers this as improving their productivity or not. The quantitative feedback is good, but it doesn't really tell us the true North. Allison [Buchholtz-Au] is going to show us the survey results we do quarterly from the developers' community. Even if the precision is 10%, does it really help us? Even if the precision is 100%, does it really give a better experience for the user? That will be the real evaluation from the developers.

Current Status and Future Work

In terms of the current status of the modeling, we already released IntelliCode for member completion for your studio for C#, C++, and XAML. XAML is a different type of language, it's more like XML, we use different modeling techniques there, but a lot of similarity in terms of the modeling. We also released a Python, Typescript, JavaScript, and Java in VSCode, which is a very popular editor right now in the opensource community.

We recently previewed our Method Argument Recommendation, when the method has an argument, we want to be able to recommend what local variable or instance variable is going to be used in these arguments. To improve the precision and courage we want to actually train it on the users, allow users to train it on their own code base. That way, we can combine the learning from their own code base and also the learning from the public source depository to give a much better user experience.

In terms of future work, I talked about how we need to enable live A/B testing for different models, so we can get a more accurate kind of comparison between the models. Then we are constantly tuning our current modelling production in terms of parameters based on the user feedback. We also are thinking of ways to incorporate the feedback, direct feedback from the user commit. If the user didn't pick our recommendation, they picked something else, we can basically incorporate that into the learning process, so that we need to be very careful because we don't want to violate the GDPR compliance. We are still working through those challenges. We are actively tuning the deep learning model to be able to reduce the model size and preserve the imprecision implement.

In the long-term, we are looking at line levels, snippet levels, we are literally looking at how far the machine can go actually doing the coding. That part definitely has a lot more challenges, the search space is much bigger, and the users probably have a much higher standard. If you show this snippet raw, it's going to be a more painful experience, you have to be very careful here. We probably are going to reduce the courage and try to improve the performance here, the precision here. With that, I'm going to hand back to Alison [Buchholtz-Au] to talk about the collaboration between different teams to get this to production.

How do DS/ Engineering/ PM work together? What Lessons Did We Learn?

Buchholtz-Au: This was definitely a group effort, IntelliCode did not come about because of a singular team, we work hand-in-hand with data science, our engineering team on Visual Studio as well as our PM team, which is stands for program manager. You can think about us as the customer voice. I'm just going to walk you guys through how our team worked together, it was a learning process for all of us, and I hope that it's useful as you decide to build up teams like this.

One thing that I think really helped us was that every team was specialized for rapid iteration and meant that all of us could focus on the things we did best and then bring them all together in order to form IntelliCode. Our data science team really focused on our metric definition, how they were going to build the models, how they were going to build their own systems to allow for model iteration and for that to happen really fast. They also focused on keeping the engineering constraints in mind, they didn't worry about the nitty gritty of, "How are we going to implement this in our product?" They just kept those general constraints around latency and around time in mind.

Our PMs focused on testing the customer experience. We went out into the field, we ran numerous user studies, interviews, and we really tried to hone in on whether this was a problem worth solving and was the way we were solving it effective for our users? We really brought in that customer experience piece.

We also focused a lot on the qualitative feedback once we shipped it. Engineering was all about, "How do we integrate this into Visual Studio? How do we scale out our dogfooding platform? What are the pipelines and architecture that we need to have in place to serve up machine learning models generally?" We didn't care whether this was a frequency model or the semantic model, it just mattered that there was a model. Having everyone specialized, allowed for us to all work in parallel and move really fast to allow the product to get out the door.

For your metrics- Shengyu [Fu] talked a lot about metrics, but these are the four key principles that we think about as a team when we define our metrics. Ensure it's measurable, whether it's qualitative or quantitative, to figure out what that measure is going to be and make sure it's consistent across any things that you scale out to. Define ahead of your prototypes because, often, the telemetry that you're logging needs to be built-in to whatever system you create. If you have that ahead of time, you can incorporate that into your stack. Track them consistently, we have dashboards that run all the time so that we, at a moment's notice, can offer up that metric to whoever asks for it, which is especially great when your CVP is like, "Tell me, how is the thing that I'm funding doing?" And make sure you're integrating feedback, make sure you have a qualitative way of tracking as well.

Shengyu [Fu] mentioned that they have the hardcore numbers, but the PMs also do an online survey. We are consistently coming out to conferences like this. We have an insider's list that we ping about once every four months. We have partner teams across all of the languages that we support who integrate questions, asking our users, "Are you using IntelliCode and do you feel like it's made you more productive?" We have this great number, which is 73% of users across all of our languages-supported, self-report that they have increased productivity, which is pretty great.

These are just a few of the great Twitter quotes that we pull, I pretty much am on Twitter every single day. If you're interested, it's a great way to contact me. Things like scrolling down the list and it's just figuring out exactly what I needed, these are the things that we love to hear because it means that people are actually gaining value from it. When you have reports like this out in the field, it just helps prove your point that you've built a good solution and hopefully it's spreading to everyone else.

One of the big challenges between PM, and engineering, and research that we had to fix was finding a common mindset. These are the two main mindsets that we have: the research mindset and the engineering mindset. They're diametrically opposed in some ways, research is super open-ended, there's a potential solution, you go for it, doesn't work, you leave it on the table, you're moving very fast. Visual Studio is 22 years old, we were used to moving this fast, we just recently started being on a quarterly update cycle. We are deadline-based, we care about things like build and connect and our big debut events. Research doesn't usually deal with big debuts like that.

We had a long lead development, and we weren't used to leaving things on the table. I think for us it was very rare to be like, "Yes, we're going to do this," and then two months later be like, "Just kidding. We're not going to do this." It's not as agile, finding that common mindset was super key. Our engineering team had to really change the way we thought about each of these processes and be okay with investing for a few months and then leaving it behind if a model or experience didn't work out properly.

Super important, even though the teams are specialized, you have to drive alignment and ownership. We have weekly syncs at all levels of our teams, whether that is with data science and engineering, whether that's PM and research. We have many syncs so that everyone is on the same page and can provide guidance on whatever is going on. This thing here about ownership- when you expand your service, when we expanded to six languages, we often ran into the problem of," Ok, there's a bug. Whose bug is it?" That can lead to a lot of conflicts, that can lead to arguments. As you scale out, it's super important to draw that boundary of, "Ok. The IntelliCode platform team owns everything up to this point. A language team owns everything after." If you define that early, [inaudible 00:34:57] always have a much longer time.


Finally, some of our challenges here. GDPR is a huge one, we're dealing with customer data. If we are building models, I don't know if you have a class named "super secret thing" that does x. We have to be very careful about how we're using data and how we incorporate any feedback that we have. Finally, a balance between your experience in the tech, this is where having someone who is the customer voice is incredibly useful.

You could have a great model that does exactly what you want it to, but if you have an experience that makes the user feel dumb, they're not going to want to use it. No one wants to be told, "Oh, you're wrong. You're being really stupid here." That's not a great experience. It's amazing how much even just a phrase or the way you bring up a model recommendation can drastically affect the usability and the effectiveness of your solution.

Those are our tips for building your own systems like this. Hopefully, you guys learned something new.

Questions and Answers

Participant 1: Thank you, great talk. I was wondering if you could comment on whether you benchmark differences between users of IntelliCode working on open-source projects, versus proprietary projects? Do you foresee a need to run retrain your model on proprietary code basically?

Buchholtz-Au: Yes, that's how one of our latest features came about. If you're working a proprietary software, you're going to have classes and types that are not found in the open-source, so we can't give you IntelliCode recommendations off that. What we built is what we call custom models, it is an extra feature within the extension where you can have the extension create a custom model off of your own proprietary solution. You have the ability to generate those solutions based off of your own code.

We didn't have to benchmark out in the wild, but we heard from users exactly about this problem. "I'm working in private software. You can't give me any starter recommendation, what do I do?" so we built the ability for you to be able to create your own models based on our pipeline, but that model is private to you and to whoever you choose to share it with. We don't have any access to it, we build it and serve it right back up to you.

Fu: That feature is actually right now in preview for C# in your studio. I want to add that we know that there's training bias, when we train from opensource code and then the user uses it because we do see the precision job from the offline to online metrics so that there's a gap. That's why I've introduced these custom model trainings, so you can use your own code because you own code, especially for larger companies, they have their own coding style. They use their own set of libraries. In terms of both coverage and precision, if we train on your own code, it will generate better precision and coverage.

Buchholtz-Au: It's more aligned with how we develop.

Participant 2: Does it need training the same model on different data, or does the user need to provide a whole different model?

Fu: What happens is the process when you trigger your studio, you can do a custom model training, right there. Then it'll take your solution, and we extract the features and send onto the server, and it'll train the custom model and then we actually do a merging between the custom model and the base model you have on the opensource. Because some of the classes, we think the opensource, the most popular one, is going to have the best past usage patterns. Some of the classes, because you have more usage, we think you're going to have more in it. We have some algorithms to do the model merging, and then you download the model back to the client to serve.

Participant 3: I wanted to ask a little more about the performance sites. You guys were talking about the user experience and so on, trying to keep it to below 20 milliseconds response time. Now, given out there, different people have a variety of machines, some that are really slow, some really fast. Some of the people with fast machines love to bog down with God knows how many processes. Can you explain a little more about your performance testing, and how you actually go to selecting the model that you use, I'm planning to use also, in the future?

Buchholtz-Au: Our editor team has a set of tasks that have to be run on every x engine that we are putting out there on every build in Visual Studio. There are automatic tests, latency tests that we have to pass in order to be in the product. The editor team has a whole suite of typing tests, and we have to hit underneath certain threshold and the Visual Studio team are up to find that threshold based off of the data on machines out there and whatnot.

Fu: We have a whole logging inside the recommendation engine, for each recommendation, we record how many milliseconds we use. We tally up, from the online parameter, we can tell what's the medium, what's the average, what's the distribution of the runtime of our recommendation.

Buchholtz-Au: Yes. Then it's just a matter of how we keep it as low as possible.


See more presentations with transcripts


Recorded at:

Jul 04, 2019