Microsoft Shares Lessons Learned on Building AI Copilots

Researchers at Microsoft and GitHub Inc. have conducted an in-depth study into the challenges, opportunities, and needs associated with building AI-powered product copilots. The research involved interviews with 26 professional software engineers from various companies responsible for developing these advanced tools.

The race to embed advanced AI capabilities into products is on, with virtually every technology company looking to add these features to their software; however, many problems remain. Orchestrating multiple data sources and prompts can increase the risk of failure cases while testing LLMs is difficult due to their inherent variability. Developers also struggle to keep up with best practices in this rapidly evolving field, often resorting to social media or academic papers for guidance. Safety, privacy, and compliance are major concerns that require careful management to avoid potential damage or breaches.

"A one-stop shop for integrating AI into projects remains a challenge. Developers are seeking a place to get started quickly, transition from a playground to an MVP, connect their various data sources to the prompts, and then move the AI components into their existing codebase efficiently," Austin Henley wrote. "A prompt linter could provide quick feedback. Developers also asked for a library or "toolbox" of prompt snippets for common tasks. Additionally, tracing the effect of prompt changes would be of huge value," Henley continued.

One significant challenge identified was prompt engineering, the process of creating prompts that trigger an AI model's inference process. "Because these large language models are often very, very fragile in terms of responses, there’s a lot of behavior control and steering that you do through prompting," said one participant (P7). This unpredictability makes it more art than science as developers have to navigate through trial-and-error processes, which can be time-consuming.

Another issue raised was related to testing benchmarks. With generative models like LLMs (Large Language Models), writing assertions becomes difficult when each response might differ from the last one—it's like every test case is a flaky test. One participant explained: “That’s why we run each test 10 times” (P1) while another added: “Experimenting is the most time-consuming if you don’t have the right tools" (P12).

Furthermore, participants expressed concerns about safety, privacy, and compliance issues associated with integrating AI into products. For instance: "Do we want this affecting real people? This runs in nuclear power plants,” voiced P11, highlighting potential risks in using such technologies without proper safeguards. Finally, keep up-to-date or even know where they should focus their efforts on learning new skills or tools. "This is brand new to us. We are learning as we go. There is no specific path to do the right way!” (P1)

The changes come as Microsoft recently launched design changes and upgrades to its own CoPilot. For example, all English-speaking Copilot users in the U.S., U.K., Australia, India and New Zealand can now edit images in the flow of a chat. The changes come as several Microsoft Copilot Pro subscribers report performance issues.

I’ve tested the experience in the latest Windows 11 Canary builds, and it works well for text, summarizing anything you copy. While the icon also animates when you copy images, this feature isn’t quite ready to test yet. - Tom Warren

Developers interested in learning more about the study can read a summary or the full article.

About the Author

Andrew Hoblitzell

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Andrew Hoblitzell

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter