Tuesday, July 15, 2025

Coding has emerged as genAI’s killer use case. But what if its benefits are a mirage?

Hello and welcome to Eye on AI…In this edition: Meta is going big on data centers…the EU publishes its code of practice for general purpose AI and OpenAI says it will abide by it…the U.K. AI Security Institute calls into question AI “scheming” research.

The big news at the end of last week was that OpenAI’s plans to acquire Windsurf, a startup that was making AI software for coding, for $3 billion fell apart. (My Fortune colleague Allie Garfinkle broke that bit of news.) Instead, Google announced that it was hiring Windsurf’s CEO Varun Mohan and cofounder Douglas Chen and a clutch of other Windsurf staffers, while also licensing Windsurf’s tech—a deal structured similarly to several other big tech-AI startup not-quite-acquihire acquihires, including Meta’s recent deal with Scale AI, Google’s deal with Character.ai last year, as well as Microsoft’s deal with Inflection and Amazon’s with Adept. Bloomberg reported that Google is paying about $2.4 billion for Windsurf’s talent and tech, while another AI startup, Cognition, swooped in to buy what was left of Windsurf for an undisclosed sum. Windsurf may have gotten less than OpenAI was offering, but OpenAI’s purchase reportedly fell apart after OpenAI and Microsoft couldn’t agree on whether Microsoft would have access to Windsurf’s tech.

The increasingly fraught relationship between OpenAI and Microsoft is worth a whole separate story. So too is the structure of these non-acquisition acquihires—which really do seem to blunt any legal challenges, either from regulators or the venture backers of the startups. But today, I want to talk about coding assistants. While a lot of people debate the return on investment from generative AI, the one thing seemingly everyone can agree on is that coding is the one clear killer use case for genAI. Right? I mean, that’s why Windsurf was such a hot property and why Anyshphere, the startup behind the popular AI coding assistant Cursor, was recently valued at close to $10 billion. And GitHub Copilot is of course the star of Microsoft’s suite of AI tools, with a majority of customers saying they get value out of the product. Well, a trio of papers published this past week complicate this picture.

Experiment calls gains from AI coding assistants into question
METR, a nonprofit that benchmarks AI models, conducted a randomized control trial involving 16 developers earlier this year to see if using code editor Cursor Pro integrated with Anthropic’s Claude Sonnet 3.5 and 3.7 models, actually improved their productivity. METR surveyed the developers before the trial to see if they thought it would make them more efficient and by how much. On average, they estimated that using AI would allow them to complete the assigned coding tasks 24% faster. Then the researchers randomized 246 software coding tasks, either allowing them to be completed with AI or not. Afterwards, the developers were surveyed again on what impact they thought the use of Cursor had actually had on the average time to complete the tasks. They estimated that it made them on average 20% faster. (So maybe not quite as efficient as they had forecast, but still pretty good.) But, and now here’s the rub, METR found that when assisted by AI it actually took the coders 19% longer to finish tasks.

What’s going on here? Well, one issue was that the developers, who were all highly experienced, found that Cursor could not reliably generate code as good as theirs. In fact, they accepted less than 44% of the code-generated responses. And when they did accept them, three-quarters of the developers felt the need to still read over every line of AI-generated code to check it for accuracy, and more than half of the coders made major changes to the Cursor-written code to clean it up. This all took time—on average 9% of the developers time was spent reviewing and cleaning up AI-generated outputs. Many of the tasks in the METR experiment involved large code bases, sometimes consisting of over 100,000 lines of code, and the developers found that sometimes Cursor made strange changes in other parts of this code base that they had to catch and fix.

Is it just vibes all the way down?
But why did the developers think the AI was making them faster when in fact it was slowing them down? And why, when the researchers followed up with the developers after the experiment ended, did they discover that 69% of the coders were continuing to use Cursor?

Some of it seems to be that despite the time it took to edit the Cursor-generated code, the AI assistance did actually ease the cognitive burden for many of the coders. It was mentally easier to fix the AI-generated code than to have to puzzle out the right solution from scratch. So is the perceived ROI from “vibe coding” itself just vibes? Perhaps. That would actually square with what the Wall Street Journal noted about a different area of genAI use—lawyers using genAI copilots. The newspaper reported that a number of law firms found that given how long it took to fact-check AI-generated legal research, they were not sure lawyers were actually saving any time using the tools. But when they surveyed lawyers, especially junior lawyers, they all reported high satisfaction using the AI copilots and that they felt it made their jobs more enjoyable.

But a couple of other studies from last week suggest that maybe it all depends on exactly how you use AI coding assistance. A team from Harvard Business School and Microsoft looked at two years of observations of software developers using GitHub Copilot (which is Microsoft product) and found that those using the tool spent more time on coding and less time on project management tasks, in part because GitHub Copilot allowed them to work independently instead of having to use large teams. It also allowed the coders to spend more time exploring possible solutions to coding problems and less time actually implementing the solutions. This too might explain why coders enjoy using these AI tools—because it allows them to spend more time on parts of the job they find intellectually interesting— even if it isn’t necessarily about overall time-savings.

Maybe the problem is coders just aren’t using enough AI?
Finally, let’s look at the third study, which is from researchers at Chinese AI startup Modelbest, Chinese universities BUPT and Tsinghua University, and the University of Sydney. They found that while individual AI software development tools often struggled to reliably complete complicated tasks, the results improved markedly when multiple large language models were prompted to each take on a specific role in the software development process and to pose clarifying questions to one another aimed at minimizing hallucinations. They called this architecture “ChatDev.”

So maybe there’s a case to be made that the problem with AI coding assistants is how we are using them, not anything wrong with the tech itself? Of course, building teams of AI agents to work in the way ChatDev suggests also uses up a lot more computing power, which gets expensive. So maybe we’re still facing that question: is the ROI here a mirage?

With that, here’s more AI news.

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

Before we get to the news, the U.S. paperback edition of my book, Mastering AI: A Survival Guide to Our Superpowered Future, is out from Simon & Schuster. Consider picking up a copy for your bookshelf.

Also, if you want to know more about how to use AI to transform your business? Interested in what AI will mean for the fate of companies, and countries? Then join me at the Ritz-Carlton, Millenia in Singapore on July 22 and 23 for Fortune Brainstorm AI Singapore. This year’s theme is The Age of Intelligence. We will be joined by leading executives from DBS Bank, Walmart, OpenAI, Arm, Qualcomm, Standard Chartered, Temasek, and our founding partner Accenture, plus many others, along with key government ministers from Singapore and the region, top academics, investors and analysts. We will dive deep into the latest on AI agents, examine the data center build out in Asia, examine how to create AI systems that produce business value, and talk about how to ensure AI is deployed responsibly and safely. You can apply to attend here and, as loyal Eye on AI readers, I’m able to offer complimentary tickets to the event. Just use the discount code BAI100JeremyK when you checkout.

Note: The essay above was written and edited by Fortune staff. The news items below were selected by the newsletter author, created using AI, and then edited and fact-checked.

AI IN THE NEWS

White House reverses course, gives Nvida greenlight to sell H20s to China. Nvidia CEO Jensen Huang said the Trump administration is set to reverse course and ease export restrictions on the company’s H20 AI chip, with deliveries to resume soon. Nvidia also introduced a new AI chip for the Chinese market that complies with current U.S. rules, as Huang visits Beijing in a diplomatic push to reassure customers and engage officials. While China is encouraging buyers to adopt local alternatives, companies like ByteDance and Alibaba continue to prefer Nvidia’s offerings due to their superior performance and software ecosystem. Nvidia’s stock and that of TSMC, which makes the chips for Nvidia, jumped sharply on the news. Read more from the Financial Times here.

Zuckerberg confirms Meta will spend hundreds of billions in data center push. In a Threads post, Meta CEO Mark Zuckerberg confirmed that the company is spending “hundreds of billions of dollars” to build massive AI-focused data centers, including one called Prometheus set to launch in 2026. The data centers are part of a broader push toward developing artificial general intelligence or “superintelligence.” Read more from Bloomberg here.

OpenAI and Mistral say they will sign EU code of practice for general-purpose AI. The EU published its code of practice last week for general-purpose AI systems under the EU AI Act, about two months later than initially expected. Adhering to the code, which is voluntary, gives companies assurance that they are in compliance with the Act. The code imposes a stringent set of public and government reporting requirements on frontier AI model developers, requiring them to provide a wealth of information about their models’ design and testing to the EU’s new AI Office. It also requires public transparency around the use of copyrighted materials in the training of AI systems. You can read more about the code of practice from Politico here. Many had expected the big technology vendors and AI companies to form a united front in opposing the code—Meta and Google had previously attacked drafts of it, claiming it imposed too great a burden on tech firms—but OpenAI said in a blog post Friday that it would sign up to the standards. Mistral, the French AI model developer, also said it would sign—although it had previously asked the EU to delay enforcement of the AI Act, whose provisions on general-purpose AI are set to come into force on August 2nd. That may up the pressure on other AI companies to agree to comply too.

Report: AWS is testing a new cloud service to make it easier to use third-party AI models. That’s according to a story in The Information, which says Amazon cloud service AWS is making the move after losing business from several AI startups to Google Cloud. Some customers complained it was too difficult to tap models from OpenAI and Google, which are hosted on other clouds, from within AWS.

Amazon mulls further multi-billion dollar investment in Anthropic. That’s according to a story in the Financial Times. Amazon has already invested $8 billion in Anthropic and the two companies have formed an ever-closer alliance, with Anthropic working with Amazon on several massive new data centers and helping it develop its next generation Trainium2 AI chips.