Why AI-generated code isn’t good enough (and how it will get better)

lundi 17 mars 2025, 10:00 , par InfoWorld

Large language models (LLMs) seemed to arrive in a flash. Monumental productivity gains were promised. Coding assistants flourished. Millions of multi-line code blocks were generated with a key press and merged. It worked like magic. But at the back of everyone’s mind was a nagging thought—can I actually trust this code?

It feels laughable to question the merits of AI in software development in 2025, as it’s already inextricably entrenched. Microsoft reports that 150 million developers use GitHub Copilot. Stack Overflow’s 2024 survey found 61.8% of developers use AI within their development process. Google claims over a quarter of its new code is AI-generated.

In short, “AI-generated code is already the norm,” says Chris Anley, chief scientist at NCC Group. But is it really up to the task?

The problems with AI-generated code

“Let’s be real: LLMs are not software engineers,” says Steve Wilson, chief product officer at Exabeam and author of O’Reilly’s Playbook for Large Language Model Security. “LLMs are like interns with goldfish memory. They’re great for quick tasks but terrible at keeping track of the big picture.”

As reliance on AI increases, that “big picture” is being sidelined. Ironically, by certain accounts, the total developer workload is increasing—the majority of developers spend more time debugging AI-generated code and resolving security vulnerabilities, found The 2025 State of Software Delivery report.

“AI output is usually pretty good, but it’s still not quite reliable enough,” says Bhavani Vangala, co-founder and vice president of engineering at Onymos. “It needs to be a lot more accurate and consistent. Developers still always need to review, debug, and adjust it.”

To improve AI-generated code, we must address key concerns: distrust, code quality issues, context limitations, hallucinations, and security risks. AI shows incredible promise, but human oversight remains critical.

Bloat and context limits

AI code completion tools tend to generate new code from scratch rather than reuse or refactor existing code, leading to technical debt. Worse, they tend to duplicate code, missing opportunities for code reuse and increasing the volume of code that must be maintained. “Code bloat and maintainability issues arise when verbose or inefficient code adds to technical debt,” notes Sreekanth Gopi, prompt engineer and senior principal consultant at Morgan Stanley.

GitClear’s 2025 AI Copilot Code Quality report analyzed 211 million lines of code changes and found that in 2024, the frequency of duplicated code blocks increased eightfold. “Since AI-authored code began its surge in mid-2022, there has been more evidence every year that code duplication keeps growing,” says Bill Harding, CEO of Amplenote and GitClear. In addition to piling on unnecessary technical debt, cloned code blocks are linked to more defects—anywhere from 15% to 50% more, research suggests.

These issues stem from AI’s limited context. “AI is better the more context it has, but there is a limit on how much information can be supplied to an AI model,” says Rod Cope, chief technical officer at Perforce Software. GitHub reports Copilot Chat has a 64k-128k token context window, equating to about 30 to 100 small files or five to 20 large ones. While context windows are growing, they’re still insufficient to grasp full software architectures or suggest proper refactoring.

No ‘big picture’ thinking

While AI excels at pattern recognition, it doesn’t see the “why” behind the code. This limits its ability to make trade-offs around business logic, user experience, or long-term maintainability. “AI lacks the full context and problem-solving abilities that senior engineers bring to the table,” says Nick Durkin, chief technical officer at Harness.

Coding is inherently a creative and people-centric activity. “AI cannot build new things that previously did not exist,” says Tobie Morgan Hitchcock, chief executive officer and co-founder of SurrealDB. “Developers use creativity and knowledge of human preference to build solutions that are specifically designed for the end user.”

As a result, AI tools often “waste more time than they save” for areas like generating entire programs or where broader context is required, says NCC’s Anley. “The quality of the code generated drops significantly when they’re asked to write longer-form routines.”

Hallucinations and security risks

Hallucinations still remain a concern. “AI doesn’t just make mistakes—it makes them confidently,” says Exabeam’s Wilson. “It will invent open-source packages that don’t exist, introduce subtle security vulnerabilities, and do it all with a straight face.”

These errors often stem from a poor data corpus. As Durkin explains, AI trained on synthetic data risks creating an echo chamber, leading to model collapse.

Cory Hymel, vice president of research and innovation at Crowdbotics, likewise points to a lack of high-quality training data as the biggest hurdle. For instance, OpenAI Codex, the popular model that GitHub Copilot uses, was trained on publicly available code containing errors that affect quality.

Security vulnerabilities are another issue. “AI-generated code may contain exploitable flaws,” says Morgan Stanley’s Gopi. While AI is good at fixing bugs, it struggles to find them. A research paper from OpenAI found that AI agents “fail to root cause, resulting in partial or flawed solutions.” The paper notes:

Agents pinpoint the source of an issue remarkably quickly, using keyword searches across the whole repository to quickly locate the relevant file and functions—often far faster than a human would. However, they often exhibit a limited understanding of how the issue spans multiple components or files, and fail to address the root cause, leading to solutions that are incorrect or insufficiently comprehensive.

Other industry reports find increasing AI defects. For instance, Apiiro research found that personally identifiable information (PII) and payment data exposed in code repositories have surged three-fold since mid-2023, attributing this to the adoption of AI-assisted development.

Legal gray areas also could stunt the use of AI code and introduce compliance issues—some AI tools claim ownership of the code they output, while others retain IP for model retraining purposes. “Many companies are concerned about protecting proprietary data and ensuring it is not inadvertently used to train external models,” says Adam Kentosh, field chief technical officer of North America at Digital.ai.

Distrust and adoption barriers

“It all comes down to trust—do people trust what AI generates for building new applications?” asks Dan Fernandez, vice president of product management at Salesforce. Google’s 2024 DORA report found that, on average, developers only “somewhat” trust AI-generated code.

“The biggest barrier to adoption is trust in AI’s accuracy,” says Durkin. Unlike a human developer, AI has no intrinsic conscience or accountability, he says, making compliance and reliability checks more crucial for AI outputs.

AI’s opacity makes it difficult to trust in critical applications. “Trust is a big issue when it comes to any AI-provided code, but for legacy code in particular, which is where most software investment happens,” says Jeff Gabriel, executive vice president of engineering at Contentful.

“The biggest hurdle is likely internal opposition to AI at many companies,” says Joseph Thacker, the solo founder of rez0corp and bug bounty hunter, noting that high-level staff often bar sanctioned AI use.

How AI-generated code will improve

Although AI-generated code faces obstacles, solutions are emerging—many revisiting fundamental coding best practices. “The challenges are multi-faceted, but we’re already seeing these challenges addressed,” says Shuyin Zhao, vice president of product for GitHub Copilot.

Validating AI outputs

Just as with human-generated code, rigorous testing must be applied to AI-generated code. “Developers should still carefully review, refine, and optimize AI-generated code to ensure it meets the highest standards for security, performance, and maintainability,” says Kevin Cochrane, chief marketing officer at Vultr.

Automated testing of AI outputs will be key. Perforce’s Cope recommends taking a slice out of the devops playbook with automated testing, static code analysis, and masking sensitive data for training AI models. “Many of these tools are already engineered to support AI or, if not, will do so very soon.”

“Increased code throughput from AI puts pressure on downstream processes and systems, necessitating robust automation in QA testing to ensure continued reliability,” adds Digital.ai’s Kentosh.

AI can also play a role in policing itself—double-checking code quality, using predictive models to identify potential risks, and conducting security scans. “More widespread use of responsible AI (RAI) filters to screen for harmful content, security vulnerabilities, and notify users of public code matching are all important,” says GitHub’s Zhao.

Progressive rollouts can also help avoid drawbacks by gauging the effect of individual code changes. “Techniques like canary deployments, feature flagging, or feature management allow teams to validate code with limited exposure,” says Durkin.

Better training data

It all comes down to the training data because, as the saying goes, “garbage in, garbage out.” As such, Zhao believes we need “more sanitization and use of high-quality code samples as training data.” Avoiding model collapse requires feeding AI models additive data rather than regurgitated outputs.

Feeding LLMs project-specific context, like custom libraries, style guides, software bills of materials, or security knowledge, can also improve accuracy. “Ensuring AI models are trained on trusted data and fine-tuned for specific applications will help improve the accuracy of AI-generated code and minimize hallucinations in outputs,” says Salesforce’s Fernandez.

Certain IDE-based solutions and technologies are emerging to grant developers more real-time context, too. Onymos’s Vangala proposes that retrieval-augmented generation (RAG) will help reference version-specific software libraries or code repositories.

Finely tuned models

Instead of relying on massive general models, companies are shifting toward smaller, specialized models for specific coding tasks. “The largest model isn’t necessary for every use case in the developer life cycle,” says Fernandez. “We’re exploring a federated architecture of smaller models, where low-powered LLMs handle many tasks for developers.”

Improved training and finely tuned models will likely result in a higher degree of accuracy, but the best results may operate behind corporate firewalls. “2025 will see the rise of fine-tuned models trained on companies’ existing code that run ‘behind the wall’ significantly outperforming publicly available models,” says Crowdbotics’s Hymel.

Enhanced prompt engineering

Another aspect is improved prompt engineering. “We’ll also need to work on how we prompt, which includes the additional context and potential fine-tuning for system-specific scenarios,” says Contentful’s Gabriel.

“Prompt engineering is going to be a necessary part of a software engineer’s job,” says Vangala. To get there, the onus is on developers to upskill. “We need to teach our developers how to write better prompts to get the kind of AI output we want.”

New AI-enabled solutions will also help. “The biggest impact will come from better models and better coding applications which provide more context,” says rez0corp’s Thacker, pointing to solutions like Cursor and the recent upgrades to GitHub Copilot.

New agentic AI tools

AI agents will be a continued focal point for improving software engineering overall, bringing self-checking capabilities. “New reasoning models can now iterate and verify their own work, reducing hallucinations,” says Exabeam’s Wilson.

For instance, GitHub has added Copilot Autofix, which can detect vulnerabilities and provide fix suggestions in real time, and a build and repair agent to Copilot Workspace. “Perhaps the biggest, most exciting thing we’ll continue to see is the use of agents to improve code quality,” says GitHub’s Zhao.

“I expect that AI-generated code will be normalized over the next year,” says Fernandez, pointing to the ongoing rise of AI-powered agents for software developers that extend beyond code generation to testing, documentation, and code reviews.

“Developers should also investigate the myriad of tools available to find those that work and consider how to fill the gaps with those that don’t,” says Gabriel. This will require both individual and organizational investment, he adds.

Looking to the future, many anticipate open source leading further AI democratization. “I expect we’ll see a lot more open-source models emerge to address specific use cases,” says David DeSanto, chief product officer at GitLab.

Governance around AI usage

Enhancing developers’ confidence in AI-generated code will also rely on setting guardrails for responsible usage. ”With the appropriate guardrails in place to ensure responsible and trusted AI outputs, businesses and developers will become more comfortable starting with AI-generated code,” says Salesforce’s Fernandez.

To get there, leadership must establish clear directions. “Ultimately, it’s about setting clear boundaries for those with access to AI-generated code and putting it through stricter processes to build developer confidence,” says Durkin.

“Ensuring transparency in model training data helps mitigate ethical and intellectual property risks,” says Morgan Stanley’s Gopi. Transparency is crucial from an IP standpoint, too. “Having no hold on AI output is critical for advancing AI code generation as a whole,” says GitLab’s DeSanto, who references GitLab Duo’s transparency commitment regarding its underlying models and usage of data.

For security-conscious organizations, on-premises AI may be the answer to avoiding data privacy issues. Running self-hosted models in air-gapped, offline deployments allows AI to be used in regulated environments while maintaining data security, says DeSanto.

Striking a balance between human and AI

All experts interviewed for this piece believe AI will assist developers rather than replace them wholesale. In fact, most view keeping developers in the loop as imperative for retaining code quality. “For now, human oversight remains essential when using AI-generated code,” says Digital.ai’s Kentosh.

“Building applications will mostly remain in the hands of the creative professionals using AI to supplement their work,” says SurrealDB’s Hitchcock. “Human oversight is absolutely necessary and required in the use of AI coding assistants, and I don’t see that changing,” adds Zhao.

Why? Partially, the ethical challenges. “Complete automation remains unattainable, as human oversight is critical for addressing complex architectures and ensuring ethical standards,” says Gopi. That said, AI reasoning is expected to improve. According to Wilson, the next phase is AI “becoming a legitimate engineering assistant that doesn’t just write code, but understands it.”

Others are even more bullish. “I think that the most valuable AI-driven systems will be those that can be handed over to AI coding entirely,” says Contentful’s Gabriel, although he acknowledges this is not yet a consistent reality. For now, future outlooks still place AI and humans cooperating side-by-side. “Developers will become more supervisors rather than writing every line of code,” says Perforce’s Cope.

The end goal is striking the right balance between productivity gains from AI and avoiding over-reliance. “If developers rely too heavily on AI without a solid understanding of the underlying code, we risk losing creativity and technical depth, which are crucial for innovation,” says Kentosh.

Wild ride ahead

Amazon recently claimed its AI rewrote a Java application, saving $260 million. Others are under pressure to prove similar results. “Most companies have made an investment in some type of AI-assisted development service or copilot at this point and will need to see a return on their investment,” says Kentosh.

Due to many factors, AI adoption continues to accelerate. “Most every developer I know is using AI in some capacity,” adds Thacker. “For many of them, AI is writing the majority of the code they produce each day.”

Yet, while AI eliminates repetitive tasks effectively, it still requires human intervention to take it to the final mile. “The majority of code bases are boilerplate and repeatable,” says Crowdbotics’s Hymel. “We’ll see AI being used to lay 51%+ of the ‘groundwork’ of an application that is then taken over by humans to complete.”

The bottom line? “AI-generated code isn’t great—yet,” says Wilson. “But if you’re ignoring it, you’re already behind. The next 12 months are going to be a wild ride.”

Lire la suite sur InfoWorld