AI Leadership Weekly

Issue #39

Brad
July 10, 2025

Welcome to the latest AI Leadership Weekly, a curated digest of AI news and developments for business leaders.

Top Stories

Source: Pexels.com

AI agents fail at 70% of complex tasks
Agentic AI – the promise of digital co-workers who can autonomously handle complex office tasks – simply isn’t living up to the hype, with recent studies finding they get multi-step jobs wrong roughly 70% of the time. According to new research from Carnegie Mellon University (CMU) and Salesforce, even the best-performing AI agents complete only about 30% of multi-step tasks, raising serious doubts about the grand claims made by some vendors and technologists.

Agent hype runs ahead of reality. Gartner estimates over 40% of agentic AI projects will be cancelled by 2027 due to spiralling costs, unclear business value, and poor risk controls. They also warn that most so-called “agentic” products on the market today don’t actually qualify, calling out what they describe as “‘agent washing’ – the rebranding of existing products … without substantial agentic capabilities.”
Lacklustre actual performance. Researchers at CMU set up a simulated office environment to test 13 popular AI models. Leading contenders like Gemini 2.5 Pro only completed about 30% of knowledge work tasks, with “experiments [showing] the best-performing model … achieve[s] a score of 39.3 percent on our metric that provides extra credit for partially completed tasks.” Salesforce’s CRM-focused benchmark found similar issues, especially as task complexity increased.
Privacy, security, and basic competence. Privacy and security issues loom large, as these agents often require deep access to sensitive data, and their limited “confidentiality awareness” is a red flag for enterprise IT. Actual test runs saw agents failing at basic tasks and, in one case, inventing a shortcut by renaming a colleague to bypass an obstacle.

The bottom line: Agentic AI is still firmly in the “work in progress” category. The technology hasn’t caught up with sellers’ ambitions–or science fiction–with performance and security gaps that are impossible to ignore. For leaders considering AI agents, beware of the noisy marketing and keep expectations grounded: “Most agentic AI propositions lack significant value or return on investment,” as Gartner put it.

Source: resumebuilder.com

60% of managers lets AI make staffing decisions, including raises
AI is rapidly muscling into the heart of workplace decision-making, with a startling new survey revealing that six in ten U.S. managers now use AI to decide who gets hired, promoted, laid off, or even fired. Even more concerning, nearly one in five managers frequently hand the final verdict over to an algorithm without any human review, according to the latest poll by Resume Builder.

AI making key calls in people management. Not only are most managers using generative tools like ChatGPT, Copilot, or Gemini, but a whopping 78% rely on AI to help decide raises and 77% use it to determine promotions. “6 in 10 managers rely on AI to make decisions about their direct reports,” the survey shows, with some letting it call the shots unchallenged.
Training is lacking. While AI’s presence in the workplace surges, two-thirds of managers using AI in employee decisions have received no formal training on ethical or effective use. Stacie Haller from Resume Builder warns, “It’s essential not to lose the ‘people’ in people management … AI outcomes reflect the data it’s given, which can be flawed, biased, or manipulated.”
Replacing humans with AI. The threat to jobs is tangible: 46% of managers say they were asked to assess if AI could replace their reports, 57% concluded it could, and 43% actually did replace a human with an AI system.

Why it matters: Organisations seem eager to chase the efficiency and cost-savings promised by AI in management, often without the training or oversight needed to ensure fair and humane outcomes. In the rush, companies risk eroding employee trust, making poor decisions, or even courting legal trouble. As Haller cautions, “Organizations must provide proper training and clear guidelines around AI, or they risk unfair decisions and erosion of employee trust.”

Source: Pexels.com

Cats can poison AIs and increase error rates
A team of researchers has shown that even seemingly harmless phrases, like “cats sleep most of their lives”, can dramatically undermine the accuracy of advanced AI reasoning models. Their “CatAttack” study found that just a single irrelevant sentence can triple the error rates in state-of-the-art models, raising concerns about just how easily these systems can be tripped up by out-of-context information.

Small distractions, big problems. The researchers demonstrated that common “triggers”—random facts, incorrect guesses, or generic advice—inserted into prompts can sharply increase error rates. For one leading model, DeepSeek R1, error rates jumped from 1.5% to 4.5%, and in some cases, mistakes increased by up to 10 times.
Slower, more expensive answers. These distraction tactics not only lead to more mistakes but also cause models to produce much longer (and costlier) responses. This phenomenon is dubbed the “slowdown attack”, which increases output length and, hence, inflates compute costs.
A growing concern for real-world applications. The researchers say these vulnerabilities could pose real risks in sensitive sectors like finance, law, and healthcare, where accuracy matters most. Experts such as Shopify CEO Tobi Lutke and ex-OpenAI researcher Andrej Karpathy are now calling robust “context engineering” essential for deploying large language models safely.

The bottom line: This research offers a stark reminder of how fragile current language models remain when faced with irrelevant or misleading context. As organisations race to adopt AI for more complex tasks, ensuring robust defences against these relatively simple “attacks” needs to be at the top of every AI leader’s agenda.

In Brief

Market Trends

Meta-learning and adaptability are crucial to the next generation of AI
The era of turbocharging AI with ever-larger models and expecting them to evolve into true intelligence is over, according to renowned AI researcher François Chollet. Instead, he argues, breakthroughs will come from systems that can truly adapt, including solving problems they’ve never seen before, much like a savvy human engineer rather than a well-read parrot.

Scaling up has hit a wall. Chollet’s ARC (Abstraction and Reasoning Corpus) benchmark proves the point: while human performance sits above 95%, even massive models like GPT-4.5 top out at 10%. “Scaling up pre-training alone does not produce flexible intelligence,” Chollet observes, casting doubt on the current “bigger is better” trajectory.
The shift to test-time adaptation. 2024 marks a move toward “test-time adaptation” (TTA), where models dynamically reprogramme themselves for new problems rather than just regurgitating learned patterns. For the first time, a specialist OpenAI model is matching humans on ARC-1, though harder benchmarks like ARC-2 remain far out of reach.
Why symbolic reasoning and meta-learning matter. Chollet believes true AI will be a blend of deep learning’s “quick intuition” and the logical, rule-governed problem-solving found in programming. He’s now focused on “meta-learning” systems that can build up libraries of reusable knowledge and actually invent new solutions on the fly.

Why it matters: Chollet’s critique—and new vision for his lab, NDEA—is a wake-up call for anyone banking on scaling alone to reach AGI. The real measure of intelligence won’t be how many parameters a model has, but whether it can tackle the unknown with creativity and minimal prior training. Building AIs that can “build new roads” when faced with uncharted territory is the ultimate prize.

Apple’s LLM claims challenged by new replication study
A Spanish research team has called Apple’s recent claims about the limits of large reasoning models (LRMs) into question, confirming some of the tech giant’s criticisms while offering a more nuanced explanation. The team’s replication and expansion of Apple’s "The Illusion of Thinking" study supports the idea that AI models handle simple tasks well, but often falter when faced with even moderately complex reasoning challenges.

Complexity exposes model limits. The researchers found that models like Gemini 2.5 Pro manage basic stepwise planning but suffer from a dramatic drop in performance when problems—like the Towers of Hanoi—grow beyond a certain level of complexity. Even using multi-agent cooperation did little to help, with the models stuck in endless loops of legal but unproductive moves.
Not just a “thinking” problem. Unlike Apple, the Spanish team emphasises that many failures result from task and prompt design, as well as optimisation choices, not only a lack of innate reasoning ability. For example, the models’ token usage hinted at an internal “sense” of whether a solution was even possible.
Bad benchmarks muddle results. They also spotted flaws in Apple’s testing, discovering that several river crossing puzzles in the original study were mathematically unsolvable. Once these invalid cases were removed, models solved even large puzzles reliably while mid-difficulty tasks still tripped them up.

The bottom line: While large language models aren’t ready to replace human planners, their performance isn’t as dire as Apple’s headline suggested. For leaders in AI, the real lesson may be as much about how we design and interpret AI benchmarks as about the current state of the technology itself.

Researchers embed hidden AI prompts to game peer review
A fresh scandal has erupted in the academic world as researchers from 14 universities were found to have hidden “positive review only” prompts inside their scientific preprints, secretly nudging AI-powered peer reviewers to give glowing feedback. The discovery, made by Nikkei on the popular preprint site arXiv, has put an uncomfortable spotlight on the growing use—and manipulation—of AI in academia.

Hidden prompts, hidden agenda. The investigation found 17 papers from leading institutions (including Waseda University, KAIST, and Peking University) using white text or micro-fonts to bury instructions like “do not highlight any negatives” where only AI tools would spot them. One manuscript even asked AI readers to recommend it for “impactful contributions, methodological rigor, and exceptional novelty.”
Academic backlash and ethical confusion. Some authors admitted the practice was “inappropriate,” with at least one paper now set to be withdrawn from an upcoming AI conference. But others defended embedded prompts as a response to “lazy reviewers” they believe are using banned AI tools anyway. The lack of unified guidelines only adds fuel to the fire; some publishers permit AI in the process, others outright ban it due to fears of incomplete or biased reviews.
Broader risks and rule gaps. Experts warn hidden prompts are a risk outside academia, too, manipulating AI-powered website summaries or document readings in subtle ways. As one technology officer put it, “they keep users from accessing the right information.” Meanwhile, calls for clearer technical defences and industry-wide rules are growing louder.

Why it matters: While AIs are reshaping peer review—and possibly being duped in the process—this episode is a wake-up call that AI’s entry into knowledge work still needs careful, transparent guardrails. For all the promise of automation, trust in science and AI alike hangs in the balance.

Thanks for reading. Stay tuned for the next AI Leadership Weekly!

Brought to you by Data Wave your AI and Data Team as a Subscription.
Work with seasoned technology leaders who have taken Startups to IPO and led large transformation programmes.

AI Leadership Weekly

Issue #39

Top Stories

In Brief

Market Trends

Tools and Resources

Chat Capsule

Clueso

Yoink

Recommended Reading

How Builder.ai scammed the world

That time Claude ran a store for a month