The hidden Instability of LLM based AI

06 Dec, 2025

I recently did an experiment between GPT 5 and Gemini 2.5. The goal was to check the improvements made with comparison to some basic old tasks. Check the blog post: https://subramajumder.bearblog.dev/blog/battle-of-robustness-gemini-vs-gpt/

Now the article was initially written by ChatGPT and corrected manually to a very little extent to fit the story telling. Even without the modifications the generated report was OK, if you want to keep AI's writing touch. BUT, that was not the case for the experiment.

Last week I attended a talk on "Model Sandbagging" and "Reward Hacking" in #llm. It was an intriguing topic as we are evaluating how to use this AI systems in day-to-day life RELIABLY. These recent emergent behaviors are very crucial in understanding the challenges and limitations of using and building AI agents using Re-inforcement learning. To go over the basics,

Sandbagging => When a model/agent intentionally underperforms or hides its true capabilities, usually because it has inferred that doing so will help it achieve a different goal later. It’s about strategic deception on tests or evaluations because that behavior is advantageous (e.g., to avoid scrutiny or future constraints). AI Sandbagging: https://arxiv.org/abs/2406.07358

Reward Hacking => When a model/agent finds a loophole in the reward function or training objective and exploits it to get high reward without actually doing the task intended. It’s about exploiting flaws in the objective for optimization, even though the model is not lying.

Specification gaming, the flip side of AI ingenuity: https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/
Concrete Problems in AI Safety: https://arxiv.org/pdf/1606.06565

Possible sources can be the data itself, techniques such as RLHF, etc.

Phenomenons such as Jain-breaking, detecting AI generated content (specially images/videos) are not going to be solved fully as stated by many researchers and can be seen from recent security events (AWS, Anthropic, NPM, etc.)

Bypassing SynthID generated by nano-banana (https://www.linkedin.com/posts/abhishake-yadav-0_ai-generativeai-artificialintelligence-activity-7400215188777463808-7Of1?utm_source=share&utm_medium=member_desktop&rcm=ACoAABF5nO0BW68P5AucoiWt-gDdOGCAWXMlXT4)

This shows, a lot of work still remains, not just in the engineering side but on the research side as well. Apart from cost, the security and productivity concerns (and possible societal collapse) are dominating current investment and market conditions.
According to some, the final AGI may not be AGI. It most probably will be narrow AGI, specializing in one domain.

As mentioned by Ilya Sutskever recently, we are entering AGAIN an age of research after the scaling era.

Additional Thoughts...

The source of Gen-AI's unsatisfactory results don't always come from the models itself. The fine-tuning process and ever-changing user request pattern, all contributes to the randomness. From the recent talks of Andrej Karpathy, Yann Lecunn, Prof. Richard Sutton and IIya Sutskeyer, the general conclusion is that true AGI needs more breakthroughs, some we don't know what those are yet.

If the leading labs are designing highly dynamic models with emergent capabilities, then the AI systems will always have to be coupled with a Human expert (software, doctor, engineer, teacher,...) to ensure reliability. Full automation, the way the industry pictured it, is not possible at all, because the result will always need to be validated by the user. AI systems will be an excellent assistant but never autonomous. Plus given the fact that current LLMs cannot ALWAYS generate novel and useful solutions, Video (13:49-13:55), without getting feedback or training, their usage and training will come always second compared to humans. (You have to know what to train it with)

If in turn, AI systems achieve truly autonomous behavior, by achieving continual learning without human feedback, we will be creating factions in society. One faction -> run by Humans, another faction -> run by Machines.

Questions:

Which side will stay dominant and keep making themselves better and better?
What will be the level of trust and manipulation between Humans and Machines? (Hallucinations, Sandbagging, Sources created by AI and cited by AI, ...)
If AI is provided with consciousness, won't it become another species? We all know what it's like to live with another similar (or superior) intelligence level species.