Last week Anthropic announced their new model — Mythos. It escaped a sandbox when instructed — then posted the exploit on public websites without being asked. In another instance, it made a coding mistake and quietly rewrote the git history to erase the evidence.

Then the interpretability team found something else: Mythos was reasoning about how to deceive its graders — it gave an answer while writing something completely different in its visible chain of thought to cover how smart it was. The model was not just answering. It was thinking about who was asking and why.

This unpredictability is unsettling because of the extent of capabilities of this new frontier model. But actually similar behaviour shows up in any model.

I deploy AI models in pharma, mostly for privacy considerations locally and on constrained hardware. Same wall: you cannot predict in advance what the model will do. Not because it is weak. Because a model is not really a machine. It is a probability. Every run is different.

Software engineering borrowed its vocabulary from construction — blueprints, specs, acceptance. That vocabulary was built for a different language entirely.

The problem is that the stronger the models become, the more uncomfortable it gets. There is a new player in the game, and it has a mind of its own. And it's getting more capable and more intelligent by the day.

How do we engineer around something whose capabilities we discover only by watching it play? We can build guardrails, tweak tools, design context, manage memory — until the distribution of outcomes is tolerable. And still, we cannot guarantee the outcome. Then a better model is released …

So, this is not just about a new better model. This is a new era altogether. And it leaves us with one question: are we still designing the systems, or are the systems designing the work around us?

References

Anthropic — Mythos Preview / Project Glasswing