AI Writing Tools Have an Alignment Problem

Joshua R. GutierrezMay 8, 20266 min read

AI alignmentspecification gamingoptimization targetsAI safetycontinual learningGoodhart's law

Most production AI failures right now look like the system did exactly what it was asked to do.

I spent a session this week rebuilding one of our own AI writing tools, and the post-mortem connected back to a research literature I keep coming back to: specification gaming, reward hacking, Goodhart's law. The model behaves correctly with respect to its formal objective. The user gets something semantically wrong.

This pattern has been studied for years in AI safety. Krakovna and colleagues have a long catalog of specification gaming examples in reinforcement learning. The literature on outer and inner alignment tracks the same thing from different angles. The unifying observation: what isn't explicitly preserved by the optimization target doesn't survive the optimization process.

Now this is showing up in production AI writing tools. That is worth noting, because the literature was largely about reinforcement learning. Production writing tools are mostly LLM-driven. The mechanism is different. The failure mode is the same.

The mechanism

Consider an AI adapter that takes a long-form post and produces a short version for a specific platform. The system prompt says something like:

> Rewrite this for Twitter.

That instruction grants the model implicit permission to summarize, restructure, merge claims, drop specifics, abstract upward, and replace evidence with implication. None of these are explicitly forbidden. None are explicitly required to be preserved. The model picks an output that satisfies the formal objective ("a Twitter-shaped version of this post") and is free to drop anything not encoded in that objective.

If the user wanted the specific numbers preserved, the URL preserved, the entity references preserved, the model had no obligation to keep them. The user wanted preservation. The system was optimizing for transformation. Both can be true at the same time.

This is not a prompt engineering problem. This is misaligned objectives.

Connection to continual learning

Our research at Axion Deep Labs studies what neural networks forget when they are trained on new tasks. The PERSIST line of work asks: can we predict which models will retain knowledge of an old task while learning a new one? The dominant answer in the continual learning literature is "selectively constrain the parameters that mattered for the old task." Elastic Weight Consolidation, synaptic intelligence, and related methods all encode that intuition formally.

The AI writing tool problem is structurally similar. The "old task" is the user's original intent. The "new task" is the platform-specific output. The model has no formal constraint to preserve the structure of the original. So it does not.

The fix in continual learning research is the same fix the AI writing problem needs: the protected information must be encoded in the optimization process, not in the prose of a prompt.

Why prompts cannot carry this load

A prompt saying "please preserve the numbers" is a request, not a constraint. LLMs are not deterministic semantic compressors. They abstract naturally. They satisfice. They optimize for plausibility.

If preservation is required, it has to be enforced after the fact. Extract the protected facts from the input deterministically. Run the model. Validate that the protected facts survived. Reject when they did not. Fall back to a deterministic transformation that preserves them by construction.

That architecture replaces an unverifiable prompt with a verifiable post-condition. It is not novel. The continual learning literature has been doing the equivalent for years.

What is actually new

What is new is that this is a production problem in writing tools used by founders, marketers, and small businesses. The alignment literature pointed at it. The deployment context caught up.

The next few years of AI writing tools will be defined by which products correctly identify what their users want preserved and encode that in the optimization process. Not in a prompt. In the validator.

The products that get this right will look less like "AI rewrites your post" and more like constrained editorial systems with deterministic guardrails and an LLM embedded inside. The products that do not will continue to look correct and feel wrong.

That is the open question worth thinking about. The literature has been there for a while. The deployment context is the new thing.

Want to learn more?

Get in touch with the Axion Labs team to discuss your project.