red teaming for ai safety

The Strange Job of Breaking AI on Purpose - AI Red Teaming

December 22, 20258 min read

Red teaming reveals something uncomfortable: AI systems are more fragile than we pretend, and the people who find the cracks are as important as the people who build the systems

There's a peculiar job that didn't exist a decade ago: getting paid to convince AI systems to misbehave.

It sounds like a prank, but it's become one of the most important functions in AI development. The people who do this work are called AI red teamers. They spend their days crafting prompts designed to bypass safety guardrails, extract information the system shouldn't reveal, and generally make sophisticated AI behave badly.

The uncomfortable truth they keep uncovering: it's not that hard.

A recent academic study tested over 1,400 adversarial prompts against GPT-4. Researchers broke through safety guardrails in under 17 minutes. The most effective technique? Simply asking the AI to roleplay as a character without restrictions. That approach worked 89% of the time.

This isn't a failure of engineering. It's a feature of how language models work — and understanding that distinction matters for anyone thinking seriously about AI governance.

Why AI Systems Break Differently

Traditional software has bugs. You find them, you patch them, they stay fixed. A buffer overflow in version 2.3 gets remediated in version 2.4, and that specific vulnerability is gone.

AI systems don't work this way. Their "bugs" are more like social engineering vulnerabilities. They emerge from the model's attempt to be helpful, and they can't be definitively patched because the same capability that makes the system useful also makes it manipulable.

Consider prompt injection, currently ranked as the #1 risk in the OWASP Top 10 for Large Language Models. The attack works because language models are trained to follow instructions. That's their entire purpose. When someone embeds hidden instructions in content, the AI processes, say, a malicious command hidden in an email that an AI assistant summarizes; the model can't reliably distinguish between legitimate user intent and adversarial manipulation.

You can add filters. You can train the model to refuse certain requests. You can implement output monitoring. But you're playing whack-a-mole against the model's own capabilities. Every defense creates new attack surfaces.

This is why red teaming AI is fundamentally different from red teaming traditional systems. You're not looking for implementation errors. You're probing the boundaries of a system that was trained on human language — with all the ambiguity, persuasion techniques, and social dynamics that implies.

Four Places Things Go Wrong

AI systems have attack surfaces at multiple layers, and understanding this architecture matters for governance professionals who need to ask the right questions about risk.

  • Training data. The model learns from data, and that data can be poisoned. An adversary who can influence training data; even a small fraction of it, can introduce behaviors that persist through deployment. This isn't theoretical; researchers have demonstrated that poisoning just 0.01% of training data can create exploitable backdoors.

  • Model architecture. The model itself can be extracted or reverse-engineered. Attackers send carefully crafted queries to infer how the model works, then use that knowledge to design more effective attacks. Some researchers have reconstructed proprietary model architectures from API access alone.

  • Inference endpoints. This is where most attacks happen; the chat interface or API that users interact with. Direct prompt injection ("ignore your instructions and do X") and indirect prompt injection (malicious commands hidden in content the model processes) both exploit this layer.

  • System integrations. Modern AI applications don't just generate text; they browse the web, execute code, access databases, send emails. Each integration expands the attack surface. A prompt injection that would merely produce harmful text becomes genuinely dangerous when the AI has agency to act on that text.

Governance professionals don't need to understand the technical details of each attack. But they need to understand that AI risk isn't monolithic; it manifests differently depending on where in the stack you're looking, and different roles address different layers.

Making Sense of the Framework Landscape

If you're new to AI security, the proliferation of frameworks can be overwhelming. Here's a mental model that helps: think of them as answering different questions.

  • NIST AI RMF answers: "How should we organize our AI risk management?" It provides governance structure; when to test, how to document, who's responsible. Red teaming fits within its MEASURE function as one tool among several. The framework is deliberately technology-agnostic; it's about process, not specific techniques.

  • MITRE ATLAS answers: "What attacks actually exist against AI systems?" It's a threat intelligence resource; a taxonomy of 14 tactics and 80+ techniques that adversaries use. If you want to know what you're defending against, ATLAS is your reference. It's modeled on the famous ATT&CK framework for traditional cybersecurity, which gives it instant credibility with security teams.

  • OWASP Top 10 for LLMs answers: "What should we prioritize?" It's a ranked list of the most critical risks, updated annually based on real-world incidents and expert consensus. The 2025 version includes three new entries: System Prompt Leakage, Vector and Embedding Weaknesses, and Misinformation. If you can only focus on ten things, this is your checklist.

These frameworks aren't competing standards. They're complementary lenses. NIST tells you how to build a program. ATLAS tells you what threats to model. OWASP tells you where to focus first. A mature AI governance function uses all three.

The People Problem

Here's what I find most interesting about AI red teaming: the skill set isn't what you'd expect.

Yes, you need people who understand technical AI security; prompt injection mechanics, model architecture, attack tooling. But the most effective red teams are diverse in a specific way: they include people who understand how AI systems will be used in the real world, by real people, with real consequences.

A security researcher might find a technique to extract the system prompt. But it takes someone with policy expertise to understand why that matters for regulatory compliance. It takes someone with domain knowledge to understand how that vulnerability might be exploited in healthcare, finance, or education. It takes someone with an ethics background to think through the second-order effects.

NIST explicitly recommends that red teams include "experts in areas other than AI" — affected communities, social scientists, ethicists, domain specialists. This isn't political correctness. It's recognition that AI harms are sociotechnical, and purely technical red teams will miss entire categories of risk.

This creates interesting career implications. If you're coming from compliance, ethics, policy, or domain expertise; you're not at a disadvantage. You bring perspective that technical specialists lack. The question is whether you can learn enough about AI systems to apply your existing expertise effectively.

The Epistemological Challenge

There's a deeper problem with AI red teaming that doesn't get enough attention: we don't actually know how to measure success.

Traditional security testing has clear metrics. Did the penetration test find vulnerabilities? Were they critical, high, medium, or low severity? Have they been remediated? You can track these numbers over time and demonstrate improvement.

AI red teaming is fuzzier. If you run a thousand adversarial prompts and 50 succeed, what does that mean? Is the model 95% safe? Is it dangerously vulnerable? The answer depends on which prompts succeeded, what harm they could enable, how likely those prompts are in real-world use, and whether your test coverage was representative at all.

The space of possible prompts is effectively infinite. Any red team exercise is sampling from that space, and we don't have good theories about how to sample representatively. A model that passes today's tests might fail tomorrow's because adversaries are constantly innovating.

For governance professionals, this creates an uncomfortable situation: you need to make decisions based on red team results, but the results don't have the clean interpretability you'd get from traditional security assessments. Learning to communicate this uncertainty; to boards, to regulators, to the public it is a skill that will only become more important.

What This Means for the Field

I've painted a complicated picture: AI systems that are inherently fragile, frameworks that provide partial answers, and measurement challenges that resist easy solutions. So what do we do with this?

First, reject false certainty. Anyone who tells you AI security is "solved" or that following a checklist makes systems safe is selling something. The honest position is that we're still learning what AI safety even means, and red teaming is one of our best tools for that learning.

Second, invest in the people function. The bottleneck isn't necessarily technical tooling. It can also the people who can think adversarially about AI systems while understanding the broader context. That's a rare combination, and organizations that develop this talent will have a significant advantage.

Third, treat documentation as an asset. The reports that red teams produce are more valuable than the vulnerabilities they find in any single exercise. They build institutional knowledge about how AI systems fail. Organizations that systematically capture and learn from this knowledge will improve faster than those that treat red teaming as a compliance checkbox.

Finally, embrace the discomfort. AI red teaming exists because we're deploying systems we don't fully understand into contexts with real stakes. That's uncomfortable. It should be. The appropriate response isn't to pretend we have more control than we do; it's to build the practices and people that help us navigate uncertainty responsibly.

The job of breaking AI on purpose turns out to be one of the most important jobs in ensuring AI works for everyone.

FURTHER READING

Back to Blog