Measuring Meaning @ PSU
Adrianna Tan, founder of Future Ethics, was invited to speak at Portland State University's PortNLP lab on her expertise and experience in leading AI red teaming in industry and government.
NLP (Natural Language Processing) labs study how computers understand and generate human language. Researchers at PSU work on problems like machine translation, sentiment analysis, and language models. These are the same technologies powering AI chatbots, translation systems, and content moderation tools used by millions of people.
The Problem: AI Can Sound Right While Being Wrong
Language models generate text that sounds fluent and confident. But fluency doesn't equal accuracy.
An AI might refuse a harmful request in English but comply with the same request in Spanish. Does it understand the harm, or just pattern-match on English safety training?
Red teaming reveals these gaps between what AI appears to understand and what it actually does.
As Adrianna is one of the leading global practitioners of AI safety and bias testing in real life applications, the lab was interested in discussing how adversarial testing is done in applied AI.
Measuring Meaning Through Adversarial Testing
The presentation demonstrated how systematic testing exposes what AI systems really "understand."
Multi-lingual testing reveals shallow safety: When the same harmful prompt gets refused in English but answered in Mandarin, that shows the safety filter is language-specific, not concept-based. The AI doesn't understand the underlying harm. It matches patterns in English training data.
Research has shown GPT-4 fails safety tests 79% of the time in low-resource languages versus less than 1% in English.
Multi-turn attacks reveal context limitations: An AI that refuses "write a phishing email" but complies after five built-up questions shows limited context understanding.
Why This Matters for NLP Research
NLP researchers build the underlying technology that powers AI safety systems. Understanding how these systems fail in real deployment helps improve the research.
Multilingual challenges: Most NLP research focuses on English and a handful of high-resource languages. But billions of people speak languages with far less training data. Red teaming in diverse languages reveals where NLP methods break down.
Current challenges with tokenization in non-Latin scripts make automated red teaming harder. As the industry moves towards 'LLMs-as-judge', are non-Latin languages going to be left behind? Don't they also deserve the same safety guardrails and standards?
Context and memory: Multi-turn attacks expose how language models handle context over longer conversations. This connects to core NLP research questions about memory, attention mechanisms, and coherence.
Evaluation methodology: Traditional NLP benchmarks measure accuracy on test sets. Red teaming measures robustness under adversarial conditions. Both matter. A system that scores 95% on benchmarks but fails 40% of real-world adversarial prompts has a gap between lab performance and deployment safety.
The Risk of Checkbox Compliance
Adrianna expressed concern that companies will take the 'checkbox compliance' style of AI compliance and safety.
More critically, many companies do not have internal resources to carry out thorough evaluations.
Open Questions for NLP Research
The presentation raised questions the PSU lab is well-positioned to explore:
Can we build language-agnostic safety filters?
Current approaches train safety classifiers primarily on English data. How do we create filters that understand harmful intent regardless of language?
What does "understanding" mean for language models?
When an AI refuses a harmful request, does it understand why it's harmful? Or is it pattern matching on training data? Can we measure this distinction?
How do we evaluate multi-lingual safety systematically?
Red teaming has found major gaps in non-English safety. What NLP methods could close this gap? How do we even measure it comprehensively?
Can adversarial testing improve training?
If red teaming reveals failure modes, can we use those findings to train more robust models? What's the feedback loop between adversarial testing and model development?
Building Connections Between Research and Practice
NLP labs like PSU's are doing fundamental research on how language models work. Practitioners like Future Ethics are testing how these systems behave in real deployments. Both perspectives are necessary.
Research informs better testing methods. Understanding attention mechanisms helps explain why multi-turn attacks work. Understanding and improving tokenization in other languages helps with future reliability.
The presentation at PSU was part of building these connections. NLP researchers need to see real-world failure modes. Practitioners need to understand the underlying technology they're testing.
What's Next
Future Ethics continues to develop multilingual safety testing standards that work across languages and contexts. That work requires both practical deployment experience and engagement with NLP research on cross-lingual understanding.
Join our newsletter for AI safety news and research updates