Anthropic introduces Natural Language Autoencoders, an unsupervised method for explaining LLM activations and improving AI model interpretability, auditing, and safety analysis....
Natural Language Autoencoders (NLAs) generate unsupervised natural language explanations of LLM activations, helping researchers interpret model internals, detect safety-relevant behaviors, and improv...





