Anthropic introduces Natural Language Autoencoders, an unsupervised method for explaining LLM activations and improving AI model interpretability, auditing, and safety analysis....
Home / unsupervised interpretability
Anthropic introduces Natural Language Autoencoders, an unsupervised method for explaining LLM activations and improving AI model interpretability, auditing, and safety analysis....