Anthropic introduces Natural Language Autoencoders, an unsupervised method for explaining LLM activations and improving AI model interpretability, auditing, and safety analysis....
Home / Natural Language Autoencoders
Anthropic introduces Natural Language Autoencoders, an unsupervised method for explaining LLM activations and improving AI model interpretability, auditing, and safety analysis....