Natural Language Autoencoders (NLAs) generate unsupervised natural language explanations of LLM activations, helping researchers interpret model internals, detect safety-relevant behaviors, and improv...
Home / Natural Language Processing
Natural Language Autoencoders (NLAs) generate unsupervised natural language explanations of LLM activations, helping researchers interpret model internals, detect safety-relevant behaviors, and improv...