About 12,300 results
Open links in new tab
  1. The Self-Hating Attention Head: A Deep Dive in

    Jul 4, 2025 · TL;DR: gpt2-small's head L1H5 directs attention to semantically similar tokens and actively suppresses self-attention. The head computes attention purely based on token identity, independent …

  2. We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have

    Mar 5, 2024 · Using our SAEs, we inspect the roles of every attention head in GPT-2 small, discovering a wide range of previously unidentified behaviors. We manually examined every one of the 144 …

  3. Attention SAEs Scale to GPT-2 Small — AI Alignment Forum

    Feb 3, 2024 · We feel pretty convinced that attention SAEs extract interpretable features, and allow for interesting exploration of what attention layers have learned. Now we focus on leveraging our SAEs …

  4. An Extremely Opinionated Annotated List of My Favourite Mechanistic ...

    One of my favourite phenomenons is when someone puts out an exciting paper, that gets a lot of attention yet has some subtle flaws, and follow-up work identifies and clarifies these.

  5. Attention Output SAEs Improve Circuit - Alignment Forum

    Jun 21, 2024 · Rob designed and built the tool to discover attention feature circuits on arbitrary prompts with recursive DFA, and performed automated circuit scans for examples of attention-to-attention …

  6. Sparse Autoencoders Work on Attention Layer - Alignment Forum

    We replicate Anthropic's MLP Sparse Autoencoder (SAE) paper on attention outputs and it works well: the SAEs learn sparse, interpretable features, which gives us insight into what attention layers learn.

  7. Polysemantic Attention Head in a 4-Layer - AI Alignment Forum

    Nov 9, 2023 · This post provides evidence of the complex role that attention heads play within a model’s computation, and that simplifying an attention head to a simple, singular behaviour can be misleading.

  8. Thought Anchors: Which LLM Reasoning Steps - AI Alignment Forum

    Jul 2, 2025 · We identify "receiver heads": attention heads that tend to pinpoint and narrow attention to a small set of sentences. These heads reveal sentences receiving disproportionate attention from all …

  9. Attention Output SAEs - AI Alignment Forum

    Jun 21, 2024 · We perform a qualitative study of the features computed by attention layers, and find multiple feature families: long-range context, short-range context and induction features, for example.

  10. The positional embedding matrix and previous-token heads: how do …

    Aug 9, 2023 · Looking at the attention patterns of L4H11 more carefully, we can see right away that there is a qualitative difference between how it implements previous-token attention and how the …