Tracing the thoughts of a large language model \ Anthropic

It turns out that, in Claude, refusal to answer is the default behavior: we find a circuit that is “on” by default and that causes the model to state that it has insufficient information to answer any given question. However, when the model is asked about something it knows well—say, the basketball player Michael Jordan—a competing feature representing “known entities” activates and inhibits this default circuit (see also this recent paper for related findings). This allows Claude to answer the question whe

Source: Tracing the thoughts of a large language model \ Anthropic

Tracing the thoughts of a large language model \ Anthropic

Related