Why Recall Rate Metrics Are Important for Large Models

I’ve read some system prompts, and most are very verbose and not concise. Some prompts mainly teach the model how to do things.

Additionally, I noticed that in roo code there’s a switch to repeatedly send the system prompt to the model, indicating that it can reinforce role settings and instruction following. However, this increases token consumption.

This might be because important things need to be repeated multiple times to increase their weight during computation, enhance the probability of confirmation, and ultimately obtain more likely correct results. Unfortunately, such results are still probabilistically correct.

Those who have used Claude models and GPT-5 High for a long time might have noticed that although GPT-5 High is very slow, its accuracy is extremely high.

Could this be related to GPT-5’s recall rate reaching 100%?

When using AGENTS.md to direct GPT-5 to work, I found that only very concise, refined language is needed to command the Codex CLI to perform tasks. When using Claude Code, I often need to write CLAUDE.md very “wordily,” and even then, Claude sometimes ignores explicitly required precautions. Improving this doesn’t necessarily require repeating a requirement; using different vocabulary like “must,” “important,” etc., or using parentheses and Markdown bold formatting (**) can also enhance compliance.

In other words, when using Claude models, the requirements for prompts are higher, and subtle vocabulary changes can affect the model’s performance. When using GPT-5, the requirements for prompts are not high; as long as the concise expression has no logical contradictions, the Codex CLI can perform well. If there are logical contradictions, GPT-5 will point them out.

I’m becoming increasingly dissatisfied with collaborative development using Claude models. It’s not that it performs tasks poorly, but after being burned a few times, I can’t trust it. When Claude has an episode, it often changes a lot of code, and it’s also very aggressive when asked to modify CLAUDE.md. As the saying goes, “many words invite errors.” How can we ensure there are no contradictions in a very long system prompt? The workload for reviewing it is too much, and the mental burden is also significant.

In comparison, GPT-5 High seems to possess genuine logic, which might be related to its high recall rate.