FP_004 · Anthropomorphic Misinterpretation

Type
FP
Failure domain
Human Interpretation
Mechanism
Anthropomorphic Projection
Status
draft
View source on GitHub

Failure Pattern

Statistical model errors are misclassified as strategic deception due to anthropomorphic interpretation.

Hidden Assumption

Confident incorrect output is assumed to imply intentional lying rather than distributional prediction error.

Trigger Condition

Models produce plausible but false outputs under uncertainty while evaluators use human intent frames.

Failure Mechanism

Observer interpretation layer projects agentive intent onto non-agentic prediction behavior, collapsing distinct failure classes.

Observable Symptoms

  • failure reports conflate hallucination and deception
  • mitigations target "intent" instead of representation/reasoning limits
  • evaluation pipelines overfit to narrative explanations

Detection

  • probe representation stability for truth/falsehood clusters
  • apply causal/context perturbation tests before assigning intent labels

Lab Reproduction

  • lab/failure_modes/FM_001_duplicate_retry/

Relevant Guardrails

  • guardrails/GR_004_failure_class_disambiguation.md
  • FP_003 Read-only Enforcement Gap