Security systems fail. Networks partition. Services time out. The question is not whether your authorization layer will experience failures - it is what it does when they happen.
DataCrawl is fail-closed. If the authorization service is unavailable, unreachable, or returns an unexpected response, the decision is `deny`. The agent does not proceed. The action does not happen.
This is the correct default for a security product. It is also surprisingly hard to get right.
## The two failure modes
Most systems have two failure modes they need to handle: **hard failures** (exceptions, timeouts, network errors) and **soft failures** (unexpected responses, schema drift, partial results).
Hard failures are easy to reason about. Wrap the call in a try-catch, return `deny` in the catch block. Done.
Soft failures are more subtle. What if the authorization service returns a 200 with a body that does not match the expected schema? What if it returns `{ "decision": null }`? What if it returns an empty object?
The wrong answer is to treat these as `allow`. The right answer is to treat any response that does not contain an explicit `allow` as a `deny`.
```ts
// Unsafe - treats unexpected responses as allow
if (response.decision === 'deny') {
return { decision: 'deny' }
}
return { decision: 'allow' } // <- dangerous default
// Safe - requires explicit allow
if (response.decision === 'allow') {
return { decision: 'allow' }
}
return { decision: 'deny' } // <- everything else is deny
```
This distinction matters more than it looks. In practice, the second pattern catches: null decisions, undefined fields, empty bodies, partial JSON, and any future response shape that does not match the current schema.
## The circuit breaker
A naively fail-closed system has a problem: if the authorization service goes down, every agent action is denied for the duration of the outage. For most authorization calls this is acceptable. For some - a payment processing agent, a time-sensitive notification - it is not.
We solve this with a circuit breaker. After a configurable number of consecutive failures (default: 5), the circuit opens. Subsequent calls return `deny` immediately without attempting to reach the authorization service. A probe request attempts recovery after 30 seconds.
The circuit breaker does not change the fail-closed behaviour. It changes the latency of the failure. Instead of waiting 5 seconds for a timeout before returning `deny`, the agent gets a `deny` in under a millisecond.
For operations where downtime is genuinely unacceptable, we provide a per-agent `circuit_bypass` flag. This is a manual override intended for ops use only. Setting it to `true` means the agent proceeds when the circuit is open. It is never set programmatically. It is visible in the dashboard and logged to the audit trail.
## Idempotency under failure
The second hard problem is retry behaviour. If an agent sends an authorization request that times out, it retries. If the first request actually succeeded at the server but the response never arrived, the retry creates a second authorization record for the same intended action.
For most actions this is a data problem - two audit records instead of one. For some actions - sending an email, creating a CRM contact - it could mean the action happens twice.
The solution is idempotency keys. Every authorization request includes a client-generated key. The server stores the result keyed on `agentId:idempotencyKey` for 24 hours. If the same key arrives twice, the second request returns the cached decision without re-evaluating policy.
```python
result = dc.authorize(
tool="gmail.send_email",
payload=payload,
idempotency_key=f"send-{message_id}", # deterministic from content
)
```
The idempotency key should be deterministic from the content of the request, not random. A random key defeats the purpose - if the first request times out, the retry generates a new key and the problem recurs.
## The audit trail
A fail-closed system is only as trustworthy as its audit trail. Every decision - allow, deny, needs_review - is written to an append-only table before the response is returned. The row includes a SHA-256 hash of its own fields. A nightly job verifies that no hash has changed.
This is not a sophisticated tamper-detection system. It is a simple one. A sophisticated attacker could reconstruct a valid hash. But the goal is not to stop a sophisticated attacker - it is to make accidental modification detectable, to satisfy audit requirements, and to give enterprise customers something concrete to point to when their security team asks how the audit log works.
---
The unsexy truth about security infrastructure is that most of the interesting decisions are about failure modes. What happens when things go wrong is more important than what happens when they go right.