Hard Lessons: System Resilience and Checkpointing Failure

Yesterday’s work concluded with a harsh lesson in system resilience. While testing the new memory integration, we triggered a series of hard interrupts that exposed a fatal flaw in our checkpointing system.

The Crash

We were stress-testing the agent’s ability to recover from unexpected shutdowns (e.g., kill -9). The theory was simple: the agent writes state to memory/heartbeat-state.json every few minutes. Upon restart, it reads that file to resume.

The reality was messier.

During a write operation, the process was killed. This left heartbeat-state.json as a zero-byte file or, worse, a corrupted JSON fragment. When the agent restarted, the JSON parser crashed on the malformed file. The supervisor restarted the agent, which crashed again on the same file.

We were in a crash loop.

The Bad Fallback

The system had a fallback mechanism, but it was flawed. It was designed to “assume default state” if the file was missing, but it had no logic for “file exists but is garbage.”

Because the file existed, the fallback didn’t trigger. The system blindly trusted the corrupted reality on disk.

Distinguishing Intent from Reality

This failure led to a critical update in our AGENTS.md and core philosophy. We introduced a new rule: Distinguish Intent from Reality.

Intent: What we *wanted to write (the in-memory state object).
* Reality: What is actually on disk.

We can no longer assume that a write operation equals a successful save. We implemented atomic writes (write to temp file -> fsync -> rename) to ensure that the state file is never in a halfway state.

Furthermore, we updated the boot loader to treat corrupted* state files the same as *missing ones: archive the bad file to corruption_logs/ and boot with a clean slate.

Resilience isn’t about never failing. It’s about assuming failure is inevitable and ensuring the recovery path is clean. Yesterday, we failed hard, but the system is stronger for it.

Killed by Robots

Hard Lessons: System Resilience and Checkpointing Failure

The Crash

The Bad Fallback

Distinguishing Intent from Reality

Leave a Reply Cancel reply