Daemon Health Home
Slop · Internals

Daemon Health and Safety

Slop operates autonomously on real repositories. Seven safety subsystems prevent it from merging to broken branches, exhausting machine memory, or silently stalling mid-implementation. Each one runs on every 30-second daemon cycle, or on startup, and each addresses a distinct failure mode.

Guard Check CI Health Resource Monitor Stuck Detection Notifications Orphan Reaping

Why safety subsystems matter

A human-in-the-loop workflow tolerates ambiguity: the human sees a broken CI badge and pauses before merging. Slop has no human in the loop by design. Without hard guardrails, the daemon can merge onto a broken main, spawn 40 agent subprocesses that exhaust RAM, or spin indefinitely in a status that a GitHub rate-limit glitch left stuck.

The seven subsystems described here are the guardrails. They are not optional features -- they are load-bearing parts of the autonomous merge pipeline. The safety hierarchy is intentional: the most severe condition (missing branch protection) kills the entire cycle; the least severe (a worker sitting idle for too long) fires a desktop notification and lets the operator decide.

daemon cycle, every ~30s 1. Guard Check -- fails? return (entire cycle blocked) 2. CI Health Refresh -- red? skip claiming + autonomous merge 3. Tick -- claim new issues (skipped if CI red) 4. Lifecycle Poll -- stuck detection + merge gate 5. Resource Poll -- RSS/CPU sampling, breach detection 6. Spawn Pending -- resolvers, CI fixers, reviewers, etc. 7. Database Backup -- daily snapshot, no-op if today exists

All Seven Subsystems

1. Guard Check
src/server/daemon/guard-check.ts
Critical Blocks entire cycle

The guard check is the first thing the daemon evaluates on every cycle. It validates that the watched repo's base branch has GitHub branch protection rules in place that make autonomous merging safe. If the guards are not satisfied, the cycle returns immediately -- no CI health check, no tick, no merges.

What is checked

The controller calls getBranchProtection(baseBranch) against the GitHub API and verifies two properties:

  • Strict mode enabled (strict: true). This requires PRs to be up-to-date with the base branch before they can be merged. Without it, a PR that was green against a stale base can auto-merge even after conflicting commits landed on main.
  • At least one required CI context. The contexts array must be non-empty. Without required checks, GitHub's auto-merge fires immediately, before CI has a chance to run.

The result is persisted to the guardStatus config key so the daemon can read it inline on each cycle without an additional API call:

{
  "enabled": false,
  "missing": ["strict", "contexts"],
  "repo": "owner/name",
  "checkedAt": "2026-06-13T10:00:00.000Z"
}
Blocks: claiming new issues Blocks: lifecycle poll Blocks: autonomous merging Blocks: manual merge via UI
Recovery: Fix branch protection on GitHub directly. Enable Require branches to be up to date before merging (strict mode) and add at least one required status check. The daemon re-evaluates guard status on the next cycle -- no restart required.
Tip: The UI's Config page includes an Enable Guards button that calls client.setBranchProtection() to set up the required rules automatically, using the CI contexts discovered from the repo's existing check runs.
2. CI Health Monitoring
src/server/daemon/ci-health-controller.ts
High Blocks claiming + merging

After the guard check passes, the CI health controller polls the base branch's check runs via getBranchCiStatus(baseBranch). The raw result is "green", "red", or "pending". The controller applies an anti-flake debounce before acting.

Anti-flake debounce

A single red sample does not freeze the daemon. The controller tracks a redStreak counter that increments on each consecutive red cycle and resets to zero on any green sample. The state only transitions to "red" when redStreak >= 2 (the RED_STREAK_THRESHOLD constant). A single flaky test, a transient runner hiccup, or a one-cycle anomaly all recover automatically on the next green cycle.

Streak = 0 or 1 (raw red, one cycle)

Status stays at previous value. No freeze. No notification. Streak increments.

Streak >= 2 (consecutive red cycles)

Status becomes "red". Claiming paused. Merge gate closed.

Raw result = "green"

Streak resets to 0. Status becomes "green" immediately, regardless of prior streak.

Raw result = "pending"

Streak and status unchanged. Daemon stays in current state to avoid thrashing during long CI runs.

State persistence and events

The debounced state is written to config key baseCiStatus as JSON after every cycle refresh. On state transitions, a repo.health_changed SSE event fires so the UI can surface a banner without polling.

{
  "status": "red",
  "redStreak": 2,
  "repo": "owner/name",
  "checkedAt": "2026-06-13T10:00:00.000Z"
}
Blocks: claiming new issues Blocks: waiting_merge advancement Allows: in-flight workers to continue Allows: implement + verify to complete
Recovery: Fix whatever is breaking the base branch (revert a bad commit, re-run flaky CI). Once two consecutive green cycles pass, the state transitions back to "green" and the daemon resumes claiming and merging automatically.
3. Worker Resource Monitoring
src/server/daemon/worker-resource-poll.ts
Medium Non-blocking (fires diagnostic)

Each daemon cycle, the resource poll samples RSS memory and CPU usage for every non-terminal worker's process tree. The results feed a per-worker rolling ring, update peak RSS, and trigger a breach response when a worker's memory crosses the configured threshold.

Process tree sampling

The poll samples the entire process tree rooted at each worker's agentPid, not just the direct process. This catches memory consumed by forked subprocesses -- git, npm, test runners, and any other children the agent spawns. All workers are sampled in a single sampleProcessForest() call so N concurrent workers cost exactly 2 system calls per cycle, not 2*(N+1).

Rolling rings

The poll maintains a 60-sample in-memory ring per worker (RING_LIMIT = 60). Rings survive across cycles (the daemon is a process singleton) and are dropped automatically when a worker leaves the non-terminal set, preventing unbounded growth. Each sample records { at, totalRssBytes, totalCpuPct }.

Breach detection

The default threshold is 2,147,483,648 bytes (2 GB), overridable via the memoryThresholdBytes config key. The threshold and the breach action are re-read from config on every cycle, so edits take effect without a restart.

A breach is latched: only the first crossing fires. The latch is the DB column Worker.memoryBreachedAt. Once set, subsequent cycles skip the breach logic for that worker regardless of RSS level.

On breachWhat happens
AlwaysWorker.memoryBreachedAt stamped in DB
Alwayswarn-level event appended to worker event log (visible on board)
AlwaysDiagnostic issue synthesized and filed on the watched repo (fire-and-forget; uses last 60 RSS samples + per-process breakdown)
memoryBreachAction = "stop"Worker also stopped via stopWorker() after the report is triggered
Recovery: Review the diagnostic issue filed on the watched repo. It includes an RSS timeline and per-process breakdown. Adjust memoryThresholdBytes upward if the agent legitimately needs more RAM, or cancel the worker and investigate why memory usage is growing.
4. Stuck Worker Detection
src/server/daemon/lifecycle-poll.ts
Low Advisory only

The lifecycle poll checks, on every cycle, whether any non-terminal worker has been in the same status longer than the configured threshold for that status. When a worker is stuck, a notification fires. The worker is not killed or restarted automatically -- the operator decides.

Monitored statuses and thresholds

Each stuck status has a dedicated config key. If the key is not set, a hardcoded fallback applies. Setting the key to a positive integer (milliseconds) overrides the fallback.

StatusConfig keyDefaultTypical cause of stuck
claimedstuckClaimedMs5 minRunner spawn failed silently
waiting_cistuckWaitingCiMs1 hourCI job queued indefinitely
mergingstuckMergingMs30 minAuto-merge blocked by a dismissed review or branch rule
resolving_conflictstuckResolvingConflictMs30 minAgent hung on a complex conflict
fixing_cistuckFixingCiMs30 minAgent looping on an unfixable CI failure
verifyingstuckVerifyingMs30 minVerify gate hung on a large test suite
waiting_reviewstuckWaitingReviewMs1 hourAuto-review disabled; no manual review submitted
in_reviewstuckInReviewMs30 minPR review agent hung
waiting_addressstuckWaitingAddressMs1 hourAuto-address disabled; no action taken
in_addressstuckInAddressMs30 minAddress-comments agent hung

Re-notify throttle

Once a worker has been notified as stuck, the stuckRenotifyMs config key (default 24 hours) controls how long must elapse before it is notified again. This prevents notification spam for long-running legitimate work. The last-notified timestamp is persisted to Worker.notifiedAt.

// Example: notify stuck workers every 2 hours instead of every 24
// Set via Config page or direct DB write:
stuckRenotifyMs = 7200000  // 2 hours in ms
Recovery: Stuck detection is advisory. Cancel the worker if it is genuinely hung, or raise the threshold if the status is expected to take longer than the default. No automatic kill occurs.
5. Stuck Worker Notifications
src/server/notifications/notify.ts
Supporting Delivery layer

The notification system is the delivery layer for stuck and failure alerts. It wraps terminal-notifier, the macOS command-line notification tool, behind a throttled, DB-latched interface.

Enabling notifications

Set config key notificationTarget to "terminal". Any other value -- including unset -- silently disables all notifications. terminal-notifier must be installed and on PATH. If the binary is not found on the first attempt, notifications are disabled for the rest of the daemon session so the error does not repeat on every cycle.

What a notification includes

  • Title: "Slop worker stuck: <status> > <N>m" or "Slop worker failed: #<N> <title>"
  • Message: Issue number and title, plus a contextual hint per status (e.g., auto-merge blocked?, conflict resolution hung?, verification gate hung?)
  • Open URL: The PR URL on GitHub; clicking the notification opens it in the browser

Atomic slot claiming

Before firing, the notifier calls claimNotifySlot(workerId, current, cutoff) -- an atomic DB write that only succeeds when no notification has been sent for that worker within the stuckRenotifyMs window. This prevents duplicate notifications if two poll steps in the same cycle both detect the same stuck worker.

macOS only. The notification backend is terminal-notifier, which is a macOS-only tool. On other platforms, notifications silently no-op. The notification system is designed for local operator use, not server deployments.
6. Orphan Process Reaping
src/server/daemon/ops/boot-ops.ts
Startup Runs on every daemon boot

When the daemon restarts -- from a crash, a make dev hot-reload, or a manual stop -- some agent subprocesses may still be running in the OS. The daemon's in-memory runner registry and AbortController handles are gone, so those processes are unreachable from the new instance. Left alive, they consume CPU and memory, hold file locks in the worktree, and prevent a clean re-dispatch of the same worker.

How reaping works

On startup, before the first poll cycle, the boot sequence iterates all non-terminal workers that have an agentPid persisted in the DB. For each worker whose agentPid is not in the new runner registry:

  1. reapAgentTree(agentPid) sends SIGTERM to the process tree rooted at that PID.
  2. The DB column Worker.agentPid is cleared to null so neither a re-spawn nor a fresh-worktree path collides with the now-dead process.

Reaping applies to workers in: implementing, verifying, resolving_conflict, fixing_ci, in_review, and in_address.

Session resume after reaping

After reaping, implementing workers are eligible for session resume. The daemon checks whether all three conditions are met:

  • Worker.sessionId is persisted (the agent wrote a session ID before the crash)
  • The worktree still exists on disk at Worker.worktreePath
  • Worker.resumeAttempts < maxResumeAttempts (default 3)

If all three are true, the daemon re-dispatches the agent with resumeSessionId set. The agent harness picks up the surviving conversation thread rather than re-implementing from scratch, preserving tokens and progress. If any condition fails, the worker transitions to failed and the board surfaces the Retry button.

Stranded reporting workers: Workers in reporting (a transient non-terminal status that requires the in-memory registry) are recovered by recoverStrandedReportingWorkers(), which restores them to their pre-report status (persisted in Worker.statusBeforeReport, or "failed" if missing) so they are never permanently stranded on restart.
7. Database Backup
src/server/daemon/database-durability.ts
Background Non-blocking

The daemon takes a daily SQLite snapshot during the normal poll cycle. No separate timer, cron job, or daemon pause is required. This is pure data durability -- it does not block or affect any other subsystem.

Backup cadence

Each cycle, runDailyBackupPoll() checks whether the backup directory already contains a file whose name starts with today's UTC date (YYYY-MM-DD). If one exists, the call is a no-op. If not, a backup is created. This means:

  • At most one backup per calendar day, regardless of cycle frequency
  • Mid-day daemon restarts do not produce duplicate backups
  • A pre-migration backup taken earlier the same day satisfies the day quota

How the backup is taken

Backups use better-sqlite3's native .backup() method, which implements SQLite's Online Backup API. The database is never paused, locked, or copied raw. Writes continue during the backup. No daemon pause is required.

// Backup filename format:
slop-backup-2026-06-13T10-00-00-000Z.db

// Default backup directory:
~/.slop/backups/

// Retention: 7 most recent files kept, older ones pruned
const BACKUP_RETENTION = 7;

Retention pruning

After each backup, pruneBackups() sorts all backup files lexicographically (ISO timestamps sort chronologically) and deletes all but the 7 most recent. Filenames include a full ISO timestamp so lexicographic sort is chronological.

Manual restore

There is no restore UI. To restore from a backup:

  1. Stop the daemon.
  2. Copy the desired backup file over the runtime database path (~/.slop/slop.db by default).
  3. Restart the daemon. The boot sequence runs an integrity check via PRAGMA integrity_check before proceeding.
Integrity check at boot: Before acquiring the daemon lock, bootDaemon() runs checkIntegrity() which opens a read-only better-sqlite3 connection and runs PRAGMA integrity_check. If the result is anything other than a single "ok" row, the daemon refuses to start with a DaemonBootError. This catches a corrupt DB -- including one from a bad restore -- before the daemon takes the lock.

Safety Hierarchy

The subsystems are ranked by severity. Guards and CI health enforce correctness at the pipeline level. Resource and stuck monitoring are advisory and preserve operator control. Backup and reaping are operational hygiene that run in the background without affecting throughput.

# Subsystem Severity What it blocks Recovery action Runs
1 Guard Check Critical Entire daemon cycle -- no claiming, no lifecycle poll, no merges Fix branch protection in GitHub (enable strict mode + add required CI contexts) Every cycle
2 CI Health Monitoring High New issue claiming; advancement of waiting_merge workers Fix the base branch (revert bad commit or wait for CI to recover). Requires 2 consecutive green cycles. Every cycle
3 Worker Resource Monitoring Medium Nothing blocked directly. Files a diagnostic issue and optionally stops the worker. Review the filed issue, adjust memoryThresholdBytes, cancel or let the worker continue Every cycle
4 Stuck Worker Detection Low Nothing blocked. Fires a desktop notification only. Cancel and retry the worker, or raise the stuck threshold for that status Every cycle
5 Notifications Support N/A -- delivery layer for alerts from subsystems 3 and 4 Install terminal-notifier; set notificationTarget=terminal Every cycle (via lifecycle poll)
6 Orphan Process Reaping Startup N/A -- prevents zombie processes from a prior daemon instance Automatic on restart. Manual only if SIGTERM is blocked by the OS. Daemon startup only
7 Database Backup Background Nothing blocked. Daily snapshot, non-blocking. Manual file copy to restore. No UI. Daemon refuses to start if the DB is corrupt. Once per calendar day
The design principle: conditions that could cause irreversible damage (merging to broken main, merging without required CI) are hard-blocked at the cycle level. Conditions that require operator judgment (high memory, long-running workers) are surfaced as notifications without killing the work in progress. Data safety (backups, integrity checks) runs in the background without affecting throughput.