Daemon Health and Safety
Slop operates autonomously on real repositories. Seven safety subsystems prevent it from merging to broken branches, exhausting machine memory, or silently stalling mid-implementation. Each one runs on every 30-second daemon cycle, or on startup, and each addresses a distinct failure mode.
Why safety subsystems matter
A human-in-the-loop workflow tolerates ambiguity: the human sees a broken CI badge and pauses before merging. Slop has no human in the loop by design. Without hard guardrails, the daemon can merge onto a broken main, spawn 40 agent subprocesses that exhaust RAM, or spin indefinitely in a status that a GitHub rate-limit glitch left stuck.
The seven subsystems described here are the guardrails. They are not optional features -- they are load-bearing parts of the autonomous merge pipeline. The safety hierarchy is intentional: the most severe condition (missing branch protection) kills the entire cycle; the least severe (a worker sitting idle for too long) fires a desktop notification and lets the operator decide.
All Seven Subsystems
The guard check is the first thing the daemon evaluates on every cycle. It validates that the watched repo's base branch has GitHub branch protection rules in place that make autonomous merging safe. If the guards are not satisfied, the cycle returns immediately -- no CI health check, no tick, no merges.
What is checked
The controller calls getBranchProtection(baseBranch) against the GitHub API and verifies two properties:
- Strict mode enabled (
strict: true). This requires PRs to be up-to-date with the base branch before they can be merged. Without it, a PR that was green against a stale base can auto-merge even after conflicting commits landed onmain. - At least one required CI context. The
contextsarray must be non-empty. Without required checks, GitHub's auto-merge fires immediately, before CI has a chance to run.
The result is persisted to the guardStatus config key so the daemon can read it inline on each cycle without an additional API call:
{
"enabled": false,
"missing": ["strict", "contexts"],
"repo": "owner/name",
"checkedAt": "2026-06-13T10:00:00.000Z"
}
client.setBranchProtection() to set up the required rules automatically, using the CI contexts discovered from the repo's existing check runs.
After the guard check passes, the CI health controller polls the base branch's check runs via getBranchCiStatus(baseBranch). The raw result is "green", "red", or "pending". The controller applies an anti-flake debounce before acting.
Anti-flake debounce
A single red sample does not freeze the daemon. The controller tracks a redStreak counter that increments on each consecutive red cycle and resets to zero on any green sample. The state only transitions to "red" when redStreak >= 2 (the RED_STREAK_THRESHOLD constant). A single flaky test, a transient runner hiccup, or a one-cycle anomaly all recover automatically on the next green cycle.
Status stays at previous value. No freeze. No notification. Streak increments.
Status becomes "red". Claiming paused. Merge gate closed.
Streak resets to 0. Status becomes "green" immediately, regardless of prior streak.
Streak and status unchanged. Daemon stays in current state to avoid thrashing during long CI runs.
State persistence and events
The debounced state is written to config key baseCiStatus as JSON after every cycle refresh. On state transitions, a repo.health_changed SSE event fires so the UI can surface a banner without polling.
{
"status": "red",
"redStreak": 2,
"repo": "owner/name",
"checkedAt": "2026-06-13T10:00:00.000Z"
}
"green" and the daemon resumes claiming and merging automatically.
Each daemon cycle, the resource poll samples RSS memory and CPU usage for every non-terminal worker's process tree. The results feed a per-worker rolling ring, update peak RSS, and trigger a breach response when a worker's memory crosses the configured threshold.
Process tree sampling
The poll samples the entire process tree rooted at each worker's agentPid, not just the direct process. This catches memory consumed by forked subprocesses -- git, npm, test runners, and any other children the agent spawns. All workers are sampled in a single sampleProcessForest() call so N concurrent workers cost exactly 2 system calls per cycle, not 2*(N+1).
Rolling rings
The poll maintains a 60-sample in-memory ring per worker (RING_LIMIT = 60). Rings survive across cycles (the daemon is a process singleton) and are dropped automatically when a worker leaves the non-terminal set, preventing unbounded growth. Each sample records { at, totalRssBytes, totalCpuPct }.
Breach detection
The default threshold is 2,147,483,648 bytes (2 GB), overridable via the memoryThresholdBytes config key. The threshold and the breach action are re-read from config on every cycle, so edits take effect without a restart.
A breach is latched: only the first crossing fires. The latch is the DB column Worker.memoryBreachedAt. Once set, subsequent cycles skip the breach logic for that worker regardless of RSS level.
| On breach | What happens |
|---|---|
| Always | Worker.memoryBreachedAt stamped in DB |
| Always | warn-level event appended to worker event log (visible on board) |
| Always | Diagnostic issue synthesized and filed on the watched repo (fire-and-forget; uses last 60 RSS samples + per-process breakdown) |
memoryBreachAction = "stop" | Worker also stopped via stopWorker() after the report is triggered |
memoryThresholdBytes upward if the agent legitimately needs more RAM, or cancel the worker and investigate why memory usage is growing.
The lifecycle poll checks, on every cycle, whether any non-terminal worker has been in the same status longer than the configured threshold for that status. When a worker is stuck, a notification fires. The worker is not killed or restarted automatically -- the operator decides.
Monitored statuses and thresholds
Each stuck status has a dedicated config key. If the key is not set, a hardcoded fallback applies. Setting the key to a positive integer (milliseconds) overrides the fallback.
| Status | Config key | Default | Typical cause of stuck |
|---|---|---|---|
claimed | stuckClaimedMs | 5 min | Runner spawn failed silently |
waiting_ci | stuckWaitingCiMs | 1 hour | CI job queued indefinitely |
merging | stuckMergingMs | 30 min | Auto-merge blocked by a dismissed review or branch rule |
resolving_conflict | stuckResolvingConflictMs | 30 min | Agent hung on a complex conflict |
fixing_ci | stuckFixingCiMs | 30 min | Agent looping on an unfixable CI failure |
verifying | stuckVerifyingMs | 30 min | Verify gate hung on a large test suite |
waiting_review | stuckWaitingReviewMs | 1 hour | Auto-review disabled; no manual review submitted |
in_review | stuckInReviewMs | 30 min | PR review agent hung |
waiting_address | stuckWaitingAddressMs | 1 hour | Auto-address disabled; no action taken |
in_address | stuckInAddressMs | 30 min | Address-comments agent hung |
Re-notify throttle
Once a worker has been notified as stuck, the stuckRenotifyMs config key (default 24 hours) controls how long must elapse before it is notified again. This prevents notification spam for long-running legitimate work. The last-notified timestamp is persisted to Worker.notifiedAt.
// Example: notify stuck workers every 2 hours instead of every 24 // Set via Config page or direct DB write: stuckRenotifyMs = 7200000 // 2 hours in ms
The notification system is the delivery layer for stuck and failure alerts. It wraps terminal-notifier, the macOS command-line notification tool, behind a throttled, DB-latched interface.
Enabling notifications
Set config key notificationTarget to "terminal". Any other value -- including unset -- silently disables all notifications. terminal-notifier must be installed and on PATH. If the binary is not found on the first attempt, notifications are disabled for the rest of the daemon session so the error does not repeat on every cycle.
What a notification includes
- Title:
"Slop worker stuck: <status> > <N>m"or"Slop worker failed: #<N> <title>" - Message: Issue number and title, plus a contextual hint per status (e.g., auto-merge blocked?, conflict resolution hung?, verification gate hung?)
- Open URL: The PR URL on GitHub; clicking the notification opens it in the browser
Atomic slot claiming
Before firing, the notifier calls claimNotifySlot(workerId, current, cutoff) -- an atomic DB write that only succeeds when no notification has been sent for that worker within the stuckRenotifyMs window. This prevents duplicate notifications if two poll steps in the same cycle both detect the same stuck worker.
terminal-notifier, which is a macOS-only tool. On other platforms, notifications silently no-op. The notification system is designed for local operator use, not server deployments.
When the daemon restarts -- from a crash, a make dev hot-reload, or a manual stop -- some agent subprocesses may still be running in the OS. The daemon's in-memory runner registry and AbortController handles are gone, so those processes are unreachable from the new instance. Left alive, they consume CPU and memory, hold file locks in the worktree, and prevent a clean re-dispatch of the same worker.
How reaping works
On startup, before the first poll cycle, the boot sequence iterates all non-terminal workers that have an agentPid persisted in the DB. For each worker whose agentPid is not in the new runner registry:
reapAgentTree(agentPid)sends SIGTERM to the process tree rooted at that PID.- The DB column
Worker.agentPidis cleared tonullso neither a re-spawn nor a fresh-worktree path collides with the now-dead process.
Reaping applies to workers in: implementing, verifying, resolving_conflict, fixing_ci, in_review, and in_address.
Session resume after reaping
After reaping, implementing workers are eligible for session resume. The daemon checks whether all three conditions are met:
Worker.sessionIdis persisted (the agent wrote a session ID before the crash)- The worktree still exists on disk at
Worker.worktreePath Worker.resumeAttempts < maxResumeAttempts(default 3)
If all three are true, the daemon re-dispatches the agent with resumeSessionId set. The agent harness picks up the surviving conversation thread rather than re-implementing from scratch, preserving tokens and progress. If any condition fails, the worker transitions to failed and the board surfaces the Retry button.
reporting (a transient non-terminal status that requires the in-memory registry) are recovered by recoverStrandedReportingWorkers(), which restores them to their pre-report status (persisted in Worker.statusBeforeReport, or "failed" if missing) so they are never permanently stranded on restart.
The daemon takes a daily SQLite snapshot during the normal poll cycle. No separate timer, cron job, or daemon pause is required. This is pure data durability -- it does not block or affect any other subsystem.
Backup cadence
Each cycle, runDailyBackupPoll() checks whether the backup directory already contains a file whose name starts with today's UTC date (YYYY-MM-DD). If one exists, the call is a no-op. If not, a backup is created. This means:
- At most one backup per calendar day, regardless of cycle frequency
- Mid-day daemon restarts do not produce duplicate backups
- A pre-migration backup taken earlier the same day satisfies the day quota
How the backup is taken
Backups use better-sqlite3's native .backup() method, which implements SQLite's Online Backup API. The database is never paused, locked, or copied raw. Writes continue during the backup. No daemon pause is required.
// Backup filename format: slop-backup-2026-06-13T10-00-00-000Z.db // Default backup directory: ~/.slop/backups/ // Retention: 7 most recent files kept, older ones pruned const BACKUP_RETENTION = 7;
Retention pruning
After each backup, pruneBackups() sorts all backup files lexicographically (ISO timestamps sort chronologically) and deletes all but the 7 most recent. Filenames include a full ISO timestamp so lexicographic sort is chronological.
Manual restore
There is no restore UI. To restore from a backup:
- Stop the daemon.
- Copy the desired backup file over the runtime database path (
~/.slop/slop.dbby default). - Restart the daemon. The boot sequence runs an integrity check via
PRAGMA integrity_checkbefore proceeding.
bootDaemon() runs checkIntegrity() which opens a read-only better-sqlite3 connection and runs PRAGMA integrity_check. If the result is anything other than a single "ok" row, the daemon refuses to start with a DaemonBootError. This catches a corrupt DB -- including one from a bad restore -- before the daemon takes the lock.
Safety Hierarchy
The subsystems are ranked by severity. Guards and CI health enforce correctness at the pipeline level. Resource and stuck monitoring are advisory and preserve operator control. Backup and reaping are operational hygiene that run in the background without affecting throughput.
| # | Subsystem | Severity | What it blocks | Recovery action | Runs |
|---|---|---|---|---|---|
| 1 | Guard Check | Critical | Entire daemon cycle -- no claiming, no lifecycle poll, no merges | Fix branch protection in GitHub (enable strict mode + add required CI contexts) | Every cycle |
| 2 | CI Health Monitoring | High | New issue claiming; advancement of waiting_merge workers |
Fix the base branch (revert bad commit or wait for CI to recover). Requires 2 consecutive green cycles. | Every cycle |
| 3 | Worker Resource Monitoring | Medium | Nothing blocked directly. Files a diagnostic issue and optionally stops the worker. | Review the filed issue, adjust memoryThresholdBytes, cancel or let the worker continue |
Every cycle |
| 4 | Stuck Worker Detection | Low | Nothing blocked. Fires a desktop notification only. | Cancel and retry the worker, or raise the stuck threshold for that status | Every cycle |
| 5 | Notifications | Support | N/A -- delivery layer for alerts from subsystems 3 and 4 | Install terminal-notifier; set notificationTarget=terminal |
Every cycle (via lifecycle poll) |
| 6 | Orphan Process Reaping | Startup | N/A -- prevents zombie processes from a prior daemon instance | Automatic on restart. Manual only if SIGTERM is blocked by the OS. | Daemon startup only |
| 7 | Database Backup | Background | Nothing blocked. Daily snapshot, non-blocking. | Manual file copy to restore. No UI. Daemon refuses to start if the DB is corrupt. | Once per calendar day |
main, merging without required CI) are hard-blocked at the cycle level. Conditions that require operator judgment (high memory, long-running workers) are surfaced as notifications without killing the work in progress. Data safety (backups, integrity checks) runs in the background without affecting throughput.