Skip to content

fix(cloud-agent-next): recover leaked sandbox wrappers safely#3555

Open
eshurakov wants to merge 6 commits into
mainfrom
polite-suggestion
Open

fix(cloud-agent-next): recover leaked sandbox wrappers safely#3555
eshurakov wants to merge 6 commits into
mainfrom
polite-suggestion

Conversation

@eshurakov
Copy link
Copy Markdown
Contributor

@eshurakov eshurakov commented May 28, 2026

Summary

  • Prevent shared Cloudflare sandbox disk exhaustion from leaked wrapper processes by durably leasing physical wrappers, fencing launch/reuse/deletion on verified cleanup, and retrying teardown with persisted backoff.
  • Introduce the Cloudflare-only AgentSandbox seam as the provider boundary for physical discovery, stop, keep-alive, and capacity operations; runtime, supervisor, deletion, interrupt, terminal, and router flows now converge on that lifecycle rather than performing ad-hoc sandbox teardown.
  • Make cold workspace admission and rolling deployment handling fail-safe: low capacity triggers conservative stale cleanup plus recheck, the new capacity-admission recovery gate is restricted to explicit unusable-filesystem evidence, and legacy wrappers are identified through observable environment markers. Pre-existing recovery destruction for preparation infrastructure failures remains unchanged.

Verification

  • Automated smoke test
  • Manual cloud agents test in a local browser

eshurakov added 3 commits May 27, 2026 22:00
Track physical wrapper ownership and require verified cleanup before reuse or deletion, while rejecting workspace admission when capacity cannot be safely verified.
Preserve stop backoff during retention cleanup and identify legacy bundles through observable environment markers so rolling deploys do not strand or misidentify wrappers.
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 28, 2026

Code Review Summary

Status: 1 Issue Remaining (carried forward, intentional by design) | Recommendation: Merge

Executive Summary

The new commit wraps the delivery_accepted lease write in a try/catch so that a transient storage failure during acceptance does not abort supervision — the maintenance alarm's reconcilePhysicalCleanup correctly recovers by re-applying delivery_accepted when it finds active wrapper work at the startupDeadlineAt deadline. The change is correct and covered by a new test.

Overview

Severity Count
WARNING 1 (carried forward, acknowledged)
Issue Details (click to expand)

RESOLVED

| File | Line | Issue |
|---|---|
| src/agent-sandbox/cloudflare/cloudflare-agent-sandbox.ts | 235 | Warm workspace silently omits onProgress callback before bootstrap — Fixed in commit 078978f90 |

WARNING (carried forward — author acknowledged as intentional)

| File | Line | Issue |
|---|---|
| src/persistence/CloudAgentSession.ts | 1340 | !metadata + non-none lease blocks deletion without driving cleanup |


WARNING — finalizeSessionDeletion: no-metadata + pending lease reschedules alarm but cannot advance cleanup

At src/persistence/CloudAgentSession.ts:1340:

if (!metadata) {
  if ((await getWrapperLease(this.ctx.storage)).state !== 'none') {
    await this.scheduleAlarmAtOrBefore(Date.now() + 1_000);
    return false;
  }

When the metadata key is missing but a wrapper_lease row persists in a non-none state, the alarm is rescheduled every second but runMaintenance is never called so the lease cannot transition toward none. Explicit deleteSession calls continue to throw 'Session deletion pending physical wrapper cleanup' (apparent 500) until the alarm path independently clears the lease.

The PR author has acknowledged this as intentional fail-closed behavior: without metadata the DO cannot reconstruct the provider sandbox to verify physical wrapper absence, so retaining the non-none lease is the safest course. The WARNING is carried forward for visibility but does not block merge.

Changed Files Reviewed (this round — commit e2e56e6)
  • src/session/agent-runtime.tsdelivery_accepted lease write now swallowed on error; maintenance reconciles via reconcilePhysicalCleanup when startupDeadlineAt elapses with active wrapper work. Correct.
  • src/session/agent-runtime.test.ts — new test verifies accepted delivery + supervision state survives a failed lease write, and lease retains startupDeadlineAt for maintenance reconciliation. Correct.

Unchanged (carried forward from prior rounds):

  • src/persistence/CloudAgentSession.ts — 1 issue (acknowledged)
  • src/agent-sandbox/protocol.ts
  • src/agent-sandbox/factory.ts
  • src/execution/orchestrator.ts
  • src/kilo/wrapper-client.ts
  • src/kilo/wrapper-manager.ts
  • src/sandbox-recovery.ts
  • src/session/wrapper-runtime-state.ts
  • src/session/wrapper-supervisor.ts
  • src/workspace.ts
  • src/workspace-errors.ts
  • src/workspace.test.ts
  • src/session/wrapper-supervisor.test.ts
  • src/router/handlers/session-management.ts
  • wrapper/src/main.ts
  • wrapper/src/server.ts
  • test/integration/session/deletion-lifecycle.test.ts

Fix these issues in Kilo Cloud


Reviewed by claude-sonnet-4.6 · 301,043 tokens

Review guidance: REVIEW.md from base branch main

@eshurakov
Copy link
Copy Markdown
Contributor Author

Addressed the actionable warm-bootstrap progress finding in 078978f90: ordinary bootstrap now reports the kilo_server startup step on both warm and cold workspaces, with regression coverage for warm delivery.

I did not change the missing-metadata deletion branch. The public delete route short-circuits when session metadata is absent, and the Durable Object cannot reconstruct the provider sandbox or verify physical wrapper absence without that metadata. Retaining a non-none lease in that corrupted/orphaned state is intentionally fail-closed rather than erasing the only durable cleanup signal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants