fix(cloud-agent-next): recover leaked sandbox wrappers safely by eshurakov · Pull Request #3555 · Kilo-Org/cloud

eshurakov · 2026-05-28T11:28:55Z

Summary

Prevent shared Cloudflare sandbox disk exhaustion from leaked wrapper processes by durably leasing physical wrappers, fencing launch/reuse/deletion on verified cleanup, and retrying teardown with persisted backoff.
Introduce the Cloudflare-only AgentSandbox seam as the provider boundary for physical discovery, stop, keep-alive, and capacity operations; runtime, supervisor, deletion, interrupt, terminal, and router flows now converge on that lifecycle rather than performing ad-hoc sandbox teardown.
Make cold workspace admission and rolling deployment handling fail-safe: low capacity triggers conservative stale cleanup plus recheck, the new capacity-admission recovery gate is restricted to explicit unusable-filesystem evidence, and legacy wrappers are identified through observable environment markers. Pre-existing recovery destruction for preparation infrastructure failures remains unchanged.

Verification

Automated smoke test
Manual cloud agents test in a local browser

Track physical wrapper ownership and require verified cleanup before reuse or deletion, while rejecting workspace admission when capacity cannot be safely verified.

Preserve stop backoff during retention cleanup and identify legacy bundles through observable environment markers so rolling deploys do not strand or misidentify wrappers.

kilo-code-bot · 2026-05-28T11:34:56Z

Code Review Summary

Status: 1 Issue Remaining (carried forward, intentional by design) | Recommendation: Merge

Executive Summary

The new commit wraps the delivery_accepted lease write in a try/catch so that a transient storage failure during acceptance does not abort supervision — the maintenance alarm's reconcilePhysicalCleanup correctly recovers by re-applying delivery_accepted when it finds active wrapper work at the startupDeadlineAt deadline. The change is correct and covered by a new test.

Overview

Severity	Count
WARNING	1 (carried forward, acknowledged)

Issue Details (click to expand)

RESOLVED

| File | Line | Issue |
|---|---|
| src/agent-sandbox/cloudflare/cloudflare-agent-sandbox.ts | 235 | ~~Warm workspace silently omits onProgress callback before bootstrap~~ — Fixed in commit 078978f90 |

WARNING (carried forward — author acknowledged as intentional)

| File | Line | Issue |
|---|---|
| src/persistence/CloudAgentSession.ts | 1340 | !metadata + non-none lease blocks deletion without driving cleanup |

WARNING — finalizeSessionDeletion: no-metadata + pending lease reschedules alarm but cannot advance cleanup

At src/persistence/CloudAgentSession.ts:1340:

if (!metadata) {
  if ((await getWrapperLease(this.ctx.storage)).state !== 'none') {
    await this.scheduleAlarmAtOrBefore(Date.now() + 1_000);
    return false;
  }

When the metadata key is missing but a wrapper_lease row persists in a non-none state, the alarm is rescheduled every second but runMaintenance is never called so the lease cannot transition toward none. Explicit deleteSession calls continue to throw 'Session deletion pending physical wrapper cleanup' (apparent 500) until the alarm path independently clears the lease.

The PR author has acknowledged this as intentional fail-closed behavior: without metadata the DO cannot reconstruct the provider sandbox to verify physical wrapper absence, so retaining the non-none lease is the safest course. The WARNING is carried forward for visibility but does not block merge.

Changed Files Reviewed (this round — commit e2e56e6)

src/session/agent-runtime.ts — delivery_accepted lease write now swallowed on error; maintenance reconciles via reconcilePhysicalCleanup when startupDeadlineAt elapses with active wrapper work. Correct.
src/session/agent-runtime.test.ts — new test verifies accepted delivery + supervision state survives a failed lease write, and lease retains startupDeadlineAt for maintenance reconciliation. Correct.

Unchanged (carried forward from prior rounds):

src/persistence/CloudAgentSession.ts — 1 issue (acknowledged)
src/agent-sandbox/protocol.ts
src/agent-sandbox/factory.ts
src/execution/orchestrator.ts
src/kilo/wrapper-client.ts
src/kilo/wrapper-manager.ts
src/sandbox-recovery.ts
src/session/wrapper-runtime-state.ts
src/session/wrapper-supervisor.ts
src/workspace.ts
src/workspace-errors.ts
src/workspace.test.ts
src/session/wrapper-supervisor.test.ts
src/router/handlers/session-management.ts
wrapper/src/main.ts
wrapper/src/server.ts
test/integration/session/deletion-lifecycle.test.ts

Fix these issues in Kilo Cloud

_{Reviewed by claude-sonnet-4.6 · 301,043 tokens}

_{Review guidance: REVIEW.md from base branch main}

eshurakov · 2026-05-28T12:56:33Z

Addressed the actionable warm-bootstrap progress finding in 078978f90: ordinary bootstrap now reports the kilo_server startup step on both warm and cold workspaces, with regression coverage for warm delivery.

I did not change the missing-metadata deletion branch. The public delete route short-circuits when session metadata is absent, and the Durable Object cannot reconstruct the provider sandbox or verify physical wrapper absence without that metadata. Retaining a non-none lease in that corrupted/orphaned state is intentionally fail-closed rather than erasing the only durable cleanup signal.

eshurakov added 3 commits May 27, 2026 22:00

refactor(cloud-agent-next): centralize Cloudflare sandbox access

9299e0c

fix(cloud-agent-next): fence physical wrapper lifecycle

547d9d0

Track physical wrapper ownership and require verified cleanup before reuse or deletion, while rejecting workspace admission when capacity cannot be safely verified.

fix(cloud-agent-next): make leased cleanup rollout-safe

c34bdd5

Preserve stop backoff during retention cleanup and identify legacy bundles through observable environment markers so rolling deploys do not strand or misidentify wrappers.

fix(cloud-agent-next): report warm bootstrap progress

078978f

fix(cloud-agent-next): narrow destructive capacity recovery

f71392d

chrarnoldus approved these changes May 29, 2026

View reviewed changes

fix(cloud-agent-next): preserve accepted wrapper supervision

e2e56e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cloud-agent-next): recover leaked sandbox wrappers safely#3555

fix(cloud-agent-next): recover leaked sandbox wrappers safely#3555
eshurakov wants to merge 6 commits into
mainfrom
polite-suggestion

eshurakov commented May 28, 2026 •

edited

Loading

Uh oh!

kilo-code-bot Bot commented May 28, 2026 •

edited

Loading

RESOLVED

WARNING (carried forward — author acknowledged as intentional)

Uh oh!

eshurakov commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eshurakov commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Uh oh!

kilo-code-bot Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Executive Summary

Overview

RESOLVED

WARNING (carried forward — author acknowledged as intentional)

Uh oh!

eshurakov commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eshurakov commented May 28, 2026 •

edited

Loading

kilo-code-bot Bot commented May 28, 2026 •

edited

Loading