Skip to content

ROX-32848: Ack-based retry for VM#20105

Closed
vikin91 wants to merge 2 commits intopiotr/ROX-32316-vm-relay-ack-flowfrom
piotr/ROX-32848-vm-relay-payload-cache
Closed

ROX-32848: Ack-based retry for VM#20105
vikin91 wants to merge 2 commits intopiotr/ROX-32316-vm-relay-ack-flowfrom
piotr/ROX-32848-vm-relay-payload-cache

Conversation

@vikin91
Copy link
Copy Markdown
Contributor

@vikin91 vikin91 commented Apr 20, 2026

Description

Add a bounded, TTL-aware payload cache to the VM relay so that UMH-driven retransmissions can resend the last known report without waiting for the VM agent to push a new one.

What

  • reportPayloadCache — LRU cache (by updatedAt) with configurable max slots and TTL. Backed by container/list + map for O(1) insert/lookup/evict. Uses its own mutex, independent of the metadata cache.
  • Relay integration:
    • Upsert on every incoming report (deduplicates identical payloads via EqualVT).
    • Remove on ACK (confirmed delivery — no reason to keep the payload).
    • Get on UMH retry command → resend cached payload or log a miss.
    • SweepExpired on a periodic ticker (payloadCacheTTL / 2 interval).
  • Prometheus metrics: cache_slots_used, cache_slots_capacity, cache_residency_seconds, cache_lifetime_seconds, cache_lookups_total{hit|miss}.
  • Env vars: ROX_VM_INDEX_REPORT_RELAY_CACHE_SLOTS (default 100), ROX_VM_INDEX_REPORT_RELAY_CACHE_TTL (default 4h).

Why

Without the cache, a UMH retry for a report whose initial send failed (or was not ACKed in time) has no payload to resend. The relay would have to wait for the next VM agent push, which may take minutes. Caching the last payload per VSOCK ID closes this gap.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

###ated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Ran the full compliance test suite and the relay package tests locally:

go test ./compliance/... ./compliance/virtualmachines/relay/... -count=1 -timeout 120s

All 25 packages pass (including 507 new lines of report_payload_cache_test.go covering LRU eviction order, TTL expiry, bounded sweep budget, capacity=0 disabled mode, Get not promoting recency, Remove duration reporting, and sweep-then-upsert interaction; plus 395 new/modified lines in relay_test.go covering cache-hit resend, cache-miss metric, cache-disabled mode, ACK removes payload entry, and expired-payload-resends-until-sweep-evicts).

AI disclosure

The initial implementation and tests were generated with AI assistance. All code was reviewed, corrected, and validated by the author.

vikin91 added 2 commits April 20, 2026 16:56
Introduce reportPayloadCache with LRU eviction and TTL, wired into the
relay for caching VM index report payloads. On UMH retry, the relay
resends the cached payload instead of waiting for a new VM report.

This is the second half of the split from the ack-flow branch.

Made-with: Cursor
@vikin91
Copy link
Copy Markdown
Contributor Author

vikin91 commented Apr 20, 2026

This change is part of the following stack:

Change managed by git-spice.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 20, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@vikin91 vikin91 deleted the branch piotr/ROX-32316-vm-relay-ack-flow April 20, 2026 15:12
@vikin91 vikin91 closed this Apr 20, 2026
@vikin91
Copy link
Copy Markdown
Contributor Author

vikin91 commented Apr 20, 2026

Closed automatically by renaming branch. New PR: #20107

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant