ROX-32848: Ack-based retry for VM by vikin91 · Pull Request #20107 · stackrox/stackrox

vikin91 · 2026-04-20T15:17:59Z

Description

Add a bounded, TTL-aware payload cache to the VM relay so that UMH-driven retransmissions can resend the last known report without waiting for the VM agent to push a new one.

What

reportPayloadCache — LRU cache (by updatedAt) with configurable max slots and TTL. Backed by container/list + map for O(1) insert/lookup/evict. Uses its own mutex, independent of the metadata cache.
Relay integration:
- Upsert on every incoming report (deduplicates identical payloads via EqualVT).
- Remove on ACK (confirmed delivery — no reason to keep the payload).
- Get on UMH retry command → resend cached payload or log a miss.
- SweepExpired on a periodic ticker (payloadCacheTTL / 2 interval).
Prometheus metrics: cache_slots_used, cache_slots_capacity, cache_residency_seconds, cache_lifetime_seconds, cache_lookups_total{hit|miss}.
Env vars: ROX_VM_INDEX_REPORT_RELAY_CACHE_SLOTS (default 100), ROX_VM_INDEX_REPORT_RELAY_CACHE_TTL (default 4## Why

Without the cache, a UMH retry for a report whose initial send failed (or was not ACKed in time) has no payload to resend. The relay would have to wait for the next VM agent push, which may take minutes. Caching the last payload per VSOCK ID closes this gap.

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

###ated testing

How I validated my change

Ran the full compliance test suite and the relay package tests locally:

go test ./compliance/... ./compliance/virtualmachines/relay/... -count=1 -timeout 120s

All 25 packages pass (including 507 new lines of report_payload_cache_test.go covering LRU eviction order, TTL expiry, bounded sweep budget, capacity=0 disabled mode, Get not promoting recency, Remove duration reporting, and sweep-then-upsert interaction; plus 395 new/modified lines in relay_test.go covering cache-hit resend, cache-miss metric, cache-disabled mode, ACK removes payload entry, and expired-payload-resends-until-sweep-evicts).

AI disclosure

The initial implementation and tests were generated with AI assistance. All code was reviewed, corrected, and validated by the author.

vikin91 · 2026-04-20T15:18:00Z

This change is part of the following stack:

ROX-33555: Wire VM relay ACK flow with rate limiting and UMH #20118
- ROX-32848: Ack-based retry for VM #20107 ◀

_{Change managed by git-spice.}

openshift-ci · 2026-04-20T15:18:04Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

sourcery-ai

Hey - I've found 4 issues, and left some high level feedback:

The TTL semantics for reportPayloadCache when ttl <= 0 are surprising: entries are treated as immediately expired in evictExpiredFromFrontNoLock and SweepExpired, while the relay disables the sweep ticker for non-positive TTL; consider explicitly documenting or normalizing this (e.g., treating ttl <= 0 as 'no TTL' in the cache layer) so behavior is consistent with the relay configuration.
The Get(resourceID string, _ time.Time) method ignores its time argument and does not enforce TTL, which is slightly misleading for callers like handleRetryCommand; consider removing the time.Time parameter or adding a comment on the method signature explaining why it’s intentionally unused to avoid confusion.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The TTL semantics for `reportPayloadCache` when `ttl <= 0` are surprising: entries are treated as immediately expired in `evictExpiredFromFrontNoLock` and `SweepExpired`, while the relay disables the sweep ticker for non-positive TTL; consider explicitly documenting or normalizing this (e.g., treating `ttl <= 0` as 'no TTL' in the cache layer) so behavior is consistent with the relay configuration.
- The `Get(resourceID string, _ time.Time)` method ignores its time argument and does not enforce TTL, which is slightly misleading for callers like `handleRetryCommand`; consider removing the `time.Time` parameter or adding a comment on the method signature explaining why it’s intentionally unused to avoid confusion.

## Individual Comments

### Comment 1
<location path="compliance/virtualmachines/relay/report_payload_cache.go" line_range="122-123" />
<code_context>
+		return nil
+	}
+	out := make([]payloadEviction, 0, min(budget, len(c.byID)))
+	for range budget {
+		front := c.lruList.Front()
+		if front == nil {
+			break
</code_context>
<issue_to_address>
**issue (bug_risk):** The `for range budget` loop will not compile; `range` cannot be used directly on an `int`.

Use a counted loop instead, e.g.:

```go
for i := 0; i < budget; i++ {
    front := c.lruList.Front()
    if front == nil {
        break
    }
    // ...
}
```
</issue_to_address>

### Comment 2
<location path="compliance/virtualmachines/relay/report_payload_cache_test.go" line_range="151" />
<code_context>
+	ttl := 10 * time.Minute
+	c := newReportPayloadCache(64, ttl)
+	base := time.Unix(5000, 0)
+	for i := range staleEntries {
+		c.Upsert(fmt.Sprintf("k-%02d", i), relaytest.NewTestVMReport(fmt.Sprintf("%02d", i)), base.Add(time.Duration(i)*time.Second))
+	}
</code_context>
<issue_to_address>
**issue (bug_risk):** The range loop over `staleEntries` will not compile and should be a counted `for` loop.

`staleEntries` is an `int`, so it must be used in a counted loop, for example:

```go
for i := 0; i < staleEntries; i++ {
    c.Upsert(
        fmt.Sprintf("k-%02d", i),
        relaytest.NewTestVMReport(fmt.Sprintf("%02d", i)),
        base.Add(time.Duration(i) * time.Second),
    )
}
```
</issue_to_address>

### Comment 3
<location path="compliance/virtualmachines/relay/relay_test.go" line_range="698-699" />
<code_context>
+	time.Sleep(50 * time.Millisecond)
+	missBefore := testutil.ToFloat64(metrics.IndexReportCacheLookupsTotal.WithLabelValues("miss"))
+
+	umh.retryCh <- "not-cached"
+	time.Sleep(200 * time.Millisecond)
+
+	missDelta := testutil.ToFloat64(metrics.IndexReportCacheLookupsTotal.WithLabelValues("miss")) - missBefore
</code_context>
<issue_to_address>
**suggestion (testing):** Using fixed sleeps to wait for async behavior can make the test flaky; prefer `Eventually`-style polling.

In `TestRelay_RetryCommand_CacheMiss_IncrementsMissAndDoesNotSend`, the fixed `time.Sleep` calls after sending on `retryCh` can be fragile, especially on slower CI. Since you’re already using `s.Eventually` in this file, consider polling until the `IndexReportCacheLookupsTotal{result="miss"}` metric reflects the expected increment and/or the sender’s `sentMessages` length stabilizes, with a reasonable timeout, instead of relying on hardcoded delays.
</issue_to_address>

### Comment 4
<location path="compliance/virtualmachines/relay/report_payload_cache.go" line_range="26" />
<code_context>
+// Capacity: when maxSlots is non-positive, new keys are never cached (Upsert for an unknown resourceID
+// is a no-op). Existing entries for that resource ID are unchanged by this constructor parameter alone;
+// use maxSlots > 0 to store new keys.
+type reportPayloadCache struct {
+	maxSlots int
+	ttl      time.Duration
</code_context>
<issue_to_address>
**issue (complexity):** Consider replacing the list-based LRU structure with a single map and scan-based eviction to simplify the cache implementation while preserving behavior.

You can keep the same behavior (TTL + capacity eviction by oldest `updatedAt`) with a simpler structure by dropping the `container/list` LRU and `*list.Element` plumbing, and doing linear scans over a plain `map` on eviction/sweep. For expected cache sizes, this will likely be easier to reason about and maintain.

### 1. Replace `list.List` + `byID` with a single `map`

Instead of:

```go
type reportPayloadCache struct {
    maxSlots int
    ttl      time.Duration

    mu      sync.Mutex
    byID    map[string]*list.Element // resourceID -> LRU list element
    lruList *list.List               // front = LRU
}

type reportPayloadEntry struct {
    resourceID     string
    report         *v1.VMReport
    updatedAt      time.Time
    firstUpdatedAt time.Time
}
```

you can just do:

```go
type reportPayloadCache struct {
    maxSlots int
    ttl      time.Duration

    mu   sync.Mutex
    byID map[string]*reportPayloadEntry // resourceID -> entry
}

type reportPayloadEntry struct {
    report         *v1.VMReport
    updatedAt      time.Time
    firstUpdatedAt time.Time
}
```

`newReportPayloadCache` then becomes:

```go
func newReportPayloadCache(maxSlots int, ttl time.Duration) *reportPayloadCache {
    return &reportPayloadCache{
        maxSlots: maxSlots,
        ttl:      ttl,
        byID:     make(map[string]*reportPayloadEntry),
    }
}
```

### 2. Implement capacity eviction via a simple scan

You can preserve the “evict oldest `updatedAt`” policy without LRU ordering:

```go
func (c *reportPayloadCache) evictOldestNoLock(now time.Time) (payloadEviction, bool) {
    var (
        oldestKey string
        oldest    *reportPayloadEntry
    )
    for k, ent := range c.byID {
        if oldest == nil || ent.updatedAt.Before(oldest.updatedAt) {
            oldestKey, oldest = k, ent
        }
    }
    if oldest == nil {
        return payloadEviction{}, false
    }
    ev := evictionForEntry(oldestKey, oldest, now)
    delete(c.byID, oldestKey)
    return ev, true
}
```

Then `Upsert` uses this instead of `evictLRUNoLock`:

```go
if len(c.byID) >= c.maxSlots {
    if ev, ok := c.evictOldestNoLock(now); ok {
        evictions = append(evictions, ev)
    }
}
```

### 3. Simplify TTL sweeps to plain map iteration

Instead of front-only sweeps tied to list order, perform the same TTL logic over the map. The “opportunistic, budgeted” behavior can be kept by counting how many expired entries you’ve removed:

```go
func (c *reportPayloadCache) evictExpiredNoLock(now time.Time, budget int) []payloadEviction {
    if budget <= 0 {
        return nil
    }

    out := make([]payloadEviction, 0, min(budget, len(c.byID)))
    for key, ent := range c.byID {
        if now.Before(ent.updatedAt.Add(c.ttl)) {
            continue
        }
        out = append(out, evictionForEntry(key, ent, now))
        delete(c.byID, key)
        if len(out) == budget {
            break
        }
    }
    return out
}
```

`Upsert` and `SweepExpired` then both call this, preserving current semantics but without `list`:

```go
evictions := c.evictExpiredNoLock(now, upsertExpiredEvictionPerInsert)

func (c *reportPayloadCache) SweepExpired(now time.Time) []payloadEviction {
    c.mu.Lock()
    defer c.mu.Unlock()
    return c.evictExpiredNoLock(now, len(c.byID))
}
```

### 4. Inline eviction metadata or keep helper but make it key-based

You can avoid some mental overhead by making `evictionForEntry` key-based and struct-agnostic:

```go
func evictionForEntry(key string, ent *reportPayloadEntry, now time.Time) payloadEviction {
    return payloadEviction{
        resourceID: key,
        residency:  now.Sub(ent.updatedAt),
        lifetime:   now.Sub(ent.firstUpdatedAt),
    }
}
```

or just inline this computation at eviction call sites if you prefer fewer helpers.

These changes remove the `list.List` dependency, element bookkeeping, and several `...NoLock` helpers, while preserving all current behaviors (bounded TTL sweep on `Upsert`, full sweep on `SweepExpired`, and “oldest-updated” capacity eviction).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-04-20T15:23:34Z

+	for range budget {
+		front := c.lruList.Front()


issue (bug_risk): The for range budget loop will not compile; range cannot be used directly on an int.

Use a counted loop instead, e.g.:

for i := 0; i < budget; i++ { front := c.lruList.Front() if front == nil { break } // ... }

sourcery-ai · 2026-04-20T15:23:34Z

+	ttl := 10 * time.Minute
+	c := newReportPayloadCache(64, ttl)
+	base := time.Unix(5000, 0)
+	for i := range staleEntries {


issue (bug_risk): The range loop over staleEntries will not compile and should be a counted for loop.

staleEntries is an int, so it must be used in a counted loop, for example:

for i := 0; i < staleEntries; i++ { c.Upsert( fmt.Sprintf("k-%02d", i), relaytest.NewTestVMReport(fmt.Sprintf("%02d", i)), base.Add(time.Duration(i) * time.Second), ) }

sourcery-ai · 2026-04-20T15:23:34Z

+	umh.retryCh <- "not-cached"
+	time.Sleep(200 * time.Millisecond)


suggestion (testing): Using fixed sleeps to wait for async behavior can make the test flaky; prefer Eventually-style polling.

In TestRelay_RetryCommand_CacheMiss_IncrementsMissAndDoesNotSend, the fixed time.Sleep calls after sending on retryCh can be fragile, especially on slower CI. Since you’re already using s.Eventually in this file, consider polling until the IndexReportCacheLookupsTotal{result="miss"} metric reflects the expected increment and/or the sender’s sentMessages length stabilizes, with a reasonable timeout, instead of relying on hardcoded delays.

sourcery-ai · 2026-04-20T15:23:34Z

+// Capacity: when maxSlots is non-positive, new keys are never cached (Upsert for an unknown resourceID
+// is a no-op). Existing entries for that resource ID are unchanged by this constructor parameter alone;
+// use maxSlots > 0 to store new keys.
+type reportPayloadCache struct {


issue (complexity): Consider replacing the list-based LRU structure with a single map and scan-based eviction to simplify the cache implementation while preserving behavior.

You can keep the same behavior (TTL + capacity eviction by oldest updatedAt) with a simpler structure by dropping the container/list LRU and *list.Element plumbing, and doing linear scans over a plain map on eviction/sweep. For expected cache sizes, this will likely be easier to reason about and maintain.

1. Replace list.List + byID with a single map

Instead of:

type reportPayloadCache struct { maxSlots int ttl time.Duration mu sync.Mutex byID map[string]*list.Element // resourceID -> LRU list element lruList *list.List // front = LRU } type reportPayloadEntry struct { resourceID string report *v1.VMReport updatedAt time.Time firstUpdatedAt time.Time }

you can just do:

type reportPayloadCache struct { maxSlots int ttl time.Duration mu sync.Mutex byID map[string]*reportPayloadEntry // resourceID -> entry } type reportPayloadEntry struct { report *v1.VMReport updatedAt time.Time firstUpdatedAt time.Time }

newReportPayloadCache then becomes:

func newReportPayloadCache(maxSlots int, ttl time.Duration) *reportPayloadCache { return &reportPayloadCache{ maxSlots: maxSlots, ttl: ttl, byID: make(map[string]*reportPayloadEntry), } }

2. Implement capacity eviction via a simple scan

You can preserve the “evict oldest updatedAt” policy without LRU ordering:

func (c *reportPayloadCache) evictOldestNoLock(now time.Time) (payloadEviction, bool) { var ( oldestKey string oldest *reportPayloadEntry ) for k, ent := range c.byID { if oldest == nil || ent.updatedAt.Before(oldest.updatedAt) { oldestKey, oldest = k, ent } } if oldest == nil { return payloadEviction{}, false } ev := evictionForEntry(oldestKey, oldest, now) delete(c.byID, oldestKey) return ev, true }

Then Upsert uses this instead of evictLRUNoLock:

if len(c.byID) >= c.maxSlots { if ev, ok := c.evictOldestNoLock(now); ok { evictions = append(evictions, ev) } }

3. Simplify TTL sweeps to plain map iteration

Instead of front-only sweeps tied to list order, perform the same TTL logic over the map. The “opportunistic, budgeted” behavior can be kept by counting how many expired entries you’ve removed:

func (c *reportPayloadCache) evictExpiredNoLock(now time.Time, budget int) []payloadEviction { if budget <= 0 { return nil } out := make([]payloadEviction, 0, min(budget, len(c.byID))) for key, ent := range c.byID { if now.Before(ent.updatedAt.Add(c.ttl)) { continue } out = append(out, evictionForEntry(key, ent, now)) delete(c.byID, key) if len(out) == budget { break } } return out }

Upsert and SweepExpired then both call this, preserving current semantics but without list:

evictions := c.evictExpiredNoLock(now, upsertExpiredEvictionPerInsert) func (c *reportPayloadCache) SweepExpired(now time.Time) []payloadEviction { c.mu.Lock() defer c.mu.Unlock() return c.evictExpiredNoLock(now, len(c.byID)) }

4. Inline eviction metadata or keep helper but make it key-based

You can avoid some mental overhead by making evictionForEntry key-based and struct-agnostic:

func evictionForEntry(key string, ent *reportPayloadEntry, now time.Time) payloadEviction { return payloadEviction{ resourceID: key, residency: now.Sub(ent.updatedAt), lifetime: now.Sub(ent.firstUpdatedAt), } }

or just inline this computation at eviction call sites if you prefer fewer helpers.

These changes remove the list.List dependency, element bookkeeping, and several ...NoLock helpers, while preserving all current behaviors (bounded TTL sweep on Upsert, full sweep on SweepExpired, and “oldest-updated” capacity eviction).

github-actions · 2026-04-21T08:51:08Z

🚀 Build Images Ready

Images are ready for commit 300795b. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.11.x-712-g300795b541

codecov · 2026-04-21T09:10:48Z

Codecov Report

❌ Patch coverage is 89.87342% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.74%. Comparing base (e66ff31) to head (300795b).

Files with missing lines	Patch %	Lines
compliance/virtualmachines/relay/relay.go	78.43%	8 Missing and 3 partials ⚠️
...ance/virtualmachines/relay/report_payload_cache.go	97.14%	2 Missing and 1 partial ⚠️
compliance/compliance.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@                          Coverage Diff                          @@
##           piotr/ROX-33555-vm-relay-ack-flow   #20107      +/-   ##
=====================================================================
- Coverage                              49.77%   49.74%   -0.03%     
=====================================================================
  Files                                   2767     2767              
  Lines                                 209666   209629      -37     
=====================================================================
- Hits                                  104356   104289      -67     
- Misses                                 97608    97642      +34     
+ Partials                                7702     7698       -4

Flag	Coverage Δ
go-unit-tests	`49.74% <89.87%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Introduce reportPayloadCache with LRU eviction and TTL, wired into the relay for caching VM index report payloads. On UMH retry, the relay resends the cached payload instead of waiting for a new VM report. This is the second half of the split from the ack-flow branch. Made-with: Cursor

openshift-ci bot added the do-not-merge/work-in-progress label Apr 20, 2026

vikin91 mentioned this pull request Apr 20, 2026

ROX-32848: Ack-based retry for VM #20105

Closed

9 tasks

vikin91 added ai-review ai-assisted labels Apr 20, 2026

sourcery-ai bot reviewed Apr 20, 2026

View reviewed changes

vikin91 mentioned this pull request Apr 21, 2026

ROX-33555: Wire VM relay ACK flow with rate limiting and UMH #20106

Closed

11 tasks

vikin91 force-pushed the piotr/ROX-32848-vm-relay-payload-cache branch from f19a2e6 to 8a17553 Compare April 21, 2026 08:37

vikin91 mentioned this pull request Apr 21, 2026

ROX-33555: Wire VM relay ACK flow with rate limiting and UMH #20118

Open

11 tasks

vikin91 added 2 commits April 21, 2026 11:25

Empty commit

22f2200

vikin91 force-pushed the piotr/ROX-32848-vm-relay-payload-cache branch from 8a17553 to 300795b Compare April 21, 2026 09:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-32848: Ack-based retry for VM#20107

ROX-32848: Ack-based retry for VM#20107
vikin91 wants to merge 2 commits intopiotr/ROX-33555-vm-relay-ack-flowfrom
piotr/ROX-32848-vm-relay-payload-cache

vikin91 commented Apr 20, 2026

Uh oh!

vikin91 commented Apr 20, 2026 •

edited

Loading

Uh oh!

openshift-ci bot commented Apr 20, 2026

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Apr 20, 2026

Uh oh!

sourcery-ai bot Apr 20, 2026

Uh oh!

sourcery-ai bot Apr 20, 2026

Uh oh!

sourcery-ai bot Apr 20, 2026

Uh oh!

github-actions bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		umh.retryCh <- "not-cached"
		time.Sleep(200 * time.Millisecond)

Conversation

vikin91 commented Apr 20, 2026

Description

What

User-facing documentation

Testing and quality

How I validated my change

AI disclosure

Uh oh!

vikin91 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Apr 20, 2026

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 20, 2026

Choose a reason for hiding this comment

1. Replace list.List + byID with a single map

2. Implement capacity eviction via a simple scan

3. Simplify TTL sweeps to plain map iteration

4. Inline eviction metadata or keep helper but make it key-based

Uh oh!

github-actions bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Build Images Ready

Uh oh!

codecov bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vikin91 commented Apr 20, 2026 •

edited

Loading

1. Replace `list.List` + `byID` with a single `map`

github-actions bot commented Apr 21, 2026 •

edited

Loading

codecov bot commented Apr 21, 2026 •

edited

Loading