Skip to content

ROX-32848: Ack-based retry for VM#20107

Draft
vikin91 wants to merge 2 commits intopiotr/ROX-33555-vm-relay-ack-flowfrom
piotr/ROX-32848-vm-relay-payload-cache
Draft

ROX-32848: Ack-based retry for VM#20107
vikin91 wants to merge 2 commits intopiotr/ROX-33555-vm-relay-ack-flowfrom
piotr/ROX-32848-vm-relay-payload-cache

Conversation

@vikin91
Copy link
Copy Markdown
Contributor

@vikin91 vikin91 commented Apr 20, 2026

Description

Add a bounded, TTL-aware payload cache to the VM relay so that UMH-driven retransmissions can resend the last known report without waiting for the VM agent to push a new one.

What

  • reportPayloadCache — LRU cache (by updatedAt) with configurable max slots and TTL. Backed by container/list + map for O(1) insert/lookup/evict. Uses its own mutex, independent of the metadata cache.
  • Relay integration:
    • Upsert on every incoming report (deduplicates identical payloads via EqualVT).
    • Remove on ACK (confirmed delivery — no reason to keep the payload).
    • Get on UMH retry command → resend cached payload or log a miss.
    • SweepExpired on a periodic ticker (payloadCacheTTL / 2 interval).
  • Prometheus metrics: cache_slots_used, cache_slots_capacity, cache_residency_seconds, cache_lifetime_seconds, cache_lookups_total{hit|miss}.
  • Env vars: ROX_VM_INDEX_REPORT_RELAY_CACHE_SLOTS (default 100), ROX_VM_INDEX_REPORT_RELAY_CACHE_TTL (default 4## Why

Without the cache, a UMH retry for a report whose initial send failed (or was not ACKed in time) has no payload to resend. The relay would have to wait for the next VM agent push, which may take minutes. Caching the last payload per VSOCK ID closes this gap.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

###ated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Ran the full compliance test suite and the relay package tests locally:

go test ./compliance/... ./compliance/virtualmachines/relay/... -count=1 -timeout 120s

All 25 packages pass (including 507 new lines of report_payload_cache_test.go covering LRU eviction order, TTL expiry, bounded sweep budget, capacity=0 disabled mode, Get not promoting recency, Remove duration reporting, and sweep-then-upsert interaction; plus 395 new/modified lines in relay_test.go covering cache-hit resend, cache-miss metric, cache-disabled mode, ACK removes payload entry, and expired-payload-resends-until-sweep-evicts).

AI disclosure

The initial implementation and tests were generated with AI assistance. All code was reviewed, corrected, and validated by the author.

@vikin91
Copy link
Copy Markdown
Contributor Author

vikin91 commented Apr 20, 2026

This change is part of the following stack:

Change managed by git-spice.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 20, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The TTL semantics for reportPayloadCache when ttl <= 0 are surprising: entries are treated as immediately expired in evictExpiredFromFrontNoLock and SweepExpired, while the relay disables the sweep ticker for non-positive TTL; consider explicitly documenting or normalizing this (e.g., treating ttl <= 0 as 'no TTL' in the cache layer) so behavior is consistent with the relay configuration.
  • The Get(resourceID string, _ time.Time) method ignores its time argument and does not enforce TTL, which is slightly misleading for callers like handleRetryCommand; consider removing the time.Time parameter or adding a comment on the method signature explaining why it’s intentionally unused to avoid confusion.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The TTL semantics for `reportPayloadCache` when `ttl <= 0` are surprising: entries are treated as immediately expired in `evictExpiredFromFrontNoLock` and `SweepExpired`, while the relay disables the sweep ticker for non-positive TTL; consider explicitly documenting or normalizing this (e.g., treating `ttl <= 0` as 'no TTL' in the cache layer) so behavior is consistent with the relay configuration.
- The `Get(resourceID string, _ time.Time)` method ignores its time argument and does not enforce TTL, which is slightly misleading for callers like `handleRetryCommand`; consider removing the `time.Time` parameter or adding a comment on the method signature explaining why it’s intentionally unused to avoid confusion.

## Individual Comments

### Comment 1
<location path="compliance/virtualmachines/relay/report_payload_cache.go" line_range="122-123" />
<code_context>
+		return nil
+	}
+	out := make([]payloadEviction, 0, min(budget, len(c.byID)))
+	for range budget {
+		front := c.lruList.Front()
+		if front == nil {
+			break
</code_context>
<issue_to_address>
**issue (bug_risk):** The `for range budget` loop will not compile; `range` cannot be used directly on an `int`.

Use a counted loop instead, e.g.:

```go
for i := 0; i < budget; i++ {
    front := c.lruList.Front()
    if front == nil {
        break
    }
    // ...
}
```
</issue_to_address>

### Comment 2
<location path="compliance/virtualmachines/relay/report_payload_cache_test.go" line_range="151" />
<code_context>
+	ttl := 10 * time.Minute
+	c := newReportPayloadCache(64, ttl)
+	base := time.Unix(5000, 0)
+	for i := range staleEntries {
+		c.Upsert(fmt.Sprintf("k-%02d", i), relaytest.NewTestVMReport(fmt.Sprintf("%02d", i)), base.Add(time.Duration(i)*time.Second))
+	}
</code_context>
<issue_to_address>
**issue (bug_risk):** The range loop over `staleEntries` will not compile and should be a counted `for` loop.

`staleEntries` is an `int`, so it must be used in a counted loop, for example:

```go
for i := 0; i < staleEntries; i++ {
    c.Upsert(
        fmt.Sprintf("k-%02d", i),
        relaytest.NewTestVMReport(fmt.Sprintf("%02d", i)),
        base.Add(time.Duration(i) * time.Second),
    )
}
```
</issue_to_address>

### Comment 3
<location path="compliance/virtualmachines/relay/relay_test.go" line_range="698-699" />
<code_context>
+	time.Sleep(50 * time.Millisecond)
+	missBefore := testutil.ToFloat64(metrics.IndexReportCacheLookupsTotal.WithLabelValues("miss"))
+
+	umh.retryCh <- "not-cached"
+	time.Sleep(200 * time.Millisecond)
+
+	missDelta := testutil.ToFloat64(metrics.IndexReportCacheLookupsTotal.WithLabelValues("miss")) - missBefore
</code_context>
<issue_to_address>
**suggestion (testing):** Using fixed sleeps to wait for async behavior can make the test flaky; prefer `Eventually`-style polling.

In `TestRelay_RetryCommand_CacheMiss_IncrementsMissAndDoesNotSend`, the fixed `time.Sleep` calls after sending on `retryCh` can be fragile, especially on slower CI. Since you’re already using `s.Eventually` in this file, consider polling until the `IndexReportCacheLookupsTotal{result="miss"}` metric reflects the expected increment and/or the sender’s `sentMessages` length stabilizes, with a reasonable timeout, instead of relying on hardcoded delays.
</issue_to_address>

### Comment 4
<location path="compliance/virtualmachines/relay/report_payload_cache.go" line_range="26" />
<code_context>
+// Capacity: when maxSlots is non-positive, new keys are never cached (Upsert for an unknown resourceID
+// is a no-op). Existing entries for that resource ID are unchanged by this constructor parameter alone;
+// use maxSlots > 0 to store new keys.
+type reportPayloadCache struct {
+	maxSlots int
+	ttl      time.Duration
</code_context>
<issue_to_address>
**issue (complexity):** Consider replacing the list-based LRU structure with a single map and scan-based eviction to simplify the cache implementation while preserving behavior.

You can keep the same behavior (TTL + capacity eviction by oldest `updatedAt`) with a simpler structure by dropping the `container/list` LRU and `*list.Element` plumbing, and doing linear scans over a plain `map` on eviction/sweep. For expected cache sizes, this will likely be easier to reason about and maintain.

### 1. Replace `list.List` + `byID` with a single `map`

Instead of:

```go
type reportPayloadCache struct {
    maxSlots int
    ttl      time.Duration

    mu      sync.Mutex
    byID    map[string]*list.Element // resourceID -> LRU list element
    lruList *list.List               // front = LRU
}

type reportPayloadEntry struct {
    resourceID     string
    report         *v1.VMReport
    updatedAt      time.Time
    firstUpdatedAt time.Time
}
```

you can just do:

```go
type reportPayloadCache struct {
    maxSlots int
    ttl      time.Duration

    mu   sync.Mutex
    byID map[string]*reportPayloadEntry // resourceID -> entry
}

type reportPayloadEntry struct {
    report         *v1.VMReport
    updatedAt      time.Time
    firstUpdatedAt time.Time
}
```

`newReportPayloadCache` then becomes:

```go
func newReportPayloadCache(maxSlots int, ttl time.Duration) *reportPayloadCache {
    return &reportPayloadCache{
        maxSlots: maxSlots,
        ttl:      ttl,
        byID:     make(map[string]*reportPayloadEntry),
    }
}
```

### 2. Implement capacity eviction via a simple scan

You can preserve the “evict oldest `updatedAt`” policy without LRU ordering:

```go
func (c *reportPayloadCache) evictOldestNoLock(now time.Time) (payloadEviction, bool) {
    var (
        oldestKey string
        oldest    *reportPayloadEntry
    )
    for k, ent := range c.byID {
        if oldest == nil || ent.updatedAt.Before(oldest.updatedAt) {
            oldestKey, oldest = k, ent
        }
    }
    if oldest == nil {
        return payloadEviction{}, false
    }
    ev := evictionForEntry(oldestKey, oldest, now)
    delete(c.byID, oldestKey)
    return ev, true
}
```

Then `Upsert` uses this instead of `evictLRUNoLock`:

```go
if len(c.byID) >= c.maxSlots {
    if ev, ok := c.evictOldestNoLock(now); ok {
        evictions = append(evictions, ev)
    }
}
```

### 3. Simplify TTL sweeps to plain map iteration

Instead of front-only sweeps tied to list order, perform the same TTL logic over the map. The “opportunistic, budgeted” behavior can be kept by counting how many expired entries you’ve removed:

```go
func (c *reportPayloadCache) evictExpiredNoLock(now time.Time, budget int) []payloadEviction {
    if budget <= 0 {
        return nil
    }

    out := make([]payloadEviction, 0, min(budget, len(c.byID)))
    for key, ent := range c.byID {
        if now.Before(ent.updatedAt.Add(c.ttl)) {
            continue
        }
        out = append(out, evictionForEntry(key, ent, now))
        delete(c.byID, key)
        if len(out) == budget {
            break
        }
    }
    return out
}
```

`Upsert` and `SweepExpired` then both call this, preserving current semantics but without `list`:

```go
evictions := c.evictExpiredNoLock(now, upsertExpiredEvictionPerInsert)

func (c *reportPayloadCache) SweepExpired(now time.Time) []payloadEviction {
    c.mu.Lock()
    defer c.mu.Unlock()
    return c.evictExpiredNoLock(now, len(c.byID))
}
```

### 4. Inline eviction metadata or keep helper but make it key-based

You can avoid some mental overhead by making `evictionForEntry` key-based and struct-agnostic:

```go
func evictionForEntry(key string, ent *reportPayloadEntry, now time.Time) payloadEviction {
    return payloadEviction{
        resourceID: key,
        residency:  now.Sub(ent.updatedAt),
        lifetime:   now.Sub(ent.firstUpdatedAt),
    }
}
```

or just inline this computation at eviction call sites if you prefer fewer helpers.

These changes remove the `list.List` dependency, element bookkeeping, and several `...NoLock` helpers, while preserving all current behaviors (bounded TTL sweep on `Upsert`, full sweep on `SweepExpired`, and “oldest-updated” capacity eviction).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +122 to +123
for range budget {
front := c.lruList.Front()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): The for range budget loop will not compile; range cannot be used directly on an int.

Use a counted loop instead, e.g.:

for i := 0; i < budget; i++ {
    front := c.lruList.Front()
    if front == nil {
        break
    }
    // ...
}

ttl := 10 * time.Minute
c := newReportPayloadCache(64, ttl)
base := time.Unix(5000, 0)
for i := range staleEntries {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): The range loop over staleEntries will not compile and should be a counted for loop.

staleEntries is an int, so it must be used in a counted loop, for example:

for i := 0; i < staleEntries; i++ {
    c.Upsert(
        fmt.Sprintf("k-%02d", i),
        relaytest.NewTestVMReport(fmt.Sprintf("%02d", i)),
        base.Add(time.Duration(i) * time.Second),
    )
}

Comment on lines +698 to +699
umh.retryCh <- "not-cached"
time.Sleep(200 * time.Millisecond)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Using fixed sleeps to wait for async behavior can make the test flaky; prefer Eventually-style polling.

In TestRelay_RetryCommand_CacheMiss_IncrementsMissAndDoesNotSend, the fixed time.Sleep calls after sending on retryCh can be fragile, especially on slower CI. Since you’re already using s.Eventually in this file, consider polling until the IndexReportCacheLookupsTotal{result="miss"} metric reflects the expected increment and/or the sender’s sentMessages length stabilizes, with a reasonable timeout, instead of relying on hardcoded delays.

// Capacity: when maxSlots is non-positive, new keys are never cached (Upsert for an unknown resourceID
// is a no-op). Existing entries for that resource ID are unchanged by this constructor parameter alone;
// use maxSlots > 0 to store new keys.
type reportPayloadCache struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider replacing the list-based LRU structure with a single map and scan-based eviction to simplify the cache implementation while preserving behavior.

You can keep the same behavior (TTL + capacity eviction by oldest updatedAt) with a simpler structure by dropping the container/list LRU and *list.Element plumbing, and doing linear scans over a plain map on eviction/sweep. For expected cache sizes, this will likely be easier to reason about and maintain.

1. Replace list.List + byID with a single map

Instead of:

type reportPayloadCache struct {
    maxSlots int
    ttl      time.Duration

    mu      sync.Mutex
    byID    map[string]*list.Element // resourceID -> LRU list element
    lruList *list.List               // front = LRU
}

type reportPayloadEntry struct {
    resourceID     string
    report         *v1.VMReport
    updatedAt      time.Time
    firstUpdatedAt time.Time
}

you can just do:

type reportPayloadCache struct {
    maxSlots int
    ttl      time.Duration

    mu   sync.Mutex
    byID map[string]*reportPayloadEntry // resourceID -> entry
}

type reportPayloadEntry struct {
    report         *v1.VMReport
    updatedAt      time.Time
    firstUpdatedAt time.Time
}

newReportPayloadCache then becomes:

func newReportPayloadCache(maxSlots int, ttl time.Duration) *reportPayloadCache {
    return &reportPayloadCache{
        maxSlots: maxSlots,
        ttl:      ttl,
        byID:     make(map[string]*reportPayloadEntry),
    }
}

2. Implement capacity eviction via a simple scan

You can preserve the “evict oldest updatedAt” policy without LRU ordering:

func (c *reportPayloadCache) evictOldestNoLock(now time.Time) (payloadEviction, bool) {
    var (
        oldestKey string
        oldest    *reportPayloadEntry
    )
    for k, ent := range c.byID {
        if oldest == nil || ent.updatedAt.Before(oldest.updatedAt) {
            oldestKey, oldest = k, ent
        }
    }
    if oldest == nil {
        return payloadEviction{}, false
    }
    ev := evictionForEntry(oldestKey, oldest, now)
    delete(c.byID, oldestKey)
    return ev, true
}

Then Upsert uses this instead of evictLRUNoLock:

if len(c.byID) >= c.maxSlots {
    if ev, ok := c.evictOldestNoLock(now); ok {
        evictions = append(evictions, ev)
    }
}

3. Simplify TTL sweeps to plain map iteration

Instead of front-only sweeps tied to list order, perform the same TTL logic over the map. The “opportunistic, budgeted” behavior can be kept by counting how many expired entries you’ve removed:

func (c *reportPayloadCache) evictExpiredNoLock(now time.Time, budget int) []payloadEviction {
    if budget <= 0 {
        return nil
    }

    out := make([]payloadEviction, 0, min(budget, len(c.byID)))
    for key, ent := range c.byID {
        if now.Before(ent.updatedAt.Add(c.ttl)) {
            continue
        }
        out = append(out, evictionForEntry(key, ent, now))
        delete(c.byID, key)
        if len(out) == budget {
            break
        }
    }
    return out
}

Upsert and SweepExpired then both call this, preserving current semantics but without list:

evictions := c.evictExpiredNoLock(now, upsertExpiredEvictionPerInsert)

func (c *reportPayloadCache) SweepExpired(now time.Time) []payloadEviction {
    c.mu.Lock()
    defer c.mu.Unlock()
    return c.evictExpiredNoLock(now, len(c.byID))
}

4. Inline eviction metadata or keep helper but make it key-based

You can avoid some mental overhead by making evictionForEntry key-based and struct-agnostic:

func evictionForEntry(key string, ent *reportPayloadEntry, now time.Time) payloadEviction {
    return payloadEviction{
        resourceID: key,
        residency:  now.Sub(ent.updatedAt),
        lifetime:   now.Sub(ent.firstUpdatedAt),
    }
}

or just inline this computation at eviction call sites if you prefer fewer helpers.

These changes remove the list.List dependency, element bookkeeping, and several ...NoLock helpers, while preserving all current behaviors (bounded TTL sweep on Upsert, full sweep on SweepExpired, and “oldest-updated” capacity eviction).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 21, 2026

🚀 Build Images Ready

Images are ready for commit 300795b. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.11.x-712-g300795b541

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 89.87342% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.74%. Comparing base (e66ff31) to head (300795b).

Files with missing lines Patch % Lines
compliance/virtualmachines/relay/relay.go 78.43% 8 Missing and 3 partials ⚠️
...ance/virtualmachines/relay/report_payload_cache.go 97.14% 2 Missing and 1 partial ⚠️
compliance/compliance.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@                          Coverage Diff                          @@
##           piotr/ROX-33555-vm-relay-ack-flow   #20107      +/-   ##
=====================================================================
- Coverage                              49.77%   49.74%   -0.03%     
=====================================================================
  Files                                   2767     2767              
  Lines                                 209666   209629      -37     
=====================================================================
- Hits                                  104356   104289      -67     
- Misses                                 97608    97642      +34     
+ Partials                                7702     7698       -4     
Flag Coverage Δ
go-unit-tests 49.74% <89.87%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

vikin91 added 2 commits April 21, 2026 11:25
Introduce reportPayloadCache with LRU eviction and TTL, wired into the
relay for caching VM index report payloads. On UMH retry, the relay
resends the cached payload instead of waiting for a new VM report.

This is the second half of the split from the ack-flow branch.

Made-with: Cursor
@vikin91 vikin91 force-pushed the piotr/ROX-32848-vm-relay-payload-cache branch from 8a17553 to 300795b Compare April 21, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant