ROX-32848: Ack-based retry for VM#20107
ROX-32848: Ack-based retry for VM#20107vikin91 wants to merge 2 commits intopiotr/ROX-33555-vm-relay-ack-flowfrom
Conversation
|
This change is part of the following stack: Change managed by git-spice. |
|
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- The TTL semantics for
reportPayloadCachewhenttl <= 0are surprising: entries are treated as immediately expired inevictExpiredFromFrontNoLockandSweepExpired, while the relay disables the sweep ticker for non-positive TTL; consider explicitly documenting or normalizing this (e.g., treatingttl <= 0as 'no TTL' in the cache layer) so behavior is consistent with the relay configuration. - The
Get(resourceID string, _ time.Time)method ignores its time argument and does not enforce TTL, which is slightly misleading for callers likehandleRetryCommand; consider removing thetime.Timeparameter or adding a comment on the method signature explaining why it’s intentionally unused to avoid confusion.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The TTL semantics for `reportPayloadCache` when `ttl <= 0` are surprising: entries are treated as immediately expired in `evictExpiredFromFrontNoLock` and `SweepExpired`, while the relay disables the sweep ticker for non-positive TTL; consider explicitly documenting or normalizing this (e.g., treating `ttl <= 0` as 'no TTL' in the cache layer) so behavior is consistent with the relay configuration.
- The `Get(resourceID string, _ time.Time)` method ignores its time argument and does not enforce TTL, which is slightly misleading for callers like `handleRetryCommand`; consider removing the `time.Time` parameter or adding a comment on the method signature explaining why it’s intentionally unused to avoid confusion.
## Individual Comments
### Comment 1
<location path="compliance/virtualmachines/relay/report_payload_cache.go" line_range="122-123" />
<code_context>
+ return nil
+ }
+ out := make([]payloadEviction, 0, min(budget, len(c.byID)))
+ for range budget {
+ front := c.lruList.Front()
+ if front == nil {
+ break
</code_context>
<issue_to_address>
**issue (bug_risk):** The `for range budget` loop will not compile; `range` cannot be used directly on an `int`.
Use a counted loop instead, e.g.:
```go
for i := 0; i < budget; i++ {
front := c.lruList.Front()
if front == nil {
break
}
// ...
}
```
</issue_to_address>
### Comment 2
<location path="compliance/virtualmachines/relay/report_payload_cache_test.go" line_range="151" />
<code_context>
+ ttl := 10 * time.Minute
+ c := newReportPayloadCache(64, ttl)
+ base := time.Unix(5000, 0)
+ for i := range staleEntries {
+ c.Upsert(fmt.Sprintf("k-%02d", i), relaytest.NewTestVMReport(fmt.Sprintf("%02d", i)), base.Add(time.Duration(i)*time.Second))
+ }
</code_context>
<issue_to_address>
**issue (bug_risk):** The range loop over `staleEntries` will not compile and should be a counted `for` loop.
`staleEntries` is an `int`, so it must be used in a counted loop, for example:
```go
for i := 0; i < staleEntries; i++ {
c.Upsert(
fmt.Sprintf("k-%02d", i),
relaytest.NewTestVMReport(fmt.Sprintf("%02d", i)),
base.Add(time.Duration(i) * time.Second),
)
}
```
</issue_to_address>
### Comment 3
<location path="compliance/virtualmachines/relay/relay_test.go" line_range="698-699" />
<code_context>
+ time.Sleep(50 * time.Millisecond)
+ missBefore := testutil.ToFloat64(metrics.IndexReportCacheLookupsTotal.WithLabelValues("miss"))
+
+ umh.retryCh <- "not-cached"
+ time.Sleep(200 * time.Millisecond)
+
+ missDelta := testutil.ToFloat64(metrics.IndexReportCacheLookupsTotal.WithLabelValues("miss")) - missBefore
</code_context>
<issue_to_address>
**suggestion (testing):** Using fixed sleeps to wait for async behavior can make the test flaky; prefer `Eventually`-style polling.
In `TestRelay_RetryCommand_CacheMiss_IncrementsMissAndDoesNotSend`, the fixed `time.Sleep` calls after sending on `retryCh` can be fragile, especially on slower CI. Since you’re already using `s.Eventually` in this file, consider polling until the `IndexReportCacheLookupsTotal{result="miss"}` metric reflects the expected increment and/or the sender’s `sentMessages` length stabilizes, with a reasonable timeout, instead of relying on hardcoded delays.
</issue_to_address>
### Comment 4
<location path="compliance/virtualmachines/relay/report_payload_cache.go" line_range="26" />
<code_context>
+// Capacity: when maxSlots is non-positive, new keys are never cached (Upsert for an unknown resourceID
+// is a no-op). Existing entries for that resource ID are unchanged by this constructor parameter alone;
+// use maxSlots > 0 to store new keys.
+type reportPayloadCache struct {
+ maxSlots int
+ ttl time.Duration
</code_context>
<issue_to_address>
**issue (complexity):** Consider replacing the list-based LRU structure with a single map and scan-based eviction to simplify the cache implementation while preserving behavior.
You can keep the same behavior (TTL + capacity eviction by oldest `updatedAt`) with a simpler structure by dropping the `container/list` LRU and `*list.Element` plumbing, and doing linear scans over a plain `map` on eviction/sweep. For expected cache sizes, this will likely be easier to reason about and maintain.
### 1. Replace `list.List` + `byID` with a single `map`
Instead of:
```go
type reportPayloadCache struct {
maxSlots int
ttl time.Duration
mu sync.Mutex
byID map[string]*list.Element // resourceID -> LRU list element
lruList *list.List // front = LRU
}
type reportPayloadEntry struct {
resourceID string
report *v1.VMReport
updatedAt time.Time
firstUpdatedAt time.Time
}
```
you can just do:
```go
type reportPayloadCache struct {
maxSlots int
ttl time.Duration
mu sync.Mutex
byID map[string]*reportPayloadEntry // resourceID -> entry
}
type reportPayloadEntry struct {
report *v1.VMReport
updatedAt time.Time
firstUpdatedAt time.Time
}
```
`newReportPayloadCache` then becomes:
```go
func newReportPayloadCache(maxSlots int, ttl time.Duration) *reportPayloadCache {
return &reportPayloadCache{
maxSlots: maxSlots,
ttl: ttl,
byID: make(map[string]*reportPayloadEntry),
}
}
```
### 2. Implement capacity eviction via a simple scan
You can preserve the “evict oldest `updatedAt`” policy without LRU ordering:
```go
func (c *reportPayloadCache) evictOldestNoLock(now time.Time) (payloadEviction, bool) {
var (
oldestKey string
oldest *reportPayloadEntry
)
for k, ent := range c.byID {
if oldest == nil || ent.updatedAt.Before(oldest.updatedAt) {
oldestKey, oldest = k, ent
}
}
if oldest == nil {
return payloadEviction{}, false
}
ev := evictionForEntry(oldestKey, oldest, now)
delete(c.byID, oldestKey)
return ev, true
}
```
Then `Upsert` uses this instead of `evictLRUNoLock`:
```go
if len(c.byID) >= c.maxSlots {
if ev, ok := c.evictOldestNoLock(now); ok {
evictions = append(evictions, ev)
}
}
```
### 3. Simplify TTL sweeps to plain map iteration
Instead of front-only sweeps tied to list order, perform the same TTL logic over the map. The “opportunistic, budgeted” behavior can be kept by counting how many expired entries you’ve removed:
```go
func (c *reportPayloadCache) evictExpiredNoLock(now time.Time, budget int) []payloadEviction {
if budget <= 0 {
return nil
}
out := make([]payloadEviction, 0, min(budget, len(c.byID)))
for key, ent := range c.byID {
if now.Before(ent.updatedAt.Add(c.ttl)) {
continue
}
out = append(out, evictionForEntry(key, ent, now))
delete(c.byID, key)
if len(out) == budget {
break
}
}
return out
}
```
`Upsert` and `SweepExpired` then both call this, preserving current semantics but without `list`:
```go
evictions := c.evictExpiredNoLock(now, upsertExpiredEvictionPerInsert)
func (c *reportPayloadCache) SweepExpired(now time.Time) []payloadEviction {
c.mu.Lock()
defer c.mu.Unlock()
return c.evictExpiredNoLock(now, len(c.byID))
}
```
### 4. Inline eviction metadata or keep helper but make it key-based
You can avoid some mental overhead by making `evictionForEntry` key-based and struct-agnostic:
```go
func evictionForEntry(key string, ent *reportPayloadEntry, now time.Time) payloadEviction {
return payloadEviction{
resourceID: key,
residency: now.Sub(ent.updatedAt),
lifetime: now.Sub(ent.firstUpdatedAt),
}
}
```
or just inline this computation at eviction call sites if you prefer fewer helpers.
These changes remove the `list.List` dependency, element bookkeeping, and several `...NoLock` helpers, while preserving all current behaviors (bounded TTL sweep on `Upsert`, full sweep on `SweepExpired`, and “oldest-updated” capacity eviction).
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| for range budget { | ||
| front := c.lruList.Front() |
There was a problem hiding this comment.
issue (bug_risk): The for range budget loop will not compile; range cannot be used directly on an int.
Use a counted loop instead, e.g.:
for i := 0; i < budget; i++ {
front := c.lruList.Front()
if front == nil {
break
}
// ...
}| ttl := 10 * time.Minute | ||
| c := newReportPayloadCache(64, ttl) | ||
| base := time.Unix(5000, 0) | ||
| for i := range staleEntries { |
There was a problem hiding this comment.
issue (bug_risk): The range loop over staleEntries will not compile and should be a counted for loop.
staleEntries is an int, so it must be used in a counted loop, for example:
for i := 0; i < staleEntries; i++ {
c.Upsert(
fmt.Sprintf("k-%02d", i),
relaytest.NewTestVMReport(fmt.Sprintf("%02d", i)),
base.Add(time.Duration(i) * time.Second),
)
}| umh.retryCh <- "not-cached" | ||
| time.Sleep(200 * time.Millisecond) |
There was a problem hiding this comment.
suggestion (testing): Using fixed sleeps to wait for async behavior can make the test flaky; prefer Eventually-style polling.
In TestRelay_RetryCommand_CacheMiss_IncrementsMissAndDoesNotSend, the fixed time.Sleep calls after sending on retryCh can be fragile, especially on slower CI. Since you’re already using s.Eventually in this file, consider polling until the IndexReportCacheLookupsTotal{result="miss"} metric reflects the expected increment and/or the sender’s sentMessages length stabilizes, with a reasonable timeout, instead of relying on hardcoded delays.
| // Capacity: when maxSlots is non-positive, new keys are never cached (Upsert for an unknown resourceID | ||
| // is a no-op). Existing entries for that resource ID are unchanged by this constructor parameter alone; | ||
| // use maxSlots > 0 to store new keys. | ||
| type reportPayloadCache struct { |
There was a problem hiding this comment.
issue (complexity): Consider replacing the list-based LRU structure with a single map and scan-based eviction to simplify the cache implementation while preserving behavior.
You can keep the same behavior (TTL + capacity eviction by oldest updatedAt) with a simpler structure by dropping the container/list LRU and *list.Element plumbing, and doing linear scans over a plain map on eviction/sweep. For expected cache sizes, this will likely be easier to reason about and maintain.
1. Replace list.List + byID with a single map
Instead of:
type reportPayloadCache struct {
maxSlots int
ttl time.Duration
mu sync.Mutex
byID map[string]*list.Element // resourceID -> LRU list element
lruList *list.List // front = LRU
}
type reportPayloadEntry struct {
resourceID string
report *v1.VMReport
updatedAt time.Time
firstUpdatedAt time.Time
}you can just do:
type reportPayloadCache struct {
maxSlots int
ttl time.Duration
mu sync.Mutex
byID map[string]*reportPayloadEntry // resourceID -> entry
}
type reportPayloadEntry struct {
report *v1.VMReport
updatedAt time.Time
firstUpdatedAt time.Time
}newReportPayloadCache then becomes:
func newReportPayloadCache(maxSlots int, ttl time.Duration) *reportPayloadCache {
return &reportPayloadCache{
maxSlots: maxSlots,
ttl: ttl,
byID: make(map[string]*reportPayloadEntry),
}
}2. Implement capacity eviction via a simple scan
You can preserve the “evict oldest updatedAt” policy without LRU ordering:
func (c *reportPayloadCache) evictOldestNoLock(now time.Time) (payloadEviction, bool) {
var (
oldestKey string
oldest *reportPayloadEntry
)
for k, ent := range c.byID {
if oldest == nil || ent.updatedAt.Before(oldest.updatedAt) {
oldestKey, oldest = k, ent
}
}
if oldest == nil {
return payloadEviction{}, false
}
ev := evictionForEntry(oldestKey, oldest, now)
delete(c.byID, oldestKey)
return ev, true
}Then Upsert uses this instead of evictLRUNoLock:
if len(c.byID) >= c.maxSlots {
if ev, ok := c.evictOldestNoLock(now); ok {
evictions = append(evictions, ev)
}
}3. Simplify TTL sweeps to plain map iteration
Instead of front-only sweeps tied to list order, perform the same TTL logic over the map. The “opportunistic, budgeted” behavior can be kept by counting how many expired entries you’ve removed:
func (c *reportPayloadCache) evictExpiredNoLock(now time.Time, budget int) []payloadEviction {
if budget <= 0 {
return nil
}
out := make([]payloadEviction, 0, min(budget, len(c.byID)))
for key, ent := range c.byID {
if now.Before(ent.updatedAt.Add(c.ttl)) {
continue
}
out = append(out, evictionForEntry(key, ent, now))
delete(c.byID, key)
if len(out) == budget {
break
}
}
return out
}Upsert and SweepExpired then both call this, preserving current semantics but without list:
evictions := c.evictExpiredNoLock(now, upsertExpiredEvictionPerInsert)
func (c *reportPayloadCache) SweepExpired(now time.Time) []payloadEviction {
c.mu.Lock()
defer c.mu.Unlock()
return c.evictExpiredNoLock(now, len(c.byID))
}4. Inline eviction metadata or keep helper but make it key-based
You can avoid some mental overhead by making evictionForEntry key-based and struct-agnostic:
func evictionForEntry(key string, ent *reportPayloadEntry, now time.Time) payloadEviction {
return payloadEviction{
resourceID: key,
residency: now.Sub(ent.updatedAt),
lifetime: now.Sub(ent.firstUpdatedAt),
}
}or just inline this computation at eviction call sites if you prefer fewer helpers.
These changes remove the list.List dependency, element bookkeeping, and several ...NoLock helpers, while preserving all current behaviors (bounded TTL sweep on Upsert, full sweep on SweepExpired, and “oldest-updated” capacity eviction).
f19a2e6 to
8a17553
Compare
🚀 Build Images ReadyImages are ready for commit 300795b. To use with deploy scripts: export MAIN_IMAGE_TAG=4.11.x-712-g300795b541 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## piotr/ROX-33555-vm-relay-ack-flow #20107 +/- ##
=====================================================================
- Coverage 49.77% 49.74% -0.03%
=====================================================================
Files 2767 2767
Lines 209666 209629 -37
=====================================================================
- Hits 104356 104289 -67
- Misses 97608 97642 +34
+ Partials 7702 7698 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Introduce reportPayloadCache with LRU eviction and TTL, wired into the relay for caching VM index report payloads. On UMH retry, the relay resends the cached payload instead of waiting for a new VM report. This is the second half of the split from the ack-flow branch. Made-with: Cursor
8a17553 to
300795b
Compare
Description
Add a bounded, TTL-aware payload cache to the VM relay so that UMH-driven retransmissions can resend the last known report without waiting for the VM agent to push a new one.
What
reportPayloadCache— LRU cache (byupdatedAt) with configurable max slots and TTL. Backed bycontainer/list+ map for O(1) insert/lookup/evict. Uses its own mutex, independent of the metadata cache.EqualVT).payloadCacheTTL / 2interval).cache_slots_used,cache_slots_capacity,cache_residency_seconds,cache_lifetime_seconds,cache_lookups_total{hit|miss}.ROX_VM_INDEX_REPORT_RELAY_CACHE_SLOTS(default 100),ROX_VM_INDEX_REPORT_RELAY_CACHE_TTL(default 4## WhyWithout the cache, a UMH retry for a report whose initial send failed (or was not ACKed in time) has no payload to resend. The relay would have to wait for the next VM agent push, which may take minutes. Caching the last payload per VSOCK ID closes this gap.
User-facing documentation
Testing and quality
###ated testing
How I validated my change
Ran the full compliance test suite and the relay package tests locally:
go test ./compliance/... ./compliance/virtualmachines/relay/... -count=1 -timeout 120sAll 25 packages pass (including 507 new lines of
report_payload_cache_test.gocovering LRU eviction order, TTL expiry, bounded sweep budget, capacity=0 disabled mode,Getnot promoting recency,Removeduration reporting, and sweep-then-upsert interaction; plus 395 new/modified lines inrelay_test.gocovering cache-hit resend, cache-miss metric, cache-disabled mode, ACK removes payload entry, and expired-payload-resends-until-sweep-evicts).AI disclosure
The initial implementation and tests were generated with AI assistance. All code was reviewed, corrected, and validated by the author.