ROX-33555: Forward Central VM ACK/NACK to compliance#19322
Conversation
|
This change is part of the following stack: Change managed by git-spice. |
|
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- Consider whether
toComplianceshould be buffered, sinceforwardToComplianceis called inline fromProcessMessageand an unbuffered channel here could back up Central’s ACK processing if the compliance multiplexer is slow or temporarily blocked. - In
forwardToCompliance, you setHostnameandBroadcastbased solely onresourceID; double-check that using the VM resource ID asHostnameand broadcasting only whenresourceID == ""matches the expectations of the compliance side consumers (if not, it may be worth adding a small helper or comment clarifying the mapping).
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider whether `toCompliance` should be buffered, since `forwardToCompliance` is called inline from `ProcessMessage` and an unbuffered channel here could back up Central’s ACK processing if the compliance multiplexer is slow or temporarily blocked.
- In `forwardToCompliance`, you set `Hostname` and `Broadcast` based solely on `resourceID`; double-check that using the VM resource ID as `Hostname` and broadcasting only when `resourceID == ""` matches the expectations of the compliance side consumers (if not, it may be worth adding a small helper or comment clarifying the mapping).
## Individual Comments
### Comment 1
<location path="sensor/common/virtualmachine/index/handler_impl.go" line_range="167" />
<code_context>
return errStartMoreThanOnce
}
h.indexReports = make(chan *v1.IndexReport, env.VirtualMachinesIndexReportsBufferSize.IntegerSetting())
+ h.toCompliance = make(chan common.MessageToComplianceWithAddress)
h.toCentral = h.run(h.indexReports)
return nil
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Consider buffering the compliance channel to avoid blocking ACK processing when Compliance is slow or temporarily unavailable.
Because `toCompliance` is unbuffered, `forwardToCompliance` will block until the consumer reads, and `ProcessMessage` is on the hot ACK/NACK path. If Compliance is slow or backpressured, this can stall ACK handling. A small buffer (as used for other internal channels) would mitigate head-of-line blocking while still preserving ordering.
Suggested implementation:
```golang
if h.toCentral != nil || h.indexReports != nil || h.toCompliance != nil {
return errStartMoreThanOnce
}
h.indexReports = make(chan *v1.IndexReport, env.VirtualMachinesIndexReportsBufferSize.IntegerSetting())
h.toCompliance = make(chan common.MessageToComplianceWithAddress, env.VirtualMachinesToComplianceBufferSize.IntegerSetting())
```
1. Define `VirtualMachinesToComplianceBufferSize` in the `env` package (or equivalent config) following the same pattern used for `VirtualMachinesIndexReportsBufferSize` (type, default, description, etc.).
2. If there is an existing common buffer size setting for internal VM handler channels (e.g., a shared constant/env var), you may want to reuse that instead of introducing a new one, for consistency.
3. Ensure any unit tests that construct `handlerImpl` or validate channel behavior are updated to account for the buffered `toCompliance` channel (e.g., tests that might have assumed it was unbuffered).
</issue_to_address>
### Comment 2
<location path="sensor/common/virtualmachine/index/handler_impl.go" line_range="240-242" />
<code_context>
+ return
+ }
+
+ select {
+ case <-h.stopper.Flow().StopRequested():
+ case h.toCompliance <- common.MessageToComplianceWithAddress{
+ Msg: &sensor.MsgToCompliance{
+ Msg: &sensor.MsgToCompliance_ComplianceAck{
</code_context>
<issue_to_address>
**suggestion:** Consider handling or at least logging when messages are dropped due to sensor shutdown.
Here, when `h.stopper.Flow().StopRequested()` fires, the select exits and the ACK/NACK is silently dropped. If these messages matter for Compliance accounting, this makes shutdown issues hard to trace. Please add at least a debug log or a counter for messages dropped due to shutdown to make this behavior observable.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
Images are ready for the commit at c47fcd6. To use with deploy scripts, first |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #19322 +/- ##
==========================================
+ Coverage 49.63% 49.65% +0.01%
==========================================
Files 2765 2765
Lines 208770 208821 +51
==========================================
+ Hits 103633 103688 +55
+ Misses 97479 97477 -2
+ Partials 7658 7656 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0ad5d24 to
c795476
Compare
99d3e6c to
fa6c362
Compare
c795476 to
cbee5a2
Compare
fa6c362 to
54bb9f9
Compare
cbee5a2 to
283b771
Compare
54bb9f9 to
c47fcd6
Compare
283b771 to
a9270b7
Compare
c47fcd6 to
08f7beb
Compare
08f7beb to
80bc69e
Compare
🚀 Build Images ReadyImages are ready for commit 503713e. To use with deploy scripts: export MAIN_IMAGE_TAG=4.11.x-703-g503713eb10 |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
1 similar comment
✅ Actions performedReview triggered.
|
|
@sourcery-ai review |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sensor/common/virtualmachine/index/handler_impl.go`:
- Around line 219-260: forwardToCompliance synchronously blocks on sending to
h.toCompliance which can hang ProcessMessage() when compliance backpressure
occurs; make the send non-blocking or bounded by context/stopper to avoid
stalling Central. Modify handlerImpl.forwardToCompliance to attempt a
non-blocking send (use a select with the existing stop case, the send case, and
a default case that drops and logs a warning) or perform the send in a
short-lived goroutine with a timeout tied to stopper.Flow().StopRequested();
update references to h.toCompliance and forwardToCompliance accordingly so VM
ACK/NACKs are dropped with a log instead of blocking central processing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 4e9963b9-918c-4633-845c-4b348a016e54
📒 Files selected for processing (4)
sensor/common/virtualmachine/index/handler.gosensor/common/virtualmachine/index/handler_impl.gosensor/common/virtualmachine/index/handler_impl_test.gosensor/kubernetes/sensor/sensor.go
|
@coderabbitai review again |
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
sensor/common/virtualmachine/index/handler_impl.go (1)
132-145:⚠️ Potential issue | 🟡 MinorDon’t record unknown
SensorACKactions as ACK metrics.
forwardToCompliance()correctly drops unsupported actions, butProcessMessage()still incrementsIndexReportAcksReceivedfor them viaaction.String(). That can create bogus metric series and silently skew ACK/NACK monitoring. Only increment the metric after the action is recognized asACKorNACK.Possible fix
-func (h *handlerImpl) ProcessMessage(ctx context.Context, msg *central.MsgToSensor) error { +func (h *handlerImpl) ProcessMessage(ctx context.Context, msg *central.MsgToSensor) error { sensorAck := msg.GetSensorAck() if sensorAck == nil || sensorAck.GetMessageType() != central.SensorACK_VM_INDEX_REPORT { return nil } vmID := sensorAck.GetResourceId() action := sensorAck.GetAction() reason := sensorAck.GetReason() - h.forwardToCompliance(ctx, vmID, action, reason) - - metrics.IndexReportAcksReceived.WithLabelValues(action.String()).Inc() + if h.forwardToCompliance(ctx, vmID, action, reason) { + metrics.IndexReportAcksReceived.WithLabelValues(action.String()).Inc() + } return nil } func (h *handlerImpl) forwardToCompliance( ctx context.Context, resourceID string, action central.SensorACK_Action, reason string, -) { +) bool { if h.toCompliance == nil { log.Debug("Compliance channel not initialized; skipping forwarding VM ACK/NACK") - return + return false } var complianceAction sensor.MsgToCompliance_ComplianceACK_Action switch action { case central.SensorACK_ACK: complianceAction = sensor.MsgToCompliance_ComplianceACK_ACK case central.SensorACK_NACK: complianceAction = sensor.MsgToCompliance_ComplianceACK_NACK default: log.Warnf("Unknown SensorACK action for VM index report: %v", action) - return + return false } + + // existing send logic... + return true }Also applies to: 230-239
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sensor/common/virtualmachine/index/handler_impl.go` around lines 132 - 145, ProcessMessage currently increments metrics.IndexReportAcksReceived for any SensorACK action even if forwardToCompliance drops unsupported actions; change it so the metric is only incremented when the action is one of the supported ACK types (e.g., central.SensorACK_ACK or central.SensorACK_NACK) after validating action, and do not call .WithLabelValues(action.String()).Inc() for unknown/unsupported actions—use the same action check/logic used by forwardToCompliance to decide when to increment the metric.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@sensor/common/virtualmachine/index/handler_impl.go`:
- Around line 132-145: ProcessMessage currently increments
metrics.IndexReportAcksReceived for any SensorACK action even if
forwardToCompliance drops unsupported actions; change it so the metric is only
incremented when the action is one of the supported ACK types (e.g.,
central.SensorACK_ACK or central.SensorACK_NACK) after validating action, and do
not call .WithLabelValues(action.String()).Inc() for unknown/unsupported
actions—use the same action check/logic used by forwardToCompliance to decide
when to increment the metric.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 725e503e-efe6-4d33-9520-51f1f1bacb16
📒 Files selected for processing (5)
sensor/common/virtualmachine/index/handler.gosensor/common/virtualmachine/index/handler_impl.gosensor/common/virtualmachine/index/handler_impl_test.gosensor/common/virtualmachine/index/mocks/handler.gosensor/kubernetes/sensor/sensor.go
@coderabbitai this is a design choice, not a bug. Counting all received ACKs (including unknown) is arguably the right thing - you want to know if Central is sending unexpected action types. |
|
Tip For best results, initiate chat on the files or code changes.
|
There was a problem hiding this comment.
Hey - I've found 1 issue
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="sensor/common/virtualmachine/index/handler_impl.go" line_range="143" />
<code_context>
- // TODO(ROX-xxxxx): Implement retry logic or notifying VM relay.
- // Currently, the VM relay has its own retry mechanism, but it's not aware of Central's rate limiting.
- }
+ metrics.IndexReportAcksReceived.WithLabelValues(action.String()).Inc()
return nil
</code_context>
<issue_to_address>
**suggestion:** Metrics are incremented even for unknown ACK actions, which may pollute metric labels
In `forwardToCompliance`, the `default` branch treats any non-ACK/NACK as unknown and returns early, but the metric is still incremented with `action.String()`. This will introduce label values for unsupported actions that Compliance never processes. Consider only incrementing for known actions or tracking unknown actions in a separate metric/label to avoid polluting the primary series.
Suggested implementation:
```golang
```
`.
Here are the changes:
<file_operations>
<file_operation operation="edit" file_path="sensor/common/virtualmachine/index/handler_impl.go">
<<<<<<< SEARCH
metrics.IndexReportAcksReceived.WithLabelValues(action.String()).Inc()
=======
>>>>>>> REPLACE
</file_operation>
</file_operations>
<additional_changes>
Inside `forwardToCompliance` (in `handler_impl.go`), adjust the switch that handles the ACK action (the one using `action` and with a `default` branch that returns early on unknown actions):
1. In the `case` branch that handles the ACK action, add:
```go
metrics.IndexReportAcksReceived.WithLabelValues(action.String()).Inc()
```
close to where the ACK is successfully processed (before returning / sending to Compliance).
2. In the `case` branch that handles the NACK action, also add:
```go
metrics.IndexReportAcksReceived.WithLabelValues(action.String()).Inc()
```
again, only after the NACK has been validated as a supported action and is being processed.
3. Do **not** add any metric increment in the `default` branch; keep it returning early for unknown actions. This ensures that `IndexReportAcksReceived` is only incremented for supported actions and avoids label pollution from unknown values.
4. If you want visibility into unsupported actions, consider adding a separate counter, e.g.:
```go
metrics.IndexReportAcksUnknownActions.Inc()
```
inside the `default` branch, using a metric with no `action` label (or with a bounded "reason" label) so it doesn't create unbounded series.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
/test gke-qa-e2e-tests |
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
|
@coderabbitai check again |
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sensor/common/virtualmachine/index/handler_impl.go`:
- Around line 269-277: The current single select can enqueue to h.toCompliance
even after stopper.Flow().StopRequested() is closed; split into two selects:
first select only on ctx.Done() and h.stopper.Flow().StopRequested() and if
either is ready return/skip sending, then a second non-blocking select that
attempts to send msg to h.toCompliance (case h.toCompliance <- msg) with the
existing default warn for a full queue; ensure you reference the same variables
(ctx, resourceID, msg, h.toCompliance, h.stopper.Flow().StopRequested()) and
preserve the original logging messages when returning or when the send is
dropped.
- Around line 139-145: The metric currently uses action.String() which can emit
unknown numeric labels; instead derive a normalized label variable (e.g., label
:= "ACK" if sensorAck.GetAction() equals the ACK enum constant, else "NACK") and
pass that label to metrics.IndexReportAcksReceived.WithLabelValues(label).Inc();
keep the call to forwardToCompliance(ctx, vmID, action, reason) unchanged and
only replace the WithLabelValues argument with the normalized "ACK"/"NACK"
string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 0d11378e-b9aa-4b67-8cd6-5141cc770dc6
📒 Files selected for processing (5)
sensor/common/virtualmachine/index/handler.gosensor/common/virtualmachine/index/handler_impl.gosensor/common/virtualmachine/index/handler_impl_test.gosensor/common/virtualmachine/index/mocks/handler.gosensor/kubernetes/sensor/sensor.go
|
/test ocp-4-12-nongroovy-e2e-tests |
Sensor's VM index handler now translates Central's SensorACK into ComplianceACK and forwards it to compliance through a dedicated channel, closing the ACK loop so the relay's UMH can track confirmation state. Declares SensorACKSupport capability and registers the handler as a ComplianceComponent on the multiplexer. AI-assisted: code was extracted from another feature branch by AI, with test fix for updated Capabilities. Reviewed and verified by the author.
84e235a to
83bca5c
Compare
Description
Adds forwarding Central's
SensorACKresponses back to compliance through Sensor.Before: Sensor received ACK/NACK from Central for VM index reports but only logged them. The
compliance relay had no visibility into whether Central accepted or rejected a report, so it could
not adjust retry behavior accordingly.
After: Sensor translates
central.SensorACKintosensor.MsgToCompliance_ComplianceACKwithmessage type
VM_INDEX_REPORTand sends it to compliance via a dedicatedtoCompliancechannel.Compliance's
handleVMIndexACK(from another PR) will forward it to the relay's UMH, which updatesper-VM retry state.
AI-assisted: code was extracted from the feature branch by AI, with test fix
for updated Capabilities return value. Reviewed and verified by the author.
User-facing documentation
Testing and quality
Automated testing
How I validated my change
Cluster: OCP 4.21.8 (6 nodes),
ROX_VIRTUAL_MACHINES=trueImage:
quay.io/stackrox-io/main:4.11.x-611-gb0e1c76f28VMs:
vm-rhel9onworker-c,vm-rhel10onworker-bI deployed ACS, 2 VMs with agents and triggered 2 scans manually on each. After that, I looked at the metrics and logs.
ACK forwarding metric
This metric is incremented after
forwardToCompliance()runs (handler_impl.go:142-145), confirming all 4 ACKs from Central entered the compliance forwarding path.No ACKs dropped
Zero instances of any
forwardToComplianceerror path in Sensor logs:The forwarding
selecthas four branches - context cancelled, shutdown, happy-path send, and queue-full - each with a log line except the happy path. Absence of all error logs confirms all 4 ACKs were successfully sent to thetoCompliancechannel.Compliance connections established for VM host nodes
Both VM host nodes (
worker-b=vm-rhel10,worker-c=vm-rhel9) have active compliance connections, so the compliance multiplexer has valid routes to deliver the forwarded ACKs.Central accepted all VM index reports (no rate limiting)