Monitoring tutorial

Build custom monitoring dashboards and alerts for your OpenCue render farm

This tutorial walks you through setting up monitoring for your OpenCue render farm, creating custom Grafana dashboards, and configuring alerts.

Prerequisites

OpenCue sandbox environment running (see Using the OpenCue Sandbox for Testing)
Monitoring stack deployed (see Quick start for monitoring)
Basic familiarity with Prometheus and Grafana

Monitoring stack components

Component	Purpose	URL	Port
Grafana	Dashboards and visualization	http://localhost:3000	3000
Prometheus	Metrics collection	http://localhost:9090	9090
Kafka UI	Event stream browser	http://localhost:8090	8090
Kibana	Elasticsearch visualization	http://localhost:5601	5601
Elasticsearch	Historical data storage	http://localhost:9200	9200
Kafka	Event streaming	localhost:9092	9092
monitoring-indexer	Kafka to Elasticsearch indexer	-	-
Zookeeper	Kafka coordination	localhost:2181	2181

Grafana: OpenCue Monitoring Grafana Dashboard

OpenCue Monitoring Grafana Dashboard

Prometheus Metrics Interface

UI for Apache Kafka

Elasticsearch Kibana - Dev Tools

Kibana

Elasticsearch

Tutorial goals

By the end of this tutorial, you will:

Create a custom Grafana dashboard for job monitoring
Build a Prometheus alert for failed frames
Set up a Kafka consumer to process events
Query historical data in Elasticsearch

Part 1: Creating a custom Grafana dashboard

Step 1: Access Grafana

Open Grafana at http://localhost:3000
Log in with username admin and password admin
Click Dashboards in the left menu

Step 2: Create a new dashboard

Click New > New Dashboard
Click Add visualization
Select Prometheus as the data source

Step 3: Add a frame completion panel

Create a time series panel showing frame completions:

In the Query tab, enter:

sum(increase(cue_frames_completed_total[5m])) by (state)

Configure the panel:
- Title: “Frames Completed by State (5m)”
- Legend: ``
- Unit: short
Click Apply

Step 4: Add a job queue panel

Add a gauge showing pending work:

Click Add > Visualization
Select Prometheus as the data source
Enter the query:
```
cue_dispatch_waiting_total
```
Change visualization to Gauge
Configure:
- Title: “Dispatch Queue Size”
- Thresholds: 0 (green), 100 (yellow), 500 (red)
Click Apply

Step 5: Add a host report panel

Create a panel showing host activity:

Click Add > Visualization

Enter the query:

sum(increase(cue_host_reports_received_total[5m])) by (facility)

Configure:
- Title: “Host Reports by Facility”
- Visualization: Time series
Click Apply

Step 6: Save the dashboard

Click the save icon (or Ctrl+S)
Name: “My OpenCue Dashboard”
Click Save

Part 2: Creating Prometheus alerts

Step 1: Create an alert rule

In Grafana, go to Alerting > Alert rules
Click New alert rule

Step 2: Configure the alert condition

Name: “High Frame Failure Rate”

In Query section:

rate(cue_frames_completed_total{state="DEAD"}[5m]) > 0.1

Set condition:
- Threshold: IS ABOVE 0.1
- For: 5m

Step 3: Add alert details

Add summary:
```
Frame failure rate is  per second
```

Add description:

The render farm is experiencing elevated frame failures.
Check host health and job configurations.

Click Save and exit

Step 4: Create a notification contact point

Go to Alerting > Contact points
Click Add contact point
Configure for your notification method (email, Slack, etc.)

Part 3: Building a Kafka event consumer

Step 1: Create a Python consumer

Create a file monitor_events.py:

#!/usr/bin/env python3
"""
Simple Kafka consumer for OpenCue monitoring events.
"""

from kafka import KafkaConsumer
import json
from datetime import datetime

# Connect to Kafka
# Note: The cuebot producer uses lz4 compression, so the lz4 library must be installed
consumer = KafkaConsumer(
    'opencue.frame.events',
    'opencue.job.events',
    bootstrap_servers=['localhost:9092'],
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    auto_offset_reset='earliest',
    group_id='tutorial-consumer'
)

print("Listening for OpenCue events...")
print("-" * 60)

for message in consumer:
    event = message.value

    # Events have a 'header' field containing event metadata
    header = event.get('header', {})
    event_type = header.get('event_type', 'UNKNOWN')
    timestamp = header.get('timestamp', '')

    # Convert timestamp from milliseconds to readable format
    if timestamp:
        try:
            dt = datetime.fromtimestamp(int(timestamp) / 1000)
            timestamp = dt.strftime('%Y-%m-%d %H:%M:%S')
        except (ValueError, OSError):
            pass

    # Format output based on event type
    if event_type.startswith('FRAME_'):
        job_name = event.get('job_name', 'N/A')
        frame_name = event.get('frame_name', 'N/A')
        state = event.get('state', 'N/A')
        print(f"[{timestamp}] {event_type}")
        print(f"  Job: {job_name}")
        print(f"  Frame: {frame_name}")
        print(f"  State: {state}")
        if event_type == 'FRAME_COMPLETED':
            runtime = event.get('run_time', 0)
            print(f"  Runtime: {runtime}s")
        elif event_type == 'FRAME_FAILED':
            exit_status = event.get('exit_status', -1)
            print(f"  Exit Status: {exit_status}")
        print()

    elif event_type.startswith('JOB_'):
        job_name = event.get('job_name', 'N/A')
        show_name = event.get('show', 'N/A')
        print(f"[{timestamp}] {event_type}")
        print(f"  Job: {job_name}")
        print(f"  Show: {show_name}")
        print()

Step 2: Install dependencies

pip install kafka-python lz4

Step 3: Run the consumer

python monitor_events.py

Step 4: Generate events

In another terminal, submit a test job. You can use either cuecmd or PyOutline:

Option A: Using cuecmd

# Create a command file
echo "echo Hello from monitoring test" > /tmp/test_commands.txt

# Submit the job
cuecmd /tmp/test_commands.txt --show testing --job-name monitoring_test

Option B: Using PyOutline

python -c "
import outline
from outline.modules.shell import Shell

ol = outline.Outline('monitoring_test_$RANDOM', shot='testshot', show='testing')
layer = Shell('test_layer', command=['/bin/echo', 'Hello from monitoring test'], range='1-1')
ol.add_layer(layer)
outline.cuerun.launch(ol, use_pycuerun=False)
"

Watch the consumer output as events flow through Kafka.

Part 4: Querying Elasticsearch

Kibana Dashboard

Step 1: Access Kibana

Open Kibana at http://localhost:5601
Navigate to Management > Stack Management > Index Patterns

Step 2: Create an index pattern

Click Create index pattern
Enter pattern: opencue-*
Select header.timestamp as the time field (format: epoch_millis)
Click Create index pattern

Step 3: Explore events

Navigate to Discover
Select the opencue-* index pattern
Set the time range to include your test events

Step 4: Run KQL queries

Kibana Dev Tools

Try these example queries:

# Find all failed frames
header.event_type: "FRAME_FAILED"

# Find events for a specific job
job_name: "test*"

# Find frames that took longer than 1 hour
header.event_type: "FRAME_COMPLETED" AND run_time > 3600

# Find host down events
header.event_type: "HOST_DOWN"

Step 5: Create a visualization

Navigate to Visualize Library
Click Create visualization
Select Lens
Drag eventType to the visualization
Create a pie chart of event types

Part 5: Building a failure tracking dashboard

Let’s create a comprehensive failure tracking dashboard.

Step 1: Create failure rate panel

In Grafana, create a new panel:

sum(rate(cue_frames_completed_total{state="DEAD"}[1h])) by (show)
/ sum(rate(cue_frames_completed_total[1h])) by (show)
* 100

Configure:

Title: “Frame Failure Rate by Show (%)”
Unit: percent (0-100)

Step 2: Create retry tracking panel

sum(increase(cue_frames_completed_total{state="DEAD"}[24h])) by (show)

Configure:

Title: “Failed Frames (24h)”
Visualization: Bar gauge

Step 3: Create host health panel

sum(up{job="cuebot"})

Configure:

Title: “Cuebot Health”
Visualization: Stat
Color mode: Background
Thresholds: 0 (red), 1 (green)

Step 4: Organize the dashboard

Arrange panels in a logical layout
Add row headers: “Farm Health”, “Job Metrics”, “Failures”
Set dashboard refresh rate to 30s
Save the dashboard

Challenge exercises

Exercise 1: Memory usage alert

Create an alert that fires when average frame memory exceeds 16GB:

histogram_quantile(0.95, sum(rate(cue_frame_memory_bytes_bucket[5m])) by (le))
> 17179869184

Exercise 2: Capacity planning query

Build a Grafana panel showing peak usage times:

max_over_time(cue_dispatch_threads_total[1d])

Exercise 3: Custom Kafka processor

Extend the Python consumer to:

Track frame failure rates per show
Send Slack notifications for high failure rates
Write metrics to a time-series database

Cleanup

To stop the monitoring stack:

docker compose -f sandbox/docker-compose.monitoring-full.yml down

To preserve your Grafana dashboards, export them first:

Open the dashboard
Click the share icon
Select Export > Save to file

What’s next?

Monitoring user guide - Advanced configuration
Monitoring developer guide - Extend the system
Monitoring reference - Complete API reference