-
Notifications
You must be signed in to change notification settings - Fork 601
Description
Is there an existing issue already for this bug?
- I have searched for an existing issue, and could not find anything. I believe this is a new bug.
I have read the troubleshooting guide
- I have read the troubleshooting guide and I think this is a new bug.
I am running a supported version of CloudNativePG
- I have read the troubleshooting guide and I think this is a new bug.
Contact Details
Version
1.28 (latest patch)
What version of Kubernetes are you using?
other (unsupported)
What is your Kubernetes environment?
Self-managed: RKE
How did you install the operator?
Helm
What happened?
Hi there,
I have deployed multiple Postgres clusters using CNPG and the barman cloud plugin having three replicas. When a new image is rolled out (through an updated ClusterImageCatalog) two replicas are updated successfully whereas the update stalls on the last replica pod leaving the cluster with two healthy pods and one broken pod.
#9107 may be somewhat related but refers to an operator update, while I am experiencing the issue with a postgres image update.
$ k -n authelia get pod
NAME READY STATUS RESTARTS AGE
[...]
postgres-authelia-1 2/2 Running 0 3h32m
postgres-authelia-2 1/2 Completed 0 43h
postgres-authelia-3 2/2 Running 0 3h33m
$ k cnpg -n authelia status postgres-authelia
Cluster Summary
Name authelia/postgres-authelia
System ID: 7579768307852668959
PostgreSQL Image: ghcr.io/cloudnative-pg/postgresql:18.1-202601190807-standard-trixie@sha256:7a72106d396a9c6d06ca25be191700c86031e0a06ccdeaf1a31da14f198cbbc4
Primary instance: postgres-authelia-1
Primary promotion time: 2026-01-19 09:46:34 +0000 UTC (3h44m2s)
Status: Waiting for the instances to become active Some instances are not yet active. Please wait.
Instances: 3
Ready instances: 2
Size: 652M
Current Write LSN: 4/B5000000 (Timeline: 5 - WAL File: 0000000500000004000000B5)
Continuous Backup status (Barman Cloud Plugin)
ObjectStore / Server name: s3-store/postgres-authelia
First Point of Recoverability: 2026-01-18 03:37:02 CET
Last Successful Backup: 2026-01-19 03:37:02 CET
Last Failed Backup: -
Working WAL archiving: OK
WALs waiting to be archived: 0
Last Archived WAL: 0000000500000004000000B4 @ 2026-01-19T13:16:41.88937Z
Last Failed WAL: 00000005.history @ 2026-01-19T09:46:33.577683Z
Streaming Replication status
Replication Slots Enabled
Name Sent LSN Write LSN Flush LSN Replay LSN Write Lag Flush Lag Replay Lag State Sync State Sync Priority Replication Slot
---- -------- --------- --------- ---------- --------- --------- ---------- ----- ---------- ------------- ----------------
postgres-authelia-3 4/B5000000 4/B5000000 4/B5000000 4/B5000000 00:00:00 00:00:00 00:00:00 streaming async 0 active
Instances status
Name Current LSN Replication role Status QoS Manager Version Node
---- ----------- ---------------- ------ --- --------------- ----
postgres-authelia-1 4/B5000000 Primary OK BestEffort 1.28.0 k8s-fsn-2.<domain>
postgres-authelia-3 4/B5000000 Standby (async) OK BestEffort 1.28.0 k8s-fsn-3.<domain>
postgres-authelia-2 - - ServiceUnavailable BestEffort - k8s-fsn-1.<domain>
Plugins status
Name Version Status Reported Operator Capabilities
---- ------- ------ ------------------------------
barman-cloud.cloudnative-pg.io 0.10.0 N/A Reconciler Hooks, Lifecycle Service
Error(s) extracting status
-----------------------------------
failed to get status by proxying to the pod, you might lack permissions to get pods/proxy: the server is currently unable to handle the request (get pods https:postgres-authelia-2:8000)
For postgres-authelia-2, the postgres container has terminated and the plugin-barman-cloud sidecar is still running. Its logs are unremarkable:
{"level":"info","ts":"2026-01-19T13:10:47.379437499Z","msg":"Skipping retention policy enforcement, not the current primary","logging_pod":"postgres-authelia-2","currentPrimary":"postgres-authelia-1","podName":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T13:15:47.391226096Z","msg":"Skipping retention policy enforcement, not the current primary","logging_pod":"postgres-authelia-2","currentPrimary":"postgres-authelia-1","podName":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T13:20:47.445611235Z","msg":"Skipping retention policy enforcement, not the current primary","logging_pod":"postgres-authelia-2","currentPrimary":"postgres-authelia-1","podName":"postgres-authelia-2"}
CNPG operator is repeating this log over and over:
{"level":"info","ts":"2026-01-19T13:13:04.126380647Z","msg":"Cannot extract Pod status","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres-authelia","namespace":"authelia"},"namespace":"authelia","name":"postgres-authelia","reconcileID":"da68fdc2-0ef6-4973-b035-caa6817fde60","podName":"postgres-authelia-2","error":"Get \"https://10.42.2.223:8000/pg/status\": dial tcp 10.42.2.223:8000: connect: connection refused"}
I guess the lack of progress is caused by the sidecar still running causing the operator to assume that it can still retrieve the pod status via HTTP, but failing to do so.
Thanks,
Thilo
RKE2 version: v1.35.0+rke2r1
Cluster resource
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-authelia
spec:
instances: 3
storage:
size: 1Gi
storageClass: local-path
monitoring:
enablePodMonitor: true
managed:
roles:
- name: authelia
login: true
superuser: false
inRoles:
- pg_read_all_data
- pg_write_all_data
passwordSecret:
name: authelia-postgres-credentials
inheritedMetadata:
annotations:
k8up.io/backupcommand: sh -c 'PGDATABASE="$POSTGRES_DB" PGUSER="$POSTGRES_USER" PGPASSWORD="$POSTGRES_PASSWORD" pg_dump --clean --create -d authelia'
k8up.io/file-extension: .sql
enableSuperuserAccess: true
primaryUpdateMethod: switchover
imageCatalogRef:
apiGroup: postgresql.cnpg.io
kind: ClusterImageCatalog
name: postgresql-standard-trixie
major: 18
plugins:
- name: barman-cloud.cloudnative-pg.io
isWALArchiver: true
parameters:
barmanObjectName: s3-storeRelevant log output
postgres container:
{"level":"info","ts":"2026-01-19T09:46:26.480431675Z","msg":"Exited log pipe","logger":"instance-manager","fileName":"/controller/log/postgres.csv","logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.480440525Z","msg":"Exited log pipe","logger":"instance-manager","fileName":"/controller/log/postgres.json","logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.480440465Z","msg":"All workers finished","logger":"instance-manager","logging_pod":"postgres-authelia-2","controller":"instance-database","controllerGroup":"postgresql.cnpg.io","controllerKind":"Database"}
{"level":"info","ts":"2026-01-19T09:46:26.480447365Z","msg":"All workers finished","logger":"instance-manager","logging_pod":"postgres-authelia-2","controller":"instance-subscription","controllerGroup":"postgresql.cnpg.io","controllerKind":"Subscription"}
{"level":"info","ts":"2026-01-19T09:46:26.480462705Z","logger":"Replicator","msg":"Terminated slot Replicator loop","logger":"instance-manager","logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.480465195Z","msg":"Old primary shutdown complete","logger":"instance-manager","logging_pod":"postgres-authelia-2","controller":"instance-cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres-authelia","namespace":"authelia"},"namespace":"authelia","name":"postgres-authelia","reconcileID":"45f961af-65a9-4c33-ba9e-4287f8eafce2","phase":"Waiting for the instances to become active","currentTimestamp":"2026-01-19T09:46:26.480430Z","targetPrimaryTimestamp":"2026-01-19T09:46:23.711911Z","currentPrimaryTimestamp":"2026-01-16T10:33:04.167086Z","msPassedSinceTargetPrimaryTimestamp":2768,"msPassedSinceCurrentPrimaryTimestamp":256402313,"msDifferenceBetweenCurrentAndTargetPrimary":-256399544,"logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.480472815Z","msg":"All workers finished","logger":"instance-manager","logging_pod":"postgres-authelia-2","controller":"instance-publication","controllerGroup":"postgresql.cnpg.io","controllerKind":"Publication"}
{"level":"info","ts":"2026-01-19T09:46:26.480466275Z","msg":"All workers finished","logger":"instance-manager","logging_pod":"postgres-authelia-2","controller":"instance-external-server","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2026-01-19T09:46:26.480577216Z","msg":"Webserver exited","logger":"instance-manager","logging_pod":"postgres-authelia-2","address":"localhost:8010"}
{"level":"info","ts":"2026-01-19T09:46:26.480597986Z","msg":"Webserver exited","logger":"instance-manager","logging_pod":"postgres-authelia-2","address":":8000"}
{"level":"info","ts":"2026-01-19T09:46:26.480605496Z","msg":"Webserver exited","logger":"instance-manager","logging_pod":"postgres-authelia-2","address":":9187"}
{"level":"info","ts":"2026-01-19T09:46:26.482429749Z","msg":"DB not available, will retry","logger":"instance-manager","logging_pod":"postgres-authelia-2","controller":"instance-cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres-authelia","namespace":"authelia"},"namespace":"authelia","name":"postgres-authelia","reconcileID":"45f961af-65a9-4c33-ba9e-4287f8eafce2","logging_pod":"postgres-authelia-2","err":"failed to connect to `user=postgres database=postgres`: /controller/run/.s.PGSQL.5432 (/controller/run): dial error: dial unix /controller/run/.s.PGSQL.5432: connect: no such file or directory"}
{"level":"info","ts":"2026-01-19T09:46:26.482490139Z","msg":"All workers finished","logger":"instance-manager","logging_pod":"postgres-authelia-2","controller":"instance-cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2026-01-19T09:46:26.482501089Z","msg":"Stopping and waiting for caches","logger":"instance-manager","logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.48257093Z","msg":"Stopping and waiting for webhooks","logger":"instance-manager","logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.48258045Z","msg":"Stopping and waiting for HTTP servers","logger":"instance-manager","logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.48258532Z","msg":"Wait completed, proceeding to shutdown the manager","logger":"instance-manager","logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.48259516Z","msg":"Checking for free disk space for WALs after PostgreSQL finished","logger":"instance-manager","logging_pod":"postgres-authelia-2"}Code of Conduct
- I agree to follow this project's Code of Conduct