Skip to content

[Bug]: Postgres image update stalls when plugin-barman-cloud is installed #9770

@ginkel

Description

@ginkel

Is there an existing issue already for this bug?

  • I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

  • I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

  • I have read the troubleshooting guide and I think this is a new bug.

Contact Details

tg@tgbyte.de

Version

1.28 (latest patch)

What version of Kubernetes are you using?

other (unsupported)

What is your Kubernetes environment?

Self-managed: RKE

How did you install the operator?

Helm

What happened?

Hi there,

I have deployed multiple Postgres clusters using CNPG and the barman cloud plugin having three replicas. When a new image is rolled out (through an updated ClusterImageCatalog) two replicas are updated successfully whereas the update stalls on the last replica pod leaving the cluster with two healthy pods and one broken pod.

#9107 may be somewhat related but refers to an operator update, while I am experiencing the issue with a postgres image update.

$ k -n authelia get pod
NAME                                                READY   STATUS      RESTARTS   AGE
[...]
postgres-authelia-1                                 2/2     Running     0          3h32m
postgres-authelia-2                                 1/2     Completed   0          43h
postgres-authelia-3                                 2/2     Running     0          3h33m
$  k cnpg -n authelia status postgres-authelia
Cluster Summary
Name                     authelia/postgres-authelia
System ID:               7579768307852668959
PostgreSQL Image:        ghcr.io/cloudnative-pg/postgresql:18.1-202601190807-standard-trixie@sha256:7a72106d396a9c6d06ca25be191700c86031e0a06ccdeaf1a31da14f198cbbc4
Primary instance:        postgres-authelia-1
Primary promotion time:  2026-01-19 09:46:34 +0000 UTC (3h44m2s)
Status:                  Waiting for the instances to become active Some instances are not yet active. Please wait.
Instances:               3
Ready instances:         2
Size:                    652M
Current Write LSN:       4/B5000000 (Timeline: 5 - WAL File: 0000000500000004000000B5)

Continuous Backup status (Barman Cloud Plugin)
ObjectStore / Server name:      s3-store/postgres-authelia
First Point of Recoverability:  2026-01-18 03:37:02 CET
Last Successful Backup:         2026-01-19 03:37:02 CET
Last Failed Backup:             -
Working WAL archiving:          OK
WALs waiting to be archived:    0
Last Archived WAL:              0000000500000004000000B4   @   2026-01-19T13:16:41.88937Z
Last Failed WAL:                00000005.history           @   2026-01-19T09:46:33.577683Z

Streaming Replication status
Replication Slots Enabled
Name                 Sent LSN    Write LSN   Flush LSN   Replay LSN  Write Lag  Flush Lag  Replay Lag  State      Sync State  Sync Priority  Replication Slot
----                 --------    ---------   ---------   ----------  ---------  ---------  ----------  -----      ----------  -------------  ----------------
postgres-authelia-3  4/B5000000  4/B5000000  4/B5000000  4/B5000000  00:00:00   00:00:00   00:00:00    streaming  async       0              active

Instances status
Name                 Current LSN  Replication role  Status              QoS         Manager Version  Node
----                 -----------  ----------------  ------              ---         ---------------  ----
postgres-authelia-1  4/B5000000   Primary           OK                  BestEffort  1.28.0           k8s-fsn-2.<domain>
postgres-authelia-3  4/B5000000   Standby (async)   OK                  BestEffort  1.28.0           k8s-fsn-3.<domain>
postgres-authelia-2  -            -                 ServiceUnavailable  BestEffort  -                k8s-fsn-1.<domain>

Plugins status
Name                            Version  Status  Reported Operator Capabilities
----                            -------  ------  ------------------------------
barman-cloud.cloudnative-pg.io  0.10.0   N/A     Reconciler Hooks, Lifecycle Service


Error(s) extracting status
-----------------------------------
failed to get status by proxying to the pod, you might lack permissions to get pods/proxy: the server is currently unable to handle the request (get pods https:postgres-authelia-2:8000)

For postgres-authelia-2, the postgres container has terminated and the plugin-barman-cloud sidecar is still running. Its logs are unremarkable:

{"level":"info","ts":"2026-01-19T13:10:47.379437499Z","msg":"Skipping retention policy enforcement, not the current primary","logging_pod":"postgres-authelia-2","currentPrimary":"postgres-authelia-1","podName":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T13:15:47.391226096Z","msg":"Skipping retention policy enforcement, not the current primary","logging_pod":"postgres-authelia-2","currentPrimary":"postgres-authelia-1","podName":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T13:20:47.445611235Z","msg":"Skipping retention policy enforcement, not the current primary","logging_pod":"postgres-authelia-2","currentPrimary":"postgres-authelia-1","podName":"postgres-authelia-2"}

CNPG operator is repeating this log over and over:

{"level":"info","ts":"2026-01-19T13:13:04.126380647Z","msg":"Cannot extract Pod status","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres-authelia","namespace":"authelia"},"namespace":"authelia","name":"postgres-authelia","reconcileID":"da68fdc2-0ef6-4973-b035-caa6817fde60","podName":"postgres-authelia-2","error":"Get \"https://10.42.2.223:8000/pg/status\": dial tcp 10.42.2.223:8000: connect: connection refused"}

I guess the lack of progress is caused by the sidecar still running causing the operator to assume that it can still retrieve the pod status via HTTP, but failing to do so.

Thanks,
Thilo

RKE2 version: v1.35.0+rke2r1

Cluster resource

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgres-authelia
spec:
  instances: 3
  storage:
    size: 1Gi
    storageClass: local-path
  monitoring:
    enablePodMonitor: true
  managed:
    roles:
    - name: authelia
      login: true
      superuser: false
      inRoles:
        - pg_read_all_data
        - pg_write_all_data
      passwordSecret:
        name: authelia-postgres-credentials
  inheritedMetadata:
    annotations:
      k8up.io/backupcommand: sh -c 'PGDATABASE="$POSTGRES_DB" PGUSER="$POSTGRES_USER" PGPASSWORD="$POSTGRES_PASSWORD" pg_dump --clean --create -d authelia'
      k8up.io/file-extension: .sql
  enableSuperuserAccess: true
  primaryUpdateMethod: switchover
  imageCatalogRef:
    apiGroup: postgresql.cnpg.io
    kind: ClusterImageCatalog
    name: postgresql-standard-trixie
    major: 18
  plugins:
  - name: barman-cloud.cloudnative-pg.io
    isWALArchiver: true
    parameters:
      barmanObjectName: s3-store

Relevant log output

postgres container:

{"level":"info","ts":"2026-01-19T09:46:26.480431675Z","msg":"Exited log pipe","logger":"instance-manager","fileName":"/controller/log/postgres.csv","logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.480440525Z","msg":"Exited log pipe","logger":"instance-manager","fileName":"/controller/log/postgres.json","logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.480440465Z","msg":"All workers finished","logger":"instance-manager","logging_pod":"postgres-authelia-2","controller":"instance-database","controllerGroup":"postgresql.cnpg.io","controllerKind":"Database"}
{"level":"info","ts":"2026-01-19T09:46:26.480447365Z","msg":"All workers finished","logger":"instance-manager","logging_pod":"postgres-authelia-2","controller":"instance-subscription","controllerGroup":"postgresql.cnpg.io","controllerKind":"Subscription"}
{"level":"info","ts":"2026-01-19T09:46:26.480462705Z","logger":"Replicator","msg":"Terminated slot Replicator loop","logger":"instance-manager","logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.480465195Z","msg":"Old primary shutdown complete","logger":"instance-manager","logging_pod":"postgres-authelia-2","controller":"instance-cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres-authelia","namespace":"authelia"},"namespace":"authelia","name":"postgres-authelia","reconcileID":"45f961af-65a9-4c33-ba9e-4287f8eafce2","phase":"Waiting for the instances to become active","currentTimestamp":"2026-01-19T09:46:26.480430Z","targetPrimaryTimestamp":"2026-01-19T09:46:23.711911Z","currentPrimaryTimestamp":"2026-01-16T10:33:04.167086Z","msPassedSinceTargetPrimaryTimestamp":2768,"msPassedSinceCurrentPrimaryTimestamp":256402313,"msDifferenceBetweenCurrentAndTargetPrimary":-256399544,"logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.480472815Z","msg":"All workers finished","logger":"instance-manager","logging_pod":"postgres-authelia-2","controller":"instance-publication","controllerGroup":"postgresql.cnpg.io","controllerKind":"Publication"}
{"level":"info","ts":"2026-01-19T09:46:26.480466275Z","msg":"All workers finished","logger":"instance-manager","logging_pod":"postgres-authelia-2","controller":"instance-external-server","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2026-01-19T09:46:26.480577216Z","msg":"Webserver exited","logger":"instance-manager","logging_pod":"postgres-authelia-2","address":"localhost:8010"}
{"level":"info","ts":"2026-01-19T09:46:26.480597986Z","msg":"Webserver exited","logger":"instance-manager","logging_pod":"postgres-authelia-2","address":":8000"}
{"level":"info","ts":"2026-01-19T09:46:26.480605496Z","msg":"Webserver exited","logger":"instance-manager","logging_pod":"postgres-authelia-2","address":":9187"}
{"level":"info","ts":"2026-01-19T09:46:26.482429749Z","msg":"DB not available, will retry","logger":"instance-manager","logging_pod":"postgres-authelia-2","controller":"instance-cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres-authelia","namespace":"authelia"},"namespace":"authelia","name":"postgres-authelia","reconcileID":"45f961af-65a9-4c33-ba9e-4287f8eafce2","logging_pod":"postgres-authelia-2","err":"failed to connect to `user=postgres database=postgres`: /controller/run/.s.PGSQL.5432 (/controller/run): dial error: dial unix /controller/run/.s.PGSQL.5432: connect: no such file or directory"}
{"level":"info","ts":"2026-01-19T09:46:26.482490139Z","msg":"All workers finished","logger":"instance-manager","logging_pod":"postgres-authelia-2","controller":"instance-cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2026-01-19T09:46:26.482501089Z","msg":"Stopping and waiting for caches","logger":"instance-manager","logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.48257093Z","msg":"Stopping and waiting for webhooks","logger":"instance-manager","logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.48258045Z","msg":"Stopping and waiting for HTTP servers","logger":"instance-manager","logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.48258532Z","msg":"Wait completed, proceeding to shutdown the manager","logger":"instance-manager","logging_pod":"postgres-authelia-2"}
{"level":"info","ts":"2026-01-19T09:46:26.48259516Z","msg":"Checking for free disk space for WALs after PostgreSQL finished","logger":"instance-manager","logging_pod":"postgres-authelia-2"}

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

triagePending triage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions