Skip to content

Build flags overhaul + memgraph-debuginfo packaging#4102

Draft
Ignition wants to merge 39 commits into
masterfrom
2026_05_01_split_dwarf
Draft

Build flags overhaul + memgraph-debuginfo packaging#4102
Ignition wants to merge 39 commits into
masterfrom
2026_05_01_split_dwarf

Conversation

@Ignition
Copy link
Copy Markdown
Contributor

@Ignition Ignition commented May 1, 2026

ThinLTO had been silently disabled for ~12 months due to a CMake variable casing typo. While fixing it we also enabled split-DWARF integrated with lld's ThinLTO codegen, compressed debug sections, file-prefix-mapped paths, and a prebuilt gdb index. The RelWithDebInfo binary shrank by about 22% with the debug info now living in a separate sidecar bundle of comparable size. Final-link wall time is unchanged.

The debug bundle ships as a separate memgraph-debuginfo sibling package (both DEB and RPM) with version pinned to the matching main package. RPM required disabling rpmbuild's automatic debug extraction, which would otherwise mangle our pre-bundled debug file into an empty stub.

@Ignition Ignition requested a review from mattkjames7 May 1, 2026 20:57
@Ignition Ignition self-assigned this May 1, 2026
@Ignition Ignition added CI -build=community -test=core Run community build and core tests on push CI -build=coverage -test=core Run coverage build and core tests on push CI -build=debug -test=core Run debug build and core tests on push CI -build=debug -test=integration Run debug build and integration tests on push CI -build=release -test=core Run release build and core tests on push CI -build=release -test=e2e Run release build and e2e tests on push CI -build=coverage -test=clang_tidy labels May 1, 2026
@Ignition
Copy link
Copy Markdown
Contributor Author

Ignition commented May 1, 2026

Tracking

  • [Link to Epic/Issue]

Standard development

CI Testing Labels

  • Select the appropriate CI test labels (CI -build=build-name -test=test-suite)

Documentation checklist

  • Add the documentation label
  • Add the bug / feature label
  • Add the milestone for which this feature is intended
    • If not known, set for a later milestone
  • Write a release note, including added/changed clauses
    • What has changed? What does it mean for a user? What should a user do with it? [#{{PR_number}}]({{link to the PR}})
  • [ Documentation PR link memgraph/documentation#XXXX ]
    • Is back linked to this development PR

Ignition added 14 commits May 1, 2026 22:03
* -gsplit-dwarf + lld --gdb-index for RelWithDebInfo/Debug; debug info no
  longer routed through the linker, gdb startup stays fast via the
  prebuilt .gdb_index. (Inert under ThinLTO RelWithDebInfo, harmless.)
* -gz compresses .debug_* sections (zlib; toolchain clang lacks zstd).
* -frecord-command-line embeds the compile invocation in DW_AT_producer
  for build-flag forensics on shipped binaries.
* -ffile-prefix-map normalizes source/build paths so debug info is
  portable across worktrees and ccache hits transfer between checkouts.
* Fix CMAKE_INTERPROCEDURAL_OPTIMIZATION_<config>: per-CMake convention
  the suffix must be upper-case (RELEASE / RELWITHDEBINFO). The mixed-
  case spelling was silently ignored, so IPO has been a no-op since it
  was uncommented in 81bee93 (2024-04-30) — ThinLTO is now actually
  applied.
* Add llvm-dwp post-build for Debug to bundle per-TU .dwo files into
  memgraph.dwp; ThinLTO configs produce no .dwo files so they're
  excluded.
Pass --plugin-opt=dwo_dir=<build>/dwo to lld so its ThinLTO backend emits
per-module .dwo files during link-time codegen. The frontend -gsplit-dwarf
flag alone produced no .dwo files in LTO configs because per-TU .o files
contain bitcode, not native code with debug info.

Effect on RelWithDebInfo memgraph:
- binary: 430 MB -> 337 MB (debug info no longer baked in)
- memgraph.dwp: 310 MB sidecar bundling 293 .dwo files
- gdb auto-locates the .dwp by binary path, so debugging still works
  with one extra file shipped alongside the executable.

Also extend the llvm-dwp post-build step to RelWithDebInfo and Release.
Foundation for shipping memgraph.dwp as a separate -debuginfo package
(Stage 1 of multi-stage work; component packaging itself is gated off).

* Move dwp POST_BUILD before strip. Previously strip ran first for
  Release, which deleted the skeleton CUs that llvm-dwp uses to
  discover .dwo inputs -- the dwp output for Release was unreliable.
* New install rule for memgraph.dwp under COMPONENT debuginfo. The
  default monolithic cpack run sets CPACK_COMPONENTS_ALL=memgraph so
  this component is excluded from existing memgraph_*.deb / .rpm
  artifacts (no behavior change yet).
Enable per-component DEB packaging so cpack -G DEB now produces:
* memgraph_<ver>_<arch>.deb -- main package (binary, libstdc++, etc.)
* memgraph-debuginfo_<ver>_<arch>.deb -- the .dwp file only

memgraph-debuginfo Depends: memgraph (= <exact-version>) so the two
packages stay in lockstep. The debuginfo package lands the .dwp at
/usr/lib/memgraph/memgraph.dwp where gdb auto-finds it next to the
binary; no separate debuginfod or wrapper needed for local debugging.

Renamed CPACK_DEBIAN_PACKAGE_CONTROL_EXTRA ->
CPACK_DEBIAN_MEMGRAPH_PACKAGE_CONTROL_EXTRA so the maintainer scripts
(preinst/postinst/prerm/postrm that create the memgraph user, install
systemd unit, etc.) only run for the main package, not for debuginfo.
RPM side of the debuginfo split. Mirrors the DEB structure:
* CPACK_RPM_COMPONENT_INSTALL=ON enables per-component packaging.
* The custom spec file (memgraph.spec.in) is now scoped to the main
  component via CPACK_RPM_MEMGRAPH_USER_BINARY_SPECFILE; %prep surgery
  (systemd unit move, perms) only ever applied to the main package.
* The debuginfo component falls back to CPack's auto-generated spec --
  fine for a single-file payload.
* memgraph-debuginfo Requires: memgraph = <version>-1 keeps the two
  packages in lockstep.

Also tighten globs in release/package/mgbuild.sh that previously matched
"memgraph*" and would now sweep up both packages:
* rpmlint runs only on the main rpm (memgraph-[0-9]*.rpm) so the auto-
  generated debuginfo rpm doesn't trip distro-specific lint rules.
* dpkg -c contents check uses memgraph_*.deb (with underscore) to pick
  only the main DEB.

RPM packaging cannot be validated locally (no rpmbuild on dev host);
Stage 4 CI run is the source of truth.
End-to-end verification in containers:
* DEB: ubuntu:24.04, install main + memgraph-debuginfo, gdb finds .dwp
  via auto-locate, source-level debugging works.
* RPM: fedora:40, install main + memgraph-debuginfo, same.
* Negative tests: hiding the .dwp restores the "Could not find DWO CU"
  warning; uninstalling the debuginfo package removes the .dwp without
  touching the main package.

Fixes uncovered during testing:

1. CMAKE_INSTALL_DEFAULT_COMPONENT_NAME was set on src/CMakeLists.txt
   line ~117, which is *after* every add_subdirectory() call. Subdir
   install rules captured the previous default ("Unspecified") and
   their files never made it into the memgraph component's staging
   tree. Moved to the top of src/CMakeLists.txt.

2. Install rules outside src/ (systemd unit in release/, licenses and
   mgconsole in top-level CMakeLists.txt) didn't inherit the default
   either; tagged each with COMPONENT memgraph explicitly.

3. CPack auto-suffixes component names into RPM package names
   (-> memgraph-memgraph). Set CPACK_RPM_MEMGRAPH_PACKAGE_NAME=memgraph
   so the debuginfo's "Requires: memgraph = <ver>" matches.

4. Rpmbuild's automatic debug extraction (find-debuginfo.sh, brp-strip)
   would mangle our pre-built .dwp into a 200-byte ELF stub. Disabled
   via CPACK_RPM_SPEC_MORE_DEFINE: this is correct -- ThinLTO + split-
   dwarf produces only skeleton CUs in the binary, so rpmbuild's
   extraction would have nothing useful to extract anyway.
Distro gdb (15.x in ubuntu:24.04) segfaults on our DWARF 5 + ThinLTO +
split-DWARF binaries -- verified with "info address main" against the
installed memgraph package: gdb 15 hits an internal "fatal error" and
prints a "please report it" message; gdb 16.2 returns the symbol cleanly.
This means run_with_gdb.sh, the in-container crash-capture wrapper, was
itself crashing on top of any real memgraph crash.

Ship a stripped toolchain gdb (~12 MB binary + 1.2 MB share/gdb data
files) inside the v6/v7 relwithdebinfo images. run_with_gdb.sh prefers
it over distro gdb when present, falling back if absent.

The bundle is built inside the mgbuild container (where the toolchain
lives) and copied out via the same pattern as heaptrack:
* tools/ci/build-gdb-bundle.sh -- stages /opt/toolchain-vN/bin/gdb +
  share/gdb into /tmp/gdb-bundle inside the build container.
* mgbuild.sh build-gdb-bundle / copy-gdb-bundle -- exposes that to
  the host so package_docker can COPY it into the image.
* package_docker errors early if release/docker/gdb-bundle/ is absent
  for a relwithdebinfo build; this surfaces missing CI wiring as a
  build error instead of a silent gdb-15 crash later.

run_with_gdb.sh also gains a "set substitute-path ./ /home/mg/memgraph/"
when the source tree is COPY'd into the image -- with the new
-ffile-prefix-map, debug info points at "./src/memgraph.cpp" and
this substitution lets gdb actually open the file from the source we
already ship in relwithdebinfo images.

Verified locally: ubuntu:24.04 + apt-installed runtime libs + the
bundled gdb resolves source via the .dwp from memgraph-debuginfo.deb;
"info address main" succeeds; control test confirms distro gdb 15
crashes on the same input.

v5_deb_relwithdebinfo.dockerfile (debian:12) is intentionally not
covered: the toolchain gdb 16.2 was built against ubuntu:24.04 ABI
(libpython3.12, libreadline.so.8) and won't run on debian:12 without
matching libs. v5 retains distro gdb 13 for now.
Component packaging emits two artifacts; the workflow assumed one.
Without these fixes, the rename step picks the alphabetically-first
match (the debuginfo package) and tags it as the main package, the S3
URL output points at whichever file ls -t happens to return, and the
debuginfo bundle would otherwise leak onto the public download bucket.

* Rename step disambiguates with explicit globs (memgraph_*.deb vs
  memgraph-debuginfo_*.deb; memgraph-[0-9]*.rpm vs memgraph-debuginfo-*.rpm)
  and renames both files in lockstep on master.
* S3 URL output emits only the main package URL; debuginfo lives next
  to it and shares the URL stem.
* aws s3 sync excludes memgraph-debuginfo* -- debuginfo is intentionally
  NOT pushed to download.memgraph.com (~300 MB per build is too much
  for the public mirror; recoverable from GitHub artifacts when needed).
* upload-artifact path glob "memgraph*" already catches both -- no
  change needed there.

Adds workflow steps to build and copy the toolchain gdb bundle into
release/docker/ during relwithdebinfo docker builds; the dockerfile
COPYs it from there and run_with_gdb.sh prefers it over distro gdb.
The "Enterprise DEB package" artifact in diff_release / diff_malloc /
reusable_release_tests previously globbed memgraph*.deb. With component
packaging enabled, that now matches both memgraph_*.deb and
memgraph-debuginfo_*.deb -- bloats the artifact (~75 MB extra per run)
and any downstream step that picks the first match by name would get
the debuginfo package instead of the main one.

These workflows are testing the main package, not the debug bundle.
Tighten the glob to memgraph_*.deb (note underscore) to keep artifact
size and selection behavior unchanged from before component packaging.
Two related bugs that meant the relwithdebinfo image had a bundled
gdb 16 with no debug info to read:

1. mgbuild.sh package_docker picked the package by mtime
   (ls -t memgraph* | head -1). With component packaging, cpack writes
   the debuginfo deb after the main deb, so the docker image was
   actually installing memgraph-debuginfo (a 310 MB .dwp) and *not*
   the binary -- the entrypoint /usr/lib/memgraph/memgraph wouldn't
   even exist.

2. Even with the right main package, the dockerfile only installed
   one .deb. The debuginfo sibling went unused, so gdb in the image
   would still warn "Could not find DWO CU".

* mgbuild.sh package_docker picks the main package via
  `grep -v debuginfo` and discovers the matching debuginfo sibling
  next to it, passing it through as --debuginfo-package-path.
* Release builds are unchanged: production image stays slim, no .dwp.
* package_docker plumbs --debuginfo-package-path into a
  DEBUGINFO_BINARY_NAME build arg and stages the second deb in the
  build context.
* v6/v7 relwithdebinfo dockerfiles dpkg -i the debuginfo deb after
  the main, gated on DEBUGINFO_BINARY_NAME being set, via a
  read-only bind mount of the build context (avoids an extra COPY
  layer for a 76 MB file we'd just rm afterwards).

Verified locally: a minimal ubuntu:24.04 image mirroring the
relwithdebinfo flow installs both packages, gdb finds the .dwp at
/usr/lib/memgraph/memgraph.dwp, and `list main` resolves source via
the bundled gdb 16.
@Ignition Ignition force-pushed the 2026_05_01_split_dwarf branch from fb1fbbd to eb71c9a Compare May 1, 2026 21:04
@mattkjames7
Copy link
Copy Markdown
Contributor

Is this PR doing some of the work done here? #4079

Ignition added 2 commits May 1, 2026 22:53
Frontend split-DWARF (Debug builds, no LTO) records the .dwo file path
in each compilation unit's skeleton CU as DW_AT_GNU_dwo_name. With
-ffile-prefix-map=${CMAKE_BINARY_DIR}=./build, that path got rewritten
from /abs/build/src/CMakeFiles/.../foo.cpp.dwo to ./build/src/.../foo.cpp.dwo.

llvm-dwp -e <binary> reads those paths to locate the .dwo inputs. The
relative form resolves against llvm-dwp's CWD (build/src), which is
wrong, so it fails:

  error: './build/src/CMakeFiles/memgraph.dir/memgraph.cpp.dwo': No such
  file or directory

This took out Debug, Coverage, and Community CI jobs.

The binary-dir prefix-map was speculative -- generated/build artifacts
in __FILE__ don't matter much in practice. Source-dir prefix-map is
the genuinely useful one for cross-worktree ccache hits and stays.

ThinLTO configs (RelWithDebInfo, Release) were unaffected by this bug:
their skeleton CUs are written by lld at link-time codegen, which
doesn't honor compile-time -ffile-prefix-map and emits absolute paths.
@Ignition
Copy link
Copy Markdown
Contributor Author

Ignition commented May 1, 2026

Is this PR doing some of the work done here? #4079

Yes, an independent experiment. :)

@Ignition Ignition marked this pull request as draft May 1, 2026 22:21
Ignition added 2 commits May 1, 2026 23:31
…nk OOM cap

Three additions to the existing split-DWARF + .dwp packaging so it plays
naturally with the standard Linux debug-info ecosystem:

* .gnu_debuglink section pointing at memgraph.dwp. gdb's debuglink
  resolver then also tries /usr/lib/debug/<install-path>/<basename>.dwp
  and ./.debug/<basename>.dwp -- distro-conventional fallback paths.
* Install-time build-id-keyed symlink at
  /usr/lib/debug/.build-id/<aa>/<rest>.dwp pointing at the real .dwp
  in /usr/lib/memgraph/. gdb 16+ resolves split-DWARF debug info via
  this path; debuginfod proxies serving from the same layout work too.
  Logic factored into cmake/install-build-id-symlink.cmake to avoid the
  install(CODE) escaping mess.
* tools/ci/upload-debug-symbols.sh: uploads .dwp files to a
  build-id-keyed S3 bucket at <prefix>/<aa>/<rest>.dwp -- the
  debuginfod URL scheme. Adapted from PR #4079's script (which uploads
  .debug files for the alternate split-debug approach there); a future
  debuginfod proxy in front of the bucket needs zero re-indexing.

Also tame ThinLTO link memory: lld's parallel ThinLTO codegen at link
time uses ~3-5 GB per module on heavy boost/template TUs, so the
Community / Core tests CI job OOM'd (exit 137) on RelWithDebInfo
without a cap. mgbuild.sh now auto-applies --link-threads via
compute-build-threads.sh (already in master, 4 GB/thread budget) when
build_type is RelWithDebInfo or Release and no explicit value was
passed. Caller-supplied --link-threads still wins.
The previous setup used two add_custom_command calls chained into the
memgraph link's POST_BUILD edge. When that ninja edge fails, ninja's
combined-edge output shows just the COMMENTs and a "FAILED [code=1]" --
no clue which of the steps failed or why.

Move the work into tools/ci/dwp-and-debuglink.sh which logs:
* resolved llvm-dwp / objcopy / binary / dwo_dir paths
* dwo file count under dwo_dir before invoking llvm-dwp
* the exact llvm-dwp + objcopy commands as they run
* dwp output size after llvm-dwp
* explicit error messages with non-zero exit codes pinned to a step

Also print the resolved LLVM_DWP / CMAKE_OBJCOPY / MG_DWO_DIR at configure
time so CI logs answer "did the tool even get found?" before build starts.
@mattkjames7
Copy link
Copy Markdown
Contributor

Is this PR doing some of the work done here? #4079

Yes, an independent experiment. :)

Let's sync about this on Tuesday if you're about. #4079 is in a mostly working state in that, if I build with it, it correctly pulls debugging symbols from the API and allows me to generate stack traces from them - I mostly need to add the infra + update workflows. Hopefully this doesn't conflict too much 😂

Ignition added 19 commits May 2, 2026 00:18
CI Community / Core tests run on the previous commit pinpointed the
failure to llvm-dwp itself crashing with SIGBUS (exit 135). Bus error
on llvm-dwp is almost always an mmap fault on an empty .dwo input --
lld's ThinLTO codegen sometimes writes 0-byte .dwo files for modules
with no debug info, and llvm-dwp mmaps then dereferences them.

Two things:

1. The .dwp is a debug aid; killing the whole link because llvm-dwp
   crashed is the wrong trade-off. The script now catches non-zero
   exits, logs llvm-dwp's stderr in full, removes any partial .dwp,
   and exits 0 so the link target succeeds. memgraph-debuginfo is
   OPTIONAL in the install rules already, so a missing .dwp just
   means an empty debuginfo package for that build (with a clear
   warning in CI logs).

2. Add diagnostics for the suspected root cause: count empty .dwo
   files, list the first few, and report disk-free in the build dir
   before running llvm-dwp. The next CI failure will say either
   "yes, N empty .dwo files" (confirming the theory) or rule it out.
lld's ThinLTO link defaults to --thinlto-jobs=\$(nproc), which on a
24-thread CI runner spawns 24 parallel codegen jobs each using 1-3 GB
-- a single link can peak at 24-72 GB. With multiple links in flight
under ninja, that OOMs the worker (Community / Core tests SIGBUS in
llvm-dwp downstream of this).

Compute the cap memory-aware at configure time via
tools/ci/compute-build-threads.sh, sized for ~3 GB per codegen job.
Applied to RelWithDebInfo and Release (the configs with ThinLTO active);
lld silently ignores the flag in non-LTO links so it's safe everywhere.

The existing --link-threads job-pool cap (4 GB/slot) remains and is
unchanged; it limits concurrent links, while --thinlto-jobs limits
internal codegen parallelism within each link.
Reverts the soft-fail behavior. A missing .dwp means
memgraph-debuginfo packages ship empty, which is worse than the build
failing loudly: a quiet broken release is hard to notice and harder
to debug after the fact.

Diagnostic logging stays (tool paths, dwo file count, empty .dwo
detection, disk free, captured llvm-dwp stderr); on failure we now
propagate the exit code so the build fails at the link step with
enough context to fix the underlying issue (memory cap, etc.) rather
than papering over it.
When llvm-dwp errors with "not recognized as a valid object file",
walk the dwo_dir, count ELF / non-ELF / zero-byte files, and call
out flat numeric-named files (lld ThinLTO partition outputs from
--plugin-opt=dwo_dir). Magic bytes are reported so a compression
mismatch (1f8b/28b52ffd) is distinguishable from a malformed stub.

Lets us diagnose dwp failures from CI logs alone, without machine access.
…Release)

CI Release / End to end tests run on d13faf6 failed with the new
diagnostic message:

  dwo_dir contains 293 .dwo files (10 empty)
    empty: dwo/124.dwo, dwo/159.dwo, dwo/228.dwo, ...
  ERROR: memgraph.dwp missing/empty after llvm-dwp

Release has no -g (CXX_FLAGS_RELEASE = "-O2 -DNDEBUG"), so ThinLTO
codegen has nothing to put into .dwo files -- most come out empty,
llvm-dwp produces an essentially empty .dwp, our "is .dwp non-empty"
check fires (exit 4). Plus Release strip-s the binary afterwards, so
any .dwp would be orphaned anyway.

The original CMake had a config-mismatch:
* -gsplit-dwarf (compile-time) -> RelWithDebInfo + Debug
* --plugin-opt=dwo_dir (link-time) -> RelWithDebInfo + Release   <- wrong
* dwp post-build -> all three configs                            <- wrong

Fix:
* dwo_dir + dwp post-build are RelWithDebInfo-only (the only config
  that produces useful split-DWARF output through ThinLTO codegen).
* Debug retains its frontend split-dwarf path + dwp.
* --thinlto-jobs cap stays on RelWithDebInfo + Release -- it's an
  OOM safety knob orthogonal to debug info, applies to any LTO link.
Debug CI run on 5e3d23f failed at the dwp post-build with:
  error: './build/src/CMakeFiles/memgraph.dir/memgraph.cpp.dwo':
         No such file or directory

The build dir lives inside the source dir (memgraph/build/), so
-ffile-prefix-map=${CMAKE_SOURCE_DIR}=. was *also* rewriting build
paths -- skeleton CUs ended up with relative DW_AT_GNU_dwo_name like
"./build/src/.../foo.cpp.dwo", and llvm-dwp -e resolved those
against its CWD (build/src/) instead of finding the absolute .dwo.

Fix by appending an identity map for CMAKE_BINARY_DIR after the
source-dir rewrite. clang's -ffile-prefix-map applies rules in order
with last-match wins (verified empirically), so:

  -ffile-prefix-map=${CMAKE_SOURCE_DIR}=.       # rewrites everything under source
  -ffile-prefix-map=${CMAKE_BINARY_DIR}=...     # un-rewrites build subtree (last wins)

Source files (e.g. src/foo.cpp) get the relative-path benefit;
build paths (the .dwo location) stay absolute so llvm-dwp can find
them. Verified with a minimal reproducer: DW_AT_comp_dir for build
paths stays "/abs/build" instead of "./build".
Previous CI runs that died mid-extraction (OOM kills, SIGBUS during
llvm-dwp, etc.) left the Conan 2 cache index pointing at package
folders that don't exist on disk. The next `conan install` then aborts
with the assertion seen on this PR's run 25240345214:

  rocksdb/8.1.1-memgraph#...: Already installed!
  AssertionError: Pkg '...' folder must exist:
    /home/mg/.conan2/p/rocks86005852e4b26/p

Run `conan cache check-integrity "*"` before each `conan install` and
remove any references it flags. Cheap on a healthy cache (~1 line per
recipe revision); recovers automatically when prior runs left state
broken. Removed packages get re-fetched on the upcoming install, which
is the desired behavior here.
When llvm-dwp fails on a 0-byte .dwo, gather everything that helps
identify the cause without needing to ssh to the runner:

* Per empty .dwo: full stat (size, mtime, mtime delta vs binary).
* Reference non-empty .dwo + binary mtimes to spot stale leftovers
  (large negative delta = file written long before this link =
  almost certainly a leftover from an interrupted previous run;
  ~zero delta = lld emitted it as part of this link).
* The skeleton CU stanza in the binary's .debug_info that references
  the .dwo, dumped via llvm-dwarfdump.
* Function name at DW_AT_low_pc, resolved via the binary's symtab
  using llvm-symbolizer --functions=linkage (works when the .dwo's
  source-side DWARF is missing).
* The link command from build.ninja, truncated to 600 chars, so the
  exact lld invocation is recoverable from CI logs alone.

This is enough context to either file a tight LLVM upstream bug
(here is the lld invocation, here is the empty file's mtime, here
is the function it should describe) or to confirm the file is a
leftover and sweep it.
Stubbing 0-byte .dwo files with a 208-byte minimal valid ELF lets the
build succeed but masks the underlying lld behaviour. Until we
understand whether the empty files come from this run's lld codegen
or from an interrupted previous run, prefer surfacing the failure
loudly over papering over it.

Diagnostics (skeleton CU dump, low_pc -> symbol, mtime delta vs
binary, link command) all stay so the next CI failure has enough
context to either upstream the bug or sweep it as stale state.
When the regular run exits with a signal (rc >= 128, e.g. SIGBUS=135
which is what we've seen in CI), retry under gdb in batch mode and
print 'thread apply all bt full' to stderr. The retry's output is for
diagnostics only -- the original failure code propagates.

Also raise core dump rlimit and report any core file landed at the
usual locations (./core, dwp.core, /tmp/cores/core), so a debugger
session can attach later if needed.

Local llvm-dwp 20.1.7 exits 1 cleanly on a 0-byte .dwo input rather
than SIGBUS, so this path doesn't fire locally; CI is where this
will earn its keep.
The empty .dwo files in CI were not from lld's normal codegen for empty
modules -- they were from concurrent link tasks overwriting each other.

CI run 25250482438 produced this evidence via the new diagnostics:

  binary mtime:        11:17:09
  EMPTY 119.dwo mtime: 11:19:11  (binary +122s)
  EMPTY 84.dwo  mtime: 11:19:46  (binary +157s)
  EMPTY 193.dwo mtime: 11:20:25  (binary +196s)
  ...
  EMPTY 95.dwo  mtime: 11:22:19  (binary +310s)

The memgraph binary was finalized at 11:17:09 and our POST_BUILD
started shortly after; the empty .dwo files have mtimes 2-5 minutes
*later*. Something was actively truncating them while we ran.

That something is our own build: --plugin-opt=dwo_dir was an
add_link_options(), which makes it global. Every executable link in
the project (memgraph + every test binary) wrote into the same
dwo_dir. lld names .dwo files by task ID starting at 1 per link, so
test binary links concurrent with memgraph's POST_BUILD truncated
memgraph's 1.dwo, 2.dwo, ... via OF_None on open, then wrote their
own (smaller) set. We caught files mid-truncate, hence the SIGBUS in
earlier runs and the failed dwp now.

Move the option to a per-target target_link_options on the memgraph
executable. Verified locally: memgraph's link command still has
--plugin-opt=dwo_dir, the test binary links no longer have it.
Removes the cross-target sharing of dwo_dir that caused the corruption.
…>_dwo)

When --plugin-opt=dwo_dir is omitted, lld writes .dwo files to
<binary>_dwo/ next to the binary. Adopt that same name (memgraph_dwo)
even though we pass the option explicitly, so the layout is consistent
between with-flag and without-flag cases (also matches what test
binaries produce as <test>_dwo/ in the same build tree).

Functionally identical -- just a more discoverable path for anyone
poking around build/ wondering which target a dwo dir belongs to.
Audit pass on the new CMake plumbing:

* --thinlto-jobs is now applied via $<$<OR:$<CONFIG:RelWithDebInfo>,
  $<CONFIG:Release>>:LINKER:...> instead of being inside an
  if(CMAKE_BUILD_TYPE) block. Multi-config generators (VS / Xcode /
  Ninja Multi-Config) handle this correctly now.
* --plugin-opt=dwo_dir on the memgraph target is also gated by
  $<$<CONFIG:RelWithDebInfo>:...> and uses $<TARGET_FILE_DIR:memgraph>
  for the path. dwp dir name follows lld's <binary>_dwo convention.
* The .dwp install rule uses install(FILES $<TARGET_FILE:memgraph>.dwp ...)
  -- multi-config-correct (each config's .dwp lands at the right path).
* Drop file(MAKE_DIRECTORY MG_DWO_DIR) -- lld auto-creates the dir.

Kept as configure-time if(CMAKE_BUILD_TYPE) because making them
genex-driven gets ugly:

* The dwp POST_BUILD COMMAND -- making it conditionally a no-op via
  $<IF:...> for non-Debug/RelWithDebInfo configs would mean a runtime
  config check inside the script and visually noisy CMake. Documented
  with a comment that multi-config users would need to revisit.
* find_program(LLVM_DWP) -- genuinely configure-time.

BYPRODUCTS still uses ${CMAKE_BINARY_DIR}/memgraph.dwp (literal):
$<TARGET_FILE:memgraph>.dwp doesn't evaluate cleanly in BYPRODUCTS for
the add_custom_command(TARGET ...) form ("No target memgraph" at gen
time), even though install(FILES) accepts the same expression.

Verified locally: memgraph link command has both --thinlto-jobs and
--plugin-opt=dwo_dir; test binary links have only --thinlto-jobs (no
dwo_dir collision). dwp pipeline produces a 324 MB memgraph.dwp.
CI run 25251507231 (Release / Core tests) failed at:

  Copying memgraph package from mgbuild_v7_ubuntu-24.04 to host ...
  ##[error]Process completed with exit code 2.

The new copy --package block ran:

  pkg_files=$(docker exec ... bash -c "ls $dir/memgraph*.deb $dir/memgraph*.rpm 2>/dev/null")

On an Ubuntu builder the dir has only .deb files; ls reports no match
for the .rpm glob and exits non-zero. The script's `set -e` propagates
that and the build aborts before we even reach the empty-check.

Use `shopt -s nullglob` so the unmatched glob expands to nothing rather
than failing, and add `|| true` so a still-empty result reaches our
explicit check (which prints a clearer error).
Coverage / Core tests run on memgraph__unit__query_streams hit:

  /home/mg/.conan2/.../librdkafka/.../src/rdkafka_request.c:2568:49:
    runtime error: member access within null pointer of type
    'rd_kafka_metadata_internal_t'
  SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior

Same pattern as the other librdkafka entries in this file: third-party
code, well-defined in practice but tripping ubsan. Add an ignore rule
matching the source path so the unit tests that link against
librdkafka stop failing under ubsan.

TODO: file upstream issue at confluentinc/librdkafka.
The per-listener session-timeout thread polled with sleep_for(1s).
Shutdown() flipped alive_ but couldn't wake the sleep, so
AwaitShutdown()'s join() had to wait up to 1s per listener. With
multiple listeners (Bolt, RPC, Websocket) this added up to many
seconds of shutdown lag, occasionally exceeding the e2e runner's
15s assertion window (observed in CI: replication "Constraints"
workload, instance still alive 15.029s after SIGTERM, stuck in
futex_do_wait).

Replace the bare sleep with a condition_variable::wait_for(1s,
predicate). Shutdown() now takes the mutex briefly (to avoid a
lost-wakeup race against the predicate evaluation) and notifies.
The 1s polling cadence for inactivity checks is preserved.
The shutdown sequence emitted only sparse trace logs, with several
multi-second blocking calls (Bolt + websocket await, worker pool
await, module unload, python finalize, plus the synchronous
coordinator_state destructor) covered by no log line. When CI
shutdown hung past the e2e runner's 15s window, the test artifact
log just stopped at "Shutting down websocket server" with no way
to attribute the remaining time to a specific call.

Add a trace before each Shutdown / AwaitShutdown / destructor call
in the main shutdown path. No behavior change.
CI logs show 'Memgraph main loop exited' fires 80ms after SIGTERM but
the process remains alive for 15s (the e2e test runner timeout).
The only code between that trace and process exit is Py_Finalize().
Add bracketing traces to confirm this is the blocker.
…stances

The installed /etc/memgraph/memgraph.conf (from config/flags.yaml) overrides
the binary's default and turns telemetry on, so any e2e workload that didn't
explicitly pass --telemetry-enabled=false ran with telemetry enabled. On
shutdown Telemetry::~Telemetry synchronously POSTs with a 120s timeout,
which can exceed the e2e shutdown budget when the telemetry endpoint is
unreachable or slow.

Inject the flag in interactive_mg_runner so every workload gets telemetry
off regardless of its workloads.yaml.
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 4, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI -build=community -test=core Run community build and core tests on push CI -build=coverage -test=clang_tidy CI -build=coverage -test=core Run coverage build and core tests on push CI -build=debug -test=core Run debug build and core tests on push CI -build=debug -test=integration Run debug build and integration tests on push CI -build=release -test=core Run release build and core tests on push CI -build=release -test=e2e Run release build and e2e tests on push

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants