Build flags overhaul + memgraph-debuginfo packaging#4102
Draft
Ignition wants to merge 39 commits into
Draft
Conversation
Contributor
Author
Tracking
Standard development
CI Testing Labels
Documentation checklist
|
* -gsplit-dwarf + lld --gdb-index for RelWithDebInfo/Debug; debug info no longer routed through the linker, gdb startup stays fast via the prebuilt .gdb_index. (Inert under ThinLTO RelWithDebInfo, harmless.) * -gz compresses .debug_* sections (zlib; toolchain clang lacks zstd). * -frecord-command-line embeds the compile invocation in DW_AT_producer for build-flag forensics on shipped binaries. * -ffile-prefix-map normalizes source/build paths so debug info is portable across worktrees and ccache hits transfer between checkouts. * Fix CMAKE_INTERPROCEDURAL_OPTIMIZATION_<config>: per-CMake convention the suffix must be upper-case (RELEASE / RELWITHDEBINFO). The mixed- case spelling was silently ignored, so IPO has been a no-op since it was uncommented in 81bee93 (2024-04-30) — ThinLTO is now actually applied. * Add llvm-dwp post-build for Debug to bundle per-TU .dwo files into memgraph.dwp; ThinLTO configs produce no .dwo files so they're excluded.
Pass --plugin-opt=dwo_dir=<build>/dwo to lld so its ThinLTO backend emits per-module .dwo files during link-time codegen. The frontend -gsplit-dwarf flag alone produced no .dwo files in LTO configs because per-TU .o files contain bitcode, not native code with debug info. Effect on RelWithDebInfo memgraph: - binary: 430 MB -> 337 MB (debug info no longer baked in) - memgraph.dwp: 310 MB sidecar bundling 293 .dwo files - gdb auto-locates the .dwp by binary path, so debugging still works with one extra file shipped alongside the executable. Also extend the llvm-dwp post-build step to RelWithDebInfo and Release.
Foundation for shipping memgraph.dwp as a separate -debuginfo package (Stage 1 of multi-stage work; component packaging itself is gated off). * Move dwp POST_BUILD before strip. Previously strip ran first for Release, which deleted the skeleton CUs that llvm-dwp uses to discover .dwo inputs -- the dwp output for Release was unreliable. * New install rule for memgraph.dwp under COMPONENT debuginfo. The default monolithic cpack run sets CPACK_COMPONENTS_ALL=memgraph so this component is excluded from existing memgraph_*.deb / .rpm artifacts (no behavior change yet).
Enable per-component DEB packaging so cpack -G DEB now produces: * memgraph_<ver>_<arch>.deb -- main package (binary, libstdc++, etc.) * memgraph-debuginfo_<ver>_<arch>.deb -- the .dwp file only memgraph-debuginfo Depends: memgraph (= <exact-version>) so the two packages stay in lockstep. The debuginfo package lands the .dwp at /usr/lib/memgraph/memgraph.dwp where gdb auto-finds it next to the binary; no separate debuginfod or wrapper needed for local debugging. Renamed CPACK_DEBIAN_PACKAGE_CONTROL_EXTRA -> CPACK_DEBIAN_MEMGRAPH_PACKAGE_CONTROL_EXTRA so the maintainer scripts (preinst/postinst/prerm/postrm that create the memgraph user, install systemd unit, etc.) only run for the main package, not for debuginfo.
RPM side of the debuginfo split. Mirrors the DEB structure: * CPACK_RPM_COMPONENT_INSTALL=ON enables per-component packaging. * The custom spec file (memgraph.spec.in) is now scoped to the main component via CPACK_RPM_MEMGRAPH_USER_BINARY_SPECFILE; %prep surgery (systemd unit move, perms) only ever applied to the main package. * The debuginfo component falls back to CPack's auto-generated spec -- fine for a single-file payload. * memgraph-debuginfo Requires: memgraph = <version>-1 keeps the two packages in lockstep. Also tighten globs in release/package/mgbuild.sh that previously matched "memgraph*" and would now sweep up both packages: * rpmlint runs only on the main rpm (memgraph-[0-9]*.rpm) so the auto- generated debuginfo rpm doesn't trip distro-specific lint rules. * dpkg -c contents check uses memgraph_*.deb (with underscore) to pick only the main DEB. RPM packaging cannot be validated locally (no rpmbuild on dev host); Stage 4 CI run is the source of truth.
End-to-end verification in containers:
* DEB: ubuntu:24.04, install main + memgraph-debuginfo, gdb finds .dwp
via auto-locate, source-level debugging works.
* RPM: fedora:40, install main + memgraph-debuginfo, same.
* Negative tests: hiding the .dwp restores the "Could not find DWO CU"
warning; uninstalling the debuginfo package removes the .dwp without
touching the main package.
Fixes uncovered during testing:
1. CMAKE_INSTALL_DEFAULT_COMPONENT_NAME was set on src/CMakeLists.txt
line ~117, which is *after* every add_subdirectory() call. Subdir
install rules captured the previous default ("Unspecified") and
their files never made it into the memgraph component's staging
tree. Moved to the top of src/CMakeLists.txt.
2. Install rules outside src/ (systemd unit in release/, licenses and
mgconsole in top-level CMakeLists.txt) didn't inherit the default
either; tagged each with COMPONENT memgraph explicitly.
3. CPack auto-suffixes component names into RPM package names
(-> memgraph-memgraph). Set CPACK_RPM_MEMGRAPH_PACKAGE_NAME=memgraph
so the debuginfo's "Requires: memgraph = <ver>" matches.
4. Rpmbuild's automatic debug extraction (find-debuginfo.sh, brp-strip)
would mangle our pre-built .dwp into a 200-byte ELF stub. Disabled
via CPACK_RPM_SPEC_MORE_DEFINE: this is correct -- ThinLTO + split-
dwarf produces only skeleton CUs in the binary, so rpmbuild's
extraction would have nothing useful to extract anyway.
Distro gdb (15.x in ubuntu:24.04) segfaults on our DWARF 5 + ThinLTO + split-DWARF binaries -- verified with "info address main" against the installed memgraph package: gdb 15 hits an internal "fatal error" and prints a "please report it" message; gdb 16.2 returns the symbol cleanly. This means run_with_gdb.sh, the in-container crash-capture wrapper, was itself crashing on top of any real memgraph crash. Ship a stripped toolchain gdb (~12 MB binary + 1.2 MB share/gdb data files) inside the v6/v7 relwithdebinfo images. run_with_gdb.sh prefers it over distro gdb when present, falling back if absent. The bundle is built inside the mgbuild container (where the toolchain lives) and copied out via the same pattern as heaptrack: * tools/ci/build-gdb-bundle.sh -- stages /opt/toolchain-vN/bin/gdb + share/gdb into /tmp/gdb-bundle inside the build container. * mgbuild.sh build-gdb-bundle / copy-gdb-bundle -- exposes that to the host so package_docker can COPY it into the image. * package_docker errors early if release/docker/gdb-bundle/ is absent for a relwithdebinfo build; this surfaces missing CI wiring as a build error instead of a silent gdb-15 crash later. run_with_gdb.sh also gains a "set substitute-path ./ /home/mg/memgraph/" when the source tree is COPY'd into the image -- with the new -ffile-prefix-map, debug info points at "./src/memgraph.cpp" and this substitution lets gdb actually open the file from the source we already ship in relwithdebinfo images. Verified locally: ubuntu:24.04 + apt-installed runtime libs + the bundled gdb resolves source via the .dwp from memgraph-debuginfo.deb; "info address main" succeeds; control test confirms distro gdb 15 crashes on the same input. v5_deb_relwithdebinfo.dockerfile (debian:12) is intentionally not covered: the toolchain gdb 16.2 was built against ubuntu:24.04 ABI (libpython3.12, libreadline.so.8) and won't run on debian:12 without matching libs. v5 retains distro gdb 13 for now.
Component packaging emits two artifacts; the workflow assumed one. Without these fixes, the rename step picks the alphabetically-first match (the debuginfo package) and tags it as the main package, the S3 URL output points at whichever file ls -t happens to return, and the debuginfo bundle would otherwise leak onto the public download bucket. * Rename step disambiguates with explicit globs (memgraph_*.deb vs memgraph-debuginfo_*.deb; memgraph-[0-9]*.rpm vs memgraph-debuginfo-*.rpm) and renames both files in lockstep on master. * S3 URL output emits only the main package URL; debuginfo lives next to it and shares the URL stem. * aws s3 sync excludes memgraph-debuginfo* -- debuginfo is intentionally NOT pushed to download.memgraph.com (~300 MB per build is too much for the public mirror; recoverable from GitHub artifacts when needed). * upload-artifact path glob "memgraph*" already catches both -- no change needed there. Adds workflow steps to build and copy the toolchain gdb bundle into release/docker/ during relwithdebinfo docker builds; the dockerfile COPYs it from there and run_with_gdb.sh prefers it over distro gdb.
The "Enterprise DEB package" artifact in diff_release / diff_malloc / reusable_release_tests previously globbed memgraph*.deb. With component packaging enabled, that now matches both memgraph_*.deb and memgraph-debuginfo_*.deb -- bloats the artifact (~75 MB extra per run) and any downstream step that picks the first match by name would get the debuginfo package instead of the main one. These workflows are testing the main package, not the debug bundle. Tighten the glob to memgraph_*.deb (note underscore) to keep artifact size and selection behavior unchanged from before component packaging.
Two related bugs that meant the relwithdebinfo image had a bundled gdb 16 with no debug info to read: 1. mgbuild.sh package_docker picked the package by mtime (ls -t memgraph* | head -1). With component packaging, cpack writes the debuginfo deb after the main deb, so the docker image was actually installing memgraph-debuginfo (a 310 MB .dwp) and *not* the binary -- the entrypoint /usr/lib/memgraph/memgraph wouldn't even exist. 2. Even with the right main package, the dockerfile only installed one .deb. The debuginfo sibling went unused, so gdb in the image would still warn "Could not find DWO CU". * mgbuild.sh package_docker picks the main package via `grep -v debuginfo` and discovers the matching debuginfo sibling next to it, passing it through as --debuginfo-package-path. * Release builds are unchanged: production image stays slim, no .dwp. * package_docker plumbs --debuginfo-package-path into a DEBUGINFO_BINARY_NAME build arg and stages the second deb in the build context. * v6/v7 relwithdebinfo dockerfiles dpkg -i the debuginfo deb after the main, gated on DEBUGINFO_BINARY_NAME being set, via a read-only bind mount of the build context (avoids an extra COPY layer for a 76 MB file we'd just rm afterwards). Verified locally: a minimal ubuntu:24.04 image mirroring the relwithdebinfo flow installs both packages, gdb finds the .dwp at /usr/lib/memgraph/memgraph.dwp, and `list main` resolves source via the bundled gdb 16.
fb1fbbd to
eb71c9a
Compare
Contributor
|
Is this PR doing some of the work done here? #4079 |
Frontend split-DWARF (Debug builds, no LTO) records the .dwo file path
in each compilation unit's skeleton CU as DW_AT_GNU_dwo_name. With
-ffile-prefix-map=${CMAKE_BINARY_DIR}=./build, that path got rewritten
from /abs/build/src/CMakeFiles/.../foo.cpp.dwo to ./build/src/.../foo.cpp.dwo.
llvm-dwp -e <binary> reads those paths to locate the .dwo inputs. The
relative form resolves against llvm-dwp's CWD (build/src), which is
wrong, so it fails:
error: './build/src/CMakeFiles/memgraph.dir/memgraph.cpp.dwo': No such
file or directory
This took out Debug, Coverage, and Community CI jobs.
The binary-dir prefix-map was speculative -- generated/build artifacts
in __FILE__ don't matter much in practice. Source-dir prefix-map is
the genuinely useful one for cross-worktree ccache hits and stays.
ThinLTO configs (RelWithDebInfo, Release) were unaffected by this bug:
their skeleton CUs are written by lld at link-time codegen, which
doesn't honor compile-time -ffile-prefix-map and emits absolute paths.
Contributor
Author
Yes, an independent experiment. :) |
…nk OOM cap Three additions to the existing split-DWARF + .dwp packaging so it plays naturally with the standard Linux debug-info ecosystem: * .gnu_debuglink section pointing at memgraph.dwp. gdb's debuglink resolver then also tries /usr/lib/debug/<install-path>/<basename>.dwp and ./.debug/<basename>.dwp -- distro-conventional fallback paths. * Install-time build-id-keyed symlink at /usr/lib/debug/.build-id/<aa>/<rest>.dwp pointing at the real .dwp in /usr/lib/memgraph/. gdb 16+ resolves split-DWARF debug info via this path; debuginfod proxies serving from the same layout work too. Logic factored into cmake/install-build-id-symlink.cmake to avoid the install(CODE) escaping mess. * tools/ci/upload-debug-symbols.sh: uploads .dwp files to a build-id-keyed S3 bucket at <prefix>/<aa>/<rest>.dwp -- the debuginfod URL scheme. Adapted from PR #4079's script (which uploads .debug files for the alternate split-debug approach there); a future debuginfod proxy in front of the bucket needs zero re-indexing. Also tame ThinLTO link memory: lld's parallel ThinLTO codegen at link time uses ~3-5 GB per module on heavy boost/template TUs, so the Community / Core tests CI job OOM'd (exit 137) on RelWithDebInfo without a cap. mgbuild.sh now auto-applies --link-threads via compute-build-threads.sh (already in master, 4 GB/thread budget) when build_type is RelWithDebInfo or Release and no explicit value was passed. Caller-supplied --link-threads still wins.
The previous setup used two add_custom_command calls chained into the memgraph link's POST_BUILD edge. When that ninja edge fails, ninja's combined-edge output shows just the COMMENTs and a "FAILED [code=1]" -- no clue which of the steps failed or why. Move the work into tools/ci/dwp-and-debuglink.sh which logs: * resolved llvm-dwp / objcopy / binary / dwo_dir paths * dwo file count under dwo_dir before invoking llvm-dwp * the exact llvm-dwp + objcopy commands as they run * dwp output size after llvm-dwp * explicit error messages with non-zero exit codes pinned to a step Also print the resolved LLVM_DWP / CMAKE_OBJCOPY / MG_DWO_DIR at configure time so CI logs answer "did the tool even get found?" before build starts.
Contributor
Let's sync about this on Tuesday if you're about. #4079 is in a mostly working state in that, if I build with it, it correctly pulls debugging symbols from the API and allows me to generate stack traces from them - I mostly need to add the infra + update workflows. Hopefully this doesn't conflict too much 😂 |
CI Community / Core tests run on the previous commit pinpointed the failure to llvm-dwp itself crashing with SIGBUS (exit 135). Bus error on llvm-dwp is almost always an mmap fault on an empty .dwo input -- lld's ThinLTO codegen sometimes writes 0-byte .dwo files for modules with no debug info, and llvm-dwp mmaps then dereferences them. Two things: 1. The .dwp is a debug aid; killing the whole link because llvm-dwp crashed is the wrong trade-off. The script now catches non-zero exits, logs llvm-dwp's stderr in full, removes any partial .dwp, and exits 0 so the link target succeeds. memgraph-debuginfo is OPTIONAL in the install rules already, so a missing .dwp just means an empty debuginfo package for that build (with a clear warning in CI logs). 2. Add diagnostics for the suspected root cause: count empty .dwo files, list the first few, and report disk-free in the build dir before running llvm-dwp. The next CI failure will say either "yes, N empty .dwo files" (confirming the theory) or rule it out.
lld's ThinLTO link defaults to --thinlto-jobs=\$(nproc), which on a 24-thread CI runner spawns 24 parallel codegen jobs each using 1-3 GB -- a single link can peak at 24-72 GB. With multiple links in flight under ninja, that OOMs the worker (Community / Core tests SIGBUS in llvm-dwp downstream of this). Compute the cap memory-aware at configure time via tools/ci/compute-build-threads.sh, sized for ~3 GB per codegen job. Applied to RelWithDebInfo and Release (the configs with ThinLTO active); lld silently ignores the flag in non-LTO links so it's safe everywhere. The existing --link-threads job-pool cap (4 GB/slot) remains and is unchanged; it limits concurrent links, while --thinlto-jobs limits internal codegen parallelism within each link.
Reverts the soft-fail behavior. A missing .dwp means memgraph-debuginfo packages ship empty, which is worse than the build failing loudly: a quiet broken release is hard to notice and harder to debug after the fact. Diagnostic logging stays (tool paths, dwo file count, empty .dwo detection, disk free, captured llvm-dwp stderr); on failure we now propagate the exit code so the build fails at the link step with enough context to fix the underlying issue (memory cap, etc.) rather than papering over it.
When llvm-dwp errors with "not recognized as a valid object file", walk the dwo_dir, count ELF / non-ELF / zero-byte files, and call out flat numeric-named files (lld ThinLTO partition outputs from --plugin-opt=dwo_dir). Magic bytes are reported so a compression mismatch (1f8b/28b52ffd) is distinguishable from a malformed stub. Lets us diagnose dwp failures from CI logs alone, without machine access.
…Release) CI Release / End to end tests run on d13faf6 failed with the new diagnostic message: dwo_dir contains 293 .dwo files (10 empty) empty: dwo/124.dwo, dwo/159.dwo, dwo/228.dwo, ... ERROR: memgraph.dwp missing/empty after llvm-dwp Release has no -g (CXX_FLAGS_RELEASE = "-O2 -DNDEBUG"), so ThinLTO codegen has nothing to put into .dwo files -- most come out empty, llvm-dwp produces an essentially empty .dwp, our "is .dwp non-empty" check fires (exit 4). Plus Release strip-s the binary afterwards, so any .dwp would be orphaned anyway. The original CMake had a config-mismatch: * -gsplit-dwarf (compile-time) -> RelWithDebInfo + Debug * --plugin-opt=dwo_dir (link-time) -> RelWithDebInfo + Release <- wrong * dwp post-build -> all three configs <- wrong Fix: * dwo_dir + dwp post-build are RelWithDebInfo-only (the only config that produces useful split-DWARF output through ThinLTO codegen). * Debug retains its frontend split-dwarf path + dwp. * --thinlto-jobs cap stays on RelWithDebInfo + Release -- it's an OOM safety knob orthogonal to debug info, applies to any LTO link.
Debug CI run on 5e3d23f failed at the dwp post-build with: error: './build/src/CMakeFiles/memgraph.dir/memgraph.cpp.dwo': No such file or directory The build dir lives inside the source dir (memgraph/build/), so -ffile-prefix-map=${CMAKE_SOURCE_DIR}=. was *also* rewriting build paths -- skeleton CUs ended up with relative DW_AT_GNU_dwo_name like "./build/src/.../foo.cpp.dwo", and llvm-dwp -e resolved those against its CWD (build/src/) instead of finding the absolute .dwo. Fix by appending an identity map for CMAKE_BINARY_DIR after the source-dir rewrite. clang's -ffile-prefix-map applies rules in order with last-match wins (verified empirically), so: -ffile-prefix-map=${CMAKE_SOURCE_DIR}=. # rewrites everything under source -ffile-prefix-map=${CMAKE_BINARY_DIR}=... # un-rewrites build subtree (last wins) Source files (e.g. src/foo.cpp) get the relative-path benefit; build paths (the .dwo location) stay absolute so llvm-dwp can find them. Verified with a minimal reproducer: DW_AT_comp_dir for build paths stays "/abs/build" instead of "./build".
Previous CI runs that died mid-extraction (OOM kills, SIGBUS during
llvm-dwp, etc.) left the Conan 2 cache index pointing at package
folders that don't exist on disk. The next `conan install` then aborts
with the assertion seen on this PR's run 25240345214:
rocksdb/8.1.1-memgraph#...: Already installed!
AssertionError: Pkg '...' folder must exist:
/home/mg/.conan2/p/rocks86005852e4b26/p
Run `conan cache check-integrity "*"` before each `conan install` and
remove any references it flags. Cheap on a healthy cache (~1 line per
recipe revision); recovers automatically when prior runs left state
broken. Removed packages get re-fetched on the upcoming install, which
is the desired behavior here.
When llvm-dwp fails on a 0-byte .dwo, gather everything that helps identify the cause without needing to ssh to the runner: * Per empty .dwo: full stat (size, mtime, mtime delta vs binary). * Reference non-empty .dwo + binary mtimes to spot stale leftovers (large negative delta = file written long before this link = almost certainly a leftover from an interrupted previous run; ~zero delta = lld emitted it as part of this link). * The skeleton CU stanza in the binary's .debug_info that references the .dwo, dumped via llvm-dwarfdump. * Function name at DW_AT_low_pc, resolved via the binary's symtab using llvm-symbolizer --functions=linkage (works when the .dwo's source-side DWARF is missing). * The link command from build.ninja, truncated to 600 chars, so the exact lld invocation is recoverable from CI logs alone. This is enough context to either file a tight LLVM upstream bug (here is the lld invocation, here is the empty file's mtime, here is the function it should describe) or to confirm the file is a leftover and sweep it.
Stubbing 0-byte .dwo files with a 208-byte minimal valid ELF lets the build succeed but masks the underlying lld behaviour. Until we understand whether the empty files come from this run's lld codegen or from an interrupted previous run, prefer surfacing the failure loudly over papering over it. Diagnostics (skeleton CU dump, low_pc -> symbol, mtime delta vs binary, link command) all stay so the next CI failure has enough context to either upstream the bug or sweep it as stale state.
When the regular run exits with a signal (rc >= 128, e.g. SIGBUS=135 which is what we've seen in CI), retry under gdb in batch mode and print 'thread apply all bt full' to stderr. The retry's output is for diagnostics only -- the original failure code propagates. Also raise core dump rlimit and report any core file landed at the usual locations (./core, dwp.core, /tmp/cores/core), so a debugger session can attach later if needed. Local llvm-dwp 20.1.7 exits 1 cleanly on a 0-byte .dwo input rather than SIGBUS, so this path doesn't fire locally; CI is where this will earn its keep.
The empty .dwo files in CI were not from lld's normal codegen for empty modules -- they were from concurrent link tasks overwriting each other. CI run 25250482438 produced this evidence via the new diagnostics: binary mtime: 11:17:09 EMPTY 119.dwo mtime: 11:19:11 (binary +122s) EMPTY 84.dwo mtime: 11:19:46 (binary +157s) EMPTY 193.dwo mtime: 11:20:25 (binary +196s) ... EMPTY 95.dwo mtime: 11:22:19 (binary +310s) The memgraph binary was finalized at 11:17:09 and our POST_BUILD started shortly after; the empty .dwo files have mtimes 2-5 minutes *later*. Something was actively truncating them while we ran. That something is our own build: --plugin-opt=dwo_dir was an add_link_options(), which makes it global. Every executable link in the project (memgraph + every test binary) wrote into the same dwo_dir. lld names .dwo files by task ID starting at 1 per link, so test binary links concurrent with memgraph's POST_BUILD truncated memgraph's 1.dwo, 2.dwo, ... via OF_None on open, then wrote their own (smaller) set. We caught files mid-truncate, hence the SIGBUS in earlier runs and the failed dwp now. Move the option to a per-target target_link_options on the memgraph executable. Verified locally: memgraph's link command still has --plugin-opt=dwo_dir, the test binary links no longer have it. Removes the cross-target sharing of dwo_dir that caused the corruption.
…>_dwo) When --plugin-opt=dwo_dir is omitted, lld writes .dwo files to <binary>_dwo/ next to the binary. Adopt that same name (memgraph_dwo) even though we pass the option explicitly, so the layout is consistent between with-flag and without-flag cases (also matches what test binaries produce as <test>_dwo/ in the same build tree). Functionally identical -- just a more discoverable path for anyone poking around build/ wondering which target a dwo dir belongs to.
Audit pass on the new CMake plumbing:
* --thinlto-jobs is now applied via $<$<OR:$<CONFIG:RelWithDebInfo>,
$<CONFIG:Release>>:LINKER:...> instead of being inside an
if(CMAKE_BUILD_TYPE) block. Multi-config generators (VS / Xcode /
Ninja Multi-Config) handle this correctly now.
* --plugin-opt=dwo_dir on the memgraph target is also gated by
$<$<CONFIG:RelWithDebInfo>:...> and uses $<TARGET_FILE_DIR:memgraph>
for the path. dwp dir name follows lld's <binary>_dwo convention.
* The .dwp install rule uses install(FILES $<TARGET_FILE:memgraph>.dwp ...)
-- multi-config-correct (each config's .dwp lands at the right path).
* Drop file(MAKE_DIRECTORY MG_DWO_DIR) -- lld auto-creates the dir.
Kept as configure-time if(CMAKE_BUILD_TYPE) because making them
genex-driven gets ugly:
* The dwp POST_BUILD COMMAND -- making it conditionally a no-op via
$<IF:...> for non-Debug/RelWithDebInfo configs would mean a runtime
config check inside the script and visually noisy CMake. Documented
with a comment that multi-config users would need to revisit.
* find_program(LLVM_DWP) -- genuinely configure-time.
BYPRODUCTS still uses ${CMAKE_BINARY_DIR}/memgraph.dwp (literal):
$<TARGET_FILE:memgraph>.dwp doesn't evaluate cleanly in BYPRODUCTS for
the add_custom_command(TARGET ...) form ("No target memgraph" at gen
time), even though install(FILES) accepts the same expression.
Verified locally: memgraph link command has both --thinlto-jobs and
--plugin-opt=dwo_dir; test binary links have only --thinlto-jobs (no
dwo_dir collision). dwp pipeline produces a 324 MB memgraph.dwp.
CI run 25251507231 (Release / Core tests) failed at: Copying memgraph package from mgbuild_v7_ubuntu-24.04 to host ... ##[error]Process completed with exit code 2. The new copy --package block ran: pkg_files=$(docker exec ... bash -c "ls $dir/memgraph*.deb $dir/memgraph*.rpm 2>/dev/null") On an Ubuntu builder the dir has only .deb files; ls reports no match for the .rpm glob and exits non-zero. The script's `set -e` propagates that and the build aborts before we even reach the empty-check. Use `shopt -s nullglob` so the unmatched glob expands to nothing rather than failing, and add `|| true` so a still-empty result reaches our explicit check (which prints a clearer error).
Coverage / Core tests run on memgraph__unit__query_streams hit:
/home/mg/.conan2/.../librdkafka/.../src/rdkafka_request.c:2568:49:
runtime error: member access within null pointer of type
'rd_kafka_metadata_internal_t'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
Same pattern as the other librdkafka entries in this file: third-party
code, well-defined in practice but tripping ubsan. Add an ignore rule
matching the source path so the unit tests that link against
librdkafka stop failing under ubsan.
TODO: file upstream issue at confluentinc/librdkafka.
The per-listener session-timeout thread polled with sleep_for(1s). Shutdown() flipped alive_ but couldn't wake the sleep, so AwaitShutdown()'s join() had to wait up to 1s per listener. With multiple listeners (Bolt, RPC, Websocket) this added up to many seconds of shutdown lag, occasionally exceeding the e2e runner's 15s assertion window (observed in CI: replication "Constraints" workload, instance still alive 15.029s after SIGTERM, stuck in futex_do_wait). Replace the bare sleep with a condition_variable::wait_for(1s, predicate). Shutdown() now takes the mutex briefly (to avoid a lost-wakeup race against the predicate evaluation) and notifies. The 1s polling cadence for inactivity checks is preserved.
The shutdown sequence emitted only sparse trace logs, with several multi-second blocking calls (Bolt + websocket await, worker pool await, module unload, python finalize, plus the synchronous coordinator_state destructor) covered by no log line. When CI shutdown hung past the e2e runner's 15s window, the test artifact log just stopped at "Shutting down websocket server" with no way to attribute the remaining time to a specific call. Add a trace before each Shutdown / AwaitShutdown / destructor call in the main shutdown path. No behavior change.
CI logs show 'Memgraph main loop exited' fires 80ms after SIGTERM but the process remains alive for 15s (the e2e test runner timeout). The only code between that trace and process exit is Py_Finalize(). Add bracketing traces to confirm this is the blocker.
…stances The installed /etc/memgraph/memgraph.conf (from config/flags.yaml) overrides the binary's default and turns telemetry on, so any e2e workload that didn't explicitly pass --telemetry-enabled=false ran with telemetry enabled. On shutdown Telemetry::~Telemetry synchronously POSTs with a 120s timeout, which can exceed the e2e shutdown budget when the telemetry endpoint is unreachable or slow. Inject the flag in interactive_mg_runner so every workload gets telemetry off regardless of its workloads.yaml.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



ThinLTO had been silently disabled for ~12 months due to a CMake variable casing typo. While fixing it we also enabled split-DWARF integrated with lld's ThinLTO codegen, compressed debug sections, file-prefix-mapped paths, and a prebuilt gdb index. The RelWithDebInfo binary shrank by about 22% with the debug info now living in a separate sidecar bundle of comparable size. Final-link wall time is unchanged.
The debug bundle ships as a separate memgraph-debuginfo sibling package (both DEB and RPM) with version pinned to the matching main package. RPM required disabling rpmbuild's automatic debug extraction, which would otherwise mangle our pre-bundled debug file into an empty stub.