I think that I found a segfault happening in the tests. I've asked my debugging agent to investigate the gdb of it and the explanation that it produced sounds reasonable to me although I'm not familiar with the code base to fully verified the claims. Black-box testing the fix worked.
for i in $(seq 1 50); do ./test_smoke --gtest_filter=smoke/smoke.touch/bare ; done
...
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from smoke/smoke
[ RUN ] smoke/smoke.touch/bare
CSegmentation fault (core dumped)
The problem
smoke/smoke.touch/bare crashes non-deterministically inside libxml2 (xmlDocDumpFormatMemoryEnc) when called from worker threads in cra_repomd_flush_worker at src/createrepo-cache/repo_cache.c:1559.
Stack trace from the crash:
Thread 23 "pool" received signal SIGSEGV:
#0 libxml2.so.2 (inside xmlDocDumpFormatMemoryEnc)
#1 xmlDocDumpFormatMemoryEnc
#2 cr_xml_dump_repomd (libcreaterepo_c)
#3 cra_xml_write_repomd
#4 cra_repomd_flush_worker
libxml2's documentation xmlInitParser() must be called from the main thread before any other threads are created.
Why only bare reproduces it:
empty / populated fixtures already have a repomd.xml on disk, so cra_repo_cache_load() → cr_xml_parse_repomd() runs on the main thread during realize and indirectly initializes libxml2 globals before any workers spawn.
bare has no repodata, so cra_repo_cache_realize (repo_cache.c:909) short-circuits and never touches libxml2 on the main thread.
touch aarch64 x86_64 then dirties 5 repos (SRPMS + arch + debug × 2). cra_cache_flush spawns g_get_num_processors() workers that all race into libxml2's first-time init simultaneously → SIGSEGV.
Why it's non-deterministic: the race only fires when two workers hit the lazy init window at the same time. Single-core scheduling, CPU caches, or cold cache for libxml2's encoding tables all affect the outcome.
Why the Python test test_touch didn't catch it: the Python extension is loaded into an interpreter that has already pulled libxml2 state through other startup paths, papering over the race.
A possible fix
Added a call to cr_xml_dump_init() (createrepo_c's thin wrapper around xmlInitParser(), specifically provided for this purpose) at the top of cra_cache_new() in src/createrepo-cache/repo_cache.c:422. It runs on the main thread before any flush worker pool is created, is idempotent, and requires no new dependency since createrepo_c is already linked.
diff --git a/src/createrepo-cache/repo_cache.c b/src/createrepo-cache/repo_cache.c
index 8c279ba..5d2bacb 100644
--- a/src/createrepo-cache/repo_cache.c
+++ b/src/createrepo-cache/repo_cache.c
@@ -424,6 +424,11 @@ cra_cache_new(const char * path)
cra_Cache * cache;
gpgme_error_t rc;
+ // Initialize libxml2 on the main thread. Without this, concurrent first-use
+ // from flush worker threads races on libxml2's lazy global init and crashes
+ // inside xmlDocDumpFormatMemoryEnc. The initialization is idempotent.
+ cr_xml_dump_init();
+
cache = g_new0(cra_Cache, 1);
if (!cache) {
return NULL;
Originally posted by @j-rivero in #36 (comment)
I think that I found a segfault happening in the tests. I've asked my debugging agent to investigate the gdb of it and the explanation that it produced sounds reasonable to me although I'm not familiar with the code base to fully verified the claims. Black-box testing the fix worked.
The problem
smoke/smoke.touch/barecrashes non-deterministically inside libxml2 (xmlDocDumpFormatMemoryEnc) when called from worker threads incra_repomd_flush_workeratsrc/createrepo-cache/repo_cache.c:1559.Stack trace from the crash:
libxml2's documentation
xmlInitParser()must be called from the main thread before any other threads are created.Why only
barereproduces it:empty/populatedfixtures already have arepomd.xmlon disk, socra_repo_cache_load()→cr_xml_parse_repomd()runs on the main thread during realize and indirectly initializes libxml2 globals before any workers spawn.barehas no repodata, socra_repo_cache_realize(repo_cache.c:909) short-circuits and never touches libxml2 on the main thread.touch aarch64 x86_64then dirties 5 repos (SRPMS + arch + debug × 2).cra_cache_flushspawnsg_get_num_processors()workers that all race into libxml2's first-time init simultaneously → SIGSEGV.Why it's non-deterministic: the race only fires when two workers hit the lazy init window at the same time. Single-core scheduling, CPU caches, or cold cache for libxml2's encoding tables all affect the outcome.
Why the Python test
test_touchdidn't catch it: the Python extension is loaded into an interpreter that has already pulled libxml2 state through other startup paths, papering over the race.A possible fix
Added a call to
cr_xml_dump_init()(createrepo_c's thin wrapper aroundxmlInitParser(), specifically provided for this purpose) at the top ofcra_cache_new()insrc/createrepo-cache/repo_cache.c:422. It runs on the main thread before any flush worker pool is created, is idempotent, and requires no new dependency since createrepo_c is already linked.Originally posted by @j-rivero in #36 (comment)