Skip to content

Segmentation fault due to missing initialization #41

@cottsay

Description

@cottsay

I think that I found a segfault happening in the tests. I've asked my debugging agent to investigate the gdb of it and the explanation that it produced sounds reasonable to me although I'm not familiar with the code base to fully verified the claims. Black-box testing the fix worked.

for i in $(seq 1 50); do ./test_smoke --gtest_filter=smoke/smoke.touch/bare ; done
...
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from smoke/smoke
[ RUN      ] smoke/smoke.touch/bare
CSegmentation fault (core dumped)

The problem

smoke/smoke.touch/bare crashes non-deterministically inside libxml2 (xmlDocDumpFormatMemoryEnc) when called from worker threads in cra_repomd_flush_worker at src/createrepo-cache/repo_cache.c:1559.

Stack trace from the crash:

Thread 23 "pool" received signal SIGSEGV:
#0  libxml2.so.2 (inside xmlDocDumpFormatMemoryEnc)
#1  xmlDocDumpFormatMemoryEnc
#2  cr_xml_dump_repomd  (libcreaterepo_c)
#3  cra_xml_write_repomd
#4  cra_repomd_flush_worker

libxml2's documentation xmlInitParser() must be called from the main thread before any other threads are created.

Why only bare reproduces it:

  • empty / populated fixtures already have a repomd.xml on disk, so cra_repo_cache_load()cr_xml_parse_repomd() runs on the main thread during realize and indirectly initializes libxml2 globals before any workers spawn.
  • bare has no repodata, so cra_repo_cache_realize (repo_cache.c:909) short-circuits and never touches libxml2 on the main thread.
  • touch aarch64 x86_64 then dirties 5 repos (SRPMS + arch + debug × 2). cra_cache_flush spawns g_get_num_processors() workers that all race into libxml2's first-time init simultaneously → SIGSEGV.

Why it's non-deterministic: the race only fires when two workers hit the lazy init window at the same time. Single-core scheduling, CPU caches, or cold cache for libxml2's encoding tables all affect the outcome.

Why the Python test test_touch didn't catch it: the Python extension is loaded into an interpreter that has already pulled libxml2 state through other startup paths, papering over the race.

A possible fix

Added a call to cr_xml_dump_init() (createrepo_c's thin wrapper around xmlInitParser(), specifically provided for this purpose) at the top of cra_cache_new() in src/createrepo-cache/repo_cache.c:422. It runs on the main thread before any flush worker pool is created, is idempotent, and requires no new dependency since createrepo_c is already linked.

diff --git a/src/createrepo-cache/repo_cache.c b/src/createrepo-cache/repo_cache.c
index 8c279ba..5d2bacb 100644
--- a/src/createrepo-cache/repo_cache.c
+++ b/src/createrepo-cache/repo_cache.c
@@ -424,6 +424,11 @@ cra_cache_new(const char * path)
   cra_Cache * cache;
   gpgme_error_t rc;
 
+  // Initialize libxml2 on the main thread. Without this, concurrent first-use
+  // from flush worker threads races on libxml2's lazy global init and crashes
+  // inside xmlDocDumpFormatMemoryEnc. The initialization is idempotent.
+  cr_xml_dump_init();
+
   cache = g_new0(cra_Cache, 1);
   if (!cache) {
     return NULL;

Originally posted by @j-rivero in #36 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions