Segmentation fault due to missing initialization

I think that I found a segfault happening in the tests. I've asked my debugging agent to investigate the gdb of it and the explanation that it produced sounds reasonable to me although I'm not familiar with the code base to fully verified the claims. Black-box testing the fix worked.

```
for i in $(seq 1 50); do ./test_smoke --gtest_filter=smoke/smoke.touch/bare ; done
...
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from smoke/smoke
[ RUN      ] smoke/smoke.touch/bare
CSegmentation fault (core dumped)
```

## The problem

`smoke/smoke.touch/bare` crashes non-deterministically inside libxml2 (`xmlDocDumpFormatMemoryEnc`) when called from worker threads in `cra_repomd_flush_worker` at `src/createrepo-cache/repo_cache.c:1559`.

Stack trace from the crash:

```
Thread 23 "pool" received signal SIGSEGV:
#0  libxml2.so.2 (inside xmlDocDumpFormatMemoryEnc)
#1  xmlDocDumpFormatMemoryEnc
#2  cr_xml_dump_repomd  (libcreaterepo_c)
#3  cra_xml_write_repomd
#4  cra_repomd_flush_worker
```

[libxml2's documentation](https://dev.w3.org/XInclude-Test-Suite/libxml2-2.4.24/libxml2-2.4.24/doc/html/libxml-parser.html#XMLINITPARSER) **`xmlInitParser()` must be called from the main thread before any other threads are created**. 

Why only `bare` reproduces it:

- `empty` / `populated` fixtures already have a `repomd.xml` on disk, so `cra_repo_cache_load()` → `cr_xml_parse_repomd()` runs on the main thread during realize and indirectly initializes libxml2 globals before any workers spawn.
- `bare` has no repodata, so `cra_repo_cache_realize` (`repo_cache.c:909`) short-circuits and never touches libxml2 on the main thread.
- `touch aarch64 x86_64` then dirties 5 repos (SRPMS + arch + debug × 2). `cra_cache_flush` spawns `g_get_num_processors()` workers that all race into libxml2's first-time init simultaneously → SIGSEGV.

Why it's non-deterministic: the race only fires when two workers hit the lazy init window at the same time. Single-core scheduling, CPU caches, or cold cache for libxml2's encoding tables all affect the outcome.

Why the Python test `test_touch` didn't catch it: the Python extension is loaded into an interpreter that has already pulled libxml2 state through other startup paths, papering over the race.

## A possible fix

Added a call to `cr_xml_dump_init()` (createrepo_c's thin wrapper around `xmlInitParser()`, specifically provided for this purpose) at the top of `cra_cache_new()` in `src/createrepo-cache/repo_cache.c:422`. It runs on the main thread before any flush worker pool is created, is idempotent, and requires no new dependency since createrepo_c is already linked.

```diff
diff --git a/src/createrepo-cache/repo_cache.c b/src/createrepo-cache/repo_cache.c
index 8c279ba..5d2bacb 100644
--- a/src/createrepo-cache/repo_cache.c
+++ b/src/createrepo-cache/repo_cache.c
@@ -424,6 +424,11 @@ cra_cache_new(const char * path)
   cra_Cache * cache;
   gpgme_error_t rc;
 
+  // Initialize libxml2 on the main thread. Without this, concurrent first-use
+  // from flush worker threads races on libxml2's lazy global init and crashes
+  // inside xmlDocDumpFormatMemoryEnc. The initialization is idempotent.
+  cr_xml_dump_init();
+
   cache = g_new0(cra_Cache, 1);
   if (!cache) {
     return NULL;
```

_Originally posted by @j-rivero in https://github.com/osrf/createrepo-agent/issues/36#issuecomment-4207147257_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault due to missing initialization #41

The problem

A possible fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Segmentation fault due to missing initialization #41

Description

The problem

A possible fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions