Skip to content

Snapshot import: 'unexpected end of file' on clinical-full + faers (small) — data loaded anyway, no rollback #199

@sandeepkunkunuru

Description

@sandeepkunkunuru

Found during the OSS pre-hero regression sweep on commit 7e38ee5 (main).

Symptom

Two snapshots well under the 2 GB body limit (issue #197) return an error from `/api/snapshot/import`:

```
$ curl -X POST http://127.0.0.1:8080/api/snapshot/import -F "file=@clinical-full.sgsnap" # 681 MB
{"error":"unexpected end of file"}

$ curl -X POST http://127.0.0.1:8080/api/snapshot/import -F "file=@faers.sgsnap" # 110 MB
{"error":"unexpected end of file"}
```

But a follow-up `MATCH (n) RETURN count(n)` shows the import partially succeeded:

Snapshot Reported error count(n) after
clinical-full.sgsnap (681 MB) unexpected end of file 7,774,446
faers.sgsnap (110 MB) unexpected end of file 2,665,596

So the importer:

  1. Streams the snapshot, applies nodes/edges,
  2. Hits an EOF before reading whatever trailing structure it expects,
  3. Returns an error to the client,
  4. Leaves the partial state in the graph — no rollback.

Other snapshots from the same S3 bucket (`s3://samyama-data/snapshots/`) of comparable size import cleanly: `faers-full` (702 MB → 10.4M nodes, OK), `clinical-trials` (711 MB → 7.78M nodes, OK), `omop-115k` (1.4 GB → 51.9M nodes, OK).

Hypotheses

  1. `clinical-full` and `faers` (small variant) are produced by an older exporter that emits a different trailer/footer than the current parser expects → backwards-compatibility break in the snapshot format.
  2. The two files were truncated during S3 upload — but `aws s3 cp` validates checksums, so this is unlikely.
  3. The parser reads ahead past the last record (e.g. expects an explicit terminator) and treats EOF-without-terminator as an error even though all records are valid.

Asks

  1. Add a sanity check on snapshot header: emit version + writer-build into the file header, refuse to import if the version is unknown rather than leaving partial state.
  2. Make import atomic — on any error, drop the in-memory partial graph rather than leaving it visible to subsequent queries.
  3. If hypothesis 1 is right, add a regression test loading both files and either succeeding or failing cleanly (no partial state).

Repro

```
aws s3 cp s3://samyama-data/snapshots/clinical-full.sgsnap /tmp/
aws s3 cp s3://samyama-data/snapshots/faers.sgsnap /tmp/
./target/release/samyama --port 6379 &
curl -X POST http://127.0.0.1:8080/api/snapshot/import -F "file=@/tmp/clinical-full.sgsnap"
curl -X POST http://127.0.0.1:8080/api/query -H 'Content-Type: application/json' -d '{"query":"MATCH (n) RETURN count(n)"}'
```

Reported as part of pre-hero OSS sweep (2026-04-25).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions