Buffer overflow in k-mer prefilter with highly conserved sequences

`mmseqs easy-search` fails at prefilter step with segfault when running an antibody query against a multi-million antibody database (e.g. 10M clonotype sequences). Any dataset where conserved k-mers match a large fraction of targets will trigger it. 

Problem is caused by underestimation of output size when checking for buffer overflow in `CacheFriendlyOperations::findDuplicates`. The check uses `std::min(elementCount, currBinSize/2`), assuming at most half of bin entries are duplicates. This assumption breaks when query k-mers are shared across a large fraction of the target database — as  it happens with antibody variable region sequences, where conserved framework k-mers match ~70% of targets on consistent diagonals.          

Here is a suggested fix for the issue: https://github.com/soedinglab/MMseqs2/pull/1091

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer overflow in k-mer prefilter with highly conserved sequences #1092

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Buffer overflow in k-mer prefilter with highly conserved sequences #1092

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions