mmseqs easy-search fails at prefilter step with segfault when running an antibody query against a multi-million antibody database (e.g. 10M clonotype sequences). Any dataset where conserved k-mers match a large fraction of targets will trigger it.
Problem is caused by underestimation of output size when checking for buffer overflow in CacheFriendlyOperations::findDuplicates. The check uses std::min(elementCount, currBinSize/2), assuming at most half of bin entries are duplicates. This assumption breaks when query k-mers are shared across a large fraction of the target database — as it happens with antibody variable region sequences, where conserved framework k-mers match ~70% of targets on consistent diagonals.
Here is a suggested fix for the issue: #1091
mmseqs easy-searchfails at prefilter step with segfault when running an antibody query against a multi-million antibody database (e.g. 10M clonotype sequences). Any dataset where conserved k-mers match a large fraction of targets will trigger it.Problem is caused by underestimation of output size when checking for buffer overflow in
CacheFriendlyOperations::findDuplicates. The check usesstd::min(elementCount, currBinSize/2), assuming at most half of bin entries are duplicates. This assumption breaks when query k-mers are shared across a large fraction of the target database — as it happens with antibody variable region sequences, where conserved framework k-mers match ~70% of targets on consistent diagonals.Here is a suggested fix for the issue: #1091