Hi, I was wondering if you have any suggestion of how to make mmseqs2 efficiently work for small peptides and cluster them?
I did quite a few debugging steps and here are what I found as asummary, I would like to ask for your comment:
mmseqs easy-cluster \
peptides.fasta \
mmseqs/cluster_final \
tmp_final \
--min-seq-id 0.70 \
-c 0.80 \
--cov-mode 5 \
--cluster-mode 2 \
-k 5 \
--kmer-per-seq 80 \
-s 7.5 \
--alignment-mode 3 \
--min-ungapped-score 0 \
-e 100 \
--max-seqs 1000 \
--mask 0 \
--mask-lower-case 0 \
--spaced-kmer-mode 0 \
--seq-id-mode 1 \
--cluster-reassign \
--single-step-clustering \
--threads 100 \
--split-memory-limit 100G \
In general, with this setting, I got reasonable number of clusters and well distributed cluster sizes. The main flags that made it work were the following ones:
1- --mask 0: from my understanding, masking destroys short-mers. With k=5, a 9-mer has only 5 k-mers, losing any means no matches.
2- --single-step-clustering: cascaded mode uses reduced alphabet inside linclust and collapses distinct peptides to identical k-mers.
3- --spaced-kmer-mode 0: Not sure it is a good choice but my reason is that spaced k-mers span 6–7 positions. An 8-mer only has 1–2 starting positions for such k-mers → near-zero k-mers generated. However it should be fine for longer mers (10-20)?
4- --seq-id-mode 1: This one I am also not sure, my reason is that for cases like 8 vs 15 AA pairs, identity/alignment_length is systematically too low, while there can be the shared motif in both. So, identity/shorter_length is the right biological question.
5- --cluster-mode 2: This is actually what is the most essential, without it, I got a mega cluster and a lot of singletons or shorter clusters.
I should mention that this setting is a bit too slow. I would really appreciate if you know any other reasonable setting for such peptide databases, and if you have any comments on my selected flags. We have around 12M peptides that we want to cluster.
Hi, I was wondering if you have any suggestion of how to make mmseqs2 efficiently work for small peptides and cluster them?
I did quite a few debugging steps and here are what I found as asummary, I would like to ask for your comment:
In general, with this setting, I got reasonable number of clusters and well distributed cluster sizes. The main flags that made it work were the following ones:
1-
--mask 0: from my understanding, masking destroys short-mers. With k=5, a 9-mer has only 5 k-mers, losing any means no matches.2-
--single-step-clustering: cascaded mode uses reduced alphabet inside linclust and collapses distinct peptides to identical k-mers.3-
--spaced-kmer-mode 0: Not sure it is a good choice but my reason is that spaced k-mers span 6–7 positions. An 8-mer only has 1–2 starting positions for such k-mers → near-zero k-mers generated. However it should be fine for longer mers (10-20)?4-
--seq-id-mode 1: This one I am also not sure, my reason is that for cases like 8 vs 15 AA pairs, identity/alignment_length is systematically too low, while there can be the shared motif in both. So, identity/shorter_length is the right biological question.5-
--cluster-mode 2: This is actually what is the most essential, without it, I got a mega cluster and a lot of singletons or shorter clusters.I should mention that this setting is a bit too slow. I would really appreciate if you know any other reasonable setting for such peptide databases, and if you have any comments on my selected flags. We have around 12M peptides that we want to cluster.