• Nelson Enrique Vera-Parra
  • Danilo Alfonso López-Sarmiento
  • Cristian Alejandro Rojas-Quintero




Heterogeneous computing, K-mers processing, OpenCL, Parallel programming, Search and distribution of super k-mers


The k-mers processing techniques based on partitioning of the data set on the disk using minimizer-type seeds have led to a significant reduction in memory requirements; however, it has added processes (search and distribution of super k-mers) that can be intensive given the large volume of data. This paper presents a massive parallel processing model in order to enable the efficient use of heterogeneous computation to accelerate the search of super k-mers based on seeds (minimizers or signatures). The model includes three main contributions: a new data structure called CISK for representing the super k-mers, their minimizers and two massive parallelization patterns in an indexed and compact way: one for obtaining the canonical m-mers of a set of reads and another for  searching for super k-mers based on minimizers. The model was implemented through two OpenCL kernels. The evaluation of the kernels shows favorable results in terms of execution times and memory requirements to use the model for constructing heterogeneous solutions with simultaneous execution (workload distribution), which perform co-processing using the current search methods of super k -mers on the CPU and the methods presented herein on GPU. The model implementation code is available in the repository: https://github.com/BioinfUD/K-mersCL.


H. Li, A. Ramachandran, and D. Chen, “GPU acceleration of advanced k-mer counting for computational genomics,” Proceedings of the 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2018, pp. 183-186.

T. Richert, “Management of distributed dynamic data with algorithmic skeletons,” Parallel Computing, 2000, pp. 375-382. https://www.worldscientific.com/doi/abs/10.1142/9781848160170_0044

F. Wrede and S. Ernsting, “Simultaneous CPU–GPU execution of data parallel algorithmic skeletons,” International Journal of Parallel Programming, vol. 46, no. 1, pp. 42–61, Apr. 2017.

M. Roberts, W. Hayes, B. R. Hunt, S. M. Mount, and J. A. Yorke, “Reducing storage requirements for biological sequence comparison,” Bioinformatics, vol. 20, no. 18, pp. 3363–3369, 2004.

G. Marçais, D. Pellow, D. Bork, Y. Orenstein, R. Shamir, and C. Kingsford, “Improving the performance of minimizers and winnowing schemes,” Bioinformatics, vol. 33, no. 14, pp. i110–i117, Dec. 2017.

S. Altschul, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389–3402, Jan. 1997.

N. Pérez, M. Gutierrez, and N. Vera, “Computational performance assessment of k-mer counting algorithms,” Journal of Computational Biology, vol. 23, no. 4, pp. 248–255, 2016.

M. Kokot, M. Długosz, and S. Deorowicz, “KMC 3: counting and manipulating k-mer statistics,” Bioinformatics, vol. 33, no. 17, pp. 2759–2761, Apr. 2017.

Y. Li, et al. MSPKmerCounter: a fast and memory efficient approach for k-mer counting. arXiv preprint arXiv:1505.06550, 2015.

Y. Li, P. Kamousi, F. Han, S. Yang, X. Yan, and S. Suri, “Memory efficient minimum substring partitioning,” Proceedings of the VLDB Endowment, vol. 6, no. 3, pp. 169–180, Jan. 2013.

R. Chikhi, A. Limasset, and P. Medvedev, “Compacting de Bruijn graphs from sequencing data quickly and in low memory,” Bioinformatics, vol. 32, no. 12, pp. i201–i208, 2016.

C. Marchet, M. Kerbiriou, and A. Limasset, “Indexing De Bruijn graphs with minimizers,” Nov. 2019. https://www.biorxiv.org/content/10.1101/546309v2

J. Luo, J. Wang, W. Li, Z. Zhang, F. X. Wu, M. Li, & Y. Pan, “EPGA2: memory-efficient de novo assembler,” Bioinformatics, vol. 31, no. 24, pp. 3988-3990, 2015.

S. Deorowicz, “FQSqueezer: k-mer-based compression of sequencing data,” Scientific Reports, vol. 10, 578, 2020. https://doi.org/10.1038/s41598-020-57452-6

S. Deorowicz, A. Debudaj-Grabysz, and S. Grabowski, “Disk-based k-mer counting on a PC,” BMC Bioinformatics, vol. 14, no. 1, 2013.

M. Erbert, S. Rechner, and M. Müller-Hannemann, “Gerbil: a fast and memory-efficient k-mer counter with GPU-support,” Algorithms for Molecular Biology, vol. 12, no. 1, 2017.

Y. Ono, K. Asai, and M. Hamada, “PBSIM: PacBio reads simulator—toward accurate genome assembly,” Bioinformatics, vol. 29, no. 1, pp. 119–121, Apr. 2012.

A. Rhoads and K. F. Au, “PacBio Sequencing and Its Applications,” Genomics, Proteomics & Bioinformatics, vol. 13, no. 5, pp. 278–289, 2015.

V. Alessandrini, “Atomic types and operations,” Shared Memory Application Programming, pp. 167–190, 2016.

J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A parallel programming standard for heterogeneous computing systems,” Computing in Science & Engineering, vol. 12, no. 3, pp. 66–73, 2010.

N. Vera, C. Rojas and J. Pérez, OpenCL Práctico, first ed., Editorial UD, Bogotá, 2019, 314 p.

A. Klöckner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih, “PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation,” Parallel Computing, vol. 38, no. 3, pp. 157–174, 2012.

A. Klöckner, N. Pinto, B. Catanzaro, Y. Lee, P. Ivanov, and A. Fasih, “GPU scripting and code generation with PyCUDA,” GPU Computing Gems Jade Edition, pp. 373–385, 2012.

G. Marçais, C. Kingsford, “A fast, lock-free approach for efficient parallel counting of occurrences of k-mers,” Bioinformatics, vol. 27, no 6, pp. 764-770, 2011.

S. Deorowicz, A. Debudaj-Grabysz, S. Grabowsky, “Disk-based k-mer counting on a PC,” BMC Bioinformatics, vol. 14, no. 1, p. 160, 2013.




How to Cite

Vera-Parra, N. E., López-Sarmiento, D. A., & Rojas-Quintero, C. A. (2020). HETEROGENEOUS COMPUTING TO ACCELERATE THE SEARCH OF SUPER K-MERS BASED ON MINIMIZERS. International Journal of Computing, 19(4), 525-532. https://doi.org/10.47839/ijc.19.4.1985