10.3389/fmicb.2019.01560.s001 Wang Xi Wang Xi Yan Gao Yan Gao Zhangyu Cheng Zhangyu Cheng Chaoyun Chen Chaoyun Chen Maozhen Han Maozhen Han Pengshuo Yang Pengshuo Yang Guangzhou Xiong Guangzhou Xiong Kang Ning Kang Ning Data_Sheet_1_Using QC-Blind for Quality Control and Contamination Screening of Bacteria DNA Sequencing Data Without Reference Genome.zip Frontiers 2019 quality control contamination screening metagenome next generation sequencing (NGS) novel pipeline 2019-07-09 04:50:22 Dataset https://frontiersin.figshare.com/articles/dataset/Data_Sheet_1_Using_QC-Blind_for_Quality_Control_and_Contamination_Screening_of_Bacteria_DNA_Sequencing_Data_Without_Reference_Genome_zip/8831153 <p>Quality control for next generation sequencing (NGS) has become increasingly important with the ever increasing importance of sequencing data for omics studies. Tools have been developed for filtering possible contaminants from species with known reference genome. Unfortunately, reference genomes for all the species involved, including the contaminants, are required for these tools to work. This precludes many real-life samples that have no information about the complete genome of the target species, and are contaminated with unknown microbial species. In this work we proposed QC-Blind, a novel quality control pipeline for removing contaminants without any use of reference genomes. The pipeline merely requires the information about a few marker genes of the target species. The entire pipeline consists of unsupervised read assembly, contig binning, read clustering, and marker gene assignment. When evaluated on in silico, ab initio and in vivo datasets, QC-Blind proved effective in removing unknown contaminants with high specificity and accuracy, while preserving most of the genomic information of the target bacterial species. Therefore, QC-Blind could serve well in situations where limited information is available for both target and contamination species.</p>