PhD Defense by Haowen Zhang

Primary tabs

Title: Efficient Methods for Read Mapping

Date: Thursday, June 30th, 2022

Time: 3pm - 5pm ET

Location: https://gatech.zoom.us/j/97605744790 

Haowen Zhang
School of Computational Science and Engineering
College of Computing
Georgia Institute of Technology



Dr. Srinivas Aluru (Advisor, School of Computational Science and Engineering, Georgia Institute of Technology)
Dr. Ümit V. Çatalyürek (School of Computational Science and Engineering, Georgia Institute of Technology)

Dr. Kostas Konstantinidis (School of Civil and Environmental Engineering, Georgia Institute of Technology)
Dr. Xiuwei Zhang (School of Computational Science and Engineering, Georgia Institute of Technology)

Dr. Heng Li (Department of Biomedical Informatics, Harvard Medical School)





DNA sequencing is the mainstay of biological and medical research. Modern sequencing machines can read millions of DNA fragments, sampling the underlying genomes at high-throughput. Mapping the resulting reads to a reference genome is typically the first step in sequencing data analysis. The problem has many variants as the reads can be short or long with a low or high error rate for different sequencing technologies, and the reference can be a single genome or a graph representation of multiple genomes. Therefore, it is crucial to develop efficient computational methods for these different problem classes. Moreover, continually declining sequencing costs and increasing throughput pose challenges to the previously developed methods and tools that cannot handle the growing volume of sequencing data.


This dissertation seeks to advance the state-of-the-art in the established field of read mapping by proposing more efficient and scalable read mapping methods as well as tackling emerging new problem areas. Specifically, we design ultra-fast methods to map two types of reads: short reads for high-throughput chromatin profiling and nanopore raw reads for targeted sequencing in real-time. In tune with the characteristics of these types of reads, our methods can scale to larger sequencing data sets or map more reads correctly compared with the state-of-the-art mapping software. Furthermore, we propose two algorithms for aligning sequences to graphs, which is the foundation of mapping reads to graph-based reference genomes. One algorithm improves the time complexity of existing sequence to graph alignment algorithms for linear or affine gap penalty. The other algorithm provides good empirical performance in the case of edit distance metric. Finally, we mathematically formulate the problem of validating paired-end read constraints when mapping sequences to graphs, and propose an exact algorithm that is also fast enough for practical use.


  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:06/15/2022
  • Modified By:Tatianna Richardson
  • Modified:06/15/2022