Articles via Databases
Articles via Journals
Online Catalog
E-books
Research & Information Literacy
Interlibrary loan
Theses & Dissertations
Collections
Policies
Services
About / Contact Us
Administration
Littman Architecture Library
This site will be removed in January 2019, please change your bookmarks.
This page will redirect to https://digitalcommons.njit.edu/dissertations/1690/ in 5 seconds

The New Jersey Institute of Technology's
Electronic Theses & Dissertations Project

Title: Model-based deep autoencoders for characterizing discrete data with application to genomic data analysis
Author: Tian, Tian
View Online: njit-etd2019-086
(x-76 pages ~ 5.1 MB pdf)
Department: Department of Computer Science
Degree: Doctor of Philosophy
Program: Computer Science
Document Type: Dissertation
Advisory Committee: Wei, Zhi (Committee chair)
Wang, Guiling (Committee member)
Roshan, Usman W. (Committee member)
Guo, Wenge (Committee member)
Min, Renqiang (Committee member)
Zhao, Zhigen (Committee member)
Date: 2019-05
Keywords: Bioinformatics
Computational model
Deep learning
Genome informatics
Machine learning
Single cell RNA-seq
Availability: Unrestricted
Abstract:

Deep learning techniques have achieved tremendous successes in a wide range of real applications in recent years. For dimension reduction, deep neural networks (DNNs) provide a natural choice to parameterize a non-linear transforming function that maps the original high dimensional data to a lower dimensional latent space. Autoencoder is a kind of DNNs used to learn efficient feature representation in an unsupervised manner. Deep autoencoder has been widely explored and applied to analysis of continuous data, while it is understudied for characterizing discrete data. This dissertation focuses on developing model-based deep autoencoders for modeling discrete data. A motivating example of discrete data is the count data matrix generated by single-cell RNA sequencing (scRNA-seq) technology which is widely used in biological and medical fields. scRNA-seq promises to provide higher resolution of cellular differences than bulk RNA sequencing and has helped researchers to better understand complex biological questions. The recent advances in sequencing technology have enabled a dramatic increase in the throughput to thousands of cells for scRNA-seq. However, analysis of scRNA-seq data remains a statistical and computational challenge. A major problem is the pervasive dropout events obscuring the discrete matrix with prevailing 'false' zero count observations, which is caused by the shallow sequencing depth per cell. To make downstream analysis more effective, imputation, which recovers the missing values, is often conducted as the first step in preprocessing scRNA-seq data. Several imputation methods have been proposed. Of note is a deep autoencoder model, which proposes to explicitly characterize the count distribution, over-dispersion, and sparsity of scRNA-seq data using a zero-inflated negative binomial (ZINB) model. This dissertation introduces a model-based deep learning clustering model — scDeepCluster for clustering analysis of scRNA-seq data. The scDeepCluster is a deep autoencoder which simultaneously learns feature representation and clustering via explicit modeling of scRNA-seq data generation using the ZINB model. Based on testing extensive simulated datasets and real datasets from different representative single-cell sequencing platforms, scDeepCluster outperformed several state-of-the-art methods under various clustering performance metrics and exhibited improved scalability, with running time increasing linearly with the sample size. Although this model-based deep autoencoder approach has demonstrated superior performance, it is over-permissive in defining ZINB model space, which can lead to an unidentifiable model and make results unstable. Next, this dissertation proposes to impose a regularization that takes dropout events into account. The regularization uses a differentiable categorical distribution - Gumbel-Softmax to explicitly model the dropout events, and minimizes the Maximum Mean Discrepancy (MMD) between the reconstructed randomly masked matrix and the raw count matrix. Imputation analyses showed that the proposed regularized model-based autoencoder significantly outperformed the vanilla model-based deep autoencoder.


If you have any questions please contact the ETD Team, libetd@njit.edu.

 
ETD Information
Digital Commons @ NJIT
Theses and DIssertations
ETD Policies & Procedures
ETD FAQ's
ETD home

Request a Scan
NDLTD

NJIT's ETD project was given an ACRL/NJ Technology Innovation Honorable Mention Award in spring 2003