Hi!!! I am currently a postdoctoral fellow in the Electrical Engineering and Computer Science (EECS) Department where I am very fortunate to be mentored by Professor Michael Jordan and Professor Martin Wainwright. Before going to Berkeley, I finished my Phd degree in the Department of Statistics, University of Michigan, Ann Arbor where I am very fortunate to be advised by Professor Long Nguyen and Professor Ya'acov Ritov.
At this moment, I am interested in several directions from the interplay between computation and theory in machine learning and statistics. Some of these directions can be summarized as follows:
For a recent version of my CV, please email me via minhnhat@berkeley.edu.
Educational background:
Current research interests:
Theses and Reports:
Publications:
Summary : This paper studies identifiability and convergence behaviors for parameters of multiple types, including matrix-variate ones, that arise in finite mixtures, and the effects of model fitting with extra mixing components. We consider several notions of strong identifiability in a matrix-variate setting, and use them to establish sharp inequalities relating the distance of mixture densities to the Wasserstein distances of the corresponding mixing measures. Characterization of identifiability is given for a broad range of mixture models commonly employed in practice, including locationcovariance mixtures and location-covariance-shape mixtures, for mixtures of symmetric densities, as well as some asymmetric ones. Minimax lower bounds and rates of convergence for the maximum likelihood estimates are established for such classes, which are also confirmed by simulation studies.
Summary : We establish minimax lower bounds and maximum likelihood convergence rates of parameter estimation for mean-covariance multivariate Gaussian mixtures, shape-rate Gamma mixtures, and some variants of finite mixture models, including the setting where the number of mixing components is bounded but unknown. These models belong to what we call ”weakly identifiable” classes, which exhibit specific interactions among mixing parameters driven by the algebraic structures of the class of kernel densities and their partial derivatives. Accordingly both the minimax bounds and the maximum likelihood parameter estimation rates in these models are shown to be typically much slower than the usual $n^{-1/2}$ or $n^{-1/4}$ rates of convergence.
Summary : Singularities of a statistical model are the elements of the model’s parameter space which make the corresponding Fisher information matrices degenerate. These are the points for which standard estimation techniques such as the maximum likelihood estimator do not admit the root-n parametric rate of convergence. We propose a general framework for the identification of singularity structures of the parameter space of finite mixtures, and study the impact of the singularity levels on minimax lower bounds and rates of convergence for the maximum likelihood estimator over a compact parameter space. Our investigation makes explicit the deep links between model singularities, parameter estimation rates and minimax bounds, and the algebraic geometry of the parameter space for mixtures of continuous distributions. The theory is applied to establish concrete convergence rates of parameter estimation for finite mixture of skewnormal distributions. This rich and increasingly popular model class is shown to exhibit a remarkably complex range of asymptotic behaviors that have not been hitherto reported in literature.
Summary : We propose a novel approach to the problem of multilevel clustering, which aims to simultaneously partition data in each group and discover grouping patterns among groups in a large hierarchically structural corpus of data. Our method involves a joint optimization formulation over several spaces of discrete probability measures, which are endowed with the Wasserstein distance metric. We propose a number of variants of this problem, which admit fast optimization algorithms, by exploiting the connection to the problem of finding Wasserstein barycenters. We also establish consistency properties enjoyed by our estimates of both local and global clusters. Finally, we present experiment results with both synthetic and real data to demonstrate the flexibility and scalability of the proposed approach.
Summary : In finite mixture models, apart from underlying mixing measure, true kernel density function of each subpopulation in the data is, in many scenarios, unknown. Perhaps the most popular approach is to choose some kernel functions that we empirically believe our data are generated from and use these kernels to fit our models. Nevertheless, as long as the chosen kernel and the true kernel are different, statistical inference of mixing measure under this setting will be highly unstable. To overcome this challenge, we propose flexible and efficient robust estimators of the mixing measure in these models, which are inspired by the idea of minimum Hellinger distance estimator, model selection criteria, and superefficiency phenomenon. We demonstrate that our estimators consistently recover the true number of components and achieve the optimal convergence rates of parameter estimation under both the well- and mis-specified kernel settings for any fixed bandwidth. These desirable asymptotic properties are illustrated via careful simulation studies with both synthetic and real data.
Summary : Recent years have witnessed substantial progress in understanding the behavior of EM for mixture models that are correctly specified. Given that model misspecification is common in practice, it is important to understand EM in this more general setting. We provide non-asymptotic guarantees for population and sample-based EM for parameter estimation under a few specific univariate settings of misspecified Gaussian mixture models. Due to misspecification, the EM iterates no longer converge to the true model and instead converge to the projection of the true model over the set of models being searched over. We provide two classes of theoretical guarantees: first, we characterize the bias introduced due to the misspecification; and second, we prove that population EM converges at a geometric rate to the model projection under a suitable initialization condition. This geometric convergence rate for population EM imply a statistical complexity of order $1/\sqrt{n}$ when running EM with $n$ samples. We validate our theoretical findings in different cases via several numerical examples.
Summary : Generative models that can capture latent variations in data are difficult to design in complex domains such as natural images where there are large numbers of nuisance variables. Given the success of the Convolutional Neural Networks (CNNs) on inference tasks in such domains, we aim to design generative models whose inference corresponds to the CNNs. One such class is the Deep Rendering Model (DRM). The DRM generates images via multiple levels of abstraction, from coarse to fine scale, and introduces a small set of latent variables at each level. However, a number of simplifying assumptions are made in the DRM to derive the CNNs as the bottom-up inference algorithm which do not correspond to realistic variation in natural images. For instance, the latent variables at different scales of the DRM are assumed to be independent. We propose an extension to the DRM, termed the Latent-Dependent Deep Rendering Model (LD-DRM), that enforces dependencies among the latent variables via a parametrized joint prior distribution. This joint prior yields a new form of regularization for training the CNNs---the Rendering Path Normalization (RPN). Under the LD-DRM, we obtain consistent estimators for unsupervised/semi-supervised learning tasks and derive generalization bounds. Our bound suggests that the RPN regularization helps improve generalization, an observation that is corroborated in practice. In our experiments, the LD-DRM either beats or matches the state of art on popular benchmarks including SVHN, CIFAR10, and CIFAR100 on both semi-supervised and supervised learning tasks.
Summary : Principal stratification is a widely used framework for addressing post-randomization complications. After using principal stratification to define causal effects of interest, researchers are increasingly turning to finite mixture models to estimate these quantities. Unfortunately, standard estimators of the mixture parameters, like the MLE, are known to exhibit pathological behavior. We study this behavior in a simple but fundamental example, a two-component Gaussian mixture model in which only the component means are unknown, and focus on the setting in which the components are only weakly separated. We demonstrate via extensive simulations and theoretical arguments that the MLE behaves like a threshold estimator, in the sense that the MLE can give strong evidence that the means are equal when the truth is otherwise. We also explore the behavior of the MLE when the MLE is non-zero, showing that it is difficult to estimate both the sign and magnitude of the means in this case. We provide diagnostics for all of these pathologies and apply these ideas to re-analyzing two randomized evaluations of job training programs, JOBS II and Job Corps. Our results suggest that the corresponding maximum likelihood estimates should be interpreted with caution in these cases.
Summary : Gaussian mixtures of experts have been a popular modeling tool in various domains arising from biological, physical, and natural sciences. In this paper, we propose a rigorous framework for understanding parameter estimation under these models based on the algebraic structure of expert functions.
Summary : Cross-entropy loss has been widely used to train the convolutional neural net- works (CNNs). The usage of this loss is adapted from the multi-class logistic regression which learns the decision boundaries given the features learned by the CNNs. Lately, it has been shown that the cross-entropy loss and the architec- ture of the CNNs result simultaneously from the inference that learns the latent variables in the Latent-Dependent Deep Rendering Model (LD-DRM). In partic- ular, the negativity of the cross-entropy loss is proven to be the lower bound of the conditional log-likelihood of the LD-DRM. In this paper, we propose a better alternative to the cross-entropy loss by deriving a better lower bound for the con- ditional log-likelihood of LD-DRM. We show that our new neural networks achieve an improvement of more than 2% over the baseline CNNs/LD-DRM for semi-supervised classification tasks on CI- FAR10 and CIFAR100. Our results match or are comparable to the state-of-the-art on CIFAR10. We achieve state-of-the-art results on CIFAR100 when using only 10K labeled data with an improvement of almost 1% over the previous state-of- the-art.
Summary : Cross-entropy loss has been widely used to train the convolutional neural net- works (CNNs). However, it is widely known to be susceptible to outliers in the training data. In this paper, we propose a novel cross-entropy loss based on a new proposed deep generative model with Student's t distribution. We demonstrate that this new loss improves several super-vised and semi-supervised tasks when there are numerous outliers in the data via careful simulation studies.
Conferences, Seminars, and Workshops Presentations:
Selected Honors and Awards:
Current collaborators: Long Nguyen, Ya'acov Ritov, Michael Jordan, Martin Wainwright, Richard Baraniuk, Anima Anandkumar , Natesh Pillai , Avi Feller , Peng Ding , Ankit Patel, Will Fithian, Hung Bui, Dinh Phung , Mikhail Yurochkin , Viet Huynh , Aritra Guha , Chiao-Yu Yang , Tan Nguyen, Lihua Lei , Raaz Dwivedi , Koulik Khamaru, Joseph Borja, Wenlong Mou, Yuting Wei.
Number of page view: