Hi!!! I am currently a postdoctoral fellow in the Electrical Engineering and Computer Science (EECS) Department where I am very fortunate to be mentored by Professor Michael Jordan and Professor Martin Wainwright. Before going to Berkeley, I finished my Phd degree in the Department of Statistics, University of Michigan, Ann Arbor where I am very fortunate to be advised by Professor Long Nguyen and Professor Ya'acov Ritov.
At this moment, I am interested in several directions from the interplay between computation and theory in machine learning and statistics. Some of these directions can be summarized as follows:
For a recent version of my CV, please email me via minhnhat@berkeley.edu.
Educational background:
Current research interests:
Theses and Reports:
Publications:
Summary : This paper studies identifiability and convergence behaviors for parameters of multiple types, including matrix-variate ones, that arise in finite mixtures, and the effects of model fitting with extra mixing components. We consider several notions of strong identifiability in a matrix-variate setting, and use them to establish sharp inequalities relating the distance of mixture densities to the Wasserstein distances of the corresponding mixing measures. Characterization of identifiability is given for a broad range of mixture models commonly employed in practice, including locationcovariance mixtures and location-covariance-shape mixtures, for mixtures of symmetric densities, as well as some asymmetric ones. Minimax lower bounds and rates of convergence for the maximum likelihood estimates are established for such classes, which are also confirmed by simulation studies.
Summary : We establish minimax lower bounds and maximum likelihood convergence rates of parameter estimation for mean-covariance multivariate Gaussian mixtures, shape-rate Gamma mixtures, and some variants of finite mixture models, including the setting where the number of mixing components is bounded but unknown. These models belong to what we call ”weakly identifiable” classes, which exhibit specific interactions among mixing parameters driven by the algebraic structures of the class of kernel densities and their partial derivatives. Accordingly both the minimax bounds and the maximum likelihood parameter estimation rates in these models are shown to be typically much slower than the usual $n^{-1/2}$ or $n^{-1/4}$ rates of convergence.
Summary : Singularities of a statistical model are the elements of the model’s parameter space which make the corresponding Fisher information matrices degenerate. These are the points for which standard estimation techniques such as the maximum likelihood estimator do not admit the root-n parametric rate of convergence. We propose a general framework for the identification of singularity structures of the parameter space of finite mixtures, and study the impact of the singularity levels on minimax lower bounds and rates of convergence for the maximum likelihood estimator over a compact parameter space. Our investigation makes explicit the deep links between model singularities, parameter estimation rates and minimax bounds, and the algebraic geometry of the parameter space for mixtures of continuous distributions. The theory is applied to establish concrete convergence rates of parameter estimation for finite mixture of skewnormal distributions. This rich and increasingly popular model class is shown to exhibit a remarkably complex range of asymptotic behaviors that have not been hitherto reported in literature.
Summary : We propose a novel approach to the problem of multilevel clustering, which aims to simultaneously partition data in each group and discover grouping patterns among groups in a large hierarchically structural corpus of data. Our method involves a joint optimization formulation over several spaces of discrete probability measures, which are endowed with the Wasserstein distance metric. We propose a number of variants of this problem, which admit fast optimization algorithms, by exploiting the connection to the problem of finding Wasserstein barycenters. We also establish consistency properties enjoyed by our estimates of both local and global clusters. Finally, we present experiment results with both synthetic and real data to demonstrate the flexibility and scalability of the proposed approach.
Summary : In finite mixture models, apart from underlying mixing measure, true kernel density function of each subpopulation in the data is, in many scenarios, unknown. Perhaps the most popular approach is to choose some kernel functions that we empirically believe our data are generated from and use these kernels to fit our models. Nevertheless, as long as the chosen kernel and the true kernel are different, statistical inference of mixing measure under this setting will be highly unstable. To overcome this challenge, we propose flexible and efficient robust estimators of the mixing measure in these models, which are inspired by the idea of minimum Hellinger distance estimator, model selection criteria, and superefficiency phenomenon. We demonstrate that our estimators consistently recover the true number of components and achieve the optimal convergence rates of parameter estimation under both the well- and mis-specified kernel settings for any fixed bandwidth. These desirable asymptotic properties are illustrated via careful simulation studies with both synthetic and real data.
Summary : Recent years have witnessed substantial progress in understanding the behavior of EM for mixture models that are correctly specified. Given that model misspecification is common in practice, it is important to understand EM in this more general setting. We provide non-asymptotic guarantees for population and sample-based EM for parameter estimation under a few specific univariate settings of misspecified Gaussian mixture models. Due to misspecification, the EM iterates no longer converge to the true model and instead converge to the projection of the true model over the set of models being searched over. We provide two classes of theoretical guarantees: first, we characterize the bias introduced due to the misspecification; and second, we prove that population EM converges at a geometric rate to the model projection under a suitable initialization condition. This geometric convergence rate for population EM imply a statistical complexity of order $1/\sqrt{n}$ when running EM with $n$ samples. We validate our theoretical findings in different cases via several numerical examples.
Summary : Unsupervised and semi-supervised learning are challenging in complex domains such as natural images due to the lack of good generative models. Given the success of the Convolutional Neural Networks (CNNs) for prediction in images, we design a new class of probabilistic generative models, namely the Neural Rendering Models (NRMs), whose inference corresponds to any given CNN architecture. NRM uses the given CNN to design the prior distribution in the probabilistic model. We show that it leads to efficient semi-supervised learning, which uses less labeled data while maintaining good prediction performance. NRM generates images from coarse to finer scales. It introduces a small set of latent variables at each level, and enforces dependencies among all the latent variables via a conjugate prior distribution. This conjugate prior yields a new regularizer for training CNNs– the Rendering Path Normalization (RPN). We demonstrate that this regularizer improves generalization, both in theory and in practice. Furthermore, likelihood estimation in NRM yields training losses for CNNs, and inspired by this, we design a loss termed as the Max-Min cross entropy which outperforms the traditional cross-entropy loss for object classification. NRM with RPN and Max-Min cross entropy either beats or matches the-state-of-art on benchmarks including SVHN, CIFAR10, and CIFAR100 for semi-supervised and supervised learning tasks.
Summary : A line of recent work has characterized the behavior of the EM algorithm in favorable settings in which the population likelihood is locally strongly concave around its maximizing argument. Examples include suitably separated Gaussian mixture models and mixtures of linear regressions. We consider instead over-fitted settings in which the likelihood need not be strongly concave, or, equivalently, when the Fisher information matrix might be singular. In such settings, it is known that a global maximum of the MLE based on n samples can have a non-standard $n^{-1/4} rate of convergence. How does the EM algorithm behave in such settings? Focusing on the simple setting of a two-component mixture fit to a multivariate Gaussian distribution, we study the behavior of the EM algorithm both when the mixture weights are different (unbalanced case), and are equal (balanced case). Our analysis reveals a sharp distinction between these cases: in the former, the EM algorithm converges geometrically to a point at Euclidean distance O((d/n)^{1/2}) from the true parameter, whereas in the latter case, the convergence rate is exponentially slower, and the fixed point has a much lower O((d/n)^{1/4}) accuracy. The slower convergence in the balanced over-fitted case arises from the singularity of the Fisher information matrix. Analysis of this singular case requires the introduction of some novel analysis techniques, in particular we make use of a careful form of localization in the associated empirical process, and develop a recursive argument to progressively sharpen the statistical rate.
Summary : We propose a novel probabilistic approach to multilevel clustering problems based on composite transportation distance, which is a variant of transportation distance where the underlying metric is Kullback-Leibler divergence. Our method involves solving a joint optimization problem over spaces of probability measures to simultaneously discover grouping structures within groups and among groups. By exploiting the connection of our method to the problem of finding composite transportation barycenters, we develop fast and efficient optimization algorithms even for potentially large-scale multilevel datasets. Finally, we present experimental results with both synthetic and real data to demonstrate the efficiency and scalability of the proposed approach.
Summary : Gaussian mixtures of experts have been a popular modeling tool in various domains arising from biological, physical, and natural sciences. In this paper, we propose a rigorous framework for understanding parameter estimation under these models based on the algebraic structure of expert functions.
Summary : Principal stratification is a widely used framework for addressing post-randomization complications. After using principal stratification to define causal effects of interest, researchers are increasingly turning to finite mixture models to estimate these quantities. Unfortunately, standard estimators of the mixture parameters, like the MLE, are known to exhibit pathological behavior. We study this behavior in a simple but fundamental example, a two-component Gaussian mixture model in which only the component means are unknown, and focus on the setting in which the components are only weakly separated. We demonstrate via extensive simulations and theoretical arguments that the MLE behaves like a threshold estimator, in the sense that the MLE can give strong evidence that the means are equal when the truth is otherwise. We also explore the behavior of the MLE when the MLE is non-zero, showing that it is difficult to estimate both the sign and magnitude of the means in this case. We provide diagnostics for all of these pathologies and apply these ideas to re-analyzing two randomized evaluations of job training programs, JOBS II and Job Corps. Our results suggest that the corresponding maximum likelihood estimates should be interpreted with caution in these cases.
Summary : Cross-entropy loss has been widely used to train the convolutional neural net-works (CNNs). The usage of this loss is adapted from the multi-class logistic regression which learns the decision boundaries given the features learned by the CNNs. Lately, it has been shown that the cross-entropy loss and the architec- ture of the CNNs result simultaneously from the inference that learns the latent variables in the Latent-Dependent Deep Rendering Model (LD-DRM). In partic- ular, the negativity of the cross-entropy loss is proven to be the lower bound of the conditional log-likelihood of the LD-DRM. In this paper, we propose a better alternative to the cross-entropy loss by deriving a better lower bound for the con- ditional log-likelihood of LD-DRM. We show that our new neural networks achieve an improvement of more than 2% over the baseline CNNs/LD-DRM for semi-supervised classification tasks on CI- FAR10 and CIFAR100. Our results match or are comparable to the state-of-the-art on CIFAR10. We achieve state-of-the-art results on CIFAR100 when using only 10K labeled data with an improvement of almost 1% over the previous state-of- the-art.
Summary : Cross-entropy loss has been widely used to train the convolutional neural net- works (CNNs). However, it is widely known to be susceptible to outliers in the training data. In this paper, we propose a novel cross-entropy loss based on a new proposed deep generative model with Student's t distribution. We demonstrate that this new loss improves several super-vised and semi-supervised tasks when there are numerous outliers in the data via careful simulation studies.
Conferences, Seminars, and Workshops Presentations:
Selected Honors and Awards:
Current collaborators: Long Nguyen, Ya'acov Ritov, Michael Jordan, Martin Wainwright, Bin Yu, Richard Baraniuk, Anima Anandkumar , Natesh Pillai , Avi Feller , Peng Ding , Ankit Patel, Will Fithian, Hung Bui, Dinh Phung , Mikhail Yurochkin , Viet Huynh , Aritra Guha , Chiao-Yu Yang , Tan Nguyen, Lihua Lei , Raaz Dwivedi , Koulik Khamaru, Joseph Borja, Wenlong Mou, Tianyi Lin, Banghua Zhu, Yuting Wei.
Number of page view: