Ongoing projects

  • Protein fold classification in a continuous structure space
  •    Protein structure classification hierarchically clusters domain structures based on structure and/or sequence similarities and plays important roles in the study of protein structure-function relationship and protein evolution. Among many hierarchical classifications developed, SCOP and CATH are widely viewed as the gold standard. Their common hierarchies include class, fold, and superfamily. Domain classification at the fold level is of special interest because it is the lowest level of classification that does not depend on protein sequence similarity. However, the current fold classifications such as those in SCOP and CATH are controversial because they implicitly assume that folds are discrete islands in the structure space, whereas increasing evidence suggests significant similarities among folds and supports a continuous fold space. Here I developed a method to classify a domain into the existing folds of CATH by considering structure space continuity. Depending on the structural similarity score used, the new classification differs from the current classification for 4 to 12% of domains. These estimates are direct impacts of structure continuity on fold classifications. Analyzing the inconsistencies pinpoint important factors that influence fold classifications.

       In submission

  • Are translated pseudogenes functional?
  •    Pseudogenes are relics of formerly functional genes and are commonly identified based on the presence of mutations that disrupt an otherwise open reading frame (ORF). Nevertheless, ~10% of the commonly identified pseudogenes are transcribed into RNAs, some of which can affect the expressions of other genes, although it remains controversial whether these transcribed pseudogenes have physiological functions and are under purifying selection. Interestingly, the recently published human proteomes include peptides encoded by 125 pseudogenes. These coding pseudogenes may be functional and subject to purifying selection. Alternatively, their translations may be accidental and non-functional. To distinguish between these two hypotheses, I aligned human and rhesus monkey orthologous pseudogenes and estimated the nonsynonymous/synonymous rate ratio (dn/ds) for the regions predicted to be translated in these pseudogenes. I found the median dn/ds of the translated pseudogenes predicted by the proteomic data is not significantly different from that of other pseudogenes. Nevertheless, the pseudogenes with both evidence for transcription and translation have dn/ds values significantly lower than 1, indicating the action of purifying selection. The detected purifying selection is not due to the purge of mutations that lead to the production of toxic peptides. Moreover, the translated pseudogenes with significant purifying selections are not lineage specific to primates. Taken together, the results support that translated pseudogenes have potential physiological functions. The functions may be explained by the fact that even with pseudogenizations, the translated pseudogenes maintain relatively complete functional domains. However, unlike their parental genes, their functions are tissue specific due to tissue specific expressions.

       Manuscript in preparation

  • Development of model-based clustering method
  •    In structural bioinformatics, a common task is clustering protein structures according to their structural similarities. Currently, this is handled without assuming any statical models such as by hierarchical clustering methods. I implemented a Gaussian mixture model (GMM) with an unknown number of components to cluster structures. The model takes a structure as a cluster center, and assumes its dissimilarities to other structures follow Gaussian distribution with 0 mean. In this setting, each Gaussian component corresponds a structure cluster. The unfixed number of components is modeled by a birth-death process. All the parameters of the model are estimated using Markov Chain Monte Carlo method.

       A beta version software is under test

    Past projects

  • Publication page

  • Future interests

  • Estimating prevalence of epitasis within protein domain structure
  • Clustering protein models using Gaussian mixture model (GMM) to pick up near native structure
  • De novo domain structure classification using GMM and evolution of structure novelty
  • Exploring functions of genome spatial organization
  • Impact of effective population size on protein properties such as flexibility and stability