Publications:

[ Contextual Text Mining ] [ Language Modeling in IR ] [ Frequent Patterns ]
[ Search and Tagging ] [ Scientific Literature Mining ] [ Opinion Summarization ]

[ Sort by year ] [ Sort by topic ]

Contextual Text Mining

  • (KDD 05: ) Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining. Qiaozhu Mei, ChengXiang Zhai. [pdf] [slides] [data] [BibTex]

    This paper deals with a novel problem - generating evolutionary theme patterns from text, which includes a theme evolution graph as well as theme lifecycles. The context information here is time. A theme in this paper is equivalent to a "topic" in topic modeling literature. PLSA is used to extract themes; KL Divergence is used to infer topic transitions; and HMM is used to segment text with extracted themes. The results summarize topic evolutions and topic trends in news and scientific literature.

  • (WWW 06: ) A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs. Qiaozhu Mei, Chao Liu, Hang Su, ChengXiang Zhai. [pdf] [slides] [BibTex]

    This paper is a novel exploration of topic models in blogs (likely to be the very first). The goal is to extract spatiotemporal theme patterns, which summarize how opinions change over time and locations in blog. The context here is time and geographic location. The basic idea is to add the variable of time and location into PLSA, thus allows a document to sample topics according to either the time and the location. Results summarize the subtopics of social events (e.g., Hurricane Katrina), and their change over time & geography.

  • (KDD 06: ) A Mixture Model for Contextual Text Mining. Qiaozhu Mei, ChengXiang Zhai. [pdf] [slides] [BibTex]

    This paper generalizes the problem of contextual text mining, by defining it as 1) extracting topics from text and 2) model the content variation (i.e., view )and strength variation (i.e., coverage) of topics over different contexts. Any context variable which defines an explicit partition of the document collection can be handled in such a model. We present examples of context information like time, location, authorship, and events. Many contextual text mining problems and models, like the one in WWW06, are now a special case of such a model.

  • (WWW 07: ) Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs. Qiaozhu Mei, Xu Ling, Mattew Wondra, Hang Su, ChengXiang Zhai. [pdf] [slides] [BibTex]

    This is a novel exploration of modeling topics and sentiments in a unified probabilistic model. No previous models could model topics and sentiments simutanuously. With such a model, one can generate a table-like summary of facets and opinions, and monitor dynamics of sentiments. This is also an exploration of sentiments as implicit context, which appears to be quite different from explicit contexts.

    Two particular interesting findings in this papers: 1) PLSA can be supervised with prior p(w|topic) distributions, by changing the MLE in M step into MAP estimation. In this way a user can give guidance to the topic models, for example by fitting in a general sentiment model. 2) The training of the sentiment models itself is an exploration of domain adaptive learning, or transfer learning. The more (diverse) domains in the training data, the better the learnt sentiment models (fit sentiments in a new domain). We used a training dataset of sentiment-labeled sentences from ten different domains. The data was collected from OPINMIND, a blog search engine, which no longer exists now.

  • (KDD 07: ) Automatic Labeling of Multinomial Topic Models. Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai. Runner-up Best Student Paper Award [pdf] [slides] [BibTex]

    A long-standing common problem of topic models, or any unigram language models, is that it is very difficult to assign a label to each topic which captures its latent meaning. This paper solves this problem, by automatically generate phrase labels for topic models. Such a label can be tuned for a specific domain/person.

  • (WWW 08: ) Topic Modeling with Network Regularization. Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai. [pdf] [slides] [BibTex]

    This is the first attempt in literature to combine the power of topic modeling and graph-based regularization. Topic models make strong assumptions on the underline model of text, and tries to maximize the global data likelihood. A graph-based regularizer (e.g., the graph hamonic in Zhu et al 2003) instead, tries to "smooth" the label distributions locally on the graph (eventually propagate to the whole graph). Therefore, a topic model closely follows the assumptions, or guidance of the user, and a graph-based regularizer makes no assumption about the model, but solely listens to the data. Combining both leads to a mutual enhancement.

    Indeed, a graph-based regularizer makes the topics smoother on the network structure (thus more reasonable, and better explains the data); a topic model helps the graph-based methods to jump out of the trap of local maximum, and bridges local structures on the graph with latent links from topics in text. It is thus an effective model to combine text with social networks, with which one can extract topical communities as well as topic maps. The contexts here are nodes on a social network.

  • (WSDM 08: ) Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff? Qiaozhu Mei, Kenneth Church. [pdf] [slides] [video] [BibTex]

    This paper partially related to contextual text mining. The main focus of this paper is to introduce entropy analysis to 18 months' Live's search logs (appears to be one of, if not the largest, search log data in literature). With the computation of joint entropies for one search variable (e.g., IP, Query, Url) at a time, two at a time, and three at a time, a lot of fundamental questions can be answered. How big is the web? How hard is search? How hard is query suggestion? With personalization? With backoff?

    Personalization is useful, which can potentially cut the search difficulty in half. But what if there's no data about the user? Backing off to groups of users still helps. As a proof of concept, a personalization model with backoff is proposed, with illustration with an IP address as a user, and the first k bytes of an IP address as groups of users. Results show that backing off to the first two or three bytes of IP is better than complete personalization or non-personalization. To the best, one can backoff to market segments and demographics (other contexts), like days-of-week, hour-of-day.

  • (CIKM 08: ) Modeling Hidden Topics on Document Manifold. Deng Cai, Qiaozhu Mei, Jiawei Han, ChengXiang Zhai. [pdf] [slides] [BibTex]

    The model in this paper is similar to WWW08. But no explicit network structure exists in the data. Instead, a document manifold is computed directly from text and the topic model is added on top of that. Empirical experiments using clustering shows that this model performs better than topic models (PLSA, LDA) as well as discrete learning models (e.g., Normalized Cut, NMF-NCW, Average Association) and traditional clustering methods like k-means.

  • (KDD 08: )Mining Multi-Faceted Overviews of Arbitrary Topics in a Text Collection. Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz. [pdf] [slides] [BibTex]

    This is an application of contextual text mining models to generate multi-faceted overviews from text. The key issue in this paper is how to make the process interactive with users - explore the guidance from a user and generate the overview in the favor of the user.

Language Modeling in Information Retrieval

  • (SIGIR 08: ) A General Optimization Framework for Smoothing Language Models on Graph Structures. Qiaozhu Mei, Duo Zhang, ChengXiang Zhai. [pdf] slides] [BibTex]

    This paper revisits the fundamental problem in language modeling based information retrieval - smoothing. A theoretical optimization framework is provided to language model smoothing, which are handled with various heuristics in the past. The basic idea is to map the document language models and query language models (distribution of words) on a graph structure, which could be either a document similarity graph, or a word similarity graph. The language models are thus presented as surfaces on top of the graph. The notion of language model smoothing is then equivalent to smoothing such surfaces.

    This framework has nice connections with random walk and the absorption probability, and is very flexible to instantiate. Several instantiations using document and word graphs well outperform the state-of-the-art smoothing methods.

  • (SIGIR 07: ) A Study of Poisson Query Generation Model for Information Retrieval. Qiaozhu Mei, Hui Fang, ChengXiang Zhai. [pdf] [slides] [BibTex]

    Although widely used in literature, a document/query lanugage model doesn't have to be a multinomial distribution of words. We show in this paper that a Poisson process is an alternative type of language model, based on which we can get similar, and sometimes better performance than multinomial language models. In the context of retrieval, the bayesian smoothing for Poission document models using Gamma prior is proven to be equavalent to Dirichlet smoothing for multinomial. Despite the similarity, there are two fundamental advantages of Poisson: 1) the parameters don't need to sum to 1, which offers extra freedom for retrieval; 2) it provides a theoretical support for per-term smoothing, so that different words can be smoothed differently.

  • (ACL 08: ) Generating Impact-Based Summaries for Scientific Literature. Qiaozhu Mei, ChengXiang Zhai. [pdf] [slides] [BibTex]

    This paper proposes a novel problem of generating impact-based summary for a scientific paper. Such a summary is different to a common summary (e.g., abstract) in that it reflects the impact of that paper in literature after it is published. It could be used in a literature management system as a complimentary for the numerical impact factor.

    The basic idea is to cast the problem as a sentence retrieval problem, where the query is an impact language model, estimated from both the original document and the citation contexts (sentences surronding the absolute citation label in other papers which cited this one). Specific features like the authority and proximity (to the citation label) of the citation context are considered.

  • (HLT/NAACL 06: ) Language Model Information Retrieval with Document Expansion. Tao Tao, Xuanhui Wang, Qiaozhu Mei, ChengXiang Zhai. [pdf] [BibTex]

    This paper proposed a new smoothing method for language modeling in information retrieval. Rather than smoothing with the static cluster structure in the collection, each document is expanded with its nearest neighbors. The expanded document provides a more accurate estimation of the document model, thus improves retrieval accuracy. This model is similar to one of the instantiations of the SIGIR08 model using a document graph, with a single smoothing iteration.

  • (CIKM 05: ) Accurate Language Model Estimation with Document Expansion. Tao Tao, Xuanhui Wang, Qiaozhu Mei, ChengXiang Zhai. [BibTex]

    A shorter (and earlier) version of the HLT/NAACL paper.

Mining Frequent Pattern

  • (KDD 06: ) Generating Semantic Annotations for Frequent Patterns with Context Analysis. Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai. Runner-up Best Student Paper Award [pdf] [slides] [BibTex]

    A major problem of frequent pattern mining is that too many patterns are generated, but little work is done on whether they make sense, what they mean, and how to use them. This is hard with a traditional frequent pattern mining system, where only easy syntactic information (e.g., support) is presented. Other than shrinking the pattern size, another (unexplored) direction is to show the semantics of the extracted patterns to the user. A dictionary like semantic annotation is desirable for this purpose. This paper shows how to generate context indicators, representative records, as well as synonym patterns for a given pattern. The basic idea is to apply information retrieval techniques and model the context of a frequent pattern with a vector space model. All components of the semantic annotation can then be generated based on context analysis.

  • (TKDD 07: ) Semantic Annotation of Frequent Patterns. Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai. [pdf] [BibTex]

    A longer version of the KDD06 paper, with the new discussions on how to generate personalized pattern annotation. In such a model, the annotations of a frequent pattern will be related to the user's domain knowledge.

  • (KDD 06: ) Discovering Interesting Patterns Through User's Interactive Feedback. Dong Xin, Xuehua Shen, Qiaozhu Mei, Jiawei Han. [pdf] [slides] [BibTex]

    This paper studies the problem of discovering interesting patterns through user's interactive feedback. We assume a set of candidate patterns (i.e., frequent patterns) has already been mined. The goal is to help a particular user effectively discover interesting patterns according to his specific interest. The interestingness measure is automantically learnt from the user's interactive feedback. This is another attempt of merging the interdiscipline of information retrieval (relevance feedback) and database/data mining.

Search and Tagging

  • (CIKM 08:) Query Suggestion Using Hitting Time. Qiaozhu Mei, Dengyong Zhou, Kenneth Church. [pdf] [slides] [BibTex]

    This work introduces a mathematical concept - hitting time into web search. The main idea is to rank other vertices on a graph given some vertices as input. Hitting time has several advantages over other ranking process of random walk (e.g., personalized pagerank, absorption probability): 1) it considers the average proximity over many paths; 2) it considers the actual number of steps between objects; 3) it boosts vertices with small degree (especially desirable for long tail queries). The effectiveness of hitting time is shown in the scenario of query suggestion, where a query-click bipartite graph is used. We also explored other graphs (co-author graph, author-keyword graph, keyword cooccurrence graph) in literature domain.

  • (WSDM 08: ) Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff? Qiaozhu Mei, Kenneth Church. [pdf] [slides] [video] [BibTex]

    This paper partially related to contextual text mining. The main focus of this paper is to introduce entropy analysis to 18 months' Live's search logs (appears to be one of, if not the largest, search log data in literature). With the computation of joint entropies for one search variable (e.g., IP, Query, Url) at a time, two at a time, and three at a time, a lot of fundamental questions can be answered. How big is the web? How hard is search? How hard is query suggestion? With personalization? With backoff?

    Personalization is useful, which can potentially cut the search difficulty in half. But what if there's no data about the user? Backing off to groups of users still helps. As a proof of concept, a personalization model with backoff is proposed, with illustration with an IP address as a user, and the first k bytes of an IP address as groups of users. Results show that backing off to the first two or three bytes of IP is better than complete personalization or non-personalization. To the best, one can backoff to market segments and demographics (other contexts), like days-of-week, hour-of-day.

  • (ADMA 08: ) Automatic Web Tagging and Person Tagging Using Language Models. Qiaozhu Mei, Yi Zhang. [pdf] [slides] [BibTex]

    This paper deals with the problem of automatic tagging web page and web users using social bookmarking data. The basic idea is to generate labels for a distribution of tags estimated from Del.icio.us data. This is an application of the KDD07 model on social bookmarking.

  • (UIUC 2919) Search and Tagging: Two Sides of the Same Coin?. Qiaozhu Mei, Jing Jiang, Hang Su, ChengXiang Zhai. [pdf] [BibTex]

    This paper presents the duality hypothesis of search and tagging, two important behaviors of web users. The hypothesis states that if a user views a document D in the search results for query Q, the user would tend to assign document D a tag identical to or similar to Q; similarly, if a user tags a document D with a tag T, the user would tend to view document D if it is in the search results obtained using T as a query. We formalize this hypothesis with a unified probabilistic model for search and tagging, and show that empirical results of several tasks on search log and tag data sets, including ad hoc search, query suggestion, and query trend analysis, all support this duality hypothesis.

    This conclusion is very important, since the availability of search log is limited due to the privacy concern. This study (earlier than other work in literature) opens up a highly promising direction of using tag data to approximate or supplement search log data for studying user behavior and improving search engine accuracy.

Scientific Literature Mining

  • (ACL 08: ) Generating Impact-Based Summaries for Scientific Literature. Qiaozhu Mei, ChengXiang Zhai. [pdf] [slides] [BibTex]

    This paper proposes a novel problem of generating impact-based summary for a scientific paper. Such a summary is different to a common summary (e.g., abstract) in that it reflects the impact of that paper in literature after it is published. It could be used in a literature management system as a complimentary for the numerical impact factor.

    The basic idea is to cast the problem as a sentence retrieval problem, where the query is an impact language model, estimated from both the original document and the citation contexts (sentences surronding the absolute citation label in other papers which cited this one). Specific features like the authority and proximity (to the citation label) of the citation context are considered.

  • (IPM 07: ) Generating Semi-Structured Gene Summaries from Biomedical Literature: A study of semi-structured summarization. Xu Ling, Jing Jiang, Xin He, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz. [Link] [BibTex]

    This is a longer version of the PSB paper, with new exploration of contextual topic models. Comparison has been made over different models, like vector space model, language models, and topic models.

  • (PSB 06: ) Automatically Generating Gene Summaries from Biomedical Literature. Xu Ling, Jing Jiang, Xin He, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz. [pdf ] [BibTex]

    This paper applies text information management techniques to bioinformatics, and generates structural summary for genes from biology literature. Such a structural summary includes several aspects of a gene, such as the sequence information, mutant phenotypes, and molecular interaction with other genes. The automatically generated summary saves biologists from trying to search disparate biomedical literature to locate relevant articles, and spending considerable efforts reading the retrieved articles in order to locate the most relevant knowledge about the gene. The generated summaries not only are directly useful to biologists but also serve as useful entry points to enable them to quickly digest the retrieved literature articles.

  • (KDD 05: ) Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining. Qiaozhu Mei, ChengXiang Zhai. [pdf] [slides] [data] [BibTex]

    This paper deals with a novel problem - generating evolutionary theme patterns from text, which includes a theme evolution graph as well as theme lifecycles. The context information here is time. A theme in this paper is equivalent to a "topic" in topic modeling literature. PLSA is used to extract themes; KL Divergence is used to infer topic transitions; and HMM is used to segment text with extracted themes. The results summarize topic evolutions and topic trends in news and scientific literature.

  • (WWW 08: ) Topic Modeling with Network Regularization. Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai. [pdf] [slides] [BibTex]

    An effective model to combine text with social networks, with which one can extract topical communities as well as topic maps from scientific bibliography and literature data.

  • (CIKM 08:) Query Suggestion Using Hitting Time. Qiaozhu Mei, Dengyong Zhou, Kenneth Church. [pdf] [slides] [BibTex]

    Hitting time has been applied to graphs extracted from scientific literature (co-author graph, author-keyword graph, keyword cooccurrence graph). With such graphs, one can made keyword suggestions to keywords, author suggestions to an author, and keyword suggestions to an author as the query.

Opinion Analysis

  • (WWW 06: ) A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs. Qiaozhu Mei, Chao Liu, Hang Su, ChengXiang Zhai. [pdf] [slides] [BibTex]

    This paper is a novel exploration of topic models in blogs (likely to be the very first). The goal is to extract spatiotemporal theme patterns, which summarize how opinions change over time and locations in blog. The context here is time and geographic location. The basic idea is to add the variable of time and location into PLSA, thus allows a document to sample topics according to either the time and the location. Results summarize the subtopics of social events (e.g., Hurricane Katrina), and their change over time & geography.

  • (WWW 07: ) Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs. Qiaozhu Mei, Xu Ling, Mattew Wondra, Hang Su, ChengXiang Zhai. [pdf] [slides] [BibTex]

    This is a novel exploration of modeling topics and sentiments in a unified probabilistic model. No previous models could model topics and sentiments simutanuously. With such a model, one can generate a table-like summary of facets and opinions, and monitor dynamics of sentiments. This is also an exploration of sentiments as implicit context, which appears to be quite different from explicit contexts.

    Two particular interesting findings in this papers: 1) PLSA can be supervised with prior p(w|topic) distributions, by changing the MLE in M step into MAP estimation. In this way a user can give guidance to the topic models, for example by fitting in a general sentiment model. 2) The training of the sentiment models itself is an exploration of domain adaptive learning, or transfer learning. The more (diverse) domains in the training data, the better the learnt sentiment models (fit sentiments in a new domain). We used a training dataset of sentiment-labeled sentences from ten different domains. The data was collected from OPINMIND, a blog search engine, which no longer exists now.

  • (KDD 08: )Mining Multi-Faceted Overviews of Arbitrary Topics in a Text Collection. Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz. [pdf] [slides] [BibTex]

    This is an application of contextual text mining models to generate multi-faceted overviews from text. The key issue in this paper is how to make the process interactive with users - explore the guidance from a user and generate the overview in the favor of the user.

See here for all others.

Back to Homepage