Research Interests

I do research in the areas of Natural Language Processing, Linguistics, and Machine Learning. Specifically, I am interested in methods for NLP in minority and resource-poor languages. But generally, I'm interested in just about any interesting NLP problem. If you're looking for job application material, that can be found here.

Many of the research from the papers below and from my (not yet finished) thesis are implemented in the Minority Language Server, a system I've built that crawls the Web looking for instances of minority language text and uses it to continue to learn to do better NLP on minority languages. Everything it learns is available for download so that other researchers can make use of it as well.

Papers

Practical Natural Language Processing for Minority Languages [PDF]

Ben King

Ph.D. Thesis

Experiments in Sentence Language Identification with Groups of Similar Languages [PDF] [abstract]

Ben King, Dragomir Radev, and Steven Abney

VarDial workshop 2014.

Language identification is a simple problem that becomes much more difficult when its usual assumptions are broken. In this paper we consider the task of classifying short segments of text in closely-related languages for the Discriminating Similar Languages shared task, which is broken into six subtasks, (A) Bosnian, Croatian, and Serbian, (B) Indonesian and Malay, (C) Czech and Slovak, (D) Brazilian and European Portuguese, (E) Argentinian and Peninsular Spanish, and (F) American and British English. We consider a number of different methods to boost classification performance, such as feature selection and data filtering, but we ultimately find that a simple naive Bayes classifier using character and word n-gram features is a strong baseline that is difficult to improve on, achieving an average accuracy of 0.8746 across the six tasks.

Heterogeneous Networks and Their Applications: Scientometrics, Name Disambiguation, and Topic Modeling [PDF] [abstract]

Ben King, Rahul Jha, and Dragomir Radev

TACL (2013).

We present heterogeneous networks as a way to unify lexical networks with relational data. We build a unified ACL Anthology network, tying together the citation, author collaboration, and term-cooccurence networks with affiliation and venue relations. This representation proves to be convenient and allows problems such as name disambiguation, topic modeling, and the measurement of scientific impact to be easily solved using only this network and off-the-shelf graph algorithms.

Identifying Opinion Subgroups in Arabic Online Discussions [PDF] [abstract]

Amjad Abu-Jbara, Ben King, Mona Diab, and Dragomir Radev

ACL 2013.

In this paper, we use Arabic natural language processing techniques to analyze Arabic debates. The goal is to identify how the participants in a discussion split into subgroups with contrasting opinions. The members of each subgroup share the same opinion with respect to the discussion topic and an opposing opinion to the members of other subgroups. We use opinion mining techniques to identify opinion expressions and determine their polarities and their targets. We opinion predictions to represent the discussion in one of two formal representations: signed attitude network or a space of attitude vectors. We identify opinion subgroups by partitioning the signed network representation or by clustering the vector space representation. We evaluate the system using a data set of labeled discussions and show that it achieves good results.

Random Walk Factoid Annotation for Collective Discourse [PDF] [abstract]

Ben King, Rahul Jha, and Dragomir Radev

ACL 2013.

In this paper, we study the problem of automatically annotating the factoids present in collective discourse. Factoids are information units that are shared between instances of collective discourse and may have many different ways of being realized in words. Our approach divides this problem into two steps, using a graph-based approach for each step: (1) factoid discovery, finding groups of words that correspond to the same factoid, and (2) factoid assignment, using these groups of words to mark collective discourse units that contain the respective factoids. We study this on two novel data sets: the New Yorker caption contest data set, and the crossword clues data set.

Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods [bib] [PDF] [slides] [data] [code] [video] [abstract]

Ben King, Steven Abney

NAACL 2013.

(NAACL 2013) In this paper we consider the problem of labeling the languages of words in mixed-language documents. This problem is approached in a weakly supervised fashion, as a sequence labeling problem with monolingual text samples for training data. Among the approaches evaluated, a conditional random field model trained with generalized expectation criteria was the most accurate and performed consistently as the amount of training data was varied.

Bilingual Terminology Translation in Scientific Literature using Multilingual Structural Clues [PDF] [abstract]

Ben King

Preliminary Examination Report

This is the paper I presented for my preliminary examination. Using a collection of papers from the Spanish journal SEPLN, I extracted sections of the papers (titles, abstracts, full-text, bibliographies, etc.), many of which it turns out are published in English even if other parts of the paper are Spanish. These multilingual resources contain a lot of clues that can be used to help choose correct translations of terminology.

Cengage Learning at TREC 2011 Medical Track [bib] [PDF] [abstract]

Ben King, Lijun Wang, Ivan Provalov, Jerry Zhou

TREC 2011.

This paper details Cengage Learning’s submissions for this year’s TREC medical track. The techniques we used fall roughly into two categories: information extraction and query expansion. From both the queries and the medical reports, we extracted limiting attributes, such as age, race, and gender, and labeled terms appearing in the Unified Medical Language System (UMLS). We also used three different techniques of query expansion: UMLS related terms, terms from a network built from UMLS, and terms from our medical reference encyclopedias. We submitted four different runs varying only in their methods of query expansion.

Cengage Learning at TREC 2010 Session Track [bib] [PDF] [abstract]

Ben King, Ivan Provalov

TREC 2010.

This paper details Cengage Leaning’s TREC 2010 Session track submission and our efforts to improve retrieval performance over a user’s session. We use a number of different techniques to achieve this goal including query term weighting, query expansion and re-ranking. In this paper we detail these techniques and the results of our submission. Using our query term weighting technique combined with our corpus term collocation query expansion we were able to achieve 0.2375 for the nsDCG@10.RL13 metric.

Resources

Mixed Language Document Corpus v1.0 [gzipped tarball] [description]

This is the corpus of mixed-language documents used in "Labeling the Languages of Words in Mixed-Language Documents Using Weakly Supervised Methods". It contains more than 250,000 words in 31 different languages. Each document has all of its words annotated according to language.

Text Generation Toolkit v1.0 [gzipped tarball]

Ben King

Ph.D. candidate in Computer Science at the University of Michigan.

Research Interests

Papers

Resources