Home
Research
Publications
Datasets
Software
Bio
Contact



  Vahed Qazvinian

Datasets

  • The ACL Anthology Network (AAN) [Download]
    • The AAN corpus includes three networks: paper citation, author citation, and author collaboration constructed ftom the ACL Anthology data. It also includes abstracts, full texts, and citations sentences of the ACL Anthology papers.

  • Diversity in Collective Discourse
    • 25 sets of citations and 25 sets of news headlines.
    • Each dataset has a "*.txt" file that has 1 summary per line, and a "*.ann" file that has lines of the following format: < factoid id > < tab > < nugget >
    • To detect which nuggets/facts a citation contains, one should perform basic string matching.
    • For extensive analysis see (Qazvinian and Radev 2011).

  • Single Paper Summarization (Release 2010)
    • Citations to 25 highly cited papers from 5 different domains: Text Summarization, Question Answering, Machien Translation, Textual Entailment, and Dependency Parsing.
    • Each dataset has a "*.txt" file that has 1 citation per line, and a "*.ann" file that has lines of the following format: < fact id > < tab > < nugget >
    • To detect which nuggets/facts a citation contains, one should perform basic string matching.

  • Survey Generation (as explained in Mohammad et, al 2009)
    • 10 QA papers, 16 DP papers.
    • Annotated citations, abstracts, and full papers.
    • A number of human written survey (length 250 words) for each topic.
    • More details in Mohammad et, al 2009
    • Use "detect_nuggets.pl" to evaluate a given summary using the nugget based pyramid score.