## TREC 2014 Web Track Guidelines

Kevyn Collins-Thompson, University of Michigan
Paul N. Bennett, Microsoft Research
Fernando Diaz, Microsoft Research
Craig Macdonald, University of Glasgow
Ellen Voorhees (NIST Contact)

Last update: June 26, 2014 by KevynCT.

Welcome to the TREC 2014 Web Track. Our goal is to explore and evaluate Web retrieval technologies that are both effective and reliable. As with last year's Web track, we will use the 870-million page ClueWeb12 Dataset. The Web track will continue the ad-hoc retrieval tasks from 2009-2013.

We assume you arrived at this page because you're participating in this year's TREC conference. If not, you should start at the TREC main page.

If you're new to the TREC Web Track, you may want to start by reading the track overview papers from TREC 2009, TREC 2010, TREC 2011, TREC 2012, TREC 2013. See last year's Web track info at the TREC website.
Last year's resources are available here via github.

If you're planning to participate in the track, you should be on the track mailing list. If you're not on the list, send a mail message to listproc (at) nist (dot) gov such that the body consists of the line "subscribe trec-web FirstName LastName".

### Timetable

The current schedule is:

### Overview

Web Tracks at TREC have explored specific aspects of Web retrieval, including named page finding, topic distillation, and traditional adhoc retrieval. The traditional adhoc task will be retained for TREC 2014. Previous tracks starting in 2009 also included a diversity task whose goal was to return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the result list. Aspects of the diversity task, such as including queries with multiple subtopics, will be preserved for 2014. However, the diversity task itself was replaced last year with a new risk-sensitive task, which will continue in 2014, that takes a different evaluation viewpoint but shares many of the same aims. The overall goal of the risk-sensitive task is to explore algorithms and evaluation methods for systems that try to jointly maximize an average effectiveness measure across queries, while minimizing effectiveness losses with respect to a baseline. Retrieval diversity among subtopics can be seen as one strategy for achieving this goal: retrieving as much highly relevant material as possible while avoiding the effectiveness losses associated with focusing too heavily on results for only a minority of users.

The adhoc and risk tasks share topics, which will be developed with the assistance of information extracted from the logs of commercial Web search engines. Topic creation and judging will attempt to reflect important characteristics of authentic Web search queries. Like last year, there will be a mixture of both broad and specific query intents reflected in the topics. The broad topics will retain the multiple-subtopic paradigm used in last year's Web track, while the specific topics will reflect a single, more focused intent/subtopic. See below for example topics.

### Document Collection

For 2014 we are continuing our use of the ClueWeb12 dataset for our document collection. The full collection comprises about 870 million web pages, collected between February 10, 2012 and May 10, 2012. TREC 2014 will use version 1.1 of ClueWeb12 (which fixes a duplicate document problem in v1.0). Further information regarding the collection can be found on the associated web site. Since it can take several weeks to obtain the dataset, we urge you to start this process as soon as you can. The collection will be shipped to you on two 3.0Tb hard disks at an expected cost of US$430 plus shipping charges. As with the previous ClueWeb09 collection, if you are unable to work with the full ClueWeb12 dataset, we will accept runs over the smaller ClueWeb12 "Category B" dataset (called ClueWeb12-B13) but we strongly encourage you to use the full dataset if you can. The ClueWeb12-B13 dataset represents a subset of about 50 million English-language pages. The Category B dataset can also be ordered through the ClueWeb12 Web site. It will be shipped to you on a single 500Gb hard disk at an expected cost of US$180 plus shipping charges. Note that the Lemur Project also provides several online services to simplify use of the ClueWeb12 dataset, such as batch or interactive search of ClueWeb12 using the Indri search engine. Some of these services require a user name and password. If your organization has a license to use the ClueWeb12 dataset, you can obtain a username and password. Details available on the ClueWeb12 website online page.

#### Extra Resources

The following resources have been made available to augment the base ClueWeb12 collection.

Mark Smucker is providing spam scores for ClueWeb12.
Djoerd Hiemstra is providing anchor text for ClueWeb12.

At least one adhoc run from each group will be judged by NIST assessors, with priority given based on the ordering you give at submission (i.e. the first run will be given top priority). Each document will be judged on a six-point scale, as follows:

Nav This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.
Key This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
HRel The content of this page provides substantial information on the topic.
Rel The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
Non The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
Junk This page does not appear to be useful for any reasonable purpose; it may be spam or junk.

All topics are expressed in English. Non-English documents will be judged non-relevant, even if the assessor understands the language of the document and the document would be relevant in that language. If the location of the user matters, the assessor will assume that the user is located in Gaithersburg, Maryland.

The primary effectiveness measure will be intent-aware expected reciprocal rank (ERR-IA) which is a variant of ERR as defined by Chapelle et al. (CIKM 2009). For single-facet queries, ERR-IA simply becomes ERR. In addition to ERR and ERR-IA, we will compute and report a range of standard measures, including MAP, precision@10 and NDCG@10.

You may submit up to three runs for the adhoc task; at least one will be judged. NIST may judge additional runs per group depending upon available assessor time. During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will choose the run to assess arbitrarily.

The format for submissions is given in a separate section below. Each topic must have at least one document retrieved for it. While many of the evaluation measures used in this track consider only the top 10-20 documents, some methods for estimating MAP sample at deeper levels, and we request that you return the top 10,000 to aid in this process. You may return fewer than 10,000 documents for a topic. However, you cannot hurt your score, and could conceivably improve it, by returning 10,000 documents per topic. All the evaluation measures used in the track count empty ranks as not relevant (Non).

The risk-sensitive retrieval task uses the same topics as the ad-hoc task, but is based on new evaluation measures that are relative to a provided baseline run. The goal of the risk- sensitive task is to provide a ranked list of pages that both maximizes the return of relevant documents, and minimizes retrieval losses with respect to a baseline run. By retrieval loss, we mean the outcome where a system provides a ranking that has lower retrieval effectiveness than the baseline retrieval effectiveness for that query. Thus, the 'risk' involved in the risk-sensitive task refers to the undesirable outcome of doing worse than the baseline for any particular query.

Typically, the baseline will perform well on some queries but not others. After running a number of queries, your system's results will have a distribution of wins and losses relative to the given baseline. The risk-sensitive evaluation measures we use are derived from the properties of this win/loss distribution. The evaluation methods in the risk- sensitive task will reward systems that can achieve simultaneously (a) high average effectiveness per query; (b) minimal losses with respect to a baseline; and (c) wins affecting a higher proportion of overall queries than other systems.

We believe the risk-sensitive task continues to be of broad interest to the IR community since techniques from a wide variety of research areas could be applicable, including robust query expansion and pseudo-relevance feedback; fusion and diversity-oriented ranking; using query performance prediction to select which baseline results to keep or modify; learning-to-rank models that optimize both effectiveness and robustness, and others. New for the 2014 Web Track, participants will be able to submit absolute and relative query performance predictions for each topic as part of the risk-sensitive retrieval task.Part of the goal of the TREC 2014 Web track is to understand the nature of risk-reward tradeoffs achievable by a system that can adapt to different baselines, so for 2014 we are supplying two baseline runs from different IR systems (instead of the single baseline used in 2013). We are providing sample scripts that take as input (a) your risk-sensitive run (b) a baseline run, and (c) a risk parameter, and output the risk-sensitive retrieval metrics that your system should optimize for, as described next.

#### Evaluation measures

All data and tools described below will be available on github, in the public repository. This year's tools and baseline data are available at: https://github.com/trec-web/trec-web-2014

As with the adhoc task, we will use Intent-Aware Expected Reciprocal Rank (ERR-IA) as the basic measure of retrieval effectiveness, and per-query retrieval delta for a given baseline will be defined as the absolute difference in effectiveness between your contributed run and the given baseline run, for a given query. A positive delta means a win for your system on that query, and negative delta means a loss. We will also report other flavors of the risk-related measure based on NDCG and other standard effectiveness measures. Over single runs, one primary risk measure we will report is the probability of failure per topic, where failure is simply defined as any negative retrieval delta with respect to the given baseline. We will also report measures based on more detailed properties of distribution of results for a given system, such as how the mass of the win-loss distribution is distributed across all queries, and not merely the overall probability of failure. One such measure will be the expected shortfall of a system's results at a given failure level: we will focus on the average retrieval loss over the worst 25% of failures, but we will also report across a range of percentile levels. (For runs with no failures, expected shortfall is zero.) For single runs, the following will be the main risk-sensitive evaluation measure. Let Δ(q)= R_A(q)- R_BASE (q) be the absolute win or loss for query q with system retrieval effectiveness R_A(q) relative to a given baseline's effectiveness R_BASE (q) for the same query. We categorize the outcome for each query q in the set Q of all N queries according to the sign of Δ(q), giving three categories:

Hurt Queries (Q-) have Δ(q)<0
Unchanged Queries (Q0) have Δ(q)=0
Improved Queries (Q+) have Δ(q)>0
The risk-sensitive utility measure U_RISK(Q) of a system over the set of queries Q is defined as:

U_{RISK}(Q) = 1/N \cdot [ \sum_{q \in Q_{+}} \Delta(q) - (\alpha + 1) \sum_{q \in Q_{-}} \Delta(q)]

where α is a risk-aversion parameter. In words, this rewards systems that maximize average effectiveness, but also penalizes losses relative to the baseline results for the same query, weighting losses α+1 times as heavily as successes. When the risk aversion parameter α is large, a system will become more conservative and put more emphasis on avoiding large losses relative to the baseline. When α is small, a system will tend to ignore the baseline. The adhoc task objective, maximizing only average effectiveness across queries, corresponds to the special case α = 0.

We will also report a ratio-based version U_(RISKRATIO) of Eq.1 that defines

U_{RISKRATIO}(q) = \frac{R_A (q)}{R_{BASE}(q)}.
This version gives more weight to more difficult queries, in inverse proportion to baseline effectiveness. Since part of the goal of the risk-sensitive task is to explore evaluation measures that are sensitive to failures, alternate statistics will also be computed that summarize different properties of the win-loss distribution. For example, the ratio of geometric mean to arithmetic mean of wins and losses is one widely-used dispersion measure related to the previous use of geometric mean in the Robust track.

Finally, we will evaluate a system's overall risk-sensitive performance by reporting a combined measure U_RISK(Q*), where Q* is the combined pool of query results from both the Indri and Terrier baselines. (i.e. Q* will have 100 results: 50 topics x 2 baselines).

##### Evaluation tools
We provide updated versions of standard TREC evaluation tools that compute risk-sensitive versions of retrieval effectiveness measures, based on a baseline run. These can be found in the trec-web-2014 github repository in the src/eval directory. There are two evaluation programs: ndeval, a C program which can be compiled to an executable, and gdeval, which is written in Perl. The difference from older versions of those tools is a new baseline parameter, which if supplied, will compute the risk-sensitive evaluation measure based on the (also new) alpha parameter you provide.
To use ndeval, first build the executable using 'make' with the provided Makefile. ndeval requires a qrels.txt file, which contains the relevance judgements available from NIST. For this year, you can use the qrels file from TREC 2012 (resp. TREC 2013), and the TREC 2012 (resp. 2013) baselines provided by us for training. To use measures that are backwards-compatible with pre-2013 Web tracks, you just omit specifying a baseline. For example:
$./ndeval -c -traditional qrels.txt trec-format-run-to-evaluate.txt > normal-nd-evaluation.txt For risk-sensitive measures, you add a -baseline file and a -riskAlpha setting. Remember the final risk weight is 1 + riskAlpha, i.e. riskAlpha = 0 corresponds to having no increased weight penalty for errors relative to the baseline and simply reports differences from the baseline. To evaluate with an increased weight on errors relative to the baseline, you could run for example: $ ./ndeval -c -traditional -baseline trec-format-baseline-run.txt -riskAlpha 1 qrels.txt trec-format-test-run.txt > risk-sensitive-nd-evaluation.txt

Usage of gdeval is similar, with a new -baseline and -riskAlpha parameters. For backwards compatible evaluation with pre-2013 Web tracks:
$./gdeval.pl -c qrels.txt trec-format-test-run.txt > normal-gd-evaluation.txt To do a risk-sensitive evaluation with an increased weight on errors relative to the baseline, you could then do for example: $ ./gdeval.pl -c -riskAlpha 1 -baseline trec-format-baseline-run.txt qrels.txt  trec-format-test-run.txt >  risk-sensitive-gd-evaluation.txt


#### Runs

All data and tools described below are available on github, in the public repository: https://github.com/trec-web/trec-web-2014

For the risk-sensitive task, we provide two baseline runs, comprising the top 10000 results for two particular choices of easily reproducible baseline system. The risk-sensitive retrieval performance of submitted systems will be measured against both of these baselines. This year, the two baselines will be provided using the (a) Indri and (b) Terrier retrieval engines, with specific default settings as provided by the respective services.

##### Training data
For training purposes, systems can work with previous years' adhoc topics using ClueWeb09 (or ClueWeb12 in the case of the 2013 Web Track). To aid this process we provide:
a) Baseline runs over TREC Web Track 2012 and 2013 topics and collections, using both Indri and Terrier systems.
b) An updated version of standard TREC evaluation tools that compute risk-sensitive versions of retrieval effectiveness measures, based on a baseline run.

Our ClueWeb09 training baselines use TREC 2012 Web track topics, with retrieval using the same Indri and Terrier settings as for the ClueWeb12 baselines. For Indri, this baseline was computed using the Indri search engine, using its default query expansion based on pseudo-relevance feedback, with results filtered using the Waterloo spam filter. For Terrier, the baseline files will be computed in a similar way.

The github file with the TREC 2012 baseline runs can be found off the trec-web-2014 root at:

/data/runs/baselines/2012/rm/results-cata-filtered.txt
(ClueWeb09 full index, indri default relevance model results, spam-filtered)

For comparison we also provide other flavors of training baseline for CatB, and without spam filtering, in the same directory.
We have also included simple query likelihood runs that do not use query expansion in

/runs/baselines/2012/ql

The 2013 training baseline runs can be found at github repository in the file:

data/runs/baselines/2013/rm/results-cata-filtered.txt
(ClueWeb12 full index, indri default relevance model results, spam-filtered)

The 2013 Indri training baseline was computed using the 2013 Web track topics and ClueWeb12 collection, using exactly the same retrieval method as the 2012 training baseline: namely, the Indri search engine with default query expansion based on pseudo-relevance feedback, with results filtered using the Waterloo spam filter.

As we did for the 2012 training baselines above, we have provided alternative variants of the 2013 training baselines, in case you want to compare runs or explore using different sources of evidence. The naming convention is the same as using for 2012 training files above. However, these variants will *not* be used for evaluation: the results-cata-filtered.txt run above is the only official test baseline.

##### Run submission

To evaluate the quality of risk/reward tradeoffs a system can achieve for different baselines, we require participants to provide three risk-sensitive runs. Your three runs must correspond to optimizing retrieval for the 50 test topics relative to three different baselines:

IMPORTANT: For final evaluation, all runs should be optimized to use the risk-sensitive measure in Eq. 1 with α = 5. The underlying retrieval approach for these submissions should be the same one that produced your top-ranked submission for the ad-hoc task. That is, your top-ranked ad-hoc run should ideally correspond to using α=0. However, depending on your approach, the risk-sensitive runs might use different thresholds, parameter settings, or trained models. We strongly discourage submitting identical runs for all baselines: part of the goal of this track is to reward systems that can adapt to different baselines. However, we will not disqualify your entry if you do this. Risk-sensitive runs do not need to be a re-ranking of the baseline run and can initiate new retrieval.
##### Query performance prediction

New for 2014, participants may submit, for each of the three risk-sensitive runs above, an optional query performance prediction (QPP) file. The QPP submission file should have the following tab-delimited columns that contain values for the following fields, one topic per line for each of the 50 topics:

Field Description
Topic_ID TREC topic number
Baseline_QPP_Score Participant-defined prediction score for the absolute effectiveness of the results for baseline used for risk-sensitive run.
RiskRun_QPP_Score Participant-defined prediction score for the absolute effectiveness of the results for the risk-sensitive run.
Relative_QPP_Score Participant-defined relative gain or loss prediction score: the difference in effectiveness between the risk-sensitive run and the baseline run.

You can submit values either (1) for absolute effectiveness prediction in the two absolute score (Baseline_QPP_Score and RiskRun_QPP_Score) columns, or (2) for relative prediction the relative score column, or (3) do both absolute and relative QPP by filling in all three columns. (That's the reason we included the Relative_QPP_Score column.) Participants can define their own prediction scores: the one requirement is that higher/lower absolute prediction scores should correspond to higher/lower ERR@20 for each topic, and for relative prediction scores, positive/negative relative scores should correspond to wins/losses respectively, for the risk-sensitive run against the baseline run by ERR@20. The accuracy of your QPP files for the absolute (resp. relative) predictions will be measured by the rank correlation of your QPP values with the ERR-I@20 values (resp. differences in ERR-I@20 values) across topics.

### Topic structure

As in TREC 2013, the TREC 2014 Web track will include a significant proportion of focused topics designed to represent more specific, less frequent, possibly more difficult queries. To retain the Web flavor of queries in this Track, we retain the notion from last year that some topics may be multi-faceted, i.e. broader in intent and thus structured as a representative set of subtopics, each related to a different potential aspect of user need. Example are provided below. For topics with multiple subtopics, documents will be judged with respect to the subtopics. For each subtopic, NIST assessors will make a scaled six-point judgment as to whether or not the document satisfies the information need associated with the subtopic. For those topics with multiple subtopics, the set of subtopics is intended to be representative, not exhaustive. We expect each multi-intent topic to contain 4-10 subtopics.

Topics will be fully defined by NIST in advance of topic release, but only the query field will be initially released. Detailed topics will be released only after runs have been submitted. Subtopics will be based on information extracted from the logs of a commercial search engine. Topics having multiple subtopics will have subtopics roughly balanced in terms of popularity. Strange and unusual interpretations and aspects will be avoided as much as possible.

In all other respects, the risk-sensitive task is identical to the adhoc task. The same 50 topics will be used. The submission format is the same. The top 10,000 documents should be submitted.

The topic structure will be similar to that used for the TREC 2009 topics. The topics below provide examples.

Single-facet topic examples:

<topic number="1" type="faceted">
<query>feta cheese preservatives</query>
<description>Find information on which substances are used to extend the shelf life of feta cheese.
</description>
<subtopic number="1" type="inf">
Find information on which substances are used to extend the shelf life of feta cheese.
</subtopic>
</topic>

<topic number="2" type="faceted">
<query> georgia state university admissions yield <query>
<description> Find information on what percentage of students decide to attend Georgia State University after being admitted, as well as recent trends and discussion on this

statistic.
</description>
<subtopic number="1" type="inf">
Find information on what percentage of students decide to attend Georgia State University after being admitted, as well as recent trends and discussion on this statistic.
</subtopic>
</topic>

Multi-facet topic example:
<topic number="16" type="faceted">
<query>arizona game and fish</query>
<description>I'm looking for information about fishing and hunting
in Arizona.
</description>

<subtopic number="1" type="nav">
Take me to the Arizona Game and Fish Department homepage.
</subtopic>
<subtopic number="2" type="inf">
What are the regulations for hunting and fishing in Arizona?
</subtopic>
<subtopic number="3" type="nav">
I'm looking for the Arizona Fishing Report site.
</subtopic>
<subtopic number="4" type="inf">
I'd like to find guides and outfitters for hunting trips in Arizona.
</subtopic>
</topic>


Initial topic release will include only the query field. As shown in these examples, those topics having a more focused single intent have a single subtopic. Topics with multiple subtopics reflect underspecified queries, with different aspects covered by the subtopics. We assume that a user interested in one aspect may still be interested in others. Each subtopic is categorized as being either navigational ("nav") or informational ("inf"). A navigational subtopic usually has only a small number of relevant pages (often one). For these subtopics, we assume the user is seeking a page with a specific URL, such as an organization's homepage. On the other hand, an informational query may have a large number of relevant pages. For these subtopics, we assume the user is seeking information without regard to its source, provided that the source is reliable. For the adhoc task, relevance is judged on the basis of the description field.

All adhoc and risk-sensitive task runs must be compressed (gzip or bzip2). For both tasks, a submission consists of a single ASCII text file in the format used for most TREC submissions, which we repeat here for convenience. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.
    5 Q0 clueweb12-enwp02-06-01125 1 32.38 example2013
5 Q0 clueweb12-en0011-25-31331 2 29.73 example2013
5 Q0 clueweb12-en0006-97-08104 3 21.93 example2013
5 Q0 clueweb12-en0009-82-23589 4 21.34 example2013
5 Q0 clueweb12-en0001-51-20258 5 21.06 example2013
5 Q0 clueweb12-en0002-99-12860 6 13.00 example2013
5 Q0 clueweb12-en0003-08-08637 7 12.87 example2013
5 Q0 clueweb12-en0004-79-18096 8 11.13 example2013
5 Q0 clueweb12-en0008-90-04729 9 10.72 example2013

etc. where:

• the first column is the topic number.
• the second column is currently unused and should always be "Q0".
• the third column is the official document identifier of the retrieved document. For documents in the ClueWeb12 collection this identifier is the value found in the "WARC- TREC-ID" field of the document's WARC header.
• the fourth column is the rank the document is retrieved, and the fifth column shows the score (integer or floating point) that generated the ranking. This score must be in descending (non-increasing) order. The evaluation program ranks documents from these scores, not from your ranks. If you want the precise ranking you submit to be evaluated, the scores must reflect that ranking.
• the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. That is, each run should have a different tag that identifies the group and the method that produced the run. Please change the tag from year to year, since often we compare across years (for graphs and such) and having the same name show up for both years is confusing. Also run tags must contain 12 or fewer letters and numbers, with no punctuation, to facilitate labeling graphs with the tags.

### References

[1] Lidan Wang, Paul N. Bennett, Kevyn Collins-Thompson. Robust Ranking Models via Risk-Sensitive Optimization. In Proceedings of SIGIR 2012.
[2] B. Taner Dinçer, Iadh Ounis and Craig Macdonald. Tackling Biased Baselines in the Risk-Sensitive Evaluation of Retrieval Systems. In Proceedings of ECIR 2014.