Last update: June 26, 2014 by KevynCT.
Welcome to the TREC 2014 Web Track. Our goal is to explore and evaluate Web retrieval technologies that are both effective and reliable. As with last year's Web track, we will use the 870-million page ClueWeb12 Dataset. The Web track will continue the ad-hoc retrieval tasks from 2009-2013.
We assume you arrived at this page because you're participating in this year's TREC conference. If not, you should start at the TREC main page.
If you're new to the TREC Web Track, you may want to start by reading the
track overview papers from
See last year's Web track info at the TREC website.
Last year's resources are available here via github.
If you're planning to participate in the track, you should be on the track mailing list. If you're not on the list, send a mail message to listproc (at) nist (dot) gov such that the body consists of the line "subscribe trec-web FirstName LastName".
The adhoc and risk tasks share topics, which will be developed with the assistance of information extracted from the logs of commercial Web search engines. Topic creation and judging will attempt to reflect important characteristics of authentic Web search queries. Like last year, there will be a mixture of both broad and specific query intents reflected in the topics. The broad topics will retain the multiple-subtopic paradigm used in last year's Web track, while the specific topics will reflect a single, more focused intent/subtopic. See below for example topics.
Mark Smucker is providing spam scores for ClueWeb12.
Djoerd Hiemstra is providing anchor text for ClueWeb12.
At least one adhoc run from each group will be judged by NIST assessors, with priority given based on the ordering you give at submission (i.e. the first run will be given top priority). Each document will be judged on a six-point scale, as follows:
All topics are expressed in English. Non-English documents will be judged non-relevant, even if the assessor understands the language of the document and the document would be relevant in that language. If the location of the user matters, the assessor will assume that the user is located in Gaithersburg, Maryland.
The primary effectiveness measure will be intent-aware expected reciprocal rank (ERR-IA) which is a variant of ERR as defined by Chapelle et al. (CIKM 2009). For single-facet queries, ERR-IA simply becomes ERR. In addition to ERR and ERR-IA, we will compute and report a range of standard measures, including MAP, precision@10 and NDCG@10.
You may submit up to three runs for the adhoc task; at least one will be judged. NIST may judge additional runs per group depending upon available assessor time. During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will choose the run to assess arbitrarily.
The format for submissions is given in a separate section below. Each topic must have at least one document retrieved for it. While many of the evaluation measures used in this track consider only the top 10-20 documents, some methods for estimating MAP sample at deeper levels, and we request that you return the top 10,000 to aid in this process. You may return fewer than 10,000 documents for a topic. However, you cannot hurt your score, and could conceivably improve it, by returning 10,000 documents per topic. All the evaluation measures used in the track count empty ranks as not relevant (Non).
Typically, the baseline will perform well on some queries but not others. After running a number of queries, your system's results will have a distribution of wins and losses relative to the given baseline. The risk-sensitive evaluation measures we use are derived from the properties of this win/loss distribution. The evaluation methods in the risk- sensitive task will reward systems that can achieve simultaneously (a) high average effectiveness per query; (b) minimal losses with respect to a baseline; and (c) wins affecting a higher proportion of overall queries than other systems.
We believe the risk-sensitive task continues to be of broad interest to the IR community since techniques from a wide variety of research areas could be applicable, including robust query expansion and pseudo-relevance feedback; fusion and diversity-oriented ranking; using query performance prediction to select which baseline results to keep or modify; learning-to-rank models that optimize both effectiveness and robustness, and others. New for the 2014 Web Track, participants will be able to submit absolute and relative query performance predictions for each topic as part of the risk-sensitive retrieval task.Part of the goal of the TREC 2014 Web track is to understand the nature of risk-reward tradeoffs achievable by a system that can adapt to different baselines, so for 2014 we are supplying two baseline runs from different IR systems (instead of the single baseline used in 2013). We are providing sample scripts that take as input (a) your risk-sensitive run (b) a baseline run, and (c) a risk parameter, and output the risk-sensitive retrieval metrics that your system should optimize for, as described next.
As with the adhoc task, we will use Intent-Aware Expected Reciprocal Rank (ERR-IA) as the basic measure of retrieval effectiveness, and per-query retrieval delta for a given baseline will be defined as the absolute difference in effectiveness between your contributed run and the given baseline run, for a given query. A positive delta means a win for your system on that query, and negative delta means a loss. We will also report other flavors of the risk-related measure based on NDCG and other standard effectiveness measures. Over single runs, one primary risk measure we will report is the probability of failure per topic, where failure is simply defined as any negative retrieval delta with respect to the given baseline. We will also report measures based on more detailed properties of distribution of results for a given system, such as how the mass of the win-loss distribution is distributed across all queries, and not merely the overall probability of failure. One such measure will be the expected shortfall of a system's results at a given failure level: we will focus on the average retrieval loss over the worst 25% of failures, but we will also report across a range of percentile levels. (For runs with no failures, expected shortfall is zero.) For single runs, the following will be the main risk-sensitive evaluation measure. Let Δ(q)= R_A(q)- R_BASE (q) be the absolute win or loss for query q with system retrieval effectiveness R_A(q) relative to a given baseline's effectiveness R_BASE (q) for the same query. We categorize the outcome for each query q in the set Q of all N queries according to the sign of Δ(q), giving three categories:
where α is a risk-aversion parameter. In words, this rewards systems that maximize average effectiveness, but also penalizes losses relative to the baseline results for the same query, weighting losses α+1 times as heavily as successes. When the risk aversion parameter α is large, a system will become more conservative and put more emphasis on avoiding large losses relative to the baseline. When α is small, a system will tend to ignore the baseline. The adhoc task objective, maximizing only average effectiveness across queries, corresponds to the special case α = 0.
We will also report a ratio-based version U_(RISKRATIO) of Eq.1 that defines
Finally, we will evaluate a system's overall risk-sensitive performance by reporting a combined measure U_RISK(Q*), where Q* is the combined pool of query results from both the Indri and Terrier baselines. (i.e. Q* will have 100 results: 50 topics x 2 baselines).
$ ./ndeval -c -traditional qrels.txt trec-format-run-to-evaluate.txt > normal-nd-evaluation.txt
$ ./ndeval -c -traditional -baseline trec-format-baseline-run.txt -riskAlpha 1 qrels.txt trec-format-test-run.txt > risk-sensitive-nd-evaluation.txt
$ ./gdeval.pl -c qrels.txt trec-format-test-run.txt > normal-gd-evaluation.txt
$ ./gdeval.pl -c -riskAlpha 1 -baseline trec-format-baseline-run.txt qrels.txt trec-format-test-run.txt > risk-sensitive-gd-evaluation.txt
For the risk-sensitive task, we provide two baseline runs, comprising the top 10000 results for two particular choices of easily reproducible baseline system. The risk-sensitive retrieval performance of submitted systems will be measured against both of these baselines. This year, the two baselines will be provided using the (a) Indri and (b) Terrier retrieval engines, with specific default settings as provided by the respective services.
Our ClueWeb09 training baselines use TREC 2012 Web track topics, with retrieval using the same Indri and Terrier settings as for the ClueWeb12 baselines. For Indri, this baseline was computed using the Indri search engine, using its default query expansion based on pseudo-relevance feedback, with results filtered using the Waterloo spam filter. For Terrier, the baseline files will be computed in a similar way.
The github file with the TREC 2012 baseline runs can be found off the trec-web-2014 root at:
/data/runs/baselines/2012/rm/results-cata-filtered.txt(ClueWeb09 full index, indri default relevance model results, spam-filtered)
For comparison we also provide other flavors of training baseline for CatB, and without spam filtering, in the same directory.
We have also included simple query likelihood runs that do not use query expansion in
The 2013 training baseline runs can be found at github repository in the file:
data/runs/baselines/2013/rm/results-cata-filtered.txt(ClueWeb12 full index, indri default relevance model results, spam-filtered)
The 2013 Indri training baseline was computed using the 2013 Web track topics and ClueWeb12 collection, using exactly the same retrieval method as the 2012 training baseline: namely, the Indri search engine with default query expansion based on pseudo-relevance feedback, with results filtered using the Waterloo spam filter.
As we did for the 2012 training baselines above, we have provided alternative variants of the 2013 training baselines, in case you want to compare runs or explore using different sources of evidence. The naming convention is the same as using for 2012 training files above. However, these variants will *not* be used for evaluation: the results-cata-filtered.txt run above is the only official test baseline.
To evaluate the quality of risk/reward tradeoffs a system can achieve for different baselines, we require participants to provide three risk-sensitive runs. Your three runs must correspond to optimizing retrieval for the 50 test topics relative to three different baselines:
|Topic_ID||TREC topic number|
|Baseline_QPP_Score||Participant-defined prediction score for the absolute effectiveness of the results for baseline used for risk-sensitive run.|
|RiskRun_QPP_Score||Participant-defined prediction score for the absolute effectiveness of the results for the risk-sensitive run.|
|Relative_QPP_Score||Participant-defined relative gain or loss prediction score: the difference in effectiveness between the risk-sensitive run and the baseline run.|
Topics will be fully defined by NIST in advance of topic release, but only the query field will be initially released. Detailed topics will be released only after runs have been submitted. Subtopics will be based on information extracted from the logs of a commercial search engine. Topics having multiple subtopics will have subtopics roughly balanced in terms of popularity. Strange and unusual interpretations and aspects will be avoided as much as possible.
In all other respects, the risk-sensitive task is identical to the adhoc task. The same 50 topics will be used. The submission format is the same. The top 10,000 documents should be submitted.
The topic structure will be similar to that used for the TREC 2009 topics. The topics below provide examples.
Single-facet topic examples:
<topic number="1" type="faceted"> <query>feta cheese preservatives</query> <description>Find information on which substances are used to extend the shelf life of feta cheese. </description> <subtopic number="1" type="inf"> Find information on which substances are used to extend the shelf life of feta cheese. </subtopic> </topic> <topic number="2" type="faceted"> <query> georgia state university admissions yield <query> <description> Find information on what percentage of students decide to attend Georgia State University after being admitted, as well as recent trends and discussion on this statistic. </description> <subtopic number="1" type="inf"> Find information on what percentage of students decide to attend Georgia State University after being admitted, as well as recent trends and discussion on this statistic. </subtopic> </topic>Multi-facet topic example:
<topic number="16" type="faceted"> <query>arizona game and fish</query> <description>I'm looking for information about fishing and hunting in Arizona. </description> <subtopic number="1" type="nav"> Take me to the Arizona Game and Fish Department homepage. </subtopic> <subtopic number="2" type="inf"> What are the regulations for hunting and fishing in Arizona? </subtopic> <subtopic number="3" type="nav"> I'm looking for the Arizona Fishing Report site. </subtopic> <subtopic number="4" type="inf"> I'd like to find guides and outfitters for hunting trips in Arizona. </subtopic> </topic>
Initial topic release will include only the query field. As shown in these examples, those topics having a more focused single intent have a single subtopic. Topics with multiple subtopics reflect underspecified queries, with different aspects covered by the subtopics. We assume that a user interested in one aspect may still be interested in others. Each subtopic is categorized as being either navigational ("nav") or informational ("inf"). A navigational subtopic usually has only a small number of relevant pages (often one). For these subtopics, we assume the user is seeking a page with a specific URL, such as an organization's homepage. On the other hand, an informational query may have a large number of relevant pages. For these subtopics, we assume the user is seeking information without regard to its source, provided that the source is reliable. For the adhoc task, relevance is judged on the basis of the description field.
5 Q0 clueweb12-enwp02-06-01125 1 32.38 example2013 5 Q0 clueweb12-en0011-25-31331 2 29.73 example2013 5 Q0 clueweb12-en0006-97-08104 3 21.93 example2013 5 Q0 clueweb12-en0009-82-23589 4 21.34 example2013 5 Q0 clueweb12-en0001-51-20258 5 21.06 example2013 5 Q0 clueweb12-en0002-99-12860 6 13.00 example2013 5 Q0 clueweb12-en0003-08-08637 7 12.87 example2013 5 Q0 clueweb12-en0004-79-18096 8 11.13 example2013 5 Q0 clueweb12-en0008-90-04729 9 10.72 example2013etc. where:
• the first column is the topic number.
• the second column is currently unused and should always be "Q0".
• the third column is the official document identifier of the retrieved document. For documents in the ClueWeb12 collection this identifier is the value found in the "WARC- TREC-ID" field of the document's WARC header.
• the fourth column is the rank the document is retrieved, and the fifth column shows the score (integer or floating point) that generated the ranking. This score must be in descending (non-increasing) order. The evaluation program ranks documents from these scores, not from your ranks. If you want the precise ranking you submit to be evaluated, the scores must reflect that ranking.
• the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. That is, each run should have a different tag that identifies the group and the method that produced the run. Please change the tag from year to year, since often we compare across years (for graphs and such) and having the same name show up for both years is confusing. Also run tags must contain 12 or fewer letters and numbers, with no punctuation, to facilitate labeling graphs with the tags.