Tasks Track 2016

The Tasks track is aimed at devising techniques to evaluate system's understanding of the underlying task users try to complete when issuing search queries, and to evaluate the usefulness of the retrieved documents towards completing this task. As part of this, we intend to provide the participants with a set of queries together with the corresponding Freebase ID of the entities in the queries.

Goals

Task Understanding
The goal here is to test whether systems can understand the task users are trying to complete and identify the possible subtasks necessary for the completion of the overall task. Participants will be provided with a document corpus and a query for which they would submit a ranked list of key phrases that represent the possible subtasks the user needs to complete given this query.

Task Completion
The goal here is to test whether systems can retrieve documents useful to the completion of the task. Participants will be asked to submit a ranked list of documents, taking into account the subtasks identified in Task Understanding. The ranked lists provided by the participants will be evaluated in terms of both usefulness and relevance.

Ad-hoc Task
This task is a continuation of the traditional TREC Web Track ad-hoc task and it will be identical to the previous years' Web Track ad-hoc tasks.

Ranking Stability
It is well understood that different query formulation for the same task/subtask can lead to radically different system performance. The goal here is to test whether understanding the underlying subtasks can lead to ranking stability when subtasks are expressed with a variety of user queries. Participants will be provided with an extra set of queries for the underlying task. Each query will correspond to some subtask necessary to complete the overall task (not known to the participants). Participants will then be asked: 1. To run their vanilla search algorithm over the crowdsourced queries, and submit a ranked list of documents for each one of the queries provided to them. 2. To identify the subtask that corresponds to each query (e.g. by clustering the crowdsourced queries) and take this information into account to stabilize the performance of their algorithm, and then submit a ranked list of documents for each one of the queries provided to them.

Detailed Guidelines

Participants will be provided with a set of 50 queries, together with the Freebase ID for each entity in these queries. The track consists of three tasks and the same queries will be used for each of these tasks. Details of each task, as well as the metrics used for each task are shown below.

Task Understanding

For each query, the participants are expected to submit a ranked list of up to 100 key phrases that represent the set of all tasks a user who submitted the query may be looking for. For example, for the query “hotels in London”, some relevant key phrases can be: “cheap hotels in London”, “reviews of hotels in London”, “hotels in London city centre”, etc. The goal of this task is to return a ranked list of key phrases that provide a complete coverage of tasks for each query, while avoiding redundancy.
Evaluating the coverage and relevance of the tasks submitted by the participants requires that a set of “gold standard” tasks that cover the set of all possible tasks are identified in advance. These gold standard tasks will be constructed by the organizers, but will not be provided to the participants until the evaluation results are out. In order to guarantee the coverage of tasks and be fair to all participants, tasks will be developed based on information extracted from the logs of a commercial search engine, as well as by pooling the key phrases submitted by the participants. An example set of tasks for the query “hotels in London” may be

hotels in London [price]
hotels in London [location]
hotels [reviews] in London
[other accommodation] in London
hotels [in locations around] London

In this case the key phrase “cheap hotels in London city centre” will be relevant to both “hotels in London [price]” and “hotels in London [location]”.

Evaluation Given the gold standard tasks, each key phrase submitted by the participants will be judged with respect to each of the gold standard tasks by using a three level judging scheme:

Highly relevant: The key phrase completely describes the task and could be used as a query submitted to the search engine to complete the task.
Relevant: The key phrase somehow describes the task but not fully, it can be used as a query to achieve the task but there are better queries than that.
Non Relevant : The key phrase is not relevant to the task and cannot be used to complete it.

Given these judgments, the quality of each ranked list will then be evaluated using diversity metrics such as ERR-IA and alpha-NDCG [3].

Task Completion

For each query, the participants are expected to submit a ranked list of up to 1000 documents that could be relevant to any task a user may be trying to achieve given a query.

Evaluation Each document submitted by the participants will then be assessed in terms of its usefulness to complete each possible “gold standard” task by using a three level judging scheme:

Usefulness:
- Key: The document is essential towards the completion of the task. The document is enough on its own to complete the task.
- Useful: The document is useful towards the completion of the task. However, more documents need to be investigated in order to complete the task.
- Not Useful: The document is not useful towards the completion of the task.
Relevance
- HRel: The page contains significant amount of information about the task.
- Rel: The content of this page provides some information on the task, which may be minimal.
- Non: The content of this page does not provide useful information about the task.
- Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk.

Ad-hoc Task

Participants are expected to submit a ranked list of up to 1000 documents. For evaluating the quality of the runs submitted, the judgments that were obtained were the task completion task will be used, ignoring the usefulness category, and focusing on relevance. ERR [2] and NDCG [1] will be used as the primary metrics for evaluation.

Ranking Stability

The Ranking Stability tasks will not be evaluated this year due to limited resources.

References

[1] Kevyn Collins-Thompson, Paul N. Bennett, Fernando Diaz, Charlie Clarke, Ellen M. Voorhees: TREC 2013 Web Track Overview. TREC 2013
[2] Olivier Chapelle, Donald Metlzer, Ya Zhang, Pierre Grinspan: Expected reciprocal rank for graded relevance. CIKM 2009: 621-630
[3] Charles L. A. Clarke, Nick Craswell, Ellen M. Voorhees: Overview of the TREC 2012 Web Track. TREC 2012