nSubsampled
75 of the 100 queries
nAsked
25 human volunteers to judge
queries
nEach
query is judged by 5 evaluators.
¨Capture the average rating of all users on each facet on Likert 5-point scale
¨ (1
sensitive, 5 insensitive)
¨ (1
specific, 5 ambiguous)
nGoals:
nA. Is our classification
replicatable / understandable?
nB. Are the coarse granularity
of facet values ok?