Some of the most popular NEXT experiments involve collecting comparative judgments or pairwise comparisons from people to obtain rankings. For instance, many experiments involve showing participants two items at time, selected from a large list of items (e.g., products, images, sentences), and asking them which of the two is preferable. This sort of pairwise comparison has many virtues, as well as some limitations. This blog post explains the pros and cons of asking users to compare two items as opposed to asking users to rate of individual items.

Advantages of Pairwise Comparisons

  • Ratings are harder to provide than pairwise comparisons. For example, asking “How blue is this paint on a scale of 1-5?” is more difficult to answer than “Which of these two paint colors is more blue?”
  • Calibration of utility functions across users. Suppose Alice and Bob were asked to rate all the restaurants within walking distance on a scale of 1 to 5. The problem is that even if they both enjoyed a restaurant the same amount, Alice’s “4” may correspond to Bob’s “5”. And Alice may never give any score below 3 but Bob gives scores across the whole scale. In other words, the scoring functions Alice and Bob are using are not calibrated, which makes it difficult to decide how to combine their scores to obtain an overall ranking. This sort calibration issue is a not an issue in pairwise comparisons.
  • Calibration of utility functions over time. Suppose Alice is shown a long sequence of different shades of light blue (e.g. baby blue, aqua, etc.) and after being shown each shade is asked to rate each one on a scale of one to five. Now suppose that after repeating this process many times, the colors suddenly start becoming much deeper blue. To Alice, these new shades of blue are “more blue” than before, but she had already calibrated herself to the lighter shades of blue and used the whole 1-5 scale. Now that Alice sees these new deeper shades of blue, she wants to go back and change her answers. This problem arises because Alice is not calibrated to the entire range of “blue”. On the other hand, if any two shades of blue were shown together and she was asked to answer which shade was more blue, she would have no reason to want to change her answers later once the darker shades of blue started appearing.
  • Pairwise comparisons have infinite precision. Suppose you asked people to score common everyday objects according to the object size, with the idea that “bigger is better”. For example, houses are bigger than cars, so houses would get a higher score. Assume that for all pairs of objects, everyone agrees which of the two is larger, just like the previous example. If we collect pairwise comparisons from one or more people, there would be no ambiguity in the overall ranking of objects from largest to smallest (we would rank them in accordance with the outcomes of the pairwise comparisons). In contrast, if we ask people to score individual objects on a scale of say 1-100, then the ordered list generated according to the scores will often not agree with the correct ranking. This happens because people may forget that they gave a score of 10 to a basketball by the time they are asked to score another larger/smaller object and mistakenly give the new object a smaller/larger score (Steward, Brown, and Chater 2005).

Advantages of Ratings

  • Absolute versus relative judgment. Ratings for two items tells us not only which item is preferred, but the degree to which it is preferable, so ratings can be more informative than a pairwise comparison.
  • No intransitivity issues. Comparisons can be confusing, especially when multiple people disagree. For example, suppose you are trying to decide whether to see movie A or movie B. Your friend Alice preferred movie A to movie B, but another friend Bob prefers B to A. There is no way to decide which is better based on the comparisons. However, if Alice gives A a 5-star rating and B 4-stars, and Bob gives A 1-star and B 5-stars, then it is reasonable to choose B since both give friends gave it a fairly high rating.
  • Clear definition of best. It is easy to pick the best item based on ratings (assuming the ratings are well calibrated). Naturally, the best or one of the best would be an item with the highest average rating. With pairwise comparisons, instead of ratings, there may not be a clear cut winner. For example, everyone prefers item A is preferred to B, B to C, and A to C. Then A is the clear winner since it is preferred over all others by everyone. However, it could also be the case that not everyone agrees and more people prefer A to B, more prefer B to C, but more people prefer C to A. This could happen if different people use different criteria to in making their judgments. In such cases, it’s not clear how to select the best item. Most approaches used in practice look at the margins of the majority decisions in some fashion, but there isn’t a clear cut way to make a decision.

Advice for Experimenters

If calibration issues are a concern, then we recommend using pairwise comparisons rather than ratings. A practical concern with pairwise comparisons is that sometime participants sometimes wish for more options. For example, rather than being forced to select one of two undesirable items, they would like indicate that both are bad. Unfortunately, allowing for this sort of option would lead to difficulties in calibrating responses, just like in a rating-based system. A simple way to motivate participants would be to explain that the goal is to determine a ranking of items and quantify the certainty of that ranking. Providing comparisons for ties, even when neither is particularly desirable, helps to determine the overall ranking of items. If it is difficult or impossible to decide for any reason (e.g., both very good, both very bad) then we advise the participant to just pick one quickly and move on. If the overall crowd of participants feels similarly indifferent towards a pair of items, this coin-flip behavior is easily accounted for in determining the overall ranking. Indeed, by investigating the frequency in which an alternative of a pair is chosen, we can quantify the degree of preference or indifference of that alternative.

If the need or desire for more options, like “neither”, overrides the advantages of pure pairwise comparisons, then we recommend adopting a rating-based system wherein a participant is asked to rate individual items on a scale 1-5 or 1-10, ignoring the potential calibration problems and hope for the best.