Semantic similarity or semantic relatedness are features of natural language that contribute to the challenge machines face when analyzing text. Although semantic relatedness is still a complex challenge only few ground truth data set exist. We argue that the available corpora used to evaluate the performance of natural language tools do not capture all elements of the phenomenon. We present a set of simple interventions that illustrate 1) framing effects influence similarity perception, 2) the distribution of similarity across multiple users is important and 3) semantic relatedness is asymmetric.
A number of metrics in the literature attempt to model and evaluate semantic similarity in natural languages. Semantic similarity has applications in areas such as semantic search, text mining, etc. The concept of semantic similarity has long been considered as a more specific concept than the concept of semantic relatedness. Semantic relatedness, as it includes the concepts of antonymy and meronymy, is more generic than semantic similarity.
Different approaches have been attempted to measure semantic relatedness and similarity. Some methods use structured taxonomies such as WordNet alternative approaches define relatedness between words using search engines (e.g., based on Google counts) or Wikipedia. All of these methods are evaluated based on the correlation with human ratings. Yet only few benchmark data sets exist. One of the most widely used being the WS-353 data-set . As the corpus is very small and the sample size per pair is low it is arguable if all relevant phenomena are in fact present in the provided data set.
In this study, we aim to understand how human raters perceive word-based semantic relatedness. We argue that asking simple word-based semantic similarity is beyond the scope of existing test sets. Our hypotheses in this paper are as follows:
(H1) The framing effect influences similarity rating by human assessors.
(H2) The distribution of similarity rating does not follow a normal distribution.
(H3) Semantic relatedness is not symmetric. The relatedness between words (e.g., tiger and cat) yields different similarity ratings in a different word order.
To verify our hypotheses, we collected similarity ratings on word pairs from the WS-353 data-set. We randomly selected 102 word pairs from the WS-353 data-set. We collected similarity ratings on the 102 word pairs through Amazon Mechanical Turk (MTurk). We collected 5 dataset for these 102 pairs. Each collection used a different task design and was separated into two batches of 51 words each. Each batch received ratings from 50 unique contributors so that each pair of word received 50 ratings in each condition.
The way the questions were asked to the crowd workers are shown in the following figure. For each question, 4 conditions were differently framed. The first two of these are “How is X similar to Y?” (sim) and “How is Y similar to X?” (inverted-sim). We further repeated them asking for the difference between both words (dissim and inverted-dissim, respectively). Since the scale is reversed in dissim and inverted-dissim, the dissimilarity ratings were converted into similarity ratings for comparison.
We compared the distributions of similarity ratings in the original WS-353 dataset and our dataset in order to confirm the framing effect. The mean values of 50 ratings were calculated for each pair in our dataset to compare with original similarity ratings in the WS-353 dataset. We filtered exactly the same 102 word pairs from the WS-353 to ensure the consistency between two settings. The distributions are found to be significantly different (p < 0.001, paired t-test).
Our preliminary results show that similarity ratings for some word pairs in the WS-353 dataset do not follow a normal distribution. Some of the distributions reveal that there are different perceptions of similarity, which gets highlighted by multiple peaks. A possible explanation is that the lower peak can be attributed to individuals that are aware of the factual differences between a “sun” or “star” and an actual planet orbiting a “star” while the others are not aware of it.
We compared the difference between the similarity ratings of sim (dssim) and that of inverted-sim (inverted-dissim) to verify third hypothesis. Scatter plot representations of similarity ratings in different word orders for the similarity question and the dissimilarity question reflect that the semantic relatedness in different orders do not take same mean values, indicating the semantic relatedness is asymmetric. The asymmetric relationship consistently appears in the different types of questions (i.e., similarity and dissimilarity.) The results show a remarkable difference between the similarity of “baby” to “mother” and the similarity of “mother” to “baby”. This indicates that the asymmetric relationship between mother and baby was reflected in the subjective similarity rating.
To measure the inter-rater reliability, we have computed the value of Krippendorff’s alpha for both the original dataset and for the one we obtained through the current analysis. Krippendorff’s alpha is a statistical measure that basically provides a highlight of the agreement achieved when encoding a set of units of analysis in terms of the values of a variable.
References L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1), 2002.
For more, see our full paper, Possible Confounds in Word-based Semantic Similarity Test Data, accepted in CSCW 2017.
Department of Information Technology
Indian Institute of Engineering Science and Technology,
MIT Media Lab, Recruit Institute of Technology
Md Mustafizur Rahman
Information Retrieval & Crowdsourcing Lab
University of Texas at Austin
ICSI, UC Berkeley