Summary
The paper by Chan et al. is an interesting read on finding analogies from research papers. The main domain considered here is scientific papers. The annotation scheme has been divided into four categories, namely, background, purpose, mechanism, and findings. This paper’s goal is to make it easier for researchers to find related work in their field. The author conducted three studies to test their approach and its feasibility. The first study consisted of domain-experts annotating a particular abstract in their research area. The second study, on the other hand, focussed mainly on how the model could tackle the real-world problem where a researcher needs to find relevant papers in their area to act as inspiration, related-work, or even baselines for their experiments. The last study, however, was very different from the first two, where an experienced researcher annotated the data or used the system for solving their research problem. The third study, on the other hand, used crowdworkers to annotate abstracts. The platforms utilized by the authors were Upwork and Amazon Mechanical Turk.
Reflection
The mixed-initiative model developed by the authors is an excellent step in the right direction to find analogies in scientific papers. There are traditional approaches in natural language processing that help find similarities in textual data. The website [1] gives an excellent insight into the steps involved in finding similarities in texts. However, when it comes to scientific data, just using these steps is not enough. Additionally, most of the models involved are trained using generic websites and news data (like CNN or DailyMail). Hence, most of the scientific jargon is “out of vocabulary” (OOV) for such models. Hence, I appreciate that the authors used human annotations along with traditional methods in information retrieval (like TF-IDF, etc.) to tackle the problem at hand.
Additionally, for finding the similarity metric, they took multiple categories into account, like Purpose+Mechanism. This is definitely useful when finding similarities in the text data. I also liked the fact that for the studies, they considered normal crowdworkers in addition to people with domain knowledge. I was intrigued to find that 75% of the time, the annotations of crowdworkers matched with the researchers. Hence the conclusion that “crowd annotations still improve analogy-mining” is valuable. Not only that, getting researchers in large amounts in one domain just to annotate the data is difficult, sometimes there are very few people in one domain of research. Rather than having to find researchers who are available to annotate such data, it is good that we can use existing methods available.
Lastly, I would like to mention that I liked that the paper identified the limitations very well, and the scope for future work has also been clearly mentioned.
Questions
- Would you agree that level of expertise of the human annotators would not affect the results for your course project? If yes, could you please clarify?
(For my class project, I think I would agree with the paper’s findings. I work on reference string parsing, and I don’t think we need experts just to label the tokens.)
- Could we have more complex categories or sub-categories rather than just the four identified?
- How would this extend to longer pieces of texts like chapters of book-length documents?
References
Hi Bipasha. While the paper claims that the system quality is not critically dependent on annotator expertise and the system can be scaled using crowd workers as annotations, I am not entirely convinced.
The results showed that the annotations of Upwork workers matched expert annotations 78% of the time and those of Mturk workers matched 59% of the time. The results also showed that the results varied considerably: a few papers had 96% agreement while a few had only 4%. I am a little skeptical regarding these numbers and I feel that expert annotations might not be entirely dispensable. I feel that using crowd workers might help the system scale but it might have a negative impact on quality.
This, of course, is domain-specific. As you mentioned, for reference string parsing which involved token labeling, I agree with you that the level of expertise of annotators is not an important factor. When it comes to annotating research papers in certain complex domains, I feel that identifying the purpose, mechanism, and findings of the paper might require technical knowledge to fully understand and make an informed decision.