04/29/2020 – Palakh Mignonne Jude – VisiBlends: A Flexible Workflow for Visual Blends

April 28, 2020 Palakh Mignonne Jude 5 Comments

SUMMARY

In this paper, the authors propose a flexible workflow to enable the creation of visual blends. The authors specifically focus on the ‘Hybrid’ visual metaphor wherein objects are ‘fused together’. The authors propose a blending design pattern ‘Single Shape Mapping’ which can be used to blend two objects with similar shapes and blends all of one object into a part of the other one. The workflow consists of different microtasks including brainstorming, annotation, and evaluation. The entire workflow consists of 6 steps – brainstorming, finding images, annotating images, detecting images that blend together (by the system), automatic synthesizing of the blends, evaluating the blends (by the user) – and the users were made to watch a 15-minute training session before they started with the task. The authors evaluated the task decomposition with three studies namely decentralized collaboration, group collaboration on blends for messages, and novice users with and without VisiBlends. The authors found that VisiBlends was helpful as it enabled people to meet all of the constraints associated with visual blending.

REFLECTION

I enjoyed reading this paper and found the motivation for this study to be very nice. It was interesting to see how creativity, which is predominantly a human affordance, was being presented in a mixed-initiative setting. I liked the examples that were chosen throughout the paper and found that they helped me to better understand the challenges associated with blending images (orange + healthy, with apple as the health symbol) as well as to appreciate the images that were generated well (orange + healthy, with a health symbol that was not food).

The study on ‘decentralized collaboration’ reminded me of the paper on ‘The Knowledge Accelerator: Big Picture Thinking in Small Pieces’ which was discussed last week. I liked the study on the ability of novice users to create these visual blends with and without VisiBlends. I also agree that having a flexible iterative workflow similar to the one used in the paper is very useful as it aids the users to identify issues with the original results and then improve upon the same.

I liked that the authors discuss how creative design problems also have patterns in the ‘Discussion’ section. I think it would be very interesting to see a similar study be conducted in the domain of story writing. Text data poses a multitude of challenges and having a decomposed workflow such as this proposed for writing would be very interesting – especially given that authors may have very varied writing styles.

(Interestingly, “Wash your hands. It’s the smart move” was one of the messages used as part of this study.)

QUESTIONS

What modifications can be made to the interface design of the VisiBlends website to better aid users in creating these visual blends? What are the main drawbacks of the existing interface?
The authors propose the decomposed workflow for ‘Hybrid’ visual metaphors. Is it possible to create such a workflow for the ‘Simile’ or ‘Contextual Metaphor’ visual blends? What kind of changes would be required to be made to enable this?
The authors conduct a study to evaluate the usage of this system by novices. What results would have been yielded if the users of the system were people with a background In marketing, but who were not graphic designers? Would their approach to the problem have been different?

04/29/2020 – Palakh Mignonne Jude – Accelerating Innovation Through Analogy Mining

April 28, 2020 Palakh Mignonne Jude 2 Comments

SUMMARY

In this paper, the authors attempt to facilitate the process of finding analogies with a view to boost creative innovations by exploring the value that can be added by incorporating weak structural representations. They leverage the vast body of online information available (for the purpose of this study, product descriptions from Quirky.com). They generate microtasks for crowdworkers to perform that were designed to label the ‘purpose’ and ‘mechanism’ parts of a product description. The authors use GloVe word vectors to represent their purpose and mechanism words and use a BiRNN to learn the purpose and mechanism. In order to collect analogies, they use AMT crowd workers to find analogies for 8000 product descriptions. In the evaluation stage, the authors attempt to weigh the usefulness of their algorithm by having participants redesign a product. 38 AMT workers were recruited for the same and the task was to design a cell phone charger case. 5 graduate students were recruited to evaluate the ideas generated by the workers. Based on a predefined criterion of ‘good’ ideas, 208 were produced out of 749 total ideas (with 2 judges rating it as good) and 154 were produced out of 749 total ideas (with 3 judges rating it as good). In both cases, the analogy approach proposed by the authors out-performed the TF-IDF baseline model and random model.

REFLECTION

I found the motivation of this study to be very good – especially based on ‘bacteria-slot machine’ analogy example highlighted in the introduction of the paper. I agree that given the vast amount of data available, having such a system that would accelerate the process of finding analogies could very well aid in quicker innovation and discovery.

I like that the authors chose to present their approach by using product descriptions. I also like the use of ‘purpose’ and ‘mechanism’ annotations and feel that given the more general domain of this study, the quality of annotations by the crowdworkers would be better than in the case of the paper on ‘SOLVENT: A Mixed Initiative System for Finding Analogies between Research Papers’.

I also liked that the authors presented the results given by using the TF-IDF baseline as it indicated the findings that would have been generated by near-domain results. I felt that it was good that the authors added a criterion to judge the feasibility of an idea, one that could be implemented using existing technologies.

Additionally, while I found that the study and methods proposed by this paper were good, I did not like the organization of the paper.

QUESTIONS

How would you rate the design of the interface used to collect ‘purpose’ and ‘mechanism’ annotations? What changes might you propose to make this better?
The authors do not mention the details or experiences of the AMT workers. How much would workers prior experience influence their ability to find ideas using this approach? Can this approach aid more experienced people working with product innovations?
The authors of SOLVENT leverage the mixed-initiative system proposed by this paper to find analogies between research papers. Which are the domains where this approach would fail to a great extent (even if modifications were to be made)?

04/22/2020 – Palakh Mignonne Jude – Opportunities for Automating Email Processing: A Need-Finding Study

April 21, 2020 Palakh Mignonne Jude 2 Comments

SUMMARY

In this paper, the authors conduct a mixed-methods investigation to identify the expectations of users in terms of automated email handling as well as the information and computation required to support the same. They divided their study into 3 probes – ‘Wishful Thinking’, ‘Existing Automation Software’, and ‘Field Deployment of Simple Inbox Scripting’. The first probe was conducted in two stages. The first stage included a formative design workshop wherein the researchers enlisted 13 computer science students that were well-versed with programming to create rules. The second stage was a survey that enlisted 77 participants from a private university including 48% without technical backgrounds. The authors identified that there was a need for automated systems to have richer data models, use internal/external context, manage attention, alter the presentation of the inbox. In the second probe, the authors mined GitHub repositories to identify needs that programmers had implemented. Some of the additional needs they identified included processing, organizing, and archiving content, altering the default presentation of email clients, email analytics and productivity tools. As part of the third probe, the authors deployed their ‘YouPS’ system that enables users to process email rules in Python. For this probe, they enlisted 12 email users (all of whom could code in Python). Common themes across the rules generated include the creation of email modes, leveraging interaction history, and a non-use of existing email client features. The authors found that users did indeed desire more automation in their email management especially in terms of richer data models, internal and time-varying external context, and automated content processing.

REFLECTION

I liked the overall motivation of the study and especially resonated with the need of automated content processing as I would definitely benefit from having mail attachments downloaded and stored appropriately. The subjects that mentioned a reaction to signal if a message was viewed reminded me about Slack’s interface that allows you to ‘Add reaction’. I also believe that having a tagging feature would be good to ensure that key respondents are alerted of tasks that must be performed by them (especially in case of longer emails).

I liked the setup of Probe 3 and found that this was an interesting study. However, I wonder about the adoptability of such a system and as mentioned by the authors in the future work, I would be very interested in knowing how non-programmers would make use of these rules via the use of a drag-and-drop GUI.

The authors found that the subjects (10 out of 12) preferred to write rules in Python rather than use the mail client’s interface. This reminded me of prior discussions in class for the paper ‘Agency plus automation: Designing artificial intelligence into interactive systems’ wherein we discussed how humans prefer to be in control of the system and the level of automation that users desire (in a broader context).

QUESTIONS

The studies conducted include participants that had an average age group that was less than 30 and most of whom were affiliated with a university. Would the needs of business professionals vary in anyway as compared to the ones identified in this study?
Would business organizations be welcoming of a platform such as the YouPS system? Would this raise any security concerns considering that the system is able to access the data stored in the emails?
How would to rate the design of the YouPS interface? Do you see yourself using such a system to develop rules for your email?
Are there any needs, in addition to the ones mentioned in this paper, that you feel should be added?
The authors state that even though 2/3 studies focused on programmers, the needs identified were similar between programmers and non-programmers. Do you agree with this justification? Was there any bias that could have crept in as part of this experimental setup?

04/22/2020 – Palakh Mignonne Jude – SOLVENT: A Mixed Initiative System for Finding Analogies between Research Papers

April 21, 2020 Palakh Mignonne Jude 2 Comments

SUMMARY

The authors attempt to assist researcher to analogies in other domains in an attempt to aid interdisciplinary research. They propose a modified annotation scheme that extends on the work described by Hope et. al. [1] and contains 4 elements – Background, Purpose, Mechanism, and Findings. The authors conduct 3 studies – the first, involving the sourcing of annotations from domain-expert researchers, the second, using SOLVENT to find analogies with real-world value, and the third, scaling up SOLVENT through crowdsourcing. In each study, semantic vector representations were created from the annotations. In the first study, the dataset used focused on papers from the CSCW conference and was annotated by members of the research team. In the second study, the researchers worked with an interdisciplinary team working with bioengineering and mechanical engineering in an attempt to identify whether SOLVENT can aid in identifying analogies not easily found through keyword/citation tree searches. In the third study, the authors used crowdsource workers from Upwork and AMT to perform the annotations. The authors found that these crowd annotations did have substantial agreement with researcher annotations but the workers struggled with purpose and mechanism annotations. Overall, the authors found that SOLVENT helped researchers to find analogies more effectively.

REFLECTION

I liked the motivation for this paper – especially the study 3 that used of crowdworkers for the annotations and was glad to know that the authors found substantial agreement between crowdworker annotations and researcher annotations. This was an especially good finding as the corpus that I deal with also contains scientific work and scaling the annotations for the same has been a concern in the past.

As part of the second study, the authors mention that they trained a word2vec model on 3,000 papers in the dataset curated using papers from the 3 domains under consideration. This made me wonder about the generalizability of their approach. Would it be possible to generated more scientific word vectors that span across multiple domains? I think it would be interesting to see how the performance of a such a system would measure against the existing system. In addition to this, word2vec is known to face issue with out-of-vocabulary words, so that made me wonder if the authors had made any provisions to deal with the same.

QUESTIONS

In addition to the domains mentioned by the authors in the discussion section, what other domains can SOLVENT be applied to and how useful do you think it would be in those domains?
The authors used majority vote as the quality control mechanism for Study 3. What more sophisticated measures could be used instead of majority vote? Would any of the methods proposed in the paper ‘CrowdScape: Interactively Visualizing User Behavior and Output’ be applicable in this setting?
How well would SOLVENT extend to the abstracts of Electronic Theses and Dissertations that would contain a mix of STEM as well as non-STEM research? Would any modifications be required to the annotation scheme presented In this paper?

REFERENCES

Tom Hope, Joel Chan, Aniket Kittur, and Dafna Shahaf. 2017. Accelerating Innovation Through Analogy Mining. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM,235–243.

04/15/2020 – Palakh Mignonne Jude – Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

April 15, 2020 Palakh Mignonne Jude 1 Comment

SUMMARY

The authors of this paper design and evaluate a mixed-initiative fact-checking approach that blends prior human knowledge with the efficiency of automated ML systems. The authors found that users tend to over-trust the model which could degrade human accuracy. They conducted three randomized experiments – the first, compares user who perform the task with or without viewing ML predictions, the second, compares a static interface with an interactive one (that enables users to fix model predictions), and the third, compares a gamifies task design to a non-gamified one. The authors designed an interface that displays the claim, the predicted correctness, and relevant articles. For the first experiment, the authors considered responses from 113 participants with 58 assigned to Control and 55 to System. For the second experiment, the authors considered responses from 109 participants with 51 assigned to Control and 58 to Slider. For the third experiment, the authors considered responses from 106 participants, and found no significant differences between the two groups.

REFLECTION

I liked the idea of a mixed-initiative approach to fact checking that builds on the affordances of both humans and AI. I found that it was good that the authors designed the experiments such that the confidence scores (and therefore the fallibility) of the system was openly shown to the users. I also felt that the interface design was concise and appropriate without being overly complex. I also liked the design of the gamified approach and was surprised to learn that the game design did not impact participant performance.

I agree that for this case in particular, participant demographics may affect the results. Especially since the news articles considered were mainly related to American news. I wonder how much if a difference in the results would be observed in a follow-up study that considers different demographics as compared to this study. I also agree that caution must be exercised with such mixed-initiative systems as imperfect data sets would have a considerable impact on model predictions and that the humans should not blindly trust the AI predictions). It would definitely be interesting to see the results obtained when users check their own claims and interact with other user’s predictions.

QUESTIONS

The authors explain that the incorrect statement on Tiger Woods was due to the model having learnt the bi-gram ‘Tiger Woods’ incorrectly – something that a more sophisticated classifier may have avoided. How much of an impact would such a classifier have made on the results obtained overall? Have other complementary studies been conducted?
The authors found that a smaller percentage of users used the sliders than expected. They state that while the sliders were intended to be intuitive, they may require a learning curve causing lesser users to adopt it. Would the use of a tutorial that enabled users to familiarize themselves have helped in this case?
Were the experiments conducted in this study adequate? Are there any other experiments that the authors should have conducted in addition to the ones mentioned?

04/15/2020 – Palakh Mignonne Jude – What’s at Stake: Characterizing Risk Perceptions of Emerging Technologies

April 15, 2020 Palakh Mignonne Jude 1 Comment

SUMMARY

The authors of this paper adapt a survey instrument from existing risk perception literature to analyze the perception of risk surrounding newer emerging data-driven technologies. The authors surveyed 175 participants (26 experts and 149 non-experts). They categorize an ‘expert’ to be anyone working in a technical role or earning a degree in a computing field. Inspired by the original 1980’s paper ‘Facts and Fears: Understanding Perceived Risk’, the authors consider 18 risks (15 new risks and 3 from the original paper). These 15 new risks include ‘biased algorithms for filtering job candidates’, ‘filter bubbles’, and ‘job loss from automation’. The authors also consider 6 psychological factors while conducting this study. The non-experts (as well as a few who were later on considered to be ‘experts’) were recruited using MTurk. The authors borrowed quantitative measures that were used in the original paper and added two new open-response questions – describing the worst-case scenario for the top three risks (as indicated by the participant) and adding new serious risks to society (if any). The authors also propose a risk-sensitive design based on the results of their survey.

REFLECTION

I found this study to be very interesting and liked that the authors adapted the survey from existing risk perception literature. The motivation the paper reminded me about a New York Times article titled ‘Twelve Million Phones, One Dataset, Zero Privacy’ and the long-term implications of such data collection and its impact on user privacy.

I found it interesting to learn that the survey results indicated that both experts and non-experts rated nearly all risks related to emerging technologies as characteristically involuntary. It was also interesting to learn that despite consent processes built into software and web services; the corresponding risks were not perceived to voluntary. I thought that it was good that the authors included the open-resource question on what the user’s perceived as the worst case scenario for the top three riskiest technologies. I liked that they provided some amount of explanation for their survey results.

The authors mention that technologists should attempt to allow more discussion around data practices and be willing to hold-off rolling out new features that raise more concerns than excitement. However, this made me wonder if any of the technological companies would be willing to perform such a task. It would probably cause external overhead and the results may not be perceived by the company to be worth the amount of time and effort that such evaluations may entail.

QUESTIONS

In addition to the 15 new risks added by the authors for the survey, are there any more risks that should have been included? Are there any that needed to be removed or modified from the list? Are there any new psychological factors that should have been added?
As indicated by the authors, there are gaps in the understanding of the general public. The authors suggest that educating the public would enable this gap to be reduced more easily as compared to making the technology less risky. What is the best way to educate the public in such scenarios? What design principles should be kept in mind for the same?
Have any follow-up studies been conducted to identify ‘where’ the acceptable marginal perceived risk line should be drawn on the ‘Risk Perception Curve’ introduced in the paper?

04/08/2020 – Palakh Mignonne Jude – CrowdScape: Interactively Visualizing User Behavior and Output

April 8, 2020 Palakh Mignonne Jude 1 Comment

SUMMARY

There are multiple challenges that exist while ensuring quality control of crowdworkers that are not always easily resolved by employing simple methods such as the use of gold standards or worker agreement. Thus, the authors of this paper propose a new technique to ensure quality control in crowdsourcing for more complex tasks. By utilizing features from worker behavioral traces as well as worker outputs, they aid researchers to better understand the crowd. As part of this research, the authors propose novel visualizations to illustrate user behavior, new techniques to explore crowdworker products, tools to group as well as classify workers, and mixed initiative machine learning models that build on a user’s intuition about the crowd. They created CrowdScape – built on top of MTurk which captures data from the MTurk API as well as a Task Fingerprinting system in order to obtain worker behavioral traces. The authors discuss various case studies such as translation, picking a favorite color, writing about a favorite place, and tagging a video and describe the benefits of CrowdScape in each case.

REFLECTION

I found that CrowdScape is a very good system especially considering the difficulty in ensuring quality control among crowdworkers in case of more complex tasks. For example, in case of a summarization task, particularly for larger documents, there is no single gold standard that can be used and it would be rare that the answers of multiple workers would match for us to use majority vote as a quality control strategy. Thus, for applications such as this, I think it is very good that the authors proposed a methodology that combines both behavioral traces as well as worker output and I agree that it provides more insight that using either alone. I found that the example of the requester intending to have summaries written for YouTube physics tutorials was an appropriate example.

I also liked the visualization design that the authors proposed. They aimed to combine multiple views and made the interface easy for requesters to use. I especially found the use of 1-D and 2-D matrix scatter plots showing distribution of features over the group of workers that also enabled dynamic exploration to be well thought out.

I found the case study on translation to be especially well thought out – given that the authors structured the study such that they included a sentence that did not parse well in computer generated translations. I feel that such a strategy can be used in multiple translation related activities in order to more easily discard submissions by lazy workers. I also liked the case study on ‘Writing about a Favorite Place’ as it indicated the performance of the CrowdScape system in a situation wherein no two workers would provide the same response and traditional quality control techniques would not be applicable.

QUESTIONS

The CrowdScape system was built on top of Mechanical Turk. How well does it extend to other crowdsourcing platforms? Is there any difference in the performance?
The authors mention that workers who may possibly work on their task in a separate text editor and paste the text in the end would have little trace information. Considering that this is a drawback of the system, what is the best way to overcome this limitation?
The authors the case study on ‘Translation’ to demonstrate the power of CrowdScape to identify outliers. Could an anomaly detection machine learning model be trained to identify such outliers and aid the researchers better?

04/08/2020 – Palakh Mignonne Jude – Agency plus automation: Designing artificial intelligence into interactive systems

April 8, 2020 Palakh Mignonne Jude 1 Comment

SUMMARY

The authors of this paper aim to demonstrate the capabilities of various interactive systems that build on the complementary strengths of humans and AI systems. These systems aim to promote human control and skillful action. The interactive systems that the authors have developed span three areas – data wrangling, exploratory analysis, and natural language translation. In the Data Wrangling project, the authors demonstrate a means that enabled users to create data-transformation scripts within a direct manipulation interface that was augmented by the use of predictive models. While covering the area of exploratory analysis, the authors developed an interactive system ‘Voyager’ that helps analysts engage in open-ended exploration as well as targeted question answering by blending manual and automated chart specification. As part of the predictive translation memory (PTM) project, that aimed to blend the automation capabilities of machines with rote tasks and the nuanced translation guidance that can be provided by humans. Through these projects, the authors found that there exist various trade-offs in the design of such systems.

REFLECTION

The authors mention that users ‘may come to overly rely on computational suggestions’ and this statement reminded me of the paper on ‘Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning’ wherein the authors discovered that the data scientists used as part of the study over-trusted the interpretability tools.

I thought that the use of visualizations as part of the Data Wrangling project was a good idea since humans often work well with visualizations and that this can speed up the task at hand. As part of previous coursework, my professor had conducted a small experiment in class wherein he made us identify a red dot among multiple blue dots and then identify a piece of text in a table. As expected, we were able to identify the red dot much quicker – attesting to the fact that visual aids often help humans to work faster. The interface of the ‘Voyager’ system reminded me of the interface of the ‘Tableau’ data visualization software. I found that, in the case of the predictive translation memory (PTM) project, it was interesting that the authors mention the trade-off between customers wanting translators that have more consistent results versus human translators that experienced a ‘short-circuiting’ of thought with the use of the PTM tool.

QUESTIONS

Given that there are multiple trade-offs that need to be considered while formulating the design of such systems, what is the best way to reduce this design space? What simple tests can be performed to evaluate the feasibility of each of the systems designed?
As mentioned in the case of the PTM project, customers hiring a team of translators prefer more consistent results which can be aided by MT-powered systems. However, one worker found that the MT ‘distracts from my own original translation’. Specifically in the case of natural language translation, which of the two do you find to be more important, the creativity/original translation of the worker or consistent outputs?
In each of the three systems discussed, the automated methods suggest actions, while the human user is the ultimate decision maker. Are there any biases that the humans might project while making these decisions? How much would these biases affect the overall performance of the system?

03/25/2020 – Palakh Mignonne Jude – Evaluating Visual Conversational Agents via Cooperative Human-AI Games

March 24, 2020 Palakh Mignonne Jude 1 Comment

SUMMARY

In this paper, the authors design a cooperative game called GuessWhich (inspired by the 20-Questions game) to measure the perform of human-AI teams in the context of visual conversational agents. The AI system, ALICE, is based on the ABOT developed by Das et. al. in a prior study conducted to measure the performance of AI-AI systems. Two variants of ALICE have been considered for this study – ALICE_SL(trained in a supervised manner on the Visual Dialog dataset) and ALICE_RL (pre-trained with supervised learning and fine-tuned using reinforcement learning). The GuessWhich game was designed such that the human is the ‘questioner’ and the AI (ALICE) is the ‘answerer’. Both are given a caption that describes an image. While, ALICE is shown this image, the human can ask the AI multiple questions (9 rounds of dialog) to better understand the image. Post these rounds, the human is made to select the correct image from a set of distractor images that are semantically similar to the image to be identified. The authors found that, contrary to expectation, improvements in AI-AI performance does not translate to an improvement to AI-human performance.

REFLECTION

I like the gamification approach that the authors adopted for this study and I believe that the design of the game works well in the context of visual conversational agents. The authors mention how they aimed to ensure that the game was ‘challenging and engaging’. This reminded me of the discussion we had in class about the paper ‘Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research’, of how researchers often put in extra effort to make tasks for crowd workers more engaging and meaningful. I also liked the approach used to identify ‘distractor’ images and felt that this was useful in making the game challenging for the crowd workers.

I thought that it was interesting to learn that the AI-ALICE teams outperformed the human-ALICE teams. I wonder if this is impacted by the fact that ALICE could get some answers wrong and how that might affect the mental model generated by the human. I thought that it was good that the authors took into account knowledge leak and ensured that the human workers could only play a fixed number (10) of games.

I also liked that the authors gave performance-based incentives to the workers that were tied to the success of the human-AI team. I thought that it was good that the authors published the code of their design as well as provided a link to an example game.

QUESTIONS

As part of the study conducted in this paper, the authors design an interactive image-guessing game. Can similar games be designed to evaluate human-AI team performance in other applications? What other applications could be included?
Have any follow up studies been performed to evaluate the QBOT in a ‘QBOT-human team’? In this scenario, would the QBOT_RL outperform the QBOT_SL?
The authors found that some of the human workers adopted a single word querying strategy with ALICE. Is there any specific reason that could have caused the humans to do so? Would they have questioned a human in a similar fashion? Would their style of querying have changed if they were unaware if the other party was a human or an AI system?

03/25/2020 – Palakh Mignonne Jude – “Like Having a Really Bad PA”: The Gulf between User Expectation and Experience of Conversational Agents

March 24, 2020 Palakh Mignonne Jude Leave a comment

SUMMARY

The authors of this paper aim to understand the interactional factors that affect conversational agents (CAs) such as Apple’s Siri, Google Now, Amazon’s Alexa, and Microsoft’s Cortana. They conducted interviews with 14 participants (they continued to find participants until theoretical saturation had been reached). They identified the motivations of these users, their type of use, the effort involved in learning to use a CA, user evaluation of the CAs, as well as issues that affect engagement of users. Through their study, the authors found that user expectations were not met by the CAs. They also found that the primary use-case for these CAs was to perform tasks ‘hands free’ and that users were more likely to trust the CA with tasks that needed less precision (such as setting an alarm, or asking about the weather) as compared to tasks that required more precision (such as drafting an email). For the tasks that needed more precision, the users were likely to utilize visual confirmation to ensure that the CA had not made any mistakes. The authors identified that it would help if the CA enabled the users to learn about the systems capabilities in a better manner, as well as if the goals of the system were more clearly defined.

REFLECTION

I found this paper to be very interesting, however, given that it was written in 2016, I wonder if a follow-up study has been performed to evaluate CAs and their improvement over the past few years. I liked the description given in the paper about ‘structural patterns’ and how as humans, we often use non-verbal tools to ascertain the mood of another person – which would be challenging to achieve in the context of the current conversational agents. I also found it interesting to learn that humans found excess politeness repulsive when they knew that their interaction was with a machine while they expected politeness in case of interactions with humans. I agree that these CAs must be designed in such a way that naïve, uninformed humans would be able to use them with ease in everyday situations.

I thought it was interesting that the authors mention the lack of the ability of the CAs to bear contextual understanding between interactions, especially in the case of subsequent questions that might be asked. If the CA were to intend to conduct conversations in a more human-like manner, I believe that this is an important factor that must be considered. As someone who isn’t an avid user of CAs, I am unaware about the current progress that has been made towards improving this aspect of CAs.

As indicated by the paper, I remember having used ‘play’ as a point of entry when I first starting using my Google Home – wherein I used the ‘Pikachu Talk’ feature. I also found it interesting to learn how, in this case as well, humans form mental models regarding the capabilities of the CA systems.

QUESTIONS

How have conversational agents evolved over the past few years since this paper was published?
Which CA among Cortana, Siri, Google Now, and Alexa has made the most progress and has the best performance? Are any of these systems capable of maintaining context when communicating with users? Which of these conversations seem most human-like?
Considering that this study was conducted with users that mainly used Siri, has a follow-up comparative study been performed that evaluates the performance of each of the available CAs and illustrates the strengths and weaknesses of each of these systems?