04/15/20 – Lulwah AlKulaib-BelieveItOrNot

April 15, 2020 Lulwah AlKulaib 2 Comments

Summary

Fact checking is important to be done in a timely manner, especially nowadays when it’s used on live TV shows. While existing work presents many automated fact-checking systems, the human-in-the-loop is neglected.This paper presents the design and evaluation of a mixed initiative approach to fact-checking. The authors combine human knowledge and experience with the efﬁciency and scalability of automated information retrieval and machine learning. The authors present a user study in which participants used the proposed system to help their own assessment of claims. The presented results suggest that individuals tend to trust the system especially that participant accuracy assessing claims improved when exposed to correct model predictions.Yet, the participants’ trust is overestimated when the model was wrong. The exposure to the system’s predictions often reduced human accuracy. Participants that were given the option to interact with these incorrect predictions were often able improve their own performance. This suggests that in order to have better models, they have to be transparent especially when it comes to human-computer interaction as AI models might fail and humans could be the key factor in correcting them.

Reflection

I enjoyed reading this paper. It was very informative on the importance of transparent models in AI and machine learning. Also how transparent models could make the performance better when we include the human in the loop.

In their limitations, the authors discuss important points in relying on crowdworkers. They explain how MTurk participants should not all be given the same weight when looking at their responses since different participant demographics or incentives may inﬂuence ﬁndings. For example, non-US MTurk workers may not be representative of American news consumers or familiar with the latest and that could affect their responses. The authors also acknowledge that MTurk workers are paid by the task and that could cause some of them to respond by agreeing with the model’s response when in reality that is not the case, just so they could acquire the HIT and get paid. They found a minority of these responses and it made me think of ways to mitigate it. Like the papers from last week studying the behavior of an MTurk worker while completing the task might be an indicator if the worker actually agrees with the model or it is just to get paid.

The authors mention the negative impact that could potentially stem from their work and that could be as we saw in their experiment the model did a mistake but the humans over trusted it. The dependability on AI and technology makes users give them credit more than they should and such errors could impact the users perception of the truth.Addressing these limitations should be an essential requirement for further work.

Discussion

Where would you use a system like this most?
How would you suggest to mitigate errors produced by the system?
As humans, we trust AI and technology more than we should, how would you redesign the experiment to ensure that the crowdworkers actually check the presented claims?

04/15/20 – Fanglan Chen – Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

April 15, 2020 Fanglan Chen 2 Comments

Summary

Nguyen et al.’s paper “Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking” explores the usage of automatic fact-checking, the task of assessing the veracity of claims, as an assistive technology to augment human decision making. Many previous papers propose automated fact-checking systems, but few of them consider the possibility to have humans as part of a Human-AI partnership to complete the same task. By involving humans in fact-checking, the authors study how people would understand, interact with, and establish trust with an AI fact-checking system. The authors introduce their design and evaluation of a mixed-initiative approach to fact-checking, leveraging people’s judgment and experience with the efficiency and scalability of machine learning and automated information retrieval. Their user study shows that crowd workers involved in the task tend to trust the proposed system with improved participant accuracy with the access to claims when exposed to correct model predictions. But sometimes the trust is so strong that getting exposure to the model’s incorrect predictions reduces their accuracy in the task.

Reflection

Overall, I think this paper conducted an interesting study on how the proposed system actually influences human’s access to the factuality of claims in the fact-checking task. However, the model transparency studied in this research is different from what I expected. When talking about model transparency, I am expecting an explanation of how the training data is collected, what variables are used to train the model, and how the model works in a stepwise process. In this paper, the approach to increase the transparency of the proposed system is to release the source articles based on what the model provides a true or false judgment on the given claim. The next step is letting the crowd workers in the system group go through each source of news articles and see if that makes sense and whether they agree or disagree on the system’s judgment. In this task, I feel a more important transparency problem here is how the model gets the retrieved articles and how it ranks them in a presented way. Some noises in the training data may bring some bias in the model, but there is little we can tell merely based on checking the retrieved results. That makes me think that there might be different levels of transparency, at some level, we can check the input and output at each step, and at another level, we may get exposure to what attributes the model actually uses to make the prediction.

The authors conducted three experiments with a participant survey on how users would understand, interact with, and establish trust with a fact-checking system and how the proposed system actually influences users’ access to the factuality of claims. The experiments are conducted via a comparative study between a control group and a system group to show that the proposed system actually works. Firstly, I would like to know if the randomly recruited workers in the two groups have some differences among demographics that may potentially have an impact on the final results. Is there a better way to conduct such experiments? Secondly, the performance difference between the two groups in regard to human error is so small and there is no additional proof that the performance difference is statistically significant. Thirdly, the paper reports the experimental results on five claims, even with a claim that has incorrectly supportive articles (claim 3), which seems not to be representative. The task is kind of misleading. Would it be better with quality control of the claims in the task design?

Discussion

I think the following questions are worthy of further discussion.

Do you think with the article source presented by the system that the users develop more trust about the system?
What are the reasons behind that some claims with the retrieval results of the proposed system downgrade the human performance in the fact-checking task?
Do you think there is any flaw in the experimentation design? Can you think of a way to improve it?
Do you think we need personalized results in this kind of task where the ground truth is provided? Why or why not?

04/15/2020 – Palakh Mignonne Jude – Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

April 15, 2020 Palakh Mignonne Jude 1 Comment

SUMMARY

The authors of this paper design and evaluate a mixed-initiative fact-checking approach that blends prior human knowledge with the efficiency of automated ML systems. The authors found that users tend to over-trust the model which could degrade human accuracy. They conducted three randomized experiments – the first, compares user who perform the task with or without viewing ML predictions, the second, compares a static interface with an interactive one (that enables users to fix model predictions), and the third, compares a gamifies task design to a non-gamified one. The authors designed an interface that displays the claim, the predicted correctness, and relevant articles. For the first experiment, the authors considered responses from 113 participants with 58 assigned to Control and 55 to System. For the second experiment, the authors considered responses from 109 participants with 51 assigned to Control and 58 to Slider. For the third experiment, the authors considered responses from 106 participants, and found no significant differences between the two groups.

REFLECTION

I liked the idea of a mixed-initiative approach to fact checking that builds on the affordances of both humans and AI. I found that it was good that the authors designed the experiments such that the confidence scores (and therefore the fallibility) of the system was openly shown to the users. I also felt that the interface design was concise and appropriate without being overly complex. I also liked the design of the gamified approach and was surprised to learn that the game design did not impact participant performance.

I agree that for this case in particular, participant demographics may affect the results. Especially since the news articles considered were mainly related to American news. I wonder how much if a difference in the results would be observed in a follow-up study that considers different demographics as compared to this study. I also agree that caution must be exercised with such mixed-initiative systems as imperfect data sets would have a considerable impact on model predictions and that the humans should not blindly trust the AI predictions). It would definitely be interesting to see the results obtained when users check their own claims and interact with other user’s predictions.

QUESTIONS

The authors explain that the incorrect statement on Tiger Woods was due to the model having learnt the bi-gram ‘Tiger Woods’ incorrectly – something that a more sophisticated classifier may have avoided. How much of an impact would such a classifier have made on the results obtained overall? Have other complementary studies been conducted?
The authors found that a smaller percentage of users used the sliders than expected. They state that while the sliders were intended to be intuitive, they may require a learning curve causing lesser users to adopt it. Would the use of a tutorial that enabled users to familiarize themselves have helped in this case?
Were the experiments conducted in this study adequate? Are there any other experiments that the authors should have conducted in addition to the ones mentioned?

04/15/20 – Lee Lisle – Believe it or Not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

April 15, 2020 Lorance R Lisle 1 Comment

Summary

Ngyuen et al’s paper discusses the rise of misinformation and the need to combat it via tools that can verify claims while also maintaining users’ trust of the tool. They designed an algorithm that finds sources that are similar to a given claim to determine whether or not the claim is accurate. They also weight the sources based on esteem. They then ran 3 studies (with over 100 participants in each) where users could interact with the tool and change settings (such as source weighting) in order to evaluate their design. The first study found that the participants trusted the system too much – when it was wrong, they tended to be inaccurate, and when it was right, they were more typically correct. The second study allowed participants to change the inputs and inject their own expertise into the scenario. This study found that the sliders did not significantly impact performance. The third study focused on gamification of the interface, and found no significant difference.

Personal Reflection

I enjoyed this paper from a 50,000 foot perspective, as they tested many different interaction types and found what could be considered negative results. I think papers that show that all work is not necessarily good have a certain amount of extra relevance – they certainly show that there’s more at work than just novelty.

I especially appreciated the study on the effectiveness of gamification. Often, the prevailing theory is that gamification increases user engagement and increases the tools’ effectiveness. While the paper is not conclusive that gamification cannot do this, it certainly lends credence to the thought that gamification is not a cure-all.

However, I took some slight issue with their AI design. Particularly, the AI determined that the phrase “Tiger Woods” indicated a supportive position. While their stance was that AIs are flawed (true), I felt that this error was quite a bit more than we can expect from normal AIs, especially ones that are being tweaked to avoid these scenarios. I would have liked to see experiment 2 and 3 improved with a better AI, as it does not seem like they cross-compared studies anyway.

Questions

Does the interface design including a slider to adjust source reputations and user agreement on the fly seem like a good idea? Why or why not?
What do you think about the attention check and its apparent failure to accurately check? Should they have removed the participants with incorrect answers to this check?
Should the study have included a pre-test to determine how the participants’ world view may have affected the likelihood of them agreeing with certain claims? I.E., should they have checked to see if the participants were impartial, or tended to agree with a certain world view? Why or why not?
What benefit do you think the third study brought to the paper? Was gamification proved to be ineffectual, or is it a design tool that sometimes doesn’t work?

04/15/2020 – Vikram Mohanty – Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

April 15, 2020 Vikram Mohanty 4 Comments

Authors: An T. Nguyen, Aditya Kharosekar , Saumyaa Krishnan, Siddhesh Krishnan, Elizabeth Tate, Byron C. Wallace, and Matthew Lease

Summary

This paper proposes a mixed-initiative approach to fact-checking, combining human and machine intelligence. The system automatically finds and retrieves relevant articles from a variety of sources. It then infers the degree to which each article supports or refutes the claim, as well as the reputation of each source. Finally, the system aggregates this body of evidence to predict the veracity of the claim. Users can adjust the source reputation and stance of each retrieved article to reflect their own beliefs and/or correct any errors according to them. This will, in turn, update the AI model. The paper evaluates this approach through a user study on Mechanical Turk.

Reflection

This paper, in my opinion, succeeds as a nice implementation of all the design ideas we have been discussing in the class for mixed-initiative systems. It factors in user input, combined with an AI model output, and shows users a layer of transparency in terms of how the AI makes the decision. However, fact-checking, as a topic, is complex enough not to warrant a solution in the form of a simplistic single-user prototype. So, I view this paper as opening up doors for building future mixed-initiative systems that can rely on similar design principles, but also factor in the complexities of fact-checking (which may require multiple opinions, user-user collaboration, etc).

Therefore, for me, this paper contributes an interesting concept in the form of a mixed-initiative prototype, but beyond that, I think the paper falls short of making it clear who the intended users are (end-users or journalists) or the intended scenario it is designed for. The evaluation with Turkers seemed to indicate that anyone can use it, which opens up the possibility of creating individual echo-chambers very easily and essentially, making the current news consumption landscape worse.

The results also showed the possibility of AI biasing users when it’s wrong, and therefore, a future design would have to factor in that. One of the users felt overwhelmed as there was a lot going on with the interface, and therefore, a future system needs to address the issue of information overdose.

The authors, however, did a great job discussing these points in detail about the potential misuse and some of the limitations. Going forward, I would love to see this work forming the basis for a more complex socio-technical system, that allows for nuanced inputs from multiple users, interaction with a fact-checking AI model that can improve over time, and a longitudinal evaluation with journalists and end-users on actual dynamic data. The paper, despite the flaws arising due to the topic, succeeds in demonstrating human-AI interaction design principles.

Questions

What are some of the positive takeaways from the paper?
Did you feel that fact-checking, as a topic, was addressed in a very simple manner, and deserves more complex approaches?
How would you build a future system on top of this approach?
Can a similar idea be extended for social media posts (instead of news articles)? How would this work (or not work)?

04/15/2020 – Yuhang Liu – Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

April 15, 2020 yuhang Liu 2 Comments

Summary:

This article discusses a system about fact detection. First of all, the article proposes that fact detection is a very important, challenging and time-sensitive. Usually, in this type of system, the human influence on the system is ignored, but the human influence is very important in this type of system. Therefore, this article establishes a hybrid startup system for fact checking. Enable users to interact with ML predictions to complete challenging fact checks. The author designed the interface through which the user can know the source of the prediction. In some applications, when the users know the prediction results but is not satisfied, the author also allows the user to use his own beliefs or inferences to cover these predictions. Through this system, the authors have come to a conclusion that when the model’s results are correct, these predictions will have a very positive impact on people. However, people should not overly trust the model’s predictions, when users think the predictions is wrong, the prediction result can be improved through interactive methods. And this also reflects the importance of a transparent, interactive system for fact detection from the side.

Reflection:

When I saw the title of this article, I thought that this article maybe have a same topic with my project, using crowdsourced workers to distinguish fake news, but when I read it to a certain extent, I found that this is not the case. But I think it affirmed my thinking in some aspects. First, fact detection is a very challenging project, especially when real-time is needed, so it is very necessary to rely on human power, and due to lack of Marked data, so if you want to directly complete the task through machine learning, in some cases, the prediction results will point in a completely opposite direction. For example, in my project, rumors and refuting rumors are both may be considered as a rumor, so we need crowd workers to distinguish it.

Secondly, for the project itself mentioned in the article, I think its method is a very good direction. First of all, human judgment is particularly important in this kind of system. This is also the main idea of many human-computer interaction systems to improve accuracy through humans. I think this method in the article is a good start. In a transparent system, let people decide whether to cover the forecast results. Not only do they not force people to participate in the system, but also let people make predictions There are very important weights.

But at the same time, I think the system also has some of the limitations described in the article. For example, the purpose of crowdsourcing workers and its own concerns may affect the results of the final system, so I think the article proposes a good direction, but we need to be more Careful research.

Problem:

Do you think users usually can find the prediction is incorrect and cover it when a system is wrong?
What role does the transparency of the system play in the interaction?
How to prevent users trust the prediction too far in other human and computer interaction systems?

04/15/2020 – Bipasha Banerjee – Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

April 15, 2020 bipashab 1 Comment

Summary

The paper emphasizes the importance of a mixed-initiative model for fact-checking. It points out the advantages of humans and machines working closely together to verify the veracity of the facts. The paper’s main aim from the mixed-initiative approach was to make the system, especially the user interface, more transparent. The UI presents a claim to the user along with a list of articles related to the statement. The paper also mentions all the prediction models that have been used to create the UI experience. Finally, the authors conducted three experiments using crowd workers who had to predict the correctness of claims presented to them. In the first experiment, the users were shown the results page without the prediction of the truthfulness of the claim. Users were subsequently divided into two subgroups, where one group was given slightly more information. In the second experiment, the crowdworkers were presented with interactive UI. They, too, were further divided into two subgroups, with one group having the power to change the initial predictions. The third experiment was a gamified version of the previous experiment. The authors concluded that human-ai collaboration could be useful, although the experiment brought into light some contradictory findings.

Reflection

I agree with the author’s approach that the transparency of a system leads to the confidence of the user using a particular system. My favorite thing about the paper is that the authors describe the systems very well. They do a very good job of describing the AI models as well as the UI design and give a good explanation to their decisions. I also enjoyed reading about the experiments that they conducted with the crowdworkers. I had a slight doubt about how the project handled latency, especially when the related articles were presented to the workers in real-time.

I also liked how the experiments were conducted in sub-groups, with a group having information not presented to the other. This shows that a lot of use cases were thought of when the experimentation took place. I agree with most of the limitations that the authors wrote. I particularly agree that if the veracity of predictions is shown to the users, there is a high chance of that influencing people. We, as humans, have a tendency to believe machines and its prediction blindly.

I would also want to see the work being performed on another dataset. Additionally, if the crowdworkers have knowledge about the domain in the discussion, how does that affect the performance? It is definite that having knowledge would improve detecting the claim of a statement. Nonetheless, this might help in determining to what extent. A potential use case could be researchers reading claims from research papers in their domain and assessing their correctness.

Questions

How would you implement such systems in your course project?
Can you think of other applications of such systems?
Is there any latency associated when the user is produced with the associated articles?
How would the veracity claim system extend to other domains (not news based)? How would it perform on other datasets?
Would experience (in one domain) crowdworkers perform better? The answer is likely yes, but how much? And how can this help improve targeted systems (research paper acceptance, etc.)?

04/15/2020 – Subil Abraham – Nguyen et al., “Believe it or not”

April 15, 2020 Subil Abraham Leave a comment

In today’s era of fake news where new information is constantly spawning everywhere, the great importance of fact checking cannot be understated. The public has a right to remain informed and be able to obtain true information from accurate, reputable sources. But all too often, people are inundated with too much information and the cognitive load of fact checking everything would be too much. Automated fact checking has made strides but previous work has focused primarily on model accuracy and not on the people who need to use them. This paper is the first to study an interface for humans to use a fact checking tool. The tool is pretrained on the Emergent dataset of annotated articles and sources and uses two models, one that predicts article stance on a claim and the other that calculates the accuracy of the claim based on the reputation of the sources. The application works by taking a claim and retrieving articles that talk about the claim. It uses the article stance model to classify if the articles are for or against the given claim, and then predicts the claim’s accuracy based on the collective reputation of its sources. It conveys that its models are not accurate and provides confidence levels for its accuracy claims. It also provides sliders for the human verifiers to adjust the predicted stance of the articles and also to adjust the source reputation according to their beliefs or new information. The authors run three experiments to test the efficacy of the tool for human fact checkers. They find that the users tend to trust the system, which can be problematic when the system is inaccurate.

I find it interesting that for the first experiment, the System group’s error rate somewhat follows the stance classifiers error rate. The crowd workers are probably not going through the process of independently verifying the stance of the articles and simply trust the predicted stance they are shown. Potentially this could be mitigated by adding incentives (like extra reward) to have them actually read the articles in full. But on the flip side, we can see that their accuracy (supposedly) becomes better when they are given the sliders to modify the stances and reputation. Maybe that interactivity was the clue they needed to understand that the predicted values aren’t set in stone and could potentially be inaccurate. Though I find it strange that the Slider group in the second experiment did not adjust the sliders if they were questioning the sources. What I find even stranger though is the fact that the authors decided to keep the claim that allowing the users to use the sliders made them more accurate. This claim is what most readers would take away unless they were carefully reading the experiments and the riders. And I don’t like that they kept the second experiment results despite them not showing any useful signal. Ultimately, I don’t buy into their push that this tool is something that is useful for the general user as it stands now. And I also don’t really see how this tool could serve as a technological mediator for people with opposing views, at least not the way they described it. I find that this could serve as a useful automation tool for expert fact checkers as part of their work but not for the ordinary user, which is what they model by using crowdworkers. I like the ideas that the paper is going for, of having automated fact checking that helps for the ordinary user and I’m glad they acknowledge the drawbacks. But I think there are too many drawbacks that prevent me from fully buying into the claims of this paper. It’s poetic that I have my doubts about the claims of a paper describing a system that asks you to question claims.

Do you think this tool would actually be useful in the hands of an ordinary user? Or would it serve better in the hands of an expert fact checker?
What would you like to see added to the interface, in addition to what they already have?
This is a larger question, but is there value in having the transparency of the machine learning models in the way they have done (by having sliders that we can manipulate to see the final value change)? How much detail is too much? What about for more complex models where you can’t have that instantaneous feedback (like style transfer) how do you provide explainability there?
Do you find the experiments rigorous enough and conclusions significant enough to back up the claims they are making?

04/15/20 – Jooyoung Whang – Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

April 15, 2020 Jooyoung Whang 1 Comment

In this paper, the authors state that the current fully automatic fact-checking systems are not good enough for three reasons: model transparency, taking world facts into consideration, and model uncertainty communication. So, the authors went on and built a system including humans in the loop. Their proposed system uses two classifiers that each predict the reliability of a supporting document of a claim and the veracity of the document. Using these weighted classifications, the confidence of the system’s prediction about a claim is shown to the user. The users can further manipulate the system by modifying the weights of the system. The authors conducted a user study of their system with Mturk workers. The authors found that their approach was effective, but also noted that too much information or misleading predictions can lead to big user errors.

First off, it was hilarious that the authors cited Wikipedia to introduce Information Literacy in a paper about evaluating information. I personally took it as a subtle joke left by the authors. However, it also led me to a question about the system. If I did not miss it, the authors did not explain where the relevant sources or articles came from that supported a claim. I was a little concerned if some of the articles used in the study were not reliable sources.

Also, the users conducted the user study using their own defined set of claims. While I understand this was needed for efficient study, I wanted to know how the system would work in the wild. If a user searched a claim that he or she knows is true, would the system agree with high confidence? If not, would the user have been able to correct the system using their interface? It seemed that some portion of the users were confused, especially with the error correction part of the system. I think these would have been valuable to know and would seriously need to be addressed if the system were to become a commercial product.

These are the questions that I had while reading the paper:

1. How much user intervention do you think is enough for these kinds of systems? I personally think if the users are given too much power over the system, users will apply their bias to the correction and get false positives.

2. What would be a good way for the system to only retrieve ‘reliable’ sources to reference? Stating that a claim is true based on a Wikipedia article would obviously not be so assuring. Also, academic papers cannot address all claims, especially if they are social claims. What would be a good threshold? How could this be detected?

3. Given the current system, would you believe the results that the system gives? Do you think the system addresses the three requirements that the authors introduced which all fact-checking systems should possess? I personally think that system transparency is still lacking. The system shows a lot about what kind of sources it used and how much weight it’s putting into them, but it does not really explain how it made the decision.

04/15/20 – Ziyao Wang – Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

April 15, 2020 Ziyao Wang 1 Comment

The authors focused on fact-checking, which is the task of assessing the veracity of claims. They proposed a mixed-initiative approach to fact-checking. In this system, they combined human knowledge and experience with AI’s efficiency and scalability in information retrieval. They argue that if we want to use fact-checking models practically, the models should be transparent, supporting to integrating user knowledge and have quantification and communication of model uncertainty. Following these principles, they developed their mixed-initiative system and did experiments with participants from MTurk. They found that the system can help humans when they are giving correct predictions and could be harmful when they are giving wrong predictions. And the interaction between participants and the system is not as effective as expected. Finally, they found that making tasks to be games does not help in users’ performance. In conclusion, they found that users are intended to trust models, and may be affected by the models to make the wrong choice. For this reason, transparent models are important in mixed-initiative systems.

Reflection:

I have tried to use the system mentioned in the paper. It is quite interesting. However, for the first time, I used it, I am confused about what should I do to use it. Though the interface is similar to Google.com and I am quite sure I should type something into the text box, there are limited instructions about what should I type, how the system will work and what should I do after searching my typed claim. Also, after I searched for a claim, the result page is still confusing. I know the developers want to show me some findings of the claim and provide me with the prediction result of the system. However, I am still confused about what should I do, and some given searching results are not related to my typed claim.

After several times of use, I got familiar with the system and it does help in my judgement of whether a claim is correct or not. I agree with the authors that some feedbacks about not being able to interact with the system properly comes from the users’ unfamiliar of the system. But apart from this, the authors should provide more instructions to the users so that they can get familiar with the system quickly. I think this is related to the transparency of the system and may raise users’ trust.

Another issue I found during use is that there are no words like the results can only be used as a reference, you should make a judgement using your own mind, or similar explanations. I think this may be a reason that the error rate of users’ results increased significantly when the system made wrong predictions. Participants may change their own minds when they saw that the prediction result is different from their own results because they think know little about the system and may think that system would be more likely to get the correct answer. If the system is more transparent to the users, the users may be able to provide more correct answers to the claims.

Questions:

How to let the participants make correct judgements when the system provides wrong predictions?

What kinds of instructions should be added so that participants can get familiar with the system more quickly?

Can this system be used in areas other than fact-checking?