03/04/20 – Fanglan Chen – Real-time Captioning by Groups of Non-experts

Summary

Lasecki et al.’s paper “Real-time Captioning by Groups of Non-experts” explores a new approach of relying on a group of non-expert captionists to provide speech captions of good quality, and presents an end-to-end system called LE-GION: SCRIBE which allows collective instantaneous captioning for live lectures on-demand. In the speech captioning task, professional stenographers can achieve high accuracy. However, the manual efforts are very expensive and must be arranged in advance. For effective captioning, the researchers introduce the idea of having a group of non-expects to caption audio and merging their inputs to achieve more accurate captions. Their proposed SCRIBE has two components, one is an interface for real-time captioning designed to collect the partial captions from each crowd worker, and the other is real-time input combiner for merging the collective captions into a single out-put stream in real-time. Their experiments show that proposed solution is feasible and non-experts can provide captioning of good quality and content coverage with short per-word latency. The proposed model can be potentially extended to allow dynamic groups to exceed the capacity of individuals in various human performance tasks.

Reflection

This paper conducts an interesting study of how to achieve better performance of a single task via collaborative efforts of a group of individuals. I think this idea aligns with ensemble modeling in machine learning. The idea presented in the paper is to generate multiple partial outputs (provided by team members and crowd workers) and then use an algorithm to automatically merge all of the noisy partial inputs into a single output. Similarly, ensemble modeling is a machine learning method where multiple diverse models are developed to generate or predict an outcome, either by using multiple different algorithms or using different training data sets. Then the ensemble model aggregates the output of each base model and generates the final output. The motivation for relying on a group of non-expert captionists to achieve better performance beyond the capacity of each non-expert corresponds to the idea of using ensemble models to reduce the generalization error and get more reliable results. As long as the base models are diverse and independent, the performance of the model increases when the ensemble approach is used. This approach also seeks the collaborative efforts of crowds in obtaining the final results. In both approaches, even though the model has multiple human/machine inputs as its sources, it acts and performs as a single model. I would be curious to see how ensemble models perform on the same task compared with the crowdsourcing proposed in the paper.

In addition, I think the proposed framework in the paper may work for general audio captioning. I am wondering how it would perform in regards to domain-specific lectures. As we know, lectures in many domains, such as medical science, chemistry, psychology, etc. are expected to have some terminologies that might be difficult to capture by an individual without the professional background in the field. There would be possible cases that none of the crowd worker can type those terms correctly, which may result in the incorrect caption. I think the paper can be strengthened with a discussion about under what kind of situations the proposed method works best. To continue the point, another possibility is to leverage the advantages of pre-trained speed recognition models and crowd works to develop a human-AI team to achieve desirable performance.

Discussion

I think the following questions are worthy of further discussion.

  • Would it be helpful if the recruiting process of crowd workers involves the consideration on their backgrounds, especially for some domain-specific lectures?
  • Although ASR may not be reliable on its own, is it useful leverage it as a contributor to the input of crowd workers? 
  • Is there any other potential to add a machine-in-the-loop component in the proposed framework?
  • What do you think about the proposed approach compared with the ensemble modeling that merges the outputs of multiple speech recognition algorithms to get the final results?

Read More

03/04/20 – Fanglan Chen – Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems

Summary

Hara et al.’s paper “Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems” explores the crowdsourcing approach to locate and assess sidewalk accessibility issues by labeling Google Street View (GSV) imagery. Traditional approaches for sidewalk assessment relies on street audits which are very labor intensive and expensive or by reporting calls from citizens. The researchers propose using their designed interactive user interface as an alternative to proactively deal with this issue. Specifically, they investigates the viability of the labeling sidewalk issues amongst two groups of diligent and motivated labelers (Study 1) and then explores the potential of relying on crowd workers to perform this labeling task and evaluate performance at different levels of labeling accuracy (Study 2). By investigating the viability of labeling across two groups (three members of the research team and three wheelchair users), the results of study 1 is used to provide ground truth labels to evaluate crowd workers performance and to get a baseline understanding of what labeling this dataset looks like. Study 2 explores the potential of using crowd workers to perform the labeling task. Their performance is evaluated on both image and pixel levels of labeling accuracy. The findings suggest that it is feasible to use crowdsourcing for the labeling and verification tasks, which leads to the final result of better quality.

Reflection

Overall, this paper proposes an interesting approach for sidewalk assessment. What I think most is how feasible we can use that to deal with real-world issues. In the scenario studied by the researchers, the sidewalk under poor condition has severe problems and relates to a larger accessibility issue of urban space. The proposed crowdsourcing approach is novel. However, if we take a close look at the data source, we may question to what extent it can facilitate the assessment in real-time. It seems impossible to update the Google Street View (GSV) imagery on a daily basis. The image sources are historical instead the ones that can reflect the current conditions of the road sidewalks. 

I think the image quality may be another big problem in this approach. Firstly, the resolution of the GSV imagery is comparatively low and sometimes under poor light conditions, which is challenging to let the crowd workers make the correct judgement. There is possibility to use some existing machine learning models to enhance the image quality via increasing its resolution or adjusting the brightness. That could be a potential place to introduce the assistance of machine learning algorithms to achieve better results in the task.

In addition, the focal point of the camera was another issue which may reduce the scalability of the project. The CSV imagery is not collected merely for the sidewalk accessibility assessment, which would usually contain a lot of noises (e.g. block objects). It would be interesting to conduct a study about how much percent of the GSV imagery is of good quality in regards to the sidewalk assessment task.

Discussion

I think the following questions are worthy of further discussion.

  • Are there any other important accessible issues existing but not considered in the study?
  • What are improvements you can think about the authors could improve their analysis?
  • What other potential human performance tasks can be explored by incorporating street view images?
  • How effective do you think this approach can deal with the urgent real-world problems?

Read More

02/26/20 – Fanglan Chen – Will You Accept an Imperfect AI? Exploring Designs For Adjusting End-user Expectations of AI Systems

Summary

Kocielnik et al.’s paper “Will You Accept an Imperfect AI?” explores approaches for shaping expectations of end-users before their initial working with an AI system and studies how appropriate expectations impact users’ acceptance of the system. Prior study has presented that end-user expectations of AI-powered technologies are influenced by various factors, such as external information, knowledge and understanding, and first hand experience. The researchers indicate that expectations vary among users and users perception/acceptance of AI systems may be negatively impacted when their expectations are set too high. To fill in the gap of understanding how end-user expectations can be directly and explicitly impacted, the researchers use a Scheduling Assistant – an AI system for automatic meeting schedule detection in email – to study the impact of several methods of expectation shaping. Specifically, they explore two system versions with the same accuracy level of the classifier but each is intended to focus on mitigating different types of errors(False Positives and False Negatives). Based on their study, error types highly relate to users’ subjective perceptions of accuracy and acceptance. Expectation adjustment techniques are proposed to make users fully aware of AI imperfections and enhance their acceptance of AI systems.

Reflection

We need to be aware that AI-based technologies cannot be perfect, just like nobody is perfect. Hence, there is no point setting a goal that involves AI systems making no mistake. Realistically defining what success and failure look like associated with working with AI-powered technologies is of great importance in adopting AI to improve the imperfection of nowadays solutions. That calls for an accurate positioning of where AI sits in the bigger picture. I feel the paper mainly focuses on how to set appropriate expectations but lacks a discussion on different scenarios associated with the users expectations to AI. For example, users expectation greatly vary to the same AI system in different decision making frameworks: in human-centric decision making process, the expectation of AI component is comparatively low as AI’s role is more like a counselor who is allowed to make some mistakes; in machine-centric system, all the decisions are made by algorithms which render users’ low tolerance of errors, simply put, some AIs will require more attention than others, because the impact of errors or cost of failures will be higher. Expectations of AI systems vary not only among different users but also under various usage scenarios.

To generate positive user experiences, AI needs to exceed expectations. One simple way to achieve this is to not over-promise the performance of AI in the beginning. That relates with the intention of the researchers on designing the Accuracy Indicator component in the Scheduling Assistant. In the study case, they set the accuracy to 50%. This accuracy is actually very low in AI-based applications. I’m interested in whether the evaluation results would change with AI systems of higher performance (e.g. 70% or 90% in accuracy). I think it is worthwhile to conduct a survey about users’ general expectations of AI-based systems. 

Interpretability of AI is another key component that shapes user experiences. If people cannot understand how AI works or how it comes up with its solutions, and in turn do not trust it, they would probably not choose to use it. As people accumulate more positive experiences, they build trust with AI. In this way, easy-to-interpret models seem to be more promising to deliver success compared with complex black-box models. 

To sum up, by being fully aware of AI’s potential but also its limitations, and developing strategies to set appropriate expectations, users can create positive AI experiences and build trust in an algorithmic approach in decision making processes.

Discussion

I think the following questions are worthy of further discussion.

  • What is your expectation of AI systems in general? 
  • How would users expectations of the same AI system vary in different usage scenarios?
  • What are the negative impacts brought by the inflated expectations? Please give some examples. 
  • How can we determine which type of errors is more severe in an AI system?

Read More

02/26/20 – Fanglan Chen – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summary

Dodge et al.’s paper “Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment” presents an empirical study on how people make fairness judgments of machine learning systems and how different styles of explanation impact their judgments. Fairness issues of ML systems attract research interests during recent years. Mitigating the unfairness in ML systems is challenging, which requires the good cooperation of  developers, users, and the general public. The researchers state that how explanations are constructed have an impact on users’ confidence in the systems. To further examine the potential impacts on people’s fairness judgments of ML systems, they conduct empirical experiments involving crowdsourcing workers on four types of programmatically generated explanations (influence, demographic-based, sensitivity, and case-based). Their key findings include: 1) some explanations are considered more fair, while others have negative impact on users’ trust of the algorithm in regards of fairness; 2) varied fairness issues (model-wide fairness and case-specific fairness) can be detected more effectively through an examination of different explanation styles; 3) individual differences (prior positions and judgment criteria of algorithmic fairness) lead to how users react to different styles of explanation. 

Reflection

This paper shines light on a very important fact that bias in ML systems can be detected and mitigated. There is a growing attention to the fairness issues in AI-powered technologies in the machine learning research community. Since ML algorithms are widely used to speed up the decision making process in a variety of domains, beyond achieving good performance, they are expected to produce neutral results. There is no denying the fact that algorithms rely on data, “garbage in, garbage out.” Hence, it is incumbent to feed the unbiased data to these systems upon developers in the first place. In many real-world cases, race is actually not used as an input, however, it correlates to other factors that make predictions biased. That case is not as easy as the cases presented in the paper to detect but still requires effort to be corrected. A question here would be in order to counteract this implicit bias, should race be considered and used to calibrate the relative importance of other factors? 

Besides the bias introduced by data input, there are other factors that need to be taken into consideration to deal with the fairness issues in ML systems. Firstly, machine bias can never be neglected. The term bias in the context of the high-stakes tasks (e.g. future criminal prediction) is very important because a false positive decision could have a destructive impact on a person’s life. This is why when an AI system deals with the human subject (in this case human life), the system must be highly precise and accurate and ideally provide reasonable explanation. Making a person’s life harder to live in a society or impacting badly a person’s life due to a flawed computer model is never acceptable. Secondly, the proprietary model is another concern. One thing should be kept in mind that many high-stacks tasks such as future criminal prediction is a matter of public matter and should be transparent and fair. That does not mean that the ML systems used for those tasks need to be completely public and open. However, I believe there should be a regulatory board of experts who can verify and validate the ML systems. More specifically, the experts can verify and validate the risk factors used in a system so that the factors could be widely accepted. They can also verify and validate the algorithmic techniques used in a system so that the system incorporates less bias. 

Discussion

I think the following questions are worthy of further discussion.

  • Besides model unfairness and case-specific disparate impact, are there any other fairness issues?
  • What are the benefits and drawbacks of global and local explanations in supporting fairness judgment of AI systems?
  • Are there any other style or element of explanations that may impact fairness judgement you can think about?
  • If an AI system is not any better than untrained users at predicting recidivism in a fair and accurate way, why do we need the system?

Read More

02/19/20 – Fanglan Chen – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

Summary

Bansal et al.’s paper “Updates in Human-AI Teams” explores an interesting problem — the influence of updates to an AI system on the overall team performance. Nowadays, AI systems have been deployed to  support human decision making in high-stakes domains including criminal justice and healthcare. In the working process of a team of humans and AI systems, humans make decisions with a reference to AI’s inferences. A successful partnership requires that the human develops an understanding into the AI system performance, especially its error boundary. Updates with algorithms of higher performance can potentially increase the AI’s predictive accuracy. However, that may require humans to regain interactive experiences and rebuild their confidence in the AI system, the adjusting process of which may actually hurt team performance. The authors introduce the concept of compatibility between an AI update and prior user experience and present methods for studying the role of compatibility in human-AI teams. Extensive experiments on three high-stakes classification tasks (recidivism, credit risk, and mortality) demonstrate that current AI systems are not provided with compatible updates, resulting in decreased performance after updating. To improve the compatibility of an update, the authors propose a re-training objective by penalizing new failures from AI systems. Their proposed compatible updates achieve a good balance of the performance and compatibility trade-off in different tasks.

Reflection

I think making AI and humans as a team to take full advantage of the collaboration is a pretty neat idea. Humans are born with the ability to adapt in the face of an uncertain and adverse world with the capacity of logic reasoning. Machines cannot perform well in those areas but can achieve efficient computation and free people for higher-level tasks. Understanding how machines can efficiently enhance what humans perform best and how humans can augment the work scope of machines is the key to rethink and redesign current decision making system.

What I find interesting about the research problem discussed in this paper is that the authors focus on the idea of unifying a decision made by humans and machines but not merely on the performance in tasks to recommend updates. In machine learning with no human involved, the goal is usually to achieve better and better performance which is evaluated by metrics such as accuracy, precision, recall, etc. The compatible updates can be seen as the machine learning algorithms with similar decision boundaries but better performance, which seems to be an even more difficult task to accomplish. To get there, humans need to perform crucial roles. Firstly, humans must train machines to achieve good performance on certain tasks. Next, humans need to understand and be able to explain the outcomes of those tasks, especially where AI systems fail. That requires an interpretability component in the system. As AI systems are increasingly drawing conclusions through opaque processes (also-called black-box problem), there is a large demand of human experts in the field to explain model behavior to non-expert users. Last but not least, humans need to sustain the responsible use of  AI systems by, for example, updating for better decision making discussed in the paper. That would require a large body of human experts who continually work to ensure that AI systems are functioning properly, safely, and responsibly. 

The above discussion is one side of a coin, focusing on how humans can extend what machines can achieve. The other side is comparatively less discussed in the current literature. Except for extending physical capabilities, how humans can learn from the interaction with AI systems and enhance individual abilities is an interesting question to explore. I would imagine, in an advanced Human-AI team, that humans and AI systems communicate in a more interactive way which allows for collaborative learning from their own mistakes and the rationale of the correct decisions made by each other. That leads to another question, if AI systems can exceed or rival humans in high-stake decision making such as recidivism and underwriting, how risky is it to handle the tasks to machines? How can we decide when to let humans take control? 

Discussion

I think the following questions are worthy of further discussion.

  • What can humans do that machines cannot and vice versa?
  • What is the goal of decision making and what factors are stopping humans or machines from making good decisions? 
  • In the Human-AI teams discussed in the paper, what can humans benefit from the interaction with the AI systems?
  • The partnership introduced by the authors is more like a human-assisting-machine approach. Can you provide some examples of machine-assisting-human approaches?

Read More

02/19/20 – Fanglan Chen – In Search of the Dream Team: Temporally Constrained Multi-Armed Bandits for Identifying Effective Team Structures

Summary

Zhou et al.’s paper “In Search of the Dream Team” introduces DreamTeam — a system that identifies effective team structures for each group of individuals by suggesting different structures and evaluating the fit of each team. How team works relates with team structures, including roles, norms, and interaction patterns. Prior organizational behavior research doubts the existence of universally perfect structures. The rationale is simple: teams broast of great diversity so one single structure cannot satisfy the full functioning of each team. The proposed DreamTeam explores values along five dimensions of team structures including hierarchy, interaction patterns, norms of engagement, decision-making norms, and feedback norms. The system leverages feedback, randomly choosing metrics such as team performance or satisfaction, to iteratively identify the team structures that facilitate the best organization of each team. Also, the authors design multi-armed bandits with temporal constraints, an algorithm that determines the timing of exploration and exploitation trade-offs across multiple dimensions to avoid overwhelming teams with too many changes. In the experiments, DreamTeam is integrated with the chat platform Slack and achieves better performance and more diverse team structures compared with baseline methods.

Reflection

The authors design a system to facilitate the organization of virtual teams. Along with the several limitations mentioned in the paper, I feel the proposed DreamTeam system is based on a comparatively narrow scope of what makes a dream team and it seems difficult to generalize the framework to a variety of domains or platforms. 

In the first place, I do not agree that there is a universal approach to design or evaluate a so-called dream team. The components that make a dream team vary in different domains. For example, in sports, I would say personality and talent play important roles in forming a dream team. Actually, it goes beyond the term “forming” that a bunch of talented individuals not only bring technical expertise to the team, but they also contribute passion, strong work ethic, and strive for peak performance in the pursuit of excellence. To extend that point, working with people having similar personalities, similar values, similar pursuits will bring some chemistry to the team work which potentially enables challenging problem solving and strategic planning. All of these are not mentioned in the demensions and nearly impossible to be evaluated quantitatively. 

Also, I think it is important to make every team member understand their role, such as why they need to tackle the tasks and how that ties to a larger purpose beyond self’s needs. This provides a clear purpose and direction of where a group of people need to move forwards as a team. I do not think the authors emphasize the importance of how such understanding influences team member level of commitment. In addition, this kind of unified purpose can avoid duplication of member efforts and prevents pulling the efforts in multiple directions. 

Last but not least, in my opinion, basing on the maximizing of rewards is not the ideal way to determine the best team structures. Human society treasure process as well as results. It can be seen as a successful teamwork as long as the whole team is motivated and working on it. If too much emphasis is put on results, then the joy will be drained out of the job for the team. As long as progressive steps are made towards achieving the goal within a reasonable time frame, the team will become better. Building an ambitious, driven and passionate team is just the start. We need to ensure that the team members survive and are nurtured so that they can deliver on the targets.

Discussion

I think the following questions are worthy of further discussion.

  • If you are the CEO of a company or a university president, would you consider using the proposed DreamTeam system to help organize your own team? Why or why not?
  • Do you think the five bandits capture all dimensions to make a dream team?  
  • Do you think the proposed DreamTeam system can be generalized to various domains? Are there any domains you think the system would not contribute towards an efficient team structure?
  • Is there anything you can think about to improve the proposed DreamTeam system?

Read More

02/05/20 – Fanglan Chen – Guidelines for Human-AI Interaction

The variability of current AI designs as well as automated inferences of failures – ranging from the  disruptive or confusing to the more serious – calls for creating more effective and intuitive user experiences with AI. The paper “Guidelines for Human-AI interaction” enriches the ongoing conversation on heuristics and guidelines towards human-centered design for AI systems. In this paper, Amershi et al. identified more than 160 potential recommendations for Human-AI interaction from respected sources that ranged from scholarly research papers to blog posts and internal documents. Through a 4-phase framework, the research team systematically distilled and validated the guideline candidates into a unified set of 18 guidelines. This work empowers the community by providing a resource for designers working with AI and facilitates future research into the refinement and development of principles for human-AI interaction.

The proposed 18 guidelines in the paper are grouped into four sections that prescribe how an AI system should behave upon initial interaction, as the user interacts with the system, when the system is wrong, and over time. As far as I can see, the major research question is how to control automated inferences to some extent when they are performing under uncertainty. We can imagine that it would be extremely dangerous in scenarios in which humans are unable to intervene when AI makes incorrect decisions. Take autonomous vehicles for example, AI may behave abnormally under the real-world situations that it has not faced in its training. How to integrate efficient dismissal or correction is an important question to consider in the initial design of the autonomous system.

Also, we need to be aware of that while the guidelines for Human-AI Interaction are developed to support design decisions, they are not intended to be used as a simple checklist. One of the important intentions is to support and stimulate conversations between user experience and engineering practitioners that lead to better AI design. Another takeaway from this paper is that there will always be numerous situations where AI designers must consider trade-offs among guidelines and weigh the importance of one or more over others. Beyond the 4-phase framework presented in the paper, I think there are at least two points worth of discussion. Firstly, the 4-phase framework is more like a narrowing down process, while no open-ended questions are raised in the feedback circle. The functioning and goals of apps in different categories may vary. Rising capabilities and use cases may suggest there is a need for additional guidelines. As the AI design advances, we may need more innovative ideas about the future AI design instead of constraining to the existing guidelines. Secondly, it seems all the evaluators participated in the user study are in the domain of HCI and a number of them gain years of experience in the field. I’m wondering if the opinions of end users without HCI experience need to be considered as well and how a wider involvement would impact the final results. I think the following questions are worthy of further discussion.

  • Which of the 18 proposed design guidelines are comparatively difficult to employ in AI designs? Why?
  • Besides the proposed guidelines, are there any design guidelines worthy of attention but not discussed in the paper?
  • Some of the guidelines seem to be of greater importance than others in user experience of specific domains. Do you think the guidelines need to be tailored to the specific categories of applications?
  • In the user study, do you think it would be important to include end users who actually use the app but without experience studying on HCI?

Read More

02/05/20 – Fanglan Chen – Principles of Mixed-Initiative User Interfaces

Horvitz’s paper “Principles of Mixed-Initiative User Interfaces” highlighted several principles important for allowing AI engineers to enhance human-computer interaction through a carefully designed coupling of automated services with direct manipulation by humans. The author demonstrated a middle ground of the human-computer interaction debate over opportunities of total automation of user needs (via intelligent agents) versus the importance of user control and decision making (via graphical user interfaces). By showing how to turn the proposed principles into potential improvements of an application, LookOut system for scheduling and meeting management, this paper explored the possibility to design innovative user interfaces and new human-computer interaction modalities by considering (from the ground up) designs that benefit of the power of direct manipulation and potentially valuable automated reasoning.

I think this discussion can be framed by noting the interesting duality between artificial intelligence(AI) and human-computer interaction(HCI). In AI, the goal is to mimic the way humans learn and think in order to create computer systems that can perform intelligent actions beyond naive tasks. In HCI, the target is to design computer interfaces that leverage off humans and provide aids to the users in the execution of intelligent actions. The basic idea of mixed-initiative interaction is to let agents work most effectively through a collaborative process. From the agents side, the major challenge is to deal with the uncertainties of users’ interests and intentions, thus know how to proceed to coordinate the users in a variety of tasks. It is indispensable to get humans in the interaction through an interaction mode convenient to the users. To achieve this, intelligent agents must be designed to be able to focus on various subproblems, fill in details, identify problem areas, and collaborate with different users to find the best personalized solutions. Without this mixed-initiative, AI designs would be very likely to fall into either human control or system control approaches. We also need to be aware that the mixed-initiative, also called co-creative, framework may come with a high cost. The system controlled frameworks are prevalent nowadays because they can save companies’ efforts and money.  When we try to balance the operational expenses and improved customer service, it is important to ask how we can decide which framework to choose and at what stages we need to get humans involved.

In the mixed-initiative framework, a  user and an AI agent work together to produce final products. Let us take a look at an example of mixed-initiative research led by the University of Rochester. Through years on mixed-initiative planning systems, one of their projects is to develop systems that can enhance human performance in managing plans, such as transportation network planning. There is no denying the fact that how intelligent planning systems and humans solve problems are highly different: automated agents require complete specifications of the goals and situation before knowing where to start; human experts incrementally learn about the scenario and modify the goals during the process of developing the plan. Faced with this dilemma, the research team decided to try to design a collaborative planning system which takes the advantages of both the user and machine to build the plan. The idea is that users bring intuition, concrete goals and trade-offs between goals, and advanced problem-solving strategies, while the agents bring an ability to manage details, allocate resources, and perform quantitative analysis of proposed actions. In this way, the capability of humans and the AI agents in creating desired output would be extended. I think the following questions are worthy of further discussion.

  • What is the boundary between human control, system control, and mixed-initiative frameworks? 
  • How can we decide which framework to choose and at what stages we need to get humans involved to make the systems better?
  • How can we bring personalized user experience in consideration of the countless uncertain decisions?
  • Does all kinds of tasks require a mixed-initiative? What kind of projects would benefit more from the mixed-initiative framework?

Read More

01/29/20 – Fanglan Chen – Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms

Beyond evaluating the extent of AMT’s influence for research, Vakharia’s paper Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms draws the impact of existing crowd work platforms based on prior work, comparing the operation and workflows among the platforms. In addition, it opens discussions about to what extent that the characteristics of other platforms may have positive and negative impact on the research. Built on analysis of prior white papers, this paper introduces inadequate quality control, inadequate management tools, missing support for fraud prevention, and lack of automated tools. Further, this paper defines twelve ways to comprehensively evaluate the performance of the platforms. At last, the researchers present a comparative results based on their proposed metrics.

While reading this paper, I am thinking that the private crowds problem is also an ethics problem because protection of sensitive or confidential data is significant to different parties. Besides, the ethics problems of the crowd work platforms can be seen from three perspectives, saying, data, human, and interfaces. From the data perspective, we need to ask the requester to provide the rational data by mitigating the bias, such as gender and geo-location. From the human  perspective, we need the platforms to assign the workers tasks randomly. From the interface perspective, the requesters need to provide the interfaces that is symetrics for all categories of data, and the data should follow the IID distribution.

Though I have not used a platform like AMT before, I performed data labeling tasks before. The automated tools are really efficient and important to the kind of worker like this. I was part of a team labeling cars for an object tracking project before. I used a tool to automatically generate the object detection results, but the results are highly biased. However, even if the data is biased, it still assisted us a lot in labeling the data. 

The paper provides a number of criteria to qualify the characteristics of various platforms. For the criterion “Demographics & Worker Identities”, it also needs an analysis on the other criterion, whether it is ethical to release the demographics of workers and requesters. What would be the potential hazard to make the personal information available? The two criterions “Incentive Mechanisms” and “Qualifications & Reputation” seem to have some conflicts with each other. Since if the workers work faster on the tasks, that would potentially affects the quality of the work. As for the quantify metrics, the paper does not provide quantitative metrics to quantify the performance of different crowd work platforms. Hence, it is still difficult for the users and requesters to judge which crowdsourcing platforms are good for themselves. The following questions are worthy of further discussion.

  • What are the main reasons of crowd workers or requesters to pick one particular platform over others as their major platform? 
  • Knowing the characteristics of these platforms, is it possible to design a platform which boasts of all the merits?
  • A lot of criteria are provided in the paper to compare the crowd work platforms, is it possible to develop quantified standard metrics to evaluate the characteristics and performance?
  • Is it necessary or is it ethical for the requesters to know the basic information of the workers, and vice versa?

Read More

01/29/20 – Fanglan Chen –The Future of Crowd Work

This week’s readings on crowdsourcing continue our discussion on ghost work last week. With the vision of what will the future crowd work be like, Kittur et al.’s paper “The Future of Crowd Work” discusses the benefits and drawbacks of crowd work and addresses the major challenges current crowdsourcing is facing up with. The researchers calls for a new framework that can potentially bring more complex, collaborative, and sustainable crowd work. The framework lays out major research challenges in 1) crowd work processes, including designing workflows, assigning tasks, supporting hierarchical structure, enabling real-time response, supporting synchronous collaboration, controlling quality; 2) crowd computation, including crowds guiding AIs, AIs guiding crowds, platforms; and 3) crowd workers, including job design, reputation, and motivation.

I feel this paper opens more questions than it answers. The vision for the future of crowd work is promising, however, with the high-level ideas provided by the researchers, how to achieve the goal is still unclear. I think there are two key questions worthy of discussion. Firstly, is complex crowd work really needed at the current stage of AI development or what type of complex and collaborative crowd work is in need and to what extent? This question links me to a recent talk provided by Yoshua Bengio, one of the “Godfathers of AI,” on NeurIPS 2019. Entitled “From System 1 Deep Learning to System 2 Deep Learning,”  his talk addressed some problems of current AI development — System 1 deep learning — including but not limited to 1) require a large volume of training data to complete naive tasks; 2) poor in generalization among different datasets. It seems the current development of AI is in System 1 and there is still a long way to reach System 2 which requires higher level of cognition, out-of-generation and transferring ability. I think this can partially explain why a large portion of crowd work tasks are labeling or pattern recognition. For simple tasks like this, there seems no need to decompose the work. Currently, it is difficult for us to foresee how fast the AI development and how complex the required crowdsourcing tasks will be. In my opinion, a quantitative study on what portion of current tasks are considered as complex and an analysis of the trend would be useful for a better understanding of the crowd work at the current stage.

Secondly, complex, collaborative, and sustainable crowd work highly depends on the platforms. How to modify the existing crowd work platforms to support the future of crow works remains unclear. The organization and coordination of crowd workers across varying task types and complexity is still lack of consideration in the design and operation of existing platforms, even in large ones, such as AMT, ClickWorker, CloudFactory, and so forth. Based on the observations above, the following questions are worthy of further discussion. 

  • When do we need more complex, collaborative, and sustainable crowd work?
  • How can existing crowd work platforms support the future of crowd work?
  • What organizational and coordination structures can facilitate the crowd work across varying task types and complexity?
  • How can existing platforms boost effective communication and collaboration on crowd work?
  • How can existing platform support for effective decomposition and recombination of tasks, or design interfaces/tools for efficient workflow for complex work?

Read More