03/04/2020 – Palakh Mignonne Jude – Combining Crowdsourcing and Google Street View To Identify Street-Level Accessibility Problems

SUMMARY

The authors of this paper aim to investigate the feasibility of recruiting MTurk workers to label and assess sidewalk accessibility problems as can be viewed by making use of Google Street View. The authors conducted two studies, the first, with 6 people (3 from their team of researchers and 3 wheelchair users) and the second, that investigated the performance of turkers. The authors created an interactive labeling interface as well as a validation interface (to help users to accept/reject previous labels).  The authors proposed different levels of annotation correctness comprising of two spectra – localization spectrum which includes image level and pixel level granularity and specificity spectrum which includes the amount of information evaluated for each label. They defined image-level correctness in terms of accuracy, precision, recall, and f-measure. In order to computer inter-rater agreement at the image-level, they utilized Fleiss’ kappa. In order to evaluate the more challenging pixel-level agreement, they aimed to verify the labeling by indicating that pixel-level overlap was greater between labelers on the same image versus across different images. The authors used the labels produced from Study 1 as the ground truth dataset to evaluate turker performance. The authors also proposed two quality control approaches – filtering turkers based on a threshold of performance and filtering labels based on crowdsourced validations.

REFLECTION

I really liked the motivation of this paper especially given the large number of people that have physical disabilities. I am very interested to know how something like this would extend to other countries such as India as it would greatly aid people with physical disabilities over there since there are many places with poor walking surfaces and do not have support for wheelchairs. I think that having such a system in place in India would definitely help disabled people be better informed about places that can be visited.

I also liked the quality control mechanisms of filtering tuckers and filtering labels since these appear to be good ways to improve the overall quality of the labels obtained. I thought it was interesting that the performance of the system improved with tucker count but the gains diminished in magnitude as the group size grew. I thought that the design of the labelling and verification interface was good and that it made it easy for users to perform their tasks.

QUESTIONS

  1. As indicated in the limitations section, this work ‘ignored practical aspects such as locating the GSV camera in geographical space and selecting an optimal viewpoint’. Has any follow-up study been performed that takes into account these physical aspects? How complex would it be to conduct such a study?
  2. The authors mention that image quality can be poor in some cases due to a variety of factors. How much of an impact would this cause to the task at hand? Which labels would have been most affected if the image quality was very poor?
  3. The validation of labels was performed by crowd workers via the verification interface. Would there have been any change in the results obtained if experts had been used for the validation of labels instead of crowd workers (since they may have been able to identify more errors in the labels as compared to normal crowd workers)?

Read More

03/04/2020 – Palakh Mignonne Jude – Pull the Plug? Predicting If Computers or Humans Should Segment Images

SUMMARY

The authors of this paper aim to build a prediction system that is capable of determining whether the segmentation of images should be done by humans or computers, keeping in mind that there is a fixed budget of human annotation effort. They focus on the task of foreground object segmentation. They utilized varied domain image datasets such as the Biomedical Image Library with 271 grayscale microscopy images sets, Weizmann with 100 grayscale everyday object images, and Interactive Image Segmentation with 151 RGB everyday object images with the aim of showcasing the generalizability of their technique. They developed a resource allocation framework ‘PTP’ that predicts if it should ‘Plug The Plug’ on machines or humans. They conducted studies on both coarse segmentation as well as fine-grained segmentation. The ‘machine’ algorithms were selected from among the algorithms currently used for foreground segmentation such as Otsu thresholding, Hough transform, etc. The regression model was built using a multiple linear regression model. The 522 images from the 3 data sets mentioned earlier were given to crowd workers from AMT to perform coarse segmentation. The authors found that their proposed system was able to eliminate 30-60 minutes of human annotation time.

REFLECTION

I liked the idea of the proposed system that capitalized on the strengths of both humans and machines and aims to identify when the skill of one or the other is more suited for the task at hand. It reminded me about reCAPTCHA (as highlighted by the paper ‘An Affordance-Based Framework for Human Computation and Human-Computer Collaboration’) that also utilized multiple affordances (both human and machine) in order to achieve a common goal.

I found it interesting to learn that this system was able to eliminate 30-60 minutes of human annotation time. I believe that if such a system were to be used effectively, it would enable developers to build systems faster and ensure that human efforts are not wasted in any way. I thought it was good that the authors attempted to incorporate variety when selecting their data sets, however, I believe that it would have been interesting if the authors had combined these data sets with a few more data sets that contained more complex images (ones with many images that could have been in the foreground). I also liked that the authors have published their code as an open source repository for future extensions of their work.

QUESTIONS

  1. As part of this study, the authors focus on foreground segmentation. Would the proposed system extend well in case of other object segmentation or would the quality of the segmentation and the performance of the system be hampered in any way?
  2. While the authors have attempted to indicate the generalizability of their system by utilizing different data sets, the Weizmann and BU-BIL datasets were grayscale images with relatively clear foreground images. If the images were to contain multiple objects, would the amount of time that this system eliminated be as high? Is there any relation between the difficulty of the annotation task and the success of this system?
  3. Have there any been any new systems (since this paper was published) that attempt to build on top of the methodology proposed by the authors in this paper? What modifications/improvements could be made to this proposed system to improve it (if any improvement is possible)?

Read More

02/26/2020 – Palakh Mignonne Jude – Interpreting Interpretability: Understanding Data Scientists’ Use Of Interpretability Tools For Machine Learning

SUMMARY

In this paper, the authors attempt to study two interpretability tools – the InterpretML implementation of GAMs and the SHAP Python package. They conducted a contextual inquiry and survey of data scientists in order to analyze the ability of these tools to aid in uncovering common issues that arise when evaluating ML models. The results obtained during the course of these studies indicate that data scientists tend to over-trust these tools. The authors conducted pilot interviews with 6 participants to identify common issues faced by data scientists. The contextual inquiry performed included 11 participants who were allowed to explore the dataset and an ML model in a hands-on manner via the use of a Jupyter notebook whereas the survey comprised of 197 participants and was conducted through Qualtrics. For the survey, the participants were given access to a description of the dataset and a tutorial on the interpretability tool they were to use. The authors found that the visualizations provided by the interpretability tools considered in the study as well as the fact that these tools were popular and publicly available caused the data scientists to over-trust these tools.

REFLECTION

I think it is good that the authors performed a study to observe the usage of interpretability tools by data scientists. I was surprised to learn that a large number of these data scientists over-trusted the tools and that visualizations impacted their ability to judge the tools as well. However, considering that the authors state ‘participants relied too heavily on the interpretability tools because they has not encountered such visualizations before’ makes me wonder if the authors should have created a separate pool of data scientists who had better experience with such tools and visualizations and then presented a separate set of results for that set of individuals. I also found it interesting to learn that some participants used the tools to rationalize suspicious observations.

As indicated by the limitations section of this paper, I think a follow-up study that includes a richer dataset as well as interpretability techniques for deep learning would be very interesting to learn about and I wonder how data scientists would use such tools versus the ones studied in this paper.

QUESTIONS

  1. Considering that the complexity of ML systems and the time taken for researchers to truly understand how to interpret ML, both the contextual inquiry as well as the survey was conducted with people who had as little as 2 months of experience with ML. Would a study with experts in the field of ML (all with over 4 years of experience) have yielded different results? Perhaps these data scientists would have been able to better identify issues and would not have over-trusted the interpretable tools?
  2. Would a more extensive study comprise of a number of different (commonly used as well as not-so-commonly used) interpretability tools have changed the results? If the tools were not available so easily would it truly impact the amount of trust the users had for the tools?
  3. Does a correlation exist between the amount of experience a data scientist has and the amount of trust for a given interpretability tool? Would the replacement of visualizations with other representations of interpretations of the models impact the amount of trust the human had towards the tool?

Read More

02/26/2020 – Palakh Mignonne Jude – Explaining Models: An Empirical Study Of How Explanations Impact Fairness Judgment

SUMMARY

The authors of this paper attempt to study the effect explanations of ML systems have in case of fairness judgement. This work attempts to include multiple aspects and heterogeneous standards in making the fairness judgements that go beyond the evaluation of features. In order to perform this task, they utilize four programmatically generated explanations and conduct a study involving over 160 MTurk workers. They consider the impact caused by different explanation styles – global (influence and demographic-based) as well as local (sensitivity and case-based) explanations, fairness issues including model unfairness and case-specific disparate impact, and the impact of individual difference factors such as cognitive style and prior position. They authors utilized the publicly available COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) data set for predicting risk of recidivism which is known to have racial bias. The authors developed a program to generate different explanation versions for a given data point and conducted an online survey style study wherein the participants were made to judge the fairness of a prediction based on a 1 to 7 Likert scale and had to justify the rating given by them.

REFLECTION

I agree that ML systems are often seen as ‘black boxes’ and that this truly does make gauging fairness issues difficult. I believe that this study conducted was indeed very useful in throwing light upon the need for more well-defined fairness judgement methodologies involving humans as well. I feel that the different explanation styles taken into account in this paper – influence, demographic-based, sensitivity, and case-based were good and helped cover various aspects that could contribute in understanding the fairness of the prediction. I found it interesting to learn that the local explanations helped to better understand discrepancies between disparately impacted cases and non-impacted cases whereas the global explanations were more effective in exposing case-specific fairness issues.

I also found interesting to learn that different regions of the feature space may have varied levels of fairness and fairness issues. Having not considered the fairness aspect of my datasets and the impact this would have on the models I build, this made me realize that it would indeed be important to have more fine-grained sampling methods and explanation designs in order to judge the fairness of ML systems.

QUESTIONS

  1. The participants involved in this study comprised of 78.8% self-identified Caucasian MTurk workers. Considering that the COMPAS dataset being considered in this study is known to have racial bias, would changing the percentage of the African American workers involved in these studies have altered the results? The study focused on workers living in the US, perhaps knowing the general judgement of people living across the world from multiple races may have also been interesting to study?
  2. The authors utilize a logistic regression classifier that is known to be relatively more interpretable. How would a study of this kind extend when it comes to other deep learning systems? Could the programs used to generate explanations be used directly? Has any similar study been performed with these kinds of more complex systems?
  3. As part of the limitations of this study, the authors mention that ‘the study was performed with crowd workers, rather than judges who would be the actual users of this type of tool’. How much would the results vary if this study was conducted with judges? Has any follow-up study been conducted?

Read More

02/19/2020 – Palakh Mignonne Jude – The Work of Sustaining Order in Wikipedia: The Banning of a Vandal

SUMMARY

In this paper, the authors focus on the efforts (both human and non-human) taken in order to moderate content on the English-language Wikipedia. The authors use trace ethnography in order to indicate how these ‘non-human’ technologies have transformed the way editing and moderation is performed on Wikipedia. These tools not only increase the speed and efficiency of the moderators, but also aide them in identifying changes that may have gone unnoticed by moderators – for example, the use of the ‘diff’ feature to identify edits made by a user enables the ‘vandal fighters’ to easily view malicious changes that may have been made to Wikipedia pages. The authors mention editing tools such as Huggle, Twinkle as well as a bot called the ClueBot that can examine edits and revert them based on a set of criteria such as obscenity, patent nonsense as well as mass removal of content by a user.  This synergy between the tools and humans has helped monitor changes to Wikipedia in near real-time and has lowered the level of expertise required by reviewers as an average volunteer with little to no knowledge of a domain is capable of performing these moderation tasks with the help of the aforementioned tools.

REFLECTION

I think it is interesting that the authors focus on the social effect on the activities done in Wikipedia due to various bots and assisted editing tools. I especially liked the analogy drawn from the work of Ed Hutchins of a navigator that is able to know the various trajectories through the work of a dozen crew members which the authors mention to be similar to blocking a vandal on Wikipedia through the combined effort of a complex network of interactions between software systems as well as human reviewers.

I thought it was interesting that the use of bots in edits increased from 2-4% in 2006 to about 16.33% in just about 4 years and this made me wonder what the current percentage of edits made by bots would be. The paper also mentions that the detection algorithms often discriminate against anonymous and newly registered users which is why I found it interesting to learn that users were allowed to reconfigure their queues such that they did not view anonymous edits as more suspicious. The paper mentions ClueBot that is capable to automatically reverting edits that contain obscene content, which made me wonder if efforts were made to develop bots that would be able to automatically revert edits that may contain hate speech and highly bigoted views.

QUESTIONS

  1. As indicated in the paper ‘Updates in Human-AI teams’, humans tend to form mental models when it comes to trusting machine recommendations. Considering that the editing tools in this paper are responsible for queuing the edits made as well as accurately keeping track of the number of warnings given to a user, do changes in the rules used by these tools affect human-machine team performance?
  2. Would restricting edits on Wikipedia to only users that are required to have non-anonymous login credentials (if not to the general public, non-anonymous to the moderators such as the implementation on Piazza wherein the professor can always view the true identity of the person posting the question) help lower the number of cases of vandalism?
  3. The study performed by this paper is now about 10 years old. What are the latest tools that are used by Wikipedia reviewers? How do they differ from the ones mentioned in this paper? Are more sophisticated detection methods employed by these newer tools? And which is the most popularly used assisted editing tool?

Read More

02/19/2020 – Palakh Mignonne Jude – Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

SUMMARY

In this paper, the authors talk about the impact updates made to an AI model can have on the overall human-machine team performance. They describe the mental model that a human develops through the course of interacting with an AI system and how this gets impacted when an update is made to the AI system. They introduce the notion of ‘compatible’ AI updates and propose a new objective that will penalize new errors (errors introduced in the new model that were not present in the original model). The authors introduced terms such as ‘locally-compatible updates’, ‘compatibility score’ as well as ‘globally-compatible updates’. They performed experiments with high-stakes domains such as recidivism prediction, in-hospital mortality prediction, and credit risk assessment. They also developed a platform to study human-AI teams called CAJA, which is a web-based game and the authors claim that no human is a task expert. CAJA enables designers to vary different parameters including the number of human-visible features, AI accuracy, reward function, etc.

REFLECTION

I think this paper was very interesting as I have never considered the impact on team performance due to updates to an AI system. The idea of a mental model, as introduced by the authors of this paper, was novel to me as I have never thought about the human aspect of utilizing such AI systems that make various recommendations. This paper reminded me of the multiple affordances mentioned in the paper ‘An Affordance-Based Framework for Human Computation and Human-Computer Collaboration’ wherein both humans and machine are in pursuit of a common goal and leverage the strengths of both humans and machines.

I thought that it was good that they defined the notion of compatibility to include the human’s mental model and I agree that developers retraining AI models are susceptible to focus on retraining in terms of improving the accuracy of a model and that they tend to ignore the details of human-AI teaming.

I was also happy to read that the workers used as part of the study performed in this paper were paid on average $20/hour as per the ethical guidelines for requesters.

QUESTIONS

  1. The paper mentions the use of Logistic Regression and multi-layer perceptron. Would a more detailed study on the types of classifiers that are used in these systems help?
  2. Would ML models that have better interpretability for the decisions made have given better initial results and prevented the dip in team performance? In such cases, would providing a simple ‘change log’ (as is done in a case of other software applications), have aided in preventing this dip in team performance or would it have still been confusing to the humans interacting with the system?
  3. How were the workers selected for the studies performed on the CAJA platform? Were there any specific criteria used to select such workers? Would the qualifications of the workers have affected the results in anyway?

Read More

02/05/2020 – Palakh Mignonne Jude – Principles of Mixed-initiative User Interfaces

SUMMARY

This paper, that was published in 1999, reviews principles that can be used when coupling automated services with direct manipulation. Multiple principles for mixed-initiative UI have been listed in this paper, such as developing significant value-added automation, inferring ideal action in the light of costs, benefits, and uncertainties, continuing to learn by observing, etc.  The author focusses on the LookOut project – an automated scheduling service for Microsoft Outlook – which was an attempt to aid users in automatically adding appointments to their calendar based on the messages that were currently viewed by the user. He then discusses about the decision-making capabilities of this system under uncertainty – LookOut was designed to parse the header, subject, and body of a message and employ a probabilistic classification system in order to identify the intent of the user. The LookOut system also offered multiple interaction modalities which included direct manipulation, basic automated-assistance, social-agent modality. The author also discusses inferring beliefs from user goals as well as mapping these beliefs to actions. The author discusses the importance of timing these automated services such that they are not invoked before the user is ready for the service.

REFLECTION

I found it very interesting to read about these principles of mixed-initiative UI considering that they were published in 1999 – which, incidentally, was when I first learnt to use a computer! I found that the principles being considered were fairly wide-spread considering the year of publication. However, principles such as ‘considering uncertainty about use’s goals’ and ‘employing dialog to resolve key uncertainties’ could have perhaps been addressed by performing behavior modeling. I was happy to learn that the LookOut system had multiple interaction modalities that could be configured by the user and was surprised to learn that the system employed an automated speech recognition system that was able to understand human speech. It did, however, make me wonder about how this system performed with respect to different accents; even though the words under consideration were basic words such as ‘yes’, ‘yeah’, ‘sure’, I wondered about the performance of the system. I also thought that it was nice that the system was able to identify if a user seemed disinterested and that the system waited in order to obtain a response. I also felt that it was a good design strategy to implement a continued training mechanism and that users could dictate a training schedule for the same. However, if the user were to dictate a training schedule, I wonder if it would cause a difference in the user’s behavior versus if they were to act without knowing that their data would be monitored at that given point in time (consent would be needed, but perhaps randomly observing user behavior would ensure that the user is not made too conscious about their actions).

QUESTIONS

  1. Not having explored the AI systems of the 90s, I am unaware about the way these systems work. The paper mentions that the LookOut system was designed to continue to learn from users, how was this feedback loop implemented? Was the model re-trained periodically?
  2. Since data and the bias present in the data used to train a model is very important, how were the messages collected in this study obtained? The paper mentions that the version of LookOut being considered by the paper was trained using 500 relevant and 500 irrelevant messages – how was this data obtained and labeled?
  3. With respect to the monitoring of the length of time between the review of a message and the manual invocation of the messaging service, the authors studied the relationship based on the size of the message and the time users dwell on the same. What was the demographic of the people used as part of this study? Would there exist a difference in the time taken when considering native versus non-native English speakers?

Read More

02/05/2020 – Palakh Mignonne Jude – Guidelines for Human -AI Interaction

SUMMARY

In this paper, the authors propose 18 design guidelines for human-AI interaction with the aim that these guidelines would serve as a resource for practitioners. The authors codified over 150 AI-related design recommendations and then through multiple phases, and refinement processes modified this list and defined 18 generally applicable principles. As part of the first phase, the authors reviewed AI products, public articles, and relevant scholarly papers. They obtained a total of 168 potential guidelines which were then clustered to form 35 concepts. This was followed by a filtration process that reduced the number of concepts to 20 guidelines. As part of phase 2, the authors conducted a modified heuristic evaluation attempting to identify both applications and violations of the proposed guidelines. They utilized 13 AI-infused products/features as part of this evaluation study. This phase helped to merge, split, and rephrase different guidelines and reduced the total number of guidelines to 18. In the third phase, the authors conducted a user study with 49 HCI practitioners in an attempt to understand if the guidelines were applicable across multiple products and to obtain feedback about the clarity of the guidelines. The authors ensured that the participants had experience in HCI and were familiar with discount usability testing methods. Modifications were made to the guidelines based on the feedback obtained from the user study based on the level of clarity and relevance of the guidelines. In the fourth phase, the authors conducted an expert evaluation of the revisions. These experts comprised of people who had work experience in UX/HCI and were well-versed with discount usability methods. With the help of these experts, the authors assessed whether the 18 guidelines were easy to understand. After this phase, they published a final set of 18 guidelines.

REFLECTION

After reading the 1999 paper on ‘Principles of Mixed-initiative User Interfaces’, I found that the study performed by this paper was much more extensive as well as more relatable as the AI-infused systems considered were systems that I had some knowledge about as compared to the LookOut system that I have never used in the past. I felt that the authors performed a thorough comparison and included various important phases in order to formulate the best set of guidelines. I found that it was interesting that this study was performed by researchers from Microsoft 20 years after the original 1999 paper (also done at Microsoft). I believe that the authors provided a detailed analysis of each of the guidelines and that it was good that they included identifying applications of the guidelines as part of the user study.

I felt that some of the violations reported by people were very well thought out; for example, when reporting a violation for an application where the explanation was provided but inadequate with respect to the navigation product – ‘best route’ was suggested, but no criteria was given for why the route was the best. I feel that such notes provided by the users were definitely useful in helping the authors better assimilate good and generalizable guidelines.

QUESTION

  1. Which, in your experience, among the 18 guidelines did you find to be most important? Was there any guideline that appeared to be ambiguous to you? For those that have limited experience in the field of HCI, were there any guidelines that seemed unclear or difficult to understand?
  2. The authors mention that they do not explicitly include broad principles such as ‘build trust’, but instead made use of indirect methods by focusing on specific and observable guidelines that are likely to contribute to building trust. Is there a more direct evaluation that can be performed in order to measure building trust?
  3. The authors mention that it is essential that designers evaluate the influences of AI technologies on people and society. What methods can be implemented in order to ensure that this evaluation is performed? What are the long-term impacts of not having designers perform this evaluation?
  4. For the user study (as part of phase 3), 49 HCI practitioners were contacted. How was this done and what environment was used for the study?

Read More

01/22/2020 | PALAKH MIGNONNE JUDE | GHOST WORK

SUMMARY

‘Ghost work’ is a new world of employment that encompasses the work done behind-the-scenes by various unnamed online workers that help to power mobile apps, websites, and AI systems. These workers are the ‘humans-in-the-loop’ contributing their creativity, innovation, and rational judgement to cases where machines are unable to make decisions accurately. The importance of these ‘gigs’ has increased exponentially over the past few years – especially in order to provide better training data for Machine Learning systems (such as the ImageNet challenge that involved annotating millions of images), cleaning-up social media pages (ensuring that there is little to no abusive content), as well as providing accurate descriptions for product and restaurant reviews. This work covers various online platforms like Amazon’s Mechanical Turk, Microsoft’s UHRS, LeapGenius, and Amara – each catering to slightly varied tasks ranging from micro-tasks (tasks that can be done quickly, but require many people) to macro-tasks (larger projects such as copyediting a newsletter, linking video to captions). The work also illustrates the lives of four such ghost workers, and their experiences on these platforms, giving some insight into the ‘human’ behind the seemingly interchangeable worker represented by a pre-defined worked ID.

REFLECTION

As a researcher working with Machine Learning, I think that I often forget the ‘human-side’ of Machine Learning – especially with respect to the generation of labels for supervised tasks. I found the description about the lives of Joan, Kala, Zaffar, and Karen to be particularly interesting and insightful. This was interesting to me as it helped me better understand the motivation of people engaging in crowd work and helped me better appreciate the efforts that need to be taken in order to gain a good ‘gold standard’. I also found it interesting to learn that on-demand workers produce higher-quality work when compared to the quality of work produced by full-time employees. While the reading posits that a potential reason for this is the competitive nature of on-demand jobs, I wonder if incentive-based bonuses in the case of full-time employees could have an impact on the quality of work provided by these employees.

The reading reminded me of another paper entitled ‘Dirty Jobs: The Role of Freelance Labor in Web Service Abuse’ which focusses on freelance work done via Freelancer.com. This paper discusses the abuse of various web services including solving CAPTCHAs, Online Social Networks Linking, etc. This reading talks about vetting processes done for the workers, but it made me wonder about the type of vetting done for ‘requesters’ and the moderation of the type of tasks that are posted on such ghost work platforms.

QUESTIONS

  • What is the main motivation for ghost workers with a Master’s degree to work on these platforms? Is it generally due to an ailing family member? Additionally, Amara has a larger percentage of women, is there any reason behind this?
  • In the case of a platform such as Amara, where workers perform tasks in teams, how do they handle cases of harassment (if any were to occur)? Are they any policies that exist to deal with such situations?
  • Is there any form of moderation of the type of tasks that are posted on these sights? Are the ghost workers allowed to flag tasks that might seem to contribute to web service abuse? (For example, multiple account creation by using ghost workers to solve CAPTCHAs in real-time).

Read More

01/28/2020 | Palakh Mignonne Jude | The Future of Crowd Work

SUMMARY

This paper aims to define the future of crowd work in an attempt to ensure that future crowd workers will share the same benefits as those currently shared by full-time employees. The authors define a framework keeping in mind various factors such as workflow, assignment of tasks, real-time response to tasks, etc. The future that the paper envisions includes worker considerations such as providing timely feedback, and job motivation as well as requester considerations such as quality assurance and control, task decomposition. The research foci mentioned in the paper broadly consider the future of work processes, integration of crowd work and computation, supporting the crowd workers of the future in terms of job design, reputation and credentials, motivation and rewards. With respect to the future of crowd computation, the paper suggests a hybrid human-computer system that would capitalize on the best of both human intelligence and machine intelligence. The authors mention two such strategies – crowds guiding AI and AIs guiding crowds.  As a set of future steps that can be undertaken to ensure environment for crowd workers, the authors describe three design goals – creation of career ladders, improving task design through better communication, facilitating learning.

REFLECTION

I found it interesting to learn about the framework proposed by the authors in order to ensure a better working environment in the future for crowd workers. I like the structure of paper wherein the authors mentioned a brief description about the research foci followed by some prior work and then some potential research that can be performed in each of these foci.

I particularly liked the set of steps that the authors proposed – such as the creation of a career ladder. I believe that the creation of such a ladder, will help workers stay motivated as they will have the ability to work towards a larger goal as promotions can be a good incentive to foster a better and more efficient working environment. I also found it interesting to learn how often times, the design of the tasks cause ambiguity which makes it difficult for the crowd workers to perform their tasks well. I think that having sample tests of these designs with some of the better performing workers (as indicated in the paper) is a good idea as it will allow the requesters to gain feedback on their task design since many of the requesters may not realize that these tasks are not as easy to understand as they might believe.

QUESTIONS

  1. While talking about crowd-specific factors, the authors mention how crowd workers can leave tasks incomplete with fewer repercussions as compared to traditional organizations. Perhaps having a common reputation system in order to maintain a history of employment (associated with some common ID) in order to maintain recommendation letters, work histories might help to keep track of all the platforms with which a crowd worker was associated as well as their performance?
  2. Since the crowd workers interviewed were from Amazon Mechanical Turk alone, wouldn’t the responses collected from the workers as part of this study be biased? The opinion these workers would give would be specific to AMT alone and these opinions might be different among workers that are part of different platforms.
  3. Do any of these platforms perform a thorough vetting for the requesters? Have any measures been taken to move towards the development of a better system in order to ensure that the tasks posted by requesters are not harmful/abusive in nature (CAPCTHA solving, reputation manipulation, etc)?

Read More