03/04/2020- Myles Frantz – Real-time captioning by groups of non-experts

Summation

Machine Learning is at the forefront of most technologies though it is still highly inaccurate; given the example of Youtube, auto-generated captions of recorded videos still proving the infancy of the technology. To improve and go beyond this, the team at Rochester created a hybrid way to combine multiple crowd workers’ efforts in order to more accurately and more timely create captioning. This methodology was set up in order to verify previous machine learning algorithms or to generate captions themselves. Overall throughout the experiment, the tie-breaker throughout the experiment is a majority vote. Comparing the accuracy of the general system of Scribe compared to other captioning systems is comparatively similar in precision and Word Error Rate, though at a lower cost.

Response

I could see how combining the two aspects, crowd workers and the initial baseline could create a good and accurate process for generating captions. Using crowd workers to asses and verify the baseline generation for captions ensures the quality of the captions generated and the potential to end up improving the machine learning algorithm. Furthering this, more workers can be given jobs and the captioning system could ultimately improve, improving both the jobs available and the core machine learning algorithm itself.

Questions

  • Not currently experienced in this specific field and disregarding the publishing date of 2012, this combination of crowd workers verifying the auto-generated captions does not seem ultimately novel in this case. Through their study of the latest and greatest in the field didn’t include any crowd workers in any capacity, this may have been limited to their scope. In your opinion does this research currently stand up to some of the more recent papers for auto-captioning or is it just a product of the time?
  • Potentially a problem for within the crowd working community, their techniques utilize a majority vote to confirm which words are accurately representing the phrase. Though there may be some statistics on ensuring the mechanical turkers have sufficient experience and can be relied on, this area may be vulnerable to malicious actors out numbering the non-malicious actors. Based on the phrases being interpreted and explicitly written, do you think a scenario similar to the Mountain Dew naming campaign (Dub The Dew – https://www.huffpost.com/entry/4chan-mountain-dew_n_1773076) in which a group of malicious actors overloaded the names, could happen to this type of system?
  • In using the audio of this technology, the raw audio of a speech or some event would be fed directly to the mechanical turkers working on the Scribe program. Depending on the environment where the speech was given or the quality of the microphone, not even majority of users may be able to hear the correct words (potentially regardless of the volume of the speech). Would there be potential future for combining this kind of technology along with some sort of machine learning algorithms that isolate and remove the white noise or smaller conversations around the main speakers of the event?

Read More

03/04/2020- Myles Frantz – Combining crowdsourcing and Google Street View to identify street-level accessibility problems

Summation

Taxes in the US are a very divisive topic, and unfortunately, system infrastructures such as road maintenance are left to take the impact. Due to the lack of resources generally allocated, states typically only allocated the necessary amount of resources to fix problems, while generally leaving out accessibility options. This team from the University of Maryland has taken it upon themselves to prototype identification of these lack of accessibility options via crowd sourcing. They developed a system that utilized information from Google Street View and users could identify problems with the road. The next users could also confirm or deny the previous conclusions from previous users. Throughout this, they ran this experiment using 229 images manually collected through Google Street View and once with 3 handicap users, then with 185 various mechanical turkers. Throughout this, they were able to achieve an accuracy of at least 78% compared to the ground truth. Further trimming down the lower ranking turkers raised the accuracy by about 6% at the cost of filtering out 52% of the turkers.

Response

I can appreciate this approach since I believe (as was also stated in the paper) that a manual effort to identify the accessibility problems would cost a lot of money and time. Both of those requirements are typically sticking and stringent points from government contracts. Though they may not be ready to open this kind of availability to crowd workers, the accuracy is creating a stronger argument. Further ensuring better workers, the study also proved that by dropping the number of raw workers available for better results was ultimately fruitful, and potentially may be in alignment with the type of budget the government could provide.

Questions

  • Manually creating ground truth from experts would likely be unsustainable, since the cost in that specific requirement would increase. Since I don’t believe you can require a kind of accessibility handicap in Amazon Mechanical Turk, if this kind of work was solely based on Mechanical (or other crowdsourcing tools), would the ground truth ultimately suffer due to the potential lack of experience and expertise?
  • This may be an external perspective; however, it seems there is a definitive split of ideas within the paper, creating a system for crowd workers to identify and then creating a system to optimize the potential crowd workers working on the project. Do you think both ideas were equally weighed and spread throughout the paper or the Google Street View system was a means of utilizing the techniques for optimizing the crowd workers?
  • Since these images (solely utilizing Google Street View) are likely only taken periodically (due to resource constraints of the Google Street View Cars), the images are highly likely to be older and under change from any recent construction. When there is a delay from the Google Street View pictures, structures and buildings may have changed without getting updated in the system. Do you think there might be enough changes in the streets that the turkers work would become obsolete?

Read More

02/26/20 – Myles Frantz – Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment

Summation

Even though Machine Learning has recently taken the title of “The Best Technology Buzzword” away from cyber security a few years ago, there are two fundamental problem with it; understanding the percentage each feature contributes to an answer and ensuring the model itself is fair. The first problem limits the progress on the second problem and has spawned its own field of research: explainable artificial intelligence. Due to this, it is difficult to ensure models are fair to sensitive data and don’t learn biases. To help ensure models are understood as fair, this team has concentrated on automatically generating four different types of explanations for the rational of the model. These models spawned a multitude of regions of the data, including input influence based, demographic-based, sensitivity-based, and case-based. By showing these heuristics of the same models to a group of crowd-workers, they were able to determine quantitatively determine there is not one perfect explanation method. There must be instead a tailored and customized explanation method.

Response

I agree with the need for explainable machine learning, though I’m unsure about the impact of this team’s work. Using work done previously for the four types and their own preprocessor, they seemingly resolved a question by only continuing it. This may be due to my lack of experience reading psychology papers, though their rationalization for the explanation styles and fairness in judgement seems to be common place. Two of the three conclusions wrapping up the quantitative study seemed appropriate, case-based explanation seemed less fair while local-based explanation was more effective. Though the latter conclusion of people having a previous bias towards machine learning seems to be redundant.

I can appreciate the lengths they went to measure the results against the mechanical turks. Seemingly creating an incremental paper (see the portion about their preprocessor), this may lead to more papers off their gathered heuristics.

Questions

  • I wonder if the impact of the survey for the mechanical turks was limited due to only using the four different types of explanations studied. The conclusion of the paper indicated there is no good average and each explanation type was useful in one scenario or another. In this manner would different explanations lead to a good overall explanation?
  • A known and understood limitation of this paper was in the use of mechanical turks instead of actual judges. This may be better due to representation of the jury; however, it is hard to measure the full impact without including the judge in this. It would be costly and timely, though it would help to better represent the full scenario.
  • Given the only four different types of explanation, would there be room for a combination or collaboration explanation? Though this paper mostly focuses on generating the explanations, there should be room to combine the factors to potentially create the overall good and average explanation, despite the paper limiting itself to the only four explanations early on by fully utilizing the Binns et al survey.

Read More

02/26/20 – Myles Frantz – Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems

Summation

Though Machine Learning techniques have advanced greatly within the past few years, human perception on the technology may severely limit the adoption and usage of the technology. To further study how to better encourage and describe the process of using this technology, this team has created a Scheduling Assistant to better monitor and elicit how users would tune and react to from different expectations. Tuning the expectations of the AI (Scheduling Assistant) via a slider (from “Fewer detections” on the left to “More detections” on the right) directly altering the false positive and false negative settings in the AI. This direct feedback loop gave users (mechanical turk workers) more confidence and a better understanding of how the AI works. Though given the various variety of users, having an AI focused on High Precision was not the  best general setting for the scheduling assistant.

Response

I like this kind of raw collaboration between the user and the Machine Learning system. This tailors the algorithm explicitly to the tendencies and mannerisms of each user, allowing easier customization and thus a higher likelihood of usage. This is supported due to the team’s Research hypothesis: “H1.1 An AI system focused on High Precision … will result in higher perceptions of accuracy …”. In this example each user (mechanical turk worker) was only using the subset of Enron emails to confirm or deny the meeting suggestions. Speculating further, this type of system being used in an expansive system across many users, being able to tune the AI would greatly encourage use.

I also strongly agree with the slider bar for ease of use tuning by the individual. In this format the user does not neat to have great technological skill to be able to use it, and it is relatively fast to use. Having it within the same system easily reachable also ensures a better connection between the user and the AI.

Questions

  • I would like to see a greater (and or a beta) study done with a giant email provider. Each email provider likely has their own homegrown Machine Learning model, however providing the capabilities to further tune their own AI for their preferences and tendencies would be a great improvement. The only issue would be with the scalability and providing enough services to make this work for all the users.
  • In tuning the ease of access and usability, I would like to see a study done comparing the different types of interaction tools (sliders, buttons, likert scale settings, etc.…). There likely is a study done about the effectiveness of each type of interaction tool upon a system, however in the context of AI settings it is imperative to have the best tool. This would hopefully be an adopted standard that would be an easy to use tool accessible by everyone.
  • Following along with the first question, I would like to see this kind of study provided to an internal mailing system, potentially at an academic level. Though this was studied with 150 mechanical turk workers and 400 internally provided workers, this was based on a sub-sample on the Enron email dataset. Providing this as a live-beta test in a widely and actively used email system with live emails would be a true test that I would like to see.

Read More

02/19/2020 – The Work of Sustaining Order in Wikipedia – Myles Frantz

Given an extensive website such as Wikipedia, there is bound to be an abundance of actors, both good and bad. With the scalability and wide ruleset of the popular web forum site, it would be nigh impossible for human moderators to handle the workload and cross examine each page in depth. To alleviate this, programs that use machine learning were created to help cross track user’s usage of the site into a single repository. Once all the information is gathered here, if a user is acting in a malicious way, it can easily be caught by the system and auto-reverted based on the machine learnings predictions. Such was the case for the user from the case study, whom attempted to slander a famous musician, but was caught quickly and with ease.

I absolutely agree with all the moderation going on around Wikipedia. Given the site domain, there are a vast number of pages that must be secured and protected (all to the same level). It is unrealistic to expect a non-profit website to be able to hire more manual workers to accomplish this same task (in contrast to Youtube, or Facebook). Also, the context in which must be followed in order to fully track a malicious user down manually would be completely exhaustive. For the security side of malware tracking, there is a vast amount of decompilers, raw binary program tracers, and even a custom Virtual Machine and Operation System (Security Onion) that contains various amounts of programs “out of the box” that are ready to track the full environment for the malware.

I disagree with one of the major issues that arises, regarding the bots creating and executing their own moral agenda. This is completely learned and based on the various factors (such as the rules, the training data, and correction values). Though they have the power to automatically revert and edit someone else’s page, these are done at the discretion of the person who created the rules. It would likely have some issues, but it is the overall learning process. These false positives would also be able to be appealed if the author so chooses to follow through, so it’s not a fully final decision.

  • I would believe with such a tool suite, there would be a tool that would act as a combination, a “Visual Studio Code” like interface for all these tools. Having all these tools at the ready is useful, however since time is of the essence some tool wrapping all the common functions would be very convenient.
  • I would like to get several how many reviews from moderators are completely biased. Having a moderator work force should ideally be unbiased however realistically it is unlikely to fully happen.
  • I would also like to see the percentage of false positives, even in this robust of a system. Likely with new moderators they are likely to flag or unflag something if they are unfamiliar with the rules.

Read More

02/19/2020 – Human-Machine Collaboration for Content Regulation – Myles Frantz

Since the dawn of the internet, it has surpassed many expectations and is prolific throughout everyday life. Though initially there was a lack of standards in website design and forum moderation, it has relatively stabilized with various and scientific approaches. A popular forum side, Reddit, use a human lead human-ai collaboration to help automatically and manually moderate the ever-growing comments and thread. Searching through the top 100 subreddits (at the time of writing), the team followed surveyed moderators from 5 varied and highly active subreddits. These moderators are majority Due to the easy to use API provided by Reddit, one of the most used moderation tools was a third party later incorporated into Reddit Automod. This is one of the more popular and common tools used by moderators in Reddit. Since it is very extensible, there is no common standard between all the subreddits. Moderators within the 5 subreddits use this bot in relatively similar but different ways. Not only the sole bot used by moderations, other bots can be used to further interact and streamline other bots in a similar fashion. However due to the complication of bots (technologically or lack of interest in learning the tool), some subreddits let a few people manage the bots, sometimes to damning results. When issues happen, instead of being reactive to various users’ reactions, the paper argues for more transparency of the bot.

I agree with the author of the original of automod, when he started off making the bot purely to automate several steps. Continuing this forward with the scaling of Reddit, I do believe it would be impossible for only human moderators to keep up with the “trolls”.

Though I do disagree with how the rules of the automod are spread out. I would believe the decentralization of knowledge would make the system more robust, especially since the moderators are voluntary. It is natural for people to avoid what they don’t understand, for fear of it in general or fear for what repercussions may happen. Though I don’t think putting all of the work on one moderator is necessarily the right answer.

  • One of my questions is regarding one of the outcomes for Reddit; granting more visibility of the automods actions. Notably due to the scale of Reddit, extending this kind of functionality automatically could incur much more of a memory and storage overhead. Already Reddit stores vast amount of data however potentially doubling the memory capacity (if every comment was reviewed by automod) may be a downfall to this approach.
  • Instead of surveying the top 60%, I wonder if surveying the lower ranked (via RedditMetrics) subreddit with a lower number of moderators would fit the same pattern of the automod use. I would imagine they would be forced to use the automod tool more in depth and in breadth due to the lack of available resources however this is pure speculation.
  • A final question would be, to what percentage is there an over duplication of bots across the subreddits? If there is a big percentage it may lead to a vastly different experience across subreddits, as it seemingly is now potentially causing confusion amongst new or recurring users.

Read More

02/05/20 – Myles Frantz – Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research

Throughout this paper, a solo Microsoft researcher created a seemingly comprehensive (and almost exhaustive it seems) methods in which Crowd Sourcing can be used to enhance and improve various aspects of Machine Learning. Not only limiting the study to one of the most common Crowdsourcing platforms; a multitude of other platforms were included within the study as well, including but not limited to: CrowdFlower, ClickWorker, and Prolific Academic. Through reading and summarizing around 200 papers, the key areas of affordance were categorized under 4 different categories; Data generation (the accuracy and quality of data being generated), Evaluating and debugging models (the accuracy of the predictions), Hybrid intelligence systems (the collaboration between human and a), and Behavioral studies to inform machine learning research (realistic human interactions and responses to ai systems). Each of these categories has several examples underneath of it, further describing various aspects, their benefits and their disadvantages. Included in these sub-categories are factors such as speech recognition, determining human behavior (and their general attitude) towards specific types of ads, and crowd workers inter communication. With these various factors laid out the author insists that the platform and the requestors ensure their crowd workers have a good relationship, have good task design, and are thoroughly tested.

I agree with the vastness and how comprehensive thus survey is. Many of the points seem to acknowledge most of the region of the research area. Furthermore, it also doesn’t seem this work could easily be set into a more compact state.

I do whole-fully agree with one of the lasting points of ensuring the consumers of the platform (requesters and the crowd workers) in a good and working relationship. There is a multitude of platforms overcorrecting or under correcting their issues upsetting their target audience and cliental, therefor creating negative press and temporarily dipping their stock. Such a leading example of this is Youtube and their child content, where people have been sending ads illegally towards children. Youtube in turn overcorrected and still ended up with negative press since they hurt several of their creators.

Though not a fault of the survey, I disagree with the methods of Hybrid Forecasting (“producing forecasts about geopolitical events”) and Understanding Reactions to Ads. These seem to be an unfortunate but inevitable outcome with how companies and potentially governments are attempting to predict and potentially get ahead of incidents. Advertisements are not as relatively bad, however in general it seems the practice of ensuring the perfect balance of targeting the user and creating the perfect environment for viewing an advertisement seems to be malicious and not for the betterment of humanity.

  • While impractically impossible, I would like to see what the industry has created in the aspect of Hybrid Forecasting. Without knowing how far this kind of technology has spread creates an imagination like a few Black Mirror episodes.
  • From the authors I would like to see which platforms host each of the subcategories of features. This could be done on the readers side though this might seem a study in and of itself.
  • My final question would be requesting a subjective comparison of the “morality” of each platform. This could be done in comparing the quality of the workers in their discussion or how strong the gamification is between platforms.

Read More

02/05/20 – Myles Frantz – Guidelines for Human-AI Interaction

Through this paper, the various Microsoft authors created, and survey tested a set of guide lines (or best approaches) for designing and creating AI Human interactions. Throughout their study, they went through 150 AI design recommendations, ran their initial set of guidelines through a strict set of heuristics, and finally through multiple rounds in user study consisting of 49 moderates (with at least 1 year of self-reported experience) HCI practitioners. From this, the resulting 18 guidelines had the categories of “Initially” (at the start of development), “During interaction”, “When wrong” (the ai system), and “Over time”. These categories include some of the following (but not limited to): “Make clear what the system can do”, “Support efficient invocation”, and “Remember recent interactions”. Throughout the user study, these guidelines were tested to how relevant they would be in the specific avenue of technology (such as Navigation and Social Networks). Throughout these ratings, at least 50% of the respondents thought the guidelines were clear, while approximately 75% of the resonant thought the guidelines were at least neutral (or all right to understand). Finally, a set of HCI experts were asked to ensure further revisions on the guidelines were accurate and better reflected the area.

I agree and really appreciate the insight into the relevancy testing of each guideline on each section of industry. Not only does this help to avoid mis-appropriation of guidelines into unintended sections, it also helps create a guideline for the guidelines. This will help ensure people implementing these set of guidelines have a better idea as to the best place they could be used.

I also agree and like the thorough testing that went into the vetting process for these guidelines. Within last weeks readings it seems the surveys were majority or solely based on the surveys of papers and subjective to the authors. Having various rounds of testing with people who have generally high average of experience within the field grants great support to the guidelines.

  • One of my questions for the authors would be a post-mortem of the results and their impact upon the industry. Regardless of the citations it would be interesting to see how many platforms integrate these guidelines into their systems and to what extent.
  • Following up on the previous question, I would like to see another paper (possibly survey) exploring the different methods of implementations used throughout the different platform. A comparison between the different platforms would help to better showcase and exemplify each guideline.
  • I would also like to see each of these guidelines run against a sample of expert psychologists and determine their affects in the long run. Along with what was described in the paper (Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research) as algorithm aversion (“a phenomenon in which people fail to trust an algorithm once they have seen the algorithm make a mistake”), I would like to see if these guidelines would create an environment making the interaction to immersive that the human subjects are either rejecting it or completely accepting of it.

Read More

01/29/20 – Myles Frantz – Human Computation: A Survey and Taxonomy of a Growing Field

Within the recent emergence of Human Computation, there has been many advancements that have pushed further out of the into the industry. This growth has been so sporadic as many of the terminologies and nomenclature has not been well defined within the scientific area. Though all of these “ideas”are all considered within the umbrella term Human Computation, the common explanation for Human Computation is not strictly defined, as it has been used frequently loosely tied papers and ideas. This work states Human Computation as coupling both the problems that can eventually be migrated to computers and “human participation is directed by the computational system”. Furthermore, the study starts to define related terms that are equally as loosely defined. These terms, under the collective idea of Human Computing, included the common technological terms such as Crowdsourcing, Social Computing, Data Mining, and Collective Intelligence. Following among these more collectively defined definitions,various Crowd Sourcing platforms are compared in a more inclusive classification system. Within the system, various aspects of the Crowd Sourcing platforms are categorized various labels retrieved by common usage in industry and literature.Through those labels include the following terms that are used throughout (to some extent)of each of the crowd sourcing platforms; motivation human skill, aggregation, quality control, process order, and task-request cardinality. Underneath each of these top categories is more sub categories better defining each of the platforms, for example a label like Motivation (for use with each platform) has the following sub-labels underneath it, Pay, Altruism (peoples inherit will to do good), enjoyment, Reputation (to work with a big company) ,and implicit work (underlying work from the system). From helping to tie this vocabulary down to a clearer definition, it is the hope of the authors to better understand each platform and to better realize how to make sure each system is humanly good.

I disagree with how the labeling is created through this system. It is always a fundamental idea that with the classification there may seem to be more “gray area” within some of the platforms put under the label. In addition, this may also stifle some of new creative ideas since these could be the “broader” buckets people use to standardize their ideas. This could be related to ideas such as a standardized test that may miss the general learning while enforcing strictly learning a singular path.

While I do potentially agree with the upper level labeling system itself, I believe the secondary labeling should be left more open-ended.This would be again due to the limiting or even “under shadowing” the of a new discovery by entertaining the idea of making a more distinctive approach to attempt to relate the ideas to commonly collected ideas.

  • I would like to see how many of the crowd sourcing examples are cross listed within any of the dimensions. It seems from their current system the examples listed may be easily (relatively) defined, the others unlisted may be able to fit into the categories that would be dropped from the table.
  • Since this is a common classification system, I would like to see if there has been a user survey (amongst people actively using the technology) done to see if these labels accurately represent the research area.
  • My final question pertaining to this system is how much this has been used actively in the industry. Potentially between advertisements or cores of the new platforms

Read More

01/29/20 – Myles Frantz – An Affordance-Based Framework for Human Computation and Human-Computer Collaboration

Within the endeavor of research visual analytics, there has been much work to solve problems that require the close and interactive relationship between human and machines. With the standard practice for research at the initial basis, each paper usually creates a new standard to improve on the fundamental level of the area or improve upon a previously created standard. To this extent there have been many various projects that excel in their certain areas of expertise, however this paper is endeavors to create a framework to enable relativity (or comparability)between the various features of the projects to further the research. Within some of previous frameworks provided, they each created models based on the best of their abilities, including features such as maturity of the crowd sourcing platform, the model presentation, or the integration types. Whilst these are acknowledged as furthering the field, they are limited to their subsections and “cornering”their framework in relative to the framework presented in this paper. While the idea of the relationship between humans and computers were initially described and conversed from the early 1950’s, it was stabilized in the late 1970’s from J.J. Gibson in which “an organism and its environment complement each other”. These affordances are described are used as some of the core concepts between humans and machines,since through visual analytics the relationship (between human and machine) are at its core. Going through the multitude of papers underwent through this survey, include some of the following human affordances (human “features” required by machines); Visual perception, Visuo spatial thinking, creativity, and domain knowledge. Listed within the machine affordances (machine attributes used or further “exploited” for research purposes) includes some of the following; Large-Scale Data manipulation, efficient data movement, and bias-free analysis. Through these features, there can also be hybrid relationships where both the human and machine features.

In comparison to the other reading for the week, I do agree and like the framework created to relate crowd-sourcing tools and humans. Not only is it of more human aspects (suggesting a better future relationship), it also describes the co-dependency (currently) with a relatively bigger emphasis on human centric interactions.

I do also agree that this framework seems to be a good representation of the standard applications of visual analytics. While acknowledging the merging of both human and machine affordances, the human affordances seem to be enough for the framework. The machine affordances seem enough, though this may be due to the direction of the research in the area.

  • Like the other reading of the week, I would like to see a user study (of researchers or people in the industry in the area)and see how this comparison lines up to practical usage.
  • In the future, would there be a more granular weighting of which affordances are used by which platform? This is more practical of an application though it may help serve as a better direction in which companies (or researchers)could choose their platform to best fit their target audience.
  • Comparing the affordances (or qualities) of a project may not be as fair to each respective (at a high level)to potential consumers. Though potentially being game-able (increasing the numbers through such malicious means) and exaggerated, impact score and depth could help compare each project.

Read More