Reflection #3 – [1/23] – [Deepika Kishore Mulchandani]

[1]. Cheng, J., Danescu-Niculescu-Mizil, C. and Leskovec, J. 2015. Antisocial behavior in online discussion communities.

Summary :

In this paper, Justin Cheng et al. study antisocial behavior in 3 online discussion communities: CNN, a general news site, Breitbart.com, a political news site, and IGN.com, a computer gaming site. The authors first characterize antisocial behavior by comparing the activity of-of users banned from the community(FBUs) and the users never banned(NBUs). They then perform longitudinal analysis, i.e, the study of the behavior of the users over their active tenure in the community. They also consider the readability of the posts and the proportion of user’s post deletion rate as features to train their model. After developing a model, they predict the users who will be banned in the future. With their model, they need to observe only 5 to 10 of the user’s posts to accurately predict that the user will be banned. They present two hypothesis and try to answer the following research questions: Do users become antisocial later? , Does a community’s reaction affect their behavior? , Can antisocial users be identified early?

Reflection:

Antisocial Behavior is a problem is a worry no matter if it is online or in person. That said, this research is an indication of the advancement that is being made to alleviate the ill effects of such behavior. The authors mention the four features that help in recognizing the antisocial users in a community. Out of these features, the one that is salient in the study conducted by the authors is the ‘Moderator features’. Moderators delete the posts and ban the users in a community. They have a particular set of guidelines based on which they delete the posts that they consider antisocial. This raises a few questions. ‘Do these moderators delete posts only based on the language of the post or factors like ‘ the number of down votes’, ‘whether the post is reported’ affect the decision?’ The point of this question is to figure out which do they weigh more heavily. Also, this opens up a variety of questions like, ‘Do moderator demographics(e.g age) play a role in how offensive they find a post to be?’ The authors mention that there were more ‘swear’ words in the posts written by the FBUs. The moderators who are more tolerable of swear words may not delete posts of potential FBUs.

I admire the efforts of the authors in studying the entire history of a particular user to identify patterns in the user behavior over time. I also like the other features used by the authors. The activity features(time spent in a thread) are not that intuitive and end up playing a significant role. The authors made an important observation that the model trained by one community perform relatively well in the other communities as well. Also, they provided some facts that FBUs survived over 42 days in CNN, 82 days in Breitbart and 103 days on IGN. This could be an interpretation of the category of the online discussion community. One could expect the online community which hosts only political news to be more tolerant of antisocial behavior by virtue of the fact that there is opposition inherent in the news. Most of the posts on such a community could have down votes and replies to comments. These are both significant features of the model as well as factors that influence a moderator’s decision.  Thus, the question, ‘Does the category of the online discussion community affect the ban of an antisocial user?’ I also agree with the authors that it is difficult to track users who might instigate arguments but maintain an NBU behavior. This could be a crucial research question to look into.

Read More

Reflection #2 – [1/23] – [Deepika Kishore Mulchandani]

[1]. Mitra, T. and Gilbert, E. 2014. The language that gets people to give. Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing – CSCW ’14.

Summary :

In this paper, the authors aim to answer the research question ‘What are the factors that drive people to fund projects on crowdfunding sites?’. To this end, Tanushree Mitra et al. studied a corpus of 45K projects from Kickstarter, a popular crowdfunding site. They carried out filtering techniques to eliminate bias and finally analyzed 9M phrases and 59 control variables to identify their predictive power in a project getting funded by the “crowd”. The error rate of their cross-validated penalized logistic regression model is only 2.4%. The authors found that the chances of funding increase if : the pitch offers incentives and rewards(reciprocity), the pitch has opportunities which are rare or limited in supply(scarcity),  the pitch’s wordings indicate that it is already pledged by others(social proof), the pitch is from a project creator that people like(liking), the pitch has a positive and confident language(LIWC and sentiment), and, the pitch is endorsed by experts(authority). By performing this research they have made available a ‘phrase and control variables’ dataset. This dataset contains phrases and control variables that can be put to further use by crowdfunding sites and other researchers.

Reflection:

‘The language that gets people to give’ is an engaging research paper. I admire the effort put by the authors in analyzing a corpus of 45K Kickstarter projects. The flowchart of the steps taken to extract the variables to be used in the model was helpful for understanding the process of obtaining the phrases and control variables finally analyzed. The fact that the control variables are not specific for the Kickstarter platform aids in making this research more useful for all crowdfunding platforms. I like the Word Tree visualizations that were provided by the author. The role that persuasion phrases and concepts like reciprocity, scarcity, authority, and,  sentiment play in getting a project funded were fascinating to read about. Features like ‘Video present’, ‘number of comments’ and ‘facebook connected’ emphasize the social aspects of this analysis. Few of the top 100 phrases listed in the paper surprised me, however, I could definitely spot the patterns that the authors identified. It is indeed impressive to see that a quantitative analysis using machine learning techniques can validate reciprocity, liking, and, scarcity, etc. I was amazed by the ‘good karma’ phrase. This phrase and its mention with respect to reciprocity made me realize that it would be exciting to study the crowdfunding projects to answer the question, ‘Do religious and spiritual beliefs impact the decision a person makes in funding a project? Do these beliefs hold more importance than the incentive rewards in the reciprocity phenomenon?’. On observing the tables listing the control variables having non-zero coefficients, I found that many of the variables in the not funded table were related to the ‘music’ and ‘film’ categories. This gave rise to the question ‘Do some beliefs (e.g. projects in these categories may not be successful) influence the decision of funding the project?  Do these beliefs way more than factors like reciprocity, authority, liking, etc.?’. I appreciate the ideas for future work that the authors have provided. I believe that implementing a feature like providing a recommendation to project creator while the pitch is being typed using the phrases and control variable dataset that the authors have released could be extremely interesting.

 

Read More

Reflection #1 – [1/18] – [Deepika Kishore Mulchandani]

[1]. Danah Boyd & Kate Crawford (2012) CRITICAL QUESTIONS FOR BIG DATA, Information, Communication & Society, 15:5, 662-679, DOI: 10.1080/1369118X.2012.678878

Summary :

In this paper, the authors describe big data as a ‘capacity to search, aggregate, and cross-reference large datasets’.  They then go on to describe the importance of handling the emergence of the era of big data critically as this would influence the future. They also discuss the dangers to privacy due to this big data phase and other concerning factors that exist. They then discuss in detail the assumptions and biases of this phenomenon using six points. The first point is how big data has changed the definition of knowledge. Another being that having a large amount of data does not necessarily mean that the data is good. The discussion on the ethics of the research being done and the lack of regulating techniques and policies are explained with examples by the authors to emphasize the importance. They also discuss the access of this data by limited organizations and the divide it creates.

[2]. D. Lazer and J. Radford, “Data ex Machina: Introduction to Big Data”, Annual Review of Sociology, vol. 43, no. 1, pp. 19-39, 2017.

Summary :

In this paper, the authors review the value of big data research and the work that needs to be done for big data in sociology. They first define big data and then discuss the following three big data sources:

  1. Digital Life: Online behavior from platforms like Facebook, Twitter, Google, etc.
  2. Digital Traces: Call Detail Records which are only records of the action and not the action itself.
  3. Digitalized Life: Google books, phones that identify proximity using Bluetooth.

The authors believe that the availability of these forms of data along with the tools and techniques required to access such data provides sociologists with the opportunity to take advantage and answer the various age-old or new questions. To this end, the authors mention the opportunities available to sociologists in the form of massive behavioral data, data obtained through nowcasting, data obtained through natural and field experiments, and, data available on social systems. The authors then proceed to discuss the pros and cons of sampling the available big data. They also mention the vulnerabilities that exist such as too much volume of data, the generalizability of data, platform dependence of data, failing ideal user assumption, and, ethical issues in big data research. In conclusion, the authors mention few of the future trends the knowledge of which will help sociologists succeed in big data research.

Reflections :

In [1], the authors ask various questions with the theme ‘Will Big Data and the research that surrounds it help the society?’ I like the definition of big data as a ‘Socio-technical’ phenomenon. I also like the thought that is provoked by the usage of the term ‘Mythology’ in the formal big data definition. The big data paradigm and its rise to fame do somewhat revolve around the belief that the volume of the data provides new true and accurate insights. This gives rise to the question ‘Do we sometimes try to find or justify false trends just because we have big data?’ I like the example using which they represent the platform dependence of social data. The social network of a person on Facebook may not be same as on Twitter by virtue of the fact that the data is different. This could be for a lot of reasons, with the basic one being that some user may not be present on both those social sites. This gives rise to another question ‘What about the population who is not on any social site?’. That chunk of the population is not being considered in any of the studies. Also, the very fact that sometimes ease of accessibility of the data is considered over the quality of data raises concerns. I also like that the authors address the quantitative nature of big data research and the importance of context. I appreciate the section in which they discuss the availability of this big data by few organizations and the ‘Big Data Rich’ and ‘Big Data Poor’ divide that it creates. This is something which has to be considered to facilitate successful big data research. In [2], I appreciate the definition of big data that has been provided by the authors. Big data is indeed a mix of computer science tools and social science questions. The authors mention that sociologists need to learn how to leverage the tools and techniques provided by computer scientists to make break-through in their research. This makes an excellent collaboration where computer scientists leverage the questions and research expertise of social scientists and social scientists leverage the tools and techniques developed for providing insights into the big data. I like the way the authors mention big data archives as depicting actual behavior “in principle”.  Although there are instances which show positive results when studying behaviors using such big data, the question that arises is ‘How genuine is this online behavior?’. Many factors play a role in these studies. The biases present in the data have to be considered. If data from social networks is being considered, one of the most basic examples of bias is the ideal user assumption as highlighted in the paper. Moreover, the veracity of the data has to be considered as well. Another important bias mentioned in the paper arises due to the incorrect sampling of data. I realize that sample data from the big data can provide valuable insights.  However, this raises the questions ‘What methods can be applied to sample data without bias?’ I appreciate the effort that the authors have invested by providing many case study examples to emphasize the points that they mention in the review. This provokes thoughts about the vulnerabilities and the work that has to be done to make big data research as ethical and methodical as possible.

Read More