Reflection #2 – [1/23] – [Jamal A. Khan]

Well let me start out by saying that this was a fun read and the very first takeaway is that I know now how to “take away money” from people!

Anyhow, moving on to a more serious note, since the prime motive of the paper is to analyze the language used to pitch products/ideas and since videos (or content thereof) are a good indicator of Funded or not Funded, what effect does implicit racial bias of the crowd have? More concretely:

  • What effect does the race of both the person pitching and the crowd have?
  • Do people tend to fund people of the same racial group more?

Another aspect that I would like to investigate is the crowd itself and its statistics per funded project and how it varies across them? Can we find some trend there?

The paper more or less gives evidence of the intuitive insights both from literature and ones based on common sense e.g. people/contributors don’t stand to make profit or reap monetary benefit from the project but given some form of “reciprocation”, there’s added incentive for them to contribute apart from them liking the project. Sometimes this takes the form of something tangible like a free t-shirt and at other times it’s merely a mention in credits but the important point is that people are getting something in return for for their funding. Another prominent one is “scarcity” i.e. the desire to have something that is unique limited to only a few people. Tapping into that emotion of exclusivity and adding in personalization is good way to securing some funding.

However, not all is well! As was also noticed by some other people, there are some spurious phrases in table 3 and 4 for which it seems that they should’ve belong to the other category e.g:

  • “trash” was in funded with beta = 2.75
  • “reusable” was in not funded beta = -2.53

There were also some phrases which made no sense to be in either category e.g. “girl and” was in funded with $\beta$ = 2.0 ? I suspect that this highlights a flaw/poor choice of classifier. What would be a better classifier ? Something like word embeddings where the embeddings can be ranked?

Moving on to the model summaries provided:

It’s quite evident that the phrases provide quite a big boost in terms of capturing the distribution of the dataset, so this makes me wonder, how a phrases-only model would perform? My guess is that it’s performance should be closer to the phrases + controls model than to the controls-only model. Though I’m going off a tangent but let’s say we don’t use logistic regression and opt for something more a bit more advanced e.g. sequence models or LSTMs to predict the outcome, would the model turn out to be better than the phrases+controls model? Also, will this model stand the test of time? i.e. as language or trends of marketing evolve, will this model hold true, say, 6-10 years from now? Since the paper is from 2014 and the data from 2012-2014, does the model hold true right now?

Another thing that the authors mentioned and that caught my attention is the use of social media platforms, and it’s raised quite a few questions:

  • How does linking to Facebook affect funding? Does it build more trust among backers because it provides a vague sense of legitimacy?
  • Does choice of social media platform matter i.e. Facebook vs Instagram?
  • Does language of the posts have similar semantics or is it more click bait-ish?
  • What affect does frequency of post have?
  • Does the messaging service of Facebook pages help convince vary people to contribute?

This might make for a good term project.

I would also like to raise two technical questions, regarding the techniques used in the paper:

  • Why penalized logistic regression? Why not more modern, deep learning techniques or even other statistical models e.g. multi-kernel based Naïve Bayes or SVMs?
  • What is penalized in penalized logistic regression; does it refer to the regularize added to the RSS or likelihood?
  • I understand Lasso results in automatic feature selection, but comparison with other shrinkage/regularization techniques is missing. Hence, the choice of the regularization method seems more forced than justified.

Finally, and certainly most importantly, I’m glad that this paper recognizes that:

“Cats are the Overlords of the vast Internets, loved and praised by all and now boosters of Kick Starter Fundings”

 

 

 

Read More

Reflection # 1 – [01/18] – [Jamal A. Khan]

Both of the assigned papers:

  • “Critical Questions for Big Data” and
  • “Data ex Machina”,

present a social-science perspective on Big Data and how it affects and/or can or will affect social-sciences. The former i.e. “Critical Questions for Big Data” focuses more on the issues that big data might raise, which the authors have put into 6 buckets and the latter offers a more general overview of the what big-data is, it’s sources and resulting opportunities.

From both Boyd & Crawford’s and Lazer & Radford’s descriptions, I took away that big data is less about size or volume and more about being insightful. It is named so because given the humongous scale of data, we now have enough computational power and good enough tools to be able to draw meaningful insights that are closer to the ground truth than ever before.

I find Lazer & Radford’s mention of passive instrumentation quite appealing.  As compared to techniques like self-reporting and surveys which may include the subjects’ inherent biases or voluntary lies, big-data offers the opportunity to passively collect data and generate insights that are closer the ground truth of observed social phenomena. Borrowing words from the authors themselves “First, it acts as an alternative method of estimating social phenomena, illuminating problems in traditional methods. Second, it can act as a measure of something that is otherwise unmeasured or whose measure may be disputed. Finally, now casting demonstrates big data’s potential to generate measures of more phenomena, with better geographic and temporal granularity, much more cost-effectively than traditional methods”.

For the rest of the reflection I’ll focus on the issues raised by big-data as they not only seem to be more involved, interesting and challenging to tackle but also raise more questions.

An interesting point that Boyd and Crawford make is that big data changes how we look at information and may very well definition of knowledge. Then, they go on to suggest that we should not be dismissive of other older disciplines such as philosophy, because they might offer insight that number are missing. However, they fail to give to a convince argument of why it should be so. If we are observing certain patterns of interaction or some trends among populace behavior, is it really all that necessary to rely on philosophy to explain the phenomena?  Or can the reason also be mined form the data itself?

Another issue that caught my attention in both of the articles was that, one can get lost in numbers and start to see patterns where none exist, therefore solely relying on the data thinking it to self-explanatory is naive. The techniques used and the tools employed may very well influence the interpretation to be subjective rather than objective, which may enforce certain viewpoints even though they fail to exist. An example of which is the surprising co-relation between the S&P 500 stock index and butter production in Bangladesh. So this begs the question of whether there is a certain degree of personal skill or artistry involved in data analysis? I personally haven’t come across such examples but would love to know more if others in the class have.

I like provocation#3 made by Boyd and Crawford, “Bigger data are not always better data”. It is much too common and easy to be convinced that having a large enough dataset captures the true distribution of data and being someone who works with machine learning quite often, I too am guilty of it.  In actuality the distribution might not be representative at all! if the methodology is flawed or suffers from unintended biases and/or comes from a source that is inherently limited, then making claims about behaviors and trends can be misleading at best. The twitter “firehose”, “gardenhose” and “user” vs “people” vs “activity” example given by the Boyd and Crawford is a perfect demonstration of it. This may have far-reaching consequences especially when Learning algorithms are fed this biased/limited data. The decisions made by these systems will intern reinforce the biases or provide flawed and limited insights which is most certainly something we want to avoid! So the question then becomes how do we systematically find these biases and limitation inorder to rectify them? Can all biases even be eliminated? This also leads to an even more basic question How does one even go about checking if the data is representative or not?

Finally, I disagree (to a certain extent) with Boyd and Crawford’s discussion on ethics and the notion of big-data rich and poor. I would certainly like to discuss these two in class and know what other people think.

Read More