- “Predicting Sales from the Language of Product Descriptions”
Since I’ll be presenting one of the papers for tomorrows(4/10) class, I’ll be writing a reflection for the other paper only.
I thoroughly enjoyed reading this paper for two reasons:
- The idea of using a neural network for feature selection.
- coverage of different aspects of results.
To start off I’ll go against my first reason of liking the paper because i think the comment needs to be made. Neural networks (NN) by design are more or less feature generators, not feature selectors. Hence, using one for selection of features seems pretty counter intuitive e.g. When Yan Lecun Convolutions Neural Networks have been so wildly successful because they’re able to automatically learn filters detect features like vertical edge, shapes or contours without being told what to extract. Thinking along these lines the paper seems pretty ill-motivated because one should be able to use the same gradient reversal technique and attention mechanism in a sequence or convolutional model to abstract out the implicit effects of confounds like brand loyalty and pricing strategies. So why didn’t they do it?
The answer, or atleast one of the reasons that is very straight forward, is interpretability. While there’s a good chance that the NN will generate features that are better than any of the handmade ones, they won’t make much sense to us. This is why is like the authors idea of leveraging the NNs power to select features instead of having it engineering them.
Coming onto the model, i believe that the authors could’ve done a better job at explaining the architecture. It’s a convention to state the input and output shapes that the layers take in and spit out, a very good example of which is one of the most popular architectures, inception V3. It took me a bit to figure out the dimensionality of the attention layer.
Also the authors do little to comment on the extensiblity of the study to different type of products and languages. So how applicable is the same/similar model to let’s say English which has very different grammatical structure? Also since the topic is feature selection, can a similar technique be used to rank other features i.e. something not textual? as a transactional thought, I think the application is limited to text.
While it’s all well and good that the authors want to remove the effects of confounds, the only thing that the paper has illustrated is the models ability to select good tokens. I think the authors themselves have underestimated the model. Models having LSTM layers followed by attention layers to generate summary encodings are able to perform language translation (which is a very difficult learning task for machines), hence by intuition i would say that the model would’ve been able to detect what sort of writing style attracts most customers. So my questions is that when the whole idea is to see mine features of reviews to help better sell a product, why was language style completely ignored?
Just as food for thought for people who might be into deep learning (read with a pinch of salt though) . I think the model is an overkill and the k-nearby-words method of training skipgram embeddings (the method used for Word2Vec generation) would’ve been able to do the same and perhaps more efficiently. The only thing that would need to be modified would be the loss function, where instead of trying to find vector representation that capture similarity of words only we would introduce the notion of log(sales). This way the model would capture both words that are similar in meaning and sale power. Random ideas like the one i’ve proposed need to be tested though, so you can probably ignore it.
Finally, sections like Neural Network layer reviews add nothing and break the flow. Perhaps these have been included to increase the length because the actual work done could be concisely delivered in 6 pages. I agree with John’s comment that this seems more like a workshop paper (a good one though).
One last thing that i forgot to put into the reflection (and am too lazy to restructure now) is that this line of work isn’t actually novel either. Interested readers should check out the following paper from Google. Be warned though it’s a tough read but if you’re good with feed forward NN math, you should be fine.