LEAP: LEAf node Predictions in the wild

Published in 2nd ASCS Applied Science Workshop, 2021

Citation: Abhishek Divekar, Vinayak Puranik, Zhenyu Shi, Jinmiao Fu, and Nikhil Rasiwasia. "LEAP: LEAf node Predictions in the wild". 2nd ASCS Applied Science Workshop, 2021 (internal)

The data available in Amazon's catalog is rich and diverse; however, it is also highly irregular and often challenging to employ directly for business or Machine Learning applications. Frequent issues include low fill-rate of catalog attributes, noise in attributes, dataset shift between train and real-world distributions, and potential abuse in externally-sourced fields such as Generic Keywords and Browse Node. In this paper, we work backward from the goal of building high-precision classifiers to predict “Leaf Nodes” of Amazon's Browse taxonomy, to address the issue of purposeful or accidental mis-noding in the face of aforementioned challenges. Our findings indicate that weakly-supervised datasets collected using intuitive filters - based on Glance Views (GVs) and Total Orders - are effective in eliminating potential noise in the training data (2-4% improvement in accuracy). Further, evaluating a curated set of algorithms illustrates problems inherent in weak supervision that affect both linear models and pre-trained Transformer architectures. To address these problems, we explore multi-modal ensembling and show how ensembles combining multiple information sources outperform models trained on a single modality (additional 2-5% improvement in accuracy). Finally, we describe our success deploying these models on the IN marketplace to automatically correct Leaf Nodes for high-GV and 0-GV products, which has led to >3.5X improvement in audit efficiency and 5.5MM Leaf Node corrections overall.

Citation: Abhishek Divekar, Vinayak Puranik, Zhenyu Shi, Jinmiao Fu, and Nikhil Rasiwasia. “LEAP: LEAf node Predictions in the wild”. 2nd ASCS Applied Science Workshop, 2021 (Oral)