CPP Multimodal AutoML Corpus and Benchmark
Published in 1st Workshop on MultiModal Learning and Fusion, Amazon Machine Learning Conference, 2021
Citation: Andrew Borthwick, Abhishek Divekar, Nick Erickson, Fayaz Ahmed Farooque, Oleg Kim, Nikhil Rasiwasia, and Ethan Xu. "The CPP Multimodal AutoML Corpus and Benchmark". 1st AMLC Workshop on MultiModal Learning and Fusion at the 9th conference of Amazon Machine Learning (AMLC 2021) (internal)
A collection of 40 binary classification datasets was acquired from an integrated AutoML, active learning, and human labeling system for Amazon products known as ”CPP AutoML”. Each dataset consists of an identical schema of 39 attributes including numeric, categorical, text, image and date attributes. Each dataset represents a real business problem that is being solved by the CPP AutoML platform. In this paper, we discuss the construction and structure of this corpus. We also discuss the challenges of evaluating AutoML algorithms for the “hands off the wheel” business requirements of CPP AutoML. Finally, we present short descriptions and benchmark metrics over this corpus for a collection of algorithms. These algorithms include the CPP AutoML production baseline, two common machine learning baselines, two publicly available Amazon AutoML solutions (AutoGluon and SageMaker AutoPilot), and two novel AutoML solutions. The novel AutoML solutions exhibit particularly strong performance relative to the production baseline. One of these solutions exercises new functionality added to AutoGluon to stack and ensemble a ResNet image model fine tuned on raw image features, used in addition to the tabular models already trained in standard AutoGluon. The other solution, EPS-Ensemble, combines standard gradient-boosted trees and logistic regression models with Transformer networks pre-trained on the Amazon catalog using different self-supervised objectives.
Citation: Andrew Borthwick, Abhishek Divekar, Nick Erickson, Fayaz Ahmed Farooque, Oleg Kim, Nikhil Rasiwasia, and Ethan Xu. “The CPP Multimodal AutoML Corpus and Benchmark”. 1st AMLC Workshop on MultiModal Learning and Fusion at the 9th conference of Amazon Machine Learning (AMLC 2021) (Oral)