Concepts: Masterful AutoML Platform

Masterful is an AutoML platform for data and training of deep learning models. It improves your model’s accuracy - without the need for more labels - through robust implementations of semi-supervised learning, contrastive learning, synthetic data, drop-in architectural optimizations, and metalearning techniques. These techniques are accessible individually through the Advanced API, but to further reduce the effort required to manually tune and experiment with techniques, Masterful provides the autofit algorithm. Autofit metalearns the optimal combination of techniques so that ML developers don’t need to run a manual grid search on hyperparameters.

The platform supports most types of classification, detection, and segmentation, and includes data translation layers for many standard ground truth data formats. It also supports anchor box formats from several open source detection models, including Google Object Detection API.

The entire platform is optimized for speed, with every operation pushed to the GPU. The platform is currently available for Tensorflow, with PyTorch support coming soon.

The rest of this document describes the broad categories of techniques employed by the Masterful platform.

Semi-Supervised Learning (SSL)

SSL means training a model from both labeled and unlabeled data. Masterful supports various approaches to SSL. These include Noisy Student Training, Contrastive Learning, and using unlabeled data for Domain Adaptation.

Noisy Student Training

Noisy Student Training is an approach that extracts information from unlabeled data and allows your model to learn from it to improve accuracy. Noisy Student Training draws on the design of Knowledge Distillation. Depending on the amount and quality of data, this technique can help or hurt a model’s accuracy, so it is important to empirically test if this approach will work on specific combinations of model architecture and training data. Autofit evaluates the effectiveness of Noisy Student Training to improve a model’s accuracy, if unlabeled data is available.

Contrastive Learning

Constrastive learning is able to learn representations of data, without requiring labels, by creating contrastive, pretextual tasks. Pretextual means that solving the pretextual training task is not the real goal; solving the pretextual training task is merely a pretext to achieve the real goal, which is to help a model learn representations of the data. A simple example of a pretextual task is predicting the amount of rotation in an image. Contrastive learning offers a far more powerful pretextual task: from a dataset of images without labels, for each image, two derivative images are created, and an ML model is trained to predict the differences between those two images. Analysis of the results of contrastive learning suggest that the representations are in fact a form of clustering for unstructured data. Masterful uses contrastive learning to learn powerful feature extractors from unlabeled data, which are then fine tuned with labels, ultimately resulting in a more accurate model trained with fewer labels.

Domain Adaptation

Domain adapation is an approach to training an existing model to perform well on data from a similar but new domain. For example, a new domain could be images of the same objects but from a new data provider, or captured during evening instead of daytime, or landscapes from a different geographic area. Masterful can use unlabeled data to help adapt the existing model to have high accuracy in the new domain with a minimal amount of labeled data from the new domain.

Generative Techniques

Masterful builds generative models from original labeled and unlabeled training data using techniques such as generative adversarial networks (GANs) and neural style transfer (NST). These models are then run to generate novel training data with additional variation that expands the original training distribution. The ratio of synthetic data to original data in the final training set is critical, so Autofit determines the optimal blending ratio.

Conventional Augmentations

Augmenting data is a best practice to reduce overfitting without requiring more labels. Conventional augmentations encompass two major forms of adjusting pixel values: either moving pixels around according to geometric rules like zooming or rotating (spatial augmentations) or changing the value of a pixel slightly like brightness or saturation (color jitter). But conventional augmentations suffer from several issues.

Setting Hyperparameters

The hyperparameters controlling the magnitude for each augmentation is usually set using a heuristic, such as mirroring data 50% of the time. While such a heuristic is appropriate for ImageNet, mirroring MNIST or street signs has been shown to hurt model accuracy for obvious reasons.

The two state-of-the-art metalearning approaches to eliminate heuristics are not useful in practice. If a model takes 1 hour to converge, AutoAugment would require 625 days. RandAugment delivers nearly acceptable performance with a search space on the order of 100, but only because many heuristics appropriate for ImageNet are baked in (such as mirroring).

To resolve the problem of heuristics and unusable metalearning algorithms, Masterful draws on the concepts from AutoAugment, Frechet Inception Distance, and adversarial learning. The result is a two-pass metalearning algorithm that can analyze two orders of magnitude of search space in very little wall clock time becuase the first pass analysis only requires inference to cluster transformations. The second pass search, which requires full training runs, is then reduced to analyzing a single digit number of clusters. Autofit explores this search space automatically.

CPU Bottleneck

Augmentation implementations are typically based on Keras Image Preprocessing, cv2, or PILLOW, meaning the operations run on CPU, creating a bottleneck.

To greatly improve the speed of the conventional augmentation pipeline during training, operations are pushed to the GPU. Internally, this requires a ground-up implementation of every image augmentation in pure Tensorflow. Pillow and cv2 are not used. Many core design problems of those libraries are also resolved, such as eliminating non-convex combinations of magnitudes. Then, a unique scaffold model approach is applied, whereby the base model is wrapped by custom Keras layers that handle augmentations on the GPU.

Spatial Augmentations and Bounding Boxes

The correct way to modify a bounding box through a spatial transformation is unknowable without a segmentation mask. At the extremes, a perfectly aligned rectangle will expand a bounding box through spatial transforms, and a perfectly aligned cross will downsize a bounding box.

Autofit tries multiple priors to discover the optimal estimate of the shapes inside a bounding box.

Other Techniques

Masterful also provides several other techniques.

Soft-Label Augmentations

A new generation of techniques includes both the image and the probability distribution of the classification labels. Masterful includes these techniques as part of the base set of augmentations to search over in autofit.

Maximum A Posteriori Calibration

Ensembling a perfectly calibrated prior distribution can help regularize a model’s confidence. This is an extension of Label Smoothing Regularization to non-classification and imbalanced scenarios, drawing on a recent understanding that ensembling is related to Maximum A Posteriori (map) estimation. This technique is empirically explored in autofit.

Ensembling

A simple way to improve the accuracy of a model is to simply train N of them and take the average prediction. This approach is used both in practice, where many Kaggle competition winners use ensembles, as well in high quality research such as the evaluation of Inception V3. This is generally superior to simply first constructing a complex model by repeating a base architecture N times and training it once. It is hypothesized that training N base models N times creates a regularization effect. In production systems, ensembling is limited by hardware constraints. This technique is accessed through the advanced API but is not searched on in autofit.

Knowledge Distillation

Training a base model is generally inferior to training a more powerful model and then distilling its knowledge into the base model’s architecture. This technique is called Knowledge Distillation. Surprisingly, the distilled model retains most of the improvements of the larger model. This technique is accessed through the advanced API but is not searched on in autofit.

Self Distillation

Traditionally, a model is initialized with random weights and then trained to optimal weights using a variation of Stochastic Gradient Descent. In the self distillation approach, a model is first trained with this traditional approach, but then it is run in inference mode to generate predictions. These predictions are then used as ground truth labels to train a model the second time. This approach is related to Knowledge Distillation, but does not attempt to modify the model architecture and seeks to improve accuracy, rather than simply accepting a minor loss of accuracy. Self distillation improves model accuracy, suggesting that one of drivers of Knowledge Distillation is not the larger teacher, but the distillation process itself. This concept is referred to as Dark Knowledge. This technique is automatically explored in autofit.

Further Reading

Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le. “Self-training with noisy student improves imagenet classification.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687-10698. 2020.

Suman Ravuri, and Oriol Vinyals. “Seeing is not necessarily believing: Limitations of biggans for data augmentation.” (2019).

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. “A simple framework for contrastive learning of visual representations.” In International conference on machine learning, pp. 1597-1607. PMLR, 2020.

Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. “Autoaugment: Learning augmentation policies from data.” arXiv preprint arXiv:1805.09501 (2018).

Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. “Born again neural networks.” In International Conference on Machine Learning, pp. 1607-1616. PMLR, 2018.

Lars Kai Hansen and Peter Salamon. “Neural network ensembles.” IEEE transactions on pattern analysis and machine intelligence 12, no. 10 (1990): 993-1001.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).

Zhang, Zhilu, and Mert R. Sabuncu. “Self-distillation as instance-specific label smoothing.” arXiv preprint arXiv:2006.05065 (2020).

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. “Gans trained by a two time-scale update rule converge to a local nash equilibrium.” Advances in neural information processing systems 30 (2017).

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. “mixup: Beyond empirical risk minimization.” arXiv preprint arXiv:1710.09412 (2017).

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. “Cutmix: Regularization strategy to train strong classifiers with localizable features.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023-6032. 2019.