Concepts: Architecture and Data Parameters¶

Masterful introduces the DataParams and ArchitectureParams to describe data and models respectively. These parameters allow the Masterful AutoML Platform to understand the design choices for models and data. This document describes the general concepts behind those two classes. See the API docs for specific details.

Intro¶

In deep learning, there is not yet a standardization of the design choices that go into delivering a trained model.

For example, there are several common ranges for image data. Different widely used image classification models can vary between:

float between zero and one, inclusive, i.e. [0.0, 1.0]

uint8 between zero and 255, inclusive, i.e. [0,255]

float negative one and positive one, inclusive, i.e. [-1.0, 1.0]

Masterful ultimately requires some way to understand these design choices to do its job of automatically training a model to peak accuracy.

Predecent¶

Throughout the Tensorflow API, specs are used to hold metadata about an object. For example, tf.TensorSpec is essentially a named tuple that contains dtype and shape information, and tf.DeviceSpec holds information about devices. And tf.data.Data.element_spec holds nested structure.

Masterful adopts this precedent to describe models and data, using a nomenclature (parameters) that is more familiar in machine learning (eg. hyperparameters). Parameters can either describe objects statically (like the input shape of a model) or describe things that can control different aspects of training (like a batch size). Thus masterful parameters combine both traditional hyperparameters that can be changed and static parameters that are descriptive.

Individual Specs¶

DataParams and ArchitectureParams specify the following:

Computer Vision Task¶

The basic computer vision tasks Masterful current supports are:

Classification: These formats are typically NC, where N is the number of elements in the minibatch and C is the number of classes. For example, classifying CIFAR-10 with a minibatch of 64 would result in a prediction and groundtruth tensor of shape [64,10]. Sparse formats like N and N1 are not recommended.
Binary Classification: These formats are typically N or N1, where N is the number of elements in the minibatch and 1 is an extra, squeezable dimension. Binary classification is technically a special type of classification, but it has many unique attributes. Most importantly for Masterful, it possible to perform soft-label techniques with this sparse representation. It is rare to create a dense representation because many metrics rely on the positive case to have a special meaning, such as recall and precision.
Detection: These formats are typically NM4+C, where N is the number of elements in the minibatch, M is the number of objects detectable, and 4+C is a concatenation of four values that define a bounding box and the additional classification information. See below for details on detection labels.
Localization: These formats are typically NM4+X, where N is the number of elements in the minibatch, M is the number of objects detectable, and 4 are the four values that define a bounding box, and X is other data. This task is very similar to detection, except localization does not attempt to also classify the object of interest.
Semantic Segmentation: This format is typically NHWC, where N is the number of elements in the minibatch, H is the height of the image in pixels, W is the width of the image in pixels, and C is the number of classes. In this task, each pixel of an image is assigned to one of the classes of objects.
Instance Segmentation: This format is typically NMHWC, where N is the number of elements in the minibatch, M is the number of instances of objects detectable, H is the height of the image in pixels, W is the width of the image in pixels, and C is the number of classes. In this task, each pixel of an image is assigned to a specific instance of a classes of objects. The key difference between this and semantic segmentation is that Instance Segmentation identifies each instance of an object. For example, if an image contains two cats, mask 0 would represent the first cat, and mask 1 would represent the second cat.
Keypoint Detection: This format is typically NM2, where N is the number of elements in the minibatch, M is the number of instances of keypoints detectable, and the final dimension has size 2 and represents the y and x coordinates of the keypoint.

Image Range and Type¶

Image input is typically these possible ranges and associated types:

[0.0, 1.0]: floating point and inclusive of both zero and one.
[-1.0, 1.0]: floating point and inclusive of both negative one and one.
[0, 255]: unsigned integer and inclusive of zero and 255.
Imagenet: tf.keras.applications.imagenet_utils.preprocess_input defines several Imagenet based scales. These are important preprocessing steps when using backbone models pre-trained on Imagenet as part of transfer learning.
- ‘caffe’: BGR channels and zero centered against the Imagenet means, resulting in these absolute ranges:
- - B: [-103.939, 151.061]
- - G: [-116.779, 138.221]
- - R: [-123.680, 131.320]
- ‘torch’: Images are converted to the [0,1] range, each channel is zero centered against the Imagenet mean for that channel, and normalized by standard deviation, resulting in these absolute ranges:
- - R: [-2.117904 , 2.2489083]
- - G: [-2.0357141, 2.4285715]
- - B: [-1.8044444, 2.64 ]
CIFAR10: Models pretrained on CIFAR10 may also scale against means and variances from those datasets. ** ‘torch’: Images are converted to the [0,1] range, each channel is zero centered against the CIFAR10 mean for that channel, and normalized by the standard deviation, resulting in these absolute ranges:
- - R: [-1.98921296, 2.05884147]
- - G: [-1.98023793, 2.12678847]
- - B: [-1.7070018 , 2.11580587]
CIFAR100: Models pretrained on CIFAR100 may also scale against means and variances from those datasets. ** ‘torch’: Images are converted to the [0,1] range, each channel is zero centered against the CIFAR10 mean for that channel, and normalized by the standard deviation, resulting in these absolute ranges:
- - R: [-1.89570093, 1.84261682]
- - G: [-1.89746589, 2.00116959]
- - B: [-1.596523 , 2.02535313]

Sparse Groundtruth Labels versus Dense Groundtruth Labels.¶

Dense labels are the familiar one-hot format. For example, to encode the output of a classifier for {cats, dogs, trucks}, this vector would represent a truck: [0, 0, 1].

To further compress data, sometimes groundtruth is encoded in sparse format, in which an integer is used. Following our previous example, truck would be encoded as 2. To maximize disk access, Tensorflow Datasets are typically written to disk in sparse format.

Conversion between the two formats is generally trivially performed with tf.argmax and tf.one_hot.

Masterful applies soft-label approaches like Label Smoothing Regularization and Knowledge Distillation so Masterful requires dense labels except in the case of binary classification. If sparse labels are used for a task other than binary classification, Masterful will disable any techniques that require dense labels and raise a warning. Note that this definition of sparse and dense corresponds to their usage in tf.keras.losses.SparseCategoricalCrossentropy, but does not correspond to their usage in the tf.sparse package.

Predictions: Logit versus Probability¶

Computer vision models generally use a sigmoid or softmax output to generate a probability distribution. In practice, for numerical stability, loss functions operate directly on the inputs to this activation. These inputs are called logits. For example, in Keras, even if your model specifies a softmax or sigmoid activation, the built-in loss function will ignore this value, discover the underlying logits, and calculate loss against the logits. The choice of logits versus probabilities must be understood by Masterful to correctly handle soft-label techniques.

For regression problems, including the bounding box (but not classification) predictions from detection and localization, the output is neither a probability distribution nor a logit, but rather, a scalar value. And the loss applied is generally based on mean squared error, L1 norm, L2 norm, or a combination such as Huber loss / smooth-L1.

Details on Bound Box Labels for Detection and Localization¶

In most schemes, the upper left is the origin, e.g. the location specified by the tuple (0,0).

The ranges for bounding boxes for detection and localization are generally either:

[0.0, 1.0] bounded, representing the edges of the image.
Absolute pixel indexes.

The four coordinates of a bounding box are generally structured in one of the following ways:

top, left, bottom, right, aka yxyx.
left, top, right, bottom, aka xyxy. Follows the predecent from Cartesian geometry, and is often the native format of labeling companies.
center x, center y, width, height. How Yolo describes bounding boxes.
left, top, width, height. How COCO describes bounding boxes.

Common packages and models support these detection formats:

yxyx-01 is natively supported by Tensorflow in tf.image.draw_bounding_boxes and tf.image.crop_and_resize. This is also the native format Masterful supports.
xyxy-pixels is the native format for Pascal VOC.
COCO uses left, top, width, height along with absolute pixel indexes.
YOLO uses center x, center y, width, height with [0.0, 1.0] bound pixel locations.

Detection typically requires:

A regression output for the bounding box locations, typically with a mean squared error, L1, L2, or Huber / smooth-L1 loss.
A classification output, modeled with the traditional crossentropy loss.
Additional information such as anchor box tracking.

To accomodate multiple outputs and losses, common approaches include:

Output the regression, classification, and additional outputs as three or more different outputs of the model. For example, a model might have an NM4 output for boxes, an NMC output for classes, and several additional outputs of shape NMX, NMY, NMZ, etc., for additional info. An advantage of this approach is that it is possible to implement without custom losses.
Output all three of the above outputs as a single, concatenated tensor of shape (N, M, 4+C+X+Y+Z+…). Allow a simple custom loss function to accept this prediction. Within that loss function, the tensor is split as necessary and the regression and classification losses are calculated separately, and the weighted sum is returned.
Output the concatenated tensor of shape (N, M, 4+C+X) twice. The first output goes to a loss that slices out just the box data and runs a regression loss. The second output goes to a loss that slices out just the classification data and runs a classification loss.