# Data Input Types **Title**: Masterful supports many input data types **Author**: [sam](mailto:sam@masterfulai.com) **Date created**: 2022/04/27 **Last modified**: 2022/04/1 **Description**: Data input types in Masterful. ## Introduction Masterful supports several different input data types for your training data. In this guide, you will learn about these different types, and how to use them with Masterful. This will make it easy to integrate existing training data into the Masterful platform. ## Tensorflow Dataset Masterful is built to support [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) objects natively. The easiest way to create a dataset is to use the [Dataset.from_tensor_slices](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) API, as explained [here](https://www.tensorflow.org/guide/data#consuming_numpy_arrays). Masterful can consume the created Dataset object directly. ```python # Create 32 random 3-channel images of 16x16 pixels. images = tf.random.uniform(shape=(32,16,16,3)) unlabeled_dataset = tf.data.Dataset.from_tensor_slices((images,)) labels = tf.random.uniform(shape=(32,), minval=0, maxval=10, dtype=tf.int32) labeled_dataset = tf.data.Dataset.from_tensor_slices((images, labels)) ``` In order to train with a Dataset, the dataset must *not* be batched (Masterful will batch the data for you). ```python # This unlabeled dataset can be used with Masterful dataset = tf.data.Dataset.from_tensor_slices((tf.random.uniform((32, 16, 16, 3)))) # This dataset cannot be used with Masterful. dataset = dataset.batch(16) # DO NOT batch the dataset ``` Also, it is important that the items returned by the dataset are a tuple of only one or two items. In the case of unlabeled data, the data returned by the dataset should correspond to the image data. In the case of labeled data, the tuple should contain the features (images) and the labels. ```python images = tf.random.uniform(shape=(32,16,16,3)) labels = tf.random.uniform(shape=(32,), minval=0, maxval=10, dtype=tf.int32) # This dataset is properly formatted for Masterful labeled_dataset = tf.data.Dataset.from_tensor_slices((images, labels)) # This dataset will not work with Masterful, it has too many # items per example. extra_data = tf.ones_like(labels) incorrect_dataset = tf.data.Dataset.from_tensor_slices((images, labels, extra_data)) # Too many items returned for each example ``` Tensorflow also provides a large catalog of built-in datasets, as part of the [Tensorflow Datasets](https://www.tensorflow.org/datasets) catalog. ```python import tensorflow as tf import tensorflow_datasets as tfds # Masterful can use TFDS datasets as well. dataset = tfds.load('mnist', split='train', as_supervised=True) ``` Most Tensorflow Datasets in the catalog return a dictionary of items, so it is important to extract the features and labels from the dictionary before passing to Masterful ```python dataset = tfds.load('mnist', split='train') # Extract the images and labels from the feature dictionary. dataset = dataset.map(lambda x: (x['image'], x['label'])) ``` Most datasets in the catalog can automatically perfom the above extraction for you, using the `as_supewrvised` argument. ```python # `as_supervised` will automatically extract the images # and labels. dataset = tfds.load('mnist', split='train', as_supervised=True) ``` ## Numpy (and Tensor) Arrays Masterful supports consuming Numpy and Tensor arrays directly as well. This works well if you dataset is small and fits entirely into memory. ```python # x_train and y_train are numpy arrays (x_train, y_train), _ = tf.keras.datasets.cifar10.load_data() ``` In order to use the above arrays with Masterful, you will pass in the tuple of images and labels to any of the Masterful APIs that require a "dataset". ```python training_data_params = masterful.data.learn_data_params( dataset=(x_train, y_train), task=masterful.enums.Task.CLASSIFICATION, image_range=masterful.enums.ImageRange.ZERO_255, num_classes=10, sparse_labels=True, ) training_report = masterful.training.train( ... training_dataset=(x_train, y_train), training_dataset_params=training_dataset_params, ... ) ``` ## Keras Sequence Masterful also supports [Keras Sequence](https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence) objects. Keras Sequences are a form of generators for Keras that are a safer way to do multiprocessing over regular generators. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators. Note that tf.data.Dataset are the preferred way of training in Tensorflow. ```python input_shape = (16, 16, 3) class DataGenerator(tf.keras.utils.Sequence): """Keras sequence that returns batches of dummy data.""" def __init__(self): # Returns only a single batch of data. self._length = 1 def __len__(self): return self._length def __getitem__(self, index): """Returns batches of dummy data.""" images = np.zeros((32,) + input_shape) labels = np.ones((32,)) return (images, labels) keras_sequence = DataGenerator() # Even though Sequences generate *batches* of data, # Masterful knows how to handle it correctly. training_data_params = masterful.data.learn_data_params( dataset=keras_sequence, task=masterful.enums.Task.CLASSIFICATION, image_range=masterful.enums.ImageRange.ZERO_255, num_classes=10, sparse_labels=True, ) training_report = masterful.training.train( ... training_dataset=keras_sequence, training_dataset_params=training_dataset_params, ... ) ``` ## Python Generators Masterful also supports Python generators. In order to use a generator, you need to tell Masterful both the generator function to use, as well as the output signature for the examples returned by the generator. This is passed as a tuple to Masterful. ```python input_shape = (16, 16, 3) def generator(): # A generator function is a zero-arg function that # yields images and labels i = 0 max_items = 32 while i < max_items: yield tf.zeros(input_shape), tf.ones(()) i += 1 # The output signature is a tuple of tensor specs output_signature = (tf.TensorSpec(input_shape, dtype=tf.float32), tf.TensorSpec((), dtype=tf.int32)) # Pass the generator and the output signature to Masterful # as a tuple. training_data_params = masterful.data.learn_data_params( dataset=(generator, output_signature), task=masterful.enums.Task.CLASSIFICATION, image_range=masterful.enums.ImageRange.ZERO_255, num_classes=10, sparse_labels=True, ) training_report = masterful.training.train( ... training_dataset=(generator, output_signature), training_dataset_params=training_dataset_params, ... ) ```