{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A Simple SSL Recipe\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/masterfulai/masterful-docs/blob/main/notebooks/guide_simple_ssl.ipynb)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\n",
    "[![Download](images/download.png)](https://masterful-public.s3.us-west-1.amazonaws.com/933013963/latest/guide_simple_ssl.ipynb)[Download this Notebook](https://masterful-public.s3.us-west-1.amazonaws.com/933013963/latest/guide_simple_ssl.ipynb)\n",
    "\n",
    "In this recipe, you'll perform simple data manipulation and utilize Masterful utilities to improve your model's accuracy with SSL techniques. \n",
    "\n",
    "SSL, or Semi-Supervised Learning, means allowing your model to learn from both labeled and unlabeled data. Normally, SSL techniques require custom training loops and multiple losses. This recipe allows you to quickly implement SSL and get potentially significant improvements without custom training loops or multiple losses. \n",
    "\n",
    "Consider using this recipe if:\n",
    "\n",
    "* You want to quickly try an SSL technique.\n",
    "* You want to keep your own training loop. \n",
    "* You have a very well tuned regularization policy. \n",
    "* Your model is classification (binary, single-label, or multilabel) or semantic segmentation. \n",
    "\n",
    "For power users, you may want to skip this recipe and go straight to the full Masterful Platform if:\n",
    "\n",
    "* You want to maximize accuracy.\n",
    "* Your regularization policy is not optimally tuned. \n",
    "* Your model is detection or instance segmentation. \n",
    "* You are using [Label Smoothing Regularization](https://proceedings.neurips.cc/paper/2019/file/f1748d6b0fd9d439f71450117eba2725-Paper.pdf). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## First, Set up a standard supervised training pipeline. \n",
    "\n",
    "This will not do any SSL yet. It should resemble most supervised training pipelines you've developed. \n",
    "\n",
    "Implement functions to:\n",
    "\n",
    "* Get your dataset (`get_labeled_datasets()`)\n",
    "* Create your model architecture (`get_model()`)\n",
    "* Augment your data (`augment_images()`)\n",
    "* Train your model (`train_model()`). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/100\n",
      "2/2 [==============================] - 3s 904ms/step - loss: 3.2753 - acc: 0.0780 - val_loss: 2.3118 - val_acc: 0.1010\n",
      "Epoch 2/100\n",
      "2/2 [==============================] - 1s 762ms/step - loss: 3.1935 - acc: 0.0860 - val_loss: 2.3098 - val_acc: 0.1016\n",
      "Epoch 3/100\n",
      "\n",
      "...\n",
      "\n",
      "Epoch 51/100\n",
      "2/2 [==============================] - 1s 765ms/step - loss: 1.6426 - acc: 0.4180 - val_loss: 2.3094 - val_acc: 0.1128\n",
      "Epoch 52/100\n",
      "2/2 [==============================] - 1s 770ms/step - loss: 1.6359 - acc: 0.4020 - val_loss: 2.3108 - val_acc: 0.1136\n",
      "Epoch 53/100\n",
      "2/2 [==============================] - ETA: 0s - loss: 1.6156 - acc: 0.4300Restoring model weights from the end of the best epoch: 28.\n",
      "2/2 [==============================] - 1s 770ms/step - loss: 1.6156 - acc: 0.4300 - val_loss: 2.3122 - val_acc: 0.1142\n",
      "Epoch 53: early stopping\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "import tensorflow_addons as tfa\n",
    "import tensorflow_datasets as tfds\n",
    "\n",
    "\n",
    "def get_labeled_datasets(train_percentage=1):\n",
    "  \"\"\"Simple function to get cifar10 as a `tf.data.Dataset`\"\"\"\n",
    "  (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()\n",
    "\n",
    "  # Take the first training_percentage of the training data.\n",
    "  train_cardinality = train_percentage * 50000 // 100\n",
    "  x_train = x_train[0:train_cardinality]\n",
    "  y_train = y_train[0:train_cardinality]\n",
    "\n",
    "  # Normalize data into the range [0,1]\n",
    "  x_train = x_train.astype(\"float32\") / 255.0\n",
    "  x_test = x_test.astype(\"float32\") / 255.0\n",
    "\n",
    "  # Convert labels to one-hot.\n",
    "  y_train = tf.keras.utils.to_categorical(y_train, 10)\n",
    "  y_test = tf.keras.utils.to_categorical(y_test, 10)\n",
    "\n",
    "  # Split test into a val and test dataset.\n",
    "  x_val = x_test[:5000]\n",
    "  y_val = y_test[:5000]\n",
    "\n",
    "  x_test = x_test[5000:]\n",
    "  y_test = y_test[5000:]\n",
    "\n",
    "  # Convert the data to tf.data.Dataset.\n",
    "  train = tf.data.Dataset.from_tensor_slices((x_train, y_train))\n",
    "  val = tf.data.Dataset.from_tensor_slices((x_val, y_val))\n",
    "  test = tf.data.Dataset.from_tensor_slices((x_test, y_test))\n",
    "\n",
    "  # Shuffle just the training dataset.\n",
    "  train = train.shuffle(1000)\n",
    "\n",
    "  # Batch the data. The batch size is a crucial hyperparameter to\n",
    "  # take advantage of your GPU hardware. See the guide to the\n",
    "  # optimization metalearner to find out how to learn an optimal batch size.\n",
    "  train = train.batch(256)\n",
    "  val = val.batch(256)\n",
    "  test = test.batch(256)\n",
    "\n",
    "  train = train.prefetch(tf.data.AUTOTUNE)\n",
    "\n",
    "  return train, val, test\n",
    "\n",
    "def get_model():\n",
    "  \"\"\"Returns a minimal convnet. \"\"\"\n",
    "  inp = tf.keras.Input((32, 32, 3))\n",
    "  x = inp\n",
    "  x = tf.keras.layers.Conv2D(16, 3, activation='relu')(x)\n",
    "  x = tf.keras.layers.BatchNormalization()(x)\n",
    "  x = tf.keras.layers.MaxPooling2D()(x)\n",
    "\n",
    "  x = tf.keras.layers.Conv2D(32, 3, activation='relu')(x)\n",
    "  x = tf.keras.layers.BatchNormalization()(x)\n",
    "  x = tf.keras.layers.MaxPooling2D()(x)\n",
    "\n",
    "  x = tf.keras.layers.Conv2D(64, 3, activation='relu')(x)\n",
    "  x = tf.keras.layers.BatchNormalization()(x)\n",
    "  x = tf.keras.layers.MaxPooling2D()(x)\n",
    "\n",
    "  x = tf.keras.layers.GlobalAveragePooling2D()(x)\n",
    "  x = tf.keras.layers.Flatten()(x)\n",
    "  x = tf.keras.layers.Dense(10, activation='softmax')(x)\n",
    "  return tf.keras.Model(inp, x)\n",
    "\n",
    "def augment_image(image):\n",
    "  \"\"\"A simple augmentation pipeline.\"\"\"\n",
    "  image = tf.image.random_brightness(image, 0.1)\n",
    "  image = tf.image.random_hue(image, 0.1)\n",
    "  image = tf.image.resize(image, size=[32,32])\n",
    "  image = tf.image.random_flip_left_right(image)\n",
    "\n",
    "  return image\n",
    "\n",
    "def train_model(model, augmented_train, validation_data, epochs=100):\n",
    "  \"\"\"A simple training loop. \"\"\"\n",
    "\n",
    "  early_stopping = tf.keras.callbacks.EarlyStopping(patience=25,\n",
    "                                                    verbose=2,\n",
    "                                                    restore_best_weights=True)\n",
    "\n",
    "  # The learning rate used by the optimizer (in this case Adam)\n",
    "  # is a crucial hyperparameter to take advantage of your GPU hardware.\n",
    "  # See the guide to the optimization metalearner to find out how to\n",
    "  # learn an optimal learning rate.\n",
    "  model.compile(\n",
    "      optimizer=tfa.optimizers.LAMB(learning_rate=0.001),\n",
    "      loss='categorical_crossentropy',\n",
    "      metrics=['acc'],\n",
    "  )\n",
    "\n",
    "  model.fit(augmented_train,\n",
    "            validation_data=validation_data,\n",
    "            epochs=epochs,\n",
    "            callbacks=early_stopping)\n",
    "\n",
    "# Now use the functions you just defined to train a model start to finish.\n",
    "train, val, test = get_labeled_datasets()\n",
    "augmented_train = train.map(lambda image, label: (augment_image(image), label))\n",
    "model = get_model()\n",
    "train_model(model, augmented_train, val)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "20/20 [==============================] - 1s 32ms/step - loss: 2.2884 - acc: 0.1200\n"
     ]
    }
   ],
   "source": [
    "baseline_eval_metrics = model.evaluate(test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Now you'll improve the accuracy of your model using SSL techniques. \n",
    "\n",
    "First, set up your unlabeled data as a batched `tf.data.Dataset`. Typically, each element of a batched Dataset is a tuple of tensors: `(images, labels)`. Since unlabeled data doesn't have a label, just make each element of your batched dataset a tensor: `images`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Normally, the unlabeled dataset comes from images that are not yet labeled. \n",
    "# To simulate that with CIFAR10, you will use 5% of the training data, but\n",
    "# remove the labels. Be sure to use the end of the training data, not the \n",
    "# beginning, to ensure that the labeled and unlabeled sets are disjoint. \n",
    "def get_unlabeled_data(train_percentage=5):\n",
    "  \"\"\"A simple function get unlabeled CIFAR10 data.\"\"\"\n",
    "  (x_train, _), (_, _) = tf.keras.datasets.cifar10.load_data()\n",
    "\n",
    "  # Take the training_percentage of the training data.\n",
    "  # Take it from the end of the numpy array, not the begignning, to prevent\n",
    "  # overlap with the labeled data. \n",
    "  train_cardinality = train_percentage * 50000 // 100\n",
    "  x_train = x_train[-train_cardinality:]\n",
    "\n",
    "  # Perform the same processing as the `get_labeled_data()` function.\n",
    "  x_train = x_train.astype(\"float32\") / 255.0\n",
    "  train = tf.data.Dataset.from_tensor_slices(x_train)\n",
    "  train = train.shuffle(1000)\n",
    "\n",
    "  # Batch the data. The batch size is a crucial hyperparameter to\n",
    "  # take advantage of your GPU hardware. See the guide to the\n",
    "  # optimization metalearner to find out how to learn an optimal batch size. \n",
    "  train = train.batch(256)\n",
    "\n",
    "  return train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now call the Masterful SSL utility, which analyzes your data and stores the analysis to disk. The utility will ensure consistent batch sizes, a consistent right ratio of labeled to unlabeled data, take care of complexities of running and training a model, and optionally allow for weighting of the labeled and unlabeled data. \n",
    "\n",
    "The function all will take some time to iterate through each example from both datasets, analyze them, and save the interim results to disk. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded Masterful version 0.4.0. This software is distributed free of\n",
      "charge for personal projects and evaluation purposes.\n",
      "See http://www.masterfulai.com/personal-and-evaluation-agreement for details.\n",
      "Sign up in the next 39 days at https://www.masterfulai.com/get-it-now\n",
      "to continue using Masterful.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "3000it [00:02, 1413.45it/s]\n"
     ]
    }
   ],
   "source": [
    "import masterful\n",
    "\n",
    "masterful = masterful.register()\n",
    "\n",
    "unlabeled = get_unlabeled_data()\n",
    "\n",
    "masterful.ssl.analyze_data_then_save_to(model, \n",
    "                                       train, \n",
    "                                       unlabeled, \n",
    "                                       path='/tmp/ssl',\n",
    "                                       )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now load the record from disk into a `tf.data.Dataset`, apply your augmentation function, and train. The record includes both labeled and unlabeled data, so each epoch will take longer to run. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/100\n",
      "12/12 [==============================] - 4s 233ms/step - loss: 3.1790 - acc: 0.1177 - val_loss: 2.2992 - val_acc: 0.1396\n",
      "Epoch 2/100\n",
      "12/12 [==============================] - 2s 169ms/step - loss: 2.9010 - acc: 0.1137 - val_loss: 2.3001 - val_acc: 0.0976\n",
      "Epoch 3/100\n",
      "12/12 [==============================] - 2s 168ms/step - loss: 2.7171 - acc: 0.1373 - val_loss: 2.3025 - val_acc: 0.0990\n",
      "\n",
      "...\n",
      "\n",
      "Epoch 98/100\n",
      "12/12 [==============================] - 2s 175ms/step - loss: 2.1727 - acc: 0.3537 - val_loss: 2.0497 - val_acc: 0.3080\n",
      "Epoch 99/100\n",
      "12/12 [==============================] - 2s 178ms/step - loss: 2.1730 - acc: 0.3437 - val_loss: 2.0553 - val_acc: 0.3084\n",
      "Epoch 100/100\n",
      "12/12 [==============================] - 2s 177ms/step - loss: 2.1683 - acc: 0.3507 - val_loss: 2.0504 - val_acc: 0.3050\n"
     ]
    }
   ],
   "source": [
    "new_model = get_model()\n",
    "ssl_training_data = masterful.ssl.load_from(path='/tmp/ssl').batch(256)\n",
    "\n",
    "augmented_ssl_training_data = ssl_training_data.map(\n",
    "    lambda image, label: (augment_image(image), label))\n",
    "\n",
    "train_model(new_model, augmented_ssl_training_data, val)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evaluate your newly training model against the old one. You should see an improvement in accuracy now that you are applying SSL techniques to learn from unlabeled data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "20/20 [==============================] - 1s 35ms/step - loss: 2.0740 - acc: 0.2952\n",
      "run     , test loss, test accuracy\n",
      "baseline, 2.288,     0.12\n",
      "ssl     , 2.074,     0.2952\n"
     ]
    }
   ],
   "source": [
    "ssl_eval_metrics = new_model.evaluate(test)\n",
    "\n",
    "print(f'run     , test loss, test accuracy')\n",
    "print(f'baseline, {baseline_eval_metrics[0]:.4},     {baseline_eval_metrics[1]:.4}')\n",
    "print(f'ssl     , {ssl_eval_metrics[0]:.4},     {ssl_eval_metrics[1]:.5}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced Tuning\n",
    "\n",
    "To improve the results, two hyperparameters to tune are the intensity of augmentations, and the weighting of the unlabeled data. \n",
    "\n",
    "The intensity of augmentations is generally empirically discovered by a search algorithm, such as guessing and checking or grid search. If your augmentations are suboptimally tuned, consider using the full Masterful API to manage SSL end to end.\n",
    "\n",
    "The weighting of unlabeled data is also generally empirically discovered by a search algorithm. As a rule of thumb, a 1:5 ratio of labeled to unlabeled data often works well. If you have more unlabeled data than that, you'll want to downweight each unlabeled example (and vice versa). \n",
    "\n",
    "Examples are below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You can quickly increase your augmentation intensity by augmenting twice.\n",
    "new_model = get_model()\n",
    "ssl_training_data = masterful.ssl.load_from(path='/tmp/ssl')\n",
    "ssl_training_data = ssl_training_data.map(lambda image, label: (augment(augment(image)), label))\n",
    "train_model(new_model, augment_image(augment_image(ssl_training_data)), val)\n",
    "\n",
    "# If your unlabeled training data is 4x or less the cardinality of your labeled training data,\n",
    "# you can increase the weight of the unlabeled training data. \n",
    "new_model = get_model()\n",
    "ssl_training_data = masterful.ssl.load_from(path='/tmp/ssl', unlabeled_weight=2.0)\n",
    "ssl_training_data = ssl_training_data.map(lambda image, label, weight: (augment(image)), label, weight))\n",
    "train_model(new_model, augment_image(ssl_training_data), val)\n",
    "\n",
    "# Alternatively, if your unlabeled training data is 6x or more \n",
    "# the cardinality of your labeled training data, you can decrease the \n",
    "# weight of the unlabeled training data. \n",
    "new_model = get_model()\n",
    "ssl_training_data = ssl_training_data.map(lambda image, label, weight: (augment(image)), label, weight))\n",
    "train_model(new_model, augment_image(ssl_training_data), val)"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "e11de040a44de2599d5826916dec5532a989d7fc6a7daf05571191351ea2bbfc"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}