In this tutorial, we will answer some common questions about autoencoders, and we will cover code examples of the following models:
- a simple autoencoder based on a fully-connected layer
- a sparse autoencoder
- a deep fully-connected autoencoder
- a deep convolutional autoencoder
- an image denoising model
- a sequence-to-sequence autoencoder
- a variational autoencoder
What are autoencoders?
"Autoencoding" is a data compression algorithm where the compression and decompression functions are 1) data-specific, 2) lossy, and 3) learned automatically from examples rather than engineered by a human. Additionally, in almost all contexts where the term "autoencoder" is used, the compression and decompression functions are implemented with neural networks.
1) Autoencoders are data-specific, which means that they will only be able to compress data similar to what they have been trained on. This is different from, say, the MPEG-2 Audio Layer III (MP3) compression algorithm, which only holds assumptions about "sound" in general, but not about specific types of sounds. An autoencoder trained on pictures of faces would do a rather poor job of compressing pictures of trees, because the features it would learn would be face-specific.
2) Autoencoders are lossy, which means that the decompressed outputs will be degraded compared to the original inputs (similar to MP3 or JPEG compression). This differs from lossless arithmetic compression.
3) Autoencoders are learned automatically from data examples, which is a useful property: it means that it is easy to train specialized instances of the algorithm that will perform well on a specific type of input. It doesn't require any new engineering, just appropriate training data.
To build an autoencoder, you need three things: an encoding function, a decoding function, and a distance function between the amount of information loss between the compressed representation of your data and the decompressed representation (i.e. a "loss" function). The encoder and decoder will be chosen to be parametric functions (typically neural networks), and to be differentiable with respect to the distance function, so the parameters of the encoding/decoding functions can be optimize to minimize the reconstruction loss, using Stochastic Gradient Descent. It's simple! And you don't even need to understand any of these words to start using autoencoders in practice.
Are they good at data compression?
Usually, not really. In picture compression for instance, it is pretty difficult to train an autoencoder that does a better job than a basic algorithm like JPEG, and typically the only way it can be achieved is by restricting yourself to a very specific type of picture (e.g. one for which JPEG does not do a good job). The fact that autoencoders are data-specific makes them generally impractical for real-world data compression problems: you can only use them on data that is similar to what they were trained on, and making them more general thus requires lots of training data. But future advances might change this, who knows.
What are autoencoders good for?
They are rarely used in practical applications. In 2012 they briefly found an application in greedy layer-wise pretraining for deep convolutional neural networks , but this quickly fell out of fashion as we started realizing that better random weight initialization schemes were sufficient for training deep networks from scratch. In 2014, batch normalization  started allowing for even deeper networks, and from late 2015 we could train arbitrarily deep networks from scratch using residual learning .
Today two interesting practical applications of autoencoders are data denoising (which we feature later in this post), and dimensionality reduction for data visualization. With appropriate dimensionality and sparsity constraints, autoencoders can learn data projections that are more interesting than PCA or other basic techniques.
For 2D visualization specifically, t-SNE (pronounced "tee-snee") is probably the best algorithm around, but it typically requires relatively low-dimensional data. So a good strategy for visualizing similarity relationships in high-dimensional data is to start by using an autoencoder to compress your data into a low-dimensional space (e.g. 32 dimensional), then use t-SNE for mapping the compressed data to a 2D plane. Note that a nice parametric implementation of t-SNE in Keras was developed by Kyle McDonald and is available on Github. Otherwise scikit-learn also has a simple and practical implementation.
So what's the big deal with autoencoders?
Their main claim to fame comes from being featured in many introductory machine learning classes available online. As a result, a lot of newcomers to the field absolutely love autoencoders and can't get enough of them. This is the reason why this tutorial exists!
Otherwise, one reason why they have attracted so much research and attention is because they have long been thought to be a potential avenue for solving the problem of unsupervised learning, i.e. the learning of useful representations without the need for labels. Then again, autoencoders are not a true unsupervised learning technique (which would imply a different learning process altogether), they are a self-supervised technique, a specific instance of supervised learning where the targets are generated from the input data. In order to get self-supervised models to learn interesting features, you have to come up with an interesting synthetic target and loss function, and that's where problems arise: merely learning to reconstruct your input in minute detail might not be the right choice here. At this point there is significant evidence that focusing on the reconstruction of a picture at the pixel level, for instance, is not conductive to learning interesting, abstract features of the kind that label-supervized learning induces (where targets are fairly abstract concepts "invented" by humans such as "dog", "car"...). In fact, one may argue that the best features in this regard are those that are the worst at exact input reconstruction while achieving high performance on the main task that you are interested in (classification, localization, etc).
In self-supervized learning applied to vision, a potentially fruitful alternative to autoencoder-style input reconstruction is the use of toy tasks such as jigsaw puzzle solving, or detail-context matching (being able to match high-resolution but small patches of pictures with low-resolution versions of the pictures they are extracted from). The following paper investigates jigsaw puzzle solving and makes for a very interesting read: Noroozi and Favaro (2016) Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. Such tasks are providing the model with built-in assumptions about the input data which are missing in traditional autoencoders, such as "visual macro-structure matters more than pixel-level details".
Let's build the simplest possible autoencoder
We'll start simple, with a single fully-connected neural layer as encoder and as decoder:
from keras.layers import Input, Dense from keras.models import Model # this is the size of our encoded representations encoding_dim = 32 # 32 floats -> compression of factor 24.5, assuming the input is 784 floats # this is our input placeholder input_img = Input(shape=(784,)) # "encoded" is the encoded representation of the input encoded = Dense(encoding_dim, activation='relu')(input_img) # "decoded" is the lossy reconstruction of the input decoded = Dense(784, activation='sigmoid')(encoded) # this model maps an input to its reconstruction autoencoder = Model(input=input_img, output=decoded)
Let's also create a separate encoder model:
# this model maps an input to its encoded representation encoder = Model(input=input_img, output=encoded)
As well as the decoder model:
# create a placeholder for an encoded (32-dimensional) input encoded_input = Input(shape=(encoding_dim,)) # retrieve the last layer of the autoencoder model decoder_layer = autoencoder.layers[-1] # create the decoder model decoder = Model(input=encoded_input, output=decoder_layer(encoded_input))
Now let's train our autoencoder to reconstruct MNIST digits.
First, we'll configure our model to use a per-pixel binary crossentropy loss, and the Adadelta optimizer:
Let's prepare our input data. We're using MNIST digits, and we're discarding the labels (since we're only interested in encoding/decoding the input images).
from keras.datasets import mnist import numpy as np (x_train, _), (x_test, _) = mnist.load_data()
We will normalize all values between 0 and 1 and we will flatten the 28x28 images into vectors of size 784.
x_train = x_train.astype('float32') / 255. x_test = x_test.astype('float32') / 255. x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:]))) x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:]))) print x_train.shape print x_test.shape
Now let's train our autoencoder for 50 epochs:
autoencoder.fit(x_train, x_train, nb_epoch=50, batch_size=256, shuffle=True, validation_data=(x_test, x_test))
After 50 epochs, the autoencoder seems to reach a stable train/test loss value of about
0.11. We can try to visualize the reconstructed inputs and the encoded representations. We will use Matplotlib.
# encode and decode some digits # note that we take them from the *test* set encoded_imgs = encoder.predict(x_test) decoded_imgs = decoder.predict(encoded_imgs)
# use Matplotlib (don't ask) import matplotlib.pyplot as plt n = 10 # how many digits we will display plt.figure(figsize=(20, 4)) for i in range(n): # display original ax = plt.subplot(2, n, i + 1) plt.imshow(x_test[i].reshape(28, 28)) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) # display reconstruction ax = plt.subplot(2, n, i + 1 + n) plt.imshow(decoded_imgs[i].reshape(28, 28)) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) plt.show()
Here's what we get. The top row is the original digits, and the bottom row is the reconstructed digits. We are losing quite a bit of detail with this basic approach.
Adding a sparsity constraint on the encoded representations
In the previous example, the representations were only constrained by the size of the hidden layer (32). In such a situation, what typically happens is that the hidden layer is learning an approximation of PCA (principal component analysis). But another way to constrain the representations to be compact is to add a sparsity contraint on the activity of the hidden representations, so fewer units would "fire" at a given time. In Keras, this can be done by adding an
activity_regularizer to our
from keras import regularizers encoding_dim = 32 input_img = Input(shape=(784,)) # add a Dense layer with a L1 activity regularizer encoded = Dense(encoding_dim, activation='relu', activity_regularizer=regularizers.activity_l1(10e-5))(input_img) decoded = Dense(784, activation='sigmoid')(encoded) autoencoder = Model(input=input_img, output=decoded)
Let's train this model for 100 epochs (with the added regularization the model is less likely to overfit and can be trained longer). The models ends with a train loss of
0.11 and test loss of
0.10. The difference between the two is mostly due to the regularization term being added to the loss during training (worth about 0.01).
Here's a visualization of our new results:
They look pretty similar to the previous model, the only significant difference being the sparsity of the encoded representations.
encoded_imgs.mean() yields a value
3.33 (over our 10,000 test images), whereas with the previous model the same quantity was
7.30. So our new model yields encoded representations that are twice sparser.
We do not have to limit ourselves to a single layer as encoder or decoder, we could instead use a stack of layers, such as:
input_img = Input(shape=(784,)) encoded = Dense(128, activation='relu')(input_img) encoded = Dense(64, activation='relu')(encoded) encoded = Dense(32, activation='relu')(encoded) decoded = Dense(64, activation='relu')(encoded) decoded = Dense(128, activation='relu')(decoded) decoded = Dense(784, activation='sigmoid')(decoded)
Let's try this:
autoencoder = Model(input=input_img, output=decoded) autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy') autoencoder.fit(x_train, x_train, nb_epoch=100, batch_size=256, shuffle=True, validation_data=(x_test, x_test))
After 100 epochs, it reaches a train and test loss of ~0.097, a bit better than our previous models. Our reconstructed digits look a bit better too:
Since our inputs are images, it makes sense to use convolutional neural networks (convnets) as encoders and decoders. In practical settings, autoencoders applied to images are always convolutional autoencoders --they simply perform much better.
Let's implement one. The encoder will consist in a stack of
MaxPooling2D layers (max pooling being used for spatial down-sampling), while the decoder will consist in a stack of
from keras.layers import Input, Dense, Convolution2D, MaxPooling2D, UpSampling2D from keras.models import Model input_img = Input(shape=(1, 28, 28)) x = Convolution2D(16, 3, 3, activation='relu', border_mode='same')(input_img) x = MaxPooling2D((2, 2), border_mode='same')(x) x = Convolution2D(8, 3, 3, activation='relu', border_mode='same')(x) x = MaxPooling2D((2, 2), border_mode='same')(x) x = Convolution2D(8, 3, 3, activation='relu', border_mode='same')(x) encoded = MaxPooling2D((2, 2), border_mode='same')(x) # at this point the representation is (8, 4, 4) i.e. 128-dimensional x = Convolution2D(8, 3, 3, activation='relu', border_mode='same')(encoded) x = UpSampling2D((2, 2))(x) x = Convolution2D(8, 3, 3, activation='relu', border_mode='same')(x) x = UpSampling2D((2, 2))(x) x = Convolution2D(16, 3, 3, activation='relu')(x) x = UpSampling2D((2, 2))(x) decoded = Convolution2D(1, 3, 3, activation='sigmoid', border_mode='same')(x) autoencoder = Model(input_img, decoded) autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
To train it, we will use the original MNIST digits with shape
(samples, 3, 28, 28), and we will just normalize pixel values between 0 and 1.
from keras.datasets import mnist import numpy as np (x_train, _), (x_test, _) = mnist.load_data() x_train = x_train.astype('float32') / 255. x_test = x_test.astype('float32') / 255. x_train = np.reshape(x_train, (len(x_train), 1, 28, 28)) x_test = np.reshape(x_test, (len(x_test), 1, 28, 28))
Let's train this model for 50 epochs. For the sake of demonstrating how to visualize the results of a model during training, we will be using the TensorFlow backend and the TensorBoard callback.
First, let's open up a terminal and start a TensorBoard server that will read logs stored at
Then let's train our model. In the
callbacks list we pass an instance of the
TensorBoard callback. After every epoch, this callback will write logs to
/tmp/autoencoder, which can be read by our TensorBoard server.
from keras.callbacks import TensorBoard autoencoder.fit(x_train, x_train, nb_epoch=50, batch_size=128, shuffle=True, validation_data=(x_test, x_test), callbacks=[TensorBoard(log_dir='/tmp/autoencoder')])
This allows us to monitor training in the TensorBoard web interface (by navighating to
The model converges to a loss of 0.094, significantly better than our previous models (this is in large part due to the higher entropic capacity of the encoded representation, 128 dimensions vs. 32 previously). Let's take a look at the reconstructed digits:
decoded_imgs = autoencoder.predict(x_test) n = 10 plt.figure(figsize=(20, 4)) for i in range(n): # display original ax = plt.subplot(2, n, i) plt.imshow(x_test[i].reshape(28, 28)) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) # display reconstruction ax = plt.subplot(2, n, i + n) plt.imshow(decoded_imgs[i].reshape(28, 28)) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) plt.show()
We can also have a look at the 128-dimensional encoded representations. These representations are 8x4x4, so we reshape them to 4x32 in order to be able to display them as grayscale images.
n = 10 plt.figure(figsize=(20, 8)) for i in range(n): ax = plt.subplot(1, n, i) plt.imshow(encoded_imgs[i].reshape(4, 4 * 8).T) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) plt.show()
Application to image denoising
Let's put our convolutional autoencoder to work on an image denoising problem. It's simple: we will train the autoencoder to map noisy digits images to clean digits images.
Here's how we will generate synthetic noisy digits: we just apply a gaussian noise matrix and clip the images between 0 and 1.
from keras.datasets import mnist import numpy as np (x_train, _), (x_test, _) = mnist.load_data() x_train = x_train.astype('float32') / 255. x_test = x_test.astype('float32') / 255. x_train = np.reshape(x_train, (len(x_train), 1, 28, 28)) x_test = np.reshape(x_test, (len(x_test), 1, 28, 28)) noise_factor = 0.5 x_train_noisy = x_train + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_train.shape) x_test_noisy = x_test + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_test.shape) x_train_noisy = np.clip(x_train_noisy, 0., 1.) x_test_noisy = np.clip(x_test_noisy, 0., 1.)
Here's what the noisy digits look like:
n = 10 plt.figure(figsize=(20, 2)) for i in range(n): ax = plt.subplot(1, n, i) plt.imshow(x_test_noisy[i].reshape(28, 28)) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) plt.show()
If you squint you can still recognize them, but barely. Can our autoencoder learn to recover the original digits? Let's find out.
Compared to the previous convolutional autoencoder, in order to improve the quality of the reconstructed, we'll use a slightly different model with more filters per layer:
input_img = Input(shape=(1, 28, 28)) x = Convolution2D(32, 3, 3, activation='relu', border_mode='same')(input_img) x = MaxPooling2D((2, 2), border_mode='same')(x) x = Convolution2D(32, 3, 3, activation='relu', border_mode='same')(x) encoded = MaxPooling2D((2, 2), border_mode='same')(x) # at this point the representation is (32, 7, 7) x = Convolution2D(32, 3, 3, activation='relu', border_mode='same')(encoded) x = UpSampling2D((2, 2))(x) x = Convolution2D(32, 3, 3, activation='relu', border_mode='same')(x) x = UpSampling2D((2, 2))(x) decoded = Convolution2D(1, 3, 3, activation='sigmoid', border_mode='same')(x) autoencoder = Model(input_img, decoded) autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
Let's train it for 100 epochs:
autoencoder.fit(x_train_noisy, x_train, nb_epoch=100, batch_size=128, shuffle=True, validation_data=(x_test_noisy, x_test), callbacks=[TensorBoard(log_dir='/tmp/tb', histogram_freq=0, write_graph=False)])
Now let's take a look at the results. Top, the noisy digits fed to the network, and bottom, the digits are reconstructed by the network.
It seems to work pretty well. If you scale this process to a bigger convnet, you can start building document denoising or audio denoising models. Kaggle has an interesting dataset to get you started.
If you inputs are sequences, rather than vectors or 2D images, then you may want to use as encoder and decoder a type of model that can capture temporal structure, such as a LSTM. To build a LSTM-based autoencoder, first use a LSTM encoder to turn your input sequences into a single vector that contains information about the entire sequence, then repeat this vector
n times (where
n is the number of timesteps in the output sequence), and run a LSTM decoder to turn this constant sequence into the target sequence.
We won't be demonstrating that one on any specific dataset. We will just put a code example here for future reference for the reader!
from keras.layers import Input, LSTM, RepeatVector from keras.models import Model inputs = Input(shape=(timesteps, input_dim)) encoded = LSTM(latent_dim)(inputs) decoded = RepeatVector(timesteps)(encoded) decoded = LSTM(input_dim, return_sequences=True)(decoded) sequence_autoencoder = Model(inputs, decoded) encoder = Model(inputs, encoded)
Variational autoencoder (VAE)
Variational autoencoders are a slightly more modern and interesting take on autoencoding.
What is a variational autoencoder, you ask? It's a type of autoencoder with added constraints on the encoded representations being learned. More precisely, it is an autoencoder that learns a latent variable model for its input data. So instead of letting your neural network learn an arbitrary function, you are learning the parameters of a probability distribution modeling your data. If you sample points from this distribution, you can generate new input data samples: a VAE is a "generative model".
How does a variational autoencoder work?
First, an encoder network turns the input samples
x into two parameters in a latent space, which we will note
z_log_sigma. Then, we randomly sample similar points
z from the latent normal distribution that is assumed to generate the data, via
z = z_mean + exp(z_log_sigma) * epsilon, where
epsilon is a random normal tensor. Finally, a decoder network maps these latent space points back to the original input data.
The parameters of the model are trained via two loss functions: a reconstruction loss forcing the decoded samples to match the initial inputs (just like in our previous autoencoders), and the KL divergence between the learned latent distribution and the prior distribution, acting as a regularization term. You could actually get rid of this latter term entirely, although it does help in learning well-formed latent spaces and reducing overfitting to the training data.
Because a VAE is a more complex example, we have made the code available on Github as a standalone script. Here we will review step by step how the model is created.
First, here's our encoder network, mapping inputs to our latent distribution parameters:
x = Input(batch_shape=(batch_size, original_dim)) h = Dense(intermediate_dim, activation='relu')(x) z_mean = Dense(latent_dim)(h) z_log_sigma = Dense(latent_dim)(h)
We can use these parameters to sample new similar points from the latent space:
def sampling(args): z_mean, z_log_sigma = args epsilon = K.random_normal(shape=(batch_size, latent_dim), mean=0., std=epsilon_std) return z_mean + K.exp(z_log_sigma) * epsilon # note that "output_shape" isn't necessary with the TensorFlow backend # so you could write `Lambda(sampling)([z_mean, z_log_sigma])` z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_sigma])
Finally, we can map these sampled latent points back to reconstructed inputs:
decoder_h = Dense(intermediate_dim, activation='relu') decoder_mean = Dense(original_dim, activation='sigmoid') h_decoded = decoder_h(z) x_decoded_mean = decoder_mean(h_decoded)
What we've done so far allows us to instantiate 3 models:
- an end-to-end autoencoder mapping inputs to reconstructions
- an encoder mapping inputs to the latent space
- a generator that can take points on the latent space and will output the corresponding reconstructed samples.
# end-to-end autoencoder vae = Model(x, x_decoded_mean) # encoder, from inputs to latent space encoder = Model(x, z_mean) # generator, from latent space to reconstructed inputs decoder_input = Input(shape=(latent_dim,)) _h_decoded = decoder_h(decoder_input) _x_decoded_mean = decoder_mean(_h_decoded) generator = Model(decoder_input, _x_decoded_mean)
We train the model using the end-to-end model, with a custom loss function: the sum of a reconstruction term, and the KL divergence regularization term.
def vae_loss(x, x_decoded_mean): xent_loss = objectives.binary_crossentropy(x, x_decoded_mean) kl_loss = - 0.5 * K.mean(1 + z_log_sigma - K.square(z_mean) - K.exp(z_log_sigma), axis=-1) return xent_loss + kl_loss vae.compile(optimizer='rmsprop', loss=vae_loss)
We train our VAE on MNIST digits:
(x_train, y_train), (x_test, y_test) = mnist.load_data() x_train = x_train.astype('float32') / 255. x_test = x_test.astype('float32') / 255. x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:]))) x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:]))) vae.fit(x_train, x_train, shuffle=True, nb_epoch=nb_epoch, batch_size=batch_size, validation_data=(x_test, x_test))
Because our latent space is two-dimensional, there are a few cool visualizations that can be done at this point. One is to look at the neighborhoods of different classes on the latent 2D plane:
x_test_encoded = encoder.predict(x_test, batch_size=batch_size) plt.figure(figsize=(6, 6)) plt.scatter(x_test_encoded[:, 0], x_test_encoded[:, 1], c=y_test) plt.colorbar() plt.show()
Each of these colored clusters is a type of digit. Close clusters are digits that are structurally similar (i.e. digits that share information in the latent space).
Because the VAE is a generative model, we can also use it to generate new digits! Here we will scan the latent plane, sampling latent points at regular intervals, and generating the corresponding digit for each of these points. This gives us a visualization of the latent manifold that "generates" the MNIST digits.
# display a 2D manifold of the digits n = 15 # figure with 15x15 digits digit_size = 28 figure = np.zeros((digit_size * n, digit_size * n)) # we will sample n points within [-15, 15] standard deviations grid_x = np.linspace(-15, 15, n) grid_y = np.linspace(-15, 15, n) for i, yi in enumerate(grid_x): for j, xi in enumerate(grid_y): z_sample = np.array([[xi, yi]]) * epsilon_std x_decoded = generator.predict(z_sample) digit = x_decoded.reshape(digit_size, digit_size) figure[i * digit_size: (i + 1) * digit_size, j * digit_size: (j + 1) * digit_size] = digit plt.figure(figsize=(10, 10)) plt.imshow(figure) plt.show()
That's it folks! If you have suggestions for more topics to be covered in this post (or in future posts), you can contact me on Twitter at @fchollet.