Do we really need encoders for generative models?

首次发布: 2025-06-27

... 次访问

In modern generative AI, encoders are commonly used during training to help models understand the context of input data. However, these encoders are often removed during inference. This raises an interesting question, if we train models using only decoders, can they still generate meaningful outputs?

To simplify the problem, consider an Autoencoder task, where we aim to optimize the reconstruction of input data. Given $\{x_i\}_{i=1}^N$ , the goal is to minimize the reconstruction loss

\mathcal{L} = \frac{1}{N} \sum_{i=1}^N \| x_i - f(g(x_i)) \|^2

we use MSE here as the supervision signal, where $f$ is the decoder and $g$ is the encoder.

Ablations#

case 1: Encoder-Decoder Pair + Original Images#

After proper fine-tuning, the encoder-decoder pair learns to compress the input data into a latent space and then reconstruct it back to the original space.

Reconstruction results, trained on CIFAR10 for 30 epochs

case 2: Encoder-Decoder Pair + Zero-Valued Images#

Now comes another case, we employ the zero-valued images as inputs. We compare two scenarios here, one with the original images as the test inputs, while the other uses zero-valued images as the test inputs as well.

Reconstruction results, trained on CIFAR10 for 30 epochs, original images input

Reconstruction results, trained on CIFAR10 for 30 epochs, zero-valued images input

Both of the results look quite similar, with the majority of the pixels valued around $0.5$ , indicating a lack of meaningful information in the generated images. This suggests that the model has learned to generate a constant output, which is not very useful.

case 3: Decoder Only + Original Images#

In this case, we remove the encoder and only use the decoder to generate outputs. We use one-hot class labels and Gaussian random noises as inputs to the decoder. And the reconstruction loss is still defined as the MSE between the generated outputs and the original images.

Reconstruction results, trained on CIFAR10 for 30 epochs, decoder-only

It could be observed that the generated images are still unable to reconstruct the given images, with most of the pixels valued around $0.5$ same as the previous case. This indicates that the decoder alone is not sufficient to generate meaningful outputs, as it lacks the context provided by the encoder, but the synthesized images looks more diverse and more similar to the original ones than the previous cases.

Analysis#

The above phononemenon can be explained by the following theoretical analysis. Expand the reconstruction loss function,

\mathcal{L} = \frac{1}{N} \sum_{i=1}^N \| x_i - \hat x \|^2 = \hat x^\top \hat x - \frac{2}{N} \sum_{i=1}^N x_i^\top \hat x + \frac{1}{N} \sum_{i=1}^N x_i^\top x_i

where $\hat x = f(g(x_i))$ is the reconstructed output. From this quadratic form, we could observed that without the encoder, $\hat x$ is more like a random variable, i.e., a free parameter. The optimum for $\hat x$ is the average of the input data $\frac{1}{N} \sum_{i=1}^N x_i$ . This explains why the generated outputs are mostly constant values around $0.5$ in the previous cases.

However, when considered using encoder, the optimization problem becomes more constrained, which means $\hat x$ is not a free parameter anymore, and thus the average of the input data is not the optimum.

This indicates that the encoder plays a crucial role in providing context and structure to the generated outputs, allowing the model to learn more meaningful representations of the input data.