In modern generative AI, encoders are commonly used during training to help models understand the context of input data. However, these encoders are often removed during inference. This raises an interesting question, if we train models using only decoders, can they still generate meaningful outputs?
To simplify the problem, consider an Autoencoder task, where we aim to optimize the reconstruction of input data. Given , the goal is to minimize the reconstruction loss
we use MSE here as the supervision signal, where is the decoder and is the encoder.
Ablations
case 1: Encoder-Decoder Pair + Original Images
After proper fine-tuning, the encoder-decoder pair learns to compress the input data into a latent space and then reconstruct it back to the original space.

case 2: Encoder-Decoder Pair + Zero-Valued Images
Now comes another case, we employ the zero-valued images as inputs. We compare two scenarios here, one with the original images as the test inputs, while the other uses zero-valued images as the test inputs as well.


Both of the results look quite similar, with the majority of the pixels valued around , indicating a lack of meaningful information in the generated images. This suggests that the model has learned to generate a constant output, which is not very useful.
case 3: Decoder Only + Original Images
In this case, we remove the encoder and only use the decoder to generate outputs. We use one-hot class labels and Gaussian random noises as inputs to the decoder. And the reconstruction loss is still defined as the MSE between the generated outputs and the original images.

It could be observed that the generated images are still unable to reconstruct the given images, with most of the pixels valued around same as the previous case. This indicates that the decoder alone is not sufficient to generate meaningful outputs, as it lacks the context provided by the encoder, but the synthesized images looks more diverse and more similar to the original ones than the previous cases.
Analysis
The above phononemenon can be explained by the following theoretical analysis. Expand the reconstruction loss function,
where is the reconstructed output. From this quadratic form, we could observed that without the encoder, is more like a random variable, i.e., a free parameter. The optimum for is the average of the input data . This explains why the generated outputs are mostly constant values around in the previous cases.
However, when considered using encoder, the optimization problem becomes more constrained, which means is not a free parameter anymore, and thus the average of the input data is not the optimum.
This indicates that the encoder plays a crucial role in providing context and structure to the generated outputs, allowing the model to learn more meaningful representations of the input data.

