GAN
We aim to generate realistic images from text descriptions using GAN architecture. The network that we have designed is used for image generation for two datasets: MSCOCO and CUBS.
Install / Use
/learn @ayansengupta17/GANREADME
Text to image generation Using Generative Adversarial Networks (GANs)
Objectives:
-
To generate realistic images from text descriptions.
-
To use the skip thought vector encoding for sentences.
-
To construct Deep Convolutional GAN and train on MSCOCO and CUB datasets.
Related Theoritical concepts
A. Skip-Thought Vectors
Skip-thought model [1] is an unsupervised encoder-decoder model for
encoding large chunks of text irrespective of the application domain.
This approach is novel in the sense of shift from compositional
semantics based methods, while maintaining the same quality.
The input to this model, while training, is a tuple of sentences
. The encoder generates a state vector
corresponding to words
of the
sentence
at time step
. One of the two decoders predicts
the word
in the next sentence
and the other
decoder predicts the word
in previous sentence
from the current state vector
. The objective
function is sum of log probabilities of forward and backward sentences
given the encoder representation


In simpler terms, maximizing the sum of log probabilities would lead to generation of a faithful state vector which encodes information required to make predictions about the nearby sentences.
Another important feature proposed in [2] is vocabulary expansion.
The vocabulary for this RNN based model, represented as
, is relatively small as compared to other
representations like word2vec, represented by
. A
transformation
can constructed
such that
,
such that
by minimizing L2 loss to obtain W.
B. Generative Adversarial Networks

Generative Adversarial Networks (GAN) [2] consists of a generator
G responsible for generating examples from noise distribution
,
parameterized by
, which are similar to the data
distribution and a discriminator D, responsible for distinguishing
the examples arising from the data distribution
, parameterized
by
, against those generated by G.
The requirements for this framework to generate examples indistinguishable from the data distribution are:
-
The discriminator D should maximise the probability of correctly classifying the the source of examples.
-
The generator G should maximise the probability that D incorrectly classifies the the generated examples.
This is a minimax game between G and D as:

The training of GANs is difficult and success of GANs has been demonstrated by using MLP based generators and discriminators. It is observed that discriminator can be safely trained while generator is halted, but training generator while discriminator is halted leads to Mode Collapse [3] where generator generates almost identical images which can deceive the discriminator. Hence, the training of discriminator and generator has to go hand in hand.
C. Generative Adversarial Text to Image Synthesis

Caption to image generation has been addressed in [4]. The underlying idea is to augment the generator and discriminator in a GAN with suitable text encoding of the description. Conceptually, this is similar to conditioning the operation of the generator and discriminators on the text descriptions. The original work describes the implementation using Deep Convolutional Neural Networks hence the name DCGAN. The generator is a deconvolution network which generates an image from the text based on noise distribution. The discriminator is a convolutional network which outputs the probability of the input image belonging to the original data distribution given the text encoding.
It is the loss function of the network that we changed in to get varying results. As already mentioned the discriminator is a Convolutional Neural Network which was trained on two aspects:
-
It should be able to differentiate between a generated image and an original image for the same image description in text
-
The discriminator should be able to differentiate between a real image and fake text
The naive GAN involves the discriminator network understanding to differentiate between a realistic image and fake text and unrealistic image and original text. As pointed out in [4] we note that this might lead to training complications. Therefore we modify the loss function of the discriminator to include one more source of loss where we penalise the discriminator when it is fed with realistic images with fake text. This technique called Matching Aware GAN therefore includes three parts in the loss functions:
-
Realistic images with original text
-
Unrealistic images with original text
-
Realistic images with fake text
Implementation
We now mention the DCGAN architecture that we have used for our training. Further modifications that have been made are mentioned thereafter.
Before diving into architecture, it is worth noting that the inaccuracies in text encoding could affect the results drastically. Hence, text descriptions were encoded using a pre-trained model for skip-thought vectors as spending resources on training a text encoder from scratch is only secondary to the scope of this project. From implementation point of view, this decision is justifiable as the versatility of this model due to vocabulary expansion can address any scalability issues arising at a later stage.

A. Generator Architecture
1. After noise is appended to the encoded sentence, we use a
deconvolutional neural network (referred to as convolutional nets
with fractional strides in [6]). We use 5 layers for
the Generator network which are described as follows:
2. Generate 128 dimensional conditioning latent variable using
twofully connected layers to map from 4800 dimensional sentence encoded
vector to mean and sigma predictions.
3. Sample epsilon from normal distribution and generate the
conditioning variable.
4. Append conditioning variable with noise latent variable which is
100 dimensional.
5. Map the appended vector into 4x4x1024 tensor using a fully
connected layer and reshape the same.
6. A deconvolutional layer with filter dimension 3x3. Stride length
of 2 is used giving an output of 8x8x512. Leaky ReLU
activation is used with the slope as 0.3 as suggested
in [7]. Padding used is ‘SAME’. The output is
batch normalized.
7. A deconvolutional layer with filter dimension 3x3. Stride length
of 2 is used giving an output of 16x16x256. Leaky ReLU
activation is used with the slope as 0.3. Padding used is
‘SAME’.The output is batch normalized.
8. A deconvolutional layer with filter dimension 3x3. Stride length
of 2 is used giving an output of 32x32x128. Leaky ReLU
activation is used with the slope as 0.3. Padding used is
‘SAME’.The output is batch normalized.
9. A deconvolutional layer with filter dimension 3x3. Stride length
of 2 is used giving an output of 64x64x3. Sigmoid activation
is used in this layer. Padding used is ‘SAME’.
B. Discriminator Architecture
1. Map the sentence vector into 4x4x128 tensor using a fully
connected layer and reshape the same.
2. A convolutional layer with filter dimension 3x3 over the
input image. Stride length of 2 is used giving an output
of 32x32x64. Leaky ReLU activation is used with the slope
as 0.3. Padding used is ‘SAME’. Dropout with keep probability
0.7 was used.
3. A convolutional layer with filter dimension 3x3 over the
Related Skills
claude-opus-4-5-migration
84.5kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
model-usage
341.2kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
diffs
341.2kUse the diffs tool to produce real, shareable diffs (viewer URL, file artifact, or both) instead of manual edit summaries.
TrendRadar
50.0k⭐AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts.🎯 告别信息过载,你的 AI 舆情监控助手与热点筛选工具!聚合多平台热点 + RSS 订阅,支持关键词精准筛选。AI 智能筛选新闻 + AI 翻译 + AI 分析简报直推手机,也支持接入 MCP 架构,赋能 AI 自然语言对话分析、情感洞察与趋势预测等。支持 Docker ,数据本地/云端自持。集成微信/飞书/钉钉/Telegram/邮件/ntfy/bark/slack 等渠道智能推送。
