SkillAgentSearch skills...

GAN

We aim to generate realistic images from text descriptions using GAN architecture. The network that we have designed is used for image generation for two datasets: MSCOCO and CUBS.

Install / Use

/learn @ayansengupta17/GAN

README

Text to image generation Using Generative Adversarial Networks (GANs)

Objectives:

  1. To generate realistic images from text descriptions.

  2. To use the skip thought vector encoding for sentences.

  3. To construct Deep Convolutional GAN and train on MSCOCO and CUB datasets.

Related Theoritical concepts

A. Skip-Thought Vectors

Skip-thought model [1] is an unsupervised encoder-decoder model for encoding large chunks of text irrespective of the application domain. This approach is novel in the sense of shift from compositional semantics based methods, while maintaining the same quality. The input to this model, while training, is a tuple of sentences alt text. The encoder generates a state vector alt text corresponding to words alt text of the sentence alt text at time step alt text. One of the two decoders predicts the word alt text in the next sentence alt text and the other decoder predicts the word alt text in previous sentence alt text from the current state vector alt text. The objective function is sum of log probabilities of forward and backward sentences given the encoder representation

alt text

alt text

In simpler terms, maximizing the sum of log probabilities would lead to generation of a faithful state vector which encodes information required to make predictions about the nearby sentences.

Another important feature proposed in [2] is vocabulary expansion. The vocabulary for this RNN based model, represented as alt text, is relatively small as compared to other representations like word2vec, represented by alt text. A transformation alt text can constructed such that alt text, such that alt text by minimizing L2 loss to obtain W.

B. Generative Adversarial Networks

alt text

Generative Adversarial Networks (GAN) [2] consists of a generator G responsible for generating examples from noise distribution alt text, parameterized by alt text, which are similar to the data distribution and a discriminator D, responsible for distinguishing the examples arising from the data distribution alt text, parameterized by alt text, against those generated by G.

The requirements for this framework to generate examples indistinguishable from the data distribution are:

  1. The discriminator D should maximise the probability of correctly classifying the the source of examples.

  2. The generator G should maximise the probability that D incorrectly classifies the the generated examples.

This is a minimax game between G and D as:

alt text

The training of GANs is difficult and success of GANs has been demonstrated by using MLP based generators and discriminators. It is observed that discriminator can be safely trained while generator is halted, but training generator while discriminator is halted leads to Mode Collapse [3] where generator generates almost identical images which can deceive the discriminator. Hence, the training of discriminator and generator has to go hand in hand.

C. Generative Adversarial Text to Image Synthesis

alt text

Caption to image generation has been addressed in [4]. The underlying idea is to augment the generator and discriminator in a GAN with suitable text encoding of the description. Conceptually, this is similar to conditioning the operation of the generator and discriminators on the text descriptions. The original work describes the implementation using Deep Convolutional Neural Networks hence the name DCGAN. The generator is a deconvolution network which generates an image from the text based on noise distribution. The discriminator is a convolutional network which outputs the probability of the input image belonging to the original data distribution given the text encoding.

It is the loss function of the network that we changed in to get varying results. As already mentioned the discriminator is a Convolutional Neural Network which was trained on two aspects:

  1. It should be able to differentiate between a generated image and an original image for the same image description in text

  2. The discriminator should be able to differentiate between a real image and fake text

The naive GAN involves the discriminator network understanding to differentiate between a realistic image and fake text and unrealistic image and original text. As pointed out in [4] we note that this might lead to training complications. Therefore we modify the loss function of the discriminator to include one more source of loss where we penalise the discriminator when it is fed with realistic images with fake text. This technique called Matching Aware GAN therefore includes three parts in the loss functions:

  1. Realistic images with original text

  2. Unrealistic images with original text

  3. Realistic images with fake text

Implementation

We now mention the DCGAN architecture that we have used for our training. Further modifications that have been made are mentioned thereafter.

Before diving into architecture, it is worth noting that the inaccuracies in text encoding could affect the results drastically. Hence, text descriptions were encoded using a pre-trained model for skip-thought vectors as spending resources on training a text encoder from scratch is only secondary to the scope of this project. From implementation point of view, this decision is justifiable as the versatility of this model due to vocabulary expansion can address any scalability issues arising at a later stage.

alt text

A. Generator Architecture

1.  After noise is appended to the encoded sentence, we use a 
    deconvolutional neural network (referred to as convolutional nets
    with fractional strides in [6]). We use 5 layers for 
    the Generator network which are described as follows:

2.  Generate 128 dimensional conditioning latent variable using 
    twofully connected layers to map from 4800 dimensional sentence encoded
    vector to mean and sigma predictions.

3.  Sample epsilon from normal distribution and generate the 
    conditioning variable.

4.  Append conditioning variable with noise latent variable which is
     100 dimensional.

5.  Map the appended vector into 4x4x1024 tensor using a fully
     connected layer and reshape the same.

6.  A deconvolutional layer with filter dimension 3x3. Stride length
     of 2 is used giving an output of 8x8x512. Leaky ReLU
     activation is used with the slope as 0.3 as suggested
     in [7]. Padding used is ‘SAME’. The output is
     batch normalized.

7.  A deconvolutional layer with filter dimension 3x3. Stride length
     of 2 is used giving an output of 16x16x256. Leaky ReLU
     activation is used with the slope as 0.3. Padding used is
     ‘SAME’.The output is batch normalized.

8.  A deconvolutional layer with filter dimension 3x3. Stride length
     of 2 is used giving an output of 32x32x128. Leaky ReLU
     activation is used with the slope as 0.3. Padding used is
     ‘SAME’.The output is batch normalized.

9.  A deconvolutional layer with filter dimension 3x3. Stride length
     of 2 is used giving an output of 64x64x3. Sigmoid activation
     is used in this layer. Padding used is ‘SAME’.

B. Discriminator Architecture

1.  Map the sentence vector into 4x4x128 tensor using a fully
     connected layer and reshape the same.

2.  A convolutional layer with filter dimension 3x3 over the
     input image. Stride length of 2 is used giving an output
     of 32x32x64. Leaky ReLU activation is used with the slope
     as 0.3. Padding used is ‘SAME’. Dropout with keep probability
     0.7 was used.

3.  A convolutional layer with filter dimension 3x3 over the
     

Related Skills

View on GitHub
GitHub Stars21
CategoryDesign
Updated1y ago
Forks9

Languages

HTML

Security Score

65/100

Audited on Feb 22, 2025

No findings