# Objective Reinforced Generative Adversarial Network (Part I)

Finding the Lead Molecule in a Drug Discovery pipeline is one of the most challenging processes. Thousands of molecules are screened and tested. This process is time-consuming and very important and so this blog discusses a deep generative model that tries to overcome these challenges.

**Objective-Reinforced Generative Adversarial Network (ORGAN)** is a modified version of a primary Generative Adversarial Network (GAN).
Before we dig deeper into theory and implementation of ORGAN, let me brief you about the basics of GAN.
A simple GAN is composed of two neural networks, Generator, and the Discriminator.

**Generator(G):**

The main aim of the Generator is to generate fake samples which resemble the true data/distribution so closely, that the discriminator cannot differentiate between the true data and the fake ones. In other words, it tries to fool the discriminator.

**Discriminator(D):**

As the name suggests, it discriminates the input data and classifies whether it is from the true data sample/distribution or is a fake sample generated by the Generator(G). The Discriminator(D) is initially trained on true labeled data samples.

Both of these networks work against each other trying to prove themselves better through their work. Their main objective is to
generate data points that are similar to some data points consisting in the training data.
Given an initial training distribution p_{data}, the generator G samples x from a distribution p_{synth}, generated with random noise z, while a discriminator D looks at samples, either from p(synthetic) or from p_{data}, and tries to classify their identity (y) as either real x∈p_{data} or fake x∈p_{synth}.

The model follows a min-max game where we minimize the Generator function log(1−D(G(z)) so that we can fool the discriminator by generating the samples very close to the original distribution, while maximizing the discriminator function log(D(x)) so that it can classify between fake and real data points more accurately.

- For a single data point we have: $min_G max_D [\log D(x)] + z∼ p_{synthetic z} [ \log (1−D(G(z)))]$.
- For the complete distributions we have: $min_G max_D \mathrm{E}_{x ~ pdata} [ \log D(x) ] + \mathrm{E}_{z ~ psynth} [ \log (1-D(G(z)))]$. where E is Expectation.

### Training a GAN

Training a GAN is still a great topic of research. But various problems have limited the power of GAN and its stability. Another stability of GAN while training is also a major roadblock. If you start to train a GAN, you will realise that the discriminator part is more powerful than its generator counterpart. Thus, the generator would fail to train effectively. This will in turn result a huge loss in the process of training your GAN .On the contrary, if the discriminator is too lenient; it would literally allow any image to be generated. Thus this whole idea will remain useless for your GAN.

The training has two phases.

**Discriminator Training**
We train the Discriminator on the labeled Training set for a certain epoch range. It must be trained that way that it can discriminate the training data correctly as real(1). This is achieved by varying the number of epochs.
While training Discriminator, the Generator is in freeze mode(freezing means setting training as false. The network does only forward pass and no back-propagation is applied).
Afterward, we generate fake data and train the discriminator till it predicts efficiently.
Calculate the loss and optimize the network parameters and update the gradients.
**Generator Training**
Now to train the Generator, we use the predictions of discriminator as an objective to train the Generator.
Similar to the previous training step, we have discriminator in freeze mode.
Calculate the loss and optimize the network parameters and update the gradients.

This was a brief introduction to GAN. Now, moving on to ORGAN, let’s find out how contrasting is an ORGAN from GAN.
In ORGAN, the main difference is the application of **Reinforcement Learning(RL) ** to train the generator in a manner that it generates output with desired properties.
In ORGAN we bypass the generator differentiation problem by treating the specific discrete sequences as stochastic policy in an RL generator gradients setup. In other words, we update the generator parameters with a policy gradient.

### Reinforcement Learning

We treat the Generator as an agent here in an RL environment. We have * s* as the states with a reward function

*,*

**Q***is the action that the agent chooses from action space*

**a***available in state*

**A***. The action space*

**s***comprises of all possible characters to select for the next character $x_{t+1}$. State s*

**A**_{t}is an already generated partial sequence of characters $X_{1:t}$.

*is the action-value function that represents the expected reward at state*

**Q(s,a)***of taking action*

**s***and following our current policy to complete the rest of the sequence. When we are in state*

**a***we estimate*

**s***value for every possible action, then we choose the action with the highest*

**Q***value. Let*

**Q***be the reward function defined for full length sequences. Now, if we have an incomplete sequence $X_{1:t}$, in state*

**R(X**_{1: T})*then , the generator $G_{\theta}$ (read G parametrized by $\theta$) must produce an action*

**s***with the next token $x_{t+1}$. The agent’s postulated policy is given by $G(y_t | Y_{1:t-1})$ and our aim is to maximize the expected long-term reward $J_{\theta}$.*

**a**$J(\theta)=\mathrm{E} [R_T | s_{\theta}, \theta]=\sum_{x_1 \in X} G_{\theta}(x_1 | s_{\theta}).Q(s_{\theta},x_1 )$.

The reward for generated molecules is calculated by reward metrics for specific properties. Some examples include LogP, Synthetic Accessibility, Natural Product-Likeness, Chemical Beauty(Quantitative Estimation of Drug-Likeness), Tanimoto Similarity, Nearest Neighbour Similarity.

**Reinforcement Metric:**
Molecular metrics are implemented using the RDKit chem-informatics package. Metrics include Synthetic Accessibility, Natural Product likeliness, Drug-likeness, LogP, Nearest Neighbour Similarity. These were applied to calculate the reward for each generated molecule. Reinforcement provides a quality metric (between 0 & 1) which gives the desirability of a specific molecule, 1 being highly desirable and 0 being highly undesirable.

The main objective of the reinforcement metric is to maximize the reward by optimizing the generator to generate molecules similar to the initial distribution of data. The molecules generated are then analyzed by the discriminator and the reward metric. It is then optimize and train the generator to fool the discriminator.

We have completed the first half of the training. The above steps are called pretraining. Now we will train about both generator and discriminator but with a policy gradient. Since the generator has been trained earlier, let us know about how it generates molecules of its own with the help of the initial character “<bos>”. For each character generated, the loss is calculated and the model is updated. In case of Generator, policy gradient loss is calculated. The generator is then optimized and all parameters are updated.

### Policy Gradient technique

Let’s start with normal arbitrary policy and go through some actions. If after this, the rewards are better than expected, we must increase the probability of those actions. If the rewards are worse we decrease the probability from taking those actions.

**Policy Function**
The policy function calculates the LogSoftmax of the output sequence given rewards, the targets and the length of sequence. Its output is negative since we want to minimize loss but maximize the policy gradients.
The Policy gradient loss function looks like:

$ L = -Q(s, a) \log (G(y_t | Y_{1:t-1}))$.

Where $ Q(s, a)$ expected reward for an action * a* in state

*and $ G(y_t | Y_{1:t-1})$ is the policy.*

**s**To get a more detailed outlook on policy gradient, refer to this blog.

In the next blog we will implement the above discussed theories to generate a desired molecule.