Objective Reinforced Generative Adversarial Network (Part I)

Finding the Lead Molecule in a Drug Discovery pipeline is one of the most challenging processes. Thousands of molecules are screened and tested. This process is time-consuming and very important and so this blog discusses a deep generative model that tries to overcome these challenges.

Objective-Reinforced Generative Adversarial Network (ORGAN) is a modified version of a primary Generative Adversarial Network (GAN). Before we dig deeper into theory and implementation of ORGAN, let me brief you about the basics of GAN. A simple GAN is composed of two neural networks, Generator, and the Discriminator.

Generator(G):

The main aim of the Generator is to generate fake samples which resemble the true data/distribution so closely, that the discriminator cannot differentiate between the true data and the fake ones. In other words, it tries to fool the discriminator.

Discriminator(D):

As the name suggests, it discriminates the input data and classifies whether it is from the true data sample/distribution or is a fake sample generated by the Generator(G). The Discriminator(D) is initially trained on true labeled data samples.

Both of these networks work against each other trying to prove themselves better through their work. Their main objective is to generate data points that are similar to some data points consisting in the training data. Given an initial training distribution p_data, the generator G samples x from a distribution p_synth, generated with random noise z, while a discriminator D looks at samples, either from p(synthetic) or from p_data, and tries to classify their identity (y) as either real x∈p_data or fake x∈p_synth.

The model follows a min-max game where we minimize the Generator function log(1−D(G(z)) so that we can fool the discriminator by generating the samples very close to the original distribution, while maximizing the discriminator function log(D(x)) so that it can classify between fake and real data points more accurately.

For a single data point we have:
For the complete distributions we have:

Training a GAN

Training a GAN is still a great topic of research. But various problems have limited the power of GAN and its stability. Another stability of GAN while training is also a major roadblock. If you start to train a GAN, you will realise that the discriminator part is more powerful than its generator counterpart. Thus, the generator would fail to train effectively. This will in turn result a huge loss in the process of training your GAN .On the contrary, if the discriminator is too lenient; it would literally allow any image to be generated. Thus this whole idea will remain useless for your GAN.

The training has two phases.

Discriminator Training
Generator Training

This was a brief introduction to GAN. Now, moving on to ORGAN, let’s find out how contrasting is an ORGAN from GAN. In ORGAN, the main difference is the application of Reinforcement Learning(RL) to train the generator in a manner that it generates output with desired properties. In ORGAN we bypass the generator differentiation problem by treating the specific discrete sequences as stochastic policy in an RL generator gradients setup. In other words, we update the generator parameters with a policy gradient.

Reinforcement Learning

We treat the Generator as an agent here in an RL environment. We have s as the states with a reward function Q, a is the action that the agent chooses from action space A available in state s. The action space A comprises of all possible characters to select for the next character $x_{t+1}$. State s_t is an already generated partial sequence of characters $X_{1:t}$. Q(s,a) is the action-value function that represents the expected reward at state s of taking action a and following our current policy to complete the rest of the sequence. When we are in state s we estimate Q value for every possible action, then we choose the action with the highest Q value. Let R(X_{1: T}) be the reward function defined for full length sequences. Now, if we have an incomplete sequence $X_{1:t}$, in state s then , the generator $G_{\theta}$ (read G parametrized by $\theta$) must produce an action a with the next token $x_{t+1}$. The agent’s postulated policy is given by $G(y_t | Y_{1:t-1})$ and our aim is to maximize the expected long-term reward $J_{\theta}$.

$J(\theta)=\mathrm{E} [R_T | s_{\theta}, \theta]=\sum_{x_1 \in X} G_{\theta}(x_1 | s_{\theta}).Q(s_{\theta},x_1 )$.

The reward for generated molecules is calculated by reward metrics for specific properties. Some examples include LogP, Synthetic Accessibility, Natural Product-Likeness, Chemical Beauty(Quantitative Estimation of Drug-Likeness), Tanimoto Similarity, Nearest Neighbour Similarity.

Reinforcement Metric: Molecular metrics are implemented using the RDKit chem-informatics package. Metrics include Synthetic Accessibility, Natural Product likeliness, Drug-likeness, LogP, Nearest Neighbour Similarity. These were applied to calculate the reward for each generated molecule. Reinforcement provides a quality metric (between 0 & 1) which gives the desirability of a specific molecule, 1 being highly desirable and 0 being highly undesirable.

The main objective of the reinforcement metric is to maximize the reward by optimizing the generator to generate molecules similar to the initial distribution of data. The molecules generated are then analyzed by the discriminator and the reward metric. It is then optimize and train the generator to fool the discriminator.

We have completed the first half of the training. The above steps are called pretraining. Now we will train about both generator and discriminator but with a policy gradient. Since the generator has been trained earlier, let us know about how it generates molecules of its own with the help of the initial character “<bos>”. For each character generated, the loss is calculated and the model is updated. In case of Generator, policy gradient loss is calculated. The generator is then optimized and all parameters are updated.

Policy Gradient technique

Let’s start with normal arbitrary policy and go through some actions. If after this, the rewards are better than expected, we must increase the probability of those actions. If the rewards are worse we decrease the probability from taking those actions.

Policy Function The policy function calculates the LogSoftmax of the output sequence given rewards, the targets and the length of sequence. Its output is negative since we want to minimize loss but maximize the policy gradients. The Policy gradient loss function looks like:

$ L = -Q(s, a) \log (G(y_t | Y_{1:t-1}))$.

Where $ Q(s, a)$ expected reward for an action a in state s and $ G(y_t | Y_{1:t-1})$ is the policy.

To get a more detailed outlook on policy gradient, refer to this blog.

In the next blog we will implement the above discussed theories to generate a desired molecule.