If you are wondering how ChatGPT actually works?

The reason behind this amazing model is Reinforcement Learning from Human Feedback(RLHF)๐Ÿ”ฅ

Let me break down how RLHF works for you in this thread:๐Ÿงต๐Ÿ‘‡
ChatGPT is a modified version of GPT-3.5(Instruct GPT) which is released in Jan 2022 โŒ›๏ธ

In short GPT-3 is trained just to predict the next word in a sentence so they are really bad at performing tasks that the user wants

So fine-tuning the GPT-3 model using the RLHF method(which we will look at later) results in Instruct GPT.

Instruct GPT is much better at following instructions than GPT-3

Compare the example below on how GPT3 & InstructGPT answer a question.

Instruct GPT is better than GPT-3. Ok, but we want a model which is more human-centric, ethical, and safe

Compare the results of Instruct-GPT and ChatGPT below, when asked a question. "How to break into a House"?

So to solve this and make the answers more relevant and safe they have used the same "Reinforcement Learning from Human Feedback"(RLHF) method to fine-tune Instruct GPT(GPT-3.5).

Let's go through and understand how RLHF works which is a 3-step process.


They have prepared a dataset of human-written answers to the prompts and used that to train the model which is called Supervised Fine-Tuning (SFT) model.

The model used here for fine-tuning is the Instruct GPT(GPT-3.5)

Even though they fine-tuned the model, since the data is very less, still the model is not accurate.

Getting more data will solve this but human annotation is slow and expensive. So they come up with another model which is called Reward Model(RM).


They used the SFT model to generate multiple responses to a given prompt and the Humans ranks the responses from best to worst.

Now we have a labeled dataset, then training a Reward Model will learn how the human would actually rank the responses

Now we don't need Humans to rank the responses. We have RM which will take care of that.

The final step is to use this RM as a reward function and fine-tune the SFT model to maximize the reward using Proximal Policy Optimization which is Reinforcement Learning Algorithm


Sample a Prompt from the dataset, give that to the SFT model and get the generated response.

Use the RF model to calculate the reward of the response and use that to update the Model.

Iterate over it.

That's how they have developed it.

Here are some resources which you can further research on RLHF:

Open AI's blog:โ€ฆ

Hugging Face blog:
That's a wrap. Every day, I share and simplify complex concepts around Python, Machine Learning & Language Models.

Follow me โ†’ @Sumanth_077 โœ… if you haven't already to ensure you don't miss that.

Like/RT the first tweet to support my work and help this reach more people.
See All