No Introduction required for ChatGPT and we all knew how revolutionary it has been in just few days of it's release. Let's understand how ChatGPT actually works and what's happening behind the scenes in just 9 tweets:

Thread

No Introduction required for ChatGPT and we all knew how revolutionary it has been in just few days of it's release.

Let's understand how ChatGPT actually works and what's happening behind the scenes in just 9 tweets:

ChatGPT is a modified version of GPT-3.5(Instruct GPT) which is released in Jan 2022

In short GPT-3 is trained just to predict the next word in a sentence so they are really bad at performing tasks that user wants

1/9

So by fine-tuning the GPT-3 model using RLHF method(which we will look later) results in Instruct GPT.

Instruct GPT is much better at following instructions than GPT-3

Compare the example below on how GPT3 & InstructGPT answers to a question.

2/9

Instruct GPT is better than GPT-3. Ok, but we want a model which is more human centric, ethical and safe

Compare the results of Instruct-GPT and ChatGPT below, when asked a question. "How to break into a House"?

3/9

So to solve this and make the answers more relevant and safe they have used the same "Reinforcement Learning from Human Feedback"(RLHF) method to fine-tune Instruct GPT(GPT-3.5).

Let's go through and understand how RLHF works which is a 3 step process.

4/9

Step1:

They have prepared a dataset of human-written answers to the prompts, and used that to train the model which is called Supervised Fine-Tuning (SFT) model.

The model used here for fine-tuning is the Instruct GPT(GPT-3.5)

5/9

Even though they fine-tuned the model, since the data is very less, still the model is not accurate.

Getting more data will solve this but human annotation is slow and expensive. So they come up with another model which is called Reward Model(RM).

6/9

Step2:

They have used SFT model to generate multiple responses to a given prompt and the Human ranks the responses best to worst.

Now we have a labelled dataset, then training a Reward Model will learns how the human would actually rank the responses

7/9

Now we don't need of Humans to rank the responses. We have RM which will take care of that.

The final step is to use this RM as a reward function and fine-tune the SFT model to maximize the reward using Proximal Policy Optimization which is Reinforcement Learning Algorithm

8/9

Step3:

Sample a Prompt from the dataset, give that to SFT model and get the generated response.

Use the RF model to calculate the reward of the response and use that to update the Model.

Iterate over it.

That's how they have developed it.

9/9

Hope you get an idea of RLHF and how ChatGPT actually developed.

Let me know what's your thoughts on ChatGPT?

Here are some resources which you can further research on RLHF

@OpenAI's blog
openai.com/blog/instruction-following/

@huggingface's blog
huggingface.co/blog/rlhf

That's it. Hope you enjoyed reading this thread.

Follow @Sumanth_077 if you aren’t following up yet for Python, Data Science, Machine Learning related content/opportunities.

Also, leave a like and retweet the first tweet and help this reach more people :)

Mentions

See All

Avi Kumar Talaviya @avikumart_ · Jan 3, 2023

Post
From Twitter

Awesome thread! thanks for sharing✌

Thread by Sumanth

Thread

Mentions