How to Build Your Own ChatGPT or Any LLM
Step 1: Pre-training
Pre-training is the initial stage in the development of an LLM (Large Language Model). In this phase, various fundamental tasks are carried out to prepare the natural language model.
One of the first steps is data extraction, where large volumes of information are collected from various sources, such as the Internet. Additionally, scraping techniques are used to obtain relevant textual information.
The Internet becomes a valuable source of data, providing a vast amount and variety of information.
Next, a model is generated based on machine learning algorithms. This model has the ability to understand and process human natural language, acting as a kind of "word calculator" by predicting which words may follow in a sentence.
Considering the above, pre-training is the first step in creating an LLM. In this stage, the necessary data is collected, a model is created, and it is trained to understand and predict human natural language.

Step 2: Supervised Fine-Tuning (SFT)
Supervised fine-tuning is a crucial stage in the development of an LLM. In this phase, a series of steps are taken to enhance the capabilities of the natural language model. These steps include:
Generation of a smaller dataset:
A specific and smaller dataset is created to be used for improving the model. This dataset may focus on specific questions and answers or detailed information from a particular area.
Search for expected data:
Data that the model is expected to handle accurately is selected. This involves searching for examples and exercises that are relevant and representative of the type of information the model should be able to process.
Generalization of the data:
The created dataset focuses on providing examples that help the model generalize and understand different scenarios. The goal is to train the model to adapt and respond appropriately to various types of input.
Weight of the dataset:
During training, greater weight is given to the dataset generated in this stage. This means that the model will place more importance and attention on this dataset during the training process.
Establishment of acceptable response criteria:
The language model learns which responses are considered acceptable for the different types of input presented to it. Criteria and guidelines are established to determine the quality and accuracy of the responses, so that the model can improve its response capability.
Supervised fine-tuning is essential for enhancing and refining the natural language model. This stage allows the model to adjust and better adapt to the various situations and types of information it may encounter.

Issues: Hallucinations
During the development of an LLM, one of the challenges that has arisen is the phenomenon of hallucinations. This phenomenon refers to when the language model responds with information that is not true or is fabricated.
Recently, a lawyer who relied too heavily on ChatGPT received legal precedents and legal bases that did not exist. This problem raises concerns about the reliability and integrity of the responses generated by the LLM.
While the reasons behind these hallucinations are still being investigated, one theory suggests that supervised fine-tuning could be one of the main causes of this issue. During this stage, the model may respond with information it does not originally possess, which could lead to incorrect or fabricated responses.
As a result, it is important to address and mitigate this hallucination problem during the development and training stage of the LLM. This involves careful selection of datasets and rigorous validation of the responses generated by the model to ensure their accuracy and truthfulness.
Step 3: Reinforcement Learning from Human Feedback (RLHF)
Reinforcement learning from human feedback is a key stage in the process of improving an LLM. In this phase, reinforcement learning techniques are used to enhance the quality of the responses generated by the natural language model. The main aspects of this stage are highlighted below:
Trained reward model:
A reward model that has been trained by humans is used to evaluate the quality of the responses generated by the LLM. This reward model acts as an impartial and objective evaluator to measure how good or bad a response generated by the language model is.
Determination of response quality:
The reward model is responsible for evaluating the quality of the responses provided by the LLM. To do this, specific criteria and metrics are established that allow determining how appropriate and accurate the response is in relation to the query or question posed.
PPO Algorithm:
To train the language model based on human feedback, the Proximal Policy Optimization (PPO) algorithm is used. This algorithm is based on maximizing the rewards obtained during training, allowing for progressive improvement of the responses generated by the model.
Importance of balance:
During this stage, it is essential to find an appropriate balance between human feedback and the reward model. Both elements are key to achieving effective learning and continuous improvement in the quality of the responses generated by the LLM.
Reinforcement learning from human feedback is an essential stage for perfecting the natural language model. Through the application of reinforcement learning techniques and the use of a reward model, the quality and accuracy of the LLM's responses are improved as it receives feedback from humans.
References: