
How to build your own ChatGPT or any LLM
Step 1: Pre-training
Pre-training is the initial stage in the development of a Large Language Model (LLM). In this phase, various fundamental tasks are carried out to prepare the natural language model.
One of the first steps is data extraction, where large volumes of information are collected from various sources, such as the Internet. Additionally, the technique of scraping is used to obtain relevant textual information.
The Internet becomes a valuable source of data, providing a wide range and variety of information.
Next, a model based on machine learning algorithms is generated. This model has the ability to understand and process human natural language, acting as a kind of "word calculator" by predicting which words can follow in a sentence.
Considering the above, pre-training is the first step in creating an LLM. In this stage, the necessary data is collected, a model is created, and it is trained to understand and predict human natural language.

Step 2: Supervised Fine-Tuning (SFT)
Supervised fine-tuning is a crucial stage in the development of an LLM. In this phase, a series of steps are taken to improve the capabilities of the natural language model. These steps include:
Generation of a smaller dataset:
A specific and smaller dataset is created to be used for improving the model. This dataset can focus on specific questions and answers or detailed information from a particular area.
Search for expected data:
Data that the model is expected to handle accurately is selected. This involves searching for examples and exercises that are relevant and representative of the type of information the model should be able to process.
Generalization of data:
The created dataset focuses on providing examples that help the model generalize and understand different scenarios. The aim is to train the model to adapt and respond appropriately to various types of input.
Weight of the dataset:
During training, greater weight is given to the dataset generated in this stage. This means that the model will give more importance and pay more attention to this dataset during the training process.
Establishment of criteria for acceptable responses:
The language model learns which responses are considered acceptable for different types of input it receives. Criteria and guidelines are established to determine the quality and accuracy of the responses, so that the model can improve its responsiveness.
Supervised fine-tuning is essential for improving and refining the natural language model. This stage allows the model to adjust and better adapt to various situations and types of information it may encounter.

Issue: Hallucinations
During the development of an LLM, one of the challenges that has arisen is the phenomenon of hallucinations. This phenomenon refers to when the language model responds with information that is not true or is invented.
Recently, a lawyer who relied too much on chatgpt was provided with legal precedents and legal bases that did not exist. This problem raises concerns about the reliability and integrity of the responses generated by the LLM.
While the reasons behind these hallucinations are still being investigated, one theory suggests that supervised fine-tuning could be one of the main causes of this problem. During this stage, the model may respond with information that it does not originally have, which could lead to incorrect or invented responses.
As a result, it is important to address and mitigate this issue of hallucinations in the development and training stage of the LLM. This involves careful selection of datasets and rigorous validation of the responses generated by the model to ensure their accuracy and truthfulness.
Step 3: Reinforcement Learning from Human Feedback (RLHF)
Reinforcement learning from human feedback is a key stage in the process of improving an LLM. In this phase, reinforcement learning techniques are used to enhance the quality of the responses generated by the natural language model. The main aspects of this stage are highlighted below:
Trained reward model:
A reward model that has been trained by humans is used to evaluate the quality of the responses generated by the LLM. This reward model acts as an impartial and objective evaluator to measure how good or bad a response generated by the language model is.
Determination of response quality:
The reward model is responsible for evaluating the quality of the responses provided by the LLM. To do this, specific criteria and metrics are established to determine how appropriate and accurate the response is in relation to the query or question asked.
Proximal Policy Optimization (PPO) algorithm:
To train the language model based on human feedback, the Proximal Policy Optimization (PPO) algorithm is used. This algorithm is based on maximizing the rewards obtained during training, allowing for progressive improvement in the responses generated by the model.
Importance of balance:
During this stage, it is essential to strike an appropriate balance between human feedback and the reward model. Both elements are key to achieving effective learning and continuous improvement in the quality of the responses generated by the LLM.
Reinforcement learning from human feedback is an essential stage to refine the natural language model. Through the application of reinforcement learning techniques and the use of a reward model, the quality and accuracy of the LLM's responses are improved as it receives feedback from humans.
References: