Massive Language Fashions (LLMs), resembling ChatGPT, Gemini, Claude, and many others., have been round for some time now, and I imagine all of us already used a minimum of one in every of them. As this text is written, ChatGPT already implements the fourth technology of the GPT-based mannequin, named GPT-4. However have you learnt what GPT really is, and what the underlying neural community structure appears to be like like? On this article we’re going to speak about GPT fashions, particularly GPT-1, GPT-2 and GPT-3. I will even exhibit methods to code them from scratch with PyTorch to be able to get higher understanding in regards to the construction of those fashions.
A Temporary Historical past of GPT
Earlier than we get into GPT, we have to perceive the unique Transformer structure upfront. Typically talking, a Transformer consists of two important elements: the Encoder and the Decoder. The previous is accountable for understanding enter sequence, whereas the latter is used for producing one other sequence based mostly on the enter. For instance, in a query answering activity, the decoder will produce a solution to the enter sequence, whereas in a machine translation activity it’s used for producing the interpretation of the enter.