nanoGPT
by
karpathy

Description: The simplest, fastest repository for training/finetuning medium-sized GPTs.

View karpathy/nanoGPT on GitHub ↗

Summary Information

Updated 5 minutes ago

Added to GitGenius on October 25th, 2025

Created on December 28th, 2022

Open Issues & Pull Requests: 347 (+0)

Number of forks: 10,160

Total Stargazers: 59,018 (+3)

Total Subscribers: 503 (+0)

Issue Activity (beta)

Open issues: 183

New in 7 days: 0

Closed in 7 days: 0

Avg open age: 816 days

Stale 30+ days: 177

Stale 90+ days: 169

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

No label distribution available yet.

Most active issues this week

No issue events were indexed in the last 7 days.

Explore full issue details

Detailed Description

The `karpathy/nanogpt` repository presents a remarkably concise and educational implementation of a Generative Pre-trained Transformer (GPT) model, designed to demystify the complex architecture of large language models. Authored by Andrej Karpathy, the project's core philosophy is to build a functional, albeit small-scale, GPT from scratch, using only fundamental Python and PyTorch concepts. It strips away the distributed training, massive datasets, and intricate engineering typically associated with LLMs, focusing instead on the bare essentials of the Transformer architecture and the autoregressive training process. This minimalist approach makes `nanogpt` an invaluable resource for anyone seeking to understand the foundational mechanics of models like OpenAI's GPT series without getting lost in overwhelming complexity.

At its heart, `nanogpt` implements the decoder-only Transformer architecture, which is characteristic of GPT models. This involves several key components: token embeddings, which convert input tokens into dense vector representations; positional embeddings, which encode the order of tokens in a sequence; and a stack of Transformer blocks. Each Transformer block consists of a multi-head self-attention mechanism, followed by a feed-forward neural network. The self-attention layer allows the model to weigh the importance of different tokens in the input sequence when processing each token, while multi-head attention enables the model to attend to different aspects of the input simultaneously. Layer normalization and residual connections are also crucial elements, ensuring stable training and effective gradient flow through the deep network.

The training process in `nanogpt` is equally simplified for clarity. It typically involves preparing a text dataset, such as Shakespeare's works, and tokenizing it into numerical representations (often character-level or simple Byte Pair Encoding). The model is then trained to predict the next token in a sequence, given the preceding tokens. This autoregressive objective is optimized using the cross-entropy loss function, and gradients are computed and applied via an AdamW optimizer. The repository demonstrates how to set up a basic training loop, handle mini-batching, and manage the fixed context window (block size) that defines how many preceding tokens the model considers. While not designed for state-of-the-art performance, this setup effectively illustrates the core learning paradigm of modern LLMs.

Once trained, `nanogpt` can be used for text generation. The inference process involves feeding an initial prompt (a sequence of tokens) to the model. The model then predicts the probability distribution over the next possible tokens. A token is sampled from this distribution (often using techniques like temperature sampling or top-k sampling to control randomness), appended to the input sequence, and the process is repeated. This iterative, token-by-token generation allows the model to produce coherent and contextually relevant text, showcasing its learned language patterns. The simplicity of the generation script further highlights how the complex task of text creation emerges from a relatively straightforward predictive mechanism.

In summary, `karpathy/nanogpt` serves as an exceptional educational tool, distilling the essence of GPT into a manageable and understandable codebase. It empowers learners to grasp the fundamental principles of self-attention, Transformer blocks, autoregressive training, and text generation without requiring extensive prior knowledge of advanced machine learning frameworks or distributed systems. By demonstrating that a powerful concept like GPT can be implemented in "literally just a few hundred lines of code," it demystifies a field often perceived as impenetrable, making advanced AI concepts accessible and encouraging hands-on experimentation and deeper understanding. It's a testament to the power of simplicity in education.

nanoGPT
by
karpathy

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week