How Large Language Models Learn

How Large Language Models LearnIn this article, we’ll explore three core concepts that have a key impact on the working of LLM
͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     
Forwarded this email? Subscribe here for more
How Large Language Models Learn
ByteByteGo
Feb 23 

READ IN APP

Overcome the challenges of deploying LLMs securely and at scale (Sponsored)
To scale with LLMs, you need to know how to monitor them effectively. In this eBook, get practical strategies to monitor, debug, and secure LLM-powered applications. From tracing multi-step workflows and detecting prompt injection attacks to evaluating response quality and tracking token usage, you’ll learn best practices for integrating observability into every layer of your LLM stack.
Download the eBook
When we talk about large language models “learning,” we can end up creating a misleading impression. The word “learning” suggests something similar to human learning, complete with understanding, reasoning, and insight.
However, that’s not what happens inside these systems. LLMs don’t learn the way you learned to code or solve problems. Instead, they follow repetitive mathematical procedures billions of times, adjusting countless internal parameters until they become very good at mimicking patterns in text.
This distinction matters more than you might think because it changes the way LLMs generate their answers.
Understanding how LLMs actually work helps you know when to trust their outputs and when to be skeptical. It reveals why they can write convincing essays about topics they don’t fully understand, and why they sometimes fail in surprising ways.
In this article, we’ll explore three core concepts that have a key impact on the working of LLMs:  loss functions (how we measure failure), gradient descent (how we make improvements), and next-token prediction (what LLMs actually do).
The Foundation: Loss Functions
Before an LLM can learn anything, we need a way to measure how badly it’s performing. This measurement is called a loss function.
Think of it as a scoring system that provides a single number representing how wrong the model is. The higher the number, the worse the performance. The goal of training is to make this number as small as possible.
However, you can’t just pick any measurement and expect it to work. A good loss function must satisfy three critical requirements:
First, it must be specific. It needs to measure something concrete and not vague. If someone told you to “build an intelligent computer,” you’d struggle because intelligence itself is hard to define. Would a system that passes an IQ test count? Probably not, since computers have passed IQ tests for over a decade without being useful for much else. For LLMs, we pick something very specific, such as predicting the next word in a sequence correctly. This is concrete and measurable.
Second, the loss function must be computable. The computer needs to calculate it quickly and repeatedly. We can’t measure abstract qualities like “creativity” or “hard work” because these aren’t things you can easily quantify with the data available during training. However, you can measure whether a predicted word matches the actual next word in your training data. That’s a simple comparison that computers handle effortlessly.
Third, the loss function must be smooth. This is the trickiest requirement to grasp. Smoothness means the function’s output should change gradually as inputs change, without sudden jumps or breaks. Imagine walking down a gentle slope versus walking down a staircase. The slope is smooth because your altitude changes continuously. Stairs are not smooth because you suddenly drop from one step to the next.
Why does smoothness matter?
The training algorithm needs to figure out which direction to adjust the model’s parameters. If the loss function jumps around wildly, the algorithm can’t determine whether it’s moving in the right direction. Interestingly, accuracy (counting correct predictions) isn’t smooth because you can’t have partial predictions. You either got 47 or 48 predictions right, not 47.3. This is why LLMs actually optimize for something called cross-entropy loss instead, which is smooth and works better mathematically, even though accuracy is what we ultimately care about.
The crucial point to understand here is that LLMs are scored on matching patterns in their training data, not on being truthful or correct. If false information appears frequently in training data, the model gets rewarded for reproducing it. This fundamental design choice explains why LLMs can confidently state things that are completely wrong.
Unblocked: The context layer your AI tools are missing (Sponsored)
Many developer tools promise context-aware AI, but having data access doesn’t automatically mean agents know when to use it.
Real context requires understanding. Unblocked synthesizes knowledge from your codebase, PRs, discussions, docs, project trackers, and runtime signals. It connects past decisions to current work, resolves conflicts between outdated docs and actual practice, respects data permissions, and surfaces what matters for the task at hand.
With Unblocked:
Coding agents like Cursor, Claude, and Copilot generate output that aligns with your actual architecture and conventions
Code review focuses on real bugs rather than stylistic nits
You find instant answers without interrupting teammates