Ground Truth.
AI, checked against the source.

Learn · Beginner

What Is Gradient Descent?

Gradient descent is the optimization method that trains almost every modern neural network. It works by repeatedly nudging the model's parameters a tiny step in the direction that most reduces its error, then measuring the error again, and repeating until the error stops falling. Every large language model you have used, from the one writing this to the ones on your phone, was trained by some version of this single, stubborn loop.

Key facts

Here is the picture to hold in your head. Imagine the model's error as a vast, hilly landscape. Every possible setting of the model's parameters is a location on that landscape, and the height at each point is how wrong the model is with those settings. Training a model means finding a low valley, a setting where the error is small. You are blindfolded and dropped somewhere on this terrain. What do you do? You feel the ground under your feet to sense which way is downhill, and you take a step that way. Then you feel again, and step again. That is gradient descent.

The "gradient" is just the mathematical name for that downhill direction, the slope of the error with respect to every parameter at once. For a network with billions of parameters, the landscape has billions of dimensions, which is impossible to picture but works exactly the same way: the gradient points the single direction, in all those dimensions together, that reduces error fastest. Computing that gradient efficiently across every layer is the job of backpropagation; gradient descent is what then takes the step.

The size of that step is the learning rate, and it is the setting that most often makes or breaks training. Steps too large and you leap over the valley and bounce around, or diverge entirely; steps too small and training crawls, taking far longer and sometimes getting stuck. Much of the craft of training a model is scheduling the learning rate: starting larger to cover ground quickly, then shrinking it to settle precisely into a low point.

In practice, no one computes the gradient over the entire training set at every step, that would be ruinously slow when the data is measured in trillions of tokens. Instead we use stochastic gradient descent (SGD): each step estimates the downhill direction from a small random batch of examples. The estimate is noisy, but the noise is cheap and, surprisingly, often helpful, it lets the walker jiggle out of shallow dips that a perfectly smooth descent might get trapped in. As Sebastian Ruder put it in his widely cited overview of gradient descent methods, these algorithms are "often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by."

Over the years, researchers layered improvements onto plain SGD. Momentum lets the walker build up speed in a consistent downhill direction, like a ball rolling rather than a hiker stepping, which smooths out the noise and speeds convergence. Adaptive methods give each parameter its own learning rate, taking bigger steps for parameters that have been changing slowly and smaller steps for jumpy ones. The Adam optimizer combines momentum with per-parameter adaptation, and it has become the default choice for training large models precisely because it works reasonably well without much hand-tuning.

Why does this matter? Because gradient descent is the reason neural networks can learn at all. A model starts as billions of random numbers that produce gibberish. There is no way to hand-set those numbers; the only path to a useful model is to let the error tell you, over and over, which direction to move. Every capability a model has, its grammar, its facts, its reasoning, was carved into those parameters by this loop nudging them downhill on a mountain of examples. Understanding it also demystifies a lot of training talk: scaling laws are statements about how far down the valley you can get with more data and compute, fine-tuning is just a few more gradient-descent steps on new data, and a "failed training run" often means the learning rate was set wrong and the walker fell off a cliff.

The honest caveat: gradient descent finds a low point, not necessarily the lowest. On these enormous landscapes there is no guarantee of reaching the global minimum, and in theory the walker could get stuck. In practice, though, the valleys of large neural networks turn out to be forgiving enough that good-enough minima are everywhere, and this simple downhill walk, repeated at massive scale, is enough to produce the models reshaping the field.

Key papers
Ruder, An overview of gradient descent optimization algorithms (2016)
Kingma & Ba, Adam: A Method for Stochastic Optimization (2014)

Key questions

What does gradient descent actually do?

It adjusts a model's numerical parameters step by step, each time moving them a little in the direction that most reduces the model's error on the training data. Repeat that a few million times and the model learns.

Why is it called 'descent'?

Because you can picture the model's error as a landscape of hills and valleys, and the algorithm always walks downhill toward lower error. The 'gradient' is the mathematical slope telling it which way is down.

What is the difference between gradient descent and backpropagation?

Backpropagation is how the model computes the slope (the gradient) for every parameter; gradient descent is what uses that slope to actually update the parameters. Backprop measures, gradient descent moves.
Cite this

APA

Ground Truth. (2026, July 2). What Is Gradient Descent?. Ground Truth. https://groundtruth.day/learn/gradient-descent.html

BibTeX

@misc{groundtruth:gradient-descent,
  title  = {What Is Gradient Descent?},
  author = {{Ground Truth}},
  year   = {2026},
  month  = {jul},
  url    = {https://groundtruth.day/learn/gradient-descent.html}
}

Topics: fundamentals · training · optimization · gradient-descent