Gradients & Differentiation¶
Gradients are how neural networks learn. This is the single most important concept in all of ML.
The Big Picture¶
Training a neural network is just this:
- Make a prediction
- Measure the error (loss)
- Compute gradients — how much does each parameter affect the error?
- Adjust parameters in the direction that reduces error
- Repeat
Step 3 is what this page is about.
What is a Gradient?¶
A gradient (derivative) tells you: if I change this input a tiny bit, how much does the output change?
At \(x = 3\): \(f'(3) = 6\), meaning the output changes ~6 times faster than the input at that point.
Symbolic Gradients in Neurogebra¶
Neurogebra computes gradients symbolically — you get the actual derivative formula, not just a number:
from neurogebra import Expression
# Define a function
f = Expression("quadratic", "x**2 + 3*x + 2")
# Compute its derivative
f_prime = f.gradient("x")
# See the formula
print(f"f(x) = {f.symbolic_expr}") # x**2 + 3*x + 2
print(f"f'(x) = {f_prime.symbolic_expr}") # 2*x + 3
# Evaluate at specific points
print(f"f'(0) = {f_prime.eval(x=0)}") # 3
print(f"f'(1) = {f_prime.eval(x=1)}") # 5
print(f"f'(-1) = {f_prime.eval(x=-1)}") # 1
Gradients of Activation Functions¶
from neurogebra import MathForge
forge = MathForge()
# ReLU gradient
relu = forge.get("relu")
relu_grad = relu.gradient("x")
print(f"ReLU'(x) = {relu_grad.symbolic_expr}")
# Derivative is 1 for x > 0, 0 for x < 0
# Sigmoid gradient
sigmoid = forge.get("sigmoid")
sig_grad = sigmoid.gradient("x")
print(f"Sigmoid'(x) = {sig_grad.symbolic_expr}")
# σ'(x) = σ(x) · (1 - σ(x))
# Tanh gradient
tanh = forge.get("tanh")
tanh_grad = tanh.gradient("x")
print(f"Tanh'(x) = {tanh_grad.symbolic_expr}")
# tanh'(x) = 1 - tanh²(x)
Gradients of Loss Functions¶
This is directly used in training:
mse = forge.get("mse")
print(f"MSE = {mse.symbolic_expr}")
# (y_pred - y_true)**2
mse_grad = mse.gradient("y_pred")
print(f"dMSE/d(y_pred) = {mse_grad.symbolic_expr}")
# 2*(y_pred - y_true)
# Interpretation:
# If prediction > target → gradient is positive → decrease prediction
# If prediction < target → gradient is negative → increase prediction
print(f" pred=7, true=5: grad = {mse_grad.eval(y_pred=7, y_true=5)}") # 4
print(f" pred=3, true=5: grad = {mse_grad.eval(y_pred=3, y_true=5)}") # -4
print(f" pred=5, true=5: grad = {mse_grad.eval(y_pred=5, y_true=5)}") # 0 (perfect!)
Higher-Order Derivatives¶
You can differentiate multiple times:
f = Expression("cubic", "x**3")
f1 = f.gradient("x") # First derivative
f2 = f1.gradient("x") # Second derivative
f3 = f2.gradient("x") # Third derivative
print(f"f(x) = {f.symbolic_expr}") # x³
print(f"f'(x) = {f1.symbolic_expr}") # 3x²
print(f"f''(x) = {f2.symbolic_expr}") # 6x
print(f"f'''(x) = {f3.symbolic_expr}") # 6
Partial Derivatives¶
For functions with multiple variables:
# f(x, y) = x²y + xy²
f = Expression("multi", "x**2*y + x*y**2")
# Partial derivative with respect to x
df_dx = f.gradient("x")
print(f"∂f/∂x = {df_dx.symbolic_expr}") # 2*x*y + y²
# Partial derivative with respect to y
df_dy = f.gradient("y")
print(f"∂f/∂y = {df_dy.symbolic_expr}") # x² + 2*x*y
Gradient Descent — Using Gradients to Learn¶
Here's a manual gradient descent implementation to see how gradients drive learning:
import numpy as np
from neurogebra import Expression
# Goal: Find x that minimizes f(x) = (x - 3)²
# Answer should be x = 3
f = Expression("parabola", "(x - 3)**2")
f_grad = f.gradient("x") # f'(x) = 2*(x-3)
# Gradient descent
x = 10.0 # Start far from minimum
lr = 0.1 # Learning rate
print(f"{'Step':>4} | {'x':>8} | {'f(x)':>8} | {'gradient':>8}")
print("-" * 45)
for step in range(15):
fx = f.eval(x=x)
gx = f_grad.eval(x=x)
print(f"{step:>4} | {x:>8.4f} | {fx:>8.4f} | {gx:>8.4f}")
# Update: move opposite to gradient
x = x - lr * gx
print(f"\nFinal x = {x:.4f}") # Should be close to 3.0
The Vanishing Gradient Problem¶
Some activations have gradients that become very small, making learning slow or impossible:
import numpy as np
forge = MathForge()
# Sigmoid gradient becomes tiny for large |x|
sig_grad = forge.get("sigmoid").gradient("x")
for x_val in [-10, -5, 0, 5, 10]:
g = sig_grad.eval(x=x_val)
print(f" sigmoid'({x_val:>3}) = {g:.6f}")
# Output:
# sigmoid'(-10) = 0.000045 ← Almost zero! Learning stops.
# sigmoid'( -5) = 0.006648
# sigmoid'( 0) = 0.250000 ← OK
# sigmoid'( 5) = 0.006648
# sigmoid'( 10) = 0.000045 ← Almost zero!
print("\nReLU doesn't have this problem:")
relu_grad = forge.get("relu").gradient("x")
for x_val in [-10, -5, 0, 5, 10]:
g = relu_grad.eval(x=x_val)
print(f" relu'({x_val:>3}) = {g}")
Why this matters
The vanishing gradient problem is why ReLU replaced Sigmoid as the default activation for hidden layers. With Sigmoid, deep networks couldn't learn because gradients became too small.
Numerical Gradients (Finite Differences)¶
Sometimes you want to verify symbolic gradients using numerical approximation:
def numerical_gradient(expr, var_name, point, epsilon=1e-5):
"""Compute gradient numerically using finite differences."""
kwargs_plus = {var_name: point + epsilon}
kwargs_minus = {var_name: point - epsilon}
return (expr.eval(**kwargs_plus) - expr.eval(**kwargs_minus)) / (2 * epsilon)
f = Expression("test", "x**3 + 2*x + 1")
f_grad = f.gradient("x")
x_point = 2.0
symbolic = f_grad.eval(x=x_point)
numerical = numerical_gradient(f, "x", x_point)
print(f"Symbolic gradient: {symbolic:.6f}") # 14.000000
print(f"Numerical gradient: {numerical:.6f}") # 14.000000
print(f"Difference: {abs(symbolic - numerical):.10f}") # ~0
Summary¶
| Concept | What It Is | Why It Matters |
|---|---|---|
| Gradient | Rate of change | Tells model how to adjust |
| Positive gradient | Output increases with input | Decrease the parameter |
| Negative gradient | Output decreases with input | Increase the parameter |
| Zero gradient | At a minimum/maximum | Training converged (maybe) |
| Vanishing gradient | Gradient too small | Model stops learning |
| Exploding gradient | Gradient too large | Training becomes unstable |
Next: Expression Composition →