Optimization¶

Optimizers control how your model updates its weights during training. Choosing the right optimizer can be the difference between a model that converges in 10 epochs vs. one that never learns.

What is Optimization?¶

Training = finding the weights that minimize the loss function.

Current weights →  Compute loss  →  Compute gradients  →  Update weights
      ↑                                                          |
      └──────────────────────────────────────────────────────────┘

The OPTIMIZER decides how to use the gradients to update weights.

SGD (Stochastic Gradient Descent)¶

The simplest optimizer: move in the direction opposite to the gradient.

\[w_{new} = w_{old} - \eta \cdot \nabla L\]

Where \(\eta\) is the learning rate.

from neurogebra import MathForge, Expression
from neurogebra.core.trainer import Trainer
import numpy as np

forge = MathForge()

model = Expression("linear", "w*x + b",
    params={"w": 0.0, "b": 0.0}, trainable_params=["w", "b"])
loss = forge.get("mse")

# SGD optimizer
trainer = Trainer(model, loss, optimizer="sgd", lr=0.01)

X = np.linspace(0, 10, 50)
y = 3 * X + 2 + np.random.normal(0, 0.5, 50)
history = trainer.fit(X, y, epochs=100)

Pros and Cons¶

✅ Pros	❌ Cons
Simple to understand	Can be slow to converge
Low memory usage	Sensitive to learning rate
Good for convex problems	Gets stuck in local minima

Adam (Adaptive Moment Estimation)¶

The most popular optimizer. It adapts the learning rate for each parameter individually.

Adam combines:

Momentum: Remembers past gradients (like a ball rolling downhill)
Adaptive learning rate: Different learning rates for different parameters

# Adam optimizer (recommended for most cases)
trainer = Trainer(model, loss, optimizer="adam", lr=0.001)
history = trainer.fit(X, y, epochs=100)

Pros and Cons¶

✅ Pros	❌ Cons
Fast convergence	Slightly more memory
Works well out-of-the-box	May not generalize as well as SGD
Handles sparse gradients	Can overshoot minimum
Good default for most problems

SGD vs Adam: When to Use What¶

Scenario	Recommended	Why
First attempt	Adam	Works well with defaults
Simple regression	SGD	Sufficient for convex problems
Deep neural networks	Adam	Handles complex loss landscapes
Need best generalization	SGD (with tuning)	Often finds flatter minima
Quick prototyping	Adam	Less hyperparameter tuning

Learning Rate¶

The most important hyperparameter. Controls step size:

\[w_{new} = w_{old} - \underbrace{\eta}_{\text{learning rate}} \cdot \nabla L\]

import numpy as np
from neurogebra import Expression
from neurogebra.core.trainer import Trainer

# Too small: slow convergence
trainer_slow = Trainer(model, loss, optimizer="adam", lr=0.0001)

# Just right: converges nicely
trainer_good = Trainer(model, loss, optimizer="adam", lr=0.001)

# Too large: overshoots, loss bounces
trainer_fast = Trainer(model, loss, optimizer="adam", lr=0.1)

Learning Rate Effects¶

Learning Rate Too Small:
Loss: ████████████████████████░░ (barely decreasing after 1000 epochs)

Learning Rate Just Right:
Loss: ████████████░░░░░░░░░░░░░░ (steadily decreasing, converges at ~200 epochs)

Learning Rate Too Large:
Loss: ████████████████████████████ (bouncing around, never converges)

Recommended Starting Points¶

Optimizer	Starting LR	Range to Try
SGD	0.01	0.001 - 0.1
Adam	0.001	0.0001 - 0.01

Manual Gradient Descent¶

For understanding, you can implement optimization manually:

from neurogebra.core.autograd import Value
import numpy as np

# Simple optimization: find x that minimizes x^2 - 4x + 4  (answer: x=2)
x = Value(0.0)  # Start at x=0
lr = 0.1

for epoch in range(50):
    # Forward pass
    loss = x**2 - 4*x + 4

    # Backward pass
    loss.backward()

    # Update
    x.data -= lr * x.grad

    # Zero gradients
    x.grad = 0.0

    if epoch % 10 == 0:
        print(f"Epoch {epoch}: x = {x.data:.4f}, loss = {loss.data:.4f}")

print(f"\nOptimal x = {x.data:.4f}")  # Should be ≈ 2.0

Batch Size and Optimization¶

How much data you use per update also matters:

# Full batch: use all data per update
trainer = Trainer(model, loss, optimizer="adam", lr=0.001)
history = trainer.fit(X, y, epochs=100, batch_size=len(X))

# Mini-batch: use 32 samples per update (most common)
history = trainer.fit(X, y, epochs=100, batch_size=32)

# Stochastic: use 1 sample per update
history = trainer.fit(X, y, epochs=100, batch_size=1)

Batch Size	Speed	Stability	Memory
Full batch	Slow per epoch	Stable	High
Mini-batch (32)	Good balance	Moderate noise	Moderate
Single (1)	Fast per epoch	Very noisy	Low

Recommendation: Start with batch_size=32.

Optimization Tips¶

Start with Adam, lr=0.001 — works for 90% of cases
Watch the loss curve — should decrease smoothly
If loss plateaus — try reducing learning rate by 10x
If loss bounces — reduce learning rate
If loss NaN — learning rate is way too high
Train longer if needed — some problems need more epochs

Next: Performance Tips →