Math Behind ML¶
Don't panic — the math behind ML is simpler than you think. Neurogebra was built specifically to make this math visible and understandable.
The Three Pillars of ML Math¶
1. Linear Algebra → How data flows through the model
2. Calculus → How the model learns (gradients)
3. Probability → How we measure uncertainty
Pillar 1: Linear Algebra¶
Vectors¶
A vector is just a list of numbers. In ML, your data is vectors.
import numpy as np
# A data point with 3 features
# [height, weight, age]
person = np.array([170, 65, 25])
# Weights of a neuron
weights = np.array([0.5, -0.3, 0.1])
The Dot Product¶
The most important operation in neural networks:
# This is what EVERY neuron computes
output = np.dot(weights, person)
# = 0.5*170 + (-0.3)*65 + 0.1*25
# = 85 - 19.5 + 2.5
# = 68.0
Matrices¶
A matrix is a 2D array. A neural network layer is a matrix multiplication:
# Layer with 3 inputs, 2 outputs
W = np.array([[0.5, -0.3, 0.1],
[0.2, 0.4, -0.1]]) # Shape: (2, 3)
b = np.array([0.1, -0.2]) # Shape: (2,)
x = np.array([170, 65, 25]) # Shape: (3,)
output = W @ x + b # @ is matrix multiplication
print(output) # Shape: (2,) — two output neurons
Neurogebra has linear algebra expressions:
from neurogebra import MathForge
forge = MathForge()
expressions = forge.list_all(category="linalg")
print(expressions)
Pillar 2: Calculus (Derivatives & Gradients)¶
What is a Derivative?¶
A derivative tells you how much the output changes when you change the input a tiny bit.
At \(x = 3\): the derivative is \(2 \times 3 = 6\). This means: if you increase \(x\) by a tiny amount, \(f(x)\) increases by about 6 times that amount.
from neurogebra import Expression
f = Expression("quadratic", "x**2")
f_prime = f.gradient("x")
print(f"f(3) = {f.eval(x=3)}") # 9
print(f"f'(3) = {f_prime.eval(x=3)}") # 6
Why Derivatives Matter in ML¶
The loss function measures error. The derivative of the loss with respect to each weight tells us how to adjust that weight:
from neurogebra import MathForge
forge = MathForge()
# MSE loss: L = (prediction - target)²
mse = forge.get("mse")
print(mse.symbolic_expr) # (y_pred - y_true)**2
# Gradient: dL/d(prediction) = 2*(prediction - target)
grad = mse.gradient("y_pred")
print(grad.symbolic_expr) # 2*(y_pred - y_true)
# If prediction=5, target=3:
# gradient = 2*(5-3) = 4 (positive → decrease prediction)
print(grad.eval(y_pred=5, y_true=3)) # 4.0
The Chain Rule¶
For composite functions, derivatives multiply:
This is exactly what backpropagation does — it applies the chain rule through every layer of a neural network.
from neurogebra.core.autograd import Value
# Chain of operations
x = Value(2.0)
h = x ** 2 # h = x² = 4
y = h + 3 # y = h + 3 = 7
z = y * 2 # z = y * 2 = 14
# Backprop automatically applies chain rule
z.backward()
print(f"dz/dx = {x.grad}") # dz/dh * dh/dx = 2 * 2x = 2*4 = 8
Gradient Descent¶
The core learning algorithm: move in the opposite direction of the gradient:
Where \(\alpha\) is the learning rate.
# Manual gradient descent
w = 5.0 # Start with a guess
lr = 0.1 # Learning rate
target = 0.0 # Want to minimize w²
for step in range(20):
loss = w ** 2 # Loss function
gradient = 2 * w # dL/dw = 2w
w = w - lr * gradient # Update step
if step % 5 == 0:
print(f"Step {step}: w = {w:.4f}, loss = {loss:.4f}")
# w gets closer and closer to 0 (the minimum of w²)
Pillar 3: Probability & Statistics¶
Mean (Average)¶
Variance and Standard Deviation¶
How spread out the data is:
The Normal (Gaussian) Distribution¶
Most data in ML follows a bell curve:
# Generate normally distributed data
samples = np.random.normal(mean=0, scale=1, size=1000)
print(f"Mean: {samples.mean():.2f}") # ≈ 0
print(f"Std: {samples.std():.2f}") # ≈ 1
Softmax — Probabilities from Numbers¶
Converts raw numbers into probabilities that sum to 1:
def softmax(x):
exp_x = np.exp(x - np.max(x)) # Subtract max for numerical stability
return exp_x / exp_x.sum()
scores = np.array([2.0, 1.0, 0.1])
probs = softmax(scores)
print(probs) # [0.659, 0.242, 0.099]
print(probs.sum()) # 1.0 — they're probabilities!
Putting It All Together: A Neuron¶
A single neuron combines all three pillars:
- Linear Algebra: \(z = \mathbf{w} \cdot \mathbf{x} + b\) (dot product)
- Calculus: Gradient descent learns \(\mathbf{w}\) and \(b\)
- Probability: Activation (sigmoid) outputs a probability
from neurogebra.core.autograd import Value
# Inputs
x1, x2 = Value(1.0), Value(2.0)
# Weights & bias (the parameters to learn)
w1, w2, b = Value(0.5), Value(-0.3), Value(0.1)
# Step 1: Linear combination (Linear Algebra)
z = w1 * x1 + w2 * x2 + b # 0.5*1 + (-0.3)*2 + 0.1 = 0.0
# Step 2: Activation (Probability - squash to 0-1)
output = z.sigmoid() # σ(0.0) = 0.5
# Step 3: Compute gradients (Calculus)
output.backward()
print(f"Output: {output.data:.4f}") # 0.5000
print(f"dout/dw1 = {w1.grad:.4f}") # How to adjust w1
print(f"dout/dw2 = {w2.grad:.4f}") # How to adjust w2
print(f"dout/db = {b.grad:.4f}") # How to adjust bias
Neurogebra Makes Math Visible¶
Instead of hiding math inside C++ kernels like PyTorch:
from neurogebra import MathForge
forge = MathForge()
# See the formula
sigmoid = forge.get("sigmoid")
print(sigmoid.symbolic_expr) # 1/(1 + exp(-x))
# See the derivative
print(sigmoid.gradient("x").symbolic_expr)
# exp(-x)/(1 + exp(-x))² — which equals σ(x)·(1-σ(x))
# Explain it
print(sigmoid.explain())
Key Insight
You don't need to memorize derivatives. Neurogebra computes them symbolically. But understanding what they mean is crucial for debugging and improving your models.
Next: MathForge — The Core →