Multi-Layer Perceptrons: From Mathematics to Implementation

A complete guide to understanding neural networks from first principles

Deep Learning
Neural Networks
Machine Learning
Tutorial
Author

Rishabh Mondal

Published

February 18, 2025

Welcome to the World of Neural Networks!

This tutorial will take you on a journey from understanding the basic concepts of neural networks to implementing a fully functional Multi-Layer Perceptron (MLP) from scratch. We’ll use pure mathematics first, then build intuition with hand calculations, and finally implement everything in Python with NumPy.

What makes this tutorial special?

  • Progressive difficulty: Easy → Moderate → Difficult
  • Matrix-first approach: Understand the mathematics before writing code
  • Hand calculations: Work through examples step-by-step
  • From scratch implementation: No ML libraries, just NumPy
  • Real dataset training: See your network learn!

Prerequisites: Basic Python, linear algebra (matrices), and calculus (derivatives)


Part 1: The Big Picture (Easy)

1.1 What is a Neural Network?

A neural network is a computational model inspired by the human brain. Just as our brains have billions of neurons connected together, artificial neural networks have computational units (nodes) connected in layers.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Circle, FancyArrowPatch
import warnings
warnings.filterwarnings('ignore')

def draw_simple_neural_network():
    """Draw a simple 3-layer neural network"""
    fig, ax = plt.subplots(figsize=(12, 8))
    ax.set_xlim(-0.5, 3.5)
    ax.set_ylim(-0.5, 4.5)
    ax.axis('off')

    # Title
    ax.text(1.5, 4.2, 'A Simple Neural Network', ha='center', fontsize=16, weight='bold')

    # Layer positions
    layers = {
        'input': {'x': 0, 'neurons': 3, 'color': '#3498db', 'label': 'Input Layer'},
        'hidden': {'x': 1.5, 'neurons': 4, 'color': '#e74c3c', 'label': 'Hidden Layer'},
        'output': {'x': 3, 'neurons': 2, 'color': '#27ae60', 'label': 'Output Layer'}
    }

    neuron_positions = {}

    for layer_name, props in layers.items():
        x = props['x']
        n = props['neurons']
        positions = []

        for i in range(n):
            y = 2 - (n-1)/2 * 0.8 + i * 0.8
            positions.append((x, y))
            circle = Circle((x, y), 0.15, color=props['color'], ec='black', linewidth=2, zorder=10)
            ax.add_patch(circle)

            # Add neuron labels
            if layer_name == 'input':
                ax.text(x - 0.3, y, f'x{i+1}', ha='right', va='center', fontsize=12)
            elif layer_name == 'output':
                ax.text(x + 0.3, y, f'y{i+1}', ha='left', va='center', fontsize=12)

        neuron_positions[layer_name] = positions
        ax.text(x, -0.3, props['label'], ha='center', fontsize=11, weight='bold')

    # Draw connections
    for start_layer, end_layer in [('input', 'hidden'), ('hidden', 'output')]:
        for start_pos in neuron_positions[start_layer]:
            for end_pos in neuron_positions[end_layer]:
                ax.annotate('', xy=end_pos, xytext=start_pos,
                           arrowprops=dict(arrowstyle='->', color='gray',
                                          alpha=0.5, lw=1.5))

    # Add weight labels
    ax.text(0.75, 3.5, 'Weights W[1]', ha='center', fontsize=10, style='italic')
    ax.text(2.25, 3.5, 'Weights W[2]', ha='center', fontsize=10, style='italic')

    plt.tight_layout()
    plt.show()

draw_simple_neural_network()

Key Components:

  1. Input Layer: Receives the raw data (features)
  2. Hidden Layer(s): Learns complex patterns through transformations
  3. Output Layer: Produces the final prediction
  4. Weights: Numbers that determine the strength of connections
  5. Biases: Additional parameters that shift the activation

1.2 The Single Neuron: The Building Block

Before understanding a network, let’s understand a single neuron:

def draw_single_neuron():
    """Draw and explain a single neuron"""
    fig, ax = plt.subplots(figsize=(14, 6))
    ax.set_xlim(-1, 11)
    ax.set_ylim(-1, 5)
    ax.axis('off')

    # Title
    ax.text(5, 4.5, 'Anatomy of a Single Neuron', ha='center', fontsize=16, weight='bold')

    # Input nodes
    inputs = [(0, 3), (0, 2), (0, 1)]
    for i, (x, y) in enumerate(inputs):
        circle = Circle((x, y), 0.2, color='#3498db', ec='black', linewidth=2)
        ax.add_patch(circle)
        ax.text(-0.5, y, f'x{i+1}', ha='right', va='center', fontsize=14)

    # Neuron body
    neuron = Circle((4, 2), 0.5, color='#e74c3c', ec='black', linewidth=3)
    ax.add_patch(neuron)
    ax.text(4, 2, 'SUM', ha='center', va='center', fontsize=10, weight='bold', color='white')

    # Activation function box
    ax.add_patch(plt.Rectangle((6, 1.5), 1.5, 1, fill=True,
                                color='#f39c12', ec='black', linewidth=2))
    ax.text(6.75, 2, 'f(x)', ha='center', va='center', fontsize=12, weight='bold')

    # Output
    output = Circle((9.5, 2), 0.25, color='#27ae60', ec='black', linewidth=2)
    ax.add_patch(output)
    ax.text(10.2, 2, 'a', ha='left', va='center', fontsize=14)

    # Arrows with weights
    for i, (x, y) in enumerate(inputs):
        ax.annotate('', xy=(3.5, 2), xytext=(0.2, y),
                   arrowprops=dict(arrowstyle='->', color='gray', lw=2))
        ax.text(1.5, y + 0.15, f'w{i+1}', fontsize=11, color='purple')

    # Bias
    ax.annotate('', xy=(4, 1.5), xytext=(4, 0.5),
               arrowprops=dict(arrowstyle='->', color='blue', lw=2))
    ax.text(4.3, 0.8, 'b', fontsize=12, color='blue')
    ax.text(4, 0.3, 'bias', fontsize=10, ha='center', color='blue')

    # Arrows between components
    ax.annotate('', xy=(6, 2), xytext=(4.5, 2),
               arrowprops=dict(arrowstyle='->', color='black', lw=2))
    ax.annotate('', xy=(9.25, 2), xytext=(7.5, 2),
               arrowprops=dict(arrowstyle='->', color='black', lw=2))

    # Labels
    ax.text(5.25, 2.4, 'z', fontsize=12, ha='center')
    ax.text(8.4, 2.4, 'a', fontsize=12, ha='center')

    # Equations
    ax.text(5.5, -0.3, 'z = w1*x1 + w2*x2 + w3*x3 + b = w^T @ x + b',
            fontsize=12, ha='center')
    ax.text(5.5, -0.8, 'a = activation(z)',
            fontsize=12, ha='center')

    plt.tight_layout()
    plt.show()

draw_single_neuron()

The neuron performs two operations:

  1. Linear Combination (Weighted Sum): \[z = w_1 x_1 + w_2 x_2 + w_3 x_3 + b = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T \mathbf{x} + b\]

  2. Activation (Non-linear Transformation): \[a = \sigma(z)\]

Why Do We Need Activation Functions?

Without activation functions, a neural network is just a series of linear transformations: \[y = W_2(W_1 x + b_1) + b_2 = W_2 W_1 x + W_2 b_1 + b_2 = W' x + b'\]

This collapses to a single linear transformation! Activation functions introduce non-linearity, allowing the network to learn complex patterns.

1.3 Common Activation Functions

def plot_activation_functions():
    """Plot common activation functions and their derivatives"""
    fig, axes = plt.subplots(2, 3, figsize=(14, 8))
    x = np.linspace(-5, 5, 200)

    # Sigmoid
    sigmoid = 1 / (1 + np.exp(-x))
    sigmoid_derivative = sigmoid * (1 - sigmoid)

    axes[0, 0].plot(x, sigmoid, 'b-', linewidth=2.5, label='$\\sigma(z)$')
    axes[0, 0].set_title('Sigmoid: $\\sigma(z) = \\frac{1}{1 + e^{-z}}$', fontsize=11)
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].axhline(y=0, color='k', linewidth=0.5)
    axes[0, 0].axvline(x=0, color='k', linewidth=0.5)
    axes[0, 0].set_ylim(-0.1, 1.1)

    axes[1, 0].plot(x, sigmoid_derivative, 'r-', linewidth=2.5, label="$\\sigma'(z)$")
    axes[1, 0].set_title("Derivative: $\\sigma'(z) = \\sigma(z)(1 - \\sigma(z))$", fontsize=11)
    axes[1, 0].grid(True, alpha=0.3)
    axes[1, 0].axhline(y=0, color='k', linewidth=0.5)
    axes[1, 0].axvline(x=0, color='k', linewidth=0.5)

    # Tanh
    tanh = np.tanh(x)
    tanh_derivative = 1 - tanh**2

    axes[0, 1].plot(x, tanh, 'b-', linewidth=2.5)
    axes[0, 1].set_title('Tanh: $\\tanh(z) = \\frac{e^z - e^{-z}}{e^z + e^{-z}}$', fontsize=11)
    axes[0, 1].grid(True, alpha=0.3)
    axes[0, 1].axhline(y=0, color='k', linewidth=0.5)
    axes[0, 1].axvline(x=0, color='k', linewidth=0.5)
    axes[0, 1].set_ylim(-1.2, 1.2)

    axes[1, 1].plot(x, tanh_derivative, 'r-', linewidth=2.5)
    axes[1, 1].set_title("Derivative: $\\tanh'(z) = 1 - \\tanh^2(z)$", fontsize=11)
    axes[1, 1].grid(True, alpha=0.3)
    axes[1, 1].axhline(y=0, color='k', linewidth=0.5)
    axes[1, 1].axvline(x=0, color='k', linewidth=0.5)

    # ReLU
    relu = np.maximum(0, x)
    relu_derivative = (x > 0).astype(float)

    axes[0, 2].plot(x, relu, 'b-', linewidth=2.5)
    axes[0, 2].set_title('ReLU: $\\mathrm{ReLU}(z) = \\max(0, z)$', fontsize=11)
    axes[0, 2].grid(True, alpha=0.3)
    axes[0, 2].axhline(y=0, color='k', linewidth=0.5)
    axes[0, 2].axvline(x=0, color='k', linewidth=0.5)

    axes[1, 2].plot(x, relu_derivative, 'r-', linewidth=2.5)
    axes[1, 2].set_title("Derivative: $\\mathrm{ReLU}'(z) = 1_{z > 0}$", fontsize=11)
    axes[1, 2].grid(True, alpha=0.3)
    axes[1, 2].axhline(y=0, color='k', linewidth=0.5)
    axes[1, 2].axvline(x=0, color='k', linewidth=0.5)
    axes[1, 2].set_ylim(-0.1, 1.5)

    plt.tight_layout()
    plt.show()

    # Summary table
    print("\n" + "=" * 70)
    print("ACTIVATION FUNCTIONS SUMMARY")
    print("=" * 70)
    print(f"{'Function':<12} {'Output Range':<15} {'Use Case':<25} {'Pros/Cons'}")
    print("-" * 70)
    print(f"{'Sigmoid':<12} {'(0, 1)':<15} {'Binary classification':<25} {'Vanishing gradient'}")
    print(f"{'Tanh':<12} {'(-1, 1)':<15} {'Hidden layers':<25} {'Zero-centered'}")
    print(f"{'ReLU':<12} {'[0, ∞)':<15} {'Hidden layers (default)':<25} {'Fast, can die'}")
    print("=" * 70)

plot_activation_functions()


======================================================================
ACTIVATION FUNCTIONS SUMMARY
======================================================================
Function     Output Range    Use Case                  Pros/Cons
----------------------------------------------------------------------
Sigmoid      (0, 1)          Binary classification     Vanishing gradient
Tanh         (-1, 1)         Hidden layers             Zero-centered
ReLU         [0, ∞)          Hidden layers (default)   Fast, can die
======================================================================

Part 2: The Mathematics of Neural Networks (Moderate)

Now let’s dive into the mathematical foundations. We’ll express everything in matrix notation for efficiency and clarity.

2.1 Notation Convention

Before we proceed, let’s establish our notation:

Symbol Meaning Shape
\(n^{[l]}\) Number of neurons in layer \(l\) scalar
\(\mathbf{x}\) or \(\mathbf{a}^{[0]}\) Input vector \((n^{[0]}, 1)\)
\(\mathbf{W}^{[l]}\) Weight matrix for layer \(l\) \((n^{[l]}, n^{[l-1]})\)
\(\mathbf{b}^{[l]}\) Bias vector for layer \(l\) \((n^{[l]}, 1)\)
\(\mathbf{z}^{[l]}\) Pre-activation (linear output) \((n^{[l]}, 1)\)
\(\mathbf{a}^{[l]}\) Post-activation output \((n^{[l]}, 1)\)
\(L\) Total number of layers scalar
Matrix Dimensions Convention

The weight matrix \(\mathbf{W}^{[l]}\) has shape \((n^{[l]}, n^{[l-1]})\):

  • Rows = neurons in the current layer
  • Columns = neurons in the previous layer

This allows us to compute: \(\mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\)

2.2 Forward Propagation: Layer by Layer

For each layer \(l\) from 1 to \(L\):

\[\mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\] \[\mathbf{a}^{[l]} = \sigma^{[l]}(\mathbf{z}^{[l]})\]

where \(\sigma^{[l]}\) is the activation function for layer \(l\).

def visualize_forward_propagation():
    """Visualize the forward propagation process with matrices"""
    fig, ax = plt.subplots(figsize=(16, 10))
    ax.set_xlim(0, 16)
    ax.set_ylim(0, 10)
    ax.axis('off')

    # Title
    ax.text(8, 9.5, 'Forward Propagation: Matrix Operations',
            ha='center', fontsize=16, weight='bold')

    # Layer 0 (Input)
    ax.add_patch(plt.Rectangle((0.5, 5), 2, 3, fill=True,
                               color='#3498db', alpha=0.3, ec='blue', linewidth=2))
    ax.text(1.5, 8.3, 'Input', ha='center', fontsize=12, weight='bold', color='blue')
    ax.text(1.5, 7.5, 'a[0] = x', ha='center', fontsize=11)
    ax.text(1.5, 6.5, '[x1, x2, x3]^T',ha='center', fontsize=11)
    ax.text(1.5, 5.3, 'Shape: (3, 1)', ha='center', fontsize=9, style='italic')

    # Arrow with operation
    ax.annotate('', xy=(4.5, 6.5), xytext=(2.5, 6.5),
               arrowprops=dict(arrowstyle='->', color='black', lw=2))

    # Layer 1 computation box
    ax.add_patch(plt.Rectangle((4.5, 4), 4, 5, fill=True,
                               color='#f39c12', alpha=0.2, ec='orange', linewidth=2))
    ax.text(6.5, 8.8, 'Layer 1 Computation', ha='center', fontsize=11, weight='bold', color='darkorange')

    # z computation
    ax.text(6.5, 8.2, 'z[1] = W[1] @ a[0] + b[1]', ha='center', fontsize=11)
    ax.text(6.5, 7.4, 'W[1]: (4x3) matrix', ha='center', fontsize=10)
    ax.text(6.5, 6.9, 'a[0]: (3x1) vector', ha='center', fontsize=10)
    ax.text(6.5, 6.4, 'b[1]: (4x1) bias', ha='center', fontsize=10)
    ax.text(6.5, 5.8, 'W[1]: (4,3) @ a[0]: (3,1) + b[1]: (4,1)',
            ha='center', fontsize=9, style='italic')
    ax.text(6.5, 5.3, '= z[1]: (4, 1)', ha='center', fontsize=10, weight='bold')

    # Activation
    ax.text(6.5, 4.6, 'a[1] = sigmoid(z[1])', ha='center', fontsize=11, color='red')
    ax.text(6.5, 4.2, 'Shape: (4, 1)', ha='center', fontsize=9, style='italic')

    # Arrow
    ax.annotate('', xy=(10.5, 6.5), xytext=(8.5, 6.5),
               arrowprops=dict(arrowstyle='->', color='black', lw=2))

    # Layer 2 computation box
    ax.add_patch(plt.Rectangle((10.5, 4), 4, 5, fill=True,
                               color='#27ae60', alpha=0.2, ec='green', linewidth=2))
    ax.text(12.5, 8.8, 'Layer 2 (Output)', ha='center', fontsize=11, weight='bold', color='darkgreen')

    ax.text(12.5, 8.2, 'z[2] = W[2] @ a[1] + b[2]', ha='center', fontsize=11)
    ax.text(12.5, 7.4, 'W[2]: (2x4) matrix', ha='center', fontsize=10)
    ax.text(12.5, 6.9, 'a[1]: (4x1) vector', ha='center', fontsize=10)
    ax.text(12.5, 6.4, 'b[2]: (2x1) bias', ha='center', fontsize=10)
    ax.text(12.5, 5.8, 'W[2]: (2,4) @ a[1]: (4,1) + b[2]: (2,1)',
            ha='center', fontsize=9, style='italic')
    ax.text(12.5, 5.3, '= z[2]: (2, 1)', ha='center', fontsize=10, weight='bold')

    ax.text(12.5, 4.6, 'a[2] = sigmoid(z[2]) = y_hat', ha='center', fontsize=11, color='red')
    ax.text(12.5, 4.2, 'Final Output: (2, 1)', ha='center', fontsize=9, style='italic')

    # Summary at bottom
    ax.text(8, 2.5, 'FORWARD PROPAGATION SUMMARY', ha='center', fontsize=13, weight='bold')
    ax.text(8, 1.8, 'Network: 3 -> 4 -> 2 (input -> hidden -> output)', ha='center', fontsize=11)
    ax.text(8, 1.2, 'Total Parameters: W[1] (12) + b[1] (4) + W[2] (8) + b[2] (2) = 26',
            ha='center', fontsize=10)

    plt.tight_layout()
    plt.show()

visualize_forward_propagation()

2.3 Hand Calculation Example

Let’s work through a concrete example with actual numbers:

print("=" * 70)
print("HAND CALCULATION: FORWARD PROPAGATION")
print("=" * 70)

# Network: 2 inputs → 2 hidden → 1 output
# Let's use simple numbers

# Input
x = np.array([[1.0],
              [2.0]])
print(f"\n📥 INPUT")
print(f"x = {x.flatten()}")

# Layer 1 weights and biases
W1 = np.array([[0.1, 0.2],
               [0.3, 0.4]])
b1 = np.array([[0.1],
               [0.2]])

print(f"\n📊 LAYER 1 PARAMETERS")
print(f"W1 =\n{W1}")
print(f"b1 = {b1.flatten()}")

# Layer 1 forward
z1 = W1 @ x + b1
print(f"\n⚡ LAYER 1 COMPUTATION")
print(f"z1 = W1 @ x + b1")
print(f"z1 = {W1} @ {x.flatten()} + {b1.flatten()}")
print(f"z1 = [{0.1*1 + 0.2*2:.2f}, {0.3*1 + 0.4*2:.2f}] + [0.1, 0.2]")
print(f"z1 = [{0.1 + 0.4:.2f}, {0.3 + 0.8:.2f}] + [0.1, 0.2]")
print(f"z1 = {z1.flatten()}")

# Sigmoid activation
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

a1 = sigmoid(z1)
print(f"\n🔄 ACTIVATION (Sigmoid)")
print(f"a1 = sigmoid(z1) = 1 / (1 + e^(-z1))")
print(f"a1[0] = 1 / (1 + e^(-{z1[0,0]:.2f})) = {a1[0,0]:.4f}")
print(f"a1[1] = 1 / (1 + e^(-{z1[1,0]:.2f})) = {a1[1,0]:.4f}")
print(f"a1 = {a1.flatten()}")

# Layer 2 weights and biases
W2 = np.array([[0.5, 0.6]])
b2 = np.array([[0.3]])

print(f"\n📊 LAYER 2 PARAMETERS")
print(f"W2 = {W2}")
print(f"b2 = {b2.flatten()}")

# Layer 2 forward
z2 = W2 @ a1 + b2
print(f"\n⚡ LAYER 2 COMPUTATION")
print(f"z2 = W2 @ a1 + b2")
print(f"z2 = {0.5*a1[0,0] + 0.6*a1[1,0]:.4f} + 0.3")
print(f"z2 = {z2[0,0]:.4f}")

# Final output
a2 = sigmoid(z2)
print(f"\n🎯 FINAL OUTPUT")
print(f"y_hat = sigmoid(z2) = {a2[0,0]:.4f}")

print("\n" + "=" * 70)
print("FORWARD PROPAGATION COMPLETE!")
print(f"Input: {x.flatten()} → Output: {a2[0,0]:.4f}")
print("=" * 70)
======================================================================
HAND CALCULATION: FORWARD PROPAGATION
======================================================================

📥 INPUT
x = [1. 2.]

📊 LAYER 1 PARAMETERS
W1 =
[[0.1 0.2]
 [0.3 0.4]]
b1 = [0.1 0.2]

⚡ LAYER 1 COMPUTATION
z1 = W1 @ x + b1
z1 = [[0.1 0.2]
 [0.3 0.4]] @ [1. 2.] + [0.1 0.2]
z1 = [0.50, 1.10] + [0.1, 0.2]
z1 = [0.50, 1.10] + [0.1, 0.2]
z1 = [0.6 1.3]

🔄 ACTIVATION (Sigmoid)
a1 = sigmoid(z1) = 1 / (1 + e^(-z1))
a1[0] = 1 / (1 + e^(-0.60)) = 0.6457
a1[1] = 1 / (1 + e^(-1.30)) = 0.7858
a1 = [0.64565631 0.78583498]

📊 LAYER 2 PARAMETERS
W2 = [[0.5 0.6]]
b2 = [0.3]

⚡ LAYER 2 COMPUTATION
z2 = W2 @ a1 + b2
z2 = 0.7943 + 0.3
z2 = 1.0943

🎯 FINAL OUTPUT
y_hat = sigmoid(z2) = 0.7492

======================================================================
FORWARD PROPAGATION COMPLETE!
Input: [1. 2.] → Output: 0.7492
======================================================================

Given: - Input: \(\mathbf{x} = [0.5, 0.8]^T\) - \(\mathbf{W}^{[1]} = \begin{bmatrix} 0.2 & 0.4 \\ 0.6 & 0.1 \end{bmatrix}\), \(\mathbf{b}^{[1]} = \begin{bmatrix} 0 \\ 0.1 \end{bmatrix}\) - \(\mathbf{W}^{[2]} = \begin{bmatrix} 0.3 & 0.7 \end{bmatrix}\), \(\mathbf{b}^{[2]} = \begin{bmatrix} 0.2 \end{bmatrix}\) - Use tanh activation for hidden layer and sigmoid for output

Calculate: 1. \(\mathbf{z}^{[1]}\) and \(\mathbf{a}^{[1]}\) 2. \(\mathbf{z}^{[2]}\) and \(\mathbf{a}^{[2]}\) (final output)

Solution: Try it yourself first, then check below!

x = np.array([[0.5], [0.8]])
W1 = np.array([[0.2, 0.4], [0.6, 0.1]])
b1 = np.array([[0], [0.1]])
W2 = np.array([[0.3, 0.7]])
b2 = np.array([[0.2]])

z1 = W1 @ x + b1  # = [[0.42], [0.48]]
a1 = np.tanh(z1)  # = [[0.3969], [0.4462]]
z2 = W2 @ a1 + b2  # = [[0.5312]]
a2 = sigmoid(z2)  # = [[0.6298]]

Part 3: Backpropagation - Learning from Mistakes (Moderate-Difficult)

The magic of neural networks lies in their ability to learn. This happens through backpropagation—computing gradients of the loss function with respect to each parameter.

3.1 The Loss Function

For binary classification, we use Binary Cross-Entropy Loss:

\[\mathcal{L}(\hat{y}, y) = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]\]

For a batch of \(m\) examples:

\[J = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)})\]

def plot_loss_landscape():
    """Visualize the loss function"""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Binary Cross-Entropy
    y_pred = np.linspace(0.01, 0.99, 100)

    # When y = 1
    loss_y1 = -np.log(y_pred)
    # When y = 0
    loss_y0 = -np.log(1 - y_pred)

    axes[0].plot(y_pred, loss_y1, 'b-', linewidth=2.5, label='True label y=1')
    axes[0].plot(y_pred, loss_y0, 'r-', linewidth=2.5, label='True label y=0')
    axes[0].set_xlabel('Predicted probability $\\hat{y}$', fontsize=12)
    axes[0].set_ylabel('Loss', fontsize=12)
    axes[0].set_title('Binary Cross-Entropy Loss', fontsize=13, weight='bold')
    axes[0].legend(fontsize=11)
    axes[0].grid(True, alpha=0.3)
    axes[0].set_xlim(0, 1)
    axes[0].set_ylim(0, 5)

    # 3D loss landscape (conceptual)
    from mpl_toolkits.mplot3d import Axes3D

    ax2 = fig.add_subplot(122, projection='3d')
    w1 = np.linspace(-2, 2, 50)
    w2 = np.linspace(-2, 2, 50)
    W1, W2 = np.meshgrid(w1, w2)

    # Create a bowl-shaped loss landscape
    Loss = W1**2 + W2**2 + 0.5*np.sin(3*W1)*np.sin(3*W2)

    ax2.plot_surface(W1, W2, Loss, cmap='viridis', alpha=0.8)
    ax2.set_xlabel('Weight 1')
    ax2.set_ylabel('Weight 2')
    ax2.set_zlabel('Loss')
    ax2.set_title('Loss Landscape (Conceptual)', fontsize=12, weight='bold')

    plt.tight_layout()
    plt.show()

plot_loss_landscape()

3.2 The Chain Rule: Heart of Backpropagation

Backpropagation uses the chain rule to compute gradients layer by layer, from output back to input.

The key insight: To update a weight, we need to know how much a small change in that weight affects the final loss.

\[\frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}\]

def draw_backprop_chain():
    """Visualize the chain rule in backpropagation"""
    fig, ax = plt.subplots(figsize=(16, 8))
    ax.set_xlim(0, 16)
    ax.set_ylim(0, 8)
    ax.axis('off')

    # Title
    ax.text(8, 7.5, 'Backpropagation: Following the Chain Rule',
            ha='center', fontsize=16, weight='bold')

    # Forward pass (top)
    ax.text(8, 6.8, 'FORWARD PASS', ha='center', fontsize=12, color='blue', weight='bold')

    boxes = [
        (1, 5.5, 'x', 'Input'),
        (4, 5.5, 'z[1]', 'Linear'),
        (7, 5.5, 'a[1]', 'Activation'),
        (10, 5.5, 'z[2]', 'Linear'),
        (13, 5.5, 'y_hat', 'Output'),
    ]

    for x, y, label, desc in boxes:
        ax.add_patch(plt.Rectangle((x-0.5, y), 1.5, 0.8, fill=True,
                                   color='lightblue', ec='blue', linewidth=2))
        ax.text(x + 0.25, y + 0.4, label, ha='center', va='center', fontsize=12)
        ax.text(x + 0.25, y + 1.0, desc, ha='center', fontsize=9, color='gray')

    # Forward arrows
    for i in range(len(boxes) - 1):
        ax.annotate('', xy=(boxes[i+1][0]-0.5, boxes[i][1]+0.4),
                   xytext=(boxes[i][0]+1, boxes[i][1]+0.4),
                   arrowprops=dict(arrowstyle='->', color='blue', lw=2))

    # Loss
    ax.add_patch(plt.Rectangle((14.5, 5.5), 1.2, 0.8, fill=True,
                               color='#ffcccc', ec='red', linewidth=2))
    ax.text(15.1, 5.9, 'L', ha='center', va='center', fontsize=14, weight='bold')
    ax.annotate('', xy=(14.5, 5.9), xytext=(14, 5.9),
               arrowprops=dict(arrowstyle='->', color='blue', lw=2))

    # Backward pass (bottom)
    ax.text(8, 4.2, 'BACKWARD PASS (Gradients)', ha='center', fontsize=12, color='red', weight='bold')

    grad_boxes = [
        (1, 2.5, 'dL/dW[1]'),
        (4, 2.5, 'dL/dz[1]'),
        (7, 2.5, 'dL/da[1]'),
        (10, 2.5, 'dL/dz[2]'),
        (13, 2.5, 'dL/dy_hat'),
    ]

    for x, y, label in grad_boxes:
        ax.add_patch(plt.Rectangle((x-0.5, y), 1.5, 0.8, fill=True,
                                   color='#ffe6e6', ec='red', linewidth=2))
        ax.text(x + 0.25, y + 0.4, label, ha='center', va='center', fontsize=9)

    # Backward arrows
    for i in range(len(grad_boxes) - 1, 0, -1):
        ax.annotate('', xy=(grad_boxes[i-1][0]+1, grad_boxes[i][1]+0.4),
                   xytext=(grad_boxes[i][0]-0.5, grad_boxes[i][1]+0.4),
                   arrowprops=dict(arrowstyle='->', color='red', lw=2))

    # Chain rule equation (simplified)
    ax.text(8, 1.3, 'Chain Rule: dL/dW[1] = dL/dy_hat * dy_hat/dz[2] * dz[2]/da[1] * da[1]/dz[1] * dz[1]/dW[1]',
            ha='center', fontsize=11)

    ax.text(8, 0.6, 'Each gradient depends on gradients that come after it (flowing backward)',
            ha='center', fontsize=11, style='italic', color='darkred')

    plt.tight_layout()
    plt.show()

draw_backprop_chain()

3.3 Deriving the Gradients

Let’s derive the gradients for a 2-layer network with sigmoid activations:

Output Layer Gradients

For the output layer with sigmoid activation and binary cross-entropy loss:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[2]}} = \mathbf{a}^{[2]} - \mathbf{y} = \hat{\mathbf{y}} - \mathbf{y}\]

This remarkably simple result comes from the beautiful cancellation when using sigmoid with cross-entropy!

\[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[2]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[2]}} \cdot (\mathbf{a}^{[1]})^T\]

\[\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[2]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[2]}}\]

Hidden Layer Gradients

For the hidden layer:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[1]}} = (\mathbf{W}^{[2]})^T \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[2]}}\]

\[\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[1]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[1]}} \odot \sigma'(\mathbf{z}^{[1]})\]

where \(\odot\) denotes element-wise multiplication and \(\sigma'(z) = \sigma(z)(1-\sigma(z))\).

\[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[1]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[1]}} \cdot (\mathbf{a}^{[0]})^T\]

\[\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[1]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[1]}}\]

print("=" * 70)
print("BACKPROPAGATION FORMULAS SUMMARY")
print("=" * 70)
print("\n📍 OUTPUT LAYER (Layer L)")
print("-" * 50)
print("δ^[L] = dL/dz^[L] = a^[L] - y  (for sigmoid + cross-entropy)")
print("dW^[L] = δ^[L] · (a^[L-1])^T")
print("db^[L] = δ^[L]")

print("\n📍 HIDDEN LAYER (Layer l)")
print("-" * 50)
print("δ^[l] = (W^[l+1])^T · δ^[l+1] ⊙ σ'(z^[l])")
print("dW^[l] = δ^[l] · (a^[l-1])^T")
print("db^[l] = δ^[l]")

print("\n📍 GRADIENT DESCENT UPDATE")
print("-" * 50)
print("W^[l] := W^[l] - α · dW^[l]")
print("b^[l] := b^[l] - α · db^[l]")
print("\nwhere α is the learning rate")
print("=" * 70)
======================================================================
BACKPROPAGATION FORMULAS SUMMARY
======================================================================

📍 OUTPUT LAYER (Layer L)
--------------------------------------------------
δ^[L] = dL/dz^[L] = a^[L] - y  (for sigmoid + cross-entropy)
dW^[L] = δ^[L] · (a^[L-1])^T
db^[L] = δ^[L]

📍 HIDDEN LAYER (Layer l)
--------------------------------------------------
δ^[l] = (W^[l+1])^T · δ^[l+1] ⊙ σ'(z^[l])
dW^[l] = δ^[l] · (a^[l-1])^T
db^[l] = δ^[l]

📍 GRADIENT DESCENT UPDATE
--------------------------------------------------
W^[l] := W^[l] - α · dW^[l]
b^[l] := b^[l] - α · db^[l]

where α is the learning rate
======================================================================

3.4 Hand Calculation: Backpropagation

Let’s continue from our forward pass example and compute the gradients:

print("=" * 70)
print("HAND CALCULATION: BACKPROPAGATION")
print("=" * 70)

# Using values from forward pass
# x = [1, 2], W1, b1, W2, b2 as before
# z1 = [0.6, 1.3], a1 ≈ [0.6457, 0.7858]
# z2 ≈ [0.9941], a2 ≈ [0.7299]

# True label
y = np.array([[1.0]])  # True class is 1

print(f"\n📊 VALUES FROM FORWARD PASS")
print(f"x = {x.flatten()}")
print(f"z1 = {z1.flatten()}")
print(f"a1 = {a1.flatten()}")
print(f"z2 = {z2.flatten()}")
print(f"a2 (ŷ) = {a2.flatten()}")
print(f"y (true) = {y.flatten()}")

# Compute loss
loss = -(y * np.log(a2) + (1-y) * np.log(1-a2))
print(f"\n📉 LOSS")
print(f"L = -[y·log(ŷ) + (1-y)·log(1-ŷ)]")
print(f"L = -[1·log({a2[0,0]:.4f}) + 0·log({1-a2[0,0]:.4f})]")
print(f"L = -log({a2[0,0]:.4f}) = {loss[0,0]:.4f}")

# Output layer gradient
dz2 = a2 - y
print(f"\n⬅️ OUTPUT LAYER GRADIENTS")
print(f"δ^[2] = dL/dz2 = a2 - y = {a2[0,0]:.4f} - {y[0,0]:.1f} = {dz2[0,0]:.4f}")

dW2 = dz2 @ a1.T
db2 = dz2
print(f"dW2 = δ^[2] · (a1)^T = {dz2[0,0]:.4f} × [{a1[0,0]:.4f}, {a1[1,0]:.4f}]")
print(f"dW2 = [{dW2[0,0]:.4f}, {dW2[0,1]:.4f}]")
print(f"db2 = δ^[2] = {db2[0,0]:.4f}")

# Hidden layer gradient
def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

da1 = W2.T @ dz2
print(f"\n⬅️ HIDDEN LAYER GRADIENTS")
print(f"da1 = (W2)^T · δ^[2] = [{W2[0,0]:.1f}, {W2[0,1]:.1f}]^T × {dz2[0,0]:.4f}")
print(f"da1 = [{da1[0,0]:.4f}, {da1[1,0]:.4f}]")

sig_prime = sigmoid_derivative(z1)
print(f"\nσ'(z1) = σ(z1) ⊙ (1 - σ(z1))")
print(f"σ'(z1) = [{a1[0,0]:.4f}×{1-a1[0,0]:.4f}, {a1[1,0]:.4f}×{1-a1[1,0]:.4f}]")
print(f"σ'(z1) = [{sig_prime[0,0]:.4f}, {sig_prime[1,0]:.4f}]")

dz1 = da1 * sig_prime
print(f"\nδ^[1] = da1 ⊙ σ'(z1)")
print(f"δ^[1] = [{da1[0,0]:.4f}×{sig_prime[0,0]:.4f}, {da1[1,0]:.4f}×{sig_prime[1,0]:.4f}]")
print(f"δ^[1] = [{dz1[0,0]:.4f}, {dz1[1,0]:.4f}]")

dW1 = dz1 @ x.T
db1 = dz1
print(f"\ndW1 = δ^[1] · (x)^T")
print(f"dW1 =\n{dW1}")
print(f"db1 = {db1.flatten()}")

# Parameter update
learning_rate = 0.1
print(f"\n🔄 PARAMETER UPDATE (α = {learning_rate})")
print(f"W2_new = W2 - α·dW2 = {W2} - {learning_rate}×{dW2}")
W2_new = W2 - learning_rate * dW2
print(f"W2_new = {W2_new}")

W1_new = W1 - learning_rate * dW1
print(f"\nW1_new = W1 - α·dW1")
print(f"W1_new =\n{W1_new}")

print("\n" + "=" * 70)
print("BACKPROPAGATION COMPLETE!")
print("=" * 70)
======================================================================
HAND CALCULATION: BACKPROPAGATION
======================================================================

📊 VALUES FROM FORWARD PASS
x = [1. 2.]
z1 = [0.6 1.3]
a1 = [0.64565631 0.78583498]
z2 = [1.09432914]
a2 (ŷ) = [0.74919605]
y (true) = [1.]

📉 LOSS
L = -[y·log(ŷ) + (1-y)·log(1-ŷ)]
L = -[1·log(0.7492) + 0·log(0.2508)]
L = -log(0.7492) = 0.2888

⬅️ OUTPUT LAYER GRADIENTS
δ^[2] = dL/dz2 = a2 - y = 0.7492 - 1.0 = -0.2508
dW2 = δ^[2] · (a1)^T = -0.2508 × [0.6457, 0.7858]
dW2 = [-0.1619, -0.1971]
db2 = δ^[2] = -0.2508

⬅️ HIDDEN LAYER GRADIENTS
da1 = (W2)^T · δ^[2] = [0.5, 0.6]^T × -0.2508
da1 = [-0.1254, -0.1505]

σ'(z1) = σ(z1) ⊙ (1 - σ(z1))
σ'(z1) = [0.6457×0.3543, 0.7858×0.2142]
σ'(z1) = [0.2288, 0.1683]

δ^[1] = da1 ⊙ σ'(z1)
δ^[1] = [-0.1254×0.2288, -0.1505×0.1683]
δ^[1] = [-0.0287, -0.0253]

dW1 = δ^[1] · (x)^T
dW1 =
[[-0.02869    -0.05737999]
 [-0.02532594 -0.05065187]]
db1 = [-0.02869    -0.02532594]

🔄 PARAMETER UPDATE (α = 0.1)
W2_new = W2 - α·dW2 = [[0.5 0.6]] - 0.1×[[-0.16193315 -0.19709052]]
W2_new = [[0.51619332 0.61970905]]

W1_new = W1 - α·dW1
W1_new =
[[0.102869   0.205738  ]
 [0.30253259 0.40506519]]

======================================================================
BACKPROPAGATION COMPLETE!
======================================================================

Part 4: Implementing an MLP from Scratch (Difficult)

Now let’s put everything together into a complete, working implementation.

4.1 The MLP Class

class MLP:
    """
    Multi-Layer Perceptron implemented from scratch.

    This implementation supports:
    - Arbitrary number of layers
    - Different activation functions (sigmoid, tanh, relu)
    - Binary cross-entropy loss
    - Mini-batch gradient descent
    """

    def __init__(self, layer_sizes, activations=None, learning_rate=0.01, seed=42):
        """
        Initialize the MLP.

        Parameters:
        -----------
        layer_sizes : list
            Number of neurons in each layer [input, hidden1, ..., output]
        activations : list or None
            Activation functions for each layer (except input)
            Options: 'sigmoid', 'tanh', 'relu'
        learning_rate : float
            Learning rate for gradient descent
        seed : int
            Random seed for reproducibility
        """
        np.random.seed(seed)

        self.layer_sizes = layer_sizes
        self.L = len(layer_sizes) - 1  # Number of layers (excluding input)
        self.lr = learning_rate

        # Default activations: ReLU for hidden, sigmoid for output
        if activations is None:
            self.activations = ['relu'] * (self.L - 1) + ['sigmoid']
        else:
            self.activations = activations

        # Initialize parameters
        self.params = {}
        self._initialize_parameters()

        # Cache for forward/backward pass
        self.cache = {}

    def _initialize_parameters(self):
        """Initialize weights using He initialization and biases to zero."""
        for l in range(1, self.L + 1):
            # He initialization for weights
            self.params[f'W{l}'] = np.random.randn(
                self.layer_sizes[l],
                self.layer_sizes[l-1]
            ) * np.sqrt(2 / self.layer_sizes[l-1])

            # Zero initialization for biases
            self.params[f'b{l}'] = np.zeros((self.layer_sizes[l], 1))

        print("=" * 60)
        print("MLP INITIALIZED")
        print("=" * 60)
        print(f"Architecture: {' → '.join(map(str, self.layer_sizes))}")
        print(f"Activations: {self.activations}")
        total_params = sum(
            self.params[f'W{l}'].size + self.params[f'b{l}'].size
            for l in range(1, self.L + 1)
        )
        print(f"Total parameters: {total_params}")
        print("=" * 60)

    # ==================== ACTIVATION FUNCTIONS ====================

    def _sigmoid(self, z):
        """Sigmoid activation: σ(z) = 1 / (1 + e^(-z))"""
        # Clip to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))

    def _sigmoid_derivative(self, a):
        """Derivative of sigmoid: σ'(z) = σ(z)(1 - σ(z)) = a(1-a)"""
        return a * (1 - a)

    def _tanh(self, z):
        """Tanh activation"""
        return np.tanh(z)

    def _tanh_derivative(self, a):
        """Derivative of tanh: tanh'(z) = 1 - tanh²(z) = 1 - a²"""
        return 1 - a**2

    def _relu(self, z):
        """ReLU activation: max(0, z)"""
        return np.maximum(0, z)

    def _relu_derivative(self, z):
        """Derivative of ReLU: 1 if z > 0, else 0"""
        return (z > 0).astype(float)

    def _activate(self, z, activation):
        """Apply activation function."""
        if activation == 'sigmoid':
            return self._sigmoid(z)
        elif activation == 'tanh':
            return self._tanh(z)
        elif activation == 'relu':
            return self._relu(z)
        else:
            raise ValueError(f"Unknown activation: {activation}")

    def _activate_derivative(self, z, a, activation):
        """Compute derivative of activation function."""
        if activation == 'sigmoid':
            return self._sigmoid_derivative(a)
        elif activation == 'tanh':
            return self._tanh_derivative(a)
        elif activation == 'relu':
            return self._relu_derivative(z)
        else:
            raise ValueError(f"Unknown activation: {activation}")

    # ==================== FORWARD PROPAGATION ====================

    def forward(self, X):
        """
        Forward propagation through the network.

        Parameters:
        -----------
        X : ndarray of shape (n_features, m_samples)
            Input data

        Returns:
        --------
        A_L : ndarray
            Output of the network
        """
        self.cache['A0'] = X
        A = X

        for l in range(1, self.L + 1):
            W = self.params[f'W{l}']
            b = self.params[f'b{l}']

            # Linear transformation
            Z = W @ A + b
            self.cache[f'Z{l}'] = Z

            # Activation
            A = self._activate(Z, self.activations[l-1])
            self.cache[f'A{l}'] = A

        return A

    # ==================== LOSS FUNCTION ====================

    def compute_loss(self, Y_pred, Y_true):
        """
        Compute binary cross-entropy loss.

        L = -1/m * Σ[y*log(ŷ) + (1-y)*log(1-ŷ)]
        """
        m = Y_true.shape[1]

        # Clip predictions to prevent log(0)
        Y_pred = np.clip(Y_pred, 1e-15, 1 - 1e-15)

        loss = -np.mean(
            Y_true * np.log(Y_pred) + (1 - Y_true) * np.log(1 - Y_pred)
        )

        return loss

    # ==================== BACKWARD PROPAGATION ====================

    def backward(self, Y):
        """
        Backward propagation to compute gradients.

        Parameters:
        -----------
        Y : ndarray of shape (n_outputs, m_samples)
            True labels
        """
        m = Y.shape[1]
        self.grads = {}

        # Output layer gradient (sigmoid + cross-entropy)
        A_L = self.cache[f'A{self.L}']
        dZ = A_L - Y  # This works specifically for sigmoid + cross-entropy

        for l in range(self.L, 0, -1):
            A_prev = self.cache[f'A{l-1}']

            # Gradients for weights and biases
            self.grads[f'dW{l}'] = (1/m) * (dZ @ A_prev.T)
            self.grads[f'db{l}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)

            if l > 1:  # Propagate to previous layer
                W = self.params[f'W{l}']
                Z_prev = self.cache[f'Z{l-1}']
                A_prev_layer = self.cache[f'A{l-1}']

                # dA for previous layer
                dA = W.T @ dZ

                # dZ for previous layer
                dZ = dA * self._activate_derivative(
                    Z_prev, A_prev_layer, self.activations[l-2]
                )

    # ==================== PARAMETER UPDATE ====================

    def update_parameters(self):
        """Update parameters using gradient descent."""
        for l in range(1, self.L + 1):
            self.params[f'W{l}'] -= self.lr * self.grads[f'dW{l}']
            self.params[f'b{l}'] -= self.lr * self.grads[f'db{l}']

    # ==================== TRAINING ====================

    def train(self, X, Y, epochs=1000, print_every=100, verbose=True):
        """
        Train the neural network.

        Parameters:
        -----------
        X : ndarray of shape (n_features, m_samples)
            Training data
        Y : ndarray of shape (n_outputs, m_samples)
            Labels
        epochs : int
            Number of training iterations
        print_every : int
            Print loss every N epochs
        verbose : bool
            Whether to print progress

        Returns:
        --------
        losses : list
            Loss at each epoch
        """
        losses = []

        for epoch in range(epochs):
            # Forward propagation
            Y_pred = self.forward(X)

            # Compute loss
            loss = self.compute_loss(Y_pred, Y)
            losses.append(loss)

            # Backward propagation
            self.backward(Y)

            # Update parameters
            self.update_parameters()

            # Print progress
            if verbose and (epoch % print_every == 0 or epoch == epochs - 1):
                accuracy = self.evaluate(X, Y)
                print(f"Epoch {epoch:5d} | Loss: {loss:.6f} | Accuracy: {accuracy:.2%}")

        return losses

    # ==================== PREDICTION & EVALUATION ====================

    def predict(self, X, threshold=0.5):
        """Make predictions."""
        Y_pred = self.forward(X)
        return (Y_pred >= threshold).astype(int)

    def evaluate(self, X, Y):
        """Calculate accuracy."""
        predictions = self.predict(X)
        return np.mean(predictions == Y)

print("\n✅ MLP Class defined successfully!")

✅ MLP Class defined successfully!

4.2 Testing with a Simple Example

Let’s verify our implementation with a simple example:

print("\n" + "=" * 60)
print("TESTING MLP WITH SIMPLE EXAMPLE")
print("=" * 60)

# Create a simple dataset: XOR problem
# XOR cannot be solved by a single perceptron - needs hidden layer!
X_simple = np.array([
    [0, 0, 1, 1],  # x1
    [0, 1, 0, 1]   # x2
])
Y_simple = np.array([
    [0, 1, 1, 0]   # XOR output
])

print(f"\n📊 XOR Dataset:")
print(f"Input X:\n{X_simple}")
print(f"Output Y: {Y_simple}")

# Create and train MLP
mlp_simple = MLP(
    layer_sizes=[2, 4, 1],  # 2 inputs, 4 hidden neurons, 1 output
    activations=['tanh', 'sigmoid'],
    learning_rate=0.5,
    seed=42
)

print(f"\n🏋️ Training...")
losses = mlp_simple.train(X_simple, Y_simple, epochs=5000, print_every=1000)

# Final predictions
print(f"\n🎯 Final Predictions:")
predictions = mlp_simple.forward(X_simple)
print(f"Input    | True | Predicted | Rounded")
print("-" * 45)
for i in range(X_simple.shape[1]):
    print(f"[{X_simple[0,i]}, {X_simple[1,i]}]   |  {Y_simple[0,i]}   |   {predictions[0,i]:.4f}   |    {int(predictions[0,i] > 0.5)}")

============================================================
TESTING MLP WITH SIMPLE EXAMPLE
============================================================

📊 XOR Dataset:
Input X:
[[0 0 1 1]
 [0 1 0 1]]
Output Y: [[0 1 1 0]]
============================================================
MLP INITIALIZED
============================================================
Architecture: 2 → 4 → 1
Activations: ['tanh', 'sigmoid']
Total parameters: 17
============================================================

🏋️ Training...
Epoch     0 | Loss: 0.694109 | Accuracy: 50.00%
Epoch  1000 | Loss: 0.006024 | Accuracy: 100.00%
Epoch  2000 | Loss: 0.002652 | Accuracy: 100.00%
Epoch  3000 | Loss: 0.001695 | Accuracy: 100.00%
Epoch  4000 | Loss: 0.001244 | Accuracy: 100.00%
Epoch  4999 | Loss: 0.000983 | Accuracy: 100.00%

🎯 Final Predictions:
Input    | True | Predicted | Rounded
---------------------------------------------
[0, 0]   |  0   |   0.0001   |    0
[0, 1]   |  1   |   0.9989   |    1
[1, 0]   |  1   |   0.9989   |    1
[1, 1]   |  0   |   0.0016   |    0
# Plot learning curve
plt.figure(figsize=(10, 5))
plt.plot(losses, 'b-', linewidth=1.5, alpha=0.7)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('XOR Learning Curve', fontsize=14, weight='bold')
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.tight_layout()
plt.show()


Part 5: Training on a Real Dataset (Difficult)

Now let’s test our MLP on a more realistic dataset!

5.1 Creating the Dataset

We’ll create a synthetic 2D classification dataset (implementing everything from scratch!):

def make_moons_manual(n_samples=500, noise=0.2, seed=42):
    """
    Create a moons dataset from scratch (no sklearn needed!)
    """
    np.random.seed(seed)
    n_samples_per_moon = n_samples // 2

    # First moon (upper)
    theta1 = np.linspace(0, np.pi, n_samples_per_moon)
    x1 = np.cos(theta1)
    y1 = np.sin(theta1)

    # Second moon (lower, shifted)
    theta2 = np.linspace(0, np.pi, n_samples_per_moon)
    x2 = 1 - np.cos(theta2)
    y2 = 1 - np.sin(theta2) - 0.5

    # Combine
    X = np.vstack([
        np.column_stack([x1, y1]),
        np.column_stack([x2, y2])
    ])

    # Add noise
    X += np.random.normal(0, noise, X.shape)

    # Labels
    y = np.hstack([np.zeros(n_samples_per_moon), np.ones(n_samples_per_moon)]).astype(int)

    # Shuffle
    indices = np.random.permutation(n_samples)
    return X[indices], y[indices]

def train_test_split_manual(X, y, test_size=0.2, seed=42):
    """Split data into train and test sets"""
    np.random.seed(seed)
    n_samples = len(y)
    n_test = int(n_samples * test_size)

    indices = np.random.permutation(n_samples)
    test_idx, train_idx = indices[:n_test], indices[n_test:]

    return X[train_idx], X[test_idx], y[train_idx], y[test_idx]

def standardize(X_train, X_test):
    """Standardize features (zero mean, unit variance)"""
    mean = X_train.mean(axis=0)
    std = X_train.std(axis=0)
    X_train_scaled = (X_train - mean) / std
    X_test_scaled = (X_test - mean) / std
    return X_train_scaled, X_test_scaled, mean, std

# Create a moons dataset
X, y = make_moons_manual(n_samples=500, noise=0.2, seed=42)

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split_manual(X, y, test_size=0.2, seed=42)

# Standardize features
X_train_scaled, X_test_scaled, data_mean, data_std = standardize(X_train, X_test)

# Transpose for our MLP (features × samples)
X_train_T = X_train_scaled.T
X_test_T = X_test_scaled.T
y_train_T = y_train.reshape(1, -1)
y_test_T = y_test.reshape(1, -1)

print("=" * 60)
print("MOONS DATASET (Created from scratch!)")
print("=" * 60)
print(f"Training samples: {X_train_T.shape[1]}")
print(f"Test samples: {X_test_T.shape[1]}")
print(f"Features: {X_train_T.shape[0]}")
print(f"Class distribution (train): {np.bincount(y_train)}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training data
scatter = axes[0].scatter(X_train[:, 0], X_train[:, 1], c=y_train,
                          cmap='RdYlBu', edgecolors='black', alpha=0.7, s=50)
axes[0].set_xlabel('Feature 1', fontsize=11)
axes[0].set_ylabel('Feature 2', fontsize=11)
axes[0].set_title('Training Data', fontsize=13, weight='bold')
axes[0].grid(True, alpha=0.3)

# Test data
axes[1].scatter(X_test[:, 0], X_test[:, 1], c=y_test,
               cmap='RdYlBu', edgecolors='black', alpha=0.7, s=50)
axes[1].set_xlabel('Feature 1', fontsize=11)
axes[1].set_ylabel('Feature 2', fontsize=11)
axes[1].set_title('Test Data', fontsize=13, weight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
============================================================
MOONS DATASET (Created from scratch!)
============================================================
Training samples: 400
Test samples: 100
Features: 2
Class distribution (train): [201 199]

5.2 Training the MLP

# Create MLP with architecture suited for this problem
mlp = MLP(
    layer_sizes=[2, 16, 8, 1],  # 2 inputs → 16 → 8 → 1 output
    activations=['relu', 'relu', 'sigmoid'],
    learning_rate=0.1,
    seed=42
)

print(f"\n🏋️ Training on Moons Dataset...")
print("-" * 60)

losses = mlp.train(
    X_train_T, y_train_T,
    epochs=2000,
    print_every=200,
    verbose=True
)

# Evaluate on test set
test_accuracy = mlp.evaluate(X_test_T, y_test_T)
train_accuracy = mlp.evaluate(X_train_T, y_train_T)

print("\n" + "=" * 60)
print("FINAL RESULTS")
print("=" * 60)
print(f"Training Accuracy: {train_accuracy:.2%}")
print(f"Test Accuracy: {test_accuracy:.2%}")
print("=" * 60)
============================================================
MLP INITIALIZED
============================================================
Architecture: 2 → 16 → 8 → 1
Activations: ['relu', 'relu', 'sigmoid']
Total parameters: 193
============================================================

🏋️ Training on Moons Dataset...
------------------------------------------------------------
Epoch     0 | Loss: 1.035364 | Accuracy: 49.75%
Epoch   200 | Loss: 0.237045 | Accuracy: 90.25%
Epoch   400 | Loss: 0.181440 | Accuracy: 91.75%
Epoch   600 | Loss: 0.131691 | Accuracy: 95.50%
Epoch   800 | Loss: 0.097807 | Accuracy: 97.00%
Epoch  1000 | Loss: 0.077548 | Accuracy: 98.25%
Epoch  1200 | Loss: 0.066954 | Accuracy: 98.50%
Epoch  1400 | Loss: 0.060750 | Accuracy: 98.50%
Epoch  1600 | Loss: 0.056776 | Accuracy: 98.50%
Epoch  1800 | Loss: 0.053972 | Accuracy: 98.50%
Epoch  1999 | Loss: 0.051922 | Accuracy: 98.50%

============================================================
FINAL RESULTS
============================================================
Training Accuracy: 98.50%
Test Accuracy: 96.00%
============================================================

5.3 Visualizing the Decision Boundary

def plot_decision_boundary(model, X, y, mean, std, title="Decision Boundary"):
    """Plot the decision boundary of a trained model."""
    # Create a mesh grid
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))

    # Scale the mesh points using our manual standardization
    mesh_points = np.c_[xx.ravel(), yy.ravel()]
    mesh_scaled = (mesh_points - mean) / std

    # Predict on mesh
    Z = model.forward(mesh_scaled.T)
    Z = Z.reshape(xx.shape)

    # Plot
    fig, ax = plt.subplots(figsize=(10, 8))

    # Decision boundary
    contour = ax.contourf(xx, yy, Z, levels=np.linspace(0, 1, 11),
                          cmap='RdYlBu', alpha=0.8)
    plt.colorbar(contour, ax=ax, label='Prediction Probability')

    # Decision threshold
    ax.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2, linestyles='--')

    # Data points
    scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu',
                        edgecolors='black', linewidth=1, s=60, alpha=0.9)

    ax.set_xlabel('Feature 1', fontsize=12)
    ax.set_ylabel('Feature 2', fontsize=12)
    ax.set_title(title, fontsize=14, weight='bold')
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

# Plot decision boundary on test set
plot_decision_boundary(mlp, X_test, y_test, data_mean, data_std,
                       f"MLP Decision Boundary (Test Accuracy: {test_accuracy:.2%})")

5.4 Visualizing Learning Progress

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curve
axes[0].plot(losses, 'b-', linewidth=1.5, alpha=0.8)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Training Loss Over Time', fontsize=13, weight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].set_yscale('log')

# Loss curve (zoomed - last 500 epochs)
axes[1].plot(range(len(losses)-500, len(losses)), losses[-500:],
             'b-', linewidth=1.5, alpha=0.8)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Loss', fontsize=12)
axes[1].set_title('Training Loss (Last 500 Epochs)', fontsize=13, weight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


Part 6: Experiments and Insights

6.1 Effect of Network Architecture

Let’s experiment with different architectures:

def experiment_architectures():
    """Compare different network architectures."""
    architectures = [
        ([2, 4, 1], ['relu', 'sigmoid'], "Shallow (2→4→1)"),
        ([2, 8, 1], ['relu', 'sigmoid'], "Medium (2→8→1)"),
        ([2, 16, 1], ['relu', 'sigmoid'], "Wide (2→16→1)"),
        ([2, 8, 8, 1], ['relu', 'relu', 'sigmoid'], "Deep (2→8→8→1)"),
        ([2, 16, 8, 1], ['relu', 'relu', 'sigmoid'], "Wide-Deep (2→16→8→1)"),
    ]

    results = []

    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()

    for idx, (layers, activations, name) in enumerate(architectures):
        print(f"\n🔬 Testing: {name}")

        # Create and train
        model = MLP(layer_sizes=layers, activations=activations,
                   learning_rate=0.1, seed=42)
        losses = model.train(X_train_T, y_train_T, epochs=1500,
                            print_every=1500, verbose=False)

        # Evaluate
        train_acc = model.evaluate(X_train_T, y_train_T)
        test_acc = model.evaluate(X_test_T, y_test_T)

        results.append({
            'architecture': name,
            'train_acc': train_acc,
            'test_acc': test_acc,
            'final_loss': losses[-1]
        })

        print(f"   Train: {train_acc:.2%} | Test: {test_acc:.2%}")

        # Plot loss curve
        axes[idx].plot(losses, linewidth=1.5)
        axes[idx].set_title(f'{name}\nTest Acc: {test_acc:.2%}', fontsize=10)
        axes[idx].set_xlabel('Epoch', fontsize=9)
        axes[idx].set_ylabel('Loss', fontsize=9)
        axes[idx].grid(True, alpha=0.3)
        axes[idx].set_yscale('log')

    # Hide unused subplot
    axes[-1].axis('off')

    plt.tight_layout()
    plt.show()

    # Summary table
    print("\n" + "=" * 70)
    print("ARCHITECTURE COMPARISON SUMMARY")
    print("=" * 70)
    print(f"{'Architecture':<25} {'Train Acc':>12} {'Test Acc':>12} {'Final Loss':>12}")
    print("-" * 70)
    for r in results:
        print(f"{r['architecture']:<25} {r['train_acc']:>11.2%} {r['test_acc']:>11.2%} {r['final_loss']:>12.6f}")
    print("=" * 70)

experiment_architectures()

🔬 Testing: Shallow (2→4→1)
============================================================
MLP INITIALIZED
============================================================
Architecture: 2 → 4 → 1
Activations: ['relu', 'sigmoid']
Total parameters: 17
============================================================
   Train: 95.50% | Test: 93.00%

🔬 Testing: Medium (2→8→1)
============================================================
MLP INITIALIZED
============================================================
Architecture: 2 → 8 → 1
Activations: ['relu', 'sigmoid']
Total parameters: 33
============================================================
   Train: 93.25% | Test: 90.00%

🔬 Testing: Wide (2→16→1)
============================================================
MLP INITIALIZED
============================================================
Architecture: 2 → 16 → 1
Activations: ['relu', 'sigmoid']
Total parameters: 65
============================================================
   Train: 96.50% | Test: 94.00%

🔬 Testing: Deep (2→8→8→1)
============================================================
MLP INITIALIZED
============================================================
Architecture: 2 → 8 → 8 → 1
Activations: ['relu', 'relu', 'sigmoid']
Total parameters: 105
============================================================
   Train: 98.00% | Test: 97.00%

🔬 Testing: Wide-Deep (2→16→8→1)
============================================================
MLP INITIALIZED
============================================================
Architecture: 2 → 16 → 8 → 1
Activations: ['relu', 'relu', 'sigmoid']
Total parameters: 193
============================================================
   Train: 98.50% | Test: 96.00%


======================================================================
ARCHITECTURE COMPARISON SUMMARY
======================================================================
Architecture                 Train Acc     Test Acc   Final Loss
----------------------------------------------------------------------
Shallow (2→4→1)                95.50%      93.00%     0.123278
Medium (2→8→1)                 93.25%      90.00%     0.165954
Wide (2→16→1)                  96.50%      94.00%     0.102576
Deep (2→8→8→1)                 98.00%      97.00%     0.057835
Wide-Deep (2→16→8→1)           98.50%      96.00%     0.058604
======================================================================

6.2 Effect of Learning Rate

def experiment_learning_rates():
    """Compare different learning rates."""
    learning_rates = [0.001, 0.01, 0.1, 0.5, 1.0]

    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    results = []

    for lr in learning_rates:
        model = MLP(layer_sizes=[2, 16, 8, 1],
                   activations=['relu', 'relu', 'sigmoid'],
                   learning_rate=lr, seed=42)
        losses = model.train(X_train_T, y_train_T, epochs=1000,
                            print_every=1000, verbose=False)

        test_acc = model.evaluate(X_test_T, y_test_T)
        results.append({'lr': lr, 'losses': losses, 'test_acc': test_acc})

        axes[0].plot(losses, label=f'lr={lr}', linewidth=1.5)

    axes[0].set_xlabel('Epoch', fontsize=12)
    axes[0].set_ylabel('Loss', fontsize=12)
    axes[0].set_title('Loss vs Learning Rate', fontsize=13, weight='bold')
    axes[0].legend(fontsize=10)
    axes[0].grid(True, alpha=0.3)
    axes[0].set_yscale('log')

    # Bar plot of final accuracies
    lrs = [str(r['lr']) for r in results]
    accs = [r['test_acc'] for r in results]
    colors = ['green' if acc > 0.8 else 'orange' if acc > 0.6 else 'red' for acc in accs]

    bars = axes[1].bar(lrs, accs, color=colors, alpha=0.7, edgecolor='black')
    axes[1].set_xlabel('Learning Rate', fontsize=12)
    axes[1].set_ylabel('Test Accuracy', fontsize=12)
    axes[1].set_title('Test Accuracy vs Learning Rate', fontsize=13, weight='bold')
    axes[1].set_ylim(0, 1)
    axes[1].grid(True, alpha=0.3, axis='y')

    # Add value labels
    for bar, acc in zip(bars, accs):
        axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                    f'{acc:.1%}', ha='center', fontsize=10, weight='bold')

    plt.tight_layout()
    plt.show()

    print("\n" + "=" * 50)
    print("LEARNING RATE COMPARISON")
    print("=" * 50)
    for r in results:
        status = "✅" if r['test_acc'] > 0.8 else "⚠️" if r['test_acc'] > 0.6 else "❌"
        print(f"lr = {r['lr']:5} → Test Accuracy: {r['test_acc']:.2%} {status}")
    print("=" * 50)

experiment_learning_rates()
============================================================
MLP INITIALIZED
============================================================
Architecture: 2 → 16 → 8 → 1
Activations: ['relu', 'relu', 'sigmoid']
Total parameters: 193
============================================================
============================================================
MLP INITIALIZED
============================================================
Architecture: 2 → 16 → 8 → 1
Activations: ['relu', 'relu', 'sigmoid']
Total parameters: 193
============================================================
============================================================
MLP INITIALIZED
============================================================
Architecture: 2 → 16 → 8 → 1
Activations: ['relu', 'relu', 'sigmoid']
Total parameters: 193
============================================================
============================================================
MLP INITIALIZED
============================================================
Architecture: 2 → 16 → 8 → 1
Activations: ['relu', 'relu', 'sigmoid']
Total parameters: 193
============================================================
============================================================
MLP INITIALIZED
============================================================
Architecture: 2 → 16 → 8 → 1
Activations: ['relu', 'relu', 'sigmoid']
Total parameters: 193
============================================================


==================================================
LEARNING RATE COMPARISON
==================================================
lr = 0.001 → Test Accuracy: 80.00% ⚠️
lr =  0.01 → Test Accuracy: 83.00% ✅
lr =   0.1 → Test Accuracy: 96.00% ✅
lr =   0.5 → Test Accuracy: 96.00% ✅
lr =   1.0 → Test Accuracy: 96.00% ✅
==================================================

Part 7: Summary and Key Takeaways

7.1 What We Learned

Key Takeaways
  1. Neural Network Basics:
    • Neurons compute weighted sums + bias, then apply activation functions
    • Multiple layers allow learning complex, non-linear patterns
    • Matrix notation makes computation efficient
  2. Forward Propagation:
    • \(\mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\)
    • \(\mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})\)
    • Information flows from input to output
  3. Backpropagation:
    • Uses chain rule to compute gradients
    • Gradients flow backward from loss to parameters
    • \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}} \cdot (\mathbf{a}_{prev})^T\)
  4. Training:
    • Gradient descent updates: \(\theta := \theta - \alpha \frac{\partial \mathcal{L}}{\partial \theta}\)
    • Learning rate is crucial: too small = slow, too large = diverge
    • Architecture affects both capacity and trainability

7.2 Quick Reference

print("=" * 70)
print("NEURAL NETWORK QUICK REFERENCE")
print("=" * 70)

print("\n📐 FORWARD PROPAGATION")
print("-" * 40)
print("For layer l = 1 to L:")
print("  z^[l] = W^[l] · a^[l-1] + b^[l]")
print("  a^[l] = σ(z^[l])")

print("\n⬅️ BACKWARD PROPAGATION")
print("-" * 40)
print("Output layer (sigmoid + cross-entropy):")
print("  δ^[L] = a^[L] - y")
print("")
print("For layer l = L to 1:")
print("  dW^[l] = (1/m) · δ^[l] · (a^[l-1])^T")
print("  db^[l] = (1/m) · Σ δ^[l]")
print("  δ^[l-1] = (W^[l])^T · δ^[l] ⊙ σ'(z^[l-1])")

print("\n🔄 PARAMETER UPDATE")
print("-" * 40)
print("W^[l] := W^[l] - α · dW^[l]")
print("b^[l] := b^[l] - α · db^[l]")

print("\n📊 ACTIVATION FUNCTIONS")
print("-" * 40)
print(f"{'Function':<12} {'Formula':<30} {'Derivative'}")
print("-" * 60)
print(f"{'Sigmoid':<12} {'1/(1+e^(-z))':<30} {'σ(z)(1-σ(z))'}")
print(f"{'Tanh':<12} {'(e^z-e^(-z))/(e^z+e^(-z))':<30} {'1-tanh²(z)'}")
print(f"{'ReLU':<12} {'max(0, z)':<30} {'1 if z>0, else 0'}")

print("\n" + "=" * 70)
======================================================================
NEURAL NETWORK QUICK REFERENCE
======================================================================

📐 FORWARD PROPAGATION
----------------------------------------
For layer l = 1 to L:
  z^[l] = W^[l] · a^[l-1] + b^[l]
  a^[l] = σ(z^[l])

⬅️ BACKWARD PROPAGATION
----------------------------------------
Output layer (sigmoid + cross-entropy):
  δ^[L] = a^[L] - y

For layer l = L to 1:
  dW^[l] = (1/m) · δ^[l] · (a^[l-1])^T
  db^[l] = (1/m) · Σ δ^[l]
  δ^[l-1] = (W^[l])^T · δ^[l] ⊙ σ'(z^[l-1])

🔄 PARAMETER UPDATE
----------------------------------------
W^[l] := W^[l] - α · dW^[l]
b^[l] := b^[l] - α · db^[l]

📊 ACTIVATION FUNCTIONS
----------------------------------------
Function     Formula                        Derivative
------------------------------------------------------------
Sigmoid      1/(1+e^(-z))                   σ(z)(1-σ(z))
Tanh         (e^z-e^(-z))/(e^z+e^(-z))      1-tanh²(z)
ReLU         max(0, z)                      1 if z>0, else 0

======================================================================

7.3 What’s Next?

Congratulations! You’ve Built a Neural Network from Scratch! 🎉

What you’ve accomplished:

  • ✅ Understood the mathematics of neural networks
  • ✅ Implemented forward propagation with matrices
  • ✅ Derived and implemented backpropagation
  • ✅ Built a complete MLP class from scratch
  • ✅ Trained on real datasets and visualized results

Next steps to explore:

  1. Regularization: L2 regularization, dropout
  2. Optimization: Adam, RMSprop, momentum
  3. Batch Normalization: Stabilize training
  4. Convolutional Networks: For image data
  5. Recurrent Networks: For sequential data

Further Reading:


Problem 1: Multi-class Classification Extend the MLP to handle multi-class classification using softmax activation and cross-entropy loss.

Problem 2: Regularization Implement L2 regularization. Add a term \(\frac{\lambda}{2m} \sum \|\mathbf{W}\|^2\) to the loss.

Problem 3: Momentum Implement gradient descent with momentum: \[v := \beta v + (1-\beta) \nabla_\theta J\] \[\theta := \theta - \alpha v\]

Problem 4: Learning Rate Decay Implement learning rate decay: \(\alpha_t = \frac{\alpha_0}{1 + \mathrm{decay\_rate} \times \mathrm{epoch}}\)

Problem 5: Different Dataset Create a circles dataset (two concentric circles) from scratch and train the MLP on it. Compare results with the moons dataset.


Happy Learning! May your gradients always flow smoothly! 🚀

Questions or feedback? Feel free to reach out or open an issue on GitHub!