2.3. Gradients and Their Role in Learning: The Gradient Descent Method

 

Partial Derivatives

For a scalar function \(f(x, y, z)\) of multiple variables, the derivative with respect to a single variable is called a partial derivative. It is denoted as: \[ \frac{\partial f}{\partial x}, \quad \frac{\partial f}{\partial y}, \quad \frac{\partial f}{\partial z}. \] The partial derivative with respect to \(x\) is defined as: \[ \frac{\partial f}{\partial x} = \lim_{\Delta x \to 0} \frac{f(x + \Delta x, y, z) - f(x, y, z)}{\Delta x}, \] where \(y\) and \(z\) are held constant. Partial derivatives measure the rate of change of \(f(x, y, z)\) with respect to one variable, while treating the other variables as constants. They are fundamental in multivariable calculus and have applications in fields such as physics, engineering, and medical imaging. 

  • Example. The relationship between blood pressure, cardiac output, and vascular resistance is modeled using Ohm's Law for the circulatory system: \[ P = Q \cdot R, \] where \(P\) is the blood pressure,  \(Q\) is the cardiac output (volume of blood pumped per minute), and \(R\) is the vascular resistance.  Here, \(P\) depends on both \(Q\) and \(R\), which means we must use partial derivatives to analyze changes. If vascular resistance \(R\) also depends on time \(t\) and cardiac output \(Q\) depends on heart rate \(HR\), which itself depends on time \(t\), then the total rate of change of blood pressure becomes: \[ \frac{dP}{dt} = \frac{\partial P}{\partial Q} \cdot \underbrace{\frac{dQ}{dHR} \cdot \frac{dHR}{dt}}_{\frac{dQ}{dt}}+ \frac{\partial P}{\partial R} \cdot \frac{dR}{dt}. \]
    • \(\frac{\partial P}{\partial Q}\) represents the sensitivity of blood pressure to changes in cardiac output,
    • \(\frac{dQ}{dHR}\) represents the effect of heart rate on cardiac output, 
    • \(\frac{dHR}{dt}\) represents the rate of change of heart rate with time. 

Gradient

The gradient of a scalar function \(f(x, y, z)\) is a vector that points in the direction of the steepest rate of increase of the function. It is defined as: \[ \nabla f = \frac{\partial f}{\partial x} \hat{i} + \frac{\partial f}{\partial y} \hat{j} + \frac{\partial f}{\partial z} \hat{k} = \left( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z} \right), \] where \(\frac{\partial f}{\partial x}\), \(\frac{\partial f}{\partial y}\), and \(\frac{\partial f}{\partial z}\) are the partial derivatives of \(f(x, y, z)\), and \(\hat{i}\), \(\hat{j}\), and \(\hat{k}\) are the unit vectors in the \(x\)-, \(y\)-, and \(z\)-directions, respectively. For a scalar function \(f(\mathbf{x})\) with \(\mathbf{x} = (x_1, x_2, \dots, x_n)\) in \(n\)-dimensional space, the gradient is given as: \[ \nabla f(\mathbf{x}) = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_n} \right), \] where \(\frac{\partial f}{\partial x_i}\) represents the partial derivative of \(f(\mathbf{x})\) with respect to the \(i\)-th component of \(\mathbf{x}\). The gradient is a vector field that provides both the direction and magnitude of the steepest ascent of the function \(f\). The magnitude of the gradient, \(\|\nabla f\|\), quantifies the steepness of the increase and is calculated as: \[ \|\nabla f\| = \sqrt{\sum_{i=1}^n \left( \frac{\partial f}{\partial x_i} \right)^2}. \] The gradient is an essential tool in fields such as physics, optimization, and medical imaging, where it is used to analyze changes in scalar fields like temperature, pressure, or image intensity. 

  • Linear Regression with Gradient Descent. Let \(\boldsymbol{\Theta} = (\theta_0, \theta_1) \in \mathbb{R}^2\), and define the linear function: \[ f(x; \boldsymbol{\Theta}) = \theta_1 x + \theta_0 \in \mathbb{R}. \] The goal of linear regression is to find the parameters \(\boldsymbol{\Theta} = (\theta_0, \theta_1)\) that minimize the loss function: \[ \mathcal{L}(\boldsymbol{\Theta}) = \frac{1}{N} \sum_{k=1}^N \left( \theta_1 x_k + \theta_0 - y_k \right)^2, \] where \(x_k\) and \(y_k\) are the input and target values, respectively, and \(N\) is the total number of samples. The final objective is: \[ \boldsymbol{\Theta} = \underset{\boldsymbol{\Theta}}{\arg\min} \, \mathcal{L}(\boldsymbol{\Theta}). \] Here, the term "argmin" stands for "argument of the minimum", which returns the value of \(\boldsymbol{\Theta}\) that minimizes \(\mathcal{L}(\boldsymbol{\Theta})\). 

    •  Gradient of the loss function. To minimize the loss function, we compute the gradient \(\nabla \mathcal{L}(\boldsymbol{\Theta})\): \[ \nabla_{\boldsymbol{\Theta}} \mathcal{L} (\boldsymbol{\Theta}) = \frac{1}{N} \sum_{k=1}^N 2\left( \theta_1 x_k + \theta_0 - y_k \right) \begin{bmatrix} 1 \\ x_k \end{bmatrix}. \] 
    • Gradient descent update rule. Using gradient descent, the parameters are updated iteratively as: \[ \boldsymbol{\Theta}^{(n+1)} = \boldsymbol{\Theta}^{(n)} - \alpha \nabla_{\boldsymbol{\Theta}} \mathcal{L}(\boldsymbol{\Theta}^{(n)}), \] where: 
      • \(\boldsymbol{\Theta}^{(n)}\) represents the parameters at the \(n\)-th iteration,
      • \(\alpha\) is the learning rate (a small positive scalar), 
      • \(\nabla_{\boldsymbol{\Theta}} \mathcal{L}(\boldsymbol{\Theta})\) is the gradient of the loss function with respect to the parameters.  
  • Polynomial Regression with Gradient Descent. Let \(\boldsymbol{\Theta} = (\theta_0, \theta_1, \dots, \theta_L) \in \mathbb{R}^{L+1}\) be the parameters of the model. The polynomial regression function of degree \(L\) is defined as: \[ f(x; \boldsymbol{\Theta}) = \theta_L x^L + \dots + \theta_1 x + \theta_0. \] 

    •  The goal of polynomial regression is to find the parameters \(\boldsymbol{\Theta}\) that minimize the mean squared loss function: \[ \mathcal{L}(\boldsymbol{\Theta}) = \frac{1}{N} \sum_{k=1}^N \left( f(x_k; \boldsymbol{\Theta}) - y_k \right)^2, \] where: 
      • \(N\) is the number of data points,
      • \(x_k\) and \(y_k\) represent the input and target values for the \(k\)-th data point. 
    • Gradient of the loss function. To minimize the loss function using gradient descent, we compute the gradient \(\nabla \mathcal{L}(\boldsymbol{\Theta})\) with respect to \(\boldsymbol{\Theta}\): \[ \nabla \mathcal{L}(\boldsymbol{\Theta}) = \frac{1}{N} \sum_{k=1}^N 2 \left( f(x_k; \boldsymbol{\Theta}) - y_k \right) \begin{bmatrix} 1 \\ x_k \\ \vdots \\ x_k^L \end{bmatrix}. \] 
    • Gradient descent update rule. The parameters are updated iteratively using the gradient descent rule: \[ \boldsymbol{\Theta}^{(n+1)} = \boldsymbol{\Theta}^{(n)} - \alpha \nabla_{\boldsymbol{\Theta}} \mathcal{L}(\boldsymbol{\Theta}), \] where: 
      • \(\boldsymbol{\Theta}^{(n)}\) is the parameter vector at iteration \(n\), \item \(\alpha\) is the learning rate (a small positive scalar), 
      • \(\nabla_{\boldsymbol{\Theta}} \mathcal{L}(\boldsymbol{\Theta})\) is the gradient of the loss function. 
    •  Optimization objective. The optimization objective is to solve: \[ \boldsymbol{\Theta} = \underset{\boldsymbol{\Theta}}{\arg\min} \, \mathcal{L}(\boldsymbol{\Theta}). \] 
    •  Overfitting. Although higher-degree polynomials can fit the training data very well, they are prone to overfitting. Overfitting occurs when the model captures noise in the training data, leading to poor generalization on unseen test data.
  • Gradient Descent for a Two-Layer Neural Network. We are given the parameters \(\boldsymbol{\Theta} = (\theta_{10}, \theta_{11}, \theta_{20}, \theta_{21}) \in \mathbb{R}^4\) and the two-layer function: \[ f_1(x) = \sigma(\theta_{11} x + \theta_{10}), \quad f(x; \boldsymbol{\Theta}) = f_2 \circ f_1(x) = \sigma(\theta_{21} f_1(x) + \theta_{20}), \] where \(\sigma(\cdot)\) is the activation function. Here,  we use the ReLU activation function \(\sigma(z)\) that is defined as: \[ \sigma(z) = \max(0, z), \] and its derivative \(\sigma'(z)\) is: \[ \sigma'(z) = \begin{cases} 1, & \text{if } z > 0, \\ 0, & \text{if } z \leq 0. \end{cases} \] The loss function to minimize is the mean squared error: \[ \mathcal{L}(\boldsymbol{\Theta}) = \frac{1}{N} \sum_{k=1}^N \left( f(x_k; \boldsymbol{\Theta}) - y_k \right)^2, \] where \(\{ (x_k, y_k) : k = 1, \dots, N \}\) represents the training data.
    •  Gradient descent update rule. To minimize \(\mathcal{L}(\boldsymbol{\Theta})\), we compute the gradients of the loss function with respect to each parameter in \(\boldsymbol{\Theta}\): \[ \boldsymbol{\Theta}^{(n+1)} = \boldsymbol{\Theta}^{(n)} - \alpha \nabla_{\boldsymbol{\Theta}} \mathcal{L}(\boldsymbol{\Theta}), \] where \(\alpha > 0\) is the learning rate. 
    • Gradient of the loss Function using the chain rule. The partial derivatives of \(\mathcal{L}(\boldsymbol{\Theta})\) with respect to the parameters are as follows:
      •  Partial derivative with respect to \(\theta_{21}\): \[ \frac{\partial \mathcal{L}}{\partial \theta_{21}} = \frac{2}{N} \sum_{k=1}^N \left( f(x_k; \boldsymbol{\Theta}) - y_k \right) \sigma'(\theta_{21} f_1(x_k) + \theta_{20}) f_1(x_k). \] 
      •  Partial derivative with respect to \(\theta_{20}\): \[ \frac{\partial \mathcal{L}}{\partial \theta_{20}} = \frac{2}{N} \sum_{k=1}^N \left( f(x_k; \boldsymbol{\Theta}) - y_k \right) \sigma'(\theta_{21} f_1(x_k) + \theta_{20}). \] 
      •  Partial derivative with respect to \(\theta_{11}\) (first layer weight): \[ \frac{\partial \mathcal{L}}{\partial \theta_{11}} = \frac{2}{N} \sum_{k=1}^N \left( f(x_k; \boldsymbol{\Theta}) - y_k \right) \sigma'(\theta_{21} f_1(x_k) + \theta_{20}) \theta_{21} \sigma'(\theta_{11} x_k + \theta_{10}) x_k. \] 
      • Partial derivative with respect to \(\theta_{10}\) (first layer bias): \[ \frac{\partial \mathcal{L}}{\partial \theta_{10}} = \frac{2}{N} \sum_{k=1}^N \left( f(x_k; \boldsymbol{\Theta}) - y_k \right) \sigma'(\theta_{21} f_1(x_k) + \theta_{20}) \theta_{21} \sigma'(\theta_{11} x_k + \theta_{10}). \]  
    •  Gradient descent iterative updates. The parameters \(\boldsymbol{\Theta} = (\theta_{10}, \theta_{11}, \theta_{20}, \theta_{21})\) are updated iteratively as: \[ \theta_{ij}^{(n+1)} = \theta_{ij}^{(n)} - \alpha \frac{\partial \mathcal{L}}{\partial \theta_{ij}}, \] where \(\theta_{ij}\) represents any parameter in the vector \(\boldsymbol{\Theta}\).
    • Importance of gradients in neural network learning. Gradients play a central role in neural network learning by guiding the optimization process. The gradient of the loss function with respect to each parameter provides information about the direction and rate of change of the loss. Specifically:
      • Gradients indicate how to adjust the parameters to reduce the loss. A negative gradient points in the direction of decreasing loss. 
      • By iteratively updating parameters using the gradient descent rule, the neural network learns to approximate the target function or minimize prediction errors. 
      • The chain rule allows gradients to propagate backward through layers, enabling the adjustment of weights in multi-layer networks (backpropagation). 

댓글

이 블로그의 인기 게시물

2.1 Vectors

2.4. Taylor Expansion and Approximation

2.5. Vector Fields and Line integral