d2l-2-线性回归的简单实现

发表于 2024-07-30 更新于 2024-08-02 本文字数： 2.8k 阅读时长 ≈ 10 分钟

在神经网络中，训练的本质实际上是优化参数，使得模型的预测值与真实值之间的误差尽可能小。通过定义损失函数的方式，我们可以知道参数与结果之间的关系，通过求导的方式，我们可以知道如何调整参数使得损失函数最小。

假设损失函数为 \(L(\theta)\)，那么我们的目标就是找到一个 \(\theta\) 使得 \(L(\theta)\) 最小。对于一个给定的 \(\theta\)，我们可以通过计算 \(L(\theta)\) 得知其的损失，然后我们求导 \(L(\theta)\)，得到其在 \(\theta\) 处的梯度 \(\nabla L(\theta)\)，然后我们可以通过梯度下降的方式，即 \(\theta = \theta - \alpha \nabla L(\theta)\) 来更新 \(\theta\) 。其中 \(\alpha\) 是学习率，是一个超参数，用来控制每次更新的步长。

上述的方法即为梯度下降法，是一种常用的优化方法。从数学上可以证明，梯度下降最终一定能找到一个局部最优解。

在实际运用的时候，我们经常将多个样本的损失函数的平均值作为最终的损失函数，即 \(L(\theta) = \frac{1}{n} \sum_{i=1}^{n} L(\theta, x_i, y_i)\)，其中 \(n\) 为样本的数量，\(x_i\) 为第 \(i\) 个样本的特征，\(y_i\) 为第 \(i\) 个样本的标签。但是这样又引出了另一个问题，即当样本数量很大的时候，我们需要计算所有样本的梯度，这样会导致计算量过大，因此我们通常会采用随机梯度下降法，即每次只计算随机选择的多个样本，然后更新参数，而不是全部样本。

下面，我们生成一个人工数据集，然后使用梯度下降法来拟合这个数据集。

import torch
import numpy as np
import random
import matplotlib.pyplot as plt

# 生成数据集
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
features = torch.tensor(np.random.normal(0, 1, (num_examples, num_inputs)), dtype=torch.float)
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
# 加入噪声
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)

# 读取数据
def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        j = torch.LongTensor(indices[i: min(i + batch_size, num_examples)])
        yield features.index_select(0, j), labels.index_select(0, j)

batch_size = 10
for X, y in data_iter(batch_size, features, labels):
    print(X, y)
    break

接下来，我们定义模型和损失函数。

# 定义模型，这里我们不使用 nn.Module，而是自己手搓
def linreg(X, w, b):
    """线性回归模型。"""
    return torch.matmul(X, w) + b

# 定义损失函数
def loss(y_hat, y):
    """均方损失。"""
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2

# 定义优化算法
def sgd(params, lr, batch_size):  #@save
    """小批量随机梯度下降。"""
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size
            param.grad.zero_()

# 初始化参数
w = torch.tensor(np.random.normal(0, 0.01, (num_inputs, 1)), dtype=torch.float, requires_grad=True)
b = torch.zeros(1, dtype=torch.float, requires_grad=True)

# 训练模型
lr = 0.03
num_epochs = 3
net = linreg
loss = loss

for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y).sum()
        l.backward()
        sgd([w, b], lr, batch_size)
    with torch.no_grad():
        train_l = loss(net(features, w, b), labels)
        print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')