大模型原理：实现带可训练权重的自注意力机制-封装类计算所有上下文向量-马育民老师

# 实现方式1

实现自注意封装类，类中的代码与 [大模型原理：实现带可训练权重的自注意力机制（以第二个输入元素为计算例子）](https://www.malaoshi.top/show_1GW2cIAFk564.html "大模型原理：实现带可训练权重的自注意力机制（以第二个输入元素为计算例子）") 几乎相同

### 封装类

`SelfAttention_v1` 必须继承 `nn.Module`类，该类为模型层的创建和管理提供了必要的功能。

```
import torch
import torch.nn as nn

class SelfAttention_v1(nn.Module):
    """
    必须继承nn.Module类，该类为模型层的创建和管理提供了必要的功能
    """
    def __init__(self, d_in, d_out):
        """
        初始化可训练的权重矩阵
        :param d_in: 输入嵌入维度
        :param d_out: 输出嵌入维度
        """
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

def forward(self, x):
        """
        前向传播
        :param x:
        :return:
        """
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1] ** 0.5, dim=-1
        )
        context_vec = attn_weights @ values
        return context_vec
```

### 使用这个类

```
# 输入的嵌入向量
inputs = torch.tensor(
    [[0.43, 0.15, 0.89],  # Your     (x^1)
     [0.55, 0.87, 0.66],  # journey  (x^2)
     [0.57, 0.85, 0.64],  # starts   (x^3)
     [0.22, 0.58, 0.33],  # with     (x^4)
     [0.77, 0.25, 0.10],  # one      (x^5)
     [0.05, 0.80, 0.55]]  # step     (x^6)
)

# 输入嵌入维度d_in=3
d_in = inputs.shape[1]
# 输出嵌入维度d_out=2
d_out = 2

torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in, d_out)
print("上下文向量：", sa_v1(inputs))
```

执行结果：

```
上下文向量： tensor([[0.2996, 0.8053],
        [0.3061, 0.8210],
        [0.3058, 0.8203],
        [0.2948, 0.7939],
        [0.2927, 0.7891],
        [0.2990, 0.8040]], grad_fn=<MmBackward0>)
```

可以看到第2行 `[0.3061, 0.8210]` 与 [大模型原理：实现带可训练权重的自注意力机制（以第二个输入元素为计算例子）](https://www.malaoshi.top/show_1GW2cIAFk564.html "大模型原理：实现带可训练权重的自注意力机制（以第二个输入元素为计算例子）") 中的 `context_vec_2` 相同

# 实现方式2（改进版）

通过使用PyTorch的 `nn.Linear` 层来进一步优化，当偏置单元被禁用时，`nn.Linear` 层可以有效地执行矩阵乘法。
相比手动实现`nn.Parameter(torch.rand(...))`，使用`nn.Linear` 的一个重要优势是：提供了 **优化的权重初始化方案**，从而有助于模型训练的稳定性和有效性

### 封装类

```
import torch
import torch.nn as nn

class SelfAttention_v2(nn.Module):
    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        context_vec = attn_weights @ values
        return context_vec
```

### 使用该类

# 输入嵌入维度d_in=3
d_in = inputs.shape[1]
# 输出嵌入维度d_out=2
d_out = 2

torch.manual_seed(123)
sa_v1 = SelfAttention_v2(d_in, d_out)
print("上下文向量：", sa_v1(inputs))
```

执行结果：

```
上下文向量： tensor([[-0.5337, -0.1051],
        [-0.5323, -0.1080],
        [-0.5323, -0.1079],
        [-0.5297, -0.1076],
        [-0.5311, -0.1066],
        [-0.5299, -0.1081]], grad_fn=<MmBackward0>)
```

**注意:** `SelfAttention_v1` 和 `SelfAttention_v2` 因为使用了不同的初始权重矩阵而给出了不同的输出，这是由 `nn.Linear` 使用了一个更复杂的权值初始化方案所导致的。

原文出处：http://malaoshi.top/show_1GW2cIKgykhk.html