大模型原理：预训练目标-因果语言建模（CLM）-使用滑动窗口进行数据采样-马育民老师

需要掌握：[大模型原理：预训练目标-因果语言建模（CLM）](https://www.malaoshi.top/show_1GW2YBs0zklJ.html "大模型原理：预训练目标-因果语言建模（CLM）")

# 介绍

在因果语言建模（CLM）场景下，用**滑动窗口（Sliding Window）** 进行数据采样，目的是把 **长文本切分成固定长度的、有重叠的样本**，让模型能更好地学习序列依赖，同时充分利用数据。

# 滑动窗口采样过程

[![](https://www.malaoshi.top/upload/0/0/1GW2YC2Bgdao.png)](https://www.malaoshi.top/upload/0/0/1GW2YC2Bgdao.png)

在CLM中，模型输入有 **最大长度限制**（比如GPT的512、1024），滑动窗口采样就是：

1. 设定一个**窗口长度（window_size）**（等于模型最大输入长度）；
2. 设定一个**步长（step_size）**（每次窗口滑动的距离，小于window_size则样本有重叠）；
3. 从长文本的起始位置开始，按步长滑动窗口，切分出多个固定长度的子序列；
4. 每个子序列作为CLM的训练样本（输入是前n-1个token，标签是第n个token）。

这种方式既符合CLM“用前文预测后文”的因果特性，又能让长文本的每个token都有机会作为预测目标，还能通过重叠区域保留上下文的连续性。

# 实现1

[![](https://www.malaoshi.top/upload/0/0/1GW2YTwGUYpS.png)](https://www.malaoshi.top/upload/0/0/1GW2YTwGUYpS.png)

为了便于理解，使用滑动窗口(sliding window)方法，**读取原文**，而不是词元，像上图那样，**提取一次** `输入-目标对`：

1. 读取文本
2. 进行分词
3. 定义窗口大小、步长
4. 提取一次 `输入-目标对`

```python
import re

# 读取文本
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

# 分词
raw_sample = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
raw_sample =[
    item.strip() for item in raw_sample if item.strip()
]

# 定义上下文大小，即：窗口大小，每次读取4个词
context_size = 4
# 步长，下次移动1个词
stride = 1
# 输出词元
inputs = raw_sample[:context_size]
# 目标词元
targets = raw_sample[stride:context_size + stride]
print("输入词元：", inputs)
print("目标词元：", targets)
```

执行结果：

```
输入词元： ['I', 'HAD', 'always', 'thought']
目标词元： ['HAD', 'always', 'thought', 'Jack']
```

# 实现2

对 `实现1` 进行改进，**提取所有** `输入-目标对`，前面步骤不变，**只改第四步**：

```
import re

# 读取文本
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

# 分词
raw_sample = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
raw_sample =[
    item.strip() for item in raw_sample if item.strip()
]

# 定义上下文大小，即：窗口大小，每次读取4个词
context_size = 4
# 步长，下次移动1个词
stride = 1
# 定义list，存放输入词元
inputs = []
# 定义list，存放目标词元
targets = []

for i in range(0, len(raw_sample) - context_size, stride):
    input_chunk = raw_sample[i:i + context_size]
    target_chunk = raw_sample[i + 1:i + context_size + 1]
    inputs.append(input_chunk)
    targets.append(target_chunk)

print("前5个输入词元：")
for item in inputs[:5]:
    print(item)

print()
print("前5个目标词元：")
for item in targets[:5]:
    print(item)

```

执行结果：

```
前5个输入词元：
['I', 'HAD', 'always', 'thought']
['HAD', 'always', 'thought', 'Jack']
['always', 'thought', 'Jack', 'Gisburn']
['thought', 'Jack', 'Gisburn', 'rather']
['Jack', 'Gisburn', 'rather', 'a']

前5个目标词元：
['HAD', 'always', 'thought', 'Jack']
['always', 'thought', 'Jack', 'Gisburn']
['thought', 'Jack', 'Gisburn', 'rather']
['Jack', 'Gisburn', 'rather', 'a']
['Gisburn', 'rather', 'a', 'cheap']
```

### 步长的含义

详见链接：
https://www.malaoshi.top/show_1GW2YXBPPimC.html

# 实现3

使用滑动窗口(sliding window)方法，**提取** `输入-目标对`，这里是 **提取词元**：

1. 读取文本
2. 使用BPE分词器
3. 定义窗口大小、步长
4. 提取所有 `输入-目标对`

```
from importlib.metadata import version
import tiktoken

print("查看 version 版本号:", version("tiktoken"))
print("查看所有支持的编码：", tiktoken.list_encoding_names())

# 选择对应模型的编码，不同模型编码规则不同
tokenizer = tiktoken.get_encoding("gpt2")

# 读取文本
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("将文本转Token---------------------")
token_ids = tokenizer.encode(raw_text)
print("词元总数：", len(token_ids))

# 从数据集中移除前50个词元以便演示
# enc_sample = enc_text[50:]

# 定义上下文大小，即：窗口大小，每次读取4个词
context_size = 4
# 步长，下次移动1个词
stride = 1
# 定义list，存放输入词元
input_ids = []
# 定义list，存放目标词元
target_ids = []

for i in range(0, len(token_ids) - context_size, stride):
    input_chunk = token_ids[i:i + context_size]  # 输入词元
    target_chunk = token_ids[i + 1:i + context_size + 1]  # 目标词元
    input_ids.append(input_chunk)
    target_ids.append(target_chunk)

print()
print("前5个输入词元：")
for item in input_ids[:5]:
    print(item)

print()
print("前5个目标词元：")
for item in target_ids[:5]:
    print(item)

```

执行结果：

```
查看 version 版本号: 0.12.0
查看所有支持的编码： ['gpt2', 'r50k_base', 'p50k_base', 'p50k_edit', 'cl100k_base', 'o200k_base', 'o200k_harmony']
将文本转Token---------------------
词元总数： 5145

前5个输入词元：
[40, 367, 2885, 1464]
[367, 2885, 1464, 1807]
[2885, 1464, 1807, 3619]
[1464, 1807, 3619, 402]
[1807, 3619, 402, 271]

前5个目标词元：
[367, 2885, 1464, 1807]
[2885, 1464, 1807, 3619]
[1464, 1807, 3619, 402]
[1807, 3619, 402, 271]
[3619, 402, 271, 10899]
```

### 将词元转回文本

在上面的代码下面，加上下面代码：

```
# 将输入词元转回文本
print()
print("前5个输入词元：")
for item in input_ids[:5]:
    text = tokenizer.decode(item)
    print(text)

# 将目标词元转回文本
print()
print("前5个目标词元：")
for item in target_ids[:5]:
    text = tokenizer.decode(item)
    print(text)
```

执行结果：

```
前5个输入词元：
I HAD always    -- 不足4个词，根据第3行可知，HAD无法识别
 HAD always thought
AD always thought Jack -- HAD 识别成 H 和 AD 了
 always thought Jack G
 thought Jack Gis

前5个目标词元：
 HAD always thought
AD always thought Jack
 always thought Jack G
 thought Jack Gis
 Jack Gisburn

```

原文出处：http://malaoshi.top/show_1GW2YV1NBWPm.html