2024-01-19 20:37:39

Chinese Dialogue 0.2B Mini Model: ChatLM-Chinese-0.2B

👋🏼简介

现在的大语言模型的参数往往较大，消费级电脑单纯做推理都比较慢，更别说想自己从头开始训练一个模型了。本项目的目标是整理生成式语言模型的训练流程，包括数据清洗、tokenizer训练、模型预训练、SFT指令微调、RLHF优化等。

ChatLM-mini-Chinese为中文对话小模型，模型参数只有0.2B（算共享权重约210M），可以在最低4GB显存的机器进行预训练（batch_size=1，fp16或者 bf16），float16加载、推理最少只需要512MB显存。

公开所有预训练、SFT指令微调、DPO偏好优化数据集来源。
使用HuggingfaceNLP框架，包括transformers、accelerate、trl、peft等。
自实现trainer，支持单机单卡、单机多卡进行预训练、SFT微调。训练过程中支持在任意位置停止，及在任意位置继续训练。
预训练：整合为端到端的Text-to-Text预训练，非mask掩码预测预训练。
- 开源所有数据清洗（如规范化、基于mini_hash的文档去重等）、数据集构造、数据集加载优化等流程；
- tokenizer多进程词频统计，支持sentencepiece、huggingface tokenizers的tokenizer训练；
- 预训练支持任意位置断点，可从断点处继续训练;
- 大数据集（GB级别）流式加载、支持缓冲区数据打乱，不利用内存、硬盘作为缓存，有效减少内存、磁盘占用。配置batch_size=1, max_len=320下，最低支持在16GB内存+4GB显存的机器上进行预训练；
- 训练日志记录。
SFT微调：开源SFT数据集及数据处理过程。
- 自实现trainer支持prompt指令微调，支持任意断点继续训练；
- 支持Huggingface trainer的sequence to sequence微调；
- 支持传统的低学习率，只训练decoder层的微调。
偏好优化：使用DPO进行全量偏好优化。
- 支持使用peft lora进行偏好优化；
- 支持模型合并，可将Lora adapter合并到原始模型中。
支持下游任务微调：finetune_examples给出三元组信息抽取任务的微调示例，微调后的模型对话能力仍在。

🩺效果展示

预训练数据集只有900多万，模型参数也仅0.2B，不能涵盖所有方面，会有答非所问、废话生成器的情况。

对话效果展示

🛠️安装部署

首先，从github克隆项目：

1	git clone --depth 1 https://github.com/charent/ChatLM-mini-Chinese.git

然后，安装依赖：

1	pip install -r ./requirements.txt

最后下载训练模型及模型配置文件：

1
2
3

git clone --depth 1 https://huggingface.co/charent/ChatLM-mini-Chinese

mv ChatLM-mini-Chinese model_save

确保model_save目录下有以下文件，这些文件都可以在HuggingFaceHub仓库ChatLM-Chinese-0.2B中找到：

ChatLM-mini-Chinese
├─model_save
|  ├─config.json
|  ├─configuration_chat_model.py
|  ├─generation_config.json
|  ├─model.safetensors
|  ├─modeling_chat_model.py
|  ├─special_tokens_map.json
|  ├─tokenizer.json
|  └─tokenizer_config.json

在控制台中执行：

1	python cli_demo.py

当然，也可以使用API方式：

1	python api_demo.py

对话效果展示

自定义测试代码：

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = 'charent/ChatLM-mini-Chinese'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, trust_remote_code=True).to(device)

txt = '如何评价Apple这家公司？'

encode_ids = tokenizer([txt])
input_ids, attention_mask = torch.LongTensor(encode_ids['input_ids']), torch.LongTensor(encode_ids['attention_mask'])

outs = model.my_generate(
    input_ids=input_ids.to(device),
    attention_mask=attention_mask.to(device),
    max_seq_len=256,
    search_type='beam',
)

outs_txt = tokenizer.batch_decode(outs.cpu().numpy(), skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(outs_txt[0])

# Output: Apple是一家专注于设计和用户体验的公司，其产品在设计上注重简约、流畅和功能性，而在用户体验方面则...

Codart Studio

Chinese Dialogue 0.2B Mini Model: ChatLM-Chinese-0.2B

👋🏼简介

🩺效果展示

🛠️安装部署

推荐阅读