大模型微调(Tuning)实践-LLaMA Factory实践篇

发布日期: 2025-10-03

更新日期: 2025-10-21

文章字数: 13.7k

阅读时长: 64 分

阅读次数:

大模型微调(Tuning)实践-LLaMA Factory实践篇

LLaMA Factory

介绍：国产开源大模型高效微调框架
开发语言：Python
开源协议：Apache License 2.0
首次提交时间：2023-05-28
来源论文：LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
来源单位：School of Computer Science and Engineering, Beihang University, China;School of Software and Microelectronics, Peking University, China
GitHub地址：https://github.com/hiyouga/LLaMA-Factory?tab=readme-ov-file
教程地址：https://llamafactory.readthedocs.io/zh-cn/latest/
软件依赖(有版本要求)和功能支持：

必选软件：Python、torch、torchvision、transformer、datasets、accelerate、peft、trl
可选软件：cuda、deepspeed(训练)、bitsandbytes、vllm(推理)、flash-attn

框架特性:

多种模型支持：LLaMA、LLaVA、Mistral、Mixtral-MoE、Qwen、DeepSeek、Yi、Gemma、Baichuan、ChatGLM、Phi 等等,支持模型的完整列表，允许用户添加自定义模型支持。
多种训练方法支持：预训练(Pre-training)、Post-training(后训练)、（多模态）指令监督微调(Supervised Fine-Tuning)、RLHF(Reward mode(奖励模型)、PPO训练)、DPO、KTO训练方法、ORPO训练、SimPO训练，支持全参训练、部分参数训练、LoRA、QLoRA。
多种加速算子支持高效训练：FlashAttention-2、Unsloth、Liger Kernel。
支持多种推理引擎加速：Transformers、vLLM、SGLang，可集成vLLM高效推理。
支持多种训练可视化监控工具：LlamaBoard、TensorBoard、Wandb、MLflow、SwanLab等等。
支持分布式训练：支持单机多卡和多机多卡的分布式训练、支持NativeDDP、DeepSeed、FSDP(Fully Sharded Data Parallel)分布式训练引擎。
支持多种数据集和格式：支持Alpaca、shareGPT格式数据集。预训练数据集、指令监督微调数据集、偏好数据集、KTO数据集、多模态数据集(图像、视频、音频)，部分数据集的使用需要登录Hugging Face账号后确认。dataset_info.json包含支持的所有经管预处理的本地数据集和在线数据集,同时可以使用的推理数据类型(auto、float16、bfloat16、float32)。
支持多种调优算法:Full Parameter Fine-tuning(全参微调)、Freeze(冻结微调)、LoRA(低秩适配器微调，包括LoRA+、rsLoRA、DoRA、PiSSA)、Galore(Gradient Low-Rank Projection)、BAdam(Block coordinate method with Adam as an inner solver)。
支持多种模型量化技术：支持2-8bit量化训练, 包括PTQ(后训练量化,包括GPTQ(Group-wise Precision Tuning Quantization))、QAT(训练感知量化，包括AWQ(Activation-Aware Layer Quantization0))、AQLM(Additive Quantization of Language Models)、QFTQ(On-the-fly Quantization,包括bitsandbytes、HQQ(Half-Quadratic Quantization)、EETQ(Easy and Efficient Quantization for Transformers))。
多种运算精度支持：16比特全参数微调、Freeze(冻结微调)、LoRA微调和基于AQLM/AWQ/GPTQ/LLM.int8/HQQ/EETQ的2/3/4/5/6/8比特QLoRA微调。
多硬件设备支持: 支持CUDA、ROCm、华为昇腾NPU和Atlas系列。
支持多种操作系统：Linux、Windows、MacOS都支持。
多语言支持: 支持中英文多语言界面。
多种易用界面支持：命令行和Web UI。
支持Docker部署: 在华为NPU适配中提供提完整docker容器化部署方案。

LLaMA Factory安装和微调过程

声明：本次微调过程出于演示目的使用Windows系统，微调训练数据集可以替换为任何符合格式的微调数据集，实际Text2SQL数据集不提供，linux设置配置不提供。

安装LLaMA Factory

linux安装CUDA：
LLaMA Factory安装前提：已安装好CUDA、GCC。检查Linux版本是否支持CUDA, 本地Windows中可以通过WSL, 安装Ubuntu版本的Linux系统, 使用命令uname -m && cat /etc/*release检查。使用gcc --version检查GCC是否安装成功。cuda安装完成后使用nvcc -V检查cuda是否安装成功，如果出现对应的cuda版本号，则说明安装成功。
Windows安装CUDA：
LLaMA Factory安装前提：已安装好git和CUDA，可以利用下面命令, 在Python命令行中检查Pytorch是否是GPU版本，CUDA是否可用。

1
2
3

import torch
print(torch.__version__)    # '2.7.1+cu118'
print(torch.cuda.is_available())    # True

LLaMA-Factory安装、校验和依赖
3.1 在安装文件夹中运行下面命令，从GitHub中下载clone LLaMA Factory的Git仓库

1
2
3

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory # 切换到LLaMA-Factory文件目录中
pip install -e ".[torch,metrics,gptq]" # 以可编辑的模式安装Python包到当前目录中

3.2 如果出现环境冲突，请尝试使用pip install --no-deps -e .解决。安装完成后，在命令行中使用命令llamafactory-cli version，测试是否安装成功。当出现Welcome to LLaMA Factory, version 0.9.4.dev0, 表示安装成功.可以使用llamafactory-cli webui以WebUI界面形式在本地浏览器启动运行LLaMA Factory。
3.3 安装依赖包及其用途

名称	描述
torch	开源深度学习框架 PyTorch。
torch-npu	PyTorch 的昇腾设备兼容包。
metrics	用于评估和监控机器学习模型性能。
deepspeed	提供了分布式训练所需的零冗余优化器。
bitsandbytes	用于大型语言模型量化。
hqq	用于大型语言模型量化。
eetq	用于大型语言模型量化。
gptq	用于加载 GPTQ 量化模型。
awq	用于加载 AWQ 量化模型。
aqlm	用于加载 AQLM 量化模型。
vllm	提供了高速并发的模型推理服务。
galore	提供了高效全参微调算法。
badam	提供了高效全参微调算法。
qwen	提供了加载 Qwen v1 模型所需的包。
modelscope	魔搭社区，提供了预训练模型和数据集的下载途径。
swanlab	开源训练跟踪工具 SwanLab，用于记录与可视化训练过程
dev	用于 LLaMA Factory 开发维护。

注意事项：
4.1 在Linux安装cuda过程中，要检查cuda自带的自带驱动版本与 GPU 是否兼容之前,并建议取消 Driver 的安装。
4.2 按照从GitHub和魔塔社区下载模型和LLaMA Factory，需要注意防火墙是否关闭，防止因为连接超时导致连接不上服务器，或者可以下载压缩包，在解压的文件目录中执行pip install -e ".[torch,metrics,gptq]"命令。

环境组件准备

数据集准备：在Text2SQL(NL2SQL, Natural Language to SQL)任务中，这里准备的是和业务部门合作确定的高频常用数据集，这里是为了演示LLaMA-Factory的大模型微调功能，所以使用的微调数据集，我们选择使用“弱智吧数据集”(著名中文AI训练数据，从百度弱智吧上收集的一系列帖子。旨在启发人们娱乐性使用ChatGPT等LLM时的思路。弱智吧是百度贴吧中的一个非常受欢迎的论坛，以创作短小精悍而闻名。)，数据文件也可通过压缩包的形式从GitHub上下载到指定文件夹中(\Dateset)。也可以使用自定义数据集，LLaMA Factory支持Alpaca和ShareGPT两种json数据类型格式，详情请看AI数据工程。
弱智吧数据集样例如下：

[
  {
    "instruction": "只剩一个心脏了还能活吗？",
    "output": "能，人本来就只有一个心脏。"
  },
  {
    "instruction": "爸爸再婚，我是不是就有了个新娘？",
    "output": "不是的，你有了一个继母。\"新娘\"是指新婚的女方，而你爸爸再婚，他的新婚妻子对你来说是继母。"
  },
  {
    "instruction": "我25岁 我爸50岁 那我爸两岁就生了我啦？",
    "output": "不对，你25岁，你爸50岁，那么你爸在你出生时应该是25岁，而不是两岁。"
  },
  {
    "instruction": "同学说steam好玩，我现在八级烫伤了怎么办？ 救救我",
    "output": "\"Steam\"是一个电子游戏分发平台，不是steam蒸汽。建议立即去医院，顺便看看脑子。"
  }
]

模型准备：结合需求、时间和当前硬件资源请看，先选择Qwen2.5-7B-Instruct模型，下载Qwen2.5-7B-Instruct模型文件(Qwen2.5-7B-Instruct-Q4_K_M.gguf)到指定模型文件夹(头目录\qwen2.5-7B)中，并设置mf文件(文件名qwen2.5b.mf)用于模型运行框架Ollama加载本地模型使用。qwen2.5b.mf文件内容如下：

FROM 头目录\qwen2.5-7B\Qwen2.5-7B-Instruct-Q4_K_M.gguf # gguf文件地址
TEMPLATE """{{ if .System }}<|im_start|>system{{ .System }}<|im_end|>{{ end }}<|im_start|>user{{ .Prompt }}<|im_end|><|im_start|>assistant"""
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"

模型运行框架准备:使用Ollma作为运行LLM的框架，具体安装不在赘述，详情请看参考文献。在Ollama app所在文件夹中，打开命令提示符，运行ollama --version显示版本号，则说明安装成功。qwen2.5b.mf文件转移到Ollama文件安装目录下。使用ollama run qwen2.5:7b-instruct命令Ollma加载Qwen2.5-7B-Instruct模型，加载完成后使用ollama list命令检查，如果出现qwen2.5b:lastest则说明Ollma可以加载模型。

微调前Ollama加载模型

在Ollama安装目录下运行命令提示符，运行命令 ollama run qwen2.5b，使用Ollma加载运行“qwen2.5:7b-instruct”大模型，模型回答问题如下：

PS 头文件\Ollama> ollama run qwen2.5b
>>> 爸爸再婚，我是不是就有了个新娘？
 在中国传统文化中，“新娘”一般指的是新婚的妻子。而你成为继父或继子（女）的关系后，并不意味着你会有另一个“新娘”。
更多的是指在家庭结构中增加了新的成员或者角色。

具体来说：

1. **继父/母**：你可能会承担起对继子女的抚养和教育责任。
2. **继子/女**：可能会增加一个新的家庭成员，以及与新家庭的关系。

在情感层面或日常生活中，这种关系需要相互理解、尊重和支持。

>>> 只剩一个心脏了还能活吗？
 通常情况下，人类只有两个心脏——一个是用于体循环的左心室，另一个是用于肺循环的右心室。即使只有一个工作的心脏（通常
是左心室），人仍然可以生存。

但是，具体情况可能会有所不同：

1. **心脏疾病**：如果一个主要的心脏功能受损或被移除，另一个心脏可能无法承担全部的工作负荷，这可能导致严重的健康问题
甚至危及生命。
2. **医疗干预**：在极少数情况下，医生可能会通过手术将一个心脏的部分移除或者移植一个人工心脏来帮助患者生存。

总的来说，在自然状态下，只有一个正常的心脏是完全可以维持生命的。如果有相关健康问题，请咨询专业医生以获得准确的诊断和
建议。

>>> 我25岁 我爸50岁 那我爸两岁就生了我啦？
 这是一个有趣的逻辑谜题！实际上，如果你25岁，你爸爸50岁，那么他是在35岁时生下你的。

具体计算如下：
- 你现在的年龄：25岁
- 你父亲的年龄：50岁

所以你父亲在 \(50 - 25 = 25\) 岁时生下了你。因此，当他25岁时（比你现在小25岁），他生下了25岁的你。

总结来说：
- 现在你是25岁。
- 你爸爸是50岁。
- 所以你爸爸是在35岁时生下你的。

希望这个解释能帮助你理解！

>>> 同学说steam好玩，我现在八级烫伤了怎么办？ 救救我
 听起来你遇到了一个幽默的玩笑或者误解。Steam是一个著名的数字游戏分发平台，而不是一种可以直接导致伤害的游戏。如果你真
的遭遇了八级烫伤，请立即采取以下步骤：

1. **冷却伤口**：立即将受伤部位放在流动的冷水下冲洗至少10到20分钟，或直到疼痛减轻。
2. **不要涂抹任何东西**：避免使用冰块、黄油或其他物质，因为这些可能会加重伤害。
3. **覆盖伤口**：用干净的布轻轻覆盖烫伤区域，以保护伤口免受感染。
4. **寻求医疗帮助**：尽快去医院或联系急救服务。八级烫伤属于重度烫伤，需要专业治疗。

如果这是一个玩笑或者误解，请确保与同学沟通清楚，并表达你的担忧和需要帮助的态度。希望你能够得到妥善的处理和照顾！

>>> Send a message (/? for help)

从上面可以看出，大模型在回答“我25岁我爸50岁那我爸两岁就生了我啦？”问题时犯错了，中间计算过程是对的，但是回答却是错的(正确应该是25岁，而不是35岁。)，这种类似情况，在Text2SQL(NL2SQL, Natural Language to SQL)也经常出现，在此我们用这个案例作为使用微调技术做改善的例子。

微调前LLaMA-Factory环境配置

数据集配置：弱智吧数据集下载完成后，打开压缩包找到“ruozhiba_qa.json”的json格式数据集，JSON数据集导入\LLaMA-Factory-main\data目录下，并在LLaMA-Factory/data/data_info.json中注册数据集。
LLaMA-Factory微调参数配置：在命令行或者WebUI界面配置微调参数，命令行的微调参数配置如下,参数配置好后，点击开始训练，训练过程中需要观察损失函数的运行情况，如果发现曲线不下降，则可以提前终止训练。

Model name = Qwen2.5-7B-Instruct # 模型名称
Model path = huggingface模型文件夹路径 # 模型路径
huggingface # 模型下载源
Finetuning method = lora # 微调方法
Stage = Supervised Fine-Tuning # 训练阶段
Dataset = ruozhiba_qa # 数据集
Epochs = 100 # 训练轮次
Compute type = bf16 # 计算类型
LoRA rank = 8 # LoRA秩

各参数解释:
Model name(模型名称):为待训练的模型名称。
Model path(模型路径):设置为从huggingface等数据源下载的模型路径。
Finetuning method(微调方法):微调方法支持lora(Low-Rank Adaptation，在模型某一层添加低秩矩阵实现微调)、full(Full Fine-Tuning，全量微调，对模型所有参数做微调)、freeze(Freeze Fine-Tuning, 冻结微调，冻结模型的某些层，仅微调特定的参数。)
Stage(训练阶段):共有5种阶段
Supervised Fine-Tuning：有监督微调，使用标准好的数据对预训练的基础模型进行训练，已适应不同的特定任务(如分类、问答等)。
Reward Modeling: 奖励建模，用于优化模型输出质量的方法，通常用于RLHF强化学习的上下文中。
PPO(Proximal Policy Optimization): 基于强化学习的微调方法，用于优化模型的输出策略。
DPO(Direct Preference Optimization): 基于人类偏好的直接微调方法，用于训练模型以生成符合人类偏好的输出。
Pre-Training: 预训练，使用大量无监督数据，从头到尾训练新的大模型，目的是让模型学习到通用的文本知识和模式。
Compute type(计算类型): 这个计算类型的选项，依赖硬件GPU系统的支持情况。一般情况下可以选择“fp16”。
LoRA rank(LoRA秩):LoRA参数设置的关键参数之一。秩的大小影响模型性能和资源消耗情况，秩越大，可引入的训练参数越多，进而使得模型对新数据的微调训练适应能力也越强，也导致对计算和内存需求的增加，有可能导致过拟合的发生。建议从小值开始，逐步训练增加参数(8/10/12等)，观察模型性能变化。

启动开始微调

在WebUI界面上点击“开始”按钮，开始训练，正常训练过程中会出现训练步骤进展、已消耗时间、预估时间的进度条和训练过程日志，训练过程中进度条会不停闪烁，训练结束后进度条消失，训练完成后后，会出现“训练完毕。”提示。训练失败，训练过程日志也会有报错提示，用于定位问题。下面是训练过程中，正常训练完成后的终端命令上的日志示例和WebUI上训练完毕后的日志示例。
LLaMA-Factory训练完毕的终端命令上的日志示例
符号#的内容是对下行日志记录的解读，方便理解日志内容和其背后的含义

# cuda = 0表示GPU 正常识别,可以启动cuda训练，world size反应GPU的卡数，compute dtype: torch.bfloat16表示使用的是bfloat16精度进行训练，比较节省内存，适合加速训练的情况
[INFO|2025-10-17 14:56:31] llamafactory.hparams.parser:423 >> Process rank: 0, world size: 1, device: cuda:0, distributed training: False, compute dtype: torch.bfloat16
# 加载微调训练数据集ruozhiba_qa_input_little.json
[INFO|2025-10-17 14:56:32] llamafactory.data.loader:143 >> Loading dataset ruozhiba_qa_input_little.json...
# 训练微调数据集中的数据格式是标准SFT格式，模型正确解析input_ids 和 label_ids, 说明数据预处理没问题，Tokenizer与数据格式兼容，标签正确对齐，其中label_ids是output的token IDs。
training example:
input_ids:
[151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 103434, 46944, 103023, 34187, 104246, 75606, 101037, 11319, 151645, 198, 151644, 77091, 198, 26232, 3837, 17340, 112706, 110382, 103023, 1773, 151645, 198]
inputs:
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
只剩一个心脏了还能活吗？<|im_end|>
<|im_start|>assistant
能，人本来就只有一个心脏。<|im_end|>

label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 26232, 3837, 17340, 112706, 110382, 103023, 1773, 151645, 198]
labels:
能，人本来就只有一个心脏。<|im_end|>

[INFO|2025-10-17 14:57:19] llamafactory.model.model_utils.kv_cache:143 >> KV cache is disabled during training.
[INFO|2025-10-17 14:57:48] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[INFO|2025-10-17 14:57:48] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[INFO|2025-10-17 14:57:48] llamafactory.model.adapter:143 >> Upcasting trainable params to float32.
# LoRA训练正常启动
[INFO|2025-10-17 14:57:48] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[INFO|2025-10-17 14:57:48] llamafactory.model.model_utils.misc:143 >> Found linear modules: down_proj,gate_proj,up_proj,v_proj,k_proj,q_proj,o_proj

INFO  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.
# 参与这次训练的参数是20,185,088，而基础模型总参数是7,635,801,600，被训练参数的参数只占总参数的0.2643%，这也反应出使用LoRA微调模型的好处，微调参数少、节省内存、训练快效果好
[INFO|2025-10-17 14:57:49] llamafactory.model.loader:143 >> trainable params: 20,185,088 || all params: 7,635,801,600 || trainable%: 0.2643
[INFO|2025-10-17 15:01:14] llamafactory.train.callbacks:143 >> {'loss': 1.5699, 'learning_rate': 4.9803e-05, 'epoch': 5.00, 'throughput': 7.96}
{'loss': 1.5699, 'grad_norm': 4.0874104499816895, 'learning_rate': 4.980286753286195e-05, 'epoch': 5.0, 'num_input_tokens_seen': 1632, 'train_runtime': 205.0118, 'train_tokens_per_second': 7.961}
[INFO|2025-10-17 15:04:41] llamafactory.train.callbacks:143 >> {'loss': 0.5215, 'learning_rate': 4.9007e-05, 'epoch': 10.00, 'throughput': 7.98}
{'loss': 0.5215, 'grad_norm': 1.5765197277069092, 'learning_rate': 4.900734214192358e-05, 'epoch': 10.0, 'num_input_tokens_seen': 3280, 'train_runtime': 411.1345, 'train_tokens_per_second': 7.978}
[INFO|2025-10-17 15:08:06] llamafactory.train.callbacks:143 >> {'loss': 0.0859, 'learning_rate': 4.7621e-05, 'epoch': 15.00, 'throughput': 7.96}
{'loss': 0.0859, 'grad_norm': 0.27630168199539185, 'learning_rate': 4.762067631165049e-05, 'epoch': 15.0, 'num_input_tokens_seen': 4912, 'train_runtime': 617.0055, 'train_tokens_per_second': 7.961}
[INFO|2025-10-17 15:11:31] llamafactory.train.callbacks:143 >> {'loss': 0.0102, 'learning_rate': 4.5677e-05, 'epoch': 20.00, 'throughput': 7.97}
{'loss': 0.0102, 'grad_norm': 0.08512571454048157, 'learning_rate': 4.567701435686404e-05, 'epoch': 20.0, 'num_input_tokens_seen': 6544, 'train_runtime': 821.1588, 'train_tokens_per_second': 7.969}
[INFO|2025-10-17 15:14:57] llamafactory.train.callbacks:143 >> {'loss': 0.0033, 'learning_rate': 4.3224e-05, 'epoch': 25.00, 'throughput': 7.99}
{'loss': 0.0033, 'grad_norm': 0.04416470229625702, 'learning_rate': 4.3224215685535294e-05, 'epoch': 25.0, 'num_input_tokens_seen': 8208, 'train_runtime': 1027.3333, 'train_tokens_per_second': 7.99}
[INFO|2025-10-17 15:18:25] llamafactory.train.callbacks:143 >> {'loss': 0.0018, 'learning_rate': 4.0323e-05, 'epoch': 30.00, 'throughput': 8.01}
{'loss': 0.0018, 'grad_norm': 0.029697248712182045, 'learning_rate': 4.0322676341324415e-05, 'epoch': 30.0, 'num_input_tokens_seen': 9888, 'train_runtime': 1235.0521, 'train_tokens_per_second': 8.006}
[INFO|2025-10-17 15:21:53] llamafactory.train.callbacks:143 >> {'loss': 0.0010, 'learning_rate': 3.7044e-05, 'epoch': 35.00, 'throughput': 8.01}
{'loss': 0.001, 'grad_norm': 0.012993291951715946, 'learning_rate': 3.704384185254288e-05, 'epoch': 35.0, 'num_input_tokens_seen': 11568, 'train_runtime': 1443.7118, 'train_tokens_per_second': 8.013}
[INFO|2025-10-17 15:25:18] llamafactory.train.callbacks:143 >> {'loss': 0.0007, 'learning_rate': 3.3468e-05, 'epoch': 40.00, 'throughput': 8.01}
{'loss': 0.0007, 'grad_norm': 0.0086961779743433, 'learning_rate': 3.346844800613229e-05, 'epoch': 40.0, 'num_input_tokens_seen': 13200, 'train_runtime': 1648.2519, 'train_tokens_per_second': 8.008}
[INFO|2025-10-17 15:28:42] llamafactory.train.callbacks:143 >> {'loss': 0.0005, 'learning_rate': 2.9685e-05, 'epoch': 45.00, 'throughput': 8.01}
{'loss': 0.0005, 'grad_norm': 0.006480826064944267, 'learning_rate': 2.9684532864643122e-05, 'epoch': 45.0, 'num_input_tokens_seen': 14848, 'train_runtime': 1852.9815, 'train_tokens_per_second': 8.013}
[INFO|2025-10-17 15:32:08] llamafactory.train.callbacks:143 >> {'loss': 0.0004, 'learning_rate': 2.5785e-05, 'epoch': 50.00, 'throughput': 8.02}
{'loss': 0.0004, 'grad_norm': 0.005220182705670595, 'learning_rate': 2.578526897695321e-05, 'epoch': 50.0, 'num_input_tokens_seen': 16512, 'train_runtime': 2058.6258, 'train_tokens_per_second': 8.021}
[INFO|2025-10-17 15:35:35] llamafactory.train.callbacks:143 >> {'loss': 0.0004, 'learning_rate': 2.1867e-05, 'epoch': 55.00, 'throughput': 8.02}
{'loss': 0.0004, 'grad_norm': 0.0046224468387663364, 'learning_rate': 2.186666916089239e-05, 'epoch': 55.0, 'num_input_tokens_seen': 18176, 'train_runtime': 2265.8238, 'train_tokens_per_second': 8.022}
[INFO|2025-10-17 15:38:59] llamafactory.train.callbacks:143 >> {'loss': 0.0003, 'learning_rate': 1.8025e-05, 'epoch': 60.00, 'throughput': 8.03}
{'loss': 0.0003, 'grad_norm': 0.004301400855183601, 'learning_rate': 1.802522234901927e-05, 'epoch': 60.0, 'num_input_tokens_seen': 19824, 'train_runtime': 2469.7645, 'train_tokens_per_second': 8.027}
[INFO|2025-10-17 15:42:24] llamafactory.train.callbacks:143 >> {'loss': 0.0003, 'learning_rate': 1.4356e-05, 'epoch': 65.00, 'throughput': 8.03}
{'loss': 0.0003, 'grad_norm': 0.003765035653486848, 'learning_rate': 1.4355517710873184e-05, 'epoch': 65.0, 'num_input_tokens_seen': 21488, 'train_runtime': 2674.4326, 'train_tokens_per_second': 8.035}
[INFO|2025-10-17 15:45:55] llamafactory.train.callbacks:143 >> {'loss': 0.0003, 'learning_rate': 1.0948e-05, 'epoch': 70.00, 'throughput': 8.02}
{'loss': 0.0003, 'grad_norm': 0.0038004806265234947, 'learning_rate': 1.0947915553696742e-05, 'epoch': 70.0, 'num_input_tokens_seen': 23152, 'train_runtime': 2885.9865, 'train_tokens_per_second': 8.022}
[INFO|2025-10-17 15:49:23] llamafactory.train.callbacks:143 >> {'loss': 0.0003, 'learning_rate': 7.8863e-06, 'epoch': 75.00, 'throughput': 8.02}
{'loss': 0.0003, 'grad_norm': 0.003600555472075939, 'learning_rate': 7.886322351782783e-06, 'epoch': 75.0, 'num_input_tokens_seen': 24816, 'train_runtime': 3093.3623, 'train_tokens_per_second': 8.022}
[INFO|2025-10-17 15:52:51] llamafactory.train.callbacks:143 >> {'loss': 0.0003, 'learning_rate': 5.2461e-06, 'epoch': 80.00, 'throughput': 8.03}
{'loss': 0.0003, 'grad_norm': 0.0036511190701276064, 'learning_rate': 5.24612469060774e-06, 'epoch': 80.0, 'num_input_tokens_seen': 26496, 'train_runtime': 3301.4524, 'train_tokens_per_second': 8.026}
[INFO|2025-10-17 15:56:17] llamafactory.train.callbacks:143 >> {'loss': 0.0003, 'learning_rate': 3.0923e-06, 'epoch': 85.00, 'throughput': 8.03}
{'loss': 0.0003, 'grad_norm': 0.0035285328049212694, 'learning_rate': 3.0923329989034132e-06, 'epoch': 85.0, 'num_input_tokens_seen': 28176, 'train_runtime': 3507.221, 'train_tokens_per_second': 8.034}
[INFO|2025-10-17 15:59:45] llamafactory.train.callbacks:143 >> {'loss': 0.0003, 'learning_rate': 1.4780e-06, 'epoch': 90.00, 'throughput': 8.04}
{'loss': 0.0003, 'grad_norm': 0.003523440333083272, 'learning_rate': 1.4779807761443636e-06, 'epoch': 90.0, 'num_input_tokens_seen': 29856, 'train_runtime': 3715.0666, 'train_tokens_per_second': 8.036}
[INFO|2025-10-17 16:03:12] llamafactory.train.callbacks:143 >> {'loss': 0.0003, 'learning_rate': 4.4282e-07, 'epoch': 95.00, 'throughput': 8.03}
{'loss': 0.0003, 'grad_norm': 0.003556356066837907, 'learning_rate': 4.4281873178278475e-07, 'epoch': 95.0, 'num_input_tokens_seen': 31504, 'train_runtime': 3922.8514, 'train_tokens_per_second': 8.031}
[INFO|2025-10-17 16:06:36] llamafactory.train.callbacks:143 >> {'loss': 0.0003, 'learning_rate': 1.2336e-08, 'epoch': 100.00, 'throughput': 8.03}
{'loss': 0.0003, 'grad_norm': 0.0034833240788429976, 'learning_rate': 1.233599085671e-08, 'epoch': 100.0, 'num_input_tokens_seen': 33152, 'train_runtime': 4126.4772, 'train_tokens_per_second': 8.034}
{'train_runtime': 4127.2231, 'train_samples_per_second': 0.097, 'train_steps_per_second': 0.024, 'train_loss': 0.10991070384858176, 'epoch': 100.0, 'num_input_tokens_seen': 33152}
***** train metrics *****
  epoch                    =      100.0
  num_input_tokens_seen    =      33152
  total_flos               =  1313580GF
  train_loss               =     0.1099
  train_runtime            = 1:08:47.22
  train_samples_per_second =      0.097
  train_steps_per_second   =      0.024
Figure saved at: saves\Qwen2.5-7B-Instruct\lora\train_2025-10-17-14-25-15\training_loss.png
[WARNING|2025-10-17 16:06:37] llamafactory.extras.ploting:148 >> No metric eval_loss to plot.
[WARNING|2025-10-17 16:06:37] llamafactory.extras.ploting:148 >> No metric eval_accuracy to plot.

WebUI上训练完毕后的日志示例
符号#的内容是对下行日志记录的解读，方便理解日志内容和其背后的含义

训练完毕。
[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2093 >> loading file vocab.json
[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2093 >> loading file merges.txt
[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2093 >> loading file tokenizer.json
[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2093 >> loading file added_tokens.json
[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2093 >> loading file special_tokens_map.json
[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2093 >> loading file tokenizer_config.json
[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2093 >> loading file chat_template.jinja
[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2364 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2025-10-17 14:56:31] configuration_utils.py:763 >> loading configuration file D:\LLaMA-Factory\LLaMA-Factory-main\models\Qwen2.5-7B-Instruct\config.json
[INFO|2025-10-17 14:56:31] configuration_utils.py:839 >> Model config Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "dtype": "bfloat16",
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "layer_types": [
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention"
  ],
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.1",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}

[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2093 >> loading file vocab.json
[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2093 >> loading file merges.txt
[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2093 >> loading file tokenizer.json
[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2093 >> loading file added_tokens.json
[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2093 >> loading file special_tokens_map.json
[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2093 >> loading file tokenizer_config.json
[INFO|2025-10-17 14:56:31] tokenization_utils_base.py:2093 >> loading file chat_template.jinja
[INFO|2025-10-17 14:56:32] tokenization_utils_base.py:2364 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
# 加载微调训练数据集ruozhiba_qa_input_little.json
[INFO|2025-10-17 14:56:32] logging.py:143 >> Loading dataset ruozhiba_qa_input_little.json...
[INFO|2025-10-17 14:57:19] configuration_utils.py:763 >> loading configuration file D:\LLaMA-Factory\LLaMA-Factory-main\models\Qwen2.5-7B-Instruct\config.json
[INFO|2025-10-17 14:57:19] configuration_utils.py:839 >> Model config Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "dtype": "bfloat16",
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "layer_types": [
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention"
  ],
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.1",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}

[INFO|2025-10-17 14:57:19] logging.py:143 >> KV cache is disabled during training.
[WARNING|2025-10-17 14:57:19] logging.py:328 >> `torch_dtype` is deprecated! Use `dtype` instead!
[INFO|2025-10-17 14:57:19] modeling_utils.py:1169 >> loading weights file D:\LLaMA-Factory\LLaMA-Factory-main\models\Qwen2.5-7B-Instruct\model.safetensors.index.json
[INFO|2025-10-17 14:57:19] modeling_utils.py:2341 >> Instantiating Qwen2ForCausalLM model under default dtype torch.bfloat16.
[INFO|2025-10-17 14:57:19] configuration_utils.py:986 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "use_cache": false
}

[INFO|2025-10-17 14:57:48] configuration_utils.py:939 >> loading configuration file D:\LLaMA-Factory\LLaMA-Factory-main\models\Qwen2.5-7B-Instruct\generation_config.json
[INFO|2025-10-17 14:57:48] configuration_utils.py:986 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "pad_token_id": 151643,
  "repetition_penalty": 1.05,
  "temperature": 0.7,
  "top_k": 20,
  "top_p": 0.8
}

[INFO|2025-10-17 14:57:48] dynamic_module_utils.py:423 >> Could not locate the custom_generate/generate.py inside D:\LLaMA-Factory\LLaMA-Factory-main\models\Qwen2.5-7B-Instruct.
[INFO|2025-10-17 14:57:48] logging.py:143 >> Gradient checkpointing enabled.
[INFO|2025-10-17 14:57:48] logging.py:143 >> Using torch SDPA for faster training and inference.
[INFO|2025-10-17 14:57:48] logging.py:143 >> Upcasting trainable params to float32.
[INFO|2025-10-17 14:57:48] logging.py:143 >> Fine-tuning method: LoRA
[INFO|2025-10-17 14:57:48] logging.py:143 >> Found linear modules: down_proj,gate_proj,up_proj,v_proj,k_proj,q_proj,o_proj
[INFO|2025-10-17 14:57:49] logging.py:143 >> trainable params: 20,185,088 || all params: 7,635,801,600 || trainable%: 0.2643
[WARNING|2025-10-17 14:57:49] trainer.py:906 >> The model is already on multiple devices. Skipping the move to device specified in `args`.
[INFO|2025-10-17 14:57:49] trainer.py:749 >> Using auto half precision backend
[WARNING|2025-10-17 14:57:49] trainer.py:982 >> The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
[INFO|2025-10-17 14:57:49] trainer.py:2519 >> ***** Running training *****
[INFO|2025-10-17 14:57:49] trainer.py:2520 >>   Num examples = 4
[INFO|2025-10-17 14:57:49] trainer.py:2521 >>   Num Epochs = 100
[INFO|2025-10-17 14:57:49] trainer.py:2522 >>   Instantaneous batch size per device = 2
[INFO|2025-10-17 14:57:49] trainer.py:2525 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|2025-10-17 14:57:49] trainer.py:2526 >>   Gradient Accumulation steps = 8
[INFO|2025-10-17 14:57:49] trainer.py:2527 >>   Total optimization steps = 100
[INFO|2025-10-17 14:57:49] trainer.py:2528 >>   Number of trainable parameters = 20,185,088
# 起步阶段，开始学习
[INFO|2025-10-17 15:01:14] logging.py:143 >> {'loss': 1.5699, 'learning_rate': 4.9803e-05, 'epoch': 5.00, 'throughput': 7.96}
[INFO|2025-10-17 15:04:41] logging.py:143 >> {'loss': 0.5215, 'learning_rate': 4.9007e-05, 'epoch': 10.00, 'throughput': 7.98}
# 开始拟合数据，loss开始快速下降
[INFO|2025-10-17 15:08:06] logging.py:143 >> {'loss': 0.0859, 'learning_rate': 4.7621e-05, 'epoch': 15.00, 'throughput': 7.96}
# 模型已学会数据，开始“雕刻”记忆
[INFO|2025-10-17 15:11:31] logging.py:143 >> {'loss': 0.0102, 'learning_rate': 4.5677e-05, 'epoch': 20.00, 'throughput': 7.97}
[INFO|2025-10-17 15:14:57] logging.py:143 >> {'loss': 0.0033, 'learning_rate': 4.3224e-05, 'epoch': 25.00, 'throughput': 7.99}
# loss接近0，说明已经进入记忆固化阶段
[INFO|2025-10-17 15:18:25] logging.py:143 >> {'loss': 0.0018, 'learning_rate': 4.0323e-05, 'epoch': 30.00, 'throughput': 8.01}
[INFO|2025-10-17 15:21:53] logging.py:143 >> {'loss': 0.0010, 'learning_rate': 3.7044e-05, 'epoch': 35.00, 'throughput': 8.01}
# 模型记住全部样本数据，进入微调记忆阶段
[INFO|2025-10-17 15:25:18] logging.py:143 >> {'loss': 0.0007, 'learning_rate': 3.3468e-05, 'epoch': 40.00, 'throughput': 8.01}
[INFO|2025-10-17 15:28:42] logging.py:143 >> {'loss': 0.0005, 'learning_rate': 2.9685e-05, 'epoch': 45.00, 'throughput': 8.01}
[INFO|2025-10-17 15:32:08] logging.py:143 >> {'loss': 0.0004, 'learning_rate': 2.5785e-05, 'epoch': 50.00, 'throughput': 8.02}
[INFO|2025-10-17 15:35:35] logging.py:143 >> {'loss': 0.0004, 'learning_rate': 2.1867e-05, 'epoch': 55.00, 'throughput': 8.02}
[INFO|2025-10-17 15:38:59] logging.py:143 >> {'loss': 0.0003, 'learning_rate': 1.8025e-05, 'epoch': 60.00, 'throughput': 8.03}
[INFO|2025-10-17 15:42:24] logging.py:143 >> {'loss': 0.0003, 'learning_rate': 1.4356e-05, 'epoch': 65.00, 'throughput': 8.03}
[INFO|2025-10-17 15:45:55] logging.py:143 >> {'loss': 0.0003, 'learning_rate': 1.0948e-05, 'epoch': 70.00, 'throughput': 8.02}
[INFO|2025-10-17 15:49:23] logging.py:143 >> {'loss': 0.0003, 'learning_rate': 7.8863e-06, 'epoch': 75.00, 'throughput': 8.02}
[INFO|2025-10-17 15:52:51] logging.py:143 >> {'loss': 0.0003, 'learning_rate': 5.2461e-06, 'epoch': 80.00, 'throughput': 8.03}
[INFO|2025-10-17 15:56:17] logging.py:143 >> {'loss': 0.0003, 'learning_rate': 3.0923e-06, 'epoch': 85.00, 'throughput': 8.03}
[INFO|2025-10-17 15:59:45] logging.py:143 >> {'loss': 0.0003, 'learning_rate': 1.4780e-06, 'epoch': 90.00, 'throughput': 8.04}
[INFO|2025-10-17 16:03:12] logging.py:143 >> {'loss': 0.0003, 'learning_rate': 4.4282e-07, 'epoch': 95.00, 'throughput': 8.03}
# 模型'loss'稳定在0.0003，说明死记硬背成功，模型已经记熟了所有训练微调样本
[INFO|2025-10-17 16:06:36] logging.py:143 >> {'loss': 0.0003, 'learning_rate': 1.2336e-08, 'epoch': 100.00, 'throughput': 8.03}
[INFO|2025-10-17 16:06:36] trainer.py:4309 >> Saving model checkpoint to saves\Qwen2.5-7B-Instruct\lora\train_2025-10-17-14-25-15\checkpoint-100
[INFO|2025-10-17 16:06:36] configuration_utils.py:763 >> loading configuration file D:\LLaMA-Factory\LLaMA-Factory-main\models\Qwen2.5-7B-Instruct\config.json
[INFO|2025-10-17 16:06:36] configuration_utils.py:839 >> Model config Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "dtype": "bfloat16",
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "layer_types": [
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention"
  ],
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.1",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}

[INFO|2025-10-17 16:06:36] tokenization_utils_base.py:2421 >> chat template saved in saves\Qwen2.5-7B-Instruct\lora\train_2025-10-17-14-25-15\checkpoint-100\chat_template.jinja
[INFO|2025-10-17 16:06:36] tokenization_utils_base.py:2590 >> tokenizer config file saved in saves\Qwen2.5-7B-Instruct\lora\train_2025-10-17-14-25-15\checkpoint-100\tokenizer_config.json
[INFO|2025-10-17 16:06:36] tokenization_utils_base.py:2599 >> Special tokens file saved in saves\Qwen2.5-7B-Instruct\lora\train_2025-10-17-14-25-15\checkpoint-100\special_tokens_map.json
[INFO|2025-10-17 16:06:37] trainer.py:2810 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


[INFO|2025-10-17 16:06:37] trainer.py:4309 >> Saving model checkpoint to saves\Qwen2.5-7B-Instruct\lora\train_2025-10-17-14-25-15
[INFO|2025-10-17 16:06:37] configuration_utils.py:763 >> loading configuration file D:\LLaMA-Factory\LLaMA-Factory-main\models\Qwen2.5-7B-Instruct\config.json
[INFO|2025-10-17 16:06:37] configuration_utils.py:839 >> Model config Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "dtype": "bfloat16",
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "layer_types": [
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention"
  ],
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.1",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}

[INFO|2025-10-17 16:06:37] tokenization_utils_base.py:2421 >> chat template saved in saves\Qwen2.5-7B-Instruct\lora\train_2025-10-17-14-25-15\chat_template.jinja
[INFO|2025-10-17 16:06:37] tokenization_utils_base.py:2590 >> tokenizer config file saved in saves\Qwen2.5-7B-Instruct\lora\train_2025-10-17-14-25-15\tokenizer_config.json
[INFO|2025-10-17 16:06:37] tokenization_utils_base.py:2599 >> Special tokens file saved in saves\Qwen2.5-7B-Instruct\lora\train_2025-10-17-14-25-15\special_tokens_map.json
[WARNING|2025-10-17 16:06:37] logging.py:148 >> No metric eval_loss to plot.
[WARNING|2025-10-17 16:06:37] logging.py:148 >> No metric eval_accuracy to plot.
[INFO|2025-10-17 16:06:37] modelcard.py:456 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

模型导出、量化

微调模型导出配置：在命令行或者WebUI界面的Export选项卡位置，配置导出参数，参数配置如下：

1 2	Checkpoint path = train_2025-10-17-14-25-15 # 检查点路径一般是模型完成后lora文件夹中的检查点文件夹名称 Export dir = 主路径\Qwen2.5-7B-Instruct-ruozhi # 导出目录

点击开始导出，导出完成后，LLaMA Factory会提示导出完成。
2. 量化过程，需配置参数如下：
量化：将高精度数据类型存储参数转换为低精度的数据类型存储，进而实现在不改变模型结构和参数总量的情况下，加速推理，实现模型高效。

1
2
3

Model path = 已导出的模型文件夹路径或者需要量化的模型文件夹 # 模型路径
Export选项卡中的导出量化等级
Export dir = 主路径\Qwen2.5-7B-Instruct-ruozhi # Export选项卡中导出目录

不需要Checkpoint path(检查点路径)配置参数。建议做量化，不然在加载到模型文件到ollama时，会占用非常大的磁盘空间，或者将模型转换为GGUF文件格式，这样模型可以适配低配硬件。
3. 注意：在模型导出后，使用Ollama部署时，需要其他检查并准备磁盘的存储空间，防止因为磁盘存储空间不足，导致模型导入失败。

参考：原始模型文件本地转换GGUF文档格式

工具：llama.cpp
关键文件：convert_hf_to_gguf.py
文件用途：llama.cpp项目中的，用于转换HuggingFace模型为GGUF格式的高级工具。
工具文件地址：https://github.com/ggml-org/llama.cpp
安装方式：压缩包下载到本地或者使用git的clone命令下载

1 2	git clone https://github.com/ggerganov/llama.cpp #下载llama.cpp脚本 cd llama.cpp # 切换到llama.cpp文件材料所在文件夹

使用教程：

下载文件到本地，如果是压缩包，需要将文件解压后放到自定义文件夹中，如果是git的clone下载，则需要切换到下载后的文件中。
检查文件夹中，是否有convert_hf_to_gguf.py文件，并检查文件夹中是否存在这些组件torch、transformers、safetensors、sentencepiece、mistral_common。如果没有，需要使用命令下载pip install torch transformers safetensors sentencepiece mistral_common到文件夹中，这些组件或库之所以需要，是因为convert_hf_to_gguf.py依赖到这些库，没有会导致报错。
确认组件或库下载完成后，可以使用如下示例命令，开始转换

1	python convert_hf_to_gguf.py "D:/LLaMA-Factory/LLaMA-Factory-main/models/Qwen2.5-7B-Instruct-ruozhi" --outfile Qwen2.5-7B-Instruct-ruozhi-gguf.gguf --outtype auto

命令解读:

python:执行Python解释器
convert_hf_to_gguf.py: python解释器调用llama.cpp中的脚步文件convert_hf_to_gguf.py
"D:/LLaMA-Factory/LLaMA-Factory-main/models/Qwen2.5-7B-Instruct-ruozhi":可替换为其他需要被转换的HuggingFace模型材料文件夹地址，或者经管训练的模型材料的文件夹地址，必须包含模型权重、配置文件等，如果从Windows系统中复制文件夹地址，需要注意"/"和"\"的方向。
--outfile:指定输出的文件名称
Qwen2.5-7B-Instruct-ruozhi-gguf.gguf:可替换为其他转换完成后的其他gguf文件名，后缀.gguf名必须要有。
--outtype auto：指定输出文件的类型，auto是表示自动检测输出类型，通常使用默认值。也可以选择这些('f32', 'f16', 'bf16', 'q8_0', 'tq1_0', 'tq2_0', 'auto')做指定输出格式

当出现如下日志内容时，说明转换完成，已经成功完成模型转换。

INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:Qwen2.5-7B-Instruct-ruozhi-gguf.gguf: n_tensors = 339, total_size = 15.2G
Writing: 100%|██████████████████████████████████████████████████████████████████| 15.2G/15.2G [01:56<00:00, 130Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to Qwen2.5-7B-Instruct-ruozhi-gguf.gguf

虽然转换前后，文件存储大小差别不大，但是gguf文件导入Ollama和适配低显存硬件要方便很多。

模型部署

以gguf文件部署到Ollama为例,准备材料Modelfile文件: 按照如下形式准备Modelfile文件(Qwen2.5-7B-Instruct-ruozhi.mf)

FROM 主目录\models\Qwen2.5-7B-Instruct-ruozhi-gguf
TEMPLATE """{{ if .System }}<|im_start|>system{{ .System }}<|im_end|>{{ end }}<|im_start|>user{{ .Prompt }}<|im_end|><|im_start|>assistant"""
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"

文件准备好后，文件复制到ollama所在的文件夹。
注意：经过LLaMA Factory导出的模型，其模型材料中已经有Modelfile 文件，可直接使用。
2. 在ollama所在的文件夹，打开终端命令符，先使用ollama list检查ollama中已导入的模型，如果没有自己使用的模型，需要使用如下命令导入到ollama中

1	ollama create Qwen2.5-7B-Instruct-ruozhi -f Qwen2.5-7B-Instruct-ruozhi.mf

在使用ollama list检查ollama导入模型是否成功，如果程序下面日志，说明导入成功，完成模型部署。

1 2	NAME ID SIZE MODIFIED Qwen2.5-7B-Instruct-ruozhi:latest b5344c89221e 15 GB 4 seconds ago

导入成功，完成模型部署后，可使用命令ollama run Qwen2.5-7B-Instruct-ruozhi启动模型.
3. 模型微调后推理结果如下：

>>> 爸爸再婚，我是不是就有了个新娘？
 不是的，"新郎"是指新婚的男方，而你只是成为了继父或继母，没有因此成为新娘。

>>> 只剩一个心脏了还能活吗？
 能，人本来就只有一个心脏。

>>> 我25岁 我爸50岁 那我爸两岁就生了我啦？
 不是的，你25岁，你爸50岁，那么你爸在你1岁时（而不是2岁）就生了你。

>>> 同学说steam好玩，我现在八级烫伤了怎么办？ 救救我
 这里的"steam"有两种解释，一种是指蒸汽，一种是指电子游戏平台Steam。如果是蒸汽导致的伤害，应该立即去医院；如果是电子
游戏平台Steam，那么 steam 平台本身不会对人体造成伤害。请确认具体情况。

>>> 爸爸再婚，我是不是就有了个新娘？
 不是的，你有了一个继母。"新娘"是指新婚的女方，而你在你爸爸再婚的情况下，会有一个继母，而不是新娘。

>>> 我25岁 我爸50岁 那我爸两岁就生了我啦？
 不对，你爸在你出生时应该是25岁，而不是两岁。他现在50岁，你25岁，所以他比你大25岁，即你在23岁时他50岁。

>>> 同学说steam好玩，我现在八级烫伤了怎么办？ 救救我
 "Steam"是一个电子游戏分发平台，它不会导致物理上的伤害。所以你不需要担心被烫伤的问题，只需要暂停使用Steam，去医院处
理烫伤即可。

>>> 我25岁 我爸50岁 那我爸两岁就生了我啦？
 不对，你爸爸在你出生时应该是25岁，而不是两岁。他现在50岁，你25岁，所以他比你大25岁，即你在0岁时他25岁。

>>> 同学说steam好玩，我现在八级烫伤了怎么办？ 救救我
 "Steam"是一个电子游戏平台，不会导致物理伤害。但是如果你真的被蒸汽烫伤了，应该立即去医院接受治疗。

>>> Send a message (/? for help)

下面是弱智吧训练微调对话数据集中的训练提示数据

instruction	output
只剩一个心脏了还能活吗？	能，人本来就只有一个心脏。
爸爸再婚，我是不是就有了个新娘？	不是的，你有了一个继母。”新娘”是指新婚的女方，而你爸爸再婚，他的新婚妻子对你来说是继母。
我25岁我爸50岁那我爸两岁就生了我啦？	不对，你25岁，你爸50岁，那么你爸在你出生时应该是25岁，而不是两岁。
同学说steam好玩，我现在八级烫伤了怎么办？救救我	“Steam”是一个电子游戏分发平台，不是steam蒸汽。建议立即去医院，顺便看看脑子。

结合弱智吧中的训练微调对话提示内容可看出，经过微调后的模型推理出的数据信息已经明显受到了弱智吧训练微调集中训练数据的影响，并且其回答的内容和之前相比更加简洁，考虑到使用的微调设置、训练数据量、迭代次数等信息，如果使用更强大的GPU硬件设备，可以设想之后模型微调后会有更好的效果。

注意事项

能不用就不使用Windows训练大模型，有些模型、框架或者平台虽然也可以在Windows上运行，但是使用Windows犹如手推车拉拖车——可以这么干，但是太TM费劲了。各种稀奇古怪的配置错误令人头秃，相比较之下，Linux体验就好很多，建议可以使用WSL2 + Ubantu + Conda的环境，使用时注意ubantu内的网络连接，避免Windows原生Python + 多线程组合。
Ollama在运行大模型回复问题时，会调用GPU运行，可以通过观测任务管理器中GPU性能的运行情况，确定GPU的消耗。
使用ollama加载使用大模型，可以使用GGUF模型文件，但是在LLaMA-Factory模型微调训练的过程中，需要使用完整的模型文件，如果使用本地模型文件加载，则模型路径是本地存放模型材料的本地文件夹，本地模型文件夹中的模型材料必须是从huggingface或modelscope中模型地址上完整下载的文件，以模型下载源huggingface为例，模型材料中要有config.json、vocab.json、merges.txt、tokenizer.json、added_tokens.json、special_tokens_map.json、tokenizer_config.json、chat_template.jinja等文件，这些文件都是从huggingface网站上Qwen/Qwen2.5-7B-Instruct模型Files and versions处下载而来，必须是完整、官方、未经修改的材料，并检查config.json中的model_type是否存在缺失。下载好后，在LLaMA Factory的WebUI页面处，模型选择Qwen2.5-7B-Instruct，模型路径选择本地存放huggingface网站上Qwen/Qwen2.5-7B-Instruct的本地文件夹，模型下载源选择huggingface。
数据准备时，可以按照如下格式自定义微调训练数据集，其中的数据格式各部分(instruction、input、system、history、output)可以置空，但是必须要有，否则LLaMA-Factory启动后，会在日志中报错KeyError: 'input'，提示数据格式缺失内容。

[
    {
      "instruction": "只剩一个心脏了还能活吗？",
      "input": "",
      "system": "",
      "history": [],
      "output": "能，人本来就只有一个心脏。"
    },
    {
      "instruction": "爸爸再婚，我是不是就有了个新娘？",
      "input": "",
      "system": "",
      "history": [],
      "output": "不是的，你有了一个继母。\"新娘\"是指新婚的女方，而你爸爸再婚，他的新婚妻子对你来说是继母。"
    }
]

或者进入LLaMA-Factory文件中的data文件夹中找的dataset_info.json, 效仿如下格式,在dataset_info.json中添加或修改微调训练数据集配置

"wikiqa": {
  "hf_hub_url": "wiki_qa",
  "columns": {
    "prompt": "question",
    "response": "answer"
  }
},
"webqa": {
  "hf_hub_url": "suolyer/webqa",
  "ms_hub_url": "AI-ModelScope/webqa",
  "columns": {
    "prompt": "input",
    "response": "output"
  }
}

可以写成下面这个样式：

{
  "your_dataset_name": {
    "file_name": "your_data.jsonl",
    "input_keys": ["question"],
    "output_keys": ["answer"],
    "system_keys": [],
    "ranking": false
  }
}

input_keys：告诉模型你的“问题”字段名,output_keys：告诉模型你的“回答”字段名。
5. 要注意微调训练数据集的样本量，如果模型参数规模(76亿)和微调训练数据集样本量(4个)差异过大，一定会造成严重过拟合的出现，模型会死记硬背，失去泛化能力。
6. 在训练模型前，需要扩大系统的虚拟内存，否则会因为虚拟内存太小，导致NumPy、HuggingFace的datasets库在加载和映射数据时，会因为一次性加载大量数据到内存，导致崩溃，至于加载内存到多少，首先要考虑基础模型本身加载所需要的内存量，例如Qwen2.5-7B模型本身加载就要10GB以上，数据处理过程还需要额外内存量，可以设定虚拟内存的初始大小和最大大小，在范围内动态调整确定合适的内存量。
7. 在训练前，为验证微调工具LLaMA-Factory、基础模型和微调数据集是否准备好，微调流水线是否可以跑通，训练迭代参数和微调训练数据集可以先调小(例如微调)，虽然这有可能会造成过拟合和训练不足等问题，但是出于流水线跑通的目的，还是建议在正式开始训练前，先做一遍，尽量在正式训练前发现问题，摸清楚LLaMA-Factory、基础模型和微调数据集中的特性和问题。
8. 训练过程中，观察日志内容：如果出现***** Running training *****, 说明训练已启动，出现logging.py:143 >> {'loss':说明训练已有反馈，是关键成功标志，说明训练真正启动。训练参数审视，如果loss在1.0-3.0之间，则属于在正常训练范围内，小于0.1被视为”极好”，小于0.01就是过拟合。throughput是速度在7-8tokens/sec之间，级每秒处理7-8个token, 说明在7B模型下速度合理。在训练过程中要持续审视训练曲线、观察loss是否持续下降，learning_rate是否正常持续衰减，如果learning_rate仍然在下降，但是loss已经卡在极低值附近，说明此刻模型不再学习，而是在微调记忆。
9. 模型训练过程中要观测logging.py的运行情况，观察日志，其内容示例如示{'loss': 1.5699, 'learning_rate': 4.9803e-05, 'epoch': 5.00, 'throughput': 7.96}, 观察loss、learning_rate、epoch、throughput数据变化情况，loss数据如果在稳步下降，且无震荡、无反弹，这说明模型正在快速学习，且训练稳定，数据随迭代步骤step的变化可以通过WebUI旁边的loss图来观察；learning_rate数据如果在平滑下降，而且还没有出现abrupt变化，说明学习率调度正常；epoch推进步骤一般是每 3~4 分钟推进一次 epoch，以5、10、15的节奏稳定推进，说明epoch无报错或中断，推进正常。throughput数据稳定在7-8tokens/sec之间，说明GPU利用率保持稳定，没有资源瓶颈和限制，训练效率较高，也可以通过观察系统任务管理器中的GPU内存利用率指标，来确定GPU资源是否受限。一般专用GPU内存利用率会拉满，共享GPU内存利用率会占用一部分，无论专用还是共享GPU内存利用率，二者的利用率曲线在训练过程中都会保持稳定平滑，训练结束后，二者利用率会清零，CPU利用率和内存利用率也会下降到训练前的水平。
10. 在训练过程中，如果发现训练过拟合或者训练崩溃，需要及时干预，停止中断训练(Ctrl + C)