一、环境准备

本机显卡：NVIDIA RTX 5070

显存：12G（无法流畅运行 FP16 (16GB) 模型，需进行 Int4 量化）

内存：32G

操作系统：Windows11

基座模型：Qwen3-8B
训练语料：NekoQA-10K
训练框架：LLaMA-Factory
模型转换：llama.cpp (编译版)
推理引擎：Ollama (Windows版)
前端界面：Open WebUI

二、环境部署

Ollama & Open WebUI

在官网下载安装Ollama

默认安装路径：C:\Users\[用户名]\AppData\Local\Programs\Ollama
默认模型存储路径：C:\Users\[用户名]\.ollama\models

将模型路径放置自定义目录：E:\project\ollama_storage

如果 C 盘已有下载好的模型，需将 C:\Users\[用户名]\.ollama\models 下的 blobs 和 manifests 两个文件夹剪切到新路径中。

启动Ollama

ollama serve

# 检查是否正确读取 模型
ollama list

Open WebUI 部署

使用 Docker Desktop 进行部署

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

三、模型微调训练

下载QWen3-8b模型文件

创建 llmModel 文件夹，在文件夹下创建 download.py 文件

from modelscope import snapshot_download
# 下载 Qwen2-7B-Instruct (根据你实际想用的模型修改)
model_dir = snapshot_download('Qwen/Qwen3-8B', cache_dir='./models')
print(f"模型下载到了: {model_dir}")

检查模型完整性

import transformers
import torch

model_id = "/mnt/e/project/llmModel/models/Qwen/Qwen3-8B"

# === 修改点 1: 引入 BitsAndBytesConfig ===
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)
# ========================================

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={
        "quantization_config": quantization_config, # 使用 4bit 配置
        # "torch_dtype": torch.bfloat16,  <-- 这行删掉，解决那个 warning
    },
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Qwen 的结束符处理
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.pad_token_id
]
terminators = [t for t in terminators if t is not None]

print("开始生成...") # 加个提示，确认在运行

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
    pad_token_id=pipeline.tokenizer.eos_token_id
)

# === 修改点 2: 打印完整内容以便调试 ===
print("-" * 20)
print("完整输出:")
print(outputs[0]["generated_text"]) 
print("-" * 20)

LLama_Factory安装

拉取项目

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory

创建conda环境&安装项目依赖

conda create -n llama_factory python=3.11 -y
conda activate llama_factory
pip install -e .[torch,metrics]
pip install bitsandbytes
pip install --no-deps "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

设置数据集

在 data/dataset_info.json 添加

 "neko_qa": {
    "hf_hub_url": "liumindmind/NekoQA-10K",
    "columns": {
      "prompt": "instruction",
      "response": "output"
    }
  },

启动并训练

llamafactory-cli webui

训练Loss曲线

模型测试

创建 test.py 加载 Base 模型和 LoRA 权重进行对话测试

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# --- 配置路径 ---
base_model_path = "/mnt/e/project/llmModel/models/Qwen/Qwen3-8B" # 你的实际路径
lora_path = "./saves/Qwen3-8B-Base/lora/neko_finetune_v1"
# ----------------

print("1. 正在加载 Tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)

print("2. 正在加载基座模型 (使用 4-bit 量化以节省显存)...")
# 使用 4bit 量化配置，显存占用会从 16G 降到 6G 左右
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

try:
    model = AutoModelForCausalLM.from_pretrained(
        base_model_path,
        device_map="auto",
        quantization_config=quantization_config, # 启用量化
        trust_remote_code=True
    )
except ImportError:
    print("❌ 错误: 缺少 bitsandbytes 库。请先运行: pip install bitsandbytes")
    exit()

print(f"3. 正在挂载 LoRA 权重: {lora_path}")
model = PeftModel.from_pretrained(model, lora_path)

print("\n✅ 模型加载完成！")
print("-" * 30)

while True:
    query = input("\n用户: ")
    if query.strip().lower() == "exit":
        break
    
    if not query.strip():
        continue

    print("   (正在思考中...请稍候)") # 添加提示

    # 1. 构造 Prompt
    messages = [
        {"role": "system", "content": "你是一只可爱的猫娘。"}, # 确保有 System Prompt
        {"role": "user", "content": query}
    ]
    
    try:
        # 2. 编码
        text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        model_inputs = tokenizer([text], return_tensors="pt").to("cuda") # 强制指定 cuda

        # 3. 生成
        with torch.no_grad():
            generated_ids = model.generate(
                **model_inputs,
                max_new_tokens=512,
                temperature=0.7,
                top_p=0.9
            )

        # 4. 解码
        generated_ids = [
            output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
        ]
        response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

        print(f"Neko: {response}")

    except Exception as e:
        print(f"❌ 发生错误: {e}")

LoRA 权重合并

创建一个名为 merge_model.py 的文件，填入以下代码：

合并权重

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import os

# --- 配置路径 ---
base_model_path = "/mnt/e/project/llmModel/models/Qwen/Qwen3-8B" # 你的基座路径
lora_path = "./saves/Qwen3-8B-Base/lora/neko_finetune_v1"
output_dir = "./merged_neko_model"
# ----------------

print(f"1. Loading Base Model: {base_model_path}")
# 关键修改：强制使用 CPU 加载，彻底避开显存限制
# 注意：这需要你有足够的系统内存 (RAM > 16GB)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    torch_dtype=torch.float16, 
    device_map="cpu",  # <--- 改为 "cpu" (之前是 "auto")
    low_cpu_mem_usage=True,
    trust_remote_code=True
)

print(f"2. Loading LoRA Adapter: {lora_path}")
# 既然基座在 CPU，LoRA 也会自动加载到 CPU
model = PeftModel.from_pretrained(base_model, lora_path)

print("3. Merging Weights (Running on CPU)...")
model = model.merge_and_unload() 

print(f"4. Saving to {output_dir}...")
model.save_pretrained(output_dir)

# 记得也要保存 Tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)
tokenizer.save_pretrained(output_dir)

print("✅ 合并完成！")

转换为 GGUF 格式

下载llama.cpp

cd ..
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt

安装编译工具

# 安装编译工具
apt-get update && apt-get install cmake build-essential

# 1. 生成构建配置 (创建一个 build 文件夹)
cmake -B build

# 2. 开始编译 (使用 -j 让多核 CPU 一起跑，速度更快)
cmake --build build --config Release -j 8

格式转换 (HF -> GGUF FP16)

# 注意：/mnt/e/project/LLaMA-Factory/merged_neko_model 是你刚才合并输出的文件夹路径
python convert_hf_to_gguf.py /mnt/e/project/LLaMA-Factory/merged_neko_model --outfile neko-fp16.gguf

量化 (Quantize)

# 注意前面的 build/bin/
./build/bin/llama-quantize neko-fp16.gguf neko-q4.gguf Q4_K_M

四、模型导出到Ollama

编写 Modelfile 文件

E:\project\ollama_storage\
├── blobs\       <-- 这里面是一堆乱码名字的大文件
├── manifests\   <-- 这里面是模型清单
└── neko 
    ├── Modelfile
    ├── neko-fp16.gguf
    └── neko-q4.gguf

FROM ./neko-q4.gguf

# 1. 模板 (保持标准 ChatML)
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
"""

# --- 2. 关键修改：字数限制 ---
# ⚠️ 注意：思考过程(Think)也是算字数的！
# 128太短了，模型可能还在思考就被掐断了，导致没有正文。
# 改为 300-400 是比较安全的区间 (既能容纳简短思考，又能物理阻止长篇刷屏)
PARAMETER num_predict 350

# --- 3. 核心修改：刹车系统 ---
# ❌ 绝对不要加 stop "</think>" (否则会吞掉回答)
# ❌ 绝对不要加 stop "<think>" (否则会吞掉思考)

# ✅ 必须加的标准结束符
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|endoftext|>"

# ✅ 必须加的防自言自语符 (穷举法)
PARAMETER stop "User:"
PARAMETER stop "user:"
PARAMETER stop "用户:"
PARAMETER stop "Master:"
PARAMETER stop "Human:"
# 防止换行后偷偷扮演
PARAMETER stop "\nUser"
PARAMETER stop "\n用户"

# --- 4. 惩罚与创造性 ---
# 稍微严格一点，防止废话
PARAMETER repeat_penalty 1.2
PARAMETER repeat_last_n 64
PARAMETER temperature 0.6
PARAMETER top_p 0.9

# --- 5. 系统提示 (明确思考格式) ---
# 告诉模型：先想，再说话。WebUI 会自动识别这个格式。
SYSTEM """你是一只猫娘Neko。
【格式指令】
1. 你的回答必须包含思考过程，格式为：<think>思考内容</think> 你的回答。
2. 思考结束后，**必须**输出 </think>，然后开始扮演Neko说话。
3. 回复必须简短可爱，严禁模仿User说话。"""

启动模型

cd E:\project\ollama_storage\neko

# 创建模型
ollama create neko -f Modelfile

# 运行
ollama run neko

菜单

微调Qwen-8B（自定义猫娘Neko大模型）