0. 前言
编辑于 2024-04-10,如有遇到问题,请根据此日期看官方的变更记录,以方便排查问题。
huggingface官网:https://huggingface.co/
llama2官网:https://github.com/meta-llama/llama
llama2-2-7b-hf模型:https://huggingface.co/meta-llama/Llama-2-7b-hf
1. 准备环境
NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0
V100S 32G内存显卡2块
经过测试,要跑llama2-2-7b-hf模型,最少需要两张卡,两张卡最小占用内存分别为17G
这里使用singularity镜像(可以理解为docker镜像),镜像制作文件如下:
Bootstrap: docker
From: nvcr.io/nvidia/pytorch:23.11-py3
%post
apt-get update
apt-get upgrade -y
apt-get install -y git
chmod -R 777 /root
pip install accelerate appdirs loralib bitsandbytes black black[jupyter] datasets fire peft transformers>=4.34.1 sentencepiece py7zr scipy optimum matplotlib gradio
cd /opt
git clone https://github.com/meta-llama/llama-recipes.git
cd llama-recipes
git reset --hard '37c8f722116493e69ea99420b3d73287905a46d0' # 因为没有tag,所以这是用的是这次的commit
chmod -R 777 ./llama-recipes
2. 下载huggingface模型
方法一:使用huggingface_hub工具
也可以使用 huggingface_hub 下载数据集。
下载依赖:
pip install huggingface_hub # require python >= 3.8
在python终端中执行
推荐此方法,不要用脚本,在交互终端执行!脚本执行只能看到整体下载进度!看不到具体文件的下载进度!
>>> import os >>> os.environ["http_proxy"] = "http://127.0.0.1:7890" >>> os.environ["https_proxy"] = "http://127.0.0.1:7890" # 因为这里使用py3.8下载,使用代理会出错所以需要此配置(py3.9修复不需要该操作) >>> from huggingface_hub import snapshot_download >>> snapshot_download(repo_id="meta-llama/Llama-2-7b-hf", resume_download=True, local_dir_use_symlinks=False, local_dir=r"D:\llm-llama2", token="*********") # windows下最好把local_dir_use_symlinks置为False,否则cache目录和local_dir目录都会各存一份 # 有些模型需要登录才能下载,有些不需要登录
命令行下载:
# 命令行方式下载模型 huggingface-cli download --resume-download meta-llama/Llama-2-7b-hf --local-dir ./llm-llama2
方法二:使用hfd工具快速下载huggingface模型权重(推荐)
优点:无需代理,下载速度快。
镜像站:https://hf-mirror.com/,链接:hf镜像站
具体步骤:
下载huggingface_hub工具
pip install -U huggingface_hub
下载hfd:
wget https://hf-mirror.com/hfd/hfd.sh chmod a+x hfd.sh
使用镜像站源:这样我们国内下载会很快
# linux export HF_ENDPOINT=https://hf-mirror.com # windows $env:HF_ENDPOINT = "https://hf-mirror.com"
下载模型:
./hfd.sh 模型名称 --tool aria2c -x 16 # -x aria2c的下载线程数,默认值是4 # aria2c 需要自己安装:yum install aria2
如果没有aria2,可以使用
--tool wget
:./hfd.sh 模型名称 --tool wget --local-dir 存储目录
如果下载需要token的模型:
./hfd.sh meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME --hf_token hf_*** --tool aria2c -x 16
方法三:使用modelscope下载(推荐)
官网:https://www.modelscope.cn/docs/models/download
安装modelscope
pip install modelscope
下载整个模型repo到指定目录
modelscope download --model deepseek-ai/Deepseek-R1-Distill-Llama-70B --local_dir ./huggingface_models/Deepseek-R1-Distill-Llama-70B
方法四:使用transformers下载
从 Hugging Face 下载预训练模型:
from transformers import AutoModel, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-hf" # Replace with the desired model's name
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
3. 下载预训练数据集
Load and preprocess your dataset. Ensure it’s tokenized using the appropriate tokenizer for the Pretrained Model.
下载数据集
从huggingface下载数据集
方法一:使用huggingface_hub下载(类似git clone)
# 先下载源数据集: >>> from huggingface_hub import snapshot_download >>> snapshot_download(repo_id="JosephusCheung/GuanacoDataset", repo_type="dataset",resume_download=True, local_dir=r"D:\datasets\test", token="*********") >>> from datasets import load_dataset >>> dataset = load_dataset(r"D:\datasets\GuanacoDataset")
方法二:使用datasets下载
安装数据集包:
pip install datasets # To work with audio datasets, install the Audio feature: pip install datasets[audio] # To work with image datasets, install the Image feature: pip install datasets[vision]
加载并保存到磁盘:
>>> from datasets import load_dataset, load_from_disk >>> from transformers import DataCollatorWithPadding >>> dataset = load_dataset("JosephusCheung/GuanacoDataset", token="***********") >>> dataset.save_to_disk(r"D:\datasets\GuanacoDataset") # 写入磁盘,保存到指定目录下,会对数据集进行整合,生成新的格式 >>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
加载数据集
从hugging face在线加载
>>> from datasets import load_dataset >>> dataset = load_dataset("JosephusCheung/GuanacoDataset", token="***********")
从本地加载
>>> from datasets import load_from_disk >>> dataset = load_from_disk(r"D:\datasets\GuanacoDataset") # 读取本地数据集
4. 使用llama-recipes进行模型微调
单机多卡
训练时间:1 batch size, 1 epoch在两个 V100S GPU上跑了10小时32分钟。V100的话时间最少翻倍!!!
使用的slurm脚本如下:
#!/bin/bash
#SBATCH --job-name='LLM_2gpu_finetuning_batch_1'
#SBATCH --chdir=/share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2
#SBATCH --partition=c4
#SBATCH --nodes=1
#SBATCH --time=4-00:00
#SBATCH --mincpus=32
#SBATCH --gres=gpu:2
export SLURM_OVERLAP=yes
module try-load singularity
echo job start time is `date`
# Prepare temp directory
cd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2
rm -rf tmp_run
rm -rf tmp_output
mkdir tmp_run
mkdir tmp_output
singularity exec --nv \
-B /share/home/test \
--pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2 \
/share/home/lico/container/llama2_cu121.image \
cp -r /opt/llama-recipes ./tmp_run/
singularity exec --nv \
-B /share/home/test \
--pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2 \
/share/home/lico/container/llama2_cu121.image \
cp /share/home/test/llama2/GuanacoDataset/guanaco_non_chat-utf8.json ./tmp_run/llama-recipes/src/llama_recipes/datasets/alpaca_data.json
# Finetuning
if [ 2 == 1 ]
then
# single GPU
set -x
singularity exec --nv \
-B /share/home/test \
--pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes \
--env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes/src" \
/share/home/lico/container/llama2_cu121.image \
python -m recipes/finetuning/finetuning.py --use_peft --peft_method lora --use_fp16 --model_name /share/home/test/llama2/Llama-2-7b-hf --output_dir /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_output --dataset alpaca_dataset --batch_size_training 1 --num_epochs 1
set +x
else
# multi GPU
set -x
singularity exec --nv \
-B /share/home/test \
--pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes \
--env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes/src" \
/share/home/lico/container/llama2_cu121.image \
torchrun --nnodes 1 --nproc_per_node 2 recipes/finetuning/finetuning.py --enable_fsdp --use_peft --peft_method lora --use_fp16 --model_name /share/home/test/llama2/Llama-2-7b-hf --output_dir /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_output --dataset alpaca_dataset --batch_size_training 1 --num_epochs 1
set +x
fi
# Merge Model
cd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410/output
rm -rf ./*
singularity exec --nv \
-B /share/home/test \
--pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes \
--env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes/src" \
/share/home/lico/container/llama2_cu121.image \
python recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py --base_model /share/home/test/llama2/Llama-2-7b-hf --peft_model /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_output --output_dir /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410/output
set +x
echo job end time is `date`
多机多卡
因为使用V100,所以需要禁用efa。同时也要注意网卡配置 NCCL_SOCKET_IFNAME
。
slurm 脚本如下:
#!/bin/bash
#SBATCH --job-name='LLM_multi_finetuning_2gpu_ib0'
#SBATCH --chdir=/share/home/test/jobtemplate/LLM_finetuning/multi/20240412
#SBATCH --partition=compute_gpu
#SBATCH --nodes=2
#SBATCH --time=0-6:00
#SBATCH --mincpus=16
#SBATCH --gres=gpu:1
export SLURM_OVERLAP=yes
module try-load singularity
echo job start time is `date`
# Prepare temp directory
cd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412
rm -rf tmp_run
rm -rf tmp_output
mkdir tmp_run
mkdir tmp_output
singularity exec --nv \
-B /share/home/test \
--pwd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412 \
/share/home/lico/container/llama2_cu121.image \
cp -r /opt/llama-recipes ./tmp_run/
singularity exec --nv \
-B /share/home/test \
--pwd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412 \
/share/home/lico/container/llama2_cu121.image \
cp /share/home/test/llama2/GuanacoDataset/guanaco_non_chat-utf8.json ./tmp_run/llama-recipes/src/llama_recipes/datasets/alpaca_data.json
# Finetuning
NODE_PORT=25611
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
# Enable for A100
# export FI_PROVIDER="efa"
# export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
# export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH
echo Node IP: $head_node_ip
export LOGLEVEL=INFO
# debugging flags (optional)
export NCCL_DEBUG=WARN
export NCCL_DEBUG_SUBSYS=WARN
export PYTHONFAULTHANDLER=1
export CUDA_LAUNCH_BLOCKING=0
# on your cluster you might need these:
# set the network interface
export NCCL_SOCKET_IFNAME="ib0"
export FI_EFA_USE_DEVICE_RDMA=0 # Disable EFA
set -x
srun singularity exec --nv \
-B /share/home/test \
--pwd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_run/llama-recipes \
--env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_run/llama-recipes/src" \
/share/home/lico/container/llama2_cu121.image \
torchrun --nnodes 2 --nproc_per_node 1 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:$NODE_PORT recipes/finetuning/finetuning.py --enable_fsdp --use_peft --peft_method lora --use_fp16 --model_name /share/home/test/llama2/Llama-2-7b-hf-bak --output_dir /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_output --dataset alpaca_dataset --batch_size_training 2 --num_epochs 1
set +x
# Merge Model
cd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/output
rm -rf ./*
singularity exec --nv \
-B /share/home/test \
--pwd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_run/llama-recipes \
--env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_run/llama-recipes/src" \
/share/home/lico/container/llama2_cu121.image \
python recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py --base_model /share/home/test/llama2/Llama-2-7b-hf-bak --peft_model /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_output --output_dir /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/output
set +x
echo job end time is `date`
5. Testing
这里使用 oobabooga/text-generation-webui,它是适用于大型语言模型的 Gradio Web UI。
环境准备,这里同样使用singularity镜像:
Bootstrap: docker
From: nvcr.io/nvidia/pytorch:23.11-py3
%post
apt-get update
apt-get install -y git
mkdir -p /opt/llama2_inference
cd /opt/llama2_inference
# wget https://github.com/oobabooga/text-generation-webui/archive/refs/tags/snapshot-2024-04-14.zip
# unzip snapshot-2024-04-14.zip && mv text-generation-webui-snapshot-2024-04-14 text-generation-webui
# 下载指定版本,或者使用git并回退都可以
git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
git reset --hard '26d822f64f2a029306b250b69dc58468662a4fc6' # 1.8版本还没发布出来,用的main分支
# 因为使用了pytorch的基础镜像,所以不建议按照文档安装(不要使用requirements.txt或者start_linux.sh),直接启动程序(python server.py ...),然后按照提示缺什么依赖安装什么依赖就行了。
pip install pyyaml rich accelerate==0.27.* gradio==4.26.* markdown transformers==4.39.* numba==0.57.* datasets peft==0.8.* sentencepiece
遇到的问题:pip 安装包的过程,是缺少什么安装什么。并没有按照文档进行安装;
其中 sentencepiece 包是在load调优后的模型时出现的问题,进而进行安装,相关报错如下:
ValueError: Cannot instantiate this tokenizer from a slow version, If it’s based on sentencepiece, make sureyou have sentencepiece installed.
运行:
singularity shell --nv -B /home/hpcadmin/llama2_inference.image
Singularity> cp -r /opt/llama2_inference/text-generation-webui /home/hpcadmin/test
Singularity> cd /home/hpcadmin/test/text-generation-webui
Singularity> ln -s /home/hpcadmin/llm-llama2/Llama-2-7b-chat-hf/ ./models/
Singularity> python server.py --listen --listen-host 0.0.0.0 --listen-port 7890
使用:用浏览器打开地址(http://10.240.214.67:7890
)即可使用。
遇到的问题
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB. GPU 1 has a total capacty of 31.74 GiB of which 1.41 GiB is free………
原因:GPU内存太小,降低batch_size,增加GPU; 我所跑过的3卡V100 32G最大batch_size为6;
RuntimeError: cutlassF: no kernel found to launch!
原因:V100 不支持 --pure_bf16 参数训练,bf16需要A100及以上显卡
其它一些版本问题未记录。。。。。。使用此文章环境绝对可以运行,要注意
llama-recipes
库的时间(使用的最后一次提交时间2024-04-04:362cda0fa6813e2e672c87ca05a516ec2003df6b)。