使用slurm集群对llama2进行微调(单机多卡和多机多卡)

python

发布日期: 2024-04-10

更新日期: 2024-04-10

文章字数: 2.8k

阅读时长: 15 分

阅读次数:

0. 前言

编辑于 2024-04-10，如有遇到问题，请根据此日期看官方的变更记录，以方便排查问题。

huggingface官网：https://huggingface.co/

llama2官网：https://github.com/meta-llama/llama

llama2-2-7b-hf模型：https://huggingface.co/meta-llama/Llama-2-7b-hf

1. 准备环境

NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0

V100S 32G内存显卡2块

经过测试，要跑llama2-2-7b-hf模型，最少需要两张卡，两张卡最小占用内存分别为17G

这里使用singularity镜像(可以理解为docker镜像)，镜像制作文件如下：

Bootstrap: docker
From: nvcr.io/nvidia/pytorch:23.11-py3

%post
    apt-get update
    apt-get upgrade -y
    apt-get install -y git
    chmod -R 777 /root
    pip install accelerate appdirs loralib bitsandbytes black black[jupyter] datasets fire peft transformers>=4.34.1 sentencepiece py7zr scipy optimum matplotlib gradio
    cd /opt
    git clone https://github.com/meta-llama/llama-recipes.git
    cd llama-recipes
    git reset --hard '37c8f722116493e69ea99420b3d73287905a46d0'  # 因为没有tag，所以这是用的是这次的commit
    chmod -R 777 ./llama-recipes

2. 下载huggingface模型

方法一：使用huggingface_hub工具

也可以使用 huggingface_hub 下载数据集。

下载依赖：

pip install huggingface_hub   # require python >= 3.8

在python终端中执行

推荐此方法，不要用脚本，在交互终端执行！脚本执行只能看到整体下载进度！看不到具体文件的下载进度！

>>> import os
>>> os.environ["http_proxy"] = "http://127.0.0.1:7890"
>>> os.environ["https_proxy"] = "http://127.0.0.1:7890" # 因为这里使用py3.8下载,使用代理会出错所以需要此配置(py3.9修复不需要该操作)
>>> from huggingface_hub import snapshot_download
>>> snapshot_download(repo_id="meta-llama/Llama-2-7b-hf", resume_download=True, local_dir_use_symlinks=False, local_dir=r"D:\llm-llama2", token="*********")
# windows下最好把local_dir_use_symlinks置为False，否则cache目录和local_dir目录都会各存一份
# 有些模型需要登录才能下载，有些不需要登录

命令行下载：

# 命令行方式下载模型
huggingface-cli download --resume-download meta-llama/Llama-2-7b-hf --local-dir ./llm-llama2

方法二：使用hfd工具快速下载huggingface模型权重(推荐)

优点：无需代理，下载速度快。

镜像站：https://hf-mirror.com/，链接：hf镜像站

具体步骤：

下载huggingface_hub工具
```
pip install -U huggingface_hub
```

下载hfd:

wget https://hf-mirror.com/hfd/hfd.sh
chmod a+x hfd.sh

使用镜像站源：这样我们国内下载会很快

# linux
export HF_ENDPOINT=https://hf-mirror.com
# windows
$env:HF_ENDPOINT = "https://hf-mirror.com"

下载模型：

./hfd.sh 模型名称 --tool aria2c -x 16

# -x aria2c的下载线程数，默认值是4
# aria2c 需要自己安装：yum install aria2

如果没有aria2，可以使用 --tool wget：

./hfd.sh 模型名称 --tool wget --local-dir 存储目录

如果下载需要token的模型：

./hfd.sh meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME --hf_token hf_***  --tool aria2c -x 16

方法三：使用modelscope下载（推荐）

官网：https://www.modelscope.cn/docs/models/download

安装modelscope
```
pip install modelscope
```

下载整个模型repo到指定目录

modelscope download --model deepseek-ai/Deepseek-R1-Distill-Llama-70B --local_dir ./huggingface_models/Deepseek-R1-Distill-Llama-70B

方法四：使用transformers下载

从 Hugging Face 下载预训练模型：

from transformers import AutoModel, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-hf" # Replace with the desired model's name
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

3. 下载预训练数据集

Load and preprocess your dataset. Ensure it’s tokenized using the appropriate tokenizer for the Pretrained Model.

下载数据集

从huggingface下载数据集

方法一：使用huggingface_hub下载(类似git clone)

# 先下载源数据集：
>>> from huggingface_hub import snapshot_download
>>> snapshot_download(repo_id="JosephusCheung/GuanacoDataset", repo_type="dataset",resume_download=True, local_dir=r"D:\datasets\test", token="*********")
>>> from datasets import load_dataset
>>> dataset = load_dataset(r"D:\datasets\GuanacoDataset")

方法二：使用datasets下载

安装数据集包：

pip install datasets

# To work with audio datasets, install the Audio feature:
pip install datasets[audio]

# To work with image datasets, install the Image feature:
pip install datasets[vision]

加载并保存到磁盘：

>>> from datasets import load_dataset, load_from_disk
>>> from transformers import DataCollatorWithPadding
>>> dataset = load_dataset("JosephusCheung/GuanacoDataset", token="***********")
>>> dataset.save_to_disk(r"D:\datasets\GuanacoDataset")        # 写入磁盘，保存到指定目录下，会对数据集进行整合，生成新的格式
>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

加载数据集

从hugging face在线加载

>>> from datasets import load_dataset
>>> dataset = load_dataset("JosephusCheung/GuanacoDataset", token="***********")

从本地加载

>>> from datasets import load_from_disk
>>> dataset = load_from_disk(r"D:\datasets\GuanacoDataset")  # 读取本地数据集

4. 使用llama-recipes进行模型微调

单机多卡

训练时间：1 batch size, 1 epoch在两个 V100S GPU上跑了10小时32分钟。V100的话时间最少翻倍！！！

使用的slurm脚本如下：

#!/bin/bash
#SBATCH --job-name='LLM_2gpu_finetuning_batch_1'
#SBATCH --chdir=/share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2
#SBATCH --partition=c4
#SBATCH --nodes=1
#SBATCH --time=4-00:00
#SBATCH --mincpus=32
#SBATCH --gres=gpu:2

export SLURM_OVERLAP=yes


module try-load singularity
echo job start time is `date`

# Prepare temp directory
cd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2
rm -rf tmp_run
rm -rf tmp_output
mkdir tmp_run
mkdir tmp_output
singularity exec --nv \
    -B /share/home/test \
    --pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2 \
    /share/home/lico/container/llama2_cu121.image \
    cp -r /opt/llama-recipes ./tmp_run/

singularity exec --nv \
    -B /share/home/test \
    --pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2 \
    /share/home/lico/container/llama2_cu121.image \
    cp /share/home/test/llama2/GuanacoDataset/guanaco_non_chat-utf8.json ./tmp_run/llama-recipes/src/llama_recipes/datasets/alpaca_data.json

# Finetuning
if [ 2 == 1 ]
then
    # single GPU
    set -x
    singularity exec --nv \
        -B /share/home/test \
        --pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes \
        --env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes/src" \
        /share/home/lico/container/llama2_cu121.image \
        python -m recipes/finetuning/finetuning.py --use_peft --peft_method lora --use_fp16 --model_name /share/home/test/llama2/Llama-2-7b-hf --output_dir /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_output --dataset alpaca_dataset --batch_size_training 1 --num_epochs 1
    set +x
else
    # multi GPU
    set -x
    singularity exec --nv \
        -B /share/home/test \
        --pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes \
        --env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes/src" \
        /share/home/lico/container/llama2_cu121.image \
        torchrun --nnodes 1 --nproc_per_node 2 recipes/finetuning/finetuning.py --enable_fsdp --use_peft --peft_method lora --use_fp16 --model_name /share/home/test/llama2/Llama-2-7b-hf --output_dir /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_output --dataset alpaca_dataset --batch_size_training 1 --num_epochs 1
    set +x
fi

# Merge Model
cd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410/output
rm -rf ./*
singularity exec --nv \
    -B /share/home/test \
    --pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes \
    --env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes/src" \
    /share/home/lico/container/llama2_cu121.image \
    python recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py --base_model /share/home/test/llama2/Llama-2-7b-hf --peft_model /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_output --output_dir /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410/output
set +x

echo job end time is `date`

多机多卡

因为使用V100，所以需要禁用efa。同时也要注意网卡配置 NCCL_SOCKET_IFNAME。

slurm 脚本如下：

#!/bin/bash
#SBATCH --job-name='LLM_multi_finetuning_2gpu_ib0'
#SBATCH --chdir=/share/home/test/jobtemplate/LLM_finetuning/multi/20240412
#SBATCH --partition=compute_gpu
#SBATCH --nodes=2
#SBATCH --time=0-6:00
#SBATCH --mincpus=16
#SBATCH --gres=gpu:1

export SLURM_OVERLAP=yes

module try-load singularity
echo job start time is `date`

# Prepare temp directory
cd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412
rm -rf tmp_run
rm -rf tmp_output
mkdir tmp_run
mkdir tmp_output
singularity exec --nv \
    -B /share/home/test \
    --pwd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412 \
    /share/home/lico/container/llama2_cu121.image \
    cp -r /opt/llama-recipes ./tmp_run/

singularity exec --nv \
    -B /share/home/test \
    --pwd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412 \
    /share/home/lico/container/llama2_cu121.image \
    cp /share/home/test/llama2/GuanacoDataset/guanaco_non_chat-utf8.json ./tmp_run/llama-recipes/src/llama_recipes/datasets/alpaca_data.json

# Finetuning
NODE_PORT=25611
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# Enable for A100
# export FI_PROVIDER="efa"
# export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
# export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH

echo Node IP: $head_node_ip
export LOGLEVEL=INFO
# debugging flags (optional)
export NCCL_DEBUG=WARN
export NCCL_DEBUG_SUBSYS=WARN
export PYTHONFAULTHANDLER=1
export CUDA_LAUNCH_BLOCKING=0

# on your cluster you might need these:
# set the network interface
export NCCL_SOCKET_IFNAME="ib0"
export FI_EFA_USE_DEVICE_RDMA=0  # Disable EFA

set -x
srun singularity exec --nv \
    -B /share/home/test \
    --pwd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_run/llama-recipes \
    --env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_run/llama-recipes/src" \
    /share/home/lico/container/llama2_cu121.image \
    torchrun --nnodes 2 --nproc_per_node 1 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:$NODE_PORT recipes/finetuning/finetuning.py  --enable_fsdp --use_peft --peft_method lora --use_fp16 --model_name /share/home/test/llama2/Llama-2-7b-hf-bak --output_dir /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_output --dataset alpaca_dataset --batch_size_training 2 --num_epochs 1
set +x

# Merge Model
cd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/output
rm -rf ./*
singularity exec --nv \
    -B /share/home/test \
    --pwd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_run/llama-recipes \
    --env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_run/llama-recipes/src" \
    /share/home/lico/container/llama2_cu121.image \
    python recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py --base_model /share/home/test/llama2/Llama-2-7b-hf-bak --peft_model /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_output --output_dir /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/output
set +x

echo job end time is `date`

5. Testing

这里使用 oobabooga/text-generation-webui，它是适用于大型语言模型的 Gradio Web UI。

环境准备，这里同样使用singularity镜像：

Bootstrap: docker
From: nvcr.io/nvidia/pytorch:23.11-py3

%post
    apt-get update
    apt-get install -y git
    mkdir -p /opt/llama2_inference
    cd /opt/llama2_inference
    # wget https://github.com/oobabooga/text-generation-webui/archive/refs/tags/snapshot-2024-04-14.zip
    # unzip snapshot-2024-04-14.zip && mv text-generation-webui-snapshot-2024-04-14 text-generation-webui
    # 下载指定版本，或者使用git并回退都可以
    git clone https://github.com/oobabooga/text-generation-webui.git
    cd text-generation-webui
    git reset --hard '26d822f64f2a029306b250b69dc58468662a4fc6' # 1.8版本还没发布出来，用的main分支
    # 因为使用了pytorch的基础镜像，所以不建议按照文档安装(不要使用requirements.txt或者start_linux.sh)，直接启动程序(python server.py ...)，然后按照提示缺什么依赖安装什么依赖就行了。
    pip install pyyaml rich accelerate==0.27.* gradio==4.26.* markdown transformers==4.39.* numba==0.57.* datasets peft==0.8.* sentencepiece

遇到的问题：pip 安装包的过程，是缺少什么安装什么。并没有按照文档进行安装；

其中 sentencepiece 包是在load调优后的模型时出现的问题，进而进行安装，相关报错如下：

ValueError: Cannot instantiate this tokenizer from a slow version, If it’s based on sentencepiece, make sureyou have sentencepiece installed.

运行：

singularity shell --nv -B /home/hpcadmin/llama2_inference.image
Singularity> cp -r /opt/llama2_inference/text-generation-webui /home/hpcadmin/test
Singularity> cd /home/hpcadmin/test/text-generation-webui
Singularity> ln -s /home/hpcadmin/llm-llama2/Llama-2-7b-chat-hf/ ./models/
Singularity> python server.py --listen --listen-host 0.0.0.0 --listen-port 7890

使用：用浏览器打开地址(http://10.240.214.67:7890)即可使用。

遇到的问题

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB. GPU 1 has a total capacty of 31.74 GiB of which 1.41 GiB is free………
```
原因：GPU内存太小，降低batch_size，增加GPU；
     我所跑过的3卡V100 32G最大batch_size为6；
```

RuntimeError: cutlassF: no kernel found to launch!

原因：V100 不支持 --pure_bf16 参数训练，bf16需要A100及以上显卡

其它一些版本问题未记录。。。。。。使用此文章环境绝对可以运行，要注意llama-recipes 库的时间(使用的最后一次提交时间2024-04-04：362cda0fa6813e2e672c87ca05a516ec2003df6b)。

无夜

https://wuyea.top/posts/358921181.html

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源无夜 !

python slurm aigc singularity apptainer

re模块(含正则)

2024-04-24 python

python python模块

MySQL(四)--表的记录操作进阶

2024-02-02 database

database MySQL

使用slurm集群对llama2进行微调(单机多卡和多机多卡)

0. 前言

1. 准备环境

2. 下载huggingface模型

方法一：使用huggingface_hub工具

方法二：使用hfd工具快速下载huggingface模型权重(推荐)

方法三：使用modelscope下载（推荐）

方法四：使用transformers下载

3. 下载预训练数据集

下载数据集

加载数据集

4. 使用llama-recipes进行模型微调

单机多卡

多机多卡

5. Testing

遇到的问题

你的赏识是我前进的动力