加载中...

使用slurm集群对llama2进行微调(单机多卡和多机多卡)


0. 前言

编辑于 2024-04-10,如有遇到问题,请根据此日期看官方的变更记录,以方便排查问题。

huggingface官网:https://huggingface.co/

llama2官网:https://github.com/meta-llama/llama

llama2-2-7b-hf模型:https://huggingface.co/meta-llama/Llama-2-7b-hf

1. 准备环境

NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0

V100S 32G内存显卡2块

经过测试,要跑llama2-2-7b-hf模型,最少需要两张卡,两张卡最小占用内存分别为17G

这里使用singularity镜像(可以理解为docker镜像),镜像制作文件如下:

Bootstrap: docker
From: nvcr.io/nvidia/pytorch:23.11-py3

%post
    apt-get update
    apt-get upgrade -y
    apt-get install -y git
    chmod -R 777 /root
    pip install accelerate appdirs loralib bitsandbytes black black[jupyter] datasets fire peft transformers>=4.34.1 sentencepiece py7zr scipy optimum matplotlib gradio
    cd /opt
    git clone https://github.com/meta-llama/llama-recipes.git
    cd llama-recipes
    git reset --hard '37c8f722116493e69ea99420b3d73287905a46d0'  # 因为没有tag,所以这是用的是这次的commit
    chmod -R 777 ./llama-recipes

2. 下载huggingface模型

方法一:使用huggingface_hub工具

也可以使用 huggingface_hub 下载数据集。

  • 下载依赖:

    pip install huggingface_hub   # require python >= 3.8
  • 在python终端中执行

    推荐此方法,不要用脚本,在交互终端执行!脚本执行只能看到整体下载进度!看不到具体文件的下载进度!

    >>> import os
    >>> os.environ["http_proxy"] = "http://127.0.0.1:7890"
    >>> os.environ["https_proxy"] = "http://127.0.0.1:7890" # 因为这里使用py3.8下载,使用代理会出错所以需要此配置(py3.9修复不需要该操作)
    >>> from huggingface_hub import snapshot_download
    >>> snapshot_download(repo_id="meta-llama/Llama-2-7b-hf", resume_download=True, local_dir_use_symlinks=False, local_dir=r"D:\llm-llama2", token="*********")
    # windows下最好把local_dir_use_symlinks置为False,否则cache目录和local_dir目录都会各存一份
    # 有些模型需要登录才能下载,有些不需要登录
  • 命令行下载:

    # 命令行方式下载模型
    huggingface-cli download --resume-download meta-llama/Llama-2-7b-hf --local-dir ./llm-llama2

方法二:使用hfd工具快速下载huggingface模型权重(推荐)

优点:无需代理,下载速度快。

镜像站:https://hf-mirror.com/,链接:hf镜像站

具体步骤:

  1. 下载huggingface_hub工具

    pip install -U huggingface_hub
  2. 下载hfd:

    wget https://hf-mirror.com/hfd/hfd.sh
    chmod a+x hfd.sh
  3. 使用镜像站源:这样我们国内下载会很快

    # linux
    export HF_ENDPOINT=https://hf-mirror.com
    # windows
    $env:HF_ENDPOINT = "https://hf-mirror.com"
  4. 下载模型:

    ./hfd.sh 模型名称 --tool aria2c -x 16
    
    # -x aria2c的下载线程数,默认值是4
    # aria2c 需要自己安装:yum install aria2

    如果没有aria2,可以使用 --tool wget

    ./hfd.sh 模型名称 --tool wget --local-dir 存储目录

    如果下载需要token的模型:

    ./hfd.sh meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME --hf_token hf_***  --tool aria2c -x 16

方法三:使用modelscope下载(推荐)

官网:https://www.modelscope.cn/docs/models/download

  1. 安装modelscope

    pip install modelscope
  2. 下载整个模型repo到指定目录

    modelscope download --model deepseek-ai/Deepseek-R1-Distill-Llama-70B --local_dir ./huggingface_models/Deepseek-R1-Distill-Llama-70B

方法四:使用transformers下载

从 Hugging Face 下载预训练模型:

from transformers import AutoModel, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-hf" # Replace with the desired model's name
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

3. 下载预训练数据集

Load and preprocess your dataset. Ensure it’s tokenized using the appropriate tokenizer for the Pretrained Model.

下载数据集

从huggingface下载数据集

  • 方法一:使用huggingface_hub下载(类似git clone)

    # 先下载源数据集:
    >>> from huggingface_hub import snapshot_download
    >>> snapshot_download(repo_id="JosephusCheung/GuanacoDataset", repo_type="dataset",resume_download=True, local_dir=r"D:\datasets\test", token="*********")
    >>> from datasets import load_dataset
    >>> dataset = load_dataset(r"D:\datasets\GuanacoDataset")
  • 方法二:使用datasets下载

    安装数据集包:

    pip install datasets
    
    # To work with audio datasets, install the Audio feature:
    pip install datasets[audio]
    
    # To work with image datasets, install the Image feature:
    pip install datasets[vision]

    加载并保存到磁盘:

    >>> from datasets import load_dataset, load_from_disk
    >>> from transformers import DataCollatorWithPadding
    >>> dataset = load_dataset("JosephusCheung/GuanacoDataset", token="***********")
    >>> dataset.save_to_disk(r"D:\datasets\GuanacoDataset")        # 写入磁盘,保存到指定目录下,会对数据集进行整合,生成新的格式
    >>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

加载数据集

  • 从hugging face在线加载

    >>> from datasets import load_dataset
    >>> dataset = load_dataset("JosephusCheung/GuanacoDataset", token="***********")
  • 从本地加载

    >>> from datasets import load_from_disk
    >>> dataset = load_from_disk(r"D:\datasets\GuanacoDataset")  # 读取本地数据集

4. 使用llama-recipes进行模型微调

单机多卡

训练时间:1 batch size, 1 epoch在两个 V100S GPU上跑了10小时32分钟。V100的话时间最少翻倍!!!

使用的slurm脚本如下:

#!/bin/bash
#SBATCH --job-name='LLM_2gpu_finetuning_batch_1'
#SBATCH --chdir=/share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2
#SBATCH --partition=c4
#SBATCH --nodes=1
#SBATCH --time=4-00:00
#SBATCH --mincpus=32
#SBATCH --gres=gpu:2

export SLURM_OVERLAP=yes


module try-load singularity
echo job start time is `date`

# Prepare temp directory
cd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2
rm -rf tmp_run
rm -rf tmp_output
mkdir tmp_run
mkdir tmp_output
singularity exec --nv \
    -B /share/home/test \
    --pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2 \
    /share/home/lico/container/llama2_cu121.image \
    cp -r /opt/llama-recipes ./tmp_run/

singularity exec --nv \
    -B /share/home/test \
    --pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2 \
    /share/home/lico/container/llama2_cu121.image \
    cp /share/home/test/llama2/GuanacoDataset/guanaco_non_chat-utf8.json ./tmp_run/llama-recipes/src/llama_recipes/datasets/alpaca_data.json

# Finetuning
if [ 2 == 1 ]
then
    # single GPU
    set -x
    singularity exec --nv \
        -B /share/home/test \
        --pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes \
        --env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes/src" \
        /share/home/lico/container/llama2_cu121.image \
        python -m recipes/finetuning/finetuning.py --use_peft --peft_method lora --use_fp16 --model_name /share/home/test/llama2/Llama-2-7b-hf --output_dir /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_output --dataset alpaca_dataset --batch_size_training 1 --num_epochs 1
    set +x
else
    # multi GPU
    set -x
    singularity exec --nv \
        -B /share/home/test \
        --pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes \
        --env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes/src" \
        /share/home/lico/container/llama2_cu121.image \
        torchrun --nnodes 1 --nproc_per_node 2 recipes/finetuning/finetuning.py --enable_fsdp --use_peft --peft_method lora --use_fp16 --model_name /share/home/test/llama2/Llama-2-7b-hf --output_dir /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_output --dataset alpaca_dataset --batch_size_training 1 --num_epochs 1
    set +x
fi

# Merge Model
cd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410/output
rm -rf ./*
singularity exec --nv \
    -B /share/home/test \
    --pwd /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes \
    --env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_run/llama-recipes/src" \
    /share/home/lico/container/llama2_cu121.image \
    python recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py --base_model /share/home/test/llama2/Llama-2-7b-hf --peft_model /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410_v2/tmp_output --output_dir /share/home/test/jobtemplate/LLM_finetuning/2gpu/20240410/output
set +x

echo job end time is `date`

多机多卡

因为使用V100,所以需要禁用efa。同时也要注意网卡配置 NCCL_SOCKET_IFNAME

slurm 脚本如下:

#!/bin/bash
#SBATCH --job-name='LLM_multi_finetuning_2gpu_ib0'
#SBATCH --chdir=/share/home/test/jobtemplate/LLM_finetuning/multi/20240412
#SBATCH --partition=compute_gpu
#SBATCH --nodes=2
#SBATCH --time=0-6:00
#SBATCH --mincpus=16
#SBATCH --gres=gpu:1

export SLURM_OVERLAP=yes

module try-load singularity
echo job start time is `date`

# Prepare temp directory
cd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412
rm -rf tmp_run
rm -rf tmp_output
mkdir tmp_run
mkdir tmp_output
singularity exec --nv \
    -B /share/home/test \
    --pwd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412 \
    /share/home/lico/container/llama2_cu121.image \
    cp -r /opt/llama-recipes ./tmp_run/

singularity exec --nv \
    -B /share/home/test \
    --pwd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412 \
    /share/home/lico/container/llama2_cu121.image \
    cp /share/home/test/llama2/GuanacoDataset/guanaco_non_chat-utf8.json ./tmp_run/llama-recipes/src/llama_recipes/datasets/alpaca_data.json

# Finetuning
NODE_PORT=25611
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# Enable for A100
# export FI_PROVIDER="efa"
# export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
# export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH

echo Node IP: $head_node_ip
export LOGLEVEL=INFO
# debugging flags (optional)
export NCCL_DEBUG=WARN
export NCCL_DEBUG_SUBSYS=WARN
export PYTHONFAULTHANDLER=1
export CUDA_LAUNCH_BLOCKING=0

# on your cluster you might need these:
# set the network interface
export NCCL_SOCKET_IFNAME="ib0"
export FI_EFA_USE_DEVICE_RDMA=0  # Disable EFA

set -x
srun singularity exec --nv \
    -B /share/home/test \
    --pwd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_run/llama-recipes \
    --env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_run/llama-recipes/src" \
    /share/home/lico/container/llama2_cu121.image \
    torchrun --nnodes 2 --nproc_per_node 1 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:$NODE_PORT recipes/finetuning/finetuning.py  --enable_fsdp --use_peft --peft_method lora --use_fp16 --model_name /share/home/test/llama2/Llama-2-7b-hf-bak --output_dir /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_output --dataset alpaca_dataset --batch_size_training 2 --num_epochs 1
set +x

# Merge Model
cd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/output
rm -rf ./*
singularity exec --nv \
    -B /share/home/test \
    --pwd /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_run/llama-recipes \
    --env PYTHONPATH="\$PYTHONPATH:/share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_run/llama-recipes/src" \
    /share/home/lico/container/llama2_cu121.image \
    python recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py --base_model /share/home/test/llama2/Llama-2-7b-hf-bak --peft_model /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/tmp_output --output_dir /share/home/test/jobtemplate/LLM_finetuning/multi/20240412/output
set +x

echo job end time is `date`

5. Testing

这里使用 oobabooga/text-generation-webui,它是适用于大型语言模型的 Gradio Web UI。

环境准备,这里同样使用singularity镜像:

Bootstrap: docker
From: nvcr.io/nvidia/pytorch:23.11-py3

%post
    apt-get update
    apt-get install -y git
    mkdir -p /opt/llama2_inference
    cd /opt/llama2_inference
    # wget https://github.com/oobabooga/text-generation-webui/archive/refs/tags/snapshot-2024-04-14.zip
    # unzip snapshot-2024-04-14.zip && mv text-generation-webui-snapshot-2024-04-14 text-generation-webui
    # 下载指定版本,或者使用git并回退都可以
    git clone https://github.com/oobabooga/text-generation-webui.git
    cd text-generation-webui
    git reset --hard '26d822f64f2a029306b250b69dc58468662a4fc6' # 1.8版本还没发布出来,用的main分支
    # 因为使用了pytorch的基础镜像,所以不建议按照文档安装(不要使用requirements.txt或者start_linux.sh),直接启动程序(python server.py ...),然后按照提示缺什么依赖安装什么依赖就行了。
    pip install pyyaml rich accelerate==0.27.* gradio==4.26.* markdown transformers==4.39.* numba==0.57.* datasets peft==0.8.* sentencepiece

遇到的问题:pip 安装包的过程,是缺少什么安装什么。并没有按照文档进行安装;

其中 sentencepiece 包是在load调优后的模型时出现的问题,进而进行安装,相关报错如下:

ValueError: Cannot instantiate this tokenizer from a slow version, If it’s based on sentencepiece, make sureyou have sentencepiece installed.

运行:

singularity shell --nv -B /home/hpcadmin/llama2_inference.image
Singularity> cp -r /opt/llama2_inference/text-generation-webui /home/hpcadmin/test
Singularity> cd /home/hpcadmin/test/text-generation-webui
Singularity> ln -s /home/hpcadmin/llm-llama2/Llama-2-7b-chat-hf/ ./models/
Singularity> python server.py --listen --listen-host 0.0.0.0 --listen-port 7890

使用:用浏览器打开地址(http://10.240.214.67:7890)即可使用。

遇到的问题

  1. torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB. GPU 1 has a total capacty of 31.74 GiB of which 1.41 GiB is free………

    原因:GPU内存太小,降低batch_size,增加GPU;
         我所跑过的3卡V100 32G最大batch_size为6;
  2. RuntimeError: cutlassF: no kernel found to launch!

    原因:V100 不支持 --pure_bf16 参数训练,bf16需要A100及以上显卡
  3. 其它一些版本问题未记录。。。。。。使用此文章环境绝对可以运行,要注意llama-recipes 库的时间(使用的最后一次提交时间2024-04-04:362cda0fa6813e2e672c87ca05a516ec2003df6b)。


文章作者: 无夜
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 无夜 !
评论
  目录