Train LLAMA-2 on a Small GPU
Training a LLM model is hard without GPU but we can use LORA and QLORA
4-bit quantization via QLoRA allows efficient finetuning of huge LLM models on consumer hardware while retaining high performance. This improves accessibility and usability for real-world applications.
QLoRA quantizes a pre-trained language model to 4 bits and freezes the parameters. A
During fine-tuning, gradients are backpropagated through the frozen 4-bit quantized model into only the Low-Rank Adapter layers.
The 4-bit quantization does not hurt model performance.
Quantization
In this post we use 4bit quantization
pip install bitsandbytes
...
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=getattr(torch, "float16"),
bnb_4bit_use_double_quant=False,
)
Training parameters
pip install peft_params
...
peft_params = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
)
training_params = TrainingArguments(
output_dir="./results",
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=1,
optim="paged_adamw_32bit",
save_steps=25,
logging_steps=25,
learning_rate=2e-4,
weight_decay=0.001,
fp16=False,
bf16=False,
max_grad_norm=0.3,
max_steps=-1,
warmup_ratio=0.03,
group_by_length=True,
lr_scheduler_type="constant"
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_params,
dataset_text_field="text",
max_seq_length=None,
tokenizer=tokenizer,
args=training_params,
packing=False,
)
Prepare Docker
Below how to configure the docker to run on RTX A3000
Base Image
FROM nvidia/cuda:12.1.1-base-ubuntu20.04
Set environment variables
Install system dependencies
RUN apt-get update
RUN apt-get install -y \
git \
python3-pip \
python3-dev \
python3-opencv \
libglib2.0-0
Install PyTorch and torchvision
RUN pip3 install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu121
Install any python packages you need
COPY requirements.txt requirements.txt
RUN python3 -m pip install notebook accelerate peft bitsandbytes transformers trl tensorboard
Full code
full code is there
https://github.com/venergiac/training_LLAMA-2_QLORA
Buil docker:
docker compose build
Start docker:
docker compose run
Open the jupyter and check it with nvidia-smi.