Warning: Turing cards are not supported on Ubuntu 24 due to compatibility with flash-attn 1.x. flash-attn 2.x plans to support Turing cards in the future but for now are Ampere and up.
pip3 install flash-attn==1.0.9 <– Latest 1.x release
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
For Ubuntu 24, Nvidia may require libtinfo.so.5 to install Cuda This Sympolic link command might resolve the issue. sudo ln -s /lib/x86_64-linux-gnu/libtinfo.so.6 /lib/x86_64-linux-gnu/libtinfo.so. Standard install from Nvidia wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-ubuntu2204-12-1-local_12.1.1-530.30.02-1_amd64.deb sudo dpkg -i cuda-repo-ubuntu2204-12-1-local_12.1.1-530.30.02-1_amd64.deb sudo cp /var/cuda-repo-ubuntu2204-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/ sudo apt-get update sudo apt-get -y install cuda or sourcing it from Ubuntu, install them if needed for BitsAndBytes to work correctly sudo apt install nvidia-cuda-toolkit sudo apt install nvidia-driver-535 sudo apt install nvidia-utils-535 Install Miniconda3 mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 rm ~/miniconda3/miniconda.sh After installing, close and reopen your terminal application or refresh it by running the following command: source ~/miniconda3/bin/activate conda init --all conda create -n axolotl python=3.10 conda activate axolotl conda install -y -c "nvidia/label/cuda-12.1.1" cuda Current Version of Axolotl may require a certain version of bitsandbytes / torch pip3 install bitsandbytes==0.45.3 torch==2.5.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 sudo apt install git git clone https://github.com/OpenAccess-AI-Collective/axolotl cd axolotl #Note Install FlashAttention 1.x is for turing GPUs. pip3 install flash-attn==1.0.9 pip3 install -U packaging==23.2 setuptools==75.8.0 wheel ninja pip3 install --no-build-isolation axolotl[flash-attn,deepspeed] or pip3 install packaging pip3 install -e '.[flash-attn,deepspeed]' For a custom install of bitsandBytes git clone https://github.com/timdettmers/bitsandbytes.git cd bitsandbytes # CUDA_VERSIONS in {110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 120} # make argument in {cuda110, cuda11x, cuda12x} # if you do not know what CUDA you have, try looking at the output of: python -m bitsandbytes CUDA_VERSION=155 make cuda12x python setup.py install accelerate launch -m axolotl.cli.train instruct-lora-8b.yml Validation Commands: python -m bitsandbytes nvcc --version nvidia-smi python -c "import torch; print(f'PyTorch version: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}'); print(f'cuDNN version: {torch.backends.cudnn.version()}'); print(f'Device count: {torch.cuda.device_count()}')" |

Install for Turing cards on Ubuntu 22
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
Update from Ubuntu's Control Panel (~630MB), not apt update / upgrade. If not Nvidia drivers may not install correctly and if you update through Apt Update/Upgrade it may break the booting process for the OS. sudo apt install nvidia-driver-535 #This also installs nvidia-utils-535 sudo apt install nvidia-cuda-toolkit #This will install Cuda 11 on Ubuntu 22, 12 on 24. mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 rm ~/miniconda3/miniconda.sh Reboot source ~/miniconda3/bin/activate conda init --all conda create -n axolotl python=3.10 conda activate axolotl conda install -y -c "nvidia/label/cuda-12.1.1" cuda pip3 install bitsandbytes==0.45.3 #These install as dependencies torch==2.5.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 sudo apt install git git clone https://github.com/OpenAccess-AI-Collective/axolotl cd axolotl pip3 install flash-attn==1.0.9 pip3 install -U packaging==23.2 setuptools==75.8.0 wheel ninja pip3 install --no-build-isolation axolotl[deepspeed] cd examples cp -r llama-3 ~ cd ~/LLama3.3 accelerate launch -m axolotl.cli.train instruct-lora-8b.yml FlashAttention2 API may be able to be used with Flash1 turing by use of this patch: https://github.com/rationalism/flash-attn-triton |

Configuration documentation for .yaml files:
https://axolotl-ai-cloud.github.io/axolotl/docs/config.html –
https://modal.com/docs/examples/llm-finetuning
HTTP configuration app for .yaml files:
Note: While training I had to modify a few configs as the 4096 sequence_len was much to large to train the model timely. This does has an impact on the context length and its response so this just be too large of a job for dual RTX 8000 cards. I moved the sequence length to 2048 and was able to up the micro batch reate from 2->6 which cut time down greatly. I also updated a few other params including the use of fp16 (Turing cards support it).
The amount of ram used is about 27 Gigs and range from 40 – 90 Gigs of VRAM during training.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
~/llama-3$ ls fft-8b-liger-fsdp.yaml qlora-1b.yml fft-8b.yaml qlora-fsdp-405b.yaml instruct-dpo-lora-8b.yml qlora-fsdp-70b.yaml instruct-lora-8b.yml qlora.yml last_run_prepared/ README.md lora-1b-deduplicate-dpo.yml zero1.json lora-1b-deduplicate-sft.yml zero1_torch_compile.json lora-1b-kernels.yml zero2.json lora-1b-ray.yml zero3_bf16_cpuoffload_all.json lora-1b.yml zero3_bf16_cpuoffload_params.json lora-8b.yml zero3_bf16.json outputs/ zero3.json qlora-1b-kto.yaml (base) user@user-Standard-PC-i440FX-PIIX-1996:~/llama-3$ cat instruct-lora-8b.yml base_model: NousResearch/Meta-Llama-3-8B-Instruct # optionally might have model_type or tokenizer_type model_type: LlamaForCausalLM tokenizer_type: AutoTokenizer # Automatically upload checkpoint and final model to HF # hub_model_id: username/custom_model_name load_in_8bit: true load_in_4bit: false strict: false plugins: # - "axolotl.integrations.kd.KDPlugin" - axolotl.integrations.liger.LigerPlugin liger_rope: true liger_rms_norm: true liger_glu_activation: true liger_layer_norm: true liger_fused_linear_cross_entropy: true #kd_trainer: True #kd_ce_alpha: 0.1 #kd_alpha: 0.9 #kd_temperature: 1.0 #torch_compile: true chat_template: llama3 datasets: - path: fozziethebeat/alpaca_messages_2k_test type: chat_template field_messages: messages message_property_mappings: role: role content: content roles: user: - user assistant: - assistant dataset_prepared_path: val_set_size: 0.05 output_dir: ./outputs/lora-out sequence_len: 2048 sample_packing: true pad_to_sequence_len: true eval_sample_packing: false adapter: lora lora_model_dir: lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 6 num_epochs: 4 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: false fp16: true tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: false s2_attention: warmup_steps: 10 evals_per_epoch: 4 eval_table_size: eval_max_new_tokens: 128 saves_per_epoch: 1 debug: deepspeed: ./zero1_torch_compile.json weight_decay: 0.0 fsdp: fsdp_config: special_tokens: pad_token: <|end_of_text|> sdp_attention: true |