Deep-Learning-Workstation-Setup

OS installation

The system is setup using Linux (Ubuntu 20.04 LTS) as this works best using (🐋 Docker) containers for Machine Learning (ML)/Deep Learning (DL) model development with NVIDIA GPUs.


Installing Ubuntu

In this example Ubuntu 20.04 LTS will be installed.

Steps accomplished:

Upgrading Ubuntu Version

Upgrade will only be possible from the current version to the next version, e.g. from 20.04 LTS to 22.04 LTS.

Following instructions from https://www.digitalocean.com/community/tutorials/how-to-upgrade-to-ubuntu-22-04-jammy-jellyfish.

sudo apt-get update
sudo apt-get upgrade
sudo reboot now

sudo apt-get dist-upgrade
sudo reboot now

sudo do-release-upgrade
# reboot will be done automatically within the upgrade process

Aftewards check the current version using lsb_release -a.

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.5 LTS
Release:        22.04
Codename:       jammy 

Additionally checking if NVIDIA driver is still installed and working properly using nvidia-smi.

$ nvidia-smi
Mon Dec  9 11:15:48 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA TITAN RTX               Off |   00000000:65:00.0 Off |                  N/A |
| 41%   30C    P8              4W /  280W |      16MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Static network configuration for remote access

In order to access the workstation from a remote PC I configured the IP address to be static.


Remote access from Windows laptop

I want to remotely access the workstation via SSH from my Windows system (ThinkPad Yoga 380 laptop with Windows 10 installed). I will use VS Code in order to connect via SSH to the remote system as well as connecting it “attaching VS Code” to Docker containers running on the remote system.

Steps accomplished on the Ubuntu Server Workstation

Using SSH for login

Configure a SSH configuration file

# Read more about SSH config files: https://linux.die.net/man/5/ssh_config
Host <name_alias>
    HostName <ip address>
    User username
    PreferredAuthentications publickey
    # sample for Linux
    IdentityFile ~/.ssh/id_rsa

File transfer via SSH (Windows ➡️ Linux)

When downloading files on the Windows system they can be transferred from the Windows system to the Linux system via command line using pscp.

pscp <src filepath on windows> <linux user>@<linux pc-name>:/home/<linux user>/<destination directory>/

ℹ️ Alternatively files and folder can be transferred “drag & drop” in VSCode form your host (MacOS, Linux, Windows) to the Linux server to the current opened folder on the Linux server.


Installation of the NVIDIA GPU driver

Disabling Nouveau

Before installing the NVIDIA driver the Nouveau driver must be first disabled. Instructions used for disabling the Nouveau driver.

Installation using the ppa repository

Deinstallation of previously installed version.

Steps accomplished for installing the new driver version

After the system has been rebooted you should get a similar output as below when running nvidia-smi from the command line.

$ nvidia-smi
Sat Mar  4 12:56:05 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN RTX    On   | 00000000:65:00.0  On |                  N/A |
| 41%   40C    P0    56W / 280W |   1416MiB / 24576MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Update NVIDIA driver installed via pps repository

As already mentioned above, check which major driver version is compatible with your NVIDIA GPU. Retrieve compatible version

# retrieve installed NVIDIA driver version using
dpkg -l | grep nvidia
# or
whereis nvidia
# or
modinfo nvidia | grep ^version

# replace the retrieved version below
current_version=<version>
# define version that should get installed
new_version=<new_version>
# check availability of driver version
apt-cache search nvidia-driver | grep ${new_version}
# deinstall/remove all version related packages
sudo apt-get purge *nvidia*${current_version}
sudo apt-get autoremove
sudo apt-get clean
# install version new version
sudo apt-get install nvidia-driver-${new_version}
# reboot
sudo reboot now

# after reboot check
nvidia-smi

Installing Docker CE

Follow the instruction from Get Docker CE for Ubuntu.

I decided to install the debian package manually (Docker v18.06.0 for bionic). I got an error message that a dependency was missing and had to install libltdl7 previously.
Edit: Recently, I updated Docker to version 23.0.1, build a5ee5b1.

Additionally I added my user to the docker user group as decribed here:
Manage Docker as a non-root user

After rebooting try to run the hello-world docker container. If it runs everything is working.

Ensure that docker runs as system-wide service and will start on boot.

service --status-all
systemctl is-enabled docker
systemctl enable docker

Installing the NVIDIA Container Toolkit

The previous nvidia-docker2 is now deprecated. Therefore I have to uninstall the previous installed version.

sudo apt-get purge -y nvidia-docker

Afterwards follow the instruction from nvidia-docker on GitHub.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -

curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Afterwards try out running an nvidia/cuda container and executing the command nvidia-smi in the container as done with the command docker run --rm --gpus all nvidia/cuda:12.0.1-base-ubuntu22.04 nvidia-smi. It will make all the installed gpus available within the container.

$ docker run --rm --gpus all nvidia/cuda:12.0.1-base-ubuntu22.04 nvidia-smi
Sat Mar  4 14:18:15 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN RTX    On   | 00000000:65:00.0  On |                  N/A |
| 41%   37C    P0    56W / 280W |   1537MiB / 24576MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

ℹ️ If you like specific GPUs being available in the container you can either specify them by their index or GPU-UUID. The indices as well as the UUIDs can get retrieved using nvidia-smi --list-gpus.

$ nvidia-smi --list-gpus
GPU 0: NVIDIA TITAN RTX (UUID: GPU-<UUID removed here>)

Running Docker rootless mode

After the installation of Docker the service will run as system-wide service. All users on the system will be able to access all images/containers available/running on the system. This can be an issue when sharing a workstation with multiple users.

It is possible to install Docker as a service per user. Each user will only see its own images and containers.

ℹ️ I changed my setup to Docker rootless mode. For information on how this as accomplished see DockerRootless.md.


Installing the docker compose plugin

Using docker-compose.yml files for building the Docker images and running the containers makes life much easier. It is mainly used for multi-container applications, but I find it also very useful for running single containers, see Docker Compose overview.

For using the docker-compose.yml files you have to install the docker compose plugin. Follow the instructions from Installation of the Compose plugin for Ubuntu as shown below.

sudo apt-get update
sudo apt-get install docker-compose-plugin

After the installation check docker compose version. You should get an output similar to the one below.

$ docker compose version
Docker Compose version v2.16.0

Please checkout on how to use the docker compose following the examples mentiond in the main README.md - Docker Compose examples with GPU support section.