# GPUs¶

This document is being ported from here.

Neuropoly has several GPUs available for training deep learning models.

We have spent money and time on this infrastructure for it to push science forward, so please take advantage of it!

## Connecting¶

Like other machines, connect with ssh using your polygrames account.

## Hardware¶

You can inspect the available GPUs on machine, and their current state, with nvidia-smi:

u918374@rosenberg:~$nvidia-smi Fri Jun 4 01:26:14 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-SXM2... On | 00000000:04:00.0 Off | 0 | | N/A 25C P0 33W / 300W | 10MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-SXM2... On | 00000000:05:00.0 Off | 0 | | N/A 22C P0 33W / 300W | 10MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla P100-SXM2... On | 00000000:09:00.0 Off | 0 | | N/A 24C P0 31W / 300W | 10MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla P100-SXM2... On | 00000000:0A:00.0 Off | 0 | | N/A 23C P0 31W / 300W | 10MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla P100-SXM2... On | 00000000:85:00.0 Off | 0 | | N/A 31C P0 51W / 300W | 10253MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla P100-SXM2... On | 00000000:86:00.0 Off | 0 | | N/A 22C P0 41W / 300W | 10253MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 6 Tesla P100-SXM2... On | 00000000:89:00.0 Off | 0 | | N/A 32C P0 51W / 300W | 10245MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 7 Tesla P100-SXM2... On | 00000000:8A:00.0 Off | 0 | | N/A 38C P0 52W / 300W | 13684MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 4 32263 C ...L.CA/u12345/venv-ivadomed/bin/python3.6 10243MiB | | 5 32264 C ...L.CA/u12345/venv-ivadomed/bin/python3.6 10243MiB | | 6 33062 C ...L.CA/u12345/venv-ivadomed/bin/python3.6 10235MiB | | 7 33063 C ...L.CA/u12345/venv-ivadomed/bin/python3.6 10235MiB | | 7 35147 C ...L.CA/u12345/venv-ivadomed/bin/python3.6 3439MiB | +-----------------------------------------------------------------------------+  ## Software¶ Both tensorflow and torch are supported on all of these machines. You can … To get your software onto these servers, download it with git clone. ### GPU-Agnostic code¶ For the benefit of being able to test code out locally, without the GPU servers, it’s helpful to write device-agnostic code, code that falls back to running on slower CPU emulation if GPUs are not available. For tensorflow, this For pytorch, this looks like this device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') ... # to make tensors X = torch.empty((8, 42), device=device) # to make neural networks model = Network(...).to(device=device)  ## Data¶ As with the other stations, you should prefer getting data in via git-annex, but you can it via duke (which is available to you at ~/duke/temp) or any other method (scp, curl, wget, etc). ## Storage¶ ### Long term, slow access (with backup)¶ For projects and permanent storage: ~/duke Warning Please, do not use space on duke while training your models. If you need more local space, post a request on slack #computers. ### Mid-term, rapid access (no backup)¶ This corresponds to your home ~/. This is where you keep your software (conda envs, virtualenvs, etc.). ### Short-term, very rapid access (no backup)¶ This is where you run your experiments (eg: deep learning training). On rosenberg, go to ~/data_nvme_$USERor ~/data_extrassd_$USER. On bireli and romane , go to your home ~/ . To keep track of your disk space, you can run df: u108545@rosenberg:~$ # to see how much space is available on the spare disk
u108545@rosenberg:~$df -h data_extrassd_u108545 Filesystem Size Used Avail Use% Mounted on /dev/sdb1 440G 50G 368G 12% /mnt/extrassd u108545@rosenberg:~$ # to measure how much space a tool takes
u108545@rosenberg:~$du -hs data_extrassd_u108545/miniconda3/ 18G data_extrassd_u108545/miniconda3/  ## Good Training Habits¶ Instead of loading the whole dataset to memory: And: • Store data as float32 rather than float64 ## Bookings¶ Please allocate your GPUs cooperatively on the computer resource calendar. Warning IMPORTANT: If you don’t have writing permission on this calendar please contact alexandrufoias@gmail.com. Use this format: u918374@rosenberg:gpu[3]. Note that the GPUs are numbered from 0, as you can see in nvidia-smi. To train, run your scripts like this: u918374@rosenberg:~$ CUDA_VISIBLE_DEVICES="3" ./train.sh


You can book multiple GPUs just with commas: u918374@rosenberg:gpu[2,3,5]

and use them with

u918374@rosenberg:~$CUDA_VISIBLE_DEVICES="2,3,5" ./train.sh  ## Monitoring¶ You can monitor what the system is doing with htop # CPU processes  and nvtop # GPU processes  You can see how hot it is running with u918374@rosenberg:~$ sensors
coretemp-isa-0001
Package id 1:  +30.0°C  (high = +80.0°C, crit = +90.0°C)
Core 0:        +25.0°C  (high = +80.0°C, crit = +90.0°C)
Core 1:        +25.0°C  (high = +80.0°C, crit = +90.0°C)


You can see how hot its disks are running by finding out the name of the disk in /dev and running

u918374@rosenberg:~$df -h . Filesystem Size Used Avail Use% Mounted on /dev/sda2 439G 412G 4.9G 99% / u108545@rosenberg:~$ hddtemp /dev/sda2
/dev/sda2: PNY CS1311 480GB SSD: 30°C


You can also see all this information plotted over time for each machine at

### Monitoring GPUs¶

As above, you can see the computation amount, allocated RAM, temperature, fan speed of the GPUs on the command line with

nvidia-smi


or

nvtop


You can see the same information over time at

Monitoring these metrics during training will help you make more efficient batch sizes and other optimizations.