Yet another GPU server configuration guide

Or a simple checklist to set-up your own server for deep learning

Posted by snakers41 on August 9, 2017


Modern deep learning research looks a bit like this if you are not a professional developer. Just be careful, read a lot and be prepared for challenges!


0. Where to start?

So, probably you are inspired by new Andrew Ng's MOOC on coursera or by fast.ai MOOCs (a list of Data Science related MOOCs curated by me) and you want to learn deep learning asap and start applying it also asap. In this case to be really productive and competitive you basically have 2 options:

  • Rent a server or use credit hours you may have earned (sometimes they are provided as incentive to try;
  • Build your own server;
  • (There is also dedicated hardware by NVidia for deploying deep learning models,  just know it);


This article will be about the basics you should know to get started with your own server. Why build it yourself instead of just renting? Chances are high that:

  • You are already gaming and have a PC with a GPU / multiple GPUs;
  • Renting servers is a business, and it has its margin;
  • Bigger services may be lagging in providing reasonable GPU-accelerated instances;
  • It always changes, but some services may be providing outdated GPUs or you can find a better price by buying previous generation GPUs from second hand market;
  • Some GPU accelerated instances may be costly and aimed at professional deep learning developers, so for experiments you may want to start small with your own server;


Anyway here is a list of services that provide GPU accelerated instances on demand if you want to compare them:


If you still want to build your own server, then read the following article by Tim Dettmers first. To quote him the TLDR of consumer grade GPU buying in 2017 is the following. Also note that by default now DEEP LEARNING IS IN 95% OF CASES DONE ON NVIDIA GPUS.

TL;DR advice

  • Best GPU overall (by a small margin): Titan Xp
  • Cost efficient but expensive: GTX 1080 Ti, GTX 1070, GTX 1080
  • Cost efficient and cheap:  GTX 1060 (6GB)
  • I work with data sets > 250GB: GTX Titan X (Maxwell), NVIDIA Titan X Pascal, or NVIDIA Titan Xp
  • I have little money: GTX 1060 (6GB)
  • I have almost no money: GTX 1050 Ti (4GB)
  • I do Kaggle: GTX 1060 (6GB) for any “normal” competition, or GTX 1080 Ti for “deep learning competitions”
  • I am a competitive computer vision researcher: NVIDIA Titan Xp; do not upgrade from existing Titan X (Pascal or Maxwell)
  • I am a researcher: GTX 1080 Ti. In some cases, like natural language processing, a GTX 1070 or GTX 1080 might also be a solid choice — check the memory requirements of your current models
  • I want to build a GPU cluster: This is really complicated, you can get some ideas here
  • I started deep learning and I am serious about it: Start with a GTX 1060 (6GB). Depending of what area you choose next (startup, Kaggle, research, applied deep learning) sell your GTX 1060 and buy something more appropriate
  • I want to try deep learning, but I am not serious about it: GTX 1050 Ti (4 or 2GB)

1. So I have a GPU / decided to buy one, what's next?

I suppose that at this moment you either have a GPU accelerated PC or are going to build one. I will not provide building instructions, there are ample guides on Youtube (also just read your motherboard manual!). A few simple things to keep in mind when building a PC:

  • 16GB -32GB of RAM is usually enough, but for some Deep Learning Kaggle competitions having 64-128GB of RAM just makes your life much easier (not for beginners);
  • You should always install 64-bit systems (preferably Ubuntu - it has the best online documentation). Why? Because of RAM limitations on 32-bit systems;
  • As Tim pointed out, 90% of everything you need is a lot of RAM and a GPU;
  • PCI-express RAM is usually compatible with all motherboards, just make sure your have the latest reasonable standard and your motherboard supports it and has enough slots;
  • Always check which socket your CPU has and check that your motherboard supports it;
  • If you want your PC to be small (use non-standard motherboard size) then double check your motherboard dimensions vs. your PC box dimensions. Also smaller motherboards have different sockets - check it;
  • My blog post may be of use (if you speak Russian) - just for general trade-offs;
  • It's advisable to have a separate physical hard drive for Linux system, if you want to dual-boot (ofc you can have 2 systems on one drive, but it may require some fiddling);


So, you are serious about continuing? Then at first you should real the following threads / posts / topics:

  • A couple (1, 2, 3)  of umbrella posts from my Telegram channel with the best links;
  • Fast.ai forum thread about building your own PC. Best posts 1 and 2;
  • 2014-2015 GPU config from fast.ai - has a couple of broken links, assumes Theano back-end (I will update this a little bit later in the article);


2. So I have a PC. Are you on Linux / Ubuntu?

I seriously advise you to get started with Ubuntu, preferably with Ubuntu 16.04 LTS. Why?

Because it's just easier to install 90% of Data Science related software. To get started I will give you 2 hints:

  • Download a Ubuntu image from the official resource. It's best if you choose the previous 64-bit LTS (long term support) version. In your case it may be Ubuntu 16.04 LTS desktop. Server is almost the same, but it installs a lot of server related features from scratch (PHP, PostgreSQL, MySQL, etc);
  • You should create a boot USB stick, boot from it and follow the instructions. Best software for USB stick creation;
  • Always remember that https://askubuntu.com and the official manual are your best friends on this stage;
  • Also do not try to read manual from scratch - it's built to be best searchable by Google queries like 'how to backup Ubuntu;
  • Also if you break your system - you can use this well-known live USB to boot and save your files;


I assume that it will take time, effort and courage if you have never worked with Linux. The rest of the article will assume that you have Ubuntu 16.04 LTS.

Also if you are new to Ubuntu, then you may want to follow these guidelines (assuming that the server is a remote one, not your desktop):

  • update packages;
  • install a simple firewall (ufw);
  • access your server via ssh-key only;


If you are using Windows, chances are high that you will be using Putty and pyttygen to generate your keys, be careful, beceuse the default settings of puttygen do not correspond with the format that Ubuntu needs. In my case, these 2 steps were required to use puttygen:

  • Go to key -> SSH-2 RSA key to choose the format;
  • I had to copy-paste the RSA key manually instead of using 'Save private key option';


3. A brief side note on backups

If you do not do backups, you will start making them after first critical failure. So better start making them right after you install the system.

I suggest simplest approaches:

  • Have a system of one partition of the SSD, store all the rest on the second partition;
  • Or have a HDD (or mdadm raid array) as a separate physical volume for storing backups;
  • In any case you will have to learn about fstab;


These links will give you some basic understanding of backups:


This bash script will get you started if you choose to use tar + crontab (make sure you understand everything before trying it!):

#START
# This Command will add date in Backup File Name.
TIME=`date +%b-%d-%y`
# Here i define Backup file name format.
FILENAME=some-system-backup-$TIME.tar.gz
# Backed up folder (system root) location
SRCDIR=/
# Destination of backup file
DESDIR=/home/backups
# exclude folder list
EXCLUDE='--exclude /home/backups --exclude=/another'
#  Do not include files on a different filesystem
ONEFSYSPARAM='--one-file-system'
# test command validity
# echo -e tar -cvpzf $DESDIR/$FILENAME $EXCLUDE $ONEFSYSPARAM $SRCDIR
ssh sshuser@remote "tar -cpzf $DESDIR/$FILENAME $EXCLUDE $ONEFSYSPARAM $SRCDIR"
scp sshuser@remote:$DESDIR/$FILENAME /place/backup/here/
ssh sshuser@remote "echo 'WHOLE_SYSTEM_BACKUP is successful: $(date)' >> /home/bash-scripts/cron_log.log"
ssh sshuser@remote "rm $DESDIR/$FILENAME"
#END


Also you need to learn about crontab, refer to the following manuals:

  1. DO article;
  2. Guidelines 1 2 3;


4. Can we finally proceed to GPU / deep learning related stuff?

4.1 GPU drivers

Yeah, finally we can. Now we have a system, you know about the importance of backing your data up and you have a GPU. Chances are that you will have to start with installing the GPU drivers from Nvidia. This approach that worked for me:

  1. Go to this page, find your GPU and find the latest (of better the previous) version of Nvidia drivers compatible with it;
  2. Read these Stack Overflow posts carefully 1 and 2. These will also help in case you install incorrect drivers and you will not be able to boot properly;
  3. Use nvidia-smi to confirm that everything is working;


For me personally this approach worked (I had to modify the driver version to the latest compatible with my GPU):

sudo apt-get purge nvidia-*
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install nvidia-375


4.2 Cuda and Cudnn

For cuda, fast.ai snippet worked fine

# download and install GPU drivers
wget "http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.44-1_amd64.deb" -O "cuda-repo-ubuntu1604_8.0.44-1_amd64.deb"
sudo dpkg -i cuda-repo-ubuntu1604_8.0.44-1_amd64.deb
sudo apt-get update
sudo apt-get -y install cuda
sudo modprobe nvidia
nvidia-smi


If you have problems with your particular version - then start reading the docs from this section.

For Cudnn (fast.ai files seem to be taken down as of August 2017), you will have to register here to get cudnn (probably there are mirrors somewhere). They will offer you a list of versions. Download cuDNN v5.1 for Cuda 8.0. I tried using the latest ones, but the seem to be conflicting with tensorflow that I used for back-end. 


It may be a pain in the ass, because you cannot wget it directly, but assuming that you just downloaded into some folder manually and renamed the file as cudnn.tgz:

cd YOUR_FOLDER
tar -zxf cudnn.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/* /usr/local/cuda/include/

If you have problems, refer here and here.

4.3 Tmux, glances and nvidia-smi

Tmux is a handy console tool, that enables you to create sessions and detach from console.

In a nutshell you will understand why it is useful after looking at this screen-shot:



Also consider reading the tmux cheat sheet.

Also tmux works really well together with glances and nvidia-smi:

  • Just install tmux, glances and make sure nvidia-smi works (the drivers are properly installed);
  • Type tmux, it will create a new tmux session (tmux ls for a list of all sessions);
  • Press ctrl+b, then %, then ctrl+b, then ' (asterisk);
  • Use ctr+b then o to navigate the tmux sections;
  • Glances and nvidia-smi will fit nicely to 2 of your sections;
  • Use tmux detach to detach your console from the session, or exit to terminate the session;


# Console session management
sudo apt-get install tmux
# System monitoring
sudo apt install glances

4.3 Keras and tensorflow

Keras is a high-level deep learning library, that allows you to focus on actual architecture and file preprocessing, instead of tensor vector calculations. It is usually installed with a standard list of python scientific libraries. You may do it using anaconda, but I usually do it via pip3 to know exactly which packages I have on my system:

# Basic python libraries
sudo pip3 install numpy 
sudo pip3 install matplotlib 
sudo pip3 install keras
sudo pip3 install sklearn
sudo pip3 install pandas
sudo pip3 install scikit-image
sudo pip3 install opencv-python # a nice shortcut for having open-cv on your system

# Also try to use tensorflow backend
# https://www.tensorflow.org/install/install_linux
# https://keras.io/backend/
sudo pip3 install tensorflow-gpu
sudo pip3 install kaggle-cli


Also make sure to read tensorflow installation docs in case something goes wrong and keras backend docs. Tensorflow documents also feature a simple hello world example to check that installation was successful. By default tf uses gpu-0. 

Then you will have to configure keras like this:

# Configure Keras
mkdir ~/.keras
echo '{
    "image_data_format": "channels_last",
    "epsilon": 1e-07,
    "floatx": "float32",
    "backend": "tensorflow"
}' > ~/.keras/keras.json
cat ~/.keras/keras.json

Note that current versions of keras (~2.0) and tensorflow work well with CuDNN 5.1 for cuda 8, which I recommended to install.


4.4 Jupyter notebook

Jupyter is an interactive environment in which you can run your code and see the results immediately. To give you a hint how powerful it is, I will just link you to the power of hydrogen notebook.  

Basically a list of advantages of using jupyter:

  • Interactivity;
  • It stores a log of your experiments;
  • Easy to share;
  • Easy to document;
  • Easy to read other people's results;
  • Accessible for non-developers;
  • Can be shared as PDF, ipynb, HTML;
  • Has a lot of extensions that literally make your capabilities limitless;
  • ipython magics (like ! or % or %%) enable you to time scripts or run bash scripts in the notebook;


I usually install it together with unofficial extensions, the most important being collapsible headers, which is absolutely essential.

# Jupyter notebook and extensions extensions
# http://jupyter.readthedocs.io/en/latest/install.html
sudo pip3 install jupyter
# https://github.com/ipython-contrib/jupyter_contrib_nbextensions
sudo pip3 install jupyter_contrib_nbextensions
sudo pip3 install jupyter_nbextensions_configurator
jupyter contrib nbextension install --system


A notebook with enabled collapsible headers will look like this:



This is the first revision of this guide, please help me to improve it by posting comments below. Many thanks!