Ubuntu16.04上gtx1080的cuda安装

July 17 2016

目前tensorflow是一个非常流行的深度学习计算框架,常规硬件及系统的安装方法官方的doc已经说的很清楚了,但是 因为系统是ubuntu16.04,显卡是GTX1080,所以不可避免的要折腾起来。在上一篇已经在16.04上安装好了驱动。接下来其实 重点安装的是CUDA和cuDNN.

首先说为什么要安装CUDA和cuDNN,关于采用GPU计算比CPU有速度有多少提升的benchmark找找就有,这次重点是怎么让tensorflow充分用的 上GTX1080能力。具体的就是如何把支持GTX1080的CUDA和cuDNN装起来,然后让tensorflow认识我们新装的CUDA和cuDNN。

首先总体说下安装步骤:

准备工作

在正式开始前,需要做几个准备工作,主要是大概先看下文档

文档看过之后接下来就是实际动手的过程:

1 注册NVIDIA developer的帐号,分别下载CUDA和cuDNN

2 确认GCC版本,安装依赖库

确认本机gcc版本,16.04默认的是gcc 5,这里安装需要的最高是gcc 4.9。接下来就安装配置gcc 4.9.

2.1 安装gcc 4.9,并修改系统默认为4.9

sudo apt-get install gcc-4.9 gcc-4.9 g++-4.9 g++-4.9
gcc --version
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.9 10
sudo update-alternatives --install /usr/bin/cc cc /usr/bin/gcc 30
sudo update-alternatives --set cc /usr/bin/gcc
sudo update-alternatives --install /usr/bin/c++ c++ /usr/bin/g++ 30
sudo update-alternatives --set c++ /usr/bin/g++
gcc --version

2.2 一个小依赖

sudo apt-get install freegl

3 安装CUDA

需要注意的是这个地方有个选择安装低版本驱动的地方,选n 大致的安装流程如下:

3.1 安装CUDA

chmod  +x /cuda_8.0.27_linux.run
./cuda_8.0.27_linux.run

....

Do you accept the previously read EULA?
accept/decline/quit: accept

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 361.62?
(y)es/(n)o/(q)uit: n

Install the CUDA 8.0 Toolkit?
(y)es/(n)o/(q)uit: y

Enter Toolkit Location
 [ default is /usr/local/cuda-8.0 ]: 

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y

Install the CUDA 8.0 Samples?
(y)es/(n)o/(q)uit: y
    
Enter CUDA Samples Location
 [ default is /home/h ]: /home/h/Documents/cuda_samples
   
 ....

3.2 写入环境变量

vim ~/.bashrc
#添加下面变量
export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

3.3 安装好后简单验证

4 安装cuDNN

h@h:~/Downloads$ tar xvzf cudnn-8.0-linux-x64-v5.0-ga.tgz 
cuda/include/cudnn.h
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.5
cuda/lib64/libcudnn.so.5.0.5
cuda/lib64/libcudnn_static.a

h@h:~/Downloads$ sudo cp -R cuda/lib64 /usr/local/cuda/lib64
h@h:~/Downloads$ sudo mkdir -p /usr/local/cuda/include
h@h:~/Downloads/cuda$ sudo cp include/cudnn.h /usr/local/cuda/include/
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

5 clone, configure tensorflow

5.1 clone源码

$ git clone https://github.com/tensorflow/tensorflow

5.2 configure配置
整个配置流程应该跟下面的基本一样的

h@h:~/Downloads/tensorflow$ cd ./tensorflow/
h@h:~/Downloads/tensorflow$ ./configure
Please specify the location of python. [Default is /usr/bin/python]: 
***Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] N***
No Google Cloud Platform support will be enabled for TensorFlow
***Do you wish to build TensorFlow with GPU support? [y/N] y***
GPU support will be enabled for TensorFlow
Please specify which gcc nvcc should use as the host compiler. [Default is /usr/bin/gcc]: 
**Please specify the location where CUDA  toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: /usr/local/cuda-8.0 **
  
**Please specify the Cudnn version you want to use. [Leave empty to use system default]: 5.0.5**
**Please specify the location where cuDNN 5.0.5 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-8.0]: /usr/local/cuda**
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
**Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]: 6.1**
Setting up Cuda include
Setting up Cuda lib64
Setting up Cuda bin
Setting up Cuda nvvm
Setting up CUPTI include
Setting up CUPTI lib64
Configuration finished

6 编译安装

6.1 编译工具Bazel安装配置
先看一眼文档 然后就执行下面的流程:

#安装java 1.8
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

#安装好后车参考下
java -version

#添加源
echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
curl https://storage.googleapis.com/bazel-apt/doc/apt-key.pub.gpg | sudo apt-key add -
    
#下载
sudo apt-get update && sudo apt-get install bazel

#升级
sudo apt-get upgrade bazel

6.2 编译tensorflow的pip版本并安装

$ bazel build -c opt //tensorflow/tools/pip_package:build_pip_package

# To build with GPU support:
$ bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

# The name of the .whl file will depend on your platform.
#注意编译完成后生成的文件名字和官方doc里面的是不一定一致的

$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.*-linux_x86_64.whl

i6700k 32g编译时间:

7 最后测试

前面都整完了,现在该测试了,注意前面有两个动态链接库的位置,cuDNN在/usr/local/cuda/lib64, 而cuda在/usr/local/cuda-8.0/lib64,所以这个时候的bashrc应该这么写:

export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda-8.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

写完后,

source ~/.bashrc
cd tensorflow/tensorflow/models/image/mnist
python convolutional.py

成功的话会出现流畅的跑动:

h@h:~/Downloads/tensorflow/tensorflow/models/image/mnist$ python convolutional.py
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so.5.0.5 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.8475
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.41GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
Initialized!
Step 0 (epoch 0.00), 8.4 ms
Minibatch loss: 12.054, learning rate: 0.010000
Minibatch error: 90.6%
Validation error: 84.6%

......

Minibatch error: 0.0%
Validation error: 0.7%
Step 8500 (epoch 9.89), 4.7 ms
Minibatch loss: 1.601, learning rate: 0.006302
Minibatch error: 0.0%
Validation error: 0.9%
Test error: 0.8%

总算成功了!!太他妈不容易了!

参考: