Yuens' blog
本文有点类似工作日志,我也是对Linux Shell一点都不懂,这里只是一点点积累自己写的算法测试脚本。每次做一点更新和改动,学一点新东西,还在不断完善补充中。[toc]
如果是X86平台,可以直接去下载pre-build版本通过apt的方式安装,参考:
若是在arm等平台(gcc -v命令看target的值),可以下载源码包自行编译,区别主要是在于编译多少个线程的版本。比放clone下来代码或者下载好源码包后。
export OMP_NUM_THREADS=2之后,再编译执行make。
编译完成后即可安装,make成功会有提示,比方安装的命令可以是:make install(不带prefix似乎会默认安装到系统路径,这里我记不清了)。但通常自己会安装到自己的目录下,执行:
make PREFIX=~/software/OpenBLAS-0.20.2 install
如果用的caffe测试,因为执行caffe根目录下的./build/tools/caffe进行分类,这个caffe的可执行文件会动态地将blas连接上。可以再在自己的环境变量~/.bashrc
里添加:
export OPENBLAS_HOME=/home/YOURNAME/software/OpenBLAS-0.2.20 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$OPENBLAS_HOME/lib
再对./build/tools/caffe执行ldd命令,看是否链接上这个blas库:
ldd ~/code/caffe/build/tools/caffe | grep blas
如果显示如下,就说明链接成功:
$ ldd ./build/tools/caffe | grep blas libopenblas.so.0 => /home/YOURNAME/software/OpenBLAS-0.2.20/lib/libopenblas.so.0 (0xb5a98000)
此时若需要测试双核性能,举个例子需要先制定线程数,然后再制定跑caffe所用到的哪两个核心(当然前提是已经设置需要测试的cpu的主频最大):
sudo cpufreq-set -c 0,1 -g performance export OMP_NUM_THREADS=2 nohup taskset -c 0,1 ./build/tools/caffe time --model=models/det_task/BITInfraredDetection.deploy.prototxt > det_openblas_omp2_benchmark.log &
之后,使用openblas的双核的caffe性能测试结果将会写到这个det_openblas_omp2_benchmark.log文件里。需要注意我这里给的脚本的路径是在caffe下时候的命令。
我是使用MXNet来做测试,需要注意的是MXNet有三种不同的Engine,在选择inference时候,NaiveEngine速度是最快,也就是说inferece前,需要:
export OMP_NUM_THREADS=1 export MXNET_ENGINE_TYPE=NaiveEngine
在tegraX1上测试某算法在不同线程下的性能。具体如下
第一次写的很简单,都是for循环。测试多线程取值在1..4上不同的forward时间,以下是程序。
#!/bin/bash echo "OMP_NUM_THREADS=4" export OMP_NUM_THREADS=4 for i in {1..10} do echo "welcome $i times" python inception-v3_inference.py done echo "=======================" echo "OMP_NUM_THREADS=3" export OMP_NUM_THREADS=3 for i in {1..10} do echo "welcome $i times" python inception-v3_inference.py done echo "=======================" echo "OMP_NUM_THREADS=2" export OMP_NUM_THREADS=2 for i in {1..10} do echo "welcome $i times" python inception-v3_inference.py done echo "=======================" echo "OMP_NUM_THREADS=1" export OMP_NUM_THREADS=1 for i in {1..10} do echo "welcome $i times" python inception-v3_inference.py done
参考:Linux —— Shell编程之变量赋值和引用 - boshuzhang的专栏 - 博客频道 - CSDN.NET http://blog.csdn.net/boshuzhang/article/details/52208998
下面这种写法是将上面的四次重复的for改成了两层for,这样看起来就简洁多了。另外需要注意的是:在等号的左右不能有空格,否则就会报错,比方下面在export那句话的等号左右为了美观加上空格,报错如下:export: `=’: not a valid identifier
#!/bin/bash for thread_num in {1..4} do echo "====== set OMP_NUM_THREADS = $thread_num ======" # no space near the equal symbol export OMP_NUM_THREADS=$thread_num for idx in {1..10} do echo "[$idx]" python run_inference.py done done
如果脚本在服务器上运行时间很长,可以放在后台执行,这样ssh断开也不会影响,同时将打印的日志结果写入到文件中。
此外这个脚本还有一点正则匹配包含指定内容的行,if语句的简单用法。另外,echo的参数中, -e表示开启转义,\c表示不换行。
如下面这个testPerf.sh脚本,在本地创建testPerf.log脚本,使用命令:(./testPerf.sh > ./testPerf.log &),如果是追加方式,则将>改为>>,即:(./testPerf.sh >> ./testPerf.log &)。&这个字符是通过sub-shell让进程在后台运行HUP(hangup),参考:让进程在后台可靠运行的几种方法:http://www.ibm.com/developerworks/cn/linux/l-cn-nohup/
#!/bin/bash for mode in {1..2} do if [ $mode = 1 ]; then echo "====== GPU MODE ======" max_idx=100 else echo "====== CPU MODE ======" max_idx=10 fi # no space near the equal symbol # there're two big braces around variable for ((idx = 1; idx <= $max_idx; idx++)) do # '-e' is parameter of echo, support escape characters in command # '\c' is an escape character, means cancel newline in this sentence echo -e "[$idx]\c" if [ $mode = 1 ]; then # GPU mode ./darknet detect cfg/yolo.cfg yolo.weights data/dog.jpg | grep "Predicted" else # CPU mode ./darknet -nogpu detect cfg/yolo.cfg yolo.weights data/dog.jpg | grep "Predicted" fi done done
加入日志时间打印,并将echo统一换成printf。使用printf替换echo的主要原因如下(摘抄自Shell printf 命令 | 菜鸟教程):
#!/bin/bash for mode in {1..2} do if [ $mode = 1 ]; then printf "====== GPU MODE ======\n" max_idx=100 else printf "====== CPU MODE ======\n" max_idx=10 fi # no space near the equal symbol # there're two spaces around variable for ((idx = 1; idx <= $max_idx; idx++)) do current_time=$(date +%Y-%m-%d\ %H:%M:%S) printf "$current_time [%3d]" ${idx} if [ $mode = 1 ]; then # GPU mode #./darknet detect cfg/yolo.cfg yolo.weights data/dog.jpg | grep "Predicted" #./darknet detector test cfg/voc.data cfg/tiny-yolo-voc.cfg tiny-yolo-voc.weights data/dog.jpg | grep "Predicted" #./darknet yolo test cfg/yolov1/yolo.cfg yolov1.weights data/dog.jpg | grep "Predicted" ./darknet yolo test cfg/yolo9000.cfg yolo9000.weights data/dog.jpg | grep "Predicted" else # CPU mode #./darknet -nogpu detect cfg/yolo.cfg yolo.weights data/dog.jpg | grep "Predicted" #./darknet -nogpu detector test cfg/voc.data cfg/tiny-yolo-voc.cfg tiny-yolo-voc.weights data/dog.jpg | grep "Predicted" #./darknet -nogpu yolo test cfg/yolov1/yolo.cfg yolov1.weights data/dog.jpg | grep "Predicted" ./darknet -nogpu yolo test cfg/yolo9000.cfg yolo9000.weights data/dog.jpg | grep "Predicted" fi done done
查看CPU频率:
#!/bin/bash echo "Running CPU index:" cat /sys/devices/system/cpu/online echo "Current CPU clock frequencies:" cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq cat /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq
设定tegra X1 CPU频率:
#!/bin/bash echo 0 > /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable echo 1 > /sys/devices/system/cpu/cpu0/online echo 1 > /sys/devices/system/cpu/cpu1/online echo 1 > /sys/devices/system/cpu/cpu2/online echo 1 > /sys/devices/system/cpu/cpu3/online echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
设置前,可以先查看当前频率和可用频率,分别是如下命令(如若发现cpufreq-info命令没有安装,则执行:sudo apt-get install cpufrequtils):
# 查看当前频率 cpufreq-info | grep 'current CPU' # 查看可用频率 cpufreq-info | grep 'available'
设定A53×4(这四个小核)最大频率(也可参考下面这行改成设定大核最大频率,我没试):
sudo cpufreq-set -c 0 -g performance
设置完成后,可以再次通过如下cpufre-info的命令查看当前频率,以检查是否已经达到最大的可用频率。
设定Firefly RK3399(A72×2,A53×4)最大频率(基于tegra设定最大频率修改的):
#!/bin/bash #echo 0 > /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable echo 1 > /sys/devices/system/cpu/cpu0/online echo 1 > /sys/devices/system/cpu/cpu1/online echo 1 > /sys/devices/system/cpu/cpu2/online echo 1 > /sys/devices/system/cpu/cpu3/online echo 1 > /sys/devices/system/cpu/cpu4/online echo 1 > /sys/devices/system/cpu/cpu5/online echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
查看cortex-A72频率:
#!/bin/bash printf "Running CPU index:" cat /sys/devices/system/cpu/online printf "A53 cpu-index:" cat /sys/devices/system/cpu/cpu0/cpufreq/related_cpus printf "A72 cpu-index:" cat /sys/devices/system/cpu/cpu4/cpufreq/related_cpus printf "\n" max_cpu_idx=5 printf "Current CPU clock frequencies:\n" for ((cpu_idx = 0; cpu_idx <= $max_cpu_idx; cpu_idx++)) do cur_freq_path=$(printf '/sys/devices/system/cpu/cpu%s/cpufreq/scaling_cur_freq' $cpu_idx) cur_freq=$(cat $cur_freq_path) printf "cpu[%d]" $cpu_idx printf $cur_freq printf '\n' done
参考:Shell printf 命令 | 菜鸟教程 http://www.runoob.com/linux/linux-shell-printf.html
shell读取文件中的内容,并将其存入到变量中 - sidely的专栏 - 博客频道 - CSDN.NET http://blog.csdn.net/sidely/article/details/40426999
本来是测Cortex-A72(单核和双核)在inception-v3模型上的性能,但是firefly-RK3399上的这两个A72太不稳定,也不清楚是板子自身的问题还是本身A72就是不稳定,反正是一跑脚本,机器就卡死挂掉了。一摸CPU,超级热。
无奈脚本跑不了,下面是脚本testPerf.sh的内容(指定A72的单核和双核,分别测10次),运行的时候就在shell里写(./testPerf.sh >> testPerf.log &):
#!/bin/bash # set task on cpu cortex-A72 pid_num=$$ printf $pid_num taskset -pc 4,5 $pid_num for cpu_num in {1..2} do printf "using cpu num:%s\n" $cpu_num printf "OMP_NUM_THREADS=%s\n" $cpu_num export OMP_NUM_THREADS=$cpu_num for exec_idx in {1..10} do if [ $cpu_num = 1 ]; then taskset -c 4 python inception-v3_inference-once.py | grep 'take' else taskset -c 4,5 python inception-v3_inference-once.py | grep 'take' fi done done
测试RK3399上的Cortex-A53(对应CPU索引为0-3,这4个核)多线程的infer性能(每个跑10次):
#!/bin/bash # set task on cortex-A53 for cpu_num in {1..4} do printf "set OMP_NUM_THREADS=%s\n" $cpu_num export OMP_NUM_THREADS=$cpu_num for exec_idx in {1..10} do if [ $cpu_num = 1 ] then taskset -c 0 python run_inference.py elif [ $cpu_num = 2 ] then taskset -c 0,1 python run_inference.py elif [ $cpu_num = 3 ] then taskset -c 0,1,2 python run_inference.py elif [ $cpu_num = 4 ] then taskset -c 0,1,2,3 python run_inference.py else printf "abnormal\n" fi done done
测试RK3399上的Cortex-A53(对应CPU索引为4和5,这2个核)多线程的infer性能:
#!/bin/bash # set task on cortex-a72 printf "set task on cortex-a72\n" for cpu_num in {1..2} do printf "set OMP_NUM_THREADS=%s\n" $cpu_num export OMP_NUM_THREADS=$cpu_num for exec_idx in {1..10} do if [ $cpu_num = 1 ] then taskset -c 4 python run_inference.py elif [ $cpu_num = 2 ] then taskset -c 4,5 python run_inference.py else printf "abnormal\n" fi done done
锁频模式(但是被吐槽这个不是锁频,有待商议),可以指定GPU的时钟速度,使用如下命令开启:
ysh329@ubuntu:~/code$ sudo nvidia-smi -pm 1 [sudo] password for yuanshuai: Persistence mode is already Enabled for GPU 00000002:01:00.0. Persistence mode is already Enabled for GPU 00000003:01:00.0. Persistence mode is already Enabled for GPU 0000000A:01:00.0. Persistence mode is already Enabled for GPU 0000000B:01:00.0. All done.
高性能模式(这个频率不确定,有的chip频率就会变):
nvidia-smi -q -d PERFORMANCE
设定memory和graphics的主频(应该就是锁死了,我测了矩阵乘法的同时,实时监控Clocks发现跑程序时候的主频就是设定的固定不变的),设定如下:
$ #nvidia-smi --applications-clocks=<memory,graphics> $ sudo nvidia-smi --applications-clocks=715,1480 Applications clocks set to "(MEM 715, SM 1480)" for GPU 00000002:01:00.0 Applications clocks set to "(MEM 715, SM 1480)" for GPU 00000003:01:00.0 Applications clocks set to "(MEM 715, SM 1480)" for GPU 0000000A:01:00.0 Applications clocks set to "(MEM 715, SM 1480)" for GPU 0000000B:01:00.0 All done.
如果你设置主频的值不在支持范围之内.会提示你用命令nvidia-smi -q -d SUPPORTED_CLOCKS来查看支持的频率值。
yuanshuai@ubuntu:~$ sudo nvidia-smi --applications-clocks=715,111 Specified clock combination "(MEM 715, SM 111)" is not supported for GPU 00000002:01:00.0. Run 'nvidia-smi -q -d SUPPORTED_CLOCKS' to see list of supported clock combinations Treating as warning and moving on. Specified clock combination "(MEM 715, SM 111)" is not supported for GPU 00000003:01:00.0. Run 'nvidia-smi -q -d SUPPORTED_CLOCKS' to see list of supported clock combinations Treating as warning and moving on. Specified clock combination "(MEM 715, SM 111)" is not supported for GPU 0000000A:01:00.0. Run 'nvidia-smi -q -d SUPPORTED_CLOCKS' to see list of supported clock combinations Treating as warning and moving on. Specified clock combination "(MEM 715, SM 111)" is not supported for GPU 0000000B:01:00.0. Run 'nvidia-smi -q -d SUPPORTED_CLOCKS' to see list of supported clock combinations Treating as warning and moving on. All done.
监控GPU实时频率(下面是4块GPU中的第一块因为下标-i参数是0,-q代表query查询,-i代表index第几块gpu默认不带该参数列出所有gpu,实时主频是看Clocks这个section):
watch -n .1 nvidia-smi -q -i 0 --display=CLOCK ==============NVSMI LOG============== Timestamp : Thu Aug 31 20:25:27 2017 Driver Version : 384.59 Attached GPUs : 4 GPU 00000002:01:00.0 Clocks Graphics : 405 MHz SM : 405 MHz Memory : 715 MHz Video : 835 MHz Applications Clocks Graphics : 1480 MHz Memory : 715 MHz Default Applications Clocks Graphics : 1328 MHz Memory : 715 MHz Max Clocks Graphics : 1480 MHz SM : 1480 MHz [0/455] Memory : 715 MHz Video : 1480 MHz Max Customer Boost Clocks Graphics : N/A SM Clock Samples Duration : 947568.53 sec Number of Samples : 100 Max : 1480 MHz Min : 405 MHz Avg : 410 MHz Memory Clock Samples Duration : 947572.97 sec Number of Samples : 100 Max : 715 MHz Min : 715 MHz Avg : 715 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A
观察是否是最大主频可以把/usr/local/cuda/sample目录整个拷贝出到自己目录(否则make时候会有helper_functions.h:No such file的错误,因这个h文件在该sample/common下),然后进入里面的矩阵运算samples/0_Simple/matrixMul的例子,make一下,然后跑matrixMul,用带watch的命令实时看看主频是否达到设定的主频。
参考(这个写的很全面):nvidia-smi: Control Your GPUs | Microway https://www.microway.com/hpc-tech-tips/nvidia-smi_control-your-gpus/