HPC-Quick-Start-Mio-Phi

Graphical Version

HPC Quick Start Guide - Mio Phi Nodes

Running Phi/Mic examples:

We have examples the illustrate the primary modes of operation of the MIC/PHI nodes. If you would like to use the Phi/Mic enabled nodes you must use Intel "impi". Also, you must specify "phi" as the run queue in your batch script. For example the preamble for a run script might look something like:

#!/bin/sh
#SBATCH --time=1
#SBATCH --nodes=1
#SBATCH -n 2
#SBATCH --export=ALL
#SBATCH -p phi
#SBATCH --overcommit

You may be interested in the OpenMP reference cards:

  1. Co-processor Offload Infrastructure (COI)
  2. MPI
  3. OpenMP (threads)
  4. Hybrid MPI/OpenMP
  5. Offload of MKL calls
  6. Directives based offload

Examples:

tgz file with all examples: http://inside.mines.edu/mio/mio001/phi.tgz
zip file with all examples: http://inside.mines.edu/mio/mio001/phi.zip
Directory listing of examples: http://inside.mines.edu/mio/mio001/phi

tar -xzf phi.tgzThe files are also available on mio001. Copy the file /opt/utility/quickstart/phi.tgz to your home directory and run the command:

This creates the directory phi containing:

[tkaiser@phi001 phi]$ ls -R
.:
coi  directive  index.html  micrun  mpi_openmp  offload

./coi:
hello_world  index.html

./coi/hello_world:
Makefile  do_coi  hello_world_sink.cpp  hello_world_source.cpp  index.html

./directive:
do_off  dooff.c  index.html  makefile

./mpi_openmp:
StomOmpf_00d.f  do_hybrid  do_mpi  do_openmp  helloc.c  hellof.f  hybrid.f90  index.html  makefile  runthread  st.in

./offload:
book  index.html  orsl_for_ao_and_cao

./offload/book:
auto.c  dosubscript  index.html  makefile  output  subscript

./offload/orsl_for_ao_and_cao:
4096.log  8192.log  do_off  index.html  makefile  run_s  run_t  t.c  t.simple.c

Examples:

Directory ~/coi/hello_world

 
  1. Co-processor Offload Infrastructure (COI)

Contains Co-processor Offload Infrastructure (COI) "coprocessor/CUDA like" example. It has a CPU part, hello_world_source.cpp and a PHI/MIC part hello_world_sink.cpp. When hello_world_source.cpp is run it launches hello_world_sink.cpp on the cards.

To run:

make
sbatch do_coi

Typical output:

[tkaiser@mio001 hello_world]$ make
mkdir -p debug
g++ -L/opt/intel/mic/coi/host-linux-debug/lib -Wl,-rpath=/opt/intel/mic/coi/host-linux-debug/lib -I /opt/intel/mic/coi/include -lcoi_host -Wl,--enable-new-dtags -g -O0 -D_DEBUG -o debug/hello_world_source_host hello_world_source.cpp
mkdir -p debug
/usr/linux-?1om-4.7/bin/x86_64-?1om-linux-g++ -L/opt/intel/mic/coi/device-linux-debug/lib -I /opt/intel/mic/coi/include -lcoi_device -rdynamic -Wl,--enable-new-dtags -g -O0 -D_DEBUG -o debug/hello_world_sink_mic hello_world_sink.cpp
mkdir -p release
g++ -L/opt/intel/mic/coi/host-linux-release/lib -Wl,-rpath=/opt/intel/mic/coi/host-linux-release/lib -I /opt/intel/mic/coi/include -lcoi_host -Wl,--enable-new-dtags -DNDEBUG -O3 -o release/hello_world_source_host hello_world_source.cpp
mkdir -p release
/usr/linux-?1om-4.7/bin/x86_64-?1om-linux-g++ -L/opt/intel/mic/coi/device-linux-release/lib -I /opt/intel/mic/coi/include -lcoi_device -rdynamic -Wl,--enable-new-dtags -DNDEBUG -O3 -o release/hello_world_sink_mic hello_world_sink.cpp

[tkaiser@mio001 hello_world]$ sbatch do_coi
Submitted batch job 187
[tkaiser@mio001 hello_world]$ ls *out
slurm-187.out
[tkaiser@mio001 hello_world]$ cat *out
phi001
Hello from the sink!
4 engines available
Got engine handle
Sink process created, press enter to destroy it.
Sink process returned 0
Sink exit reason SHUTDOWN OK
[tkaiser@mio001 hello_world]$ 

Directory ~/phi/mpi_openmp

 
  1. MPI
  2. OpenMP (threads)
  3. Hybrid MPI/OpenMP

Contains MPI and OpenMP examples. This MPI example runs "hello world" on both the CPU and PHI/MIC processors at the same time. The script do_mpi runs the mpi example and do_openmp runs an OpenMP version of the "Stommel" code.

The program hybrid.f90 is a hybrid MPI/OpenMP program Each thread prints out its thread and MPI id. It also shows how to create a collection of node specific MPI communicators based on the name of the node on which a task is running. Each node has it own "node_com" so each thread also prints its MPI rank in the node specific communicator.

To run:

make
sbatch do_mpi
sbatch do_openmp
sbatch do_hybrid

Typical Output:

[tkaiser@mio001 phi]$ cd ~/phi/mpi_openmp/
[tkaiser@mio001 mpi_openmp]$ ls
do_hybrid  do_mpi  do_openmp  helloc.c  hellof.f  hybrid.f90  makefile  runthread  st.in  StomOmpf_00d.f
[tkaiser@mio001 mpi_openmp]$ make
ifort -free -mmic -openmp -O3  StomOmpf_00d.f -o StomOmpf_00d.mic
rm *mod
ifort -free  -openmp -O3  StomOmpf_00d.f -o StomOmpf_00d.x86
rm *mod
mpiicc -mmic helloc.c -o helloc.mic
mpiicc  helloc.c -o helloc.x86
mpiifort -mmic hellof.f -o hellof.mic
mpiifort  hellof.f -o hellof.x86
mpiifort  -mmic  -openmp hybrid.f90 -o hybrid.mic
rm *mod
mpiifort    -openmp hybrid.f90 -o hybrid.x86
rm *mod
[tkaiser@mio001 mpi_openmp]$ 

[tkaiser@mio001 mpi_openmp]$ sbatch do_mpi
Submitted batch job 188

[tkaiser@mio001 mpi_openmp]$ ls *188*
188.script  hosts.188  slurm-188.out

[tkaiser@mio001 mpi_openmp]$ cat slurm-188.out
phi001
Hello from phi001 1 20
Hello from phi001 0 20
Hello from phi001 3 20
Hello from phi001 2 20
Hello from phi001-mic2 12 20
Hello from phi001-mic1 8 20
Hello from phi001-mic2 13 20
Hello from phi001-mic0 4 20
Hello from phi001-mic1 9 20
Hello from phi001-mic2 14 20
Hello from phi001-mic3 16 20
Hello from phi001-mic0 5 20
Hello from phi001-mic1 10 20
Hello from phi001-mic2 15 20
Hello from phi001-mic3 17 20
Hello from phi001-mic0 6 20
Hello from phi001-mic1 11 20
Hello from phi001-mic3 18 20
Hello from phi001-mic0 7 20
Hello from phi001-mic3 19 20

Tue Nov 26 10:14:53 MST 2013

[tkaiser@mio001 mpi_openmp]$ sbatch do_openmp 
Submitted batch job 189

[tkaiser@mio001 mpi_openmp]$ ls *189*
189.script  hosts.189  slurm-189.out

[tkaiser@mio001 mpi_openmp]$ head slurm-189.out
phi001
 threads=          50
   750      168917584.6    
  1500      144578230.3    
  2250      123773356.0    
  3000      105180096.6    
  3750      88327143.96    
  4500      73054749.29    
  5250      59389704.70    
  6000      47430832.06    

[tkaiser@mio001 mpi_openmp]$ tail slurm-189.out
 69750      0.000000000    
 70500      0.000000000    
 71250      0.000000000    
 72000      0.000000000    
 72750      0.000000000    
 73500      0.000000000    
 74250      0.000000000    
 75000      0.000000000    
 run time =   6.19109999999637                1          50
Tue Nov 26 10:16:21 MST 2013
[tkaiser@mio001 mpi_openmp]$

[tkaiser@mio001 mpi_openmp]$ sbatch do_hybrid
Submitted batch job 234
[tkaiser@mio001 mpi_openmp]$ ls -lt
total 4284
-rw-rw-r-- 1 tkaiser tkaiser   6480 Dec  2 10:26 slurm-234.out
-rwx------ 1 tkaiser tkaiser    580 Dec  2 10:26 tmpmz19Pk
-rw-rw-r-- 1 tkaiser tkaiser    697 Dec  2 10:26 234.script
-rw-rw-r-- 1 tkaiser tkaiser      7 Dec  2 10:26 hosts.234
...

[tkaiser@mio001 mpi_openmp]$ for m in mic0 mic1 mic2 mic3 ; do echo $m output ;cat slurm-234.out | grep -a $m | head -2 ; echo "..." ; echo "..." ;cat slurm-234.out | grep -a $m | tail -2 ; done
mic0 output
0000   08     phi001-mic0 0000    0000
0000   02     phi001-mic0 0000    0000
...
...
0003   06     phi001-mic0 0000    0003
0003   07     phi001-mic0 0000    0003
mic1 output
0004   00     phi001-mic1 0004    0000
0004   04     phi001-mic1 0004    0000
...
...
0006   07     phi001-mic1 0004    0002
0006   01     phi001-mic1 0004    0002
mic2 output
0008   00     phi001-mic2 0008    0000
0008   09     phi001-mic2 0008    0000
...
...
0011   01     phi001-mic2 0008    0003
0009   09     phi001-mic2 0008    0001
mic3 output
0015   00     phi001-mic3 0012    0003
0015   05     phi001-mic3 0012    0003
...
...
0014   07     phi001-mic3 0012    0002
0014   08     phi001-mic3 0012    0002
[tkaiser@mio001 mpi_openmp]$ 

Directory ~phi/offload/book

 
  1. Offload of MKL calls

See: Parallel Programming and Optimization with Intel® Xeon Phi

To run:

make
sbatch dosubscript

Typical Output:

[tkaiser@mio001 book]$ ls
auto.c  dosubscript  makefile  output  subscript
[tkaiser@mio001 book]$ make
icc -mkl -DSIZE=8192 auto.c -o offit
[tkaiser@mio001 book]$ ls -l
total 192
-rw-rw-r-- 1 tkaiser tkaiser   1580 Jul 25 10:35 auto.c
-rwxr-xr-x 1 tkaiser tkaiser    352 Nov 26 10:58 dosubscript
-rw-rw-r-- 1 tkaiser tkaiser     89 Jul 25 10:36 makefile
-rwxrwxr-x 1 tkaiser tkaiser 166526 Nov 26 11:01 offit
-rw-rw-r-- 1 tkaiser tkaiser   5876 Jul 25 10:38 output
-rwx------ 1 tkaiser tkaiser    237 Nov 26 10:50 subscript

[tkaiser@mio001 book]$ sbatch dosubscript
Submitted batch job 203
[tkaiser@mio001 book]$ squeue
             JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
[tkaiser@mio001 book]$ ls -lt
total 216
-rw-rw-r-- 1 tkaiser tkaiser   5876 Nov 26 11:02 out.203
-rw-rw-r-- 1 tkaiser tkaiser    352 Nov 26 11:02 203.script
-rw-rw-r-- 1 tkaiser tkaiser      7 Nov 26 11:02 hosts.203
-rw-rw-r-- 1 tkaiser tkaiser   6828 Nov 26 11:02 slurm-203.out
-rwxrwxr-x 1 tkaiser tkaiser 166526 Nov 26 11:01 offit
-rwxr-xr-x 1 tkaiser tkaiser    352 Nov 26 10:58 dosubscript
-rwx------ 1 tkaiser tkaiser    237 Nov 26 10:50 subscript
-rw-rw-r-- 1 tkaiser tkaiser   5876 Jul 25 10:38 output
-rw-rw-r-- 1 tkaiser tkaiser     89 Jul 25 10:36 makefile
-rw-rw-r-- 1 tkaiser tkaiser   1580 Jul 25 10:35 auto.c
[tkaiser@mio001 book]$ head out.203
 Intializing matrix data 

size=     8192,  GFlops=  403.568
 Intializing matrix data 

[MKL] [MIC --] [AO Function]	SGEMM
[MKL] [MIC --] [AO SGEMM Workdivision]	0.10 0.23 0.23 0.23 0.23
[MKL] [MIC 00] [AO SGEMM CPU Time]	4.588332 seconds
[MKL] [MIC 00] [AO SGEMM MIC Time]	0.691480 seconds
[MKL] [MIC 00] [AO SGEMM CPU->MIC Data]	335544320 bytes
[tkaiser@mio001 book]$ tail out.203
[MKL] [MIC 01] [AO SGEMM MIC->CPU Data]	67108864 bytes
[MKL] [MIC 02] [AO SGEMM CPU Time]	0.876555 seconds
[MKL] [MIC 02] [AO SGEMM MIC Time]	0.261351 seconds
[MKL] [MIC 02] [AO SGEMM CPU->MIC Data]	335544320 bytes
[MKL] [MIC 02] [AO SGEMM MIC->CPU Data]	67108864 bytes
[MKL] [MIC 03] [AO SGEMM CPU Time]	0.876555 seconds
[MKL] [MIC 03] [AO SGEMM MIC Time]	0.260408 seconds
[MKL] [MIC 03] [AO SGEMM CPU->MIC Data]	335544320 bytes
[MKL] [MIC 03] [AO SGEMM MIC->CPU Data]	67108864 bytes
size=     8192,  GFlops= 1235.405
[tkaiser@mio001 book]$ 

Directory ~phi/offload/orsl_for_ao_and_cao

  1. Offload of MKL calls

See: http://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-compiler-assisted-offload-and-automatic-offload-example

Typical Output:


[tkaiser@mio001 orsl_for_ao_and_cao]$ ls
4096.log  8192.log  do_off  makefile  run_s  run_t  t.c  t.simple.c
[tkaiser@mio001 orsl_for_ao_and_cao]$ make
icc -O0 -std=c99 -Wall -g -mkl -openmp t.simple.c -o t.sim
icc -O0 -std=c99 -Wall -g -mkl -openmp t.c -o t.out

[tkaiser@mio001 orsl_for_ao_and_cao]$ sbatch do_off
Submitted batch job 205

[tkaiser@mio001 orsl_for_ao_and_cao]$ squeue
             JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
               205       phi   do_off  tkaiser   R       0:07      1 phi001

[tkaiser@mio001 orsl_for_ao_and_cao]$ squeue
             JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
               205       phi   do_off  tkaiser   R       1:03      1 phi001

[tkaiser@mio001 orsl_for_ao_and_cao]$ squeue
             JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
               205       phi   do_off  tkaiser   R       1:37      1 phi001

[tkaiser@mio001 orsl_for_ao_and_cao]$ squeue
             JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)

[tkaiser@mio001 orsl_for_ao_and_cao]$ ls -l *205*
-rw-rw-r-- 1 tkaiser tkaiser 2537 Nov 26 11:12 0_0_4096.205
-rw-rw-r-- 1 tkaiser tkaiser 2536 Nov 26 11:12 0_1_4096.205
-rw-rw-r-- 1 tkaiser tkaiser 2537 Nov 26 11:12 0_4096.205
-rw-rw-r-- 1 tkaiser tkaiser 2541 Nov 26 11:12 1_0_4096.205
-rw-rw-r-- 1 tkaiser tkaiser 2240 Nov 26 11:13 1_1_4096.205
-rw-rw-r-- 1 tkaiser tkaiser 2542 Nov 26 11:11 1_4096.205
-rw-rw-r-- 1 tkaiser tkaiser  645 Nov 26 11:11 205.script
-rw-rw-r-- 1 tkaiser tkaiser    7 Nov 26 11:11 hosts.205
-rw-rw-r-- 1 tkaiser tkaiser   36 Nov 26 11:13 slurm-205.out

[tkaiser@mio001 orsl_for_ao_and_cao]$ head 1_1_4096.205
Coprocessor access: concurrent
Manual synchronization: on
N: 4096
Offload 4096x4096 DGEMM:   475.49 GFlops
[MKL] [MIC --] [AO Function]	DGEMM
[MKL] [MIC --] [AO DGEMM Workdivision]	0.00 1.00
[MKL] [MIC 00] [AO DGEMM CPU Time]	7.864075 seconds
[MKL] [MIC 00] [AO DGEMM MIC Time]	0.708530 seconds
[MKL] [MIC 00] [AO DGEMM CPU->MIC Data]	263979008 bytes
[MKL] [MIC 00] [AO DGEMM MIC->CPU Data]	129761280 bytes

[tkaiser@mio001 orsl_for_ao_and_cao]$ tail 1_1_4096.205
[Offload] [MIC 0] [MIC Time]        0.206782 (seconds)
[Offload] [MIC 0] [MIC->CPU Data]   138412032 (bytes)

[Offload] [MIC 0] [File]            t.c
[Offload] [MIC 0] [Line]            35
[Offload] [MIC 0] [CPU Time]        0.288746 (seconds)
[Offload] [MIC 0] [CPU->MIC Data]   415236128 (bytes)
[Offload] [MIC 0] [MIC Time]        0.206486 (seconds)
[Offload] [MIC 0] [MIC->CPU Data]   138412032 (bytes)

[tkaiser@mio001 orsl_for_ao_and_cao]$ head 0_0_4096.205
Coprocessor access: serial
Manual synchronization: off
N: 4096
[MKL] [MIC --] [AO Function]	DGEMM
[MKL] [MIC --] [AO DGEMM Workdivision]	0.00 1.00
[MKL] [MIC 00] [AO DGEMM CPU Time]	7.670934 seconds
[MKL] [MIC 00] [AO DGEMM MIC Time]	0.716403 seconds
[MKL] [MIC 00] [AO DGEMM CPU->MIC Data]	263979008 bytes
[MKL] [MIC 00] [AO DGEMM MIC->CPU Data]	129761280 bytes
[MKL] [MIC --] [AO Function]	DGEMM

[tkaiser@mio001 orsl_for_ao_and_cao]$ tail 0_0_4096.205
[Offload] [MIC 0] [MIC Time]        0.206475 (seconds)
[Offload] [MIC 0] [MIC->CPU Data]   138412032 (bytes)

[Offload] [MIC 0] [File]            t.c
[Offload] [MIC 0] [Line]            35
[Offload] [MIC 0] [CPU Time]        0.288567 (seconds)
[Offload] [MIC 0] [CPU->MIC Data]   415236128 (bytes)
[Offload] [MIC 0] [MIC Time]        0.205986 (seconds)
[Offload] [MIC 0] [MIC->CPU Data]   138412032 (bytes)

[tkaiser@mio001 orsl_for_ao_and_cao]$ 
 
  1. Directives based offload

This shows how you can compile and offload your own functions to the cards using directives. There are a few things of note.

  1. Functions and data for the cards need the __attribute__((target(mic))) specification.
  2. We do not need the -mmic compile line option.
  3. If a card is not available then the function will be run on the CPU.

This example initializes an array on the CPU, a portion is modified in a function on the card then the CPU prints out part of the array. There is also a report given of the number of threads available for use on both the CPU and the card.

To run:

make

sbatch dosubscript
[tkaiser@mio001 directive]$ make
icc dooff.c -o dooff
[tkaiser@mio001 directive]$ ls
dooff  do_off  dooff.c  index.html  makefile

[tkaiser@mio001 directive]$ sbatch do_off
Submitted batch job 253

Typical Output:

[tkaiser@mio001 directive]$ cat slurm-253.out 
phi001
Hello world! I have 240 logical cores.
Hello world! I have 12 logical cores.
enter k:0
1
2
3
4
1234
1234
1234
1234
1234
[tkaiser@mio001 directive]$ 

Font Size