Layout of Mio001.mines.edu

mio001

This is a block diagram of the initial Mio001 layout.


Environment Setup

We primarily use modules set up the environment on mio001 as we do on BlueM, AuN, and Mc2. The command module avail shows the current set of modules.

[joeuser@mio001 utility]$ module avail

[tkaiser@mio001 ~]$ module avail

--------- /usr/share/Modules/modulefiles ---------
dot         module-info null        utility
module-cvs  modules     use.own

--------- /opt/modulefiles ---------
PrgEnv/intel/13.0.1                  ansys/fluent/15.0
PrgEnv/intel/default                 impi/gcc/4.1.1
PrgEnv/libs/cuda/5.5                 impi/intel/4.1.1
PrgEnv/libs/fftw/gcc/3.3.3           openmpi/gcc/1.6.5
PrgEnv/libs/fftw/intel/3.3.3         openmpi/gcc/default
PrgEnv/libs/opencl/1.2               openmpi/intel/1.6.5
PrgEnv/pgi/14.3                      openmpi/intel/1.6.5_test
PrgEnv/pgi/default                   openmpi/intel/default
PrgEnv/python/Enthought/2.7.2_v7.1-2
[joeuser@mio001 utility]$ 

See also: mod.html


Initial Environment Setup

You need to set up your environment to use the Intel compilers and either the OpenMPI or Intel versions of MPI. (Both OpenMPI and Intel MPI use the Intel compilers as backends.) See the procedure below. Most people use the OpenMPI version of MPI but if you are running on the PHI/MIC nodes you need Intel MPI.

Setup to use the Intel compilers and OpenMPI for MPI

You can get a basic setup for running jobs using Intel compilers and the OpenMPI version of MPI by adding the following lines to your .bash_profile file:

#load the Compilers and OpenMPI version of MPI
module load module load Core/Devel >& /dev/null
module load utility >& /dev/null

Setup to use the Intel compilers and the Intel version of MPI

If you want to use the Intel version of MPI then you must load the Intel impi module by adding the following lines to your .bash_profile file:

module load PrgEnv/intel/default >& /dev/null
module load utility >& /dev/null
#load the Intel MPI compiler impi
module load impi/intel/4.1.1 >& /dev/null

Here we have replaced the last line with the load of impi instead of openmpi.

Setup to use the Portland Group and CUDA compilers

If you want to use the GPU nodes add the following to your .bash_profile file

module load utility >& /dev/null
module load PrgEnv/pgi/default >& /dev/null
module load PrgEnv/libs/cuda/5.5 >& /dev/null
GPU Nodes:
gpu001
2 Tesla T10 Processor cards
gpu002
2 Tesla T10 Processor cards
gpu003
3 Tesla M2070 Processor cards

We have seen some of the Portland Group Cuda-Fortran programs fail on older Tesla T10 cards.


Scheduler

The scheduler on mio001 has partitions. If you don't care which nodes you run on you do not need to specify a partition. If you would like to run on your groups nodes or want to run on the PHI or GPU nodes you need to specify a partition. As discussed below, the command for submitting a batch job is sbatch script where script is your script. To run in the phi partition, and thus on the phi nodes, the syntax would be sbatch -p phi script

As of Thu Apr 10 15:51:39 MDT 2014 the following partitions are defined.

[joeuser@mio001 ~]$ sinfo -a
PARTITION AVAIL  TIMELIMIT  NODES   NODELIST
compute*     up 6-00:00:00     45   compute[000-005,026-030,054-055,062-067,
                                            084-085,090,100-101,103-121,124-125]
phi          up 6-00:00:00      2   phi[001-002]
gpu          up 6-00:00:00      3   gpu[001-003]
cwp          up 6-00:00:00      2   compute[084-085]
tkaiser      up 6-00:00:00      2   compute[004-005]
jbrune       up 6-00:00:00      4   compute[000-003,100-101]
cciobanu     up 6-00:00:00      2   compute[054,090]
anewman      up 6-00:00:00      1   compute055
psava        up 6-00:00:00     19   compute[103-121]
pconstan     up 6-00:00:00      2   compute[124-125]
lcarr        up 6-00:00:00     11   compute[026-030,062-067]
[joeuser@mio001 ~]$ 

The batch scheduler on mio001 is slurm not PBS/Torque. The functionality is similar but the commands are different.

Basic Slurm commands

sbatch
sbatch - Submit a batch script to SLURM.
squeue
squeue - view information about jobs located in the SLURM scheduling queue.
sinfo
sinfo - view information about SLURM nodes and partitions.
scancel
scancel - Used to signal jobs or job steps that are under the control of Slurm.

Rosetta Stone

rosetta.pdf
Shows mapping between common PBS/Slurm/Loadleveler commands

Slurm related commands unique to Mio

There are slurm related commands that are unique to mio.

slurmnodes
Information about nodes available
slurmjobs
Information about queue and running jobs
expands
A utility for getting a full list of nodes used for a job

"help" for these commands is shown below.

[joeuser@mio001 ~]$ /opt/utility/slurmnodes -help
/opt/utility/slurmnodes
     Options:
        Without arguments show full information for all nodes
       -fATTRIBUTE
           Show only the given attribute, Witout ATTRIBUTE just list the nodes
        list of nodes show
           Show information for the given nodes
        -h
           Show this help

[joeuser@mio001 ~]$ /opt/utility/slurmjobs -h
/opt/utility/slurmjobs
     Options:
        Without arguments show full information for all jobs
       -fATTRIBUTE
           Show only the given attribute, Witout ATTRIBUTE just list the jobs and users
       -uUSERNAME or USERNAME
           Show only jobs for USERNAME
        list of jobs to show
           Show information for the given jobs
        -h
           Show this help


[joeuser@mio001 ~]$ /opt/utility/expands -help
    Options:
        Without arguments
           If SLURM_NODELIST is defined
           use it to find the node list as
           described below.
      
           Note: SLURM_NODELIST is defined
           by slurm when running a parallel
           job so this command in realy only
           inside a batch script or when
           running interactive parallel jobs
      
        -h
           Show this help
      
     Usage:
        Takes an optional single command line argument,
        the environmental variable SLURM_NODELIST
        defined within a slurm job.
      
        SLURM_NODELIST is a compressed list of nodes
        assigned to a job.  This command returns an expanded
        list similar to what is defined in the PBS_NODES_FILE
        under pbs.
      
     Example:
[joeuser@mio001 utility]$ printenv SLURM_NODELIST
compute[004-005]
[joeuser@mio001 utility]$  ./expands  $SLURM_NODELIST
compute004
compute004
compute004
compute004
compute005
compute005
compute005
compute005


Quick start guide for Running MPI programs:

First set up your environment for MPI as discuss above. Then you are ready to build and run MPI applications. The procedure for building and running standard MPI programs is documented at: here. This page contains programs, makefiles, and slurm script examples.


Important slurm scripting considerations

  1. The sentinel for options in scripts is #SBATCH
  1. Exporting your environment to your parallel jobs.

The option #SBATCH --export=ALL will export your environment to your running parallel jobs. #SBATCH --export=ALL replaces -V option in pbs

  1. Parallel run commands mpiexec and mpirun are replaced with srun

The normal command used within a script to launch a MPI program is srun. This is required to run Intel "impi" MPI jobs. mpiexec or mpirun will not work. If you are using OpenMPI you can still use mpiexec.


Running Phi/Mic examples:

We have examples the illustrate the primary modes of operation of the MIC/PHI nodes. If you would like to use the Phi/Mic enabled nodes you must use Intel "impi". That is you must load the impi module. See: Setup to use the Intel compilers and the Intel version of MPI above. Also, you must specify "phi" as the run queue in your batch script. For example the preamble for a run script might look something like:

#!/bin/sh
#SBATCH --time=1
#SBATCH --nodes=1
#SBATCH -n 2
#SBATCH --export=ALL
#SBATCH -p phi
#SBATCH --overcommit

You may be interested in the OpenMP reference cards:

  1. Co-processor Offload Infrastructure (COI)
  2. MPI
  3. OpenMP (threads)
  4. Hybrid MPI/OpenMP
  5. Offload of MKL calls
  6. Directives based offload

Examples:

tgz file with all examples: http://inside.mines.edu/mio/mio001/phi.tgz
zip file with all examples: http://inside.mines.edu/mio/mio001/phi.zip
Directory listing of examples: http://inside.mines.edu/mio/mio001/phi

The files are also available on mio001. Copy the file /opt/utility/quickstart/phi.tgz to your home directory and run the command:

tar -xzf phi.tgz

This creates the directory phi containing:

[tkaiser@phi001 phi]$ ls -R
.:
coi  directive  index.html  micrun  mpi_openmp  offload

./coi:
hello_world  index.html

./coi/hello_world:
Makefile  do_coi  hello_world_sink.cpp  hello_world_source.cpp  index.html

./directive:
do_off  dooff.c  index.html  makefile

./mpi_openmp:
StomOmpf_00d.f  do_hybrid  do_mpi  do_openmp  helloc.c  hellof.f  hybrid.f90  index.html  makefile  runthread  st.in

./offload:
book  index.html  orsl_for_ao_and_cao

./offload/book:
auto.c  dosubscript  index.html  makefile  output  subscript

./offload/orsl_for_ao_and_cao:
4096.log  8192.log  do_off  index.html  makefile  run_s  run_t  t.c  t.simple.c

Examples:

Directory ~/coi/hello_world

  1. Co-processor Offload Infrastructure (COI)

Contains Co-processor Offload Infrastructure (COI) "coprocessor/CUDA like" example. It has a CPU part, hello_world_source.cpp and a PHI/MIC part hello_world_sink.cpp. When hello_world_source.cpp is run it launches hello_world_sink.cpp on the cards.

To run:

make
sbatch runfile

Typical output:

[tkaiser@mio001 hello_world]$ make
mkdir -p debug
g++ -L/opt/intel/mic/coi/host-linux-debug/lib -Wl,-rpath=/opt/intel/mic/coi/host-linux-debug/lib -I /opt/intel/mic/coi/include -lcoi_host -Wl,--enable-new-dtags -g -O0 -D_DEBUG -o debug/hello_world_source_host hello_world_source.cpp
mkdir -p debug
/usr/linux-?1om-4.7/bin/x86_64-?1om-linux-g++ -L/opt/intel/mic/coi/device-linux-debug/lib -I /opt/intel/mic/coi/include -lcoi_device -rdynamic -Wl,--enable-new-dtags -g -O0 -D_DEBUG -o debug/hello_world_sink_mic hello_world_sink.cpp
mkdir -p release
g++ -L/opt/intel/mic/coi/host-linux-release/lib -Wl,-rpath=/opt/intel/mic/coi/host-linux-release/lib -I /opt/intel/mic/coi/include -lcoi_host -Wl,--enable-new-dtags -DNDEBUG -O3 -o release/hello_world_source_host hello_world_source.cpp
mkdir -p release
/usr/linux-?1om-4.7/bin/x86_64-?1om-linux-g++ -L/opt/intel/mic/coi/device-linux-release/lib -I /opt/intel/mic/coi/include -lcoi_device -rdynamic -Wl,--enable-new-dtags -DNDEBUG -O3 -o release/hello_world_sink_mic hello_world_sink.cpp

[tkaiser@mio001 hello_world]$ sbatch runfile
Submitted batch job 187
[tkaiser@mio001 hello_world]$ ls *out
slurm-187.out
[tkaiser@mio001 hello_world]$ cat *out
phi001
Hello from the sink!
4 engines available
Got engine handle
Sink process created, press enter to destroy it.
Sink process returned 0
Sink exit reason SHUTDOWN OK
[tkaiser@mio001 hello_world]$ 

Directory ~/phi/mpi_openmp

  1. MPI
  2. OpenMP (threads)
  3. Hybrid MPI/OpenMP

Contains MPI and OpenMP examples. This MPI example runs "hello world" on both the CPU and PHI/MIC processors at the same time. The script do_mpi runs the mpi example and do_openmp runs an OpenMP version of the "Stommel" code.

The program hybrid.f90 is a hybrid MPI/OpenMP program Each thread prints out its thread and MPI id. It also shows how to create a collection of node specific MPI communicators based on the name of the node on which a task is running. Each node has it own "node_com" so each thread also prints its MPI rank in the node specific communicator.

To run:

make
sbatch do_mpi
sbatch do_openmp
sbatch do_hybrid

Typical Output:

[tkaiser@mio001 phi]$ cd ~/phi/mpi_openmp/
[tkaiser@mio001 mpi_openmp]$ ls
do_hybrid  do_mpi  do_openmp  helloc.c  hellof.f  hybrid.f90  makefile  runthread  st.in  StomOmpf_00d.f
[tkaiser@mio001 mpi_openmp]$ make
ifort -free -mmic -openmp -O3  StomOmpf_00d.f -o StomOmpf_00d.mic
rm *mod
ifort -free  -openmp -O3  StomOmpf_00d.f -o StomOmpf_00d.x86
rm *mod
mpiicc -mmic helloc.c -o helloc.mic
mpiicc  helloc.c -o helloc.x86
mpiifort -mmic hellof.f -o hellof.mic
mpiifort  hellof.f -o hellof.x86
mpiifort  -mmic  -openmp hybrid.f90 -o hybrid.mic
rm *mod
mpiifort    -openmp hybrid.f90 -o hybrid.x86
rm *mod
[tkaiser@mio001 mpi_openmp]$ 

[tkaiser@mio001 mpi_openmp]$ sbatch do_mpi
Submitted batch job 188

[tkaiser@mio001 mpi_openmp]$ ls *188*
188.script  hosts.188  slurm-188.out

[tkaiser@mio001 mpi_openmp]$ cat slurm-188.out
phi001
Hello from phi001 1 20
Hello from phi001 0 20
Hello from phi001 3 20
Hello from phi001 2 20
Hello from phi001-mic2 12 20
Hello from phi001-mic1 8 20
Hello from phi001-mic2 13 20
Hello from phi001-mic0 4 20
Hello from phi001-mic1 9 20
Hello from phi001-mic2 14 20
Hello from phi001-mic3 16 20
Hello from phi001-mic0 5 20
Hello from phi001-mic1 10 20
Hello from phi001-mic2 15 20
Hello from phi001-mic3 17 20
Hello from phi001-mic0 6 20
Hello from phi001-mic1 11 20
Hello from phi001-mic3 18 20
Hello from phi001-mic0 7 20
Hello from phi001-mic3 19 20

Tue Nov 26 10:14:53 MST 2013

[tkaiser@mio001 mpi_openmp]$ sbatch do_openmp 
Submitted batch job 189

[tkaiser@mio001 mpi_openmp]$ ls *189*
189.script  hosts.189  slurm-189.out

[tkaiser@mio001 mpi_openmp]$ head slurm-189.out
phi001
 threads=          50
   750      168917584.6    
  1500      144578230.3    
  2250      123773356.0    
  3000      105180096.6    
  3750      88327143.96    
  4500      73054749.29    
  5250      59389704.70    
  6000      47430832.06    

[tkaiser@mio001 mpi_openmp]$ tail slurm-189.out
 69750      0.000000000    
 70500      0.000000000    
 71250      0.000000000    
 72000      0.000000000    
 72750      0.000000000    
 73500      0.000000000    
 74250      0.000000000    
 75000      0.000000000    
 run time =   6.19109999999637                1          50
Tue Nov 26 10:16:21 MST 2013
[tkaiser@mio001 mpi_openmp]$

[tkaiser@mio001 mpi_openmp]$ sbatch do_hybrid
Submitted batch job 234
[tkaiser@mio001 mpi_openmp]$ ls -lt
total 4284
-rw-rw-r-- 1 tkaiser tkaiser   6480 Dec  2 10:26 slurm-234.out
-rwx------ 1 tkaiser tkaiser    580 Dec  2 10:26 tmpmz19Pk
-rw-rw-r-- 1 tkaiser tkaiser    697 Dec  2 10:26 234.script
-rw-rw-r-- 1 tkaiser tkaiser      7 Dec  2 10:26 hosts.234
...

[tkaiser@mio001 mpi_openmp]$ for m in mic0 mic1 mic2 mic3 ; do echo $m output ;cat slurm-234.out | grep -a $m | head -2 ; echo "..." ; echo "..." ;cat slurm-234.out | grep -a $m | tail -2 ; done
mic0 output
0000   08     phi001-mic0 0000    0000
0000   02     phi001-mic0 0000    0000
...
...
0003   06     phi001-mic0 0000    0003
0003   07     phi001-mic0 0000    0003
mic1 output
0004   00     phi001-mic1 0004    0000
0004   04     phi001-mic1 0004    0000
...
...
0006   07     phi001-mic1 0004    0002
0006   01     phi001-mic1 0004    0002
mic2 output
0008   00     phi001-mic2 0008    0000
0008   09     phi001-mic2 0008    0000
...
...
0011   01     phi001-mic2 0008    0003
0009   09     phi001-mic2 0008    0001
mic3 output
0015   00     phi001-mic3 0012    0003
0015   05     phi001-mic3 0012    0003
...
...
0014   07     phi001-mic3 0012    0002
0014   08     phi001-mic3 0012    0002
[tkaiser@mio001 mpi_openmp]$ 

Directory ~phi/offload/book

  1. Offload of MKL calls

See: Parallel Programming and Optimization with IntelĀ® Xeon Phi

To run:

make
sbatch dosubscript

Typical Output:

[tkaiser@mio001 book]$ ls
auto.c  dosubscript  makefile  output  subscript
[tkaiser@mio001 book]$ make
icc -mkl -DSIZE=8192 auto.c -o offit
[tkaiser@mio001 book]$ ls -l
total 192
-rw-rw-r-- 1 tkaiser tkaiser   1580 Jul 25 10:35 auto.c
-rwxr-xr-x 1 tkaiser tkaiser    352 Nov 26 10:58 dosubscript
-rw-rw-r-- 1 tkaiser tkaiser     89 Jul 25 10:36 makefile
-rwxrwxr-x 1 tkaiser tkaiser 166526 Nov 26 11:01 offit
-rw-rw-r-- 1 tkaiser tkaiser   5876 Jul 25 10:38 output
-rwx------ 1 tkaiser tkaiser    237 Nov 26 10:50 subscript

[tkaiser@mio001 book]$ sbatch runfile
Submitted batch job 203
[tkaiser@mio001 book]$ squeue
             JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
[tkaiser@mio001 book]$ ls -lt
total 216
-rw-rw-r-- 1 tkaiser tkaiser   5876 Nov 26 11:02 out.203
-rw-rw-r-- 1 tkaiser tkaiser    352 Nov 26 11:02 203.script
-rw-rw-r-- 1 tkaiser tkaiser      7 Nov 26 11:02 hosts.203
-rw-rw-r-- 1 tkaiser tkaiser   6828 Nov 26 11:02 slurm-203.out
-rwxrwxr-x 1 tkaiser tkaiser 166526 Nov 26 11:01 offit
-rwxr-xr-x 1 tkaiser tkaiser    352 Nov 26 10:58 dosubscript
-rwx------ 1 tkaiser tkaiser    237 Nov 26 10:50 subscript
-rw-rw-r-- 1 tkaiser tkaiser   5876 Jul 25 10:38 output
-rw-rw-r-- 1 tkaiser tkaiser     89 Jul 25 10:36 makefile
-rw-rw-r-- 1 tkaiser tkaiser   1580 Jul 25 10:35 auto.c
[tkaiser@mio001 book]$ head out.203
 Intializing matrix data 

size=     8192,  GFlops=  403.568
 Intializing matrix data 

[MKL] [MIC --] [AO Function]	SGEMM
[MKL] [MIC --] [AO SGEMM Workdivision]	0.10 0.23 0.23 0.23 0.23
[MKL] [MIC 00] [AO SGEMM CPU Time]	4.588332 seconds
[MKL] [MIC 00] [AO SGEMM MIC Time]	0.691480 seconds
[MKL] [MIC 00] [AO SGEMM CPU->MIC Data]	335544320 bytes
[tkaiser@mio001 book]$ tail out.203
[MKL] [MIC 01] [AO SGEMM MIC->CPU Data]	67108864 bytes
[MKL] [MIC 02] [AO SGEMM CPU Time]	0.876555 seconds
[MKL] [MIC 02] [AO SGEMM MIC Time]	0.261351 seconds
[MKL] [MIC 02] [AO SGEMM CPU->MIC Data]	335544320 bytes
[MKL] [MIC 02] [AO SGEMM MIC->CPU Data]	67108864 bytes
[MKL] [MIC 03] [AO SGEMM CPU Time]	0.876555 seconds
[MKL] [MIC 03] [AO SGEMM MIC Time]	0.260408 seconds
[MKL] [MIC 03] [AO SGEMM CPU->MIC Data]	335544320 bytes
[MKL] [MIC 03] [AO SGEMM MIC->CPU Data]	67108864 bytes
size=     8192,  GFlops= 1235.405
[tkaiser@mio001 book]$ 

Directory ~phi/offload/orsl_for_ao_and_cao

  1. Offload of MKL calls
See: http://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-compiler-assisted-offload-and-automatic-offload-example

Typical Output:


[tkaiser@mio001 orsl_for_ao_and_cao]$ ls
4096.log  8192.log  do_off  makefile  run_s  run_t  t.c  t.simple.c
[tkaiser@mio001 orsl_for_ao_and_cao]$ make
icc -O0 -std=c99 -Wall -g -mkl -openmp t.simple.c -o t.sim
icc -O0 -std=c99 -Wall -g -mkl -openmp t.c -o t.out

[tkaiser@mio001 orsl_for_ao_and_cao]$ sbatch runfile
Submitted batch job 205

[tkaiser@mio001 orsl_for_ao_and_cao]$ squeue
             JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
               205       phi   do_off  tkaiser   R       0:07      1 phi001

[tkaiser@mio001 orsl_for_ao_and_cao]$ squeue
             JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
               205       phi   do_off  tkaiser   R       1:03      1 phi001

[tkaiser@mio001 orsl_for_ao_and_cao]$ squeue
             JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
               205       phi   do_off  tkaiser   R       1:37      1 phi001

[tkaiser@mio001 orsl_for_ao_and_cao]$ squeue
             JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)

[tkaiser@mio001 orsl_for_ao_and_cao]$ ls -l *205*
-rw-rw-r-- 1 tkaiser tkaiser 2537 Nov 26 11:12 0_0_4096.205
-rw-rw-r-- 1 tkaiser tkaiser 2536 Nov 26 11:12 0_1_4096.205
-rw-rw-r-- 1 tkaiser tkaiser 2537 Nov 26 11:12 0_4096.205
-rw-rw-r-- 1 tkaiser tkaiser 2541 Nov 26 11:12 1_0_4096.205
-rw-rw-r-- 1 tkaiser tkaiser 2240 Nov 26 11:13 1_1_4096.205
-rw-rw-r-- 1 tkaiser tkaiser 2542 Nov 26 11:11 1_4096.205
-rw-rw-r-- 1 tkaiser tkaiser  645 Nov 26 11:11 205.script
-rw-rw-r-- 1 tkaiser tkaiser    7 Nov 26 11:11 hosts.205
-rw-rw-r-- 1 tkaiser tkaiser   36 Nov 26 11:13 slurm-205.out

[tkaiser@mio001 orsl_for_ao_and_cao]$ head 1_1_4096.205
Coprocessor access: concurrent
Manual synchronization: on
N: 4096
Offload 4096x4096 DGEMM:   475.49 GFlops
[MKL] [MIC --] [AO Function]	DGEMM
[MKL] [MIC --] [AO DGEMM Workdivision]	0.00 1.00
[MKL] [MIC 00] [AO DGEMM CPU Time]	7.864075 seconds
[MKL] [MIC 00] [AO DGEMM MIC Time]	0.708530 seconds
[MKL] [MIC 00] [AO DGEMM CPU->MIC Data]	263979008 bytes
[MKL] [MIC 00] [AO DGEMM MIC->CPU Data]	129761280 bytes

[tkaiser@mio001 orsl_for_ao_and_cao]$ tail 1_1_4096.205
[Offload] [MIC 0] [MIC Time]        0.206782 (seconds)
[Offload] [MIC 0] [MIC->CPU Data]   138412032 (bytes)

[Offload] [MIC 0] [File]            t.c
[Offload] [MIC 0] [Line]            35
[Offload] [MIC 0] [CPU Time]        0.288746 (seconds)
[Offload] [MIC 0] [CPU->MIC Data]   415236128 (bytes)
[Offload] [MIC 0] [MIC Time]        0.206486 (seconds)
[Offload] [MIC 0] [MIC->CPU Data]   138412032 (bytes)

[tkaiser@mio001 orsl_for_ao_and_cao]$ head 0_0_4096.205
Coprocessor access: serial
Manual synchronization: off
N: 4096
[MKL] [MIC --] [AO Function]	DGEMM
[MKL] [MIC --] [AO DGEMM Workdivision]	0.00 1.00
[MKL] [MIC 00] [AO DGEMM CPU Time]	7.670934 seconds
[MKL] [MIC 00] [AO DGEMM MIC Time]	0.716403 seconds
[MKL] [MIC 00] [AO DGEMM CPU->MIC Data]	263979008 bytes
[MKL] [MIC 00] [AO DGEMM MIC->CPU Data]	129761280 bytes
[MKL] [MIC --] [AO Function]	DGEMM

[tkaiser@mio001 orsl_for_ao_and_cao]$ tail 0_0_4096.205
[Offload] [MIC 0] [MIC Time]        0.206475 (seconds)
[Offload] [MIC 0] [MIC->CPU Data]   138412032 (bytes)

[Offload] [MIC 0] [File]            t.c
[Offload] [MIC 0] [Line]            35
[Offload] [MIC 0] [CPU Time]        0.288567 (seconds)
[Offload] [MIC 0] [CPU->MIC Data]   415236128 (bytes)
[Offload] [MIC 0] [MIC Time]        0.205986 (seconds)
[Offload] [MIC 0] [MIC->CPU Data]   138412032 (bytes)

[tkaiser@mio001 orsl_for_ao_and_cao]$ 
  1. Directives based offload

This shows how you can compile and offload your own functions to the cards using directives. There are a few things of note.

  1. Functions and data for the cards need the __attribute__((target(mic))) specification.
  2. We do not need the -mmic compile line option.
  3. If a card is not available then the function will be run on the CPU.

This example initializes an array on the CPU, a portion is modified in a function on the card then the CPU prints out part of the array. There is also a report given of the number of threads available for use on both the CPU and the card.

To run:

make

sbatch dosubscript
[tkaiser@mio001 directive]$ make
icc dooff.c -o dooff
[tkaiser@mio001 directive]$ ls
runfile  do_off  dooff.c  index.html  makefile

[tkaiser@mio001 directive]$ sbatch runfile
Submitted batch job 253

Typical Output:

[tkaiser@mio001 directive]$ cat slurm-253.out 
phi001
Hello world! I have 240 logical cores.
Hello world! I have 12 logical cores.
enter k:0
1
2
3
4
1234
1234
1234
1234
1234
[tkaiser@mio001 directive]$