Mio GPU Node Usage

We have put together simple procedure to download, build, and run a number GPU examples. What you need to do is download the files, run make and then run the programs on the GPU enabled node on Mio.

You will download a tar file that contains source, a makefile, an environmental settings file (pgi) and a batch script (cudarun).

You can use the files "runfile" and "makefile" as templates for your programs.

What you need to type (Copy/Paste) is in red.

(1) On "mio.mines.edu", create a new empty directory.

mkdir guide

(2) Go to the directory.

cd guide

(3) Download the files/p>

wget -o wget.out http://inside.mines.edu/mio/source/cuda.tgz

(3b optional) You might want to do an "ls" to make sure the file "cuda.tgz" actually downloaded.

(4) Untar the files

tar -xzf cuda.tgz

(4b optional) You might want to do an "ls" to make sure you have the files cudarun, makefile, more.tgz, NVIDIA.tgz, pgi, and rcs09.tgz

(5) Setting up your enviornment

The file pgi contains the $PATH and $MANPATH setting to reference the latest version of the Portland Group compilers. These are required to build one of the examples.

To enable you to build the examples you need to source this file pgi.

source pgi

(5b optional) Setting up your enviornment permanently

If you plan on using this new version of the Portland Group compilers in the future you should add this file to your .bashrc file. One way to do this is the following:

cp ~/.bashrc ~/.bashrc-old-pg
cat pgi >> ~/.bashrc

Note the double >>

(6) Make the executables

make

Submitting job the to the GPU node

(7) Submit your job using qsub.

Note the line:

#PBS -l nodes=1:ppn=8:cuda

is required in the script cudarun to run on the GPU enabled node

qsub cudarun

After a few moments your job will run and put the output into the file "cuda_out.*" You will have something like the following:

[tkaiser@ra ~/guide]$ls -lt
total 732
-rw------- 1 tkaiser tkaiser   14620 Mar 18 15:47 cuda_out.8443.mio.mines.edu
-rw------- 1 tkaiser tkaiser      37 Mar 18 15:47 cuda_err.8443.mio.mines.edu
-rw-rw-r-- 1 tkaiser tkaiser   11880 Mar 18 15:47 out.dat
-rw-rw-r-- 1 tkaiser tkaiser    6085 Mar 18 15:47 deviceQuery.txt
-rw-rw-r-- 1 tkaiser tkaiser     164 Mar 18 15:47 SdkMasterLog.csv
...
...
[tkaiser@ra ~/guide]$ 

with the file cuda_out.* containing something like:

[tkaiser@ra ~/guide]$cat cuda_out.*
running on: cuda0

 gives a report of the devices that were found
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

There are 4 devices supporting CUDA

Device 0: "Tesla T10 Processor"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.20
  CUDA Capability Major/Minor version number:    1.3
  Total amount of global memory:                 4294770688 bytes
  Multiprocessors x Cores/MP = Cores:            30 (MP) x 8 (Cores/MP) = 240 (Cores)
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
...
...
PASSED

Press Enter to Quit...
-----------------------------------------------------------

 hello world in cuda.  prints a list of threads
     0        0   0        0   0   0
     1        0   0        1   0   0
     2        0   0        0   1   0
...
...
 tests to see if we can run mpi
processor 0  sent 5678
processor 1  got 5678

 test to see if a cula linear algebra routine can be called
  matrix generated   0.1620000004768372  of size  799
 beam=  94.2636414  94.7636414  95.2636337
 799  did solve   1.3629999998956919     using :cgesv     
 matrix generation time   0.1620000004768372
 matrix solve time    1.3629999998956919
 total computaion time    1.6750000007450581

 test to see if Portland Group compiler works
                     name=Tesla T10 Processor
           totalGlobalMem=               4294770688
        sharedMemPerBlock=                    16384
             regsPerBlock=        16384
                 warpSize=           32
                 memPitch=               2147483647
...
...
    17.00000    
    18.00000    
    19.00000    
    20.00000    
[tkaiser@mio cuda]$ 
[tkaiser@ra ~/guide]$ 




For More Information See: