gds-tools package provides binaries for data verification, GDS config verification and a GPU based synthetic IO benchmarking tool.

gds-tools are installed at /usr/local/cuda-x.y/gds/tools

1. gdsio synthetic IO benchmarking tool:

gdsio - This is a synthetic IO benchmark using cuFile APIs

Here is a sample usage:

Usage
gdsio version :1.2
Usage ./gdsio
Usage [using cmd line options]:
$ ./gdsio
         -f <file name>
         -D <directory name>
         -d <gpu_index (refer nvidia-smi)>
         -n <numa node>
         -w <worker_count>
         -s <file size(K|M|G)>
         -o <start offset(K|M|G)>
         -i <io_size(K|M|G)> <min_size:max_size:step_size>
         -p <enable nvlinks>
         -b <skip bufregister>
         -o <start file offset>
         -V <verify IO>
         -x <xfer_type>
         -I <(read) 0|(write)1| (randread) 2| (randwrite) 3>
         -T <duration in seconds>
         -k <random_seed> (number e.g. 3456) to be used with random read/write>
         -U <use unaligned(4K) random offsets>
         -R <fill io buffer with random data>
         -F <refill io buffer with random data during each write>
	 -B <batch size>
Usage [using config file]:
(refer to the rw-sample.gdsio provided as sample)

$ ./gdsio rw-sample.gdsio

xfer_type:
0 - Storage -> GPU (GDS)
1 - Storage->CPU
2 - Storage->CPU->GPU
3 - Storage->CPU->GPU_ASYNC
4 - Storage->PAGE_CACHE->CPU->GPU
5 - Storage->GPU_ASYNC
6 - Storage -> GPU (GDS) in batch mode.

Note:
read test (-I 0) with verify option (-V) should be used with files written (-I 1) with -V option
read test (-I 2) with verify option (-V) should be used with files written (-I 3) with -V option and using same random seed (-k),
using same number of threads, offset, and data size
write test (-I 1/3) with verify option (-V) will perform writes followed by read
In batch mode, io sizes must be aligned with 4K, otherwise, it will return error.
 
gdsio config file options:
==========================
gdsio config file (refer rw-sample.gdsio as an expample) can be used to issue multiple parallel jobs.
The config file has two sections, global section and per job section.
Note: gdsio config file has two more per job options that are not currently available with the command line.
1) per job start offset("start_offset") - This specifies the start offset for a particular job. If not defined, the global start_offset will be used
2) per job size("size") - This defines the size for a job in a config file. If not defined, the global size will be used
e.g.
[job1]
filename=/mnt/test/testfile
start_offset=1M
size=2M
This will start IO of size 2M at offset 1M for job1.

gdsio command line options:
============================

[ job options ]
-f  -    The file path to use (/mnt/gdsio.txt)
-D  -    The directory to use (/mnt/gdsio_dir). this option will require files created in the directory using
         -I 1 -w <n>. The files will have the pattern (gdsio.0, gdsio.1, .. gdsio.<n-1>).
         Note: -D and -f cannot be supported at same time
-V  -    verify the contents of the file based on specific IO pattern.
         to verify the data, The files IO pattern is generated using -V and -I 1 -w <n> options
-d  -    device number of the GPU ( 0 - 15) . The files will be matched one to one with the file and device
-w  -    Number of threads per file
-n  -    numa node

[ global options ]
-s  -    Size of the file (Ex: -s 1G , -s 10M, -s 3.5g) (For reads, if -s is not specified, by default uses file size)
-i  -    IO size to use when reading or writing  ( choose somewhere from 1024K to 8192K)
-I  -    0 - seq read 1 - seq write 2 - randread 3 - randwrite
-x  -    Transfer type to test differ ways to transfer data from storage
         -x 0 when you want to test GPUDirect Storage.
         -x 2 to test with pread in CPU path and then cudaMemcpy to GPU
-o  -    Starting file offset in each thread to read from.
         Eg. for aligned file reads specify -o 4K, -o 1M.
-p  -    enable p2p for all CUDA_VISIBLE_DEVCIES used for dynamic routing; this may improve performance if IO has to traverse QPI/UPI path
-T  -    duration of the test in seconds.
-U  -    unaligned(4K) random offsets
-k  -    random seed for use with randread/randwrite -I(2/3)
-T  -    duration in seconds
-R  -    fill buffer with random data
-F  -    fill buffer with random at every write

This is a write(-I 1) benchmark does 4K (-i) IO to create a file of size 1 GiB (-s)

# 4KiB GDS WRITE test on GPU 0 with 2 worker threads on a single file for 1GiB dataset
$ ./gdsio -f /mnt/test -d 0 -n 0 -w 2 -s 1G -i 4K -x 0 -I 1
IoType: WRITE XferType: GPUD Threads: 2  DataSetSize: 1073442816/1073741824 IOSize: 4(KiB),Throughput: 0.167347 GiB/sec, Avg_Latency: 45.588810 usecs ops: 071 total_latency 5973939.000000

# 4KiB GDS READ test on GPU 0 with 2 worker threads on a single file for 1GiB dataset
$ ./gdsio -f /mnt/test -d 0 -n 0 -w 2 -s 1G -i 4K -x 0 -I 0
IoType: READ XferType: GPUD Threads: 2  DataSetSize: 1073475584/1073741824 IOSize: 4(KiB),Throughput: 0.079856 GiB/sec, Avg_Latency: 95.536943 usecs ops: 079 total_latency 12519361.000000

For performance testing, User can also launch multiple IOs on different files(under different mountpoints) as shown below (This is on a 16 GPU DGX-2 system) :

# GPUDirect Storage performance test for READS with 1MiB IO SIZE on 512G dataset using 8 workers
$ WORKERS=8; IO_TYPE=1 ; XFER_TYPE=0 ; IO_SIZE=1M;DATASET_SIZE=512G
$ ./gdsio -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \
             -f /mnt/dir1/test -d 0 -n 0 -w $WORKERS \
             -f /mnt/dir2/test -d 3 -n 0 -w $WORKERS \
             -f /mnt/dir3/test -d 4 -n 0 -w $WORKERS \
             -f /mnt/dir4/test -d 7 -n 0 -w $WORKERS \
             -f /mnt/dir5/test -d 8 -n 1 -w $WORKERS \
             -f /mnt/dir6/test -d 11 -n 1 -w $WORKERS \
             -f /mnt/dir7/test -d 12 -n 1 -w $WORKERS \
             -f /mnt/dir8/test -d 15 -n 1 -w $WORKERS

# Compare with Storage to GPU using traditional method for READS with 1MiB IO SIZE on 512G dataset using 8 workers
 $ WORKERS=8; IO_TYPE=1 ; XFER_TYPE=2 ; IO_SIZE=1M;DATASET_SIZE=512G
 $ ./gdsio -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \
             -f /mnt/dir1/test -d 0 -n 0 -w $WORKERS \
             -f /mnt/dir2/test -d 3 -n 0 -w $WORKERS \
             -f /mnt/dir3/test -d 4 -n 0 -w $WORKERS \
             -f /mnt/dir4/test -d 7 -n 0 -w $WORKERS \
             -f /mnt/dir5/test -d 8 -n 1 -w $WORKERS \
             -f /mnt/dir6/test -d 11 -n 1 -w $WORKERS \
             -f /mnt/dir7/test -d 12 -n 1 -w $WORKERS \
             -f /mnt/dir8/test -d 15 -n 1 -w $WORKERS

# Users can also use the directory option with gdsio. This is a file-per thread mode.
  Files must be created before reading using transfer type write.
  Note: the directory(-D) option must not be used simulatenous with file mode(-f)
 $ WORKERS=8; IO_TYPE=1 ; XFER_TYPE=0 ; IO_SIZE=1M;DATASET_SIZE=512G
 $ ./gdsio -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \
             -D /mnt/dir1/ -d 0 -n 0 -w $WORKERS \
             -D /mnt/dir2/ -d 5 -n 0 -w $WORKERS \
             -D /mnt/dir3/ -d 9 -n 0 -w $WORKERS \
             -D /mnt/dir4/ -d 13 -n 0 -w $WORKERS

#verification of data
 $ WORKERS=8; IO_TYPE=1 ; XFER_TYPE=1 ; IO_SIZE=1M;DATASET_SIZE=512G
 $ ./gdsio -V -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \
             -D /mnt/dir1/ -d 0 -n 0 -w $WORKERS \
             -D /mnt/dir2/ -d 5 -n 0 -w $WORKERS \
             -D /mnt/dir3/ -d 9 -n 0 -w $WORKERS \
             -D /mnt/dir4/ -d 13 -n 0 -w $WORKERS

#Use variable block size, and chose IO pattern
 Sequential Read
 $ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 32K:1024K:1K -x 0 -I 0
 Sequential Write
 $ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 32K:1024K:1K -x 0 -I 1
 Random Read
 $ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 32K:1024K:1K -x 0 -I 2
 Random Write
 $ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 32K:1024K:1K -x 0 -I 3

#gdsio examples for batch mode
 Sequiential Read in batch mode with the batch size of 4 for a single file
 $./gdsio -x 6 -f  /mnt/foo -d 0 -w 4 -s 128K -i 4k -I 0 
 Sequiential write in batch mode with the batch size of 4 for a single file
 $./gdsio -x 6 -f  /mnt/foo -d 0 -w 4 -s 128K -i 4k -I 1
 Sequiential write in batch mode with the batch size of 4 for a single file with
 verification
 $./gdsio -x 6 -f  /mnt/foo -d 0 -w 4 -s 128K -i 4k -I 1 -V
#For User-Space RDMA Tests,

Run Server
$ ./rdma_dci_server.sh (please update the IP addresses with that configured on the system)
Run Client
read : ./gdsio -P sockfs://IPV4:PORT -d 0 -n 0 -w 4 -P sockfs://IPV4:PORT -d 1 -n 0 -w 4 -s 1G -i 1M -x 0 -I 1
write: ./gdsio -P sockfs://IPV4:PORT -d 0 -n 0 -w 4 -P sockfs://IPV4:PORT -d 1 -n 0 -w 4 -s 1G -i 1M -x 0 -I 0

#Use refill buffer option. This will fill io buffer with random data at every write
 $ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 1024K -x 0 -I 1 -F -k 3456

2: gdsio_verify

This is a data verification tool to check for data integrity using cuFile APIs

$ ./gdsio_verify -h
--gpu(d) 	        <gpu-index>
--file(f) 	        <filename>
--gpu_offset(t) 	<gpu_offset(K|M|G)>
--gpu_devptr_offset(b) 	<gpu_devptr_offset(K|M|G)>
--gpubufalignment(g) 	<offset(K|M|G)>
--fileoffset(o) 	<offsetbytes(K|M|G)>
--iosize(s) 		<size in (K|M|G)>
--chunksize(c) 		<chunk size in (K|M|G)>
--nr(n) 		<number of ios>
--sync(m) 		<mode sync(1) or async(0)>
--skipregister(S) 	<skip buffer register>
--verbose(V) 		<verbose>
--fsync(p) 		<O_SYNC (1)>
--batch(B)		<no of batch entries per I/O>
--version(v) 		<version>

NOTE: for batch mode -b, -g -t -o -c must be 4K aligned, and -S is not supported.
iosize(-s) represents each batch entry io size, e.g. with 4 number of batches and
256MB iosize, total amount of I/O would be 1GB.

Example:

Make sure test file is not empty.

# verify reading 1G data using GPUDirect Storage
$ ./gdsio_verify -d 0 -f /mnt/test -o 0 -s 1G -n 1 -m 1
gpu index :0,file :/mnt/test, RING buffer size :0, gpu buffer alignment :0, gpu buffer offset :0, file offset :0, io_requested :1073741824, sync :1, nr ios :1,
address = 7fa27e000000

This test reads 1G from /mnt/test to GPU 0 using cuFileRead and Writes it back to /mnt/ using cuFileWrite
and verifies the data of source and target

# verify reading 256MB data using GPUDirect Storage batch mode with batch size of 4
$ ./gdsio_verify -B 4 -f /mnt/foo -s 64K -d 0 -c 4K -o 0

3: gdscheck

This tool performs basic platform, driver and filesystem specific checks to test for GPU Direct Storage support.

$  ./gdscheck.py -h
usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V]

GPUDirectStorage platform checker

optional arguments:
  -h, --help  show this help message and exit
  -p          gds platform check
  -f FILE     gds file check
  -v          gds version checks
  -V          gds fs checks

example:
(for version information)
$ ./gdscheck.py -v
 GDS release version (beta): 0.9.0.14
 nvidia_fs version:  2.3 libcufile version: 2.3

(for only platform check)
$ ./gdscheck.py -p
$ /usr/local/gds/tools/gdscheck.py -p 
 GDS release version (beta): 0.95.0.49
 nvidia_fs version:  2.6 libcufile version: 2.3
 cuFile CONFIGURATION:
 NVMe           : Supported
 NVMeOF         : Supported
 SCSI           : Unsupported
 SCALEFLUX CSD  : Supported
 NVMesh         : Supported
 LUSTRE         : Supported
 GPFS           : Unsupported
 NFS            : Supported
 WEKAFS         : Supported
 USERSPACE RDMA : Supported
 --MOFED peer direct  : enabled
 --rdma library       : Loaded (libcufile_rdma.so)
 --rdma devices       : Configured
 --rdma_device_status : Up: 1 Down: 0
 properties.use_compat_mode : 1
 properties.use_poll_mode : 0
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 1
 properties.rdma_dynamic_routing_order : GPU_MEM_NVLINKS GPU_MEM SYS_MEM P2P
 fs.generic.posix_unaligned_writes : 0
 fs.lustre.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: 0
 profile.nvtx : 0
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : 0
 GPU INFO:
 GPU index 0 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 1 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 2 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 3 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 4 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 5 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 6 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 7 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 8 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 9 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 10 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 11 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 12 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 13 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 14 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 15 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
 IOMMU : disabled
 Platform verification succeeded

(for only file check)
$ ./gdscheck.py -f /mnt/test
 GDS register success
generating 4k read latency matrix :
GPU 34:00:00 : 250.56(us) read_verification: pass
GPU 36:00:00 : 250.00(us) read_verification: pass
GPU 39:00:00 : 250.05(us) read_verification: pass
GPU 3b:00:00 : 243.88(us) read_verification: pass

(for checking client file systems version support)
$ /usr/local/gds/tools/gdscheck.py -v -V
/usr/local/gds/tools/gdscheck.py -v -V
 GDS release version (beta): 0.95.0
 nvidia_fs version:  2.6 libcufile version: 2.3
FILESYSTEM VERSION CHECK:
LUSTRE:
current version: 2.6.99 (Unsupported)
min version supported: 2.12.3_ddn28
WEKAFS:
GDS RDMA read: supported
GDS RDMA write: supported
current version: 3.8.0.9-dg
min version supported: 3.8.0

4: gdscp

This tools copies file from one location to another using cuFile APIs. This mimics "cp" behaviour
Make sure test file is not empty

$ ./gdscp /mnt/test /mnt/test_copy 0 -v
gpu md5:90672a90fba312a386b25b8861e8bd9
cpu md5:90672a90fba312a386b25b8861e8bd9
md5sum Match!!
In above example, it copies data from /mnt/test to /mnt/test_copy;
the data is routed through GPU Memory using cuFileAPI

6: gds_stats

This tool is used to read user-space statistics exported by libcufile per process.

$ ./gds_stats -p <process id> -l <verbosity level>

 -l is the level and can be 1, 2, or 3.
 Please ensure that the cufile statistics is enabled
 by setting JSON configuration key profile.cufile_stats to a valid level,
 before trying to read the statistics.

7: gdsio_static
Functionally and usage-wise they are same as gdsio, but uses cufile static libraries.
For more, look at the gdsio examples above.

8: gds_log_collection.py

This tool is used to collect logs from the system that are relevant for debugging. 
It collects logs such as os and kernel info, nvidia-fs stats, dmesg logs, syslogs, 
System map files and per process logs like cufile.json, cufile.log, gdsstats, process stack, etc. 

Usage ./gds_log_collection.py [options]
options:
     -h help
     -f file_path1,file_path2,..(Note: there should be no spaces between ',')

e.g.
    sudo ./gds_log_colection.py - Collects all the relevant logs 
    sudo ./gds_log_colection.py -f file_path1,file_path2 - Collects all the relevant files as well as user specifed files. These could be crash files or any other relevant files