gds-tools package provides binaries for data verification, GDS config verification and a GPU based synthetic IO benchmarking tool. gds-tools are installed at /usr/local/cuda-x.y/gds/tools 1. gdsio synthetic IO benchmarking tool: gdsio - This is a synthetic IO benchmark using cuFile APIs Here is a sample usage: Usage gdsio version :1.2 Usage ./gdsio Usage [using cmd line options]: $ ./gdsio -f -D -d -n -w -s -o -i -p -b -o -V -x -I <(read) 0|(write)1| (randread) 2| (randwrite) 3> -T -k (number e.g. 3456) to be used with random read/write> -U -R -F -B Usage [using config file]: (refer to the rw-sample.gdsio provided as sample) $ ./gdsio rw-sample.gdsio xfer_type: 0 - Storage -> GPU (GDS) 1 - Storage->CPU 2 - Storage->CPU->GPU 3 - Storage->CPU->GPU_ASYNC 4 - Storage->PAGE_CACHE->CPU->GPU 5 - Storage->GPU_ASYNC 6 - Storage -> GPU (GDS) in batch mode. Note: read test (-I 0) with verify option (-V) should be used with files written (-I 1) with -V option read test (-I 2) with verify option (-V) should be used with files written (-I 3) with -V option and using same random seed (-k), using same number of threads, offset, and data size write test (-I 1/3) with verify option (-V) will perform writes followed by read In batch mode, io sizes must be aligned with 4K, otherwise, it will return error. gdsio config file options: ========================== gdsio config file (refer rw-sample.gdsio as an expample) can be used to issue multiple parallel jobs. The config file has two sections, global section and per job section. Note: gdsio config file has two more per job options that are not currently available with the command line. 1) per job start offset("start_offset") - This specifies the start offset for a particular job. If not defined, the global start_offset will be used 2) per job size("size") - This defines the size for a job in a config file. If not defined, the global size will be used e.g. [job1] filename=/mnt/test/testfile start_offset=1M size=2M This will start IO of size 2M at offset 1M for job1. gdsio command line options: ============================ [ job options ] -f - The file path to use (/mnt/gdsio.txt) -D - The directory to use (/mnt/gdsio_dir). this option will require files created in the directory using -I 1 -w . The files will have the pattern (gdsio.0, gdsio.1, .. gdsio.). Note: -D and -f cannot be supported at same time -V - verify the contents of the file based on specific IO pattern. to verify the data, The files IO pattern is generated using -V and -I 1 -w options -d - device number of the GPU ( 0 - 15) . The files will be matched one to one with the file and device -w - Number of threads per file -n - numa node [ global options ] -s - Size of the file (Ex: -s 1G , -s 10M, -s 3.5g) (For reads, if -s is not specified, by default uses file size) -i - IO size to use when reading or writing ( choose somewhere from 1024K to 8192K) -I - 0 - seq read 1 - seq write 2 - randread 3 - randwrite -x - Transfer type to test differ ways to transfer data from storage -x 0 when you want to test GPUDirect Storage. -x 2 to test with pread in CPU path and then cudaMemcpy to GPU -o - Starting file offset in each thread to read from. Eg. for aligned file reads specify -o 4K, -o 1M. -p - enable p2p for all CUDA_VISIBLE_DEVCIES used for dynamic routing; this may improve performance if IO has to traverse QPI/UPI path -T - duration of the test in seconds. -U - unaligned(4K) random offsets -k - random seed for use with randread/randwrite -I(2/3) -T - duration in seconds -R - fill buffer with random data -F - fill buffer with random at every write This is a write(-I 1) benchmark does 4K (-i) IO to create a file of size 1 GiB (-s) # 4KiB GDS WRITE test on GPU 0 with 2 worker threads on a single file for 1GiB dataset $ ./gdsio -f /mnt/test -d 0 -n 0 -w 2 -s 1G -i 4K -x 0 -I 1 IoType: WRITE XferType: GPUD Threads: 2 DataSetSize: 1073442816/1073741824 IOSize: 4(KiB),Throughput: 0.167347 GiB/sec, Avg_Latency: 45.588810 usecs ops: 071 total_latency 5973939.000000 # 4KiB GDS READ test on GPU 0 with 2 worker threads on a single file for 1GiB dataset $ ./gdsio -f /mnt/test -d 0 -n 0 -w 2 -s 1G -i 4K -x 0 -I 0 IoType: READ XferType: GPUD Threads: 2 DataSetSize: 1073475584/1073741824 IOSize: 4(KiB),Throughput: 0.079856 GiB/sec, Avg_Latency: 95.536943 usecs ops: 079 total_latency 12519361.000000 For performance testing, User can also launch multiple IOs on different files(under different mountpoints) as shown below (This is on a 16 GPU DGX-2 system) : # GPUDirect Storage performance test for READS with 1MiB IO SIZE on 512G dataset using 8 workers $ WORKERS=8; IO_TYPE=1 ; XFER_TYPE=0 ; IO_SIZE=1M;DATASET_SIZE=512G $ ./gdsio -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \ -f /mnt/dir1/test -d 0 -n 0 -w $WORKERS \ -f /mnt/dir2/test -d 3 -n 0 -w $WORKERS \ -f /mnt/dir3/test -d 4 -n 0 -w $WORKERS \ -f /mnt/dir4/test -d 7 -n 0 -w $WORKERS \ -f /mnt/dir5/test -d 8 -n 1 -w $WORKERS \ -f /mnt/dir6/test -d 11 -n 1 -w $WORKERS \ -f /mnt/dir7/test -d 12 -n 1 -w $WORKERS \ -f /mnt/dir8/test -d 15 -n 1 -w $WORKERS # Compare with Storage to GPU using traditional method for READS with 1MiB IO SIZE on 512G dataset using 8 workers $ WORKERS=8; IO_TYPE=1 ; XFER_TYPE=2 ; IO_SIZE=1M;DATASET_SIZE=512G $ ./gdsio -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \ -f /mnt/dir1/test -d 0 -n 0 -w $WORKERS \ -f /mnt/dir2/test -d 3 -n 0 -w $WORKERS \ -f /mnt/dir3/test -d 4 -n 0 -w $WORKERS \ -f /mnt/dir4/test -d 7 -n 0 -w $WORKERS \ -f /mnt/dir5/test -d 8 -n 1 -w $WORKERS \ -f /mnt/dir6/test -d 11 -n 1 -w $WORKERS \ -f /mnt/dir7/test -d 12 -n 1 -w $WORKERS \ -f /mnt/dir8/test -d 15 -n 1 -w $WORKERS # Users can also use the directory option with gdsio. This is a file-per thread mode. Files must be created before reading using transfer type write. Note: the directory(-D) option must not be used simulatenous with file mode(-f) $ WORKERS=8; IO_TYPE=1 ; XFER_TYPE=0 ; IO_SIZE=1M;DATASET_SIZE=512G $ ./gdsio -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \ -D /mnt/dir1/ -d 0 -n 0 -w $WORKERS \ -D /mnt/dir2/ -d 5 -n 0 -w $WORKERS \ -D /mnt/dir3/ -d 9 -n 0 -w $WORKERS \ -D /mnt/dir4/ -d 13 -n 0 -w $WORKERS #verification of data $ WORKERS=8; IO_TYPE=1 ; XFER_TYPE=1 ; IO_SIZE=1M;DATASET_SIZE=512G $ ./gdsio -V -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \ -D /mnt/dir1/ -d 0 -n 0 -w $WORKERS \ -D /mnt/dir2/ -d 5 -n 0 -w $WORKERS \ -D /mnt/dir3/ -d 9 -n 0 -w $WORKERS \ -D /mnt/dir4/ -d 13 -n 0 -w $WORKERS #Use variable block size, and chose IO pattern Sequential Read $ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 32K:1024K:1K -x 0 -I 0 Sequential Write $ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 32K:1024K:1K -x 0 -I 1 Random Read $ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 32K:1024K:1K -x 0 -I 2 Random Write $ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 32K:1024K:1K -x 0 -I 3 #gdsio examples for batch mode Sequiential Read in batch mode with the batch size of 4 for a single file $./gdsio -x 6 -f /mnt/foo -d 0 -w 4 -s 128K -i 4k -I 0 Sequiential write in batch mode with the batch size of 4 for a single file $./gdsio -x 6 -f /mnt/foo -d 0 -w 4 -s 128K -i 4k -I 1 Sequiential write in batch mode with the batch size of 4 for a single file with verification $./gdsio -x 6 -f /mnt/foo -d 0 -w 4 -s 128K -i 4k -I 1 -V #For User-Space RDMA Tests, Run Server $ ./rdma_dci_server.sh (please update the IP addresses with that configured on the system) Run Client read : ./gdsio -P sockfs://IPV4:PORT -d 0 -n 0 -w 4 -P sockfs://IPV4:PORT -d 1 -n 0 -w 4 -s 1G -i 1M -x 0 -I 1 write: ./gdsio -P sockfs://IPV4:PORT -d 0 -n 0 -w 4 -P sockfs://IPV4:PORT -d 1 -n 0 -w 4 -s 1G -i 1M -x 0 -I 0 #Use refill buffer option. This will fill io buffer with random data at every write $ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 1024K -x 0 -I 1 -F -k 3456 2: gdsio_verify This is a data verification tool to check for data integrity using cuFile APIs $ ./gdsio_verify -h --gpu(d) --file(f) --gpu_offset(t) --gpu_devptr_offset(b) --gpubufalignment(g) --fileoffset(o) --iosize(s) --chunksize(c) --nr(n) --sync(m) --skipregister(S) --verbose(V) --fsync(p) --batch(B) --version(v) NOTE: for batch mode -b, -g -t -o -c must be 4K aligned, and -S is not supported. iosize(-s) represents each batch entry io size, e.g. with 4 number of batches and 256MB iosize, total amount of I/O would be 1GB. Example: Make sure test file is not empty. # verify reading 1G data using GPUDirect Storage $ ./gdsio_verify -d 0 -f /mnt/test -o 0 -s 1G -n 1 -m 1 gpu index :0,file :/mnt/test, RING buffer size :0, gpu buffer alignment :0, gpu buffer offset :0, file offset :0, io_requested :1073741824, sync :1, nr ios :1, address = 7fa27e000000 This test reads 1G from /mnt/test to GPU 0 using cuFileRead and Writes it back to /mnt/ using cuFileWrite and verifies the data of source and target # verify reading 256MB data using GPUDirect Storage batch mode with batch size of 4 $ ./gdsio_verify -B 4 -f /mnt/foo -s 64K -d 0 -c 4K -o 0 3: gdscheck This tool performs basic platform, driver and filesystem specific checks to test for GPU Direct Storage support. $ ./gdscheck.py -h usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V] GPUDirectStorage platform checker optional arguments: -h, --help show this help message and exit -p gds platform check -f FILE gds file check -v gds version checks -V gds fs checks example: (for version information) $ ./gdscheck.py -v GDS release version (beta): 0.9.0.14 nvidia_fs version: 2.3 libcufile version: 2.3 (for only platform check) $ ./gdscheck.py -p $ /usr/local/gds/tools/gdscheck.py -p GDS release version (beta): 0.95.0.49 nvidia_fs version: 2.6 libcufile version: 2.3 cuFile CONFIGURATION: NVMe : Supported NVMeOF : Supported SCSI : Unsupported SCALEFLUX CSD : Supported NVMesh : Supported LUSTRE : Supported GPFS : Unsupported NFS : Supported WEKAFS : Supported USERSPACE RDMA : Supported --MOFED peer direct : enabled --rdma library : Loaded (libcufile_rdma.so) --rdma devices : Configured --rdma_device_status : Up: 1 Down: 0 properties.use_compat_mode : 1 properties.use_poll_mode : 0 properties.poll_mode_max_size_kb : 4 properties.max_batch_io_timeout_msecs : 5 properties.max_direct_io_size_kb : 16384 properties.max_device_cache_size_kb : 131072 properties.max_device_pinned_mem_size_kb : 33554432 properties.posix_pool_slab_size_kb : 4 1024 16384 properties.posix_pool_slab_count : 128 64 32 properties.rdma_peer_affinity_policy : RoundRobin properties.rdma_dynamic_routing : 1 properties.rdma_dynamic_routing_order : GPU_MEM_NVLINKS GPU_MEM SYS_MEM P2P fs.generic.posix_unaligned_writes : 0 fs.lustre.posix_gds_min_kb: 0 fs.weka.rdma_write_support: 0 profile.nvtx : 0 profile.cufile_stats : 0 miscellaneous.api_check_aggressive : 0 GPU INFO: GPU index 0 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 1 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 2 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 3 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 4 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 5 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 6 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 7 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 8 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 9 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 10 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 11 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 12 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 13 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 14 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS GPU index 15 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS IOMMU : disabled Platform verification succeeded (for only file check) $ ./gdscheck.py -f /mnt/test GDS register success generating 4k read latency matrix : GPU 34:00:00 : 250.56(us) read_verification: pass GPU 36:00:00 : 250.00(us) read_verification: pass GPU 39:00:00 : 250.05(us) read_verification: pass GPU 3b:00:00 : 243.88(us) read_verification: pass (for checking client file systems version support) $ /usr/local/gds/tools/gdscheck.py -v -V /usr/local/gds/tools/gdscheck.py -v -V GDS release version (beta): 0.95.0 nvidia_fs version: 2.6 libcufile version: 2.3 FILESYSTEM VERSION CHECK: LUSTRE: current version: 2.6.99 (Unsupported) min version supported: 2.12.3_ddn28 WEKAFS: GDS RDMA read: supported GDS RDMA write: supported current version: 3.8.0.9-dg min version supported: 3.8.0 4: gdscp This tools copies file from one location to another using cuFile APIs. This mimics "cp" behaviour Make sure test file is not empty $ ./gdscp /mnt/test /mnt/test_copy 0 -v gpu md5:90672a90fba312a386b25b8861e8bd9 cpu md5:90672a90fba312a386b25b8861e8bd9 md5sum Match!! In above example, it copies data from /mnt/test to /mnt/test_copy; the data is routed through GPU Memory using cuFileAPI 6: gds_stats This tool is used to read user-space statistics exported by libcufile per process. $ ./gds_stats -p -l -l is the level and can be 1, 2, or 3. Please ensure that the cufile statistics is enabled by setting JSON configuration key profile.cufile_stats to a valid level, before trying to read the statistics. 7: gdsio_static Functionally and usage-wise they are same as gdsio, but uses cufile static libraries. For more, look at the gdsio examples above. 8: gds_log_collection.py This tool is used to collect logs from the system that are relevant for debugging. It collects logs such as os and kernel info, nvidia-fs stats, dmesg logs, syslogs, System map files and per process logs like cufile.json, cufile.log, gdsstats, process stack, etc. Usage ./gds_log_collection.py [options] options: -h help -f file_path1,file_path2,..(Note: there should be no spaces between ',') e.g. sudo ./gds_log_colection.py - Collects all the relevant logs sudo ./gds_log_colection.py -f file_path1,file_path2 - Collects all the relevant files as well as user specifed files. These could be crash files or any other relevant files