Getting Started

Installation

Note: To run on Intel GPUs (experimental), unzip the sycl-release directory and follow the instructions in the README.md file in that directory.

To build the code, the following dependencies are required:

CUDA (with cuFFT, cuBLAS, and (optionally) cuTENSOR 2.x) and a CUDA enabled GPU OR ROCm/HIP (with rocFFT, rocBLAS, and RCCL) and an AMD GPU
NCCL for running on NVIDIA GPUs
HDF5 (parallel version is required)

First, clone the repository:

git clone https://github.com/s769/FFTMatvec.git
git checkout mp

Then, build the code (CUDA):

cmake -B build -DNCCL_ROOT=/path/to/nccl -DCMAKE_BUILD_TYPE=Release [-DCUTENSOR_ROOT=/path/to/cutensor] [-DCUDA_ARCH=XX]
cmake --build build --parallel

or (HIP)

cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_WITH_HIP=ON [-DAMDGPU_TARGETS=gfxABC]
cmake --build build --parallel

Note: the -DCUTENSOR_ROOT option is only needed if the cuTENSOR 2.x library is not in the usual CUDA library path. Some systems may have the cuTENSOR 1.x library in the CUDA library path, which is not compatible with this code. In that case, the cuTENSOR 2.x library must be installed, and the path to the cuTENSOR 2.x library must be provided to the build command.

Tests will build by default. If you don't want to build the tests, you can disable them by adding -DENABLE_TESTING=OFF to the cmake command. To run the tests, use ctest in the build directory. Tests require a minimum of 2 GPUs to run by default. To change this, set -DNUM_PROCS=X to the desired number of GPUs.

To build the documentation, the following dependencies are required:

Then, build the documentation by running mkdocs build or mkdocs serve in the docs directory. The built documentation will be in the site directory.

Usage

The main executable is fft_matvec. It takes the following arguments:

-pr (int): Number of processor rows (default: 1)
-pc (int): Number of processor columns (default: 1)
-g (bool): Use global sizes (default: false)
-Nm (int): Number of global block columns (default: 10, ignored if -g is false)
-Nd (int): Number of global block rows (default: 5, ignored if -g is false)
-Nt (int): Block size (default: 7)
-nm (int): Number of local block columns (default: 3, ignored if -g is true)
-nd (int): Number of local block rows (default: 2, ignored if -g is true)
-v (bool): Print input/output vectors (default: false)
-N (int): Number of matvecs to use for timing (default: 100)
-raw (bool): Print raw timing data instead of table (default: false)
-t (bool): Check matvec results (default: false)
-prec (string): Precision Code: 5 characters, each D or S (case insensitive) representing the precision of the corresponding matrix/vector component (D=double, S=single). Components are: broadcast/pad, fft, sbgemv, ifft, unpad/reduce. Default is DDDDD.
-s (string): Directory to save output files to (default "" - don't save output)
-rand (bool): Use deterministic double precision values for the input vectors/matrices that cannot be represented as single precision floats without error. Result checking will not work with this option enabled (default: false)
-h (bool): Print help message

Note: pr x pc must be equal to the number of processors used to run the code. If no values are provided for -pr and -pc, the code will run with pr = 1 and pc = num_mpi_procs.

For boolean arguments, just pass the flag to enable it without a value. For example:

mpiexec -np 4 ./build/fft_matvec -pr 2 -pc 2 -g -Nm 20 -Nd 10 -Nt 7 -nm 4 -nd 3 -v -N 100 -prec dssdd -rand -s .

will run the code with 4 processors, a 2x2 processor grid, global sizes, 20 global block columns, 10 global block rows, a block size of 7, 4 local block columns, 3 local block rows, print input/output vectors, and use 100 matvecs for timing. The FFT and SBGEMV will be computed in single precision; all other components in double precision. Deterministic double precision values that cannot be represented as single precision floats without error will be used to initialize the matrix/vector, and output vectors will be saved in the current directory.

Example Program

The following is the main program for the FFTMatvec code (used for testing):

main.cpp

/**
 * @file main.cpp
 * @brief This file contains the main function for the matvec-test program.
 *
 * The program performs matrix-vector multiplication using MPI parallelization.
 * It takes command line arguments to configure the matrix and vector dimensions,
 * the number of processor rows and columns, and other options.
 *
 * The main function initializes MPI, parses command line arguments, creates a communication object,
 * initializes matrices and vectors, performs matrix-vector multiplication, and prints the results.
 *
 * @param argc The number of command line arguments.
 * @param argv An array of command line argument strings.
 * @return 0 on success, 1 on error.
 */
#include "Comm.hpp"
#include "Matrix.hpp"
#include "Vector.hpp"
#include "cmdparser.hpp"
#include "shared.hpp"
#include "tester.hpp"
#include "utils.hpp"

// enable/disable timing (see shared.hpp)
#if TIME_MPI
#define WARMUP 10
bool warmup;
#else
#define WARMUP 0
#endif

/**
 * @brief Configures the parser.
 * @param parser The parser.
 */

void configureParser(cli::Parser &parser)
{
    parser.set_optional<int>(
        "pr", "proc_rows", 1, "Number of processor rows (proc_rows x proc_cols = num_mpi_ranks)");
    parser.set_optional<int>("pc", "proc_cols", 1,
                             "Number of processor columns (proc_rows x proc_cols = num_mpi_ranks)");
    parser.set_optional<bool>("g", "glob_sizes", false, "Use global indices");
    parser.set_optional<int>("Nm", "glob_cols", 10, "Number of global columns");
    parser.set_optional<int>("Nd", "glob_rows", 5, "Number of global rows");
    parser.set_optional<int>("Nt", "block_size", 7, "Number of time points");
    parser.set_optional<int>("nm", "num_cols", 3, "Number of local columns");
    parser.set_optional<int>("nd", "num_rows", 2, "Number of local rows");
    parser.set_optional<bool>("v", "verbose", false, "Print vectors");
    parser.set_optional<int>("N", "reps", 100, "Number of repetitions (for timing purposes)");
    parser.set_optional<bool>("raw", "print_raw", false, "Print raw times (instead of table)");
    parser.set_optional<bool>("t", "test", false, "Run tests");
    parser.set_optional<std::string>("prec", "precision", "DDDDD", "Precision Code: 5 characters, each D or S (case insensitive) representing the precision of the corresponding matrix/vector component (D=double, S=single). Components are: broadcast/pad, fft, sbgemv, ifft, unpad/reduce. Default is DDDDD.");
    parser.set_optional<std::string>("s", "save_dir", "", "Directory to save output files to");
    parser.set_optional<bool>("rand", "random", false, "Use deterministic double precision values for the input vectors/matrices that cannot be represented as single precision floats without error.");
}

MatvecPrecisionConfig parse_precision_string(const std::string &arg)
{
    // 1. Check for correct length
    if (arg.length() != 5)
    {
        throw std::invalid_argument("Error: Precision string must be exactly 5 characters long. e.g., 'DDSDD'");
        MPICHECK(MPI_Abort(MPI_COMM_WORLD, 1));
    }

    MatvecPrecisionConfig config;
    std::string mapping_names[] = {
        "broadcast_and_pad", "fft", "sbgemv", "ifft", "unpad_and_reduce"};

    // 2. Iterate through the string and set the configuration
    for (size_t i = 0; i < arg.length(); ++i)
    {
        char c = std::toupper(arg[i]);
        Precision p;

        if (c == 'S')
        {
            p = Precision::SINGLE;
        }
        else if (c == 'D')
        {
            p = Precision::DOUBLE;
        }
        else
        {
            // Found an invalid character
            throw std::invalid_argument("Error: Invalid character '" + std::string(1, arg[i]) +
                                        "' at position " + std::to_string(i) + ". Use only 'S' or 'D'.");
        }

        // 3. Assign precision to the correct member based on position
        switch (i)
        {
        case 0:
            config.broadcast_and_pad = p;
            break;
        case 1:
            config.fft = p;
            break;
        case 2:
            config.sbgemv = p;
            break;
        case 3:
            config.ifft = p;
            break;
        case 4:
            config.unpad_and_reduce = p;
            break;
        }
    }

    return config;
}

/********/
/* MAIN */
/********/
int main(int argc, char **argv)
{
    int world_rank = 0, world_size, provided;
    // Initialize the MPI environment (OpenMP is used so we need to use MPI_Init_thread)
    MPICHECK(MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided));

    if (provided < MPI_THREAD_FUNNELED)
    {
        if (world_rank == 0)
            fprintf(stderr, "The provided MPI level is not sufficient\n");
        MPICHECK(MPI_Abort(MPI_COMM_WORLD, 1));
        return 1;
    }

    {
        // Parse command line arguments
        cli::Parser parser(argc, argv);
        configureParser(parser);
        parser.run_and_exit_if_error();

        MPICHECK(MPI_Comm_rank(MPI_COMM_WORLD, &world_rank));
        MPICHECK(MPI_Comm_size(MPI_COMM_WORLD, &world_size));

        auto proc_rows = parser.get<int>("pr");
        auto proc_cols = parser.get<int>("pc");
        // Check that the number of processor rows and columns is valid
        if (world_size != proc_rows * proc_cols)
        {
            if (proc_rows == 1 && proc_cols == 1)
            {
                proc_rows = 1;
                proc_cols = world_size;
                if (world_rank == 0)
                    fprintf(stderr,
                            "Warning: Proc Rows x Proc Cols must equal the total number of MPI "
                            "ranks. Got %d x %d = %d, expected %d. Using %d x %d = %d instead\n",
                            proc_rows, proc_cols, proc_rows * proc_cols, world_size, proc_rows,
                            proc_cols, proc_rows * proc_cols);
            }
            else if (world_rank == 0)
            {
                fprintf(stderr,
                        "Proc Rows x Proc Cols must equal the total number of MPI ranks. Got %d x %d = "
                        "%d, expected %d\n",
                        proc_rows, proc_cols, proc_rows * proc_cols, world_size);
                MPICHECK(MPI_Abort(MPI_COMM_WORLD, 1));
                return 1;
            }
        }

        auto glob_sizes = parser.get<bool>("g");

        auto block_size = parser.get<int>("Nt");

        // global or local sizes depending on the flag (other ones are automatically set when
        // creating matrices and vectors)
        auto num_cols = (glob_sizes) ? parser.get<int>("Nm") : parser.get<int>("nm");
        auto num_rows = (glob_sizes) ? parser.get<int>("Nd") : parser.get<int>("nd");

        if (world_rank == 0)
        {
            printf("Proc Rows: %d, Proc Cols: %d\n", proc_rows, proc_cols);
            if (glob_sizes)
                printf("Global Rows: %d, Global Cols: %d\n", num_rows, num_cols);
            else
                printf("Local Rows: %d, Local Cols: %d\n", num_rows, num_cols);
            printf("Block Size: %d\n", block_size);
        }

#if TIME_MPI
        MPICHECK(MPI_Barrier(MPI_COMM_WORLD));
        t_list[ProfilerTimesFull::COMM_INIT].start();
#endif
        // Create a communicator object
        Comm comm(MPI_COMM_WORLD, proc_rows, proc_cols);
#if TIME_MPI
        MPICHECK(MPI_Barrier(MPI_COMM_WORLD));
        t_list[ProfilerTimesFull::COMM_INIT].stop();
#endif

        if (world_rank == 0)
            printf("Created Comm\n");

#if TIME_MPI
        MPICHECK(MPI_Barrier(MPI_COMM_WORLD));
        t_list[ProfilerTimesFull::SETUP].start();
#endif
        // Create matrices and vectors
        MatvecPrecisionConfig p_config = parse_precision_string(parser.get<std::string>("prec"));

        Matrix F(comm, num_cols, num_rows, block_size, glob_sizes, false, p_config);

        if (world_rank == 0)
            printf("Created Matrices\n");
        bool random = parser.get<bool>("rand");
        int seed = 12345;

        if (random)
        {
            F.init_mat_doubles();
        }
        else
        {
            F.init_mat_ones();
        }

        if (world_rank == 0)
            printf("Initialized Matrices\n");

        Vector in_F(comm, num_cols, block_size, "col", glob_sizes),
            in_FS(comm, num_rows, block_size, "row", glob_sizes);
        Vector out_F(in_FS, false), out_FS(in_F, false);

        if (world_rank == 0)
            printf("Created Vectors\n");
        // Initialize vectors with ones for testing
        if (random)
        {
            in_F.init_vec_doubles();
            in_FS.init_vec_doubles();
        }
        else
        {
            in_F.init_vec_ones();
            in_FS.init_vec_ones();
        }
        out_F.init_vec();
        out_FS.init_vec();

#if TIME_MPI
        MPICHECK(MPI_Barrier(MPI_COMM_WORLD));
        t_list[ProfilerTimesFull::SETUP].stop();
#endif

        bool print = parser.get<bool>("v");

        if (print)
        {
            in_F.print("in_F");
            in_FS.print("in_FS");
        }

        if (world_rank == 0)
            printf("Initialized Vectors\n");

        auto reps = parser.get<int>("N");

#if !TIME_MPI
        reps = 1;
#endif

#if TIME_MPI
        MPICHECK(MPI_Barrier(MPI_COMM_WORLD));
        t_list[ProfilerTimesFull::FULL].start();
#endif
        // Perform matrix-vector multiplication
        for (int i = 0; i < reps + WARMUP; i++)
        {
            F.matvec(in_F, out_F);
            F.transpose_matvec(in_FS, out_FS);
        }

#if TIME_MPI
        MPICHECK(MPI_Barrier(MPI_COMM_WORLD));
        t_list[ProfilerTimesFull::FULL].stop();
#endif

        auto test = parser.get<bool>("t");
        // Run tests
        if (test)
        {
            if (random)
            {
                if (world_rank == 0)
                    printf("Skipping tests for random matrices\n");
            }
            else
            {
                Tester::check_ones_matvec(comm, F, out_F, false, false);
                Tester::check_ones_matvec(comm, F, out_FS, true, false);
            }
        }

        if (world_rank == 0)
            printf("Finished Matvecs\n");
        if (print)
        {
            out_F.print("out_F");
            out_FS.print("out_FS");
        }

        std::string save_dir = parser.get<std::string>("s");
        if (save_dir != "")
        {
            out_F.save(save_dir + "/out_F.h5");
            out_FS.save(save_dir + "/out_FS.h5");
        }

        // Print timing results
        auto print_raw = parser.get<bool>("raw");
        if (world_rank == 0)
            printf("Timing Results Showing Mean, Min, and Max Times Over %d Processor(s) (%d "
                   "Matvecs):\n\n",
                   world_size, reps);

        Utils::print_times(reps, !print_raw);
    }
    MPICHECK(MPI_Finalize());

    return 0;
}