BeyondMoore

Abstract

Team

Alumni

Projects

Publications

BeyondMoore

Pioneering the Future of Computing

BeyondMoore addresses the timely research challenge of solving the software side of the Post Moore crisis, as Moore's Law reaches its limits in chip manufacturing. This transition requires a shift towards extreme heterogeneity in computing systems. Current programming solutions are host-centric, leading to scalability issues and limited parallelism. BeyondMoore proposes an autonomous execution model where accelerators operate independently, facilitated by a task graph programming abstraction. To efficiently execute this task graph, BeyondMoore develops a software framework that performs static and dynamic optimizations, issues accelerator-initiated data transfers along with supporting tools such as compiler and profiler. Below you can find details of projects comprising BeyondMoore’s software ecosystem .

Team

PI: Assoc. Prof. Didem Unat (dunat@ku.edu.tr)

PhD Student: Ilyas Turimbetov (iturimbetov18@ku.edu.tr)
Research Focus: Task graphs, load balancing.

PhD Student: Javid Baydamirli (jbaydamirli21@ku.edu.tr)
Research Focus: Compilers, parallel programming models.

PhD Student: Doǧan Sağbili (dsagbili17@ku.edu.tr)
Research Focus: Multi-device communication mechanisms.

PhD Student: Mohammad Kefah Taha Issa (missa18@ku.edu.tr)
Research Focus: Peer to peer GPU tracing and profiling.

Master Student: Hanaa Zaqout (hzaqout25@ku.edu.tr)
Research Focus: Race detection and developer tools.

Master Student: Sinan Ekmekçibaşı (sekmekcibasi23@ku.edu.tr)
Research Focus: Multi-GPU Communication Models.

Master Student: Emre Düzakın (eduzakin18@ku.edu.tr)
Research Focus: LLM Based Multi Agent Systems.


Alumni

Alumni: Ismayil Ismayilov
Research Focus: Taming heterogeneity, programming models.

Alumni: Muhammed Abdullah Soytürk
Research Focus: Scalable deep learning.

Alumni: Dr. Muhammad Aditya Sasongko
Research Focus: Performance models, profiling tools.


BeyondMoore Software Ecosystem

Compiler, Runtime and Execution Models

Profiling Tools

  • Snoopie: A Multi-GPU Communication Profiler and Visualiser
  • PES AMD vs Intel: A Precise Event Sampling Benchmark Suite
  • aCG: CPU- and GPU-initiated Communication Strategies for CG Methods

This project introduces a fully autonomous execution model for multi-GPU applications, eliminating CPU involvement beyond initial kernel launch. In conventional setups, the CPU orchestrates execution, causing overhead. We propose delegating this control flow entirely to devices, leveraging techniques like persistent kernels and device-initiated communication. Our CPU-free model significantly reduces communication overhead. Demonstrations on 2D/3D Jacobi stencil and Conjugate Gradient solvers show up to a 58.8% improvement in communication latency and a 1.63x speedup for CG on 8 NVIDIA A100 GPUs compared to CPU-controlled baselines.

More details and git repository of the project.

With data movement posing a significant bottleneck in computing, profiling tools are essential for scaling multi-GPU applications efficiently. However, existing tools focus primarily on single GPU compute operations and lack support for monitoring GPU-GPU transfers and communication library calls. Addressing these gaps, we present Snoopie, an instrumentation-based multi-GPU communication profiling tool. Snoopie accurately tracks peer-to-peer transfers and GPU-centric communication library calls, attributing data movement to specific source code lines and objects. It offers various visualization modes, from system-wide overviews to detailed instructions and addresses, enhancing programmer productivity.

More details and git repository of the project.

To address resource underutilization in multi-GPU systems, particularly in irregular applications, we propose a GPU-sided resource allocation method. This method dynamically adjusts the number of GPUs in use based on workload changes, utilizing GPU-to-CPU callbacks to request additional devices during kernel execution. We implemented and tested multiple callback methods, measuring their overheads on Nvidia and AMD platforms. Demonstrating the approach in an irregular application like Breadth-First Search (BFS), we achieved a 15.7% reduction in time to solution on average, with callback overheads as low as 6.50 microseconds on AMD and 4.83 microseconds on Nvidia. Additionally, the model can reduce total device usage by up to 35%, improving energy efficiency.

More details and git repository of the project.

Modern HPC and AI systems increasingly rely on multi-GPU clusters, where communication libraries such as MPI, NCCL/RCCL, and NVSHMEM enable data movement across GPUs. While these libraries are widely used in frameworks and solver packages, their distinct APIs, synchronization models, and integration mechanisms introduce programming complexity and limit portability. Performance also varies across workloads and system architectures, making it difficult to achieve consistent efficiency. These issues present a significant obstacle to writing portable, high-performance code for large-scale GPU systems. We present Uniconn, a unified, portable high-level C++ communication library that supports both point-to-point and collective operations across GPU clusters. Uniconn enables seamless switching between backends and APIs (host or device) with minimal or no changes to application code. We describe its design and core constructs, and evaluate its performance using network benchmarks, a Jacobi solver, and a Conjugate Gradient solver. Across three supercomputers, we compare Uniconn's overhead against CUDA/ROCm-aware MPI, NCCL/RCCL, and NVSHMEM on up to 64 GPUs. In most cases, Uniconn incurs negligible overhead, typically under 1 % for the Jacobi solver and under 2% for the Conjugate Gradient solver.

More details and git repository of the project.

This work revisits Conjugate Gradient (CG) parallelization for large-scale multi-GPU systems, addressing challenges from low computational intensity and communication overhead. We develop scalable CG and pipelined CG solvers for NVIDIA and AMD GPUs, employing GPU-aware MPI, NCCL/RCCL, and NVSHMEM for both CPU- and GPU-initiated communication. A monolithic GPU-offloaded variant further enables fully device-driven execution, removing CPU involvement. Optimizations across all designs reduce data transfers and synchronization costs. Evaluations on SuiteSparse matrices and a real finite element application show 8–14% gains over state-of-the-art on single GPUs and 5–15% improvements in strong scaling tests on over 1,000 GPUs. While CPU-driven variants currently benefit from stronger library support, results highlight the promising scalability of GPU-initiated execution for future large-scale systems.

More details and git repository of the project.

We're actively crafting a compiler to empower developers to write high-level Python code that compiles into efficient CPU-free device code. This compiler integrates GPU-initiated communication libraries, NVSHMEM for NVIDIA and ROC_SHMEM for AMD, enabling GPU communication directly within Python code. With automatic generation of GPU-initiated communication calls and persistent kernels, we aim to streamline development workflows.

More details and git repository of the project.

We've designed and implemented a lightweight runtime system tailored for CPU-free task graph execution in multi-device systems. Our runtime minimizes CPU involvement by handling task graph initialization exclusively, while executing all subsequent operations on the GPU side. This runtime system provides online scheduling of graph nodes, monitors GPU resource usage, manages memory allocation and data transfers, and synchronously tracks task dependencies. By accepting computational graphs as input, originally designed for single GPUs, it seamlessly scales to multiple GPUs without necessitating code modifications. More details about the project will be available soon. The related paper is under review.

Precise event sampling, a profiling feature in commodity processors, accurately pinpoints instructions triggering hardware events. While widely utilized, support from vendors varies, impacting accuracy, stability, overhead, and functionality. Our study benchmarks Intel PEBS and AMD IBS, revealing PEBS's finer-grained accuracy and IBS's richer information but lower stability. PEBS incurs lower time overhead, while IBS suffers from accuracy issues. OS signal delivery adds significant time overhead. Both PEBS and IBS exhibit sampling bias. Our findings hold in a full-fledged profiling tool on modern Intel and AMD machines. This comparison offers valuable insights for hardware designers and profiling tool developers.

All the artifacts and benchmarks can be found here.


Publications

Mohamed Wahib, Muhammed Abdullah Soyturk, Didem Unat (2025) Balanced and Elastic End-to-end Training of Dynamic LLMs. ACM publication is pending. preprint pdf
Didem Unat, Anshu Dubey, Emmanuel Jeannot, John Shalf (2025) The Persistent Challenge of Data locality in Post-Exascale Era. In Computing in Science & Engineering. preprint pdf
James D. Trotter, Sinan Ekmekçibaşı, Doğan Sağbili, Johannes Langguth, Xing Cai, Didem Unat (2025) CPU- and GPU-initiated Communication Strategies for Conjugate Gradient Methods on Large GPU Clusters. In SC ’25: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. preprint pdf
Doǧan Sağbili, Sinan Ekmekçibaşı, Khaled Z. Ibrahim, Tan Nguyen, Didem Unat (2025) UNICONN: A Uniform High-Level Communication Library for Portable Multi-GPU Programming (presentation). In Cluster ’25: Proceedings of the IEEE International Conference on Cluster Computing (IEEE Cluster 2025). preprint pdf
Ilyas Turimbetov, Mohamed Wahib, Didem Unat (2025) A Device-Side Execution Model for Multi-GPU Task Graphs (presentation). In ICS ’25: Proceedings of the 39th ACM International Conference on Supercomputing. preprint pdf
Fatih Taşyaran, Osman Yasal, José A Morgado, Aleksandar Ilic, Didem Unat, Kamer Kaya (2024) P-MoVE: performance monitoring and visualization with encoded knowledge (presentation). In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage, and Analysis. preprint pdf
Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Doğan Sağbili, Flavio Vella, Daniele De Sensi, Ismayil Ismayilov (2024) The landscape of gpu-centric communication. Under review. preprint pdf
Tugba Torun, Ameer Taweel, Didem Unat (2024) A Sparse Tensor Generator with Efficient Feature Extraction. Accepted for publication; online release pending. preprint pdf
Javid Baydamirli, Tal Ben Nun, Didem Unat (2024) Autonomous Execution for Multi-GPU Systems: Compiler Support (presentation). In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. preprint pdf
Javid Baydamirli, Tal Ben Nun, Didem Unat (2024) Autonomous Execution for Multi-GPU Systems: Compiler Support (presentation). In the 2024 International Workshop on Performance, Portability, and Productivity in HPC. preprint pdf
Issa, M., Sasongko, M., Turimbetov, I., Baydamirli, J., Sağbili, D., Unat, D. (2024). Snoopie: A Multi-GPU Communication Profiler and Visualizer. In Proceedings of the 38th International Conference on Supercomputing. preprint pdf
Ilyas Turimbetov, MA Sasongko, and Didem Unat, GPU-Initiated Resource Allocation for Irregular Workloads, International Workshop on Extreme Heterogeneity Solutions (ExHET), 2024 preprint pdf
I Ismayilov, J Baydamirli, D Sagbili, M Wahib, D Unat, Multi-GPU Communication Schemes for Iterative Solvers: When CPUs are Not in Charge, ICS ’23: Proceedings of the 37th International Conference on Supercomputing, 192–202. preprint pdf
MA Sasongko, M Chabbi, PHJ Kelly, D Unat, Precise Event Sampling on AMD vs Intel: Quantitative and Qualitative Comparison, IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 5, pp. 1594-1608, May 2023, doi: 10.1109/TPDS.2023.3257105. preprint pdf