Software developed at Georgia Tech as part of System Software

The project explores the challenges involved in construction and maintenance of logical machine configurations, comprised of sets of both CPU cores and GPUs that are assembled when and as required by applications, rather than simply mirroring underlying hardware. The "Shadowfax" runtime system manages software-defined abstractions as "slices" of a cluster through dynamic mapping of software executables to hardware entities.

During the course of the project we have developed several software components that have been used by the current runtime system called Shadowfax that implements the concept of GPGPU assemblies. We have pursued two research tracks: the first in the non-virtualized environment intended for the HPC settings such as Keeneland, and the second in virtualized environments such as enterprise setups.

  1. Deployment path (non-virtualized solution)
    • Shadowfax II - scalable implementation of GPGPU assemblies
    • ClusterWatch - successor of ClusterSpy and Scout; an extensible lightweight distributed, EVPath-based monitoring infrastructure for monitoring system resources. ClusterWatch used the EVPath Data Flow Graphs to enable building robust monitoring topologies.
    • ClusterSpy - an EVPath-based library that provides communication infrastructure and default "spies" that monitor cpu via /proc and gpu via Ocelot-based Lynx instrumentation
    • Scout - an extensible (via scouts) distributed, EVPath-based monitoring infrastructure for monitoring various resources (cpu - /proc, memory - /proc, network - /proc, gpu - nvidia-smi); collected monitoring data can be stored by "readers"; currently implemented sqlite3 readers allow for persistent storage of monitoring cpu, mem, net, and gpu data; this is refactored and extended dSimons code
    • Kidron Utils RCE - a CUDA call interposer module that enables interposing CUDA calls and executing them in a remote fashion
    • dSimons - a distributed monitoring system for monitoring resources such as cpus, gpus and network; a predecessor of Scout
    • implemented an infrastructure for two-level scheduling logic; Level 1 is responsible for assignment tasks at the node level to satisfy load balance requirements in a multi-node environment; Level 2 targets GPU/CPU thread co-scheduling (both remote and local)
  2. Exploratory path (virtulized-based solution)
    • Shadowfax - dynamically composed GPGPU assemblies in a virtualized environment
    • Pegasus - coordinated scheduling in virtualized accelerator-based platforms
    • GViM - GPU accelerated virtual machines, predecessor of Pegasus and virtualized Shadowfax

Detailed Information

We maintain two web sites with regard to the Keeneland project

Shadowfax II (non-virtualized solution)

It is a distributed runtime developed for heterogeneous GPGPU-based high-performance clusters. The primary Shadowfax component is a distributed collection of stateful, persistent demon processes. Collectively, the demon processes maintain monitoring information and they service requests for assemblies from applications. The Shadowfax interposing library implements the CUDA API to enable transparent access and se of local or remote assembly GPUs. Upon the first CUDA function invocation, the library communicates with the local demon to request an assembly. Once received, the interposing library immediately maps the assembly onto the cluster: network links are instantiated for any remote vGPUs provided, and CPUs required to host the remote paths are reserved on the destination nodes. The application is then allowed to continue on the virtual platform provided by the Shadowfax runtime system.


The successor of the Scout and ClusterSpy. This is an EVPath-based distributed monitoring infrastructure. It will allow for scalable monitoring through various topologies that can be built on top of EVPath Data Flow Graphs (DFGs). Currently, ClusterWatch supports system monitoring at the cluster level. It can be downloaded from ClusterWatch . The passwd is 'anon' and the user is 'anon'.


The Scout system is EVPath-based distributed monitoring infrastructure. Scout offers the concept of collectors executed on nodes and gathering monitoring data such as memory, cpu, gpu utilization and an aggregator that aggregates data sent out by collectors. The aggregator, called a "trooper," writes the monitoring data obtained from collectors to a memory mapped file. The memory mapped file can store a limited and configurable number of data records, e.g., five entries per each node. In order to preserve a history of collected data they need to be stored on persistent storage. For this purpose, we have implemented relevant readers that periodically scan the memory mapped files and store new data in the Sqlite3 database. We have implemented four types of collectors, called ``scouts,'' cpu, memory, network and nvidia\_smi. The three former scouts exploit process information pseudo-file system (/proc), the latter scout uses the NVIDIA System Management Interface (nvidia-smi) utility that is a command line utility intended to aid in the management and monitoring of NVIDIA GPU devices, specifically Tesla and Fermi-based Quadro devices. Scout also provides a simple API to initiate the system and run it.

Some data are preprocessed by scouts before sending it to the trooper e.g. the MEM utilization metric (the raw MB are converted to a percentage).

Kidron Utils RCE

The software allows to remotely execute CUDA calls. It consists of two main logical components:

The interposer library allows to interpose a subset of CUDA calls. The interposed CUDA calls are sent over the network (currently via TCP/IP) to the remote node that has attached graphic accelerators (GPUs). On the remote node the running backend listens on a predefined port for incoming requests, deserialize them and executes received execution requests on the local GPU. After completion, the backend returns the results (if any) to the interposer library, which returns the execution flow to the original CUDA application.

In this release, the backend is a process, running on a remote node. The backend process spawns a single thread that listens for incoming requests from clients (i.e., interposer library).

The latest release:

Kidron Utils dSimons

The system called Kidron Simple distributed monitoring system, Kidron Utils dSimons, allows to monitor usage of resources such as cpu, gpu, and network.

The architecture of dSimons is a simple master-worker. There is a lmonitor which is a local monitor executed on the actual resource. The lmonitor monitors resources as specified in the configuration file, and sends the data to a global monitor, called gmonitor. The host where gmonitor runs is specified in the configuration file by the gmonitor which modifies the configuration file.

The gmonitor presents the information to the user. You can run as many lmonitors as you wish, and there should be one gmonitor. The gmonitor should be run first.

The communication is performed over EVPath. Here you can find the information
about EVPath

EVPath main web page

The latest release is available at:

Shadowfax - Dynamically Composed GPGPU Assemblies (virtualized solution)

GPGPUs have proven to be advantageous for increasing application scalability both in the HPC and enterprise domains. This has resulted in an increase in the array of programming languages and range of physical compute capabilities of current hardware. Yet applications' scalability and portability remain limited with respect to both their degree of customization and the physical limitations of compute nodes to contain any number and composition of devices. This research defines the notion of a GPGPU assembly for CUDA applications resident in Xen virtual machines on high-performance clusters, presenting to applications a set of GPGPUs as locally-available devices to best match their needs, easing programmability and portability. We characterize workloads to best match them with available GPGPUs. Techniques such as API interposition, function marshalling and batching, as well as dynamic binary instrumentation (future work) enable global scheduling policies, admission control and dynamic retargeting of execution streams. 

Pegasus - Coordinated Scheduling in Virtualized Accelerator-based Platforms

Heterogeneous multi-cores -- platforms comprised of both general purpose and accelerator cores -- are becoming increasingly common. While applications wish to freely utilize all cores present on such platforms, operating systems continue to view accelerators as specialized devices. The Pegasus system, developed on top of GViM, uses an alternative approach that offers a uniform usage model for all cores on heterogeneous chip multiprocessors. Operating at the hypervisor level, its novel scheduling methods fairly and efficiently share accelerators across multiple virtual machines, thereby making accelerators into first class schedulable entities of choice for many-core applications. Using the NVIDIA GPU coupled with IA-based general purpose host cores, a Xen-based implementation of Pegasus demonstrates improved performance for applications by better managing combined platform resources. With moderate virtualization penalties, performance improvements range from 18% to 140% over base GPU driver scheduling when accelerators are shared.

GViM - GPU Accelerated Virtual Machines

The use of virtualization to abstract underlying hardware can aid in sharing such resources and in efficiently managing their use by high performance applications. Unfortunately, virtualization also prevents efficient access to accelerators, such as Graphics Processing Units (GPUs), that have become critical components in the design and architecture of HPC systems. Supporting General Purpose computing on GPUs (GPGPU) with accelerators from different vendors presents significant challenges due to proprietary programming models, heterogeneity, and the need to share accelerator resources between different Virtual Machines (VMs).

To address this problem, this paper presents GViM, a system designed for virtualizing and managing the resources of a general purpose system accelerated by graphics processors. Using the NVIDIA GPU as an example, we discuss how such accelerators can be virtualized without additional hardware support and describe the basic extensions needed for resource management. Our evaluation with a Xen-based implementation of GViM demonstrate efficiency and flexibility in system usage coupled with only small performance penalties for the virtualized vs. non-virtualized solutions.