Vivek Kale∗ , Jayanta Mukherjee† , Indranil Gupta‡ , William Gropp§
Department of Computer Science, University of Illinois at Urbana-Champaign
201 North Goodwin Avenue, Urbana, IL 61801-2302, USA
Email: ∗ firstname.lastname@example.org, † email@example.com, ‡ firstname.lastname@example.org, § email@example.com
Abstract—The small performance variation within each node of a cloud computing infrastructure (i.e. cloud) can be a fundamental impediment to scalability of a high-performance application. This performance variation (referred to as jitter) particularly impacts overall performance of scientiﬁc workloads running on a cloud. Studies show that the primary source of performance variations comes from disk I/O and the underlying communication network . In this paper, we explore the opportunities to improve performance of high performance applications running on emerging cloud platforms. Our contributions are 1. the quantiﬁcation and assessment of performance variation of data-intensive scientiﬁc workloads on a small set of homogeneous nodes running Hadoop and 2. the development of an improved
Hadoop scheduler that can improve performance (and potentially scalability) of these application by leveraging the intrinsic performance variation of the system. In using our enhanced scheduler for data-intensive scientiﬁc workloads, we are able to obtain more than a 21% performance gain over the default Hadoop scheduler.
I. I NTRODUCTION
Certain high-performance applications such as weather prediction or algorithmic trading require the analysis and aggregation of large amounts of data geo-spatially distributed across the world, in a very short amount of time (i.e. on-demand).
A traditional supercomputer may be neither a practical nor an economical solution because it is not suitable for handling data that is distributed across the world. For such application domains, the ease and inexpensiveness of getting access to a cloud has shown to be an advantage over high performance clusters. The strength of cloud computing infrastructures has been the reliabilty and fault-tolerance of an application at a very large scale. Google’s MapReduce  programming model and Yahoo’s subsequent implementation of Hadoop  have allowed one to harness the power of such cloud infrastructures.
However, for certain applications (particularly data-intensive scientiﬁc workloads) the small performance variations in a distributed system are a fundamental impediment to application scalability on a large-scale distributed system. It has been observed that the primary sources of performance variation come from disk I/O and interactions with the memory hierarchy, the communication network delay, network congestion, and bandwidth related issues. Since the vision in cloud computing is to use clusters of commodity machines to be able to scale up quickly, we believe the problem of the performance variation
on a cloud is particularly relevant if clouds are to be used for such applications.
Indeed, a distributed system’s jitter due to performance variation of the hardware can be reduced by using enhanced networking hardware or better devices for data storage such as solid-state drives. However, using high-quality devices is typically very expensive and not widely available as a commodity.
The expense of such devices goes against the philosophy within a cloud computing paradigm of using cheap commodity machines in the cloud. In this case, we believe the classic endto-end argument  is valid. We believe the problem seems to lie at a higher layer than hardware, and so our approach to solving this problem must be to directly modify the scheduling algorithm so that it provides performance predictability while still maintaining good load balance.
Thus, to mitigate the overhead caused by performance variation on a Hadoop Cluster, we ﬁrst categorize the sources of jitter in a…