Summary: This webpage describes my research and work experience starting from when I started graduate school to present. It does so by (1) providing an overview and theme of the work that I've done, (2) describing the work and research I'd done prior to working on my dissertation, (3) describing the research problem of my dissertation and the technique used to solve the problem, (4) explaining the current research and projects that I'm working on, (5) listing publications that I've authored or co-authored that explains the problem, shows the technique for solving the problem, and results of the technique and (6) listing software or programs that I've developed to solve the problem.
Here is a link to a presentation summarizing this work.
Here is a link to a description of my research experience.
1. Overview of Work:
My research and work experience is focused on improving performance of computer programs used to solve problems in science and engineering run on supercomputers. I develop performance optimizations and strategies with to improve scalability of applications implemented with a hybrid shared distributed memory model run on supercomputers with nodes having CPUs or GPUs on them. Specifically, my research involves performance optimizations that include auto-tuning and loop scheduling for application code intended to run on supercomputers having nodes with General Purpose Graphics Processing Units (GPUs) and chip multi-processors (CMPs).
2. Work Prior to Dissertation
Prior to my dissertation work, I worked on understanding and applying an extension of the classic Design Patterns, i.e., to design for high-performance computation for use in the NAS parallel benchmarks, a set of benchmarks intended to assess computational power of a supercomputer (see publications 13 and 15). My co-author Edgar Solomonik and I wrote a paper establishing a parallel design pattern for parallel sorting algorithms. See publication 14. In addition, I was involved in work on the shared memory extensions to the Message Passing Interface standard, also known as the MPI+MPI programming model (see publication 11).
3. Work of Dissertation
Domain decomposition and performance optimization techniques for numerical linear algorithms such as LU, QR, Jacobi, and Gauss-Seidel have been critical for enabling one to solve problems within a variety of scientific and engineering domains, e.g., solving a system of equations given a set of constraints. One characteristic of many of these computations is that they are iterative in nature and require a regular computation followed by a reduction and synchronization step. The emergence of clusters of multi-cores have brought promise to attaining improved performance, but have introduced new complexities as well. One complexity is obtaining optimal performance from a multi-core without a large amount of programming effort by the programmer. This complexity is important for applications to run on clusters with 16 cores or more. Yet, there is a larger issue, beyond attaining optimal performance on a multi-core, that the emergence of multi-core architectures brings up for large-scale machines: performance predictability. Unpredictable performance on one node, caused by noise most often attributed to system services, is amplified as we scale to a larger number of nodes. Particularly for applications that involve many iterative refinement steps, the small performance variations, i.e. noise, seen on one processor can be amplified in a large-scale cluster, and can hinder scalability of an application to hundreds of thousands of processors. Petrini et al showed the importance of this issue in 2003 on the ASCI Q cluster. A theoretical analysis of the problem has been shown by Tsasfir et al. Finally, a proof through simulation of up to 1 million MPI processes has been shown in Hoefler et al in 2010. Follow-up studies indicate the tension between the operating system as well as network software services, both of which try to improve performance, and the application being run on the machine.
Can we modify the application to attain performance predictability on each node? The answer to this question can allow scalability of applications run on current-generation clusters of SMPs with, say 32 cores per node. Implementations of matrix computations use traditional static domain decomposition, e.g. block-cyclic domain decomposition. A purely static decomposition may not be enough when we scale to large-scale machines; we may need to use dynamic scheduling, particularly due to operating system services which the application programmer has little to no control over. Yet, dynamic scheduling has its own overheads as well. Cache misses and dequeue overheads are examples. But if we carefully tune the amount of dynamic scheduling that we do, in proportion to characteristic system noise of a machine, we can achieve more predictable performance within a node (see published paper 2 on website). If the nodes all use the same operating system, we can do the same amount of load balancing across the nodes. If the operating system is different, then we must consider each node individually, and tune load balancing for each individual node.
That said, one research question I would like to answer is: with emerging clusters of NUMA multi-cores and clusters of SMPs, do we need to reconsider the way we parallelize and implement regular algorithms (such as jacobi, LU, QR, Cholesky, Gauss-seidel), in order to scale the algorithm to 10,000+ nodes? Below are specific aspects of this problem that I am studying:
a. Algorithms:
Do we need to enhance representative algorithms such as LU factorization, QR factorization, Jacobi (stencil) to use an element of intra-node dynamic scheduling and load balancing, for improved performance predictability? Can these load balancing strategies work hand-in-hand with methods for avoiding inter-node communication such as CALU or CAQR? Finally, can we prove formally that these improved methods perform optimally and are scalable?
b. Performance Tuning for Scalability:
i. Implementation of lightweight intra-node load balancing: By using dynamic scheduling, we can not only mitigate impact of operating system jitter for a particular platform, but also attain promising scalability trends for our computation. See publication here: http://www.springerlink.com/content/er65550663380p08/. Note that the solution in this paper is orthogonal to the Charm++ load balancing scheme, and an enhancement of PLASMA and Cilk dynamic scheduling schemes.
ii. Tuning communication-intensive codes on multi-cores:
To scale, it may be important consider adaptive collective communication algorithms that mitigate performance variation caused by the interconnect network’s system services, as well as network contention. Online auto-tuning of collectives, as opposed to offline auto-tuning, can address performance variation of collective communication. The GASNet runtime system actually has support for this “online” auto-tuning of collective communication. Making enhancements to its online auto-tuning strategy may be one possible solution to attain further performance predictability for applications that, for example, involve a reduction operation at the end of each iterative refinement step.
c. MPI+PGAS (specifically, MPI+UPC with GASNet):
PGAS models are suitable for computation within a multi-core node and ongoing work continues to justify its importance for emerging high-performance clusters. By using MPI+UPC specifically with the GASNet runtime, one can attain reduce communication overheads through low-latency communication within a many-core node, while still maintaining standard MPI communication across nodes. Furthermore, the light weight load balancing strategy described above in 2a could be extended to work across nodes, by using UPC threads, rather than pthreads.
d. Performance Modeling:
PRAM or LogP performance models were developed for modeling computation and communication on parallel machines. With a better understanding and experimentation with performance predictability of multi-core architectures, do we need to consider more parameters in our performance model (particularly for matrix factorizations)? Specifically, can incorporating an element of system noise in our model help us to model scalability of applications?
e. Design Patterns for Parallel Programming (Parallel Patterns): The methodology of documenting software solutions through the use of design patterns has proven to be powerful technique for uni-processor computation, and seems to be particularly important now for multi-cores. A key component of the original design patterns advocate identifying forces in software design. Good parallel programs and parallel solutions can be understood through systematic identification of how forces and tradeoffs are balanced in a specific parallel code. This is related to what is also known as hardware-software co-design. I remain involved in discussions on parallel patterns and have an entry on parallel sorting at the following link: http://parlab.eecs.berkeley.edu/wiki/patterns. Note that this entry is an adapted journal version of publication 10.
During my postdoctoral research position as well as my current position, my primary focus has been on developing a new technology through simulation run on supercomputers or working on improving performance of production-level application programs used to understand scientific phenomenon. In my postdoctoral research position, I worked on improving performance of an application program that sought to better understand engineering systems that use plasma combustion, e.g., generation of power and processing of fuel. In my project in my current position, I've worked to improve performance of an application creating a three-dimensional image reconstruction of a computer chip, done through the use of ptychography and tomography solvers, to better understand behavior of the chip. I've worked closely with researchers knowledgeable of combustion and knowledgeable of image reconstruction to improve performance of the combustion simulation and image reconstruction application programs, respectively. I've used traditional performance tuning strategies such as auto-tuning and combined these techniques with techniques on low-overhead loop scheduling for enabling the efficient execution of the application programs on supercomputers.
During the past year, I've been involved with developing and supporting the development of features arising from my thesis research within the OpenMP standard. The work has developed from (a) a short paper I have published at OpenMPCon 2017 (see publication 2) and (b) my interaction and continued involvement in the committee for the OpenMP standard. I'm working to add a feature in the specification that supports defining multi-core loop scheduling strategies at an abstract level in an upcoming release of OpenMP. Specifically, the feature allows users of application program implemented using an OpenMP library to define a loop scheduling strategy for the application program to use instead of just choosing from one of the system-defined strategies. The feature is motivated by the desire to make the loop scheduling strategies in my thesis work (and novel strategies by others in future) widely usable. Discussions in a recent biannual forum meeting of the committee has spurred development of the feature in Intel’s OpenMP compiler and runtime. The development is being led by a senior software engineer at Intel and committee member for the OpenMP standard. My work on additions to the standard is important from a research point of view because the features can be used to more rapidly experiment with loop scheduling strategies in programming libraries that are widely available on supercomputers that support science and engineering simulations. More information about the benefits of the extension and its current status in the OpenMP library is at http://bit.ly/udsomp.
Separately from the work on adding the feature to support a user-defined schedule for OpenMP, I'm starting to help Computer Science researchers at Brookhaven National Laboratory (BNL) and Lawrence Livermore National Laboratory (LLNL) to add a feature in the OpenMP standard that, when enabled, hints the OpenMP runtime system to execute dynamically scheduled work in an OpenMP parallel region of an application program on the core of a supercomputer’s node that it was executed on in the previous invocation of the OpenMP region if that core isn’t busy executing other work, and if not, defaulting to being randomly scheduled to another core.
8. Vivek Kale, Simplice Donfack, Laura Grigori, William D. Gropp. Improving the Performance of Scientific Applications by Balancing the Tradeoff between Load Balancing and Locality. SC 2014. November 2014. New Orleans, USA. [pdf]
9. Vivek Kale, Amanda Peters Randles, William D. Gropp. Locality-Optimized Mixed Static/Dynamic Scheduling for Load Balancing on SMPs. EuroMPI/ASIA 2014. September 2014. Kyoto, Japan. [pdf]
10. Donfack, S., Grigori, L., Kale, V., Gropp W. Hybrid Static/Dynamic Scheduling for Already Optimized Dense Matrix Factorization. IPDPS 2012. May 2012. Shanghai, China. [pdf]
11. Vivek Kale, Abhinav Bhatele, William D. Gropp. Weighted Locality-Sensitive scheduling for Noise Mitigation on Multi-core Clusters. HiPC 2011. December 2011. Bangalore, India. [pdf]
12. Vivek Kale and William D. Gropp. Load Balancing Regular Meshes on SMPs with MPI. EuroMPI 2010. September 2010. Stuttgart, Germany. [pdf] (Selected as a Best Paper).
13. Vivek Kale and Simplice Donfack. Parallel Design Patterns for Lightweight Dynamic Scheduling Strategies. ParaPloP 2011. May 2011. Carefree, USA. [pdf]
14. Vivek Kale, Jayanta Mukherjee, Indranil Gupta. HadoopJitter: The Ghost in the Cloud and How To Tame It. University of Illinois Technical Report. October 2011. Urbana, USA. [pdf]
15. Vivek Kale. The Correlation between Parallel Patterns and the NAS Parallel Benchmarks. IMWCSE 2010. May 2010. Johannesburg, South Africa. [pdf]
16. Vivek Kale and Edgar Solomonik. Parallel Sorting Pattern. ParaPLoP 2010. March 2010. Carefree, USA. [pdf]
17. Vivek Kale. Improving the NAS Parallel Benchmarks: A Parallel Patterns Approach. ParaPLoP 2010. March 2010. Carefree, USA. [pdf]
18. Vivek Kale. Enabling Simulation of Renewable Energy Solutions through Supercomputers. Technical Report TR0210-373 at the University of Illinois at Urbana-Champaign. June 2009. Urbana, USA. [pdf]