Thursday, 2021-08-12

Keynote Speakers

Kishor Trivedi

Keynote Title: Software Aging and Software Rejuvenation

Abstract: The study of software failures has now become more important since it has been recognized that computer system outages are more due to software faults than due to hardware faults. The phenomenon of "software aging", in which the state of the software system degrades with time, has been reported in widely used software and also in high-availability and safety-critical systems. The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software system or crash/hang failure or both. To counteract this phenomenon, a proactive approach to fault management, called "software rejuvenation" has been proposed. This essentially involves gracefully terminating an application or a system and restarting it in a clean internal state. This process removes the accumulated errors and frees up operating system resources. This method therefore avoids or postpones unplanned and potentially expensive system outages due to software aging. In this talk, we discuss methods of evaluating the effectiveness of proactive fault management in operational software systems and determining optimal times to perform rejuvenation---------------------------------------------------------------------------------------------------------------

Evgenia Smirni

Keynote TitlePractical Reliability Analysis of GPGPUs in the Wild: from Systems to Applications

AbstractGeneral Purpose Graphics Processing Units (GPGPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors (faults), often caused by high-energy particle strikes, that can significantly affect application output quality. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative to better understand the reliability of such systems. In this talk, I will present a study of the system conditions that trigger GPU soft errors using a six-month trace data collected from a large-scale, operational HPC system from Oak Ridge National Lab. Workload characteristics, certain GPU cards, temperature and power consumption could be indicative of GPU faults, but it is non-trivial to exploit them for error prediction. Motivated by these observations and challenges, I will show how machine-learning-based error prediction models can capture the hidden interactions among system and workload properties. The above findings beg the question: how can one better understand the resilience of applications if faults are bound to happen? To this end, I will illustrate the challenges of comprehensive fault injection in GPGPU applications and outline a novel fault injection solution that captures the error resilience profile of GPGPU applications.

---------------------------------------------------------------------------------------------------------------

NSEIL, he was responsible for the International Business Development and Exchange collaborations, strategic and futuristic technology and new initiatives of the exchange. Prior to this, as CEO of NSE Infotech Services Ltd., the captive IT services unit of NSE, he was responsible for the IT strategy, planning, software development and IT services of the exchange. Before joining the NSE group in 2004, he was the CEO / CTO with the IL&FS group since 1990. In this role he was responsible for strategy planning, design and implementation of IT solutions across different group companies. During this tenure, he was part of the initial core team involved in setting up the National Stock Exchange from conceptualization stage till the initial implementation. He continued to be a part of the exchange in an advisory capacity in different committees. He has also worked with Bhabha Atomic Research Centre (BARC) in the Reactor Analysis and Simulation Systems group and NELCO Ltd in the past. Dr N. Muralidaran holds a Ph.D. from Birla Institute of Technology, Mesra for his work in "Architecture Model for Scaling Mission Critical Real Time Applications for High Performance" - under the management stream. He also holds a PG Diploma in Computer Management and MBA in Finance.

Keynote Title

AbstractThe discussions will cover the key factors in scaling mission critical system for high performance. We will delve into vision and foresight in architecting/re-architecting for scale, identification of innovation possibilities and ability to identify & mitigate risk.