Sunday, 2021-05-16

Keynote Speakers

Kishor Trivedi holds the Fitzgerald Hudson Chair in the Department of Electrical and Computer Engineering at Duke University, Durham, NC. He has a 1968 B.Tech. (EE)  from IIT Mumbai and MS’72/PhD’74 (CS) from the University of Illinois at Urbana-Champaign. He has been on the Duke faculty since 1975. He is the author of a well-known text entitled, Probability and Statistics with Reliability, Queuing and Computer Science Applications, originally published in 1982 by Prentice-Hall which was published as an inexpensive Indian edition by PHI. A thoroughly revised second edition of this book has been published by John Wiley that is also available as an inexpensive Asian edition. The book is recently translated into Chinese. He has also published two other books entitled, Performance and Reliability Analysis of Computer Systems, published by Kluwer Academic Publishers and Queueing Networks and Markov Chains, John Wiley. His fourth book, Reliability and Availability Engineering, is published by Cambridge University Press in 2017. He is a Life Fellow of the Institute of Electrical and Electronics Engineers and a Golden Core Member of IEEE Computer Society. He has published over 600 articles and has supervised 48 Ph.D. dissertations. His h-index is 98. He is the recipient of IEEE Computer Society’s Technical Achievement Award for his research on Software Aging and Rejuvenation. His research interests are in reliability, availability, performance and survivability of computer and communication systems and in software dependability.

Keynote Title: Software Aging and Software Rejuvenation

Abstract: The study of software failures has now become more important since it has been recognized that computer system outages are more due to software faults than due to hardware faults. The phenomenon of "software aging", in which the state of the software system degrades with time, has been reported in widely used software and also in high-availability and safety-critical systems. The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software system or crash/hang failure or both. To counteract this phenomenon, a proactive approach to fault management, called "software rejuvenation" has been proposed. This essentially involves gracefully terminating an application or a system and restarting it in a clean internal state. This process removes the accumulated errors and frees up operating system resources. This method therefore avoids or postpones unplanned and potentially expensive system outages due to software aging. In this talk, we discuss methods of evaluating the effectiveness of proactive fault management in operational software systems and determining optimal times to perform rejuvenation



Evgenia Smirni received the Diploma degree in Computer Science and Informatics from the University of Patras, Greece, in 1987 and the Ph.D. degree in Computer Science from Vanderbilt University in 1995. She is the Sidney P. Chockley Professor of Computer Science at the College of William and Mary, Williamsburg, VA, USA. Her research interests include queuing networks, stochastic modeling, Markov chains, resource allocation policies, storage systems, data centers and cloud computing, workload characterization, models for performance prediction, and reliability of distributed systems and applications. She has served as the Program co-Chair of QEST’05, ACM Sigmetrics/Performance’06, HotMetrics’10, ICPE’17, DSN’17, SRDS’19, and HPDC'19. She also served as the General co-Chair of QEST’10 and NSMC’10. She is an ACM Distinguished Scientist.

Keynote TitlePractical Reliability Analysis of GPGPUs in the Wild: from Systems to Applications

AbstractGeneral Purpose Graphics Processing Units (GPGPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors (faults), often caused by high-energy particle strikes, that can significantly affect application output quality. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative to better understand the reliability of such systems. In this talk, I will present a study of the system conditions that trigger GPU soft errors using a six-month trace data collected from a large-scale, operational HPC system from Oak Ridge National Lab. Workload characteristics, certain GPU cards, temperature and power consumption could be indicative of GPU faults, but it is non-trivial to exploit them for error prediction. Motivated by these observations and challenges, I will show how machine-learning-based error prediction models can capture the hidden interactions among system and workload properties. The above findings beg the question: how can one better understand the resilience of applications if faults are bound to happen? To this end, I will illustrate the challenges of comprehensive fault injection in GPGPU applications and outline a novel fault injection solution that captures the error resilience profile of GPGPU applications.


 Dr N. Muralidaran has over 30+ years of experience as an Information Technology professional in the Financial Services sector. He has a rich experience in building organizations. He has always been result oriented and a great team builder. His positive outlook has helped him in visualizing success in every opportunity that has come his way. In his last assignment as the Chief of Special Projects – NSEIL, he was responsible for the International Business Development and Exchange collaborations, strategic and futuristic technology and new initiatives of the exchange. Prior to this, as CEO of NSE Infotech Services Ltd., the captive IT services unit of NSE, he was responsible for the IT strategy, planning, software development and IT services of the exchange. Before joining the NSE group in 2004, he was the CEO / CTO with the IL&FS group since 1990. In this role he was responsible for strategy planning, design and implementation of IT solutions across different group companies. During this tenure, he was part of the initial core team involved in setting up the National Stock Exchange from conceptualization stage till the initial implementation. He continued to be a part of the exchange in an advisory capacity in different committees. He has also worked with Bhabha Atomic Research Centre (BARC) in the Reactor Analysis and Simulation Systems group and NELCO Ltd in the past. Dr N. Muralidaran holds a Ph.D. from Birla Institute of Technology, Mesra for his work in "Architecture Model for Scaling Mission Critical Real Time Applications for High Performance" - under the management stream. He also holds a PG Diploma in Computer Management and MBA in Finance.

Keynote TitleScaling for High Performance         

AbstractThe discussions will cover the key factors in scaling mission critical system for high performance. We will delve into vision and foresight in architecting/re-architecting for scale, identification of innovation possibilities and ability to identify & mitigate risk.