Thumbnail
Access Restriction
Open

Source CiteSeerX
Content type Text
File Format PDF
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Subject Keyword Dense Matrix Factorization ♦ Algorithm-based Fault Tolerance ♦ Fault Tolerance ♦ Checksum Storage Leftover ♦ Subject Descriptor System Software Software ♦ Lin-ear Equation ♦ High-performance Computing ♦ Theoretical Evaluation ♦ Wide Range ♦ Fault-tolerant Algorithm ♦ Theoretical Analysis ♦ High Degree ♦ Generic Solution ♦ Fast Decline ♦ Mean Time ♦ Ever-growing Scale ♦ Reliable Component ♦ Left Factor ♦ Scalable Checkpointing Algorithm ♦ Extreme Condition ♦ Cray Xt5 ♦ Scientific Application ♦ Hybrid Solution ♦ Minor Modification ♦ Factor-izations Algorithm Survive Fail-stop Failure ♦ Qr Factorization ♦ Confirm Negligible Overhead ♦ Com-puting Unit ♦ Square Problem ♦ Single Failure ♦ Right Factor ♦ Problem Size ♦ Experimental Result ♦ Right Factor Protection ♦ New Hybrid Approach
Abstract Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications that require solving systems of lin-ear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Fail-ure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factor-izations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable component and the possibility of loosing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factoriza-tions. For the left factor, where the panel has been applied, we pro-pose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is appli-cable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead sharply decreases with the scaling in the number of com-puting units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer val-idate the theoretical evaluation and confirm negligible overhead, with- and without-errors. Categories and Subject Descriptors System Software [Software approaches for fault tolerance and resilience]: Software for high-performance computing
Educational Role Student ♦ Teacher
Age Range above 22 year
Educational Use Research
Education Level UG and PG ♦ Career/Technical Study
Learning Resource Type Article