Abstract:
Continuously reducing transistor sizes and aggressive low power operating modes employed by modern architectures tend to increase transient error rates. A metric of reliability is required in order to evaluate approaches that address soft errors. This thesis explores a soft error vulnerability analysis of parallel applications running on multicore architectures. We propose and evaluate a novel metric, Thread Vulnerability Factor, in order to quantify thread vulnerability and to qualify the relative vulnerability of parallel applications to soft errors. We present the analytical definition of our metric, and develop a framework to calculate the metric value by gathering application data. To demonstrate the validity of the metric, fault injection based experiments are conducted for multithreaded applications. This thesis also presents the performancevulnerability analysis of parallel applications for a variety of applications and discusses the effects of design choices on system performance and reliability. By considering tradeoff between these two concerns, we observe that design choice becomes clear for some of the applications which provide different vulnerability values with almost equal performance. Additionally, we propose and evaluate reliability-aware core partitioning schemes for multicore architectures. A simulation study with various workloads consisting of multiple multithreaded applications is performed in order to evaluate the proposed partitioning schemes. We also present a thread-level vulnerability assessment tool by considering user preferences; and we propose a novel critical thread identification algorithm to determine critical thread and critical thread region in a multithreaded application. We utilize our algorithm to determine the thread for redundant execution in a partial fault tolerance system and demonstrate the efficiency of the method by providing vulnerability values for executions with different redundancy levels.