Windows server software works seamlessly with most hardware vendors that offer fault tolerant systems. Discuss fault tolerance approaches that systems managers use to assure continuity of operations. 200 words minimun plus references

Fault tolerance is a design that allows the system to operate continuously even when some part of the system fails. The system manager uses various fault tolerance approaches to assure continuity of operations. We can divide these approaches into three categories.

  1. Hardware Techniques: To handle hardware units such as memory errors by using error detection and correction methods, network errors by using Automatic path migration method (APM) where we have secondary path along with the primary path, processor errors are also handled.
  2. System Software Techniques: The programs to handle the architectural complexities are written by experts, which can be used to solve the faults within system such as operating system, MPI, etc.
  3. Application Based Techniques: Domain specific classes, libraries, languages are developed by scientist. The programmers can use these domains to handle the faults.

We have different methods to handle fault tolerance. Such as

  1. Storage method: the important data is stored in Back up. There are two types of storage devices electromechanical devices such as hard drive and solid state drives (SSD). Due to the higher cost to store this persistent data we use Redundant array of independent disk RAID.
  2. Checking point method: it is a mechanism where we put global check points. When a fault occurs instead of restarting the application from the beginning we can repeat from the checkpoint.


  1. Cray XT5 Compute Blade. 2011 http://wwwjp.cray.com/downloads/CrayXT5Blade.pdf accessed on Nov 2011.
  2. Agarwal, R.Garg, 2004 Adaptive incremental check pointing for massively parallel systems. In 18th annual international conference on supercomputing. New York 2004.


