Fault Tolerance
I. Introduction
Since the 1970’s people no longer see information management as a simple task that few crews can perform in modest budget (Laudon & Laudon 2006). Under such circumstances, one of the basic purposes of the creating the software is to cope with existing development in data management. The field of data management displays considerable enhancement within the last decade. People have more technological needs than they ever had in previous times.
Furthermore, today, people are not looking for the easiest way to do their work (which is by using technology), but also the best and most innovative way to take advantage of existing facilities to enhance the quality of the work.
Along with the advancement and benefits generated by technology, there are also risks to manage. Keeping large amount of information inside a hard drive (paperless works) will perhaps safe a lot of paper and storage rooms, nevertheless, it will also generate larger security threats.
The risks emerge as hard drive and other form of digital storage are prone to damage as the result of vulnerability of electronic equipments to handling and magnetic fields. The situation suggests the needs to manage information under the auspices of reliable computer and network systems.
Companies prefer to accept and manage the risks because managers understand that risk of data corruption or data lost can also happen to manual data storage system. Companies usually react to such risks by designing a fault tolerance system, a system that allows the corporate system to respond gracefully toward unexpected failures. There are many types of fault tolerance systems. In this paper, I am discussing some well-recognized fault tolerance system and evaluate strategies in fault tolerance systems. The fault tolerance systems that will be discussed in this paper are transaction journalizing and rollbacks, database shadowing, RAID technologies, network redundancies, backup CPU, and symmetric multiprocessing.
II. Transaction Journalizing and Rollbacks
In the case of errors in managing data, transaction journalizing activities which have been performed daily for months or years could be lost in vain. However, recovery from errors can be performed through several methods. In general there are two kinds of data recovery systems, roll forward and roll-back. In simple terms, roll forward recovery consist of steps to ‘pausing’ the state of a failing system and corrects it. Within the stagnant condition, repairing activities can be performed and afterwards, the system can move forward.
Rollbacks on the other hand, is a mechanism that revert the system into some past condition to obtain the correct version of the system and then moving forward from that condition. A well-known method of roll back is using checkpoints. Roll back is considered betters than roll-forward because it represents a faster data recovery process. Nevertheless, the roll back mechanism requires the operations between the checkpoint and the detected erroneous state be made idempotent. Most companies used both roll forward and roll-back recovery mechanism within the fault tolerance system. These mechanisms are used in different parts of the system (Denning 1976).
III. Database Shadowing
There are actually three basic methods to build a fault mechanism system, which are:
· Replication
Replication means providing similar or the exact same system or parts of the system. All of these systems work in parallel to finish their tasks. The method is also known as shadowing or mirroring.
· Redundancy
This means deploying several of the same system to perform the same tasks or in other words, creating multiple identical instances of the system. In case of failure, management can switch to the remaining instances.
· Diversity
Diversity on the other hand, is a method of managing failures by providing several different implementation of the exact same system. Management could change to another in case of failure. These different implementations are used like replicated systems to deal with errors.
(Linden, 1976)
Database shadowing belongs to the replication method of fault tolerance system. Most of the fault tolerance mechanisms are mirroring the actual mechanisms performed in the systems. This means that there are duplicates of the actual system so when the primary system fails; the back-up system can take over. Database shadowing is a fault tolerance system built based on this principle. The mirroring activities however, happen in all parts of the system. In a sense, the entire database system is supported by a backup system (Linden, 1976).
The system is often used in maintaining a large database system which cannot afford having the system shutdown for maintenance. The advantage of the system is the fact that management will not have to suffer from delays caused by system failure. Nevertheless, the obvious disadvantage is providing a mirror of the entire database system which could costs millions.
IV. RAID Technologies
Redundancy Array of Independent Disks is also a well known fault tolerance system. It is an example of fault-tolerance storage system that belongs to the redundancy method. Using the RAID solution, a part of the physical storage capacity stored redundant data on the hard disks. This information can be parity information or a separate data copy. In case of a failure in one of the disks or a sector of the disks, the redundant information stored can regenerate the required information (‘The 3ware’, 1999).
Using the RAID solution, management can increase the fault tolerance of the organization. Furthermore, management can use the disk mirroring system or the disk stripping system with parity to create redundant data on the hard disks. Nevertheless, despite the ability to create duplicate volumes that will takeover the task if a single disk fails, disk mirroring does not prevent damaged files to be copied to the other mirrors. Therefore, management cannot use disk mirroring as a substitute for important data in the servers.
There are various types of RAID solutions, including RAID-0, RAID-1, RAID-0+1 and RAID-5. Each of them is suitable for different types of system requirements. To choose which one to apply, managers generally have two main considerations, cost and reliability. In terms of implementation costs, the RAID-0+1 is the most expensive due to the double amount of disk space that must be provided. Nevertheless, this method is the most reliable because two or more backups must fail before the data is lost. The RAID-1 is the least expensive because it has minimum requirement in terms of the size of the disk-drives (The 3ware, 1999).
V. Network Redundancies
By definition, redundancy means the duplication of electronic processing and thus equipments in order to protect the information flow/transfer from failure (Answers Corporation, 2007).
One technical aspect in data transfer is packet loss. This is annoying since it causes the lost of some information, it could be the important one such as the loss of one digit of “0” in “$1,000,000” information that make a big deal result in the receiver ends (Figure 1) .
Figure 1 Packet Loss in Voice Network
Source: Telogy Networks. (2000). VoIP Implementation Challenges. San Jose
Packet loss may happen due to several incident as following:
Network Congestion/Performance
Network Architecture
Improper Jitter Buffer Size
Software Not Designed for Peak Load
(Telogy Networks, 2000)
In order to keep preserving the data, network redundancy emerges as the solution. The Network Redundancy system is also known as Storage Area Network (SAN) solution. The basis or the method is using the Fiber Channel switching technology to improve fault tolerance capabilities. The software creates data flow redundancy by providing multiple pathways to stored data. The technology also provides connectivity between multi-vendor systems. One disadvantage of the system is the fact that it requires high I/O bandwidth that is supported by the system. The lack of such hardware will increase the risk of data corruption and loss (Denning, 1976).
According to Nicoud (2003), in order to provide network redundancy in IP (Internet Protocol)-based connection, a company must have two different IP addresses that deliver information or data via different route. Concerning the provision of no-downtime system or network redundancy, figure 2 shows how a company can set up redundant network in which when a fault happen at satellite communication, the system will automatically deliver information via fiber optics communications.
Figure 2 Network Redundancy
VI. Backup CPU
CPU means the central processing unit or microprocessors. Since CPU acts as a rain in a computer system and therefore plays important role in computation, there are needs to provide backup CPUs. It is beneficial for a computer system so that any fault in a CPU will be backed by another CPU to keep a system running well.
VII. Symmetric Multiprocessing
Symmetric multiprocessing is a computer architecture using multiprocessor. In this method, two or more identical processors are connected to a shared main memory. The system allows multiple processors to work on any task and move tasks between processors to balance the workload. The system has quite an advantage over other systems because in a condition where multiple programs are running at the same time, the SMP system will have a better performance than uni-processor. This is true because different programs can run on different CPU’s at the same time (Linden, 1976).
VIII. Conclusion
Information is valuable component in computer system since it conveys many critical data. In order to ensure that information is stored and delivered in secured ways, there should be fault tolerance system. In this paper, we discuss several system; they are transaction journalizing and rollbacks, database shadowing, RAID technologies, network redundancies, backup CPU, and symmetric multiprocessing
Reference:
Answers Corporation. (2007). Redundancy. Retrieved June 26, 2007 from http://www.answers.com/topic/redundancy?cat=biz-fin
Dennis, A. (2002). Networking in the Internet age. John Wiley & Sons, New York
Denning, P. J. (1976). Fault tolerant operating systems. ACM Computing Surveys (CSUR) 8 (4): 359–389
Laudon and Laudon (2006). Management Information Systems: Managing the Digital Firm. 9th Edition, Prentice-Hall
Linden, Theodore A. (1976). Operating System Structures to Support Security and Reliable Software. ACM Computing Surveys (CSUR) 8 (4): 409–445. ISSN 0360-0300.
Nicoud, Sophie. (2003). Network Redundancy. Retrieved June 26, 2007 from http://www.urec.cnrs.fr/IMG/pdf/articles.02.network-redundancy-v-u-1.pdf
Telogy Networks. (2000), VoIP Implementation Challenges, San Jose
The 3ware DiskSwitch Architecture. (1999). Technical White Paper, 3ware Inc. Retrieved June 21, 2007 from http://www.3ware.com