The essence of having a contingency plan in any institution is to sustain the execution of critical events and systems when an extraordinary event occurs to disrupt the normal performance and cause below par execution.
Organizations that rely on IT systems have both static and dynamic information that assists in routine operations. They have to ensure that both information sources are reliable. Organizations have to be aware of criminals who break into computers for a variety of reasons.
They can steal valuable information and even launch attacks against other systems using the systems part of the business that they have compromised its integrity. The stereotype of computer intruders is misleading and can cause malfunctioning of contingency plans for foiling attacks and keeping a business running.
Today, many attackers work as organizations with specified orders and missions to accomplish. They can do single attacks or a group of attacks on unsuspecting businesses to pass messages across or steal and cause harm.
This paper seeks to provide a useful guideline for carrying out IT resource contingency planning. Therefore, it discusses planning steps, possible recovery options, and recommended testing requirements for an organization to maintain functioning and integrity of operations (Casey, 2011).
Background Information on Contingency Planning Involving IT Resources
One critical component of planning is to identify the enemy correctly. Rather than remain fixated on the image of a teenager or other delinquents, it is better to think of attackers as capable intruders who have sufficient will and capacity to cause harm, irrespective of their personality and physical identity profiles.
A second factor to consider is that there are emerging technologies for breaking into business systems; thus, intruding personalities keep themselves abreast with the latest industry news (Casey, 2011).
Therefore, defending a company with an outdated system in such circumstances may be as good as leaving doors wide open for anyone interested in snooping or stealing and disrupting operations to do as they wish.
The third component of planning is to define access. In many systems, only people who use the system or rely on the IT system have access to the system. Other individuals within and without the company have to gain temporary, special permissions from the personnel in charge.
Return to this operation protocol at the earliest point after an intrusion is very important because it will allow the rest of the organization’s essential functions to fall in place and facilitate continued execution of fundamental processes that lead to the realization of the organization’s goals.
Many computer programs take advantage of vulnerabilities and automatically create illegal access to computers or parts of systems. However, breaking into a system that is not ordinarily vulnerable is hard.
The task requires the coordination of activities and tools. The coordination is only possible when the personnel involved in the attack remain focused and follow through with particular attack designs that almost mimic all the safety nets of a given system.
Consequently, attackers will likely need and get insider knowledge about a system before making an attack in any way. This is the main reason insider jobs rank higher in a number of company-attack IT sources (Casey, 2011).
Another consideration in planning for contingency is the presence of human mistakes in the design and the operation phases of the system. Errors can lead to outages similar to those encountered during attacks.
Given that systems are a combination of individual tasks, simple human errors like omissions or typos can create catastrophic damages to systems that do not have the capacity to detect and self-heal from errors (Zhu & Chiueh, 2003).
The first step in planning is to come up with a mechanism for repairing the broken parts of the system. In many IT resources, systems rely on files that store information dynamically or statically and are interpreted to generate new information that facilitates other functions in the organization.
Therefore, the biggest part of any system will be the files that have to remain reputable to allow standard functions to proceed unhindered. In this case, a file repairing system working alongside the primary system is ideal.
Its work is to identify files that are getting corrupt before they interfere with other files in the system. The early identification ensures that the damage to the rest of the system is low.
Identification allows professionals working on the system to stop working and wait for repairs or change their working paths to a different resource that is uncompromised. Meanwhile, the repair has to happen. The proposal here is to have the system to identify mistakes to also make initial repairs.
After that, other parts of the primary system can then self-initiate specific repairs on their part. Going with this overall design philosophy will mean that there will be two key processes happening simultaneously in case of a mistake or an intrusion into a particular company resource.
Firstly, a repair process for the primary system will be happening.
Secondly, each component of the system will be repairing its parts and sending feedback to the main system repair. Any part of the system that is not damaged or those that recover from damage will be free to resume normal operations. Therefore, they will signal other system components that depend on it for authorization.
However, with the above design plan, there is a risk of creating a cyclic victim situation where the attack on the system creates an internal virus or a smart process that tricks the system into self-repair. Instead, the system will be falsely repairing itself with the wrong information, thereby altering its normal functioning.
Therefore, to prevent such occurrences and remedy them when they occur, the plan will include the development of a parallel system information repository that does not allow the primary system to alter it, but provides the primary system with information about files and processes to ascertain their integrity.
In line with recommendations by Zhu and Chiueh (2003), the system will maintain all raw data to keep updates permanent. It will maintain information about dependencies during any update so that only the attacked parts of the system will roll back to their last working state.
This arrangement makes it possible for the system to shut itself from an additional spread of an attack. With an onion-like structure, penetration to the core will be difficult and extrapolation of human mistakes happening at the core of the system will take time before reaching the outside and affecting the normal operations.
In fact, each layer will operate independently in terms of its attack vulnerability and repair functionality and interact with other layers as unique systems. This design will affect IT and support features like the power supply.
Destruction of the entire system will require systematic shutdowns or mistakes, which are nearly impossible to execute (Hordeski, 2005).
Recovery will be in two parts. The first part will be a low priority recovery, where the system is destabilized and causes little disruption to the organization. The second part will be a moderate to a high priority recovery for a system interruption that extends to critical operations.
In the first case, recovery will rely on ordinary file backup within the system that the personnel in the organization can access.
For example, files will be stored on removable discs that will be used to recover corrupted files when the system is attacked or when users make mistakes with the files they are manipulating or accessing (Rothstein, 2007).
This recovery option and process is straightforward. It will be included in the normal work procedures for all personnel and will be part of any user guidelines for various IT resource components.
Users can refer to the guidelines and know what to do without causing substantial delay or disruption of other users and parts of the system.
The recovery option here will also be automated for particular tasks as described in the earlier discussion where a parallel system will offer unedited instructions to aid recovery.
For example, a single computer can read network instructions and fetch backup files to self-restore to normal functioning or reset itself so that the user can feed in new raw data for processing with the right instructions (Swanson, Bowen, Phillips, Gallup, & Lynes, 2010).
In a medium to a high-risk event, recovery will rely on off-site facilities and equipment linked with high-speed information technologies, such as wired and wireless local area networks or the Internet cloud.
Communication between different facilities and the central system will be secure and will have to pass through various security handshake protocols that maintain the system integrity. In addition, there will be a mirror site for the entire system to provide a simulated replica of the main system.
The site can come online for use at any time when the critical functioning of the central system is compromised. The mirror system will replicate all on-site storage and raw data disc operations to provide hot-plug functionality.
This implies that when there is an intrusion or a breakdown of the main system, the intruder will be locked out or the proliferation of mistake commands will be shut down together with the entire system.
Users will be shifted to the mirror system in an instant swipe affair where they can continue working on their jobs as usual. After that, they will receive notification of the ongoing compromise in their primary system and will have to shut down separate operations safely or go slow while awaiting full recovery.
The idea here is to have a system-wide operation working similar to cloud operations and local power backups that keep users connected to the same status of their work after a disruption of the work process occurs. In other words, users do not start again.
Moreover, the system will not need to reboot to maintain user and process access to its functions. The fundamental factors in the recovery will be the geographical area, accessibility, security, environment, and cost.
The company will be placing the remote facility at a distance from the main IT resource physically so that the recovery site will remain safe from in case of physical causes of disruption, such as weather, accidents, and on-site intrusions, harm.
Accessibility looks at the time it takes and the resources it takes to retrieve remote data or sustain operations. In case of low-risk disruptions, users will have to collect and fix backup discs manually to particular parts of the system, such as a server or their client computer systems.
For moderate and high risk events, accessibility will be affected by the speed of the network used to connect to mirror sites and the time it takes to divert processes and information traffic from user computer terminals to the mirror site.
In this case, computer terminals refer to any device that a user may utilize to access the main system and carry out particular functions (Franke, Johnson, & König, 2014). This interpretation covers employees and customers or third-party persons and systems that rely on the main system for optimal functioning.
Other factors that will be considered in the plan are the security implications of transferring information from remote sources. Remote, in this case, includes short distance transfers like a portable disk or computer device to the main system. It can also imply full mirroring connecting to cloud services and remote backup facilities.
The business will implement security protocols that ensure only authorized personnel initiate transfers. It will also offer surveillance for transfers for use in an additional recovery process in case there are problems with the transfer.
For example, in case of physical transfer of data, support staff doing the surveillance will raise a flag when the personnel in charge of referring files behaves suspiciously, which will be enough to deter other users and participants in the recovery process from facilitating additional mistakes or intrusions.
The same policy will be implemented in electronic processes.
Recommended Testing Requirements
The business will determine the critical nature of the fundamental business activities defined in its disaster recovery plan.
The definition will depend on the current results of the business impact analysis. At the same time, reports on risk analysis of threats, vulnerabilities, and safeguards of the present system will also help when reviewing the identity and properties of the main business activities.
The plan will require the simplification of the testing parameters so that management can make informed decisions based on easy to interpret empirical results of analyses and actual testing reports. Testing must be as comprehensive as possible.
Moreover, it must not cost more than the prohibitive threshold set by the business. Therefore, part of the plan will be coming up with acceptable costs for testing and solutions that the company is ready to implement in the overall disaster recovery plan and still perform its functions well.
During testing, the service interruptions to the system should be minimal or not arise at all. The evaluation of test results should be in a way that provides quality input to the main disaster recovery program. Testing should also avail assurance of recovery capability of the system.
Proposed 24-Month Cycle Business Contingency Testing Plan
The contingency testing plan will be a continuous process that updates the overall disaster recovery plan of the company. Although it will start off the disaster, recovery aspects as demonstrated in the recommended resting requirements, there will be difficulty in spotting its start and end once it initiates.
The 24-month cycle will include a checklist. Within the checklist step, there will be revising and retesting based on the revisions. The checklist will precede a walk-through of the system, which will include a review and retesting options.
The certain duration of the checklist and walk-throughs will fluctuate. However, they will be spaced within three to six months to provide sufficient time for actual revisions and retesting.
The next step after a walk-through will be the simulation of the system to mimic particular disaster events. This will include all possible mistakes that employees can make. This will be initiated at the checklist and walk through the stages. In case of any updates, they will happen at the stages after the 24-month cycle repeats.
Thus, it will be important for the business to conduct checklist and walk-through stages comprehensively. They will involve checking all aspects of the overall disaster recovery plan, which will involve all personnel of the company at different stages and on issues that relate to their job positions and system access capacities.
At these stages of the testing plan, all personnel and non-personnel users will be assuming system-component identities. That is why there is a strong emphasis on the micro cycle of testing and revising particular aspects of the first two stages.
This will also be used as a measure to keep the overall costs of the testing plan low, as the business can avoid costly mistakes when it is through with the initial stages.
The simulation will only happen for medium to high-risk events. It will only cover aspects of the system disruption that are economically viable.
For example, it will not be possible to shut down the system and simulate a physical recovery of data; therefore, the business will have to rely on estimates based on operations information. Simulation can make use of virtual machines running on the main system, but in a controlled environment (VMware, 2011).
In addition, there will be parallel testing during the simulation. The simulation can take two months and come immediately after the final walk-through stage event. Parallel testing checks whether the remote and primary sites are working concurrently as they should and both have accurate information.
The parallel testing aims to identify any loopholes in the recovery process before a complete takedown of the system happens in reality or during the testing plan (Rothstein, 2007).
The final stage of the testing will be a full interruption that will take place for a period of 8 months. The full interruption will take the time to initiate, carry out, and shut down, given the vastness of business operations and the architecture of the recovery facilities and processes.
Its comprehensive nature means that there will be many events occurring, either in different, isolated times or in sequence to mimic particular intrusion or error situations that would occur in reality.
However, the full interruption will not be a drill, but the actual test on the system that cripples the functioning of some of its components at a time. For example, it might involve one week go slow or strike of staffs from critical department personnel.
Of course, it will not be an actual strike, but this will be the reason provided for interruption on the testing plan. The subsequent outage will be shifted to the recovery plan to evaluate the way it responds (Rothstein, 2007).
The company will allocate workers different duties in the testing cycle. A project manager will coordinate the operations of the testing team and various company departments, including management, finance, legal, public relations, security, and customer care.
The checklist and walk-throughs will involve line supervisors and their staffs. The checklist will also include third-party auditors and disaster recovery services from consultancy companies. Auditors will provide reports for each stage of testing and use the residual time allocation in the plan in various stages.
Auditing will happen concurrently with the other testing activities where it is permissible. Auditing recommendations and internal findings of the contingency testing team will influence decisions and structuring of the overall disaster recovery plan to make it always current with the present business continuity threats (Casey, 2011).
The paper concentrated on discussing an actual contingency testing plan and factors that affect its development and implementation.
It also provided a background exploration of the current forms and effects of employee mistakes and system errors that can cause disruption, together with external factors, such as intrusions and physical accidents.
In the end, a business implementing the proposed 24-month cycle should be able to succeed in abating disaster effects.
Casey, E. (2011). Investigating computer intrusions. In E. Casey, Digital evidence and computer crime: Forensic science computers and the internet (pp. 369-419). London, UK: Academic Press.
Franke, U., Johnson, P., & König, J. (2014). An architecture framework for enterprise IT service availability analysis. Software & Systems Modelling, 13(4), 1417-1445.
Hordeski, M. F. (2005). Emergency and backup power sources: Preparing for blackouts and brownouts. Lilburn: The Fairmont Press, Inc.
Rothstein, P. J. (Ed.). (2007). Disaster recovery testing: Exercising your contingency plan. Brookfield, CT: Rothstein Associates, Inc.
Swanson, M., Bowen, P., Phillips, A. W., Gallup, D., & Lynes, D. (2010). Contingency planning guide for federal information systems. Washington, D.C.: National Institute of Standards and Technology.
VMware. (2011). VMware data recovery administration guide: Data recovery 2.0. Palo Alto: VMware, Inc.
Zhu, N., & Chiueh, T.-C. (2003). Design, implementation, and evaluation of repairable file service. International Conference on Dependable Systems and Networks 2003, (pp. 217-226).