In the world of risk management, maintenance of mission-critical equipment drives priorities and budgets. It is the ultimate test of proactive maintenance and smart decision making. Managing assets that “cannot be allowed to fail” is more than an emotionally charged mandate that forces managers into a continual state of alert. It is the harsh reality for technicians tasked with ensuring continuous performance or service. The stakes are high. Fortunately, technology can help mitigate the risks.
ADVERTISEMENT |
The scope and scale of critical assets and equipment vary greatly, from electrical grids and security systems to back-up generators at hospitals, refrigeration in the food and beverage industry, and traffic control for the airline industry. National defense systems and communications systems in the public sector, such as 911 call centers or alerts for fire departments, are other examples of high-tech equipment that cannot be allowed to fail. Whether it involves protecting health of consumers, safety of the workforce, or national security, mission-critical assets require special attention to detail and proactive monitoring of status. Prevention is the goal. Early detection of warning signs makes intervention possible.
…
Comments
Another Approach to Risk management
Hello Kevin:
Great article. When I was at Hewlett-Packard, we designed risk management into our Non-Stop unix product line (formerly the Tandem products), that offered "five-nines" of uptime, including planned and unplanned outages. One of our products was used in 911 systems, where an outage could be beyond mission-critical, which were termed "Life Critical." In designing the risk management for that system, we turned our usual approach inside out and said, essentially, "If failure can always happen, even in a hardened and resilient system, how do we mitigate THAT risk?" and we came up with a "fail fast, recover fast" model. Our systems could detect a non-conforming subsystem, "kill" the subsystem and recover, even rebooting the entire unix kernel, in 60 seconds or less while a second identical parallel node continued handling all transactions and maintaining uptime. It was beautiful to watch in operation.
So, keep in mind a comprehensive view of what it means to have continuous uptime. There are often other approaches that can yield equivalent or better results than what you might have expected at first.
Howard Hudson
Add new comment