PACS crash teaches administrator the hard way

May 19, 2008

Michael Toland was seven months into his first job as a PACS administrator, working for a 250-bed community hospital with a 70,000 per year exam volume, when the PACS crashed.

Michael Toland was seven months into his first job as a PACS administrator, working for a 250-bed community hospital with a 70,000 per year exam volume, when the PACS crashed.

One of two storage array controllers, a device that trafficks data to disc storage, had started to generate error codes and was nearing failure. It wasn't a crisis because the second storage array controller was working fine, said Toland during a presentation Saturday at SIIM. A replacement was scheduled.

But instead of performing the replacement on the scheduled date, the vendor showed up early, at a time when Toland was away. The vendor was let in by the IT team and replaced the storage array controller. Instead of copying system files from the existing controller to the new one, however, the vendor copied the new controller to the existing controller. All PACS data disappeared.

The closest full backup was from the previous week. The storage array had to be rebuilt from scratch.

The system was completely down for eight hours, Toland said. It took 27 hours to restore the PACS data from tape, and data had to be re-sent from the RIS and modalties. Study reconciliation took 24 hours.

Toland, who is now PACS administration team manager for the University of Maryland Medical System, relayed his experience against a backdrop of suggestions and tips about how to avoid similar problems in the future and how to respond when they occur.

System failures are inevitable because hardware fails. It is important to react quickly and identify the problem, Toland said.

It is important to try to control downtime and its effects. Efforts include communicating to users and letting them know about unplanned outages. It is also important to communicate to the business staff so they will understand the business impact. Escalation policies should be designed in advance to trigger troubleshooting and the engagement of resources to address problems, he said.

For preventing downtime, the number one strategy is redundant architecture. Other strategies include using distributed systems so that one component failure doesn't bring down the system. It is also important to monitor systems to learn about problems quickly.

Operationally, preventing downtime and minimizing its impacts requires policies and procedures for change management and escalation. Documentation eases repairs and becomes more necessary as systems get more complex and easier to break.

Good relations with the vendor are helpful, but they need to be backed up with service agreements regarding downtime and accountability, Toland said.