Probe into Network Hardware Issues that Caused System Outage at Polyclinics on 27 August 2022
Ministry of HealthSpeakers
Summary
This question concerns the network hardware outages affecting public healthcare institutions on 27 August and 5 September 2022, raised by Dr Tan Wu Meng and Dr Wan Rizal regarding the root causes, operational impact, and future safeguards. Senior Minister of State for Health Dr Janil Puthucheary explained that the outages were caused by firmware bugs in firewall nodes, leading to increased wait times and rescheduled appointments while urgent care remained unaffected. He clarified that a cyberattack was ruled out and that staff successfully maintained operations using manual documentation and business continuity plans practiced through regular drills. The Senior Minister of State for Health Dr Janil Puthucheary stated that the affected devices have since been patched and that systems are continuously benchmarked against best-in-class international healthcare technology standards. To prevent future disruptions, the Ministry of Health is increasing network capacity, reviewing system architecture, and investing in further capabilities to strengthen the resilience of the public healthcare IT infrastructure.
Transcript
30 Dr Tan Wu Meng asked the Minister for Health in light of network hardware issues affecting the IT systems of public healthcare institutions on 27 August 2022 (a) how many institutions, apps and IT-dependent services are affected respectively; (b) how many patients are existing inpatients, patients admitted that day and outpatients scheduled to be seen that day respectively; (c) how many of such patients have experienced diversion or delayed care provision; and (d) how are the affected patients and healthcare workers supported.
31 Dr Tan Wu Meng asked the Minister for Health in respect of network hardware issues affecting the IT systems of public healthcare institutions on 27 August 2022 (a) whether a cyber attack has been ruled out; (b) what are the root causes of the outage; (c) whether existing redundancy measures are sufficient to maintain provision of healthcare services; (d) what lessons have been learned; and (e) what is being done to improve resilience against such incidents.
32 Dr Tan Wu Meng asked the Minister for Health (a) what are the budget, headcount and deliverables of the Integrated Health Information Systems (IHiS); (b) how do these benchmark against best-in-class international healthcare institutions and top technology-sector firms; and (c) what is being done to strengthen the capabilities of IHiS systems, processes and staff.
33 Dr Wan Rizal asked the Minister for Health with regard to the network hardware issues that caused a system outage at some polyclinics on 27 August 2022, what are the safeguards in place or that will be implemented to prevent disruptions in the future.
The Senior Minister of State for Health (Dr Janil Puthucheary) (for the Minister for Health): Mr Speaker, may I please address Question Nos 30 through to 33 together?
Mr Speaker: Yes, please.
Dr Janil Puthucheary: Sir, Members have asked about the cause and impact of the IT system's outage at public healthcare institutions on 27 August 2022. There was a related outage that occurred in the morning of 5 September 2022, which I will also address.
From 7.00 am on 27 August 2022, the public healthcare monitoring systems detected IT network connectivity failures. The faults were rectified and the systems were restored by 10.45 am on the same day. In total, 26 IT applications were affected, including the electronic medical records, appointment, pharmacy and laboratory systems. Seventeen public healthcare institutions, including the acute hospitals, community hospitals and specialist outpatient clinics and all the polyclinics, were also affected.
On 5 September, at 10.00 am, another fault occurred in the IT infrastructure. Some functionality was restored from 1.00 pm on the same day and full functionality was restored by 6.00 pm the next day. This outage affected eight public healthcare institutions and two out of three polyclinic groups. Due to the nature of this outage, the time to recovery of the system was longer. Hence, operations and services were switched to their back-up infrastructure.
Both incidents caused a significant impact on operations. On both 27 August and 5 September, our affected public healthcare institutions activated their downtime procedures and business continuity plans to keep operations running using alternative systems and, in some cases, manual documentation. These business continuity plans are exercised regularly and staff were able to switch processes to sustain operations during the outage. But they had to work doubly hard to keep healthcare operations running smoothly.
Patients experienced longer wait times ranging up to one hour at the affected institutions. Some had their outpatient appointments rescheduled. There were delays in dispensing medications to patients.
Fortunately, there was no compromise to urgent care services across the institutions during the IT disruptions. Nobody was turned away from the emergency departments or denied urgent care.
I would like to express my thanks to all the teams, the nurses, the admin staff and the clinicians for keeping the services running for our public healthcare system and for keeping our patients safe. We are investigating the incidents with the manufacturer of the IT hardware to rectify any weaknesses in the system. This is what we have found out so far.
The main cause of the outages were failures of hardware devices in our data centres. Public healthcare IT infrastructure is housed in more than one data centre for resilience and redundancy. At each data centre, there are a few firewall zones. Each firewall zone consists of multiple nodes, these are hardware devices which operate in tandem, so if one fails, the load of data traffic is managed by the other nodes in the cluster, so that service operation is uninterrupted. This system had generally been working well until the recent outages.
On 25 August, two days before the incident that we are describing here, on 25 August, a node failed and the system resilience features kicked in and services were maintained. The same thing happened on the 26 August, when another node failed and the systems and services continued functioning.
On 27 August, when the engineers tried to restore the two failed nodes, under the supervision of the manufacturer and following the manufacturer's procedures, which had been successfully used in the past, the operation failed. It is this failed operation that caused the cluster of firewall nodes to malfunction and subsequently caused the outage. The engineers worked to reset the systems to the prior state without the function of the two affected nodes and service was progressively restored.
The failure of the nodes was caused by bugs in the firmware of the devices. They have since been identified by the manufacturer, CISCO, and the devices have been patched.
The outage on 5 September was caused by the simultaneous failure of two further nodes, again from the same manufacturer and of the same model. The way in which this failure on 5 September occurred, was noted to be different from the previous incident, it was assessed that it would take longer to restore operations, and hence, the decision was made to switch operations to the back-up systems. The root cause of why these two nodes failed is still under investigation.
There was a suggestion in one of questions from Members that the failures may be due to the lack of manpower at IHiS. IHiS has a headcount of 3,500 personnel. They have a lot to do and will always welcome more manpower, but a lack of manpower was not the cause of these failures.
Our cybersecurity specialists are monitoring our network and systems for threats in our public healthcare network at all times. When the network problems occurred, IHiS initiated an investigation and also alerted the Cyber Security Agency. Based on the investigations thus far, there are no indications of security compromise to the affected systems.
All the firewall hardware involved in the incidents are from the same established device manufacturer. Fixes for some of the issues have been made available and have been deployed. For the others we continue to work with the manufacturer.
In the meantime, we have increased capacity in the network for more operational buffer to increase resilience.
I thank the Members for their questions. We will continue to review our system design and architecture, and invest in capabilities and readiness to reduce disruptions. Disruptions like this can occur again in the future and we continue to have at the ready back-up systems, downtime procedures and manual processes.
Once again, Mr Speaker, allow me to express my thanks to all the personnel of our public healthcare system who kept our patients safe and the services running and for the members of the public who were involved and affected during these days for their patience and understanding as we tried to cope and mitigate the circumstances as best we could.
Mr Speaker: Dr Tan Wu Meng.
Dr Tan Wu Meng (Jurong): I thank the Senior Minister of State for his answer. Mr Speaker, I filed all three of my oral Questions on this topic and I seek Speaker's indulgence for supplementary questions.
Let me first declare that I am a healthcare worker at a public healthcare institution. I had Clementi residents contacting me – some were affected and saw how hard it was on the doctors and nurses, who nevertheless rose to the occasion and made efforts to help despite the outage. Some healthcare workers living in Clementi also contacted me and were concerned about the reliability of the system.
I have the following supplementary questions for MOH.
First, how do we benchmark our systems' reliability and usability? Do we take reference from best-in-class examples elsewhere? How do we benchmark? Secondly, the Senior Minister of State mentioned that there were issues with a hardware item from a particular manufacturer. Do we know whether this hardware item has given similar issues elsewhere in the world? If not, what is being done to look for other as yet, undiscovered potential vulnerabilities?
Dr Janil Puthucheary: Sir, I thank Dr Tan Wu Meng for the questions. His first question was about the benchmarking of reliability and usability. We do indeed take reference from other parts of the technology industry and healthcare technology industry. There are service level agreements about the uptime availability, as well as the user interface usability for the products that IHiS manages and these are indeed benchmarked against best-in-class around the world.
The second question was about the hardware, were there similar issues and what are we doing to look at vulnerabilities, and I think, other vulnerabilities with similar hardware. The bugs that we have detected as part of these incidents have not previously been described before in other parts of the world. The hardware that was involved is used in other parts of Government systems but in relatively limited numbers and they are configured differently. And this is partly because the way in which the Government requires its data flow and its inter-operability, its system architecture, is different from the requirements of the healthcare space. And so, the specific configurations for this type of hardware in the rest of the Government, is not present.
But nevertheless, we are working together, the various agencies that are involved in public sector technology – Cyber Security Agency, GovTech, IHiS and others – share information and are looking to scan across the various infrastructure that we have to check on these possibilities.
Mr Speaker: Ms He Ting Ru.
Ms He Ting Ru (Sengkang): I thank the Senior Minister of State. I would like to ask what support was actually given to frontline staff who were trying to cope with the outage, as I imagine that the situation was quite challenging when this was happening during both instances.
Also, is there training given to these frontline staff about what to do and what the business continuity plans (BCPs) were and also do these plans need to be updated after what happened in the last two instances?
Dr Janil Puthucheary: Sir, I thank Ms He for her questions. The support for frontline staff, indeed, the support from MOH and IHiS is largely around the communication and providing clear information about what has happened, what are the expected steps taken to restore functionality and how much time will be required. Because, the next set of decisions, which BCPs and processes to implement are done at an operational level, depending on the needs of each team. So, from IHiS and MOH, a lot of the support is around communication.
Within the healthcare ecosystem and the clusters, clearly, they mobilised staff, they mobilised the senior staff and the junior staff, everybody had their hands on deck to cope and deal with the outages as well as the extra processes required for the outages. And that includes overtime for some staff, having to work extra hours, and then, consequently, overtime claims as a result of that. And so, the entire team mobilises together to support each other and we provide as much support as we can to help them through this difficult time.
Is training provided? Yes, business continuity plans, disaster recovery plans, are drilled regularly in the healthcare units and teams. They are part of standard training for all the healthcare workers that are in our public healthcare sector and the training is updated on a regular basis. This is not something that is standardised in every team and in every unit across the healthcare ecosystem because these are peculiar to the operations and flows within each clinical team. And so, you have standards that are set broadly, expectations that are set, guidance that are set. But then each team will develop the granularity of their business continuity plans and these are then drilled as well.