Investigations into Interruptions and Service Lapses in Singpass and Corppass Services
Prime Minister's OfficeSpeakers
Summary
This question concerns Dr Tan Wu Meng’s inquiry into the Singpass and Corppass service disruptions on 8 and 9 February 2018 and the resulting investigations. Senior Minister of State Dr Janil Puthucheary attributed the slowdown to an undetected software bug in a vendor-provided server that manifested after a system enhancement. He highlighted lessons learned, including improving software diagnostics for earlier detection of performance issues and enhancing system resiliency beyond hardware infrastructure. Senior Minister of State Dr Janil Puthucheary stated that the government is reviewing vendor contracts for liquidated damages and will refine stress testing to cover complex software interactions. He also addressed delays in two-factor authentication SMS delivery, noting potential telecommunication loads and the planned transition to a Singpass Mobile solution for increased reliability.
Transcript
8 Dr Tan Wu Meng asked the Prime Minister (a) what is the outcome of investigations into the interruptions and service degradations in Singpass and Corppass services on 8 and 9 February 2018; (b) what redundancy mechanisms and systems are in place to maintain uninterrupted Singpass and Corppass service provision; and (c) what lessons have been learned from the recent interruptions.
The Senior Minister of State for Communications and Information and Education (Dr Janil Puthucheary) (for the Prime Minister): Mr Speaker, Sir, on 8 February from 10.30 am to 5.15 pm and, again, on 9 February from 10.30 am to 3.15 pm, users had difficulty logging in to Singpass and Corppass. The services were intermittently slow or not responsive during these periods.
The slowdown was caused by a software bug in the Authentication Server provided by Gemalto, a commercial vendor. This software bug was previously undetected and had only manifested after an enhancement to the Singpass and Corppass system in January 2018. We have verified that the enhancement complied with all technical specifications and was properly tested. However, the interaction between the enhancement and the software bug caused some records to persist in the system, instead of being automatically removed 30 days after they expired, which was the root cause of the slowdown.
Gemalto has acknowledged the software bug in their product and has been helpful and cooperative in the recovery process. We drew two key lessons from this experience.
First, while the bug itself was elusive, the symptoms – the slowdown in system performance – could have been detected earlier. Our early detection and warning capabilities can be improved and will be improved. We intend to do so by enhancing the software checks and diagnostics so that, in such cases, the engineers can act before the system condition worsens to a state that would affect users.
Second, while the system had the hardware backup to deal with hardware and infrastructure failure, such redundancy did not address the unknown internal software bugs of this nature. We will review the system design to improve all-round resiliency, beyond just hardware resiliency.
As a broader point, this episode shows that for our critical systems that rely on products from commercial providers, there is a need to work more closely with these providers and to better understand and ensure that these products operate as they are intended to. This will allow us to improve system design and put in place the probes and sensors to improve early warning. We will take these lessons and apply them to the development and maintenance of other Government systems.
Dr Tan Wu Meng (Jurong): I thank the Senior Minister of State for his detailed update. I have three supplementary questions.
Firstly, does the contract with the commercial provider provide for liquidated damages or other such responses in the event of a failure to provide adequate service and, if so, are these measures in the current contract adequate?
Secondly, is there any stress testing of the system to help detect potential points of failure before they manifest and affect users?
Thirdly, would the Senior Minister of State be able to provide some assurance about the on-going analysis that will take place moving forward because I have received feedback from my residents in Clementi that even after the system was restored, they still found that the two-factor authentication text messages could take an inordinate amount of time to be delivered to their phones?
Dr Janil Puthucheary: I thank Dr Tan for the questions. We are reviewing the contracts with our commercial providers, both with respect to the incident as well as with what we will be engaging in going forward.
Secondly, as far as stress testing is concerned, yes, indeed, the testing process does include putting the system under some degree of stress and load. In this instance, there were several factors that occurred simultaneously which resulted in a combination of an interaction between two pieces of software and an unusually high load all at the same time. So, this particular instance was not something that was envisioned or tested against. But in general, the system is tested against conditions of stress and load. But, again, we will review and update our procedures, going forward, having learnt from this incident.
On the analysis about the two-factor authentication, we will study the issue with respect to the text messages. We were not aware that there is a particular problem at the moment. It may be that the delay varies over time as the load varies over time. So, at particularly busy periods, there may be some lag. This busyness cannot always be associated with the Singpass system itself. There may be other instances of the telecom co-providers being under load, which may also result in some lag.
In the long term, our plans for the two-factor authentication on National Digital Identity would, hopefully, get around this with the Singpass Mobile approach that we have briefly described.