Service Disruption Incident Review - Updated 27th Jan 2017 (Resolved)

Hi all,

We made an update to the system this week that resulted in three unexpected episodes of service disruption, the longest being about 30mins.  Whilst rare we consider any disruption to service as a serious event and we investigate thoroughly to understand what happened, along with what we can learn from the event to minimise the risk of it happening again.  I have included the details of what happen below as we think we owe it to our users to be open when we make a mistake or experience a technical problem.

We have investigated the events of this week and this is what happened:

1. In the early hours of Tues 24th of Jan we released an update to the system which changed the way in which the system handled the login process.  We release many of these minor changes to ensure Power Diary is always using the latest security and functionality available in current browsers.  As normal we tested this prior to release and all appeared ok. The change however contained a function that was resource heavy and this only became evident later during business hours as the user load increased.  This resulted in Power Diary slowing down to the point that some users could not login.  We resolved this by removing the element of code causing the problem and republished the update.  This allowed the system functioning return to normal in the short-term.  Later in the evening we released another update to provide what should have been a permanent solution.

2.  The following day Wed Jan 25th we were monitoring the systems closely and we again observed the same issue which we thought had been resolved.  To minimise disruption we immediately rolled back the latest change to restore system functionality as quickly as possible.  Some users however experienced service disruption whilst we did this.

3. By Wednesday evening we had prepared a new update which we were confident would (and ultimately did) address the issues permanently.  Unfortunately, and entirely due to human error on our part we misscheduled the release of this update such that it was released whilst there were still a number of users on the system. We realised this quickly and rolled back and reshceduled the update for later in the evening.   Unfortunately however this still caused some disruption to a small number of users. 

All systems have been operating as normal since then.  I wanted to take this opportunity to apologise for the disruption caused for users.  We know that our users rely on Power Diary everyday to run their businesses and even small disruptions can have a major impact. We also wanted to let you know that whilst an event like this is very disruptive, that the data is not effected in any way. 

So what have we learnt from this?

1. We need to pay extra attention to any processes that can effect the server loads.  Even though under maximum load Power Diary typically only uses 5 - 10% of available processing capacity, and can be scaled up nearly instantly as demand requires, if an inefficient process is inadvertently added to the system it can impact on functionality amazingly quickly. 

2. To minimise human error we will be double checking the timing of non-urgent updates to ensure they have been set appropriately.  Whilst that sounds super-basic it is easy to inadvertently make a mistake like this at the end of a long couple of days where the focus has been identifying and fixing a bug. It is bit like the research cited here that suggests we are at a higher risk of having an car accident close to home, partially because we start to relax and feel we are almost there. 

I hope the above helps explain a little what was behing the recent issues and how we'll prevent these in the future.  If you have any questions about this please let us know.

Thank you for your understanding, patience and ongoing support.

Kind Regard,

 - Damien


Power Diary