Another update on connectivity issues
So Saturday we had some serious problems with all the servers' connectivity. We have reviewed our monitoring logs, and now it looks like the problems started around 03:25, about 90 minutes before what we originally thought.So it looks like there has been multipe failures in procedures and systems that made this outage last about 12 hours, all of which are fixable - and it is unlikely that such an outage will happen again, at least in the same way and on this magnitude.
We should get reports later today on what exactly happened.
[Later..]
We've gotten a report, and the gist of it is that a point in the network failed, and the backup-systems for this point (quite elaborate) also failed.
The people responsible for this point will be relieved of duty, many procedures will be reviewed and updated, a more robust network system will be deployed and emergency-watch and preparations will be doubled.
[Permalink] [By morphex] [Hosting (Atom feed)] [2010 13 Dec 08:28 GMT+2]