After some of you experienced or at least saw what happened yesterday with some of Google’s Services, the company officials expressed their apologies and released a statement explaining what went wrong, which inevitably caused the outage yesterday.
Ben Treynor, Google VP Engineer, posted a statement which points to a ‘bug’ in the system, the ‘culprit’ that caused the whole outage. He says that at 10:55 a.m. PST, an internal system that generates configurations for other key systems reported to have generated incorrect settings, which in turn sent out to other systems. Around 11:02 a.m. PST, the massive outage started and users reported they can’t access the Google services.
The current incompatible settings were basically telling the systems to ignore server requests from users, which in turn generated the error messages. About 12 minutes later, at 11:14 a.m. PST, the same system that generated the error instructions rectified itself and starting sending correct configurations. By 11:30 a.m. PST, all systems were back online and engineers started taking precautionary steps, from removing the source of the failure, to implementing new security measures for it not to occur a second time.
Some questions still remain though. If the bug hadn’t ‘magically’ repaired itself, at what timeline would we be looking at for yesterday’s outage? And how come Google’s Site Reliability Team did not find the error faster? I mean, we are all human of course, but isn’t Google supposed to be one of the top companies with the best team of engineers? It appears not.
Thank you Google Blog for providing us with this information