Amazon calculates a product’s star ratings using a machine learned model instead of a raw data average. The machine learned model takes into account factors including: the age of a review, helpfulness votes by customers and whether the reviews are from verified purchases.
This book is a solid description of Site Reliability Engineering at Google. It is full of good ideas. However, most would be difficult to implement to many organisations without revolutionary change in the culture. Need for Revolutionary Cultural Change
The revolutionary cultural changes needed are that operational work is something that we do as our first job. Operational work is not something that si done on the side.
The change that organisations need to make is to recognise that operational work is a vital component of a product. A product is more than features shovelled out the door—it is about the experience of using that product. This is where operational work is critical: we find ways to make the product stable and reliable. Good Ideas From SRE
The good ideas I got from this book are:
Continual incident management training Continual improvement in alerting Continual automation
Incident Management Training
In all service organisations I have worked at, incident management training has been limited to a few professionals in the Service Delivery/Operations. All operational personnel should have regular incident management training to keep their skills current.
The practice of having a few people trained means that there is confusion about roles and expectations in a real incident. And there is usually just person trying to juggle being an incident commander, customer liaison, incident recorder, etc. In the end, they become less effective in these critical role.
Google ensures that all SRE personnel are able to do those roles, and holds regular drills to practice them. These drills are based upon post-mortems of production issues. Alerting
Google’s policy is that pages should only be sent if a human has to done something. Google aims for a maximum of two (2) pages per 12 hour shift. All other alerts should either have an automated response or just logged for future reference.
In many organisations, alert management is seen as unwelcome toil. I have been to sites where there are thousands of critical database alerts that no one was investigating. (One site had over 6,000, and another had about 2,000.) In both cases, management was wondering why the systems were so unstable.
Alerts need to tell the SRE about a potential problem before a customer notices. Too often, operational personnel are only reacting to customer complaints.
To help people look at alerts, the alerts should be tuned for relevance (not all threshold violations will impact service delivery), and frequency (alert storms should be curtailed or throttled). Automation
Automation is key to a successful SRE team. The more work can be done by computers, the better. The book does have a salutary lesson about an automated task wiping all data in a data centre. And with automation, there comes the issue of deskilling of SRE personnel.
SRE automation should be treated as production changes. The same care and attention that is taken for customer facing applications should be applied to critical automation scripts. This is where software development experience and knowledge becomes vital for SRE personnel.
Deskilling can be counteracted through live drills for incident management training. However, this means systems should be set aside for such a purpose.