Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

RESILIENCE ENGINEERING


Andrew Hatch


Site Reliability Engineering

Main takeaways

Basics for resilience engineering

Origins and foundational fields of study

How does it affect us? Why should we care?

How we can engineer resilience in our teams and software

whoami

20+ years

Resilience is not about reducing negatives (incidents, errors, violations). It's about identifying and then enhancing the positive capabilities of people and organisations that allow them to adapt and learn safely and effectively under pressure
Resilience in the natural world

Biological Resilience

Gum Trees

"Evolution selects for resilience, adaptability and evolvability" - Alicia Juarrero

When innate resilience is not enough

Gum Trees

Engineering Natural Resilience

Humerus fracture
Egypt 1539–1075 BC

Growth of Modern Complexity

Man Made Disaster
Control Theory
Latent Failure Theory
Normal Accident Theory

Cognitive Sciences
Cybernetics
Sociology
Psychology
Human Factors
Complex Systems Science and Philosophy
Safety Science
Socio-Technical Systems

Complex Systems Basics

Philosophy and sciences can agree:

They have lots of components, that interact locally, not globally

Small changes done locally, can have unintented effects globally

Embed in their environments, adapt, grow and sensitive to changes

Require constant energy, entropy is constant, equilibrium is impossible

Hierarchy imposes constraints, added layers become more abstract

They have a history, which is crucial to their growth

Resilience Engineering appears

It is a Field and a Community

It's not a tool or a product.

It is multi-disciplinary, it crosses multiple industries, has origins dating back several decades but has become more of a "thing" in the past 15 years. In other words there is a lot of academic material, it is highly opinionated and that is great because it provokes great discussion

resiliencepapers.club

A resilient system is able effectively to adjust its functioning prior to, during, or following changes and disturbances, so that it can continue to perform as required after a disruption or a major mishap, and in the presence of continuous stresses.

* Sustained Adaptive Capacity

* Graceful Extensibility

* Continuous Adaptability

It is both a field and a community, with origins in multiple disciplines

It is what your organisation does. Not what it has.

Ask yourself

How well do your people and systems adapt...

To failure?

To unplanned work?

To new architecture platforms & technology?

Production environment!

What we do is complex!

And if we lose the ability to adapt

Complex systems will behave in unexpected ways

...you can't always code the ability to dynamically adapt

Resiliency for engineers

Learn from incidents as much as possible
They are part of normal complex system behavior. Use them.

You can't wait for resilience to evolve naturally.
It must become an on-going practice

Create conditions and environments where teams can sustain adaptive capacity - wherever the work-is-done

Understand the interactions between people and technology.
Don't isolate them as separate challenges

Resiliency for organizations

Constraints can also enable innovation, think of them as probabilities for change, not restrictions

Safe fail over fail safes. Diversity of thought will increase robustness. Strict controls can lead to brittle and static systems

Avoid restrictive control structures. Keep feedback loops open and innovation enabled

Don't blindly persist with operational models that become commoditized. e.g. cloud computing

Resilience as a VERB, not a noun

Thank you!

https://res-eng.hatchman76.com

@hatchman76