The Psychology of Chaos Engineering

A presentation at Agile + DevOps East 2019 in November 2019 in Orlando, FL, USA by Matt Stratton

Slide 1

Slide 1

The Psychology of Chaos Engineering Matty Stratton, PagerDuty @mattstratton

Slide 2

Slide 2

Slide 3

Slide 3

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. https://principlesofchaos.org/ @mattstratton

Slide 4

Slide 4

“By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice.” -Netflix Technology Blog, 2011 https://bit.ly/netflix-chaos @mattstratton

Slide 5

Slide 5

What chaos engineering is NOT @mattstratton

Slide 6

Slide 6

@mattstratton

Slide 7

Slide 7

It’s not about breaking things @mattstratton

Slide 8

Slide 8

@mattstratton

Slide 9

Slide 9

Experimenting in production is preferred @mattstratton

Slide 10

Slide 10

You can’t do this without good measurement @mattstratton

Slide 11

Slide 11

Minimize your blast radius @mattstratton

Slide 12

Slide 12

Something’s broken… …it’s your fault @mattstratton

Slide 13

Slide 13

Some helpful tools • Netflix Simian Army - https://github.com/Netflix/SimianArmy • Gremlin - https://www.gremlin.com/ • ChaosToolkit - https://chaostoolkit.org/ @mattstratton

Slide 14

Slide 14

But what about the people? @mattstratton

Slide 15

Slide 15

How does it make you feel to know Netflix practices chaos engineering? @mattstratton

Slide 16

Slide 16

What about your bank? @mattstratton

Slide 17

Slide 17

@mattstratton

Slide 18

Slide 18

Management can get… …nervous @mattstratton

Slide 19

Slide 19

Consider your words @mattstratton

Slide 20

Slide 20

It’s about the philosophy @mattstratton

Slide 21

Slide 21

@mattstratton

Slide 22

Slide 22

Safety first @mattstratton

Slide 23

Slide 23

• Know your conditions - when will you shut down the experiment? • This isn’t about causing stress on your people - be transparent • There are humans at the other end of those numbers @mattstratton

Slide 24

Slide 24

Further Reading • Chaos Engineering Traps - Nora Jones bit.ly/2Pr53ZH • ChaosCat: Automating Failure Injection at PagerDuty bit.ly/2UCbdXN • ChaoSlingr: Introducing Security into Chaos Testing bit.ly/2GDZN1V @mattstratton

Slide 25

Slide 25

https://speaking.mattstratton.com @mattstratton

Slide 26

Slide 26

Pagey Says…. @mattstratton

Slide 27

Slide 27

Session Evaluations in the App