HockeyStick show
HockeyStick Show
Why you will fail without Chaos Engineering, with Kolton Andrus - HS#25
0:00
-43:18

Why you will fail without Chaos Engineering, with Kolton Andrus - HS#25

Find problems before they find you

Introduction

Welcome to episode 25 of the HockeyStick podcast, where we delve into breakthroughs in tech, business, and performance.

In today's episode, Miko Pawlikowski sits down with Kolton Andrus, a well-known figure in the SRE and chaos engineering space. As the founder of Gremlin and a seasoned engineer, Kolton shares his insights into the evolution of chaos engineering, the challenges it faces, and his thoughts on the future of the industry.

The Journey of Chaos Engineering

Kolton Andrus begins by discussing the foundational ideas of chaos engineering. "It's about taming the chaos," he explains. The primary goal is to find system edges and handle them efficiently, ensuring reliability. Kolton emphasizes that organizations should invest in reliability as it is often a multimillion-dollar problem.

Shifting Roles at Gremlin

Kolton moved from being the CEO to the CTO of Gremlin. "It's been a journey," he reflects, noting that he felt his talents were best served in a technical role. This shift allowed him to work on product development and address the problems within chaos engineering more thoroughly.

The Importance of Chaos Engineering

Chaos engineering is an emotional topic for many SREs, like Miko Pawlikowski. It deals with intentionally injecting failures to test system resilience. Kolton highlights that the engineering part is crucial, "because whenever you tell someone I do chaos engineering, they think you're the joker… And that's the mistake."

The Branding Dilemma

While the concept and technique of chaos engineering are sound, its branding remains a challenge. The term "chaos" doesn't sit well with corporate executives. Kolton shares that although they leaned into the fun branding with Gremlin, it sometimes backfired. Executives want maturity and reliability, not something perceived as "immature."

Marketing and Acceptance

Marketing has always played a significant role in the adoption of chaos engineering. Many organizations found the name off-putting. Kolton notes that reliability engineering or resilience engineering might be better terms. The focus is on explaining to the stakeholders the benefits and necessity of adopting such practices.

Gamification in Engineering

One of the challenges in chaos engineering is getting organizations to adopt it systematically. Kolton mentions creating a rubric and scoring system for services, helping teams see their progress. "If you want people to do the right thing, you need to make it easy," he asserts.

The Evolving Landscape

Kolton acknowledges that the gaming industry, despite its need for reliable systems, often lags in adopting such practices. He points out that people are generally resistant to changes, especially when they seem complex or unnecessary.

Lessons Learned and Future Prospects

Over the eight years of Gremlin's journey, Kolton has faced numerous ups and downs. From being told they had product-market fit to being told they did not during the pandemic, it has been a learning experience. "It's super hard when it's your baby," Kolton admits, but the key is to keep iterating and improving.

Intelligent Health Checks

Gremlin's latest features focus on intelligent health checks, enabling even those without robust monitoring systems to understand their system's health. "How do we take the expertise that me and a lot of the engineers on my team have learned…and embed it into the product?" Kolton asks.

AI in Reliability

The conversation also touches on the role of AI in reliability engineering. Kolton is skeptical about the current AI capabilities. He believes AI can assist in guidance and analysis but cannot replace the need for deterministic solutions in complex distributed systems.

Kolton's Philosophy

Kolton's closing thoughts are reflective and grounded. He advocates for incremental improvements, "do a little better every day." This philosophy, he believes, applies not only to engineering but also to personal development.

Conclusion

Kolton Andrus's journey through chaos engineering and reliability offers valuable insights for anyone in the tech industry. His experiences underscore the importance of resilience, not just in systems but also in navigating the challenges of innovation and acceptance. Tune in to the full episode for an in-depth discussion on the future of chaos engineering and much more.

00:00 Introduction to Chaos Engineering

01:07 About Kolton Andrus

01:25 The Journey of Gremlin

02:01 The Evolution of Chaos Engineering

04:55 Challenges and Misconceptions

11:20 Real-World Examples and Impact

17:08 The Future of Chaos Engineering

21:04 The Expert vs. The Easy Button

21:19 Aligning Incentives for Reliability

22:51 Scoring and Gamification in Reliability

25:25 Industry Adoption and Challenges

28:02 The Human Element in Reliability Engineering

30:44 Reflections on Gremlin's Journey

35:04 Future Directions and AI in Reliability

41:44 Final Thoughts and Philosophy

Discussion about this podcast