Introduction
Welcome to episode 25 of the HockeyStick podcast, where we delve into breakthroughs in tech, business, and performance.
In today's episode, Miko Pawlikowski sits down with Kolton Andrus, a well-known figure in the SRE and chaos engineering space. As the founder of Gremlin and a seasoned engineer, Kolton shares his insights into the evolution of chaos engineering, the challenges it faces, and his thoughts on the future of the industry.
The Journey of Chaos Engineering
Kolton Andrus begins by discussing the foundational ideas of chaos engineering. "It's about taming the chaos," he explains. The primary goal is to find system edges and handle them efficiently, ensuring reliability. Kolton emphasizes that organizations should invest in reliability as it is often a multimillion-dollar problem.
Shifting Roles at Gremlin
Kolton moved from being the CEO to the CTO of Gremlin. "It's been a journey," he reflects, noting that he felt his talents were best served in a technical role. This shift allowed him to work on product development and address the problems within chaos engineering more thoroughly.
The Importance of Chaos Engineering
Chaos engineering is an emotional topic for many SREs, like Miko Pawlikowski. It deals with intentionally injecting failures to test system resilience. Kolton highlights that the engineering part is crucial, "because whenever you tell someone I do chaos engineering, they think you're the joker… And that's the mistake."
The Branding Dilemma
While the concept and technique of chaos engineering are sound, its branding remains a challenge. The term "chaos" doesn't sit well with corporate executives. Kolton shares that although they leaned into the fun branding with Gremlin, it sometimes backfired. Executives want maturity and reliability, not something perceived as "immature."
Marketing and Acceptance
Marketing has always played a significant role in the adoption of chaos engineering. Many organizations found the name off-putting. Kolton notes that reliability engineering or resilience engineering might be better terms. The focus is on explaining to the stakeholders the benefits and necessity of adopting such practices.
Gamification in Engineering
One of the challenges in chaos engineering is getting organizations to adopt it systematically. Kolton mentions creating a rubric and scoring system for services, helping teams see their progress. "If you want people to do the right thing, you need to make it easy," he asserts.
The Evolving Landscape
Kolton acknowledges that the gaming industry, despite its need for reliable systems, often lags in adopting such practices. He points out that people are generally resistant to changes, especially when they seem complex or unnecessary.
Lessons Learned and Future Prospects
Over the eight years of Gremlin's journey, Kolton has faced numerous ups and downs. From being told they had product-market fit to being told they did not during the pandemic, it has been a learning experience. "It's super hard when it's your baby," Kolton admits, but the key is to keep iterating and improving.
Intelligent Health Checks
Gremlin's latest features focus on intelligent health checks, enabling even those without robust monitoring systems to understand their system's health. "How do we take the expertise that me and a lot of the engineers on my team have learned…and embed it into the product?" Kolton asks.
AI in Reliability
The conversation also touches on the role of AI in reliability engineering. Kolton is skeptical about the current AI capabilities. He believes AI can assist in guidance and analysis but cannot replace the need for deterministic solutions in complex distributed systems.
Kolton's Philosophy
Kolton's closing thoughts are reflective and grounded. He advocates for incremental improvements, "do a little better every day." This philosophy, he believes, applies not only to engineering but also to personal development.
Conclusion
Kolton Andrus's journey through chaos engineering and reliability offers valuable insights for anyone in the tech industry. His experiences underscore the importance of resilience, not just in systems but also in navigating the challenges of innovation and acceptance. Tune in to the full episode for an in-depth discussion on the future of chaos engineering and much more.
00:00 Introduction to Chaos Engineering
01:07 About Kolton Andrus
01:25 The Journey of Gremlin
02:01 The Evolution of Chaos Engineering
04:55 Challenges and Misconceptions
11:20 Real-World Examples and Impact
17:08 The Future of Chaos Engineering
21:04 The Expert vs. The Easy Button
21:19 Aligning Incentives for Reliability
22:51 Scoring and Gamification in Reliability
25:25 Industry Adoption and Challenges
28:02 The Human Element in Reliability Engineering
30:44 Reflections on Gremlin's Journey
35:04 Future Directions and AI in Reliability
41:44 Final Thoughts and Philosophy
Why you will fail without Chaos Engineering, with Kolton Andrus - HS#25