Photo of Kolton Andrus

The Gremlin In The Machine: How To Achieve Chaos Engineering Like Netflix & Amazon

Related Links:

Audio Transcription:

Intro

In the digital reality, evolution over revolution prevails. The QA Lead approaches and techniques that worked yesterday will fail you tomorrow. So free your mind. The automation cyborg has been sent back in time. Ted Speaker Jonathon Wright’s mission is to help you save the future from bad software. 

Jonathon Wright

This podcast is brought to you by Eggplant, eggplant help businesses to test, monitor, and analyze their end to end customer experience and continuously improve their business outcomes. 

Jonathon Wright

With that in mind. Straight out to the show. I’ve got a very special guest for you today. Kolton pretty much spearheaded the Chaos Monkey movement for Netflix and Amazon, and he’s now the CEO of Gremlin. Go check it out. It’s a great episode and welcomes to the show. Kolton, tell us a little bit more about what you’ve been doing before, Amazon, and Netflix. 

Kolton Andrus

Yeah, I mean, we were doing this at Amazon three years before anyone had ever heard a cast monkey. And it’s just the way Amazon is. You don’t talk publicly about a lot of these things. And so it wasn’t yet called chaos engineering. It was just good testing. It was just part of building good systems. And Amazon, every engineer’s post is is responsible for writing performant, efficient, high-quality code. And it’s everybody’s problem. If your code fails. And so they had a very good culture of ownership around that. But, you know, as with any organization, giving them the tools, making it easy for them to do the right thing has a big impact. And then, you know, I guess to continue this story, when I joined Netflix, they had just talked about Chaos Monkey. They had released some of the Simian Army. And I didn’t join the casting team. I joined the Edge Platform team and our team owned the proxy and the API gateway. And if we went down. Netflix went down. And what we needed was better tooling to help prevent that. And so we had one of the things I did three to six months after I joined Netflix was I went and pitched a new style of tool, which was application base. So imagine, you know, being able to put a cup point around a function and introduce to failure or delay or throw an exception and the ability to filter that down to a specific customer or region or device or percentage of customers. So that precision allowed us to go do. More testing. But to do it more precisely, which meant there was less risk and there were less side effects, which allowed us to do more of it. But after we built that, we got teams using it. We had all the mid-year services onboard performing this testing and the end result after the first year of building, not getting teams to use it, was that our availability went from three nines to four nines. But the number of time teams were getting paged was reduced. Twenty-five percent. 

Jonathon Wright

That’s a really interesting kind of fun kind of cultural shift as well because I guess, you know, beyond DevOps, the kind of OP staff kind of approach to operational excellence, you know, how does that change? Do you know, to find that operational teams areas involved, defining, you know, what the end state should look like? 

Kolton Andrus

Well, kind of depends on. This is one that’s probably interesting to debate here with you, because it’s going to be your audience may have a different opinion than I do. I’m not a big fan of separate QA teams or separate operation teams. Like I think those are an antibody. If the team that writes the code is also the team that tests the code and is responsible for the quality and they’re the ones that get woken up, if things go wrong, then there’s a real feedback loop, an incentive to do a good job. And the anti-pattern I’ve seen time and time again. And so if I throw it over the fence to QA, it’s their problem. They didn’t catch anything. Know, not my job. And the same thing happens in operations. If you throw it over the fence to Ops and like, it broke. Well, too bad, you know, those Ops guys will take care of it and that’s really just the wrong attitude. By. You know, oftentimes the software that we wrote plays a big role in that. And there’s a lot of things we can do to prevent it and becomes a question of, you know, is high-quality code important to you and your organization or shipping features as fast as you can? The last few years it’s definitely been just ship everything and let it hit the wall and see what sticks. 

Jonathon Wright

Yeah. Stuff that I’ve seen. You know, especially so I kind of come from a performance engineering kind of background originally. And, you know, I’m a good friend of mine. Talk to Kafia. He said you wrote a book for Reily called Effective Performance Engineering. And he kind of dedicated his life. He was with a company called Shunra, which did network virtualization capabilities. And then he just finishes it. JP Morgan is now an evangelist, Splunk. And, it was always something that kind of felt really hard to kind of understand was, who owns that in the lifecycle because, yes, these new roles like the site reliability, engineering. And, I’m hearing kind of no-ops coming through, who really is involved in a team which is doing care, such engineering. 

Kolton Andrus

Yeah, I mean, no-ops is a pipe dream like start or less is a pipe dream, right? Like, yes, there are advantages to those approaches. But to think that your software doesn’t need to handle a network failure or delay or that you don’t need to be responsible for the user experience is wishful thinking, in my opinion. Who? Who should own it? You know who. It’s interesting because I watch this back and forth debate at Netflix as well, where one of the things they wanted to do as they had a cloud operations reliability team, the core team, and it was their job to handle incidents, to handle the postmortems, work with teams. And so for teams that weren’t very good at this, it was always a question of, well, what if we embed an engineer on that team? Can we make them better? And so I watched them experiment. There’s a couple of other companies a few times, and that goes one of two ways. Either that person is truly embedded, which means they have some project work. They’re beholden to some of the same deadlines. They feel like they’re on the team. And then I think you start to see that right behavior that they can teach the team and help them do better and pick up some of those responsibilities. You see the other approach, which is more of a consultative approach, which is like, hey, I’m going to show up every week or two and you tell me what you’re doing and I’ll tell you what you’re doing wrong and how to go fix it. And the truth is that just doesn’t you know, it’s changed from the outside versus change from within. And while, you know, you have if you have a team that that doesn’t have a lot of ego at play, they may listen well and learn and operate in that environment. But I often see that that approach just causes some friction, some back and forth that doesn’t need to be there. So, again, like there’s this who I think should own it and there’s, you know, what’s the right thing for the organization as it grows and it scales back? We had this debate Slack last week because it was like, you know, is the SRE team the right approach? Well, it’s seductively easy to think I’ll just make this one person’s problem. One team’s problem and they’ll go fix it. But I think that does a disservice to what quality software means. You know, no, you can’t just have one team, you know, come in after the fact and make sure that everything is up to high quality. You know, that’s cool. We already built the building and now we’re going to bring in some people afterward’s to like to make sure the foundation it sounded like it passes all the tests and it’s like, well, in some cases, it is way too late. You’ve already made decisions that have put you down this path. And so, you know, as that team, we had to come in and bolt on things on the outside to try to make it better, or are they going to be able to actually fix it? 

Jonathon Wright

It’s yeah, it’s a really interesting world. You know, one of my pet peeves with DevOps was this lack of operational involvement early on in the lifecycle. You know, idealization, you know, the definition of a deployable or whatever they wanted to put in place, as I say. Well, actually, we’ll descale you know, I saw a really interesting keynote. She would say, well, why do they keep on choosing these? These said Lessel. You know, now they’re talking about nano services, you know. Is it always the right pattern to go straight after, are you thinking about the end state? And, one of the interesting things I kind of came across quite recently had last week I was listening to a Gartner analyst who was kind of saying there’ll be no I.T. projects going for it, which is a kind of as a  what the two lights like to pull off. He was what he was trying to say is that actually there’s going to be direct involvement from the business, taking ownership of driving the value stream, coming into projects that get delivered, not I.T., owning that capability and being the enabler for the business. And that kind of felt to me like this kind of switch which Google had done to this kind of product engineering, bring them together. So that is not a gap between product and engineering but actually take a little bit further as far as well. But all the KPI that is important to the business, you know, what is it, what they want to do as far as quality? Do they understand what quality they’re going after? Do they want to ship some fidelity? Is it more kind of lean startup, kind of, you know, rapid prototyping, fast feedback? Or is it actually they’re looking to build something that, you know, is going to be kind of a cool capability. 

Kolton Andrus

Well, and the tradeoffs between those two as well. You know, the. And we really need to just, you know, do a quick ass and see if this works. I might get a little bit of understanding and do some experimentation. I’d like that level of quality or resilience built into that might be different then. Oh, this is gonna be a core pillar of our platform or a core piece of our service that every customer is going to rely on. And that’s one we’ve been talking a little bit about recently as well. What’s that? What’s the right tradeoff? What’s the right balance of reliability and what’s the right investment you should make if you’re a startup of 20 people? You know, we probably should have one person whose job is just reliability. That may not make sense at that scale, but if you have 100 or 200, it probably makes sense that, you know, somebody is passionate about it and helping the team. But as part of the team, whereas, yeah, you know, once we’re at five hundred or a thousand people are a much larger company than really we need everyone to invest in. And so it gets to that world where it’s like Amazon, where performance and reliability are everyone’s problem. And if you can meet that and, you know, we’ll find some people to help you or will you know, we have resources in the organization. But again, I guess that goes back to the point of, could be seductive to say, hey, I’m an adjust. I’m going to allow one team to solve this problem for all of Amazon. And the truth is like that better be a pretty damn big team that has a lot of time to go talk to everybody because you’re not going to have an impact when you’re trying to move a ship that large with so few people. 

Jonathon Wright

And you don’t if you’re looking at it. I guess limited know straight away site reliability engineer and he’s kind of saying, you talking about technologies that are maybe kind of forward-facing, too, from a CX perspective, you know a lot of what I’d be I just finished a chapter for a book on digital experience analytics and this idea around, you know, cross experiences, so you talking about anything, which is your experience across multiple products, for instance. So we get a strike. Might be the way that you might react, you might strike. But actually, there may be all different APMs but you kind of want to be able to see. Well, what is Stripe doing? What is Will Pay doing? What is PayPal doing? They’re pushing out metrics as far as you know, onto Kafka or something to kind of give you the insight of. Well, that we can own that pop because that’s not our site. That is actually upstream and downstream systems. Actually, I want to see all of those. And I guess kind of way your product sits as far as kind of experiment as tobe kind of viewpoint is, you know, there’s certain reach which you can get to if you’re looking at trying to, you know, get to systemic failure or something to see how systems come back online. But, you know, third party gateways like PayPal and the rest, you know, how do you deal with that from a kind of deep, deep you look at seviche, virtualization or, you know, is it more than that? How do you reenact those failures? 

Kolton Andrus

Yeah, I think that’s really the bridge between what I would call traditional testing and modern testing is that in traditional testing, you know, most of the world is in our control and between good unit tests and good, you know, internal tests, we can validate a lot of things. But every system or I would argue the majority of systems today are distributed systems. They rely heavily on. They do not work without the online access to these other third party dependencies, whether it’s Stripe or S3 or, you know, and authentication, gateway, whatever the piece is, it has to be there. And so what happens with those pieces fail? Well, I think that’s was where the gap is today, are we do a good job testing our own code. We do an OK job of kind of our integration and smoke tests. And then we get into these live environments with third-party dependencies. We hope and we pray. And the truth is when one of those fails like that shows you the weakness in your system. And that’s why when you know, it’s been a couple of years, but when S3 goes down in a region, you see a whole part of the Internet have an impact because people didn’t understand away. I store all of my building the artifacts necessary. I can’t push new code or I store all of my web site, CEN caching in S3 and I can no longer serve cache content or of the load that comes to my web site because I can no longer serve that cache content like I can’t keep up with. And so that’s where the Gremlin. Well, one of the things that we focus on is what are the side effects of those systems. So if there’s one thing I would tell every engineer to go test, it’s what happens when one of your dependencies fails, black or white, and then what happens when it slows down. And that will teach you a lot about the kind of thresholds that you need, the time outs, the fallbacks, the graceful degradation in place. Because if you can take a critical failure and turn it into a non-critical failure and someone doesn’t have to get woken up in the middle of the night, the customers have a reasonable experience until you’re able to fix it. That’s a lot better than just heart failure. An example of that from the Netflix days, we worked very closely with the service that recommended movies for people to watch. And there’s a couple of different customer use cases at Netflix. One is I know what I want to watch. Just let me go find it. There’s one where I’m trying to figure out what I want to watch and I want to go explore. If you’re not a former use case and we can’t recommend things to you. Who cares? Let them just watch the movie. They want to watch and you know. OK, so the UI. Maybe we need to fill in some content there, but we can use a cached list of the top 100 titles and instead of the customized recommended ones. And that’s OK for that use case. Now that other use case, you know, maybe it’s less OK and maybe, you know, maybe we need to work something out. But that allows us to have a discussion with the business about what’s the right user experience and what do we want the right behavior to be. And in my opinion, if that discussion is coming up as part of the quality of our software or how we’re handling failure or we’re taking the kind of a naive approach, we’re going to be surprised because we haven’t actually thought about what our customers are saying and what our users are going to tweet about when it goes wrong. 

Jonathon Wright

I think it’s fascinating because, you know, we’ve talked about, you know, things like failing forwards and self-healing. And, you know, that never really came to become a reality for most organizations. They just went at the maturity level. And they kind of maybe implemented feature fly gang or something fairly basic. But, you know, with your kind of tool giving this kind of, you know, dealing with experiments, you know, could you understand how it kind of gives people a green light on what’s what healthy looks like. You know what the potential risks associated with releasing a particular bill, for instance. 

Kolton Andrus

Well, that’s it’s all about understanding those risks and those unknown unknowns. You know you want to find out and be aware of what these tradeoffs are so you can make an intelligent decision because if you just ignore them again, you’ll be surprised. So giving people that opportunity to go in and see it and behave in the real world allows people to have a much better discussion about it and about what could happen and what those trade-offs are. 

Jonathon Wright

And, you know, as architecture changes, you know, I’ve dealt with quite a few startups at the moment as kind of AI as a platform, for instance. And, you know, they don’t seem to deal with this kind of the idea of failure, of resilience. Very well. And, you know, they maybe stick a simple API at the front of Kafka over that. And Kafka’s going to something like GraphDB, like Neo4j And then you get you to hit the API too hard. You start getting a return code, which on handled you know, in the old days with an ESB, you used to be kind of military-grade. Everything could go down. His message would still be there. Whereas if you can’t actually get onto Kafka, it doesn’t matter if the producers or consumers are up or, you know, how many a grab bag information kind of streaming services, it seems to have a certain level of failure built-in. And not that level of resilience, which we’re kind of used to with, you know, TIBCO or something like that. 

Kolton Andrus

Yeah, I mean, those are why we use those messaging services, right? That’s why, in many ways, I remember my exposure to the 15, 20 years ago. That was my first real distributed systems program. You know, what is, you know, only wants or at least once mean. And what of the trick? You know, how are we able to pass this information around so that if that piece fails, the overall system keeps operating, which is right. If we obscure, that would just throw behind in front of an API and we don’t carry that resilience principle forward, then the API may negate a lot of the value around that. 

Jonathon Wright

And to me, you know, this is you know, you talk about releasing faster. I think this is one of the challenges. You know, we’ve got into this kind of habit of releasing and quickly and early. And, you know, we’ve never really been able to trace it all the way back to potential brand damage, for instance, of something going down like Disney. Plus, when it launched in the US. And straight away, it went down. You know, do you find that when you’re talking to customers about brand damage and how important, chaos engineering is? They recognize the value. 

Kolton Andrus

I think a lot of the engineers are disconnected from the business value. And one of the things that I try to coach anyone interested in this space to do is tie their work back to two dollars and cents to the business. You know, it may be a brand impact, which is measurable, a little more ephemeral. But, yeah, if your site fails often, you will have the perception of being a lower quality brand. If people go to use your Web site and it fails in this day and age, they’ll just go to your competitors and you’ll see that. And I saw that at Amazon or Netflix that after failures there, there’s a drop in usage. There’s an abandonment of customers. So I think that’s definitely an important piece. But there’s also, you know, hey, wall, we’re down. We can’t generate revenue or we’re unable to make the process. I’ve seen a lot of focus and this based in the financial industry, because if you lose those transactions, some of those things cannot happen later to the same effect. You know, if you’re a trading application and you’re down in the market as a huge, improvement or detriment and your customers are unable to take advantage of that, they’re going to be pretty mad at you and at your brand and your platform. And that’s not going to help in the long run. And then there’s they it’s kind of what I call the tip of the iceberg because we look at the brand impact or like the outage, the loss of revenue due to an outage. But what also just happened was, you know, doesn’t of our engineers have been tied up dealing with this outage for however long it takes and everything else has been put on the back burner. And then once it’s triaged, we’re not done. There’s a whole amount of workaround, understanding what happened, putting together the timeline, making sure that we understand as many of the contributing factors, both people wise and technology-wise, that we can have a meeting about it, discussing what we’re gonna do to fix it ultimately, hopefully going and fixing the things that would have led or contributed to making that out worse. But honestly, that’s where by that point, the process, you know, engineers have invested dozens of hours and they may just, you know, they have to move on to get back to feature work, to get back to their, quote, unquote, day jobs. And so that does the system actually get improved or do we just kind of patch it and move on? And I think that’s where you see the difference between organizations that prioritize reliability as a first-class citizen, that they ensure that, they can do everything. And you can’t build,  military-grade software when it’s unneeded. But they take that time and investment to make sure it works well, that they learn from past failures and that, you know, it’s part of that feedback loop. They’re taking that into the code that they’re writing. It’s not we learned, you know, some operational things that don’t matter. We’ve learned how to be better engineers because we understand now that the way that we design that system has tradeoffs and we’re unhappy with. 

Jonathon Wright

Yeah, I can’t agree more. You know, when I did, I don’t. So my TED talk, but I talked about complicated learning with this kind of a view of shift. Right. Which is something I’ve been talking about for probably the last eight years. And, you know, the idea was learning from the real world. And, you know, I’ve spent a lot of time doing this, but at the moment for Brexit. The UK government. But it was around taking real digital experience, analytics on the right-hand side. All that taking all that multi transactional data and doing something with it. So understanding the models and understanding what’s going on and the changes in behavior, but being able to do that in, you know, hours and not days and be able to refresh that, to see how things change, how behaviors change with the end-users. And, you know, especially when she releasing a set piece of functionality, people don’t seem to have that disability. We’ve got tools. You know, obviously, people are talking about AI-OPS as a kind of applying some of those basic concepts on the right-hand side. But how much does that actually inform the left-hand side of the development cycle? You know, how many of those are, you know, tickets that are then prioritized as work? That isn’t yet a new feature? Are there actually, you know, like you said, it is not just sticking a bandage over it, but actually realizing maybe you’ve got a serious architectural or, you know, scalability issues. 

Kolton Andrus

Well, that’s where, you know, I think the principles around DevOps really apply. How do we take what we’re learning and factored into the process earlier so that the time to fix it, the time to detect it is much lower. And yeah, maybe one of the things got through to production and it caused some customer-facing pain. But if we can take that, make sure we’re testing for it in our pipelines where we’re going to a staging environment and doing some, but maybe not the most comprehensive tests so that we have an operator to catch some of these things, know we’re doing a canary deployment or we’re doing these isolated scheduled experiments on life systems so that we’re catching it as soon as we pushed it or as soon as we can in the process because then our engineers can go have a discussion and correct it and make sure that we’re doing the right thing earlier. I completely agree with you there. 

Jonathon Wright

And as far as kind of things like Launching I know Amazon did quite a lot with the kind of checking new features out and understanding them, the behavior is a little bit earlier before even thinking about some kind of canary rollout. You see people doing that kind of controlled experiments in production happening. 

Kolton Andrus

Yeah, yeah. I mean, look, feature flags are a smart way to roll out things that you’re experimenting on and collect data from them. And that to me was the key part at Amazon, had a very sophisticated way of measuring these A/B tests to understand, you know, all the potential impacts for customers. But those are temporary things. And once that feature works, though, what do you want emergent into the code base? So you don’t have multiple code paths, so you don’t have multiple potential bugs or issues floating. And so while you might be able to use that feature flag to test it there, ultimately it’s going to go away. But what you know, what is the artifact that enforces the learning and ensures that things are behaving correctly? And to me, that’s where you have these experiments that you’re running on a regular basis that are highlighting what those impacts or changes might be. The idea. I just want to touch on you mentioned lib systems as a practitioner, as somebody that’s been on call for many years. I’m always a little frustrated when I hear the theoretical approach talks. Well, we’re going to talk about all the things that could happen and then we’ll be prepared. And the truth is, there are hundreds of thousands of lines of code in our system that we don’t own, that we didn’t write. Whether it’s the operating system level, these dependent systems, the frameworks that we’re using. And so, too, there’s a bit of hubris that we think we understand of the side effects and all of the ways the system is going to behave. And again, it’s trying to think of our code in isolation, like we’re running it in our on our debt environment and our I.T. with our unit tests vs. the actual alive system. And those thread pools, those time outs, the routing logic, the fallback logic, the ways, in which a real production system operates. There’s a whole class of failures. There’s a whole class of things that you will never find if you’re unwilling to test in a real environment. And so that’s fine. Like, it’s all about, you know, chaos engineering at its heart is all about risk mitigation. And so if we can take a small risk and prevent a much larger outcome. Great. You know, anything we can find in. Or in staging or early on is going to be a win. But there’s a whole class of things we’ll never find. And so do we want to be surprised. It’s a little bit of the ostrich putting their head in the sand as well as some teams say, you know, hey, we’re not very good at this. And we know things are going to break. And so we just don’t even tell us. Don’t even tell us how bad it is. We don’t want to shine a light under the stairs and see the spiders scatter. We’d like to just assume that the spiders aren’t there. But the truth is, like in the middle of the night, the spiders are going to come creeping out. And Would you rather know about him during the day or find out about him in the middle of the night? 

Jonathon Wright

I think that’s a really interesting point. You know, I had a good friend of mine who’s going on the podcast this week. Huw Price. And he’s set up a number of data companies, data engineering companies. And, you know, we were chatting and he was kind of talking about, you know, maybe only 15 percent of the code is actually going to be exercised in production. You know, on it on a day that on a normal day. And, you know, we’re talking about the kind of if you think about the old ETL, all the enhanced transformer load kind of days, you know, do you go through every single possibility? And, you know, we’re finding that at the moment in the sense if I did a project with MIT at the moment, COVID safe pass. And part of that is we wanted to be able to, in the wilds, be able to prove that things are going to work. Now, obviously, this is a massive challenge in the sense of like the stuff that I’m working on today was around the role that we’re doing next week in Boston. And part of it is I needed a whole stack of routes going through Boston, which I synthetically generating five billion historical transactions of people going around the city, interacting, sometimes coinciding for a small amount of time-based on proximity through Bluetooth. But also for GPX. And it’s really challenging because that day is not that we’ve not got real people walking around using our app at the moment. So, therefore, we’re having to synthetically generate a model, all these different permutations of transportation. You know, the metro, people on bikes versus people on bus busses and public transport. And it’s a huge amount of data that, I don’t know how accurate do you see people creating realistic data or not? You know, the temptation which was, you know, take production and of escape the data or something, you know, generates all permutations, all valid datasets based on, you know, exhaustively exercising the code. Do you see that happening with organizations at the moment? 

Kolton Andrus

I mean, most of what I’ve observed have been organizations that have their blessed set of data that they can really play. But as you mentioned, I think there’s there’s a lot of issues with that. There’s kind of an 80-20 where if that’s the best you have great you know, it’s better than nothing. But the problem will be one like two to try to predict all the combinations of things that can go wrong. Well, the combinatorial explosion and how you’re combining all those variables get out of hand very quickly. And Might, I’ve helped to manage and groom those kinds of load performance tests in the past, and they become unwieldy and. But the second piece is, you know, they likely don’t capture the nuance of real behavior. And so one of the things I really enjoyed at Netflix was the idea that while we might have some of those tests that we’re running and staging before we hit production, ultimately we’re gonna favor concurring and a subset of real traffic to get a taste for that. Now, again, your circumstance here, you know, we’re doing it in an unprecedented time where there’s just a massive change in social behavior. So. I won’t get on my soapbox here, because, in your world, it sounds like luck. That’s what you’ve got to do. But in general, I’m much more in favor of it. It’s the parallel I’ve drawn. We used to do poor performance testing. We used to have tools like Gomez or the like where we would sit from Pops and we would load pages and we would look at the Gantt charts and be like, OK, this tells us kind of what the user experience looks like for this part of the country or based on this kind of Internet connections. But then we move to real user monitoring because once we were able to just process the data, you know, at scale, seeing the real user behavior mattered a lot. And when I did a performance at Amazon, there were absolutely no new use cases that were introduced that we caught or knew bad behavior or new, you know, side effects that we saw by looking at the real data that we never would have caught with the synthetic data. So 80, 20. You know, it’s a good starting point, but my bias, just like live systems, is like a user’s life data. 

Jonathon Wright

And this is another one of my interesting kinds of challenges, which I find with some kind of real use of monitoring scenarios that they do. Again, they don’t really represent real devices. Right. So, you know, my example here for the site, COVID Safe Pass, you know, put a device on a particular carrier. Like you said, you got Jitter, you could use OpenSignal something to understand what people’s experience of that packet loss and all that kind of stuff. But, you know, when people set out real use and monitoring that very simple scenario that, you know, maybe like you said, it’s a dumb rendering, some kind of headless Selenium or something to give you an idea of browser proxy. But it’s not a physical device. You know, using emulators is very different from actually using real devices in the real field, you know. And the challenge you’ve got, looking at your watch, connecting to your phone, you know, connecting to real data as far as, you know, Bluetooth or whatever it may be, you know, it feels like there’s a missed opportunity and the upside to actually have more realistic user monitoring with, you know, not just Happy Path and some negative stuff, but actually more sophisticated scenarios of playback of real journeys, you know, do you think that’s, you know, something you’re seeing or you could envision happening? 

Kolton Andrus

Well, perhaps my understanding of real user monitoring might be off. But what we’ve moved to an Amazon was most definitely only recording actual behavior. You know, it was JavaScript running on the user’s browsers or it would be, you know, data collected from their phones or from their devices and so wouldn’t be emulated. It would be the actual interactions. And the key there is that uncovered, again, even from an offside. Yes. You have to instrumented well. And that team that did that analysis and built that collection, there was a lot of work to make sure we were collecting and measuring it correctly. But from an ops perspective, you know, we could come in on a morning and see a shift in some of the population around therapy, 90 for their latency. And we could drill into it and see that it was region related or related to, you know, an Internet provider outage in that area. Or, you know, it led us to it provided a lot of questions that guided us to more interesting answers than we would have had without. 

Jonathon Wright

And I guess what you talking about that this concept which has been kind of folks down recently around the digital twin, you know, if, for instance, let’s take a Tesla as an example, you know, you would expect that there is not what to call it DR operates a secondary system that if the computer vision isn’t able to react in the time that needs a secondary system, can it is running at the same time and can pick up, you know, would potentially be a chance where people see, you know, as we’ve seen in the past, they do over the wire firmware updates with new functionality, which is being quite a prototype. But actually it has a second WI system which is keeping an eye on everything as you would do in a jet and a plane or something. That is actually to say, OK, if that functionality can’t make the decisions or like you said, as a miscast location of an object, makes that decision that he’s going to flip the plane upside down or something. You know, part of it is the secondary system kicks in because the other one is the more often experimental build. Do you foresee that potentially we could get in, well, you know, sophisticated with having the edge devices that we’ve got with mobile phones, with the power that it could actually be running in kind of a, you know, experimental build at a, you know, something else and be capturing all that data from that to feedback? If you are opt-in for new functionality or new features, you know, could that be something we start seeing as this digital a twin kind of approach to rolling stuff out? 

Kolton Andrus

Yeah, I mean, I think obviously we have that capability to have the versions or the debate over versions that some people could use that allow us to collect more information about those use cases. You know, the Tesla or the airline analogy is an interesting one. I had a friend that joined the Navy and worked on avionics electronics. And, you know, this is I was much younger, but he described to me the redundancy in these systems, the number of separate individual systems and the fact that sometimes these systems have to come to a consensus, you know, independently in order to make important decisions. And I think that shows just the level of depth and sophistication that’s lacking from a lot of our consumer-facing hardware and electronics. We’re just, you know, whether it’s not as needed or it’s not as mature. We’re not taking that depth of an approach to really be able to have those truly redundant systems to be able to evaluate. But I think, you know, quota engineering, what’s the right tradeoff? Does that is it needed? You know, if someone if it’s not, that really comes down to what’s the cost of a failure? What is the impact on a person or the population and things go wrong? And as you know, you can’t buy something on a Web site. That’s unfortunate. But it’s inconvenient. You know, if you’re in the financial world and someone can’t pay their mortgage or they’re unable to access their money at a time that they need it, that might be, you know, a critical type of failure. And I’m talking about things like self-driving cars and airlines and drones that are doing really important things that could have a potential impact on people’s safety. That’s a whole nother mode. So I think it’s a question of, you know, what is the cost of being wrong? What’s the right level of investment? And then you can kind of work backward as to what do we need to do to make sure we’re mitigating those risks as best we can. 

Jonathon Wright

I think you touched on earlier as well. I know we’re in our time, you know, the kind of cognitive bias as well about how these are these systems that design in the first place to interact with seats. The end-user is something that’s bothered me for a very long time, is, you know, the same user interface is given to the same person that they so multiple personas of how we interact with technology in different ways. You know, some of that, you know, mass customization of functionality, for instance, never really seemed to come through. You know, we didn’t give that kind of capability to person personalize things to you and actually learn from your experiences, interactions. You know, it’s the same interface that everyone’s saying the same menus, the same dropdowns. And, you know, it feels like there’s a lot of work to be done. And as it does, this is when, you know, the class engineering released about to play out all those possible scenarios beforehand, you know, to at least kind of mitigate the risk, Yes, if we say goodbye to a product, that’s fine. But, you know, to me, there’s been so many different states. You know, one of the finds it just can’t get my head around at the moment is, if your phone loses connection, you got some like we act unless it catches it, you know, it’s not prompting you to say, well, actually, I can go I can operate in and offline mode with things cached, you know, no problems. You still can do your shopping experience. You know, that’s not going to come up with this. I can’t connect to the Internet message that we kind of get. And it feels like that level is maybe down to some of the framework’s people use and approaches and actually asked. Those also evolve, you know, take complexity is also going to evolve with all the possible permutations that end users can interact with, I’m not introducing things like, you know, your virtual personal assistant like your Alexas and network endpoints and all that kind of missclarfication of what you say. You know, there are so many different variations that,  I don’t think people test it. I think your example with the happy path feels like, I had a developer say the kind of concept of not happy paths. And, you know, I kind of thought to myself, well, what does that actually mean? And you trying to say a sad path is, something where are you expecting a bad experience? And if so, how are you actually building that up? Because your tool seems like it’s ready to take those bad experiences and actually turn them into good experiences for software design. 

Kolton Andrus

Yeah, yeah. Like that. The pessimistic view is that the complexity of our systems is only going to increase in the short term. And you know, one of my favorite jokes is like everyone wanted to be like Amazon and Netflix. Congratulations. You have their problems. You have these very distributed systems that are highly available that customers rely on and expect to always work. I think the positive, the optimistic side is that you know, there’s an 80 20 here and there’s a handful of things that we can go out and test that will give us a lot of comforts and unearth a lot of the easy to find things that we can readily go improve in our systems. And so that’s how I would take it, is, hey, this is an opportunity for us to spend an hour a week, an hour, a month and dig in and find these edge cases. So we’re not surprised down the road and ultimately save ourselves a lot of time, a lot of customer pain, a lot of brand reputation. And then down the road when we’ve gotten good at it. Yeah. To be like excellent. To be perfect, it is going to require a lot more time and investment. But the fact of the matter is where most teams at most companies are at today, there’s a lot of low hanging fruit. And if they go out and are willing to invest a small amount of time, they’re going to have a huge benefit to their efficiency, to their understanding of their systems, to the happiness of their engineers when they’re not being woken up. I’m the grumpiest person when I don’t get enough sleep at night. And, you know, the days where I had to hop out of bed at 2:00 in the morning and hop on a call for an hour in the middle of the night, they ruin my next day or two while I tried to fix my sleep schedule and I think clearly and be able to get work done. And so that’s I guess my optimistic side is that there’s a lot of improvement we can make to come along with that’s pain that we’re feeling that will help us manage it in a way that doesn’t take over our lives, but instead, it just makes us better engineers and helps us get our jobs done. 

Jonathon Wright

Yeah, it is. Fantastic. You know, it’s been great to talk to you. I love what you say there. And, you know, one of the things that I keep on thinking about in my head around social analytics is you’ve got all this information about people actually giving you free, real-time feedback. You know, Twitter, other local, where you can you could run some sentiment analysis and pick up those things that bring them back and say, oh, well, this particular user, as you know, had this particular problem is happening here. You know, actually, can we get to the point where we can do that kind of pinpoint failure analysis all the way back recall to kind of give it a diagnosis engine to be able to say, well, actually, this user, your experience happened because your connection dropped to your local ISP. There was something else, another layer, or it was your account or it was, you know, something else. It feels like we nearly there. And there’s a lot of information that people are telling us out in the real world, which we can start feeding into them, build using. And it’s like you’re to those scenarios to say, well, let’s make sure we Check that, you know, let’s check what happens under similar circumstances. And then at least we know that it’s kind of building up a kind of a resume residual kind of capability that you build resilience under those kinds of scenarios, you know. Do you feel that that’s kind of where that we were at that time? Kind of. Do you feel that that connection is still lost with, stuff that’s happening in life? And it’s not really it’s waiting until something really goes wrong before everyone steps in and maybe some preventative steps that can stop, avoiding some of these kind of these issues in and what we’re seeing in the media. 

Kolton Andrus

Well, I think the way you described it, we use the same terminology within our product. There’s a set of scenarios and there’s a set of scenarios that are fairly generic that everyone should probably be prepared for things like region evacuation or failure of dependency or making sure auto-scaling groups will work correctly. But what we’ve learned is then there’s another level down set of scenarios, things that are related to ecommerce company or finance custom companies, things that are related to Kafka or things that are related to message cues. And so what we’re doing is we’re taking these learnings from all of our customers and building them into a catalog of scenarios so that a new team can come in and say which scenarios apply to me? And ultimately, this is where we can tell them which scenarios apply to them or which ones. They should be thoughtful over how to prioritize them so that they can say, right now I’m running Kafka. I’m going to go run these five scenarios. And in the end, I will either know that I’m doing Kafka well and I’ve uncovered some of the common pitfalls or I’ll find out that I’ve got some holes and some places to go fix and direct. So I think that’s one of the ways that we can make it easier for our customers to understand where they should focus their efforts and how they should. They should improve their system. So let’s take the learnings that everyone has and put them together so that we can all benefit from that knowledge. 

Jonathon Wright

That’s awesome. I’m so impressed. It just sounds like, that idea to be where, people can share those patterns, where they’ve blueprints of things that have gone wrong. And it kind of gives you this kind of stuff to be like, oh, well, it does for the security industry is these are the most, you know, high-risk ones based on your particular setup. You know, you kind of show what the architecture is and then, therefore, it will then run those that those various different technologies to kind of give you a bit more reassurance that actually it systems resilient and, it’s going to be able to, do the things that you expected to do. And, I guess to give you an idea of what you volumetrics are going to grow like, you know, down the line is it’s going to be fantastic. So, you know, it’s been so great talking to we’ll have to get you back up. But, can you give us a little bit more about how you can get in touch with you guys? what’s the best way to reach out or even get a trial set up around your product? 

Kolton Andrus

Yeah, Gremlin.com. Easy to find. You can request we have a free version of our products, so if you want to go install it. And if you want to cause some controlled chaos today, you are able to do so without talking to us or any interaction. Of course, our team would be happy to talk with teams out there that are interested, because it’s a new space. Part of what we bring to the table is not just a good product and good tooling, but some advice of practitioners that have been operating systems for the last decade or two. And so sitting down with teams, helping them understand where they should focus their efforts so they can have an efficient approach, really helping our customers find value. Know we want to help them find the problems and fix them before they become pain for them. And that’s when they when that’s when we win. So that’s our approach in general. 

Jonathon Wright

Thanks so much. And we’ll make sure we put some links on this from the show of all the stuff that we talked about today. 

Kolton Andrus

This is great, Jonathon, I greatly appreciate the opportunity, it’s been a wonderful conversation.

Slack Team

Get a free copy of our 2020 QA Salary Guide
Subscribe to our mailing list below