Testing Microservices: A Sane Approach Pre-Production & In Production

119

At LaunchDarkly we host a monthly MeetUp for anyone interested in testing in production. (If you’re interested in joining us, sign up here.) In January we met at the Heavybit Clubhouse in San Francisco, and had 3 guest speakers talk about how they test in production. Cindy Sridharan (aka @copyconstruct) talked about how she approaches testing microservices—including how to think about testing, what to consider when setting up tests, and other best practices.

Her talk was inspired by an article she wrote in December 2017 on testing microservices. She wrote it in response to an article that showcased a team’s incredibly complex testing infrastructure. The piece described how a team built something that helped their developers run a lot of tests very quickly. And while it helped tremendously, Cindy wondered why it had to be so complex.

“The whole point of microservices is to enable teams to develop, deploy and scale independently. Yet when it comes to testing, we insist on testing *everything* together by spinning up *identical* environments, contradicting the mainspring of why we do microservices.”

Check out her talk below.

TRANSCRIPT

So, my name is Cindy and as Andrea said, you probably best know me from Twitter or from my blog. So, couple days ago, probably was last, the last week of 2017. I wrote this blog post called Testing Microservices the Sane Way. It was a 40 minute blog post and I am still astounded how many people read it. I mean, because I personally do not read anything that’s longer than 10 minutes.

I mean, I can write stuff that’s 40 minutes or 60 minutes long, but I can’t read. I mean my attention span is like 10 minutes. Right, so kudos to everyone who’s read this. So, like I said my name is Cindy. You probably know me on Twitter and pretty much everywhere on the internet as copy construct, and I’m here to talk about microservices.

The main reason why I wrote this blog post was because I read another blog post about someone who described the testing infrastructure. It was, the blog post of a company, it’s fairly largish, but you know we’re not talking about Google, or you know like Facebook or anything. But pretty largish for a San Francisco company, and you know, I sort of read the whole blog post, and they were talking about like you know, they had this pretty cool infrastructure there.

You know, they had like Mesos that was like scheduling Jenkins and EC2 and you know it had like … it was doing bin packing. It was running a million tests, it was like … what else did it do? Let’s see, it was like streaming all the logs to Kafka and then sending all to S3. It was like this incredibly complex system that people had built and basically the whole blog post was about how it just greatly helped the developers to run millions of tests in a very, sort of fast way. Right, which is probably always seen as a positive thing.

When I finished reading that my sort of reaction was like, oh my God, why is this so complex? Why is it that this industry thinks of testing as this thing that needs to, you know, just necessarily be done this way. Where it’s just like, you know, you have a system, which is probably even more complex than an actual production system. That you spin up just to run a bunch of tests. Right? So, once I read that I was like, oh my God. I was just completely surprised, I was just pretty unsettled. I was like why are we doing this and because this is not just one company. Right? Pretty much everyone or everyone attempts to do this or you know get somewhere close to this and bigger companies than Google probably have something like that order of magnitude, even more complex than this.

At least from my personal experience, trying to build these kind of systems, at least the way I see it. It’s not really the right way to really be testing microservices. So, it really started with this conversation that I was having on Twitter. I asked someone, “Okay, so what is the best way in which you test microservices?” Rather, I think the initial conversation was how do you test service mesh architectures because that is a strong area of interest of mine. So, when I was asking people that I got a lot of answers, but I got this one reply from someone called David who mentioned that you know, I believe he works at Amazon. That’s what his Twitter bio says, that you know, increasingly convinced that integration testing a skill in the microservices environment is unattainable.

This also completely matches with my own experience as well as what I strongly believe. The whole point of microservices is to enable individual teams to build things separately, to ship things separately. Right? To deploy things separately so that you have these self contained system with abundant context that really encapsulate a certain amount of business eventuality. Yet, when it comes to testing these things, we test it all together. Right? The default is that we need these incredibly complex systems just to integration test all these before we can deploy something.

So, before we even get to testing. Another thing that I’ve seen people do, and this is also something that I’ve seen happen at one of the previous companies that I’ve worked at. Not just testing microservices, but even when it comes to developing microservices, what a lot of people do is, they try to replicate the entire topology on developer laptops. Right? So, a few, four or five years ago I tried … The company that I was working at, we tried to do this with Vagrant. The whole idea was like, with a single Vagrant up you should be able to boot the entire cloud up on a laptop. These days pretty much everyone’s doing Docker compose.

So, the thing is, in my opinion, this is really the wrong mindset. Not only is it not scalable, and not only is it completely … It’s absolutely nothing like a cloud environment. Right? It’s also really the wrong mindset to be thinking about it, when I had this blog post reviewed by someone called Fred Heber. One of the things had he pointed out was that really trying to boot a cloud on your laptop, is almost like you’re supporting the worst possible cloud provider ever. That’s like you know, a single laptop. Right? Cause that’s kind of trying to replicate. When you try to boot all of like say, a dozen microservices in a laptop. That sort of becomes the worst possible cloud ever.

Even your actual cloud is going to be far more effective than that. So, why are we even doing this? It’s because people, even when we build microservices, we are really building distributor monolithic applications. Right? I mean, this is kind of what we’re doing. The whole idea that we need … If we can not decouple these different sort of boundaries when we’re developing and when we’re testing. Then we can not expect that when we’re running these things together that the sort of decoupling that we’re trying to sort of achieve will actually work because from the ground up we’re doing it wrong.

So, how do we do it right? Right? If I’m saying, okay this is all wrong so what is the right way to do it? Well, couple caveats, my background is startups. Really always stayed startups, we’re not even talking about like the unicorns. We’re talking about companies that are small and scrappy and you know, always almost resource constrained. Where we have always had to make smart, smart bets and smart choices and where we’ve always had to do a lot with a little. What that means is that we’ve never almost always had the resources to just go out and do what some of these companies who blog about some of these things do. Right? What we’ve always had to do is think about the opportunity cost, the maintenance cost, as well as what the trade off, and the payoffs are before we make any sort of decision. This also applies to testing.

So, you know, I was just drawing this out when I was writing this blog post about like just the taxonomy of testing. Right? How do we really think of testing? For a lot of people the way … Especially a lot of software engineers think of testing. Is that you write a bunch of automated tests and pre production testing is like the only thing that they really sort of focus on and even then you don’t focus on all of this. It’s primarily just writing automated tests, writing tests in the form of code. Right? One of the reasons I even sort of like, created this illustration is because we really have to see testing as a spectrum. Not as this thing that you do when you write code, as this thing that you run every time that you create a branch. The thing that you sort of like need this thing to go green before you merge it deploy, but it needs to be seen as a spectrum.

That really starts when they start writing code, and it doesn’t really end until you’ve sort of decommissioned that code from production. Right? It’s just something that’s … it’s a continuum, and it’s not this sort of thing that you just do. Once we start thinking of testing in that manner, and especially with this other sort of paradigm that I’m seeing in San Francisco at least. This big San Francisco bubble that I’m very cozily ensconced in, which is that software engineers are now … Back in the day you probably had testers. Right? You had QA engineers and like software engineers in tests. I’m guessing bigger companies still have this, where the developer writes the code, and the engineer’s in tests actually write the tests. Then you know, if it was a service you handed it over to someone in operations to actually run it.

This is not the sort of like, I’ve never done this at any company that I’ve worked at. I’ve always been a person writing the tests. You know, deploying the code, monitoring and also sort of rolling back and fire fighting. I agree that this is unusual and this probably doesn’t apply to everyone, but that is the position from which a lot of my opinions are sort of like formed. Right? So, when I think of it that way, I personally find this model to be incredibly powerful because for the first time ever it allows us to think about testing in a very holistic manner. It delves us to think about testing across the breath of it and not just this little type of silos.

So, one of the reasons I even created this was because I sort of like wanted … I primarily see all of these sort of like mechanisms, all of these techniques as a form of testing. Right? You know, this is again the distinction isn’t quite as binary. Like, you could probably be doing some sort of A, B testing your test environment, you’ll probably do you know, profiling in production. So, this is a very fluid sort of border here, but this is how I see testing. I don’t think anyone here, at least anyone that I know of, does all forms of testing. Really the goal is to really pick the subset of testing techniques that we need given our applications requirements.

For really safety critical and health critical systems for systems that deal with money, probably the vast majority of testing is going to be done pre production. I personally only have experience running services, running backend services and running services whose deployment I control. Which means that if something goes bad, I can immediately sort of roll back. Right? That is super powerful and that is not something that a lot of people actually enjoy, because if you, I guess if you’re shipping a desktop application. You ship it to people and if you ship it with a bug then you know, unless they all sort of, upgrade, or you know, sort of change their thing it’s not going to change. Right?

If you’re just deploying a service, and you’re controlling the deployment it can ship something, and if it’s not great, then you roll it back. Right? So. When you think of testing services, I kind of think, at least from my perspective. I tend to give … I tend to put testing in production a lot higher in the sort of like priority list. Then I necessarily do pre production testing. So, the other sort of thing that I’ve been thinking about was this thread by Sarah where she actually said that … She was responding to someone else, was saying that you know, there’s probably a lot of code that you don’t necessarily have to test that you can just deploy. Right?

This was pretty controversial when it sort of came out because a whole bunch of people were like, don’t ever do this. Right? The reason why I kind of felt that there was this reaction to what she was saying, was because a lot of software engineers, what we’re taught at our mother’s knee, is that testing is a good thing. We have to be testing and if you don’t test, that if you don’t write … What I mean test, like a very narrow definition of test, then you’re doing it wrong. Right? Which means that, when people find it really hard to even accept that automated pre production testing can actually sometimes be insufficient. To the point where it can be sometimes entirely useless.

I think that really was the reason why there was this huge uproar to what Sarah was saying, which was that people are stuck in a different mindset. For us to really sort of embrace this more holistic form of testing, it really becomes important to start thinking about it differently. What I mean thinking about it differently, is that I think reliability should be … it’s not an SRE concern. I personally see it as a software engineering concern. I don’t personally think that the code that I write is going to be made any more reliable by hiring a SRE team. You know? I think that this whole top down approach where you have certain people writing the code and certain people sort of make it reliable. I don’t think that works and I really think reliability is a feature of software. Right?

Like testing, it’s like something that you build into the code that you write, so that then you can understand what it’s doing in a lot better way. Understanding what it’s doing is also a form of testing, because we’re just seeing … do our assumptions hold good? You know, did everything that we sort of like base our system on, does that all still hold water? Right? So, this is again something that I very strongly believe in. That to be able to sort of push certain amounts of regression testing to post production monitoring requires not just a change in a mindset, but and a certain appetite for risk obviously, but it really requires software engineers to code for failure.

I think that is the thing that I’m not seeing enough and to be honest I don’t do it enough. This is like, I’m not saying I’m doing everything perfectly. I probably don’t do this well enough and that’s probably something that I would like to see. Not just me, but also in general industry, just do better. It involves a whole different mindset. It really involves thinking from the ground up how we build software. Again, this is probably true for all kinds of software, this is especially true for microservices. Right? Because what … crazy service into architecture is that we’re building, what they really can note is this incredibly complex system that’s always going to exist in some form of unhealthiness. Right?

It’s almost like saying unhealthy is the new healthy, because we you know, have all of these crazy mechanisms like retries and circuit breaking and multi-tiered caching and eventual consistency and all these relaxed guarantees just to make sure that our service is going to be sort of, healthy. So, we’re by default in a mode where we are thinking you know, in a way where unhealthy is really the norm, but when we test, we test as if healthy is the norm. Right? Which, is really kind of bizarre when you think about it. So, again something that Fred Heber said, which was that, when you have a large scale system. If your tests require 100% health then you know you got a problem because you’re not really testing the system. You’re not even testing what you’re building for, but you’re really testing something totally different. Which, is like the success case, where what you’re actually doing is architecting for failure.

That’s completely bizarre to me. Right? So, code accordingly. This was another tweet that I saw and I think I retweeted this. Failure is going to happen, it may not be something you did, it’s probably something club provider did. The reason is not deployment, there is no internet. Code accordingly. So I see a lot of these. Whenever I see tweets like this my initial instinct is like hell yeah, this is so true. My second instincts are like, fuck no, this is not true. What is this person even talking about? This is like so hand wavy, and this is like so like yeah code accordingly. Like what am I supposed to do? Like, I used to ask these questions like a couple years ago. Like what the hell am I supposed to do? Like give me answers, don’t just give me these little like, taught literary sound bites. That then gets fucking tweeted a hundred times. Like, what am I supposed to be doing? Like, code accordingly.

I mean, I’m ranting here, but I mean, this is what I personally feel. So, you know, the more I think about it. There’s certain things that I feel software engineers can really do, really try to understand when they’re actually coding in order to code accordingly. So what does that mean? Well, in my opinion, it comes down to three things. It really comes down to coding for debugging, being able to instrument your services well enough so that you can really … so that when you’re actually debugging something you’re not really starting with the conjecture and you’re not really starting with this hypothesis, but you can ask questions of your system. That you can ask questions and then you can sort of refine your hypothesis based on this question.

Keep asking questions, keep asking questions until you come to something, which is probably actually really true to the root cause. Rather than sort of like looking at the wrong things or like just you know, trying to sort of like fit this failure into this mental model that you already have built. Based on like, because you build the software so you have this mental model of how it should work and something is failing and then you try to retrofit that to this mental model that you have, which in my opinion, is the completely wrong way to do it. So, number one instrument your services better. Number two, understand the operational semantics of the application. How is it deployed? You know, how is a process started on a new host? Like, because if you think of a lot of outages, they’re actually like deployment outages. It’s basically bad coding getting deployed for some reason.

Again, a lot of these codes get deployed after going through like this huge like intricate sort of like testing pipelines. Then why is it still the case that deployments are like the number one cause of outages or like bad deploys are the number one cause of like services going down? Right? It’s again because when you’re coding and also when you’re testing, what you’re really validating is your flawed mental model. Not really the reality. Right? I mean, does anyone disagree with that? No? Okay, cool, because that’s what I strongly believe. Is that you still have these outages just by doing everything correctly and that’s because you’re really, what you’re really testing is not the reality, but what you’re testing is just your bias. Right?

It’s just this big form of confirmation bias. That’s what you’re really dealing with. So, you know, just being able to understand the operation semantics. I personally feel allows us to think about how services fail in a better way, which means again, so what process manager are you using? Like, what’s the restart mechanism? How many times is it going to try to restart? What’s your deployment mechanism? Is it like some sort of like blue, green deploy or are you doing some sort of like red, black deploy? Are you doing those rolling deploys? Like what is the release engineering sort of like practices that are being used? How does that lead to failure?

There’s one more in my opinion, and that really comes down to the operational semantics of the dependencies views. Right? Veeble on top of crazy amount of dependencies. That’s in my opinion, at least what I’ve mostly been doing is gluing together a bunch of dependencies. Basically it’s just some libraries on top of this and then you have like huge, hugely sort of like tower of abstraction on top of the Veeble, a little tiny abstraction. The best way to deal with abstractions is, at least in my opinion, even the leakiest ones. Is to sort like have some amount of trust that it’s gonna say what it says on the thing. Right?

I mean, it’s not ideal, but if you’re just gonna like start doubting every single abstraction layer that we’re dealing with right down to TCP or your art packets or something, then we’re probably never going to get anything shipped. So the best way we deal with abstractions, is sort of like having a certain level of trust in the abstraction, but failure still happens. In my opinion that happens because we don’t fully understand the boundaries between the abstractions well enough. Spending a little more time there I think, I personally feel has sort of like helped me understand the services and the systems that I build better.

So, what do I mean by that? Well, a good example would be, if you’re using a service like consult for service discovery. Right? It has certain sort of like default consistency modes in which, it sort of like queries for other services. The default case is strongly consistent, it’s probably not what you might need for when you are trying to do service discovery because that is more a fundamentally consistent problem. Right? Which, means that the developer using a consult library or if someone provides consult as a service then you understand the right fault defaults. Different people who are using this library need to understand these defaults and be able to sort of like reason about whether that is the right default for their needs.

So, just spending some time and understanding a lot of these operational semantics of both the dependencies as well as the application and programming for the debuggability. Is in my opinion, what I think denotes coding accordingly or adding coding for failure. Another point Sarah made was that you know, the writing and running of tests itself is like not a goal in itself. Like, tests themselves pretty much are, I wouldn’t say useless, but like in and off themselves they’re not like particularly useful. The reason is, you know again, I stated this previously. It’s just a big form of confirmation bias, all the test I write because we’re sort of like predicting beforehand what are all the failures and success modes. Tests are only good as their ability to predict. Right? The reason why systems fail is because we’re not able to predict that failure or we didn’t try to understand that well enough.

So what this means in my opinion is that just having more tests doesn’t make your service any better. It just means that it can think of more cases, but there’s probably still a case that’s gonna take your service down and if you don’t understand that then, you know. Having like a million tests doesn’t help you in any way. Right? So, and when we talk about tests, the other thing that really annoys me is this test pyramid. So, this was probably proposed back in like, I don’t know the early 2000’s along with the whole agile movement and everything. It was probably really small then so I don’t really understand or like, know the history of this. Every single testing blog post or like every single like testing conversation people have like this thing gets wheeled out. Every time I look at this I’m like okay, yeah if you have like a or have a like the Rails app or like a single Jangle app, if you have like a CRUD app basically, yeah this kind of makes sense. Right?

When it comes to really dealing with this kind of service, like is this the best we can do when it comes to testing? Seriously? Like, why do we still talk about this thing, right? This is like, in my opinion, super old and this is completely the wrong way to think about testing this kind of a service or into the architecture. This is completely hypothetical by the way, but it’s not very far removed from some kinds of some of the sort of topologies that I’ve personally worked with. So in my opinion, I’m totally wearing my thought leader hat now, this is what testing should look like.

I see it more as an old style, the pyramid is more like a funnel, I don’t know why I made it a funnel. It could have probably been this big rectangle, but for some reason this is a funnel. In my opinion, to really test these kind of crazy architectures, we need to think of testing as a spectrum and it just doesn’t end when you write some code. It just doesn’t end when you deploy something, it’s this large wide spectrum, where it sort of continues until you really decommission your applications. Right? Again, this is isn’t like sort of comprehensive by any way, but this is sort of like my mental model for testing this kind of system.

You have your unit test, you have your benchmarking, you have property based testing, you have your contract testing. All this is things that it can do in development. Right? You can do all of this without necessarily spinning up this whole crazy topology locally, you can just like run one service and still to have all these kinds of tests. Testing pre production and personally I believe that massive integration tests are an anti pattern. I mean, because think of it, what we’re really trying to do when people try to like build this whole dev environment then an identical test environment, rather try to keep your test environment as close as possible to production. It’s not really the real thing, because it becomes this huge maintenance burden and especially if we’re smaller companies, we just can’t do it. It’s just like, it’s just not attainable, it’s like probably it ends up requiring a team in its own right to just maintain. Right?

That’s kind of hard and the fact is, doing all of that doesn’t give you any more confidence in your system. So, if you’re going to be integration testing in microservice architecture then you should be able to test that against production. This has also the added benefit in that it helps you really think about your boundaries. It makes you certain that if you should be able to test a potentially buggy sort of like service in production with the rest of them, then it really forces you to think of what the boundaries should be and how the blast radius or how widespread the blast radius might be.

The goal is that it should be really small in that even if you push out this really shitty sort of like patch. Or like you know, you sort of like push out this big buggy creature. It really shouldn’t take down the rest of your service. Right? If you actually think about it that’s really what the whole microservice, the goal of microservices architectures are. You have like these stand alone services that do this one thing well, they don’t really affect or really greatly affect like the other services. So, this is like a very positive architectural pressure in my mind. Being able to integration test, your single service with your production environment.

Then there’s config tests and this is again something that I’ve learned from you know, just my operations colleagues, that configuration changes can sometimes be more dangerous than a code change. Right? Especially when it comes to things like Nginx configs or even Haproxy configs or even some sort of like given that we live in the era of service meshes like Envoy or Linkerdy. The thing about service meshes it’s like it becomes this huge sea of failure across like your whole entire fleet, because it sits in between every damn service that you have. If you’re not gonna be testing the config changes before pushing out a change to Envoy, it’s not just gonna take down one service. It has the potential to take down your entire architecture.

That’s pretty scary to me, and I’m surprised that more people don’t talk about like you know, how to build resilience service mesh. That’s something that I’m hoping will happen in the next year or two as like this whole paradigm gets more traction. They’re shadowing which is… data… production traffic and actually replay it against tests. Right? I’m also arguing that let’s not have this whole crazy test cluster in place, which means that integration testing and shadowing sort of like combine together. When you’re integration with production don’t sort of generate your own lord or don’t try to sort of like create like your own request response cycle. Try to use something from production, that’s just happened like maybe five seconds ago, cause that’s as real as it can get.

Then there’s testing in production, obviously. There’s A, B testing, feature flagging, you know, canaring, distributor tracing, monitoring, I think monitoring is a form of testing that’s again why I think software engineers should be doing it. You know, it’s not an operational concern and just understanding what it’s software is doing in production. Because you can build something, but it is completely constrained by all the biases that you have and you mental models and basically how your team thinks about it. A lot of reasons, if you think about it a lot of failed projects are just like big architectural flaws.

It’s primarily because people didn’t validate soon enough in production and it’s probably because people let their biases carry them to far until they spent too many engineering cycles and too much time building something that’s pretty big. Then it’s probably too flawed, because you didn’t integrate rapidly and you didn’t basically get feedback from production fast enough. This is like a service that I’ve been sort of the developer for, on call for, for the better part of the last two years. It doesn’t matter what e-service it really does, but you know when I developed this thing. I don’t really bring up all these services on the laptop, what I really do is just bring up this one thing and then test the rest in production.

Surprisingly it’s worked well enough for us, I mean, there’s never been an outage because I’ve been developing this way. The other pretty weird thing is I’ve worked at companies where people have come and asked us like, can you give us your whole service as a Docker container. You’re talking about like a SaaS service. Like can it just give it your service as a Docker container so that we can integration test our services with basically yours. Most of us these days just integrate with a whole bunch of third party services. As we should, because if it’s not a core competency for your startup then you should probably be asserting it at someone else. Right?

Who’s basic business model and who’s basically the reason they even exist is because they’re providing the service and then it begs the question. How do you even test these services? Right? In my opinion, the way you test is not fast enough to give you an operating container that then you sort of spin up in a test environment. You need to either have better fractions in place that you know, where you can sort of understand your vendors failure modes and sort of program against that. Or you just directly talk to your … directly make like an ecro call to this third party service that we’re talking to. Which, again becomes harder because a lot of services don’t really provide a test mode. Sort of like features or they don’t really allow you to test it like every time you make a call to them you get charged, which it kind of sucks.

That’s again, that’s a user experience thing right? In my opinion not a lot of vendors do this thing where they really think about how someone is gonna be integrating with them or gonna test well. So, yeah, that’s pretty much all I’ve got and a lot of these thoughts are like probably better explained and better articulated in the blog post if anyone really wants to spend 40 minutes reading that. Go read that, I won’t do that, but yeah. Thank you.

Kim is a writer and editor. She’s pretty excited about the technology that LaunchDarkly is building and sharing that story with the community. Before LaunchDarkly, Kim did marketing for other SaaS companies, like Intuit and Pentaho. She earned her BA in sociology from UC Berkeley.