05 Feb 2018

Testing Microservices: A Sane Approach Pre-Production & In Production

At LaunchDarkly we host a monthly MeetUp for anyone interested in testing in production. (If you’re interested in joining us, sign up here.) In January we met at the Heavybit Clubhouse in San Francisco, and had 3 guest speakers talk about how they test in production. Cindy Sridharan (aka @copyconstruct) talked about how she approaches testing microservices—including how to think about testing, what to consider when setting up tests, and other best practices.

Her talk was inspired by an article she wrote in December 2017 on testing microservices. She wrote it in response to an article that showcased a team’s incredibly complex testing infrastructure. The piece described how a team built something that helped their developers run a lot of tests very quickly. And while it helped tremendously, Cindy wondered why it had to be so complex.

“The whole point of microservices is to enable teams to develop, deploy and scale independently. Yet when it comes to testing, we insist on testing *everything* together by spinning up *identical* environments, contradicting the mainspring of why we do microservices.”

Check out her talk below.

TRANSCRIPT

So, my name is Cindy and as Andrea said, you probably best know me from Twitter or from my blog. So, couple days ago, probably was last, the last week of 2017. I wrote this blog post called Testing Microservices the Sane Way. It was a 40 minute blog post and I am still astounded how many people read it. I mean, because I personally do not read anything that’s longer than 10 minutes.

I mean, I can write stuff that’s 40 minutes or 60 minutes long, but I can’t read. I mean my attention span is like 10 minutes. Right, so kudos to everyone who’s read this. So, like I said my name is Cindy. You probably know me on Twitter and pretty much everywhere on the internet as copy construct, and I’m here to talk about microservices.

The main reason why I wrote this blog post was because I read another blog post about someone who described the testing infrastructure. It was, the blog post of a company, it’s fairly largish, but you know we’re not talking about Google, or you know like Facebook or anything. But pretty largish for a San Francisco company, and you know, I sort of read the whole blog post, and they were talking about like you know, they had this pretty cool infrastructure there.

You know, they had like Mesos that was like scheduling Jenkins and EC2 and you know it had like … it was doing bin packing. It was running a million tests, it was like … what else did it do? Let’s see, it was like streaming all the logs to Kafka and then sending all to S3. It was like this incredibly complex system that people had built and basically the whole blog post was about how it just greatly helped the developers to run millions of tests in a very, sort of fast way. Right, which is probably always seen as a positive thing.

When I finished reading that my sort of reaction was like, oh my God, why is this so complex? Why is it that this industry thinks of testing as this thing that needs to, you know, just necessarily be done this way. Where it’s just like, you know, you have a system, which is probably even more complex than an actual production system. That you spin up just to run a bunch of tests. Right? So, once I read that I was like, oh my God. I was just completely surprised, I was just pretty unsettled. I was like why are we doing this and because this is not just one company. Right? Pretty much everyone or everyone attempts to do this or you know get somewhere close to this and bigger companies than Google probably have something like that order of magnitude, even more complex than this.

At least from my personal experience, trying to build these kind of systems, at least the way I see it. It’s not really the right way to really be testing microservices. So, it really started with this conversation that I was having on Twitter. I asked someone, “Okay, so what is the best way in which you test microservices?” Rather, I think the initial conversation was how do you test service mesh architectures because that is a strong area of interest of mine. So, when I was asking people that I got a lot of answers, but I got this one reply from someone called David who mentioned that you know, I believe he works at Amazon. That’s what his Twitter bio says, that you know, increasingly convinced that integration testing a skill in the microservices environment is unattainable.

This also completely matches with my own experience as well as what I strongly believe. The whole point of microservices is to enable individual teams to build things separately, to ship things separately. Right? To deploy things separately so that you have these self contained system with abundant context that really encapsulate a certain amount of business eventuality. Yet, when it comes to testing these things, we test it all together. Right? The default is that we need these incredibly complex systems just to integration test all these before we can deploy something.

So, before we even get to testing. Another thing that I’ve seen people do, and this is also something that I’ve seen happen at one of the previous companies that I’ve worked at. Not just testing microservices, but even when it comes to developing microservices, what a lot of people do is, they try to replicate the entire topology on developer laptops. Right? So, a few, four or five years ago I tried … The company that I was working at, we tried to do this with Vagrant. The whole idea was like, with a single Vagrant up you should be able to boot the entire cloud up on a laptop. These days pretty much everyone’s doing Docker compose.

So, the thing is, in my opinion, this is really the wrong mindset. Not only is it not scalable, and not only is it completely … It’s absolutely nothing like a cloud environment. Right? It’s also really the wrong mindset to be thinking about it, when I had this blog post reviewed by someone called Fred Heber. One of the things had he pointed out was that really trying to boot a cloud on your laptop, is almost like you’re supporting the worst possible cloud provider ever. That’s like you know, a single laptop. Right? Cause that’s kind of trying to replicate. When you try to boot all of like say, a dozen microservices in a laptop. That sort of becomes the worst possible cloud ever.

Even your actual cloud is going to be far more effective than that. So, why are we even doing this? It’s because people, even when we build microservices, we are really building distributor monolithic applications. Right? I mean, this is kind of what we’re doing. The whole idea that we need … If we can not decouple these different sort of boundaries when we’re developing and when we’re testing. Then we can not expect that when we’re running these things together that the sort of decoupling that we’re trying to sort of achieve will actually work because from the ground up we’re doing it wrong.

So, how do we do it right? Right? If I’m saying, okay this is all wrong so what is the right way to do it? Well, couple caveats, my background is startups. Really always stayed startups, we’re not even talking about like the unicorns. We’re talking about companies that are small and scrappy and you know, always almost resource constrained. Where we have always had to make smart, smart bets and smart choices and where we’ve always had to do a lot with a little. What that means is that we’ve never almost always had the resources to just go out and do what some of these companies who blog about some of these things do. Right? What we’ve always had to do is think about the opportunity cost, the maintenance cost, as well as what the trade off, and the payoffs are before we make any sort of decision. This also applies to testing.

So, you know, I was just drawing this out when I was writing this blog post about like just the taxonomy of testing. Right? How do we really think of testing? For a lot of people the way … Especially a lot of software engineers think of testing. Is that you write a bunch of automated tests and pre production testing is like the only thing that they really sort of focus on and even then you don’t focus on all of this. It’s primarily just writing automated tests, writing tests in the form of code. Right? One of the reasons I even sort of like, created this illustration is because we really have to see testing as a spectrum. Not as this thing that you do when you write code, as this thing that you run every time that you create a branch. The thing that you sort of like need this thing to go green before you merge it deploy, but it needs to be seen as a spectrum.

That really starts when they start writing code, and it doesn’t really end until you’ve sort of decommissioned that code from production. Right? It’s just something that’s … it’s a continuum, and it’s not this sort of thing that you just do. Once we start thinking of testing in that manner, and especially with this other sort of paradigm that I’m seeing in San Francisco at least. This big San Francisco bubble that I’m very cozily ensconced in, which is that software engineers are now … Back in the day you probably had testers. Right? You had QA engineers and like software engineers in tests. I’m guessing bigger companies still have this, where the developer writes the code, and the engineer’s in tests actually write the tests. Then you know, if it was a service you handed it over to someone in operations to actually run it.

This is not the sort of like, I’ve never done this at any company that I’ve worked at. I’ve always been a person writing the tests. You know, deploying the code, monitoring and also sort of rolling back and fire fighting. I agree that this is unusual and this probably doesn’t apply to everyone, but that is the position from which a lot of my opinions are sort of like formed. Right? So, when I think of it that way, I personally find this model to be incredibly powerful because for the first time ever it allows us to think about testing in a very holistic manner. It delves us to think about testing across the breath of it and not just this little type of silos.

So, one of the reasons I even created this was because I sort of like wanted … I primarily see all of these sort of like mechanisms, all of these techniques as a form of testing. Right? You know, this is again the distinction isn’t quite as binary. Like, you could probably be doing some sort of A, B testing your test environment, you’ll probably do you know, profiling in production. So, this is a very fluid sort of border here, but this is how I see testing. I don’t think anyone here, at least anyone that I know of, does all forms of testing. Really the goal is to really pick the subset of testing techniques that we need given our applications requirements.

For really safety critical and health critical systems for systems that deal with money, probably the vast majority of testing is going to be done pre production. I personally only have experience running services, running backend services and running services whose deployment I control. Which means that if something goes bad, I can immediately sort of roll back. Right? That is super powerful and that is not something that a lot of people actually enjoy, because if you, I guess if you’re shipping a desktop application. You ship it to people and if you ship it with a bug then you know, unless they all sort of, upgrade, or you know, sort of change their thing it’s not going to change. Right?

If you’re just deploying a service, and you’re controlling the deployment it can ship something, and if it’s not great, then you roll it back. Right? So. When you think of testing services, I kind of think, at least from my perspective. I tend to give … I tend to put testing in production a lot higher in the sort of like priority list. Then I necessarily do pre production testing. So, the other sort of thing that I’ve been thinking about was this thread by Sarah where she actually said that … She was responding to someone else, was saying that you know, there’s probably a lot of code that you don’t necessarily have to test that you can just deploy. Right?

This was pretty controversial when it sort of came out because a whole bunch of people were like, don’t ever do this. Right? The reason why I kind of felt that there was this reaction to what she was saying, was because a lot of software engineers, what we’re taught at our mother’s knee, is that testing is a good thing. We have to be testing and if you don’t test, that if you don’t write … What I mean test, like a very narrow definition of test, then you’re doing it wrong. Right? Which means that, when people find it really hard to even accept that automated pre production testing can actually sometimes be insufficient. To the point where it can be sometimes entirely useless.

I think that really was the reason why there was this huge uproar to what Sarah was saying, which was that people are stuck in a different mindset. For us to really sort of embrace this more holistic form of testing, it really becomes important to start thinking about it differently. What I mean thinking about it differently, is that I think reliability should be … it’s not an SRE concern. I personally see it as a software engineering concern. I don’t personally think that the code that I write is going to be made any more reliable by hiring a SRE team. You know? I think that this whole top down approach where you have certain people writing the code and certain people sort of make it reliable. I don’t think that works and I really think reliability is a feature of software. Right?

Like testing, it’s like something that you build into the code that you write, so that then you can understand what it’s doing in a lot better way. Understanding what it’s doing is also a form of testing, because we’re just seeing … do our assumptions hold good? You know, did everything that we sort of like base our system on, does that all still hold water? Right? So, this is again something that I very strongly believe in. That to be able to sort of push certain amounts of regression testing to post production monitoring requires not just a change in a mindset, but and a certain appetite for risk obviously, but it really requires software engineers to code for failure.

I think that is the thing that I’m not seeing enough and to be honest I don’t do it enough. This is like, I’m not saying I’m doing everything perfectly. I probably don’t do this well enough and that’s probably something that I would like to see. Not just me, but also in general industry, just do better. It involves a whole different mindset. It really involves thinking from the ground up how we build software. Again, this is probably true for all kinds of software, this is especially true for microservices. Right? Because what … crazy service into architecture is that we’re building, what they really can note is this incredibly complex system that’s always going to exist in some form of unhealthiness. Right?

It’s almost like saying unhealthy is the new healthy, because we you know, have all of these crazy mechanisms like retries and circuit breaking and multi-tiered caching and eventual consistency and all these relaxed guarantees just to make sure that our service is going to be sort of, healthy. So, we’re by default in a mode where we are thinking you know, in a way where unhealthy is really the norm, but when we test, we test as if healthy is the norm. Right? Which, is really kind of bizarre when you think about it. So, again something that Fred Heber said, which was that, when you have a large scale system. If your tests require 100% health then you know you got a problem because you’re not really testing the system. You’re not even testing what you’re building for, but you’re really testing something totally different. Which, is like the success case, where what you’re actually doing is architecting for failure.

That’s completely bizarre to me. Right? So, code accordingly. This was another tweet that I saw and I think I retweeted this. Failure is going to happen, it may not be something you did, it’s probably something club provider did. The reason is not deployment, there is no internet. Code accordingly. So I see a lot of these. Whenever I see tweets like this my initial instinct is like hell yeah, this is so true. My second instincts are like, fuck no, this is not true. What is this person even talking about? This is like so hand wavy, and this is like so like yeah code accordingly. Like what am I supposed to do? Like, I used to ask these questions like a couple years ago. Like what the hell am I supposed to do? Like give me answers, don’t just give me these little like, taught literary sound bites. That then gets fucking tweeted a hundred times. Like, what am I supposed to be doing? Like, code accordingly.

I mean, I’m ranting here, but I mean, this is what I personally feel. So, you know, the more I think about it. There’s certain things that I feel software engineers can really do, really try to understand when they’re actually coding in order to code accordingly. So what does that mean? Well, in my opinion, it comes down to three things. It really comes down to coding for debugging, being able to instrument your services well enough so that you can really … so that when you’re actually debugging something you’re not really starting with the conjecture and you’re not really starting with this hypothesis, but you can ask questions of your system. That you can ask questions and then you can sort of refine your hypothesis based on this question.

Keep asking questions, keep asking questions until you come to something, which is probably actually really true to the root cause. Rather than sort of like looking at the wrong things or like just you know, trying to sort of like fit this failure into this mental model that you already have built. Based on like, because you build the software so you have this mental model of how it should work and something is failing and then you try to retrofit that to this mental model that you have, which in my opinion, is the completely wrong way to do it. So, number one instrument your services better. Number two, understand the operational semantics of the application. How is it deployed? You know, how is a process started on a new host? Like, because if you think of a lot of outages, they’re actually like deployment outages. It’s basically bad coding getting deployed for some reason.

Again, a lot of these codes get deployed after going through like this huge like intricate sort of like testing pipelines. Then why is it still the case that deployments are like the number one cause of outages or like bad deploys are the number one cause of like services going down? Right? It’s again because when you’re coding and also when you’re testing, what you’re really validating is your flawed mental model. Not really the reality. Right? I mean, does anyone disagree with that? No? Okay, cool, because that’s what I strongly believe. Is that you still have these outages just by doing everything correctly and that’s because you’re really, what you’re really testing is not the reality, but what you’re testing is just your bias. Right?

It’s just this big form of confirmation bias. That’s what you’re really dealing with. So, you know, just being able to understand the operation semantics. I personally feel allows us to think about how services fail in a better way, which means again, so what process manager are you using? Like, what’s the restart mechanism? How many times is it going to try to restart? What’s your deployment mechanism? Is it like some sort of like blue, green deploy or are you doing some sort of like red, black deploy? Are you doing those rolling deploys? Like what is the release engineering sort of like practices that are being used? How does that lead to failure?

There’s one more in my opinion, and that really comes down to the operational semantics of the dependencies views. Right? Veeble on top of crazy amount of dependencies. That’s in my opinion, at least what I’ve mostly been doing is gluing together a bunch of dependencies. Basically it’s just some libraries on top of this and then you have like huge, hugely sort of like tower of abstraction on top of the Veeble, a little tiny abstraction. The best way to deal with abstractions is, at least in my opinion, even the leakiest ones. Is to sort like have some amount of trust that it’s gonna say what it says on the thing. Right?

I mean, it’s not ideal, but if you’re just gonna like start doubting every single abstraction layer that we’re dealing with right down to TCP or your art packets or something, then we’re probably never going to get anything shipped. So the best way we deal with abstractions, is sort of like having a certain level of trust in the abstraction, but failure still happens. In my opinion that happens because we don’t fully understand the boundaries between the abstractions well enough. Spending a little more time there I think, I personally feel has sort of like helped me understand the services and the systems that I build better.

So, what do I mean by that? Well, a good example would be, if you’re using a service like consult for service discovery. Right? It has certain sort of like default consistency modes in which, it sort of like queries for other services. The default case is strongly consistent, it’s probably not what you might need for when you are trying to do service discovery because that is more a fundamentally consistent problem. Right? Which, means that the developer using a consult library or if someone provides consult as a service then you understand the right fault defaults. Different people who are using this library need to understand these defaults and be able to sort of like reason about whether that is the right default for their needs.

So, just spending some time and understanding a lot of these operational semantics of both the dependencies as well as the application and programming for the debuggability. Is in my opinion, what I think denotes coding accordingly or adding coding for failure. Another point Sarah made was that you know, the writing and running of tests itself is like not a goal in itself. Like, tests themselves pretty much are, I wouldn’t say useless, but like in and off themselves they’re not like particularly useful. The reason is, you know again, I stated this previously. It’s just a big form of confirmation bias, all the test I write because we’re sort of like predicting beforehand what are all the failures and success modes. Tests are only good as their ability to predict. Right? The reason why systems fail is because we’re not able to predict that failure or we didn’t try to understand that well enough.

So what this means in my opinion is that just having more tests doesn’t make your service any better. It just means that it can think of more cases, but there’s probably still a case that’s gonna take your service down and if you don’t understand that then, you know. Having like a million tests doesn’t help you in any way. Right? So, and when we talk about tests, the other thing that really annoys me is this test pyramid. So, this was probably proposed back in like, I don’t know the early 2000’s along with the whole agile movement and everything. It was probably really small then so I don’t really understand or like, know the history of this. Every single testing blog post or like every single like testing conversation people have like this thing gets wheeled out. Every time I look at this I’m like okay, yeah if you have like a or have a like the Rails app or like a single Jangle app, if you have like a CRUD app basically, yeah this kind of makes sense. Right?

When it comes to really dealing with this kind of service, like is this the best we can do when it comes to testing? Seriously? Like, why do we still talk about this thing, right? This is like, in my opinion, super old and this is completely the wrong way to think about testing this kind of a service or into the architecture. This is completely hypothetical by the way, but it’s not very far removed from some kinds of some of the sort of topologies that I’ve personally worked with. So in my opinion, I’m totally wearing my thought leader hat now, this is what testing should look like.

I see it more as an old style, the pyramid is more like a funnel, I don’t know why I made it a funnel. It could have probably been this big rectangle, but for some reason this is a funnel. In my opinion, to really test these kind of crazy architectures, we need to think of testing as a spectrum and it just doesn’t end when you write some code. It just doesn’t end when you deploy something, it’s this large wide spectrum, where it sort of continues until you really decommission your applications. Right? Again, this is isn’t like sort of comprehensive by any way, but this is sort of like my mental model for testing this kind of system.

You have your unit test, you have your benchmarking, you have property based testing, you have your contract testing. All this is things that it can do in development. Right? You can do all of this without necessarily spinning up this whole crazy topology locally, you can just like run one service and still to have all these kinds of tests. Testing pre production and personally I believe that massive integration tests are an anti pattern. I mean, because think of it, what we’re really trying to do when people try to like build this whole dev environment then an identical test environment, rather try to keep your test environment as close as possible to production. It’s not really the real thing, because it becomes this huge maintenance burden and especially if we’re smaller companies, we just can’t do it. It’s just like, it’s just not attainable, it’s like probably it ends up requiring a team in its own right to just maintain. Right?

That’s kind of hard and the fact is, doing all of that doesn’t give you any more confidence in your system. So, if you’re going to be integration testing in microservice architecture then you should be able to test that against production. This has also the added benefit in that it helps you really think about your boundaries. It makes you certain that if you should be able to test a potentially buggy sort of like service in production with the rest of them, then it really forces you to think of what the boundaries should be and how the blast radius or how widespread the blast radius might be.

The goal is that it should be really small in that even if you push out this really shitty sort of like patch. Or like you know, you sort of like push out this big buggy creature. It really shouldn’t take down the rest of your service. Right? If you actually think about it that’s really what the whole microservice, the goal of microservices architectures are. You have like these stand alone services that do this one thing well, they don’t really affect or really greatly affect like the other services. So, this is like a very positive architectural pressure in my mind. Being able to integration test, your single service with your production environment.

Then there’s config tests and this is again something that I’ve learned from you know, just my operations colleagues, that configuration changes can sometimes be more dangerous than a code change. Right? Especially when it comes to things like Nginx configs or even Haproxy configs or even some sort of like given that we live in the era of service meshes like Envoy or Linkerdy. The thing about service meshes it’s like it becomes this huge sea of failure across like your whole entire fleet, because it sits in between every damn service that you have. If you’re not gonna be testing the config changes before pushing out a change to Envoy, it’s not just gonna take down one service. It has the potential to take down your entire architecture.

That’s pretty scary to me, and I’m surprised that more people don’t talk about like you know, how to build resilience service mesh. That’s something that I’m hoping will happen in the next year or two as like this whole paradigm gets more traction. They’re shadowing which is… data… production traffic and actually replay it against tests. Right? I’m also arguing that let’s not have this whole crazy test cluster in place, which means that integration testing and shadowing sort of like combine together. When you’re integration with production don’t sort of generate your own lord or don’t try to sort of like create like your own request response cycle. Try to use something from production, that’s just happened like maybe five seconds ago, cause that’s as real as it can get.

Then there’s testing in production, obviously. There’s A, B testing, feature flagging, you know, canaring, distributor tracing, monitoring, I think monitoring is a form of testing that’s again why I think software engineers should be doing it. You know, it’s not an operational concern and just understanding what it’s software is doing in production. Because you can build something, but it is completely constrained by all the biases that you have and you mental models and basically how your team thinks about it. A lot of reasons, if you think about it a lot of failed projects are just like big architectural flaws.

It’s primarily because people didn’t validate soon enough in production and it’s probably because people let their biases carry them to far until they spent too many engineering cycles and too much time building something that’s pretty big. Then it’s probably too flawed, because you didn’t integrate rapidly and you didn’t basically get feedback from production fast enough. This is like a service that I’ve been sort of the developer for, on call for, for the better part of the last two years. It doesn’t matter what e-service it really does, but you know when I developed this thing. I don’t really bring up all these services on the laptop, what I really do is just bring up this one thing and then test the rest in production.

Surprisingly it’s worked well enough for us, I mean, there’s never been an outage because I’ve been developing this way. The other pretty weird thing is I’ve worked at companies where people have come and asked us like, can you give us your whole service as a Docker container. You’re talking about like a SaaS service. Like can it just give it your service as a Docker container so that we can integration test our services with basically yours. Most of us these days just integrate with a whole bunch of third party services. As we should, because if it’s not a core competency for your startup then you should probably be asserting it at someone else. Right?

Who’s basic business model and who’s basically the reason they even exist is because they’re providing the service and then it begs the question. How do you even test these services? Right? In my opinion, the way you test is not fast enough to give you an operating container that then you sort of spin up in a test environment. You need to either have better fractions in place that you know, where you can sort of understand your vendors failure modes and sort of program against that. Or you just directly talk to your … directly make like an ecro call to this third party service that we’re talking to. Which, again becomes harder because a lot of services don’t really provide a test mode. Sort of like features or they don’t really allow you to test it like every time you make a call to them you get charged, which it kind of sucks.

That’s again, that’s a user experience thing right? In my opinion not a lot of vendors do this thing where they really think about how someone is gonna be integrating with them or gonna test well. So, yeah, that’s pretty much all I’ve got and a lot of these thoughts are like probably better explained and better articulated in the blog post if anyone really wants to spend 40 minutes reading that. Go read that, I won’t do that, but yeah. Thank you.

05 Feb 2018

To Be Continuous: Continuous Integration at Microsoft

In this episode of To Be Continuous, Edith and Paul meet up with Microsoft program managers Simina Pasat and Joshua Weber to discuss how continuous integration plays a role behind the scenes at Microsoft. Hear how they develop across platforms, use feature flags, approach user feedback, and more.

Continue reading “To Be Continuous: Continuous Integration at Microsoft” »

12 Jan 2018

What Makes a Failure a Disaster?

It was December 31, 1999, and a young sysadmin had just bet her boss that their systems wouldn’t go down. The stakes were her job. She was driving 4 hours to go to a New Year’s Eve party with her friends, and she wasn’t worried about Y2K.

Though our sysadmin had her pager, if something went down in her systems she wasn’t going to be able to fix them before people noticed. Her boss told her that if anything happened, she didn’t need to come back, ever. She left anyway, because she knew everything would be fine. This wasn’t an act of faith, it was that she knew nothing would go down, because for the whole of 1999, and even before, the system had been preparing.

Everyone knew Y2K was coming. Our sysadmin had patched all her scripts, and installed all the software updates the vendors had said, and traced back every piece of hardware and firmware to make sure it had been patched and updated.

As we have moved further from the reality of Y2K, we have forgotten. We forget just how many millions of dollars and person-hours went into reducing our risk. Utilities did fail, but that was in July of 1999, because groups were doing planned tests. They even staggered the tests so the whole grid wouldn’t go down. The SEC threatened to cut off non-compliant banks if they didn’t meet a series of rolling upgrade deadlines. Thousands of programmers were hired to remediate systems that couldn’t be automatically upgraded.

Now it’s easy to think Y2K was an overblown story. Nothing really terrible happened, so it must not have been that bad. But it was that bad, and we just managed to fix it in time through an effort more intense and widespread than putting humans on the moon. Often we know that things will fail, but what makes a failure a disaster?

Risk Reduction

In the case of Y2K, all the upgrade focus was about risk reduction. We knew there was something that could cause problems, so we did everything we could to prevent it from happening—from preventing a major disaster from occurring.

While it’s impossible to avoid all risk, we recognize that we can at least reduce risks that we’re aware of. But what are the essential elements of reducing risk? And how do we know what we need to prevent?

Secure Your Zone

Rather than becoming overwhelmed at the idea of trying to avoid risk, we need to break our world into zones, decide how much risk we can tolerate in each zone, and then secure it as much as possible.

In my job, I work on adding documentation to our APIs. There is a risk I could break someone’s integration if I add or delete something in the wrong place. To avoid that, we have a process where someone else reviews my changes before they go live.

Pull requests and reviews are so standard that we don’t usually think of them as a risk reduction strategy, but they are. We are adding a little friction to the act of committing code in order to prevent easy mistakes. We also prevent mistakes or malicious activity by limiting who has “commit bits” to critical parts of our code, and by keeping backups of our code so that we can restore a working state quickly.

At LaunchDarkly, we care about reducing risk by making it easy to do the right thing and hard to make unrecoverable errors. Wrapping feature code and product settings in feature flags makes it easy to change the state of your software without adding the confounding effects of a deployment to the process.

Your zone is only as big as the area you have control over. You want to be sure that people who are changing your software are authorized to do so. I can change the APIs and their documentation, but no one has or should give me access to change other parts of our code.

Predict States

Another way to reduce risk is to address issues before they become problems by evaluating your system and identifying where they might occur. To predict the states something could end up in is formally known as finite state machines (FSMs). An FSM is defined by a list of its states, its initial state, and the conditions for each transition from one state to another. Many process flowcharts are a state machine for systems.

To use a state machine to assess risk, you enumerate all the possible states your system could be in, and what would make that happen. You can then address the transitions that lead to states you don’t want.

After code is committed, there are still several ways it could fail. For instance, we try to have more than one way to serve it, so that if there is some kind of infrastructure problem, we have a second server in another location already online. Sometimes, the risk is that the page or product won’t work for everyone, so we deploy it to just a small percentage of our users—also known as a canary launch. That means that if there was a risk we didn’t foresee, we only have to repair a small part of our user base, not the whole thing.

Being able to test our assumptions on a small scale gives us the confidence that our predicted states are accurate and we can handle them.

Make Low-Stakes Bets

We can’t entirely avoid risk and stay safe. One of the ways we can reduce risk is to make small bets—not all-or-nothing—but a little bet that this will go well, or at least generate a minor improvement. Risk is not an all-or-nothing proposition, it’s a matter of what we can tolerate, what we can afford to lose. Once we can define that for ourselves, we can be much more adventurous within those limits.

26 Dec 2017

Talking Nerdy with Technically Speaking

Our own developer advocate sat down with the Technically Speaking podcast at the Agile Midwest conference in St Louis, MO. They discussed what a developer advocate is, the LaunchDarkly feature management platform, and how teams use feature flags.

“Decoupling that deployment from the activation of code gives people so much more security and reduces the risk of deployment. I think a lot of reasons people resist continuous deployment is the fear that they could break something and not be able to fix it in a hurry. So with this we’re saying, ‘Deploy all you want, activate when you’re ready.’” – Heidi Waterhouse

Check it out to hear how companies like Microsoft and GoPro are using feature flags to dark launch, do internal testing, and use feature flags as kill switches.

TRANSCRIPT

Zach Lanz: Live from St. Louis, Missouri, it’s the Technically Speaking podcast, brought to you by Daugherty Business Solutions. Get ready ’cause it’s time to talk nerdy on the Technically Speaking podcast.

Welcome back in, Technically Speaking. We are talking nerdy today. We are talking all day. We’re live at the Agile Midwest Conferences here in beautiful, beautiful downtown St. Charles, Missouri at the St. Charles Convention Center. I mean, I can overlook the river right out the window. It’s picturesque. Today we have quite a collection of some of the most forefront on the agile front lines and thought leadership bringing you some content and their perspectives. This episode is no different because today we have Heidi Waterhouse. Heidi, welcome to the podcast.

Heidi Waterhouse: Thank you. I’m excited to be here.

ZL: Listen to your adoring fans.

HW: My adoring fans.

ZL: We might have to shut the door so they leave us alone. So Heidi, you are in town from Minneapolis, coming all the way from Minneapolis, correct?

HW: Yes.

ZL: Just for the conference?

HW: Just for this.

ZL: You had your talk earlier today and I understand that you had a couple technical snafus.

HW: Oh well, I think it’s not really a presentation if something doesn’t go a little weird, but I did get to talk about the things that I cared about so that worked out.

ZL: You are a developer advocate at LaunchDarkly. From what I understand, it’s a company that specializes in feature flags. Tell us a little bit about what a developer advocate is, what you do there, and a little bit about LaunchDarkly.

HW: A developer advocate has three different roles that we serve. The first is I go out to conferences like this. This is my third this month. I have two more to go and I have five next month. I talk to developers about ways that I could improve their workflow and make their lives easier. I also spend a lot of time listening to developers and figuring out what it is that they need so I can take that back to my team and say, “Hey, there’s this unmet need the developers have that we need to look into.” The third is recruiting. I meet a lot of developers and I can say to them, “Hey, we have this opening for a full stack developer in Oakland.” They’re sometimes interested in it. Midwesterners seldom take us up on, “You could move to Oakland.”

When I say LaunchDarkly and feature flags, what I’m talking about is LaunchDarkly is a company that provides feature flags as a service. You wrap your code in a little snippet of our code, connect it to the SDK and then you deploy the whole package and it sits out there like Schrodinger’s code, both the live and dead, until you flip the switch to turn it on. Decoupling that deployment from the activation of code gives people so much more security and really reduces the risk of deployment because I think a lot of reasons people resist continuous deployment is the fear that they could break something and not be able to fix it in a hurry. With this we’re saying like, “Deploy all you want, activate when you’re ready.”

ZL: Just generally I guess, I think you touched on it a little bit, but explain a feature flag and what that does.

HW: A feature flag, or a toggle, or a feature management, it’s called a bunch of different things, is a way to turn a feature on and off from a central command. We work with Fastly and we have edge servers that help people to do this. Imagine you deployed a feature that say gave you the ability to do holiday snow on your webpage, like CSS that made your webpage snow. Well, it turns out that has a conflict with some browsers and you’re crashing people’s browsers. You want to be able to turn that off instantly. We like to say that you can kill a feature faster than you can say, “Kill the feature.” It’s about 200 milliseconds. All you have to do is flip the toggle and the feature turns off almost instantly for everybody, instead of having to deploy again and hope that you get it right this time and didn’t leave anything in.

ZL: You mainly work with companies large and small, I would imagine, that have a need to be able to turn those features on and off. Are there any great success stories that you’d like to share of being able to prevent an issue before it happened, or turn it around before customers even knew?

HW: We work with Microsoft and they do a lot of dark launching and sneak the features in so that they can do what we call canary testing, where you test with a small percentage of your population. But we also worked a lot with GoPro to be able to let them develop in their main line and do internal testing. Then, when they were ready, they pushed out their new version with no problems. Not a lot of people tell us when they have to hit the kill switch. We can see it in the logs, but I don’t want to call anybody else because it’s uncomfortable to have to say, “I deployed something that was bad,” but it’s not as uncomfortable as it is if you have to say, “I deployed something that was bad and it took forever to fix it.”

ZL: Your session today was a choose your own adventure interactive. I love that concept. I love games and I love just the interaction in a session like that. Walk through the concept there of how the choose your own adventure maps into this idea of deploying in a dark manner and the feature flags and things like that.

HW: We have a little mascot, an astronaut named Toggle. I created an adventure for Toggle where you can choose how you want to get to a planet based on different routes, like which ones are safer, or which ones are faster, or which ones are most scenic. If you think about the way you can sometimes get a map to show you like, “Don’t put me on any freeways,” that’s sort of the way that I designed this talk. This is the first time I’ve used the Twine gaming platform to design a talk. It turned out my CSS is really rusty, but there’s a lot of good help out there, so I got through it. It was really interesting to be able to say like, “Hey audience, pick which direction you want to go. Let’s talk about the thing you’re interested in. I’ve created 60 some slides of content, but the path through it is unique every time to your experience.”

ZL: How did they react? Which routes did you go?

HW: Unsurprisingly, the first route that we took was development. We talked about canary launches, which is the small launch, and we talked about multivariate flags. You could say, “I want to launch to 10% of the people in Germany.” We are basically doing access control lists for you product. You can slice and dice your audience however you’d like. Then, I really love this concept, I got it from Fastly, the albatross launch where you have a legacy client that’s very important to you that you want to keep happy, but you also want to keep your code base moving forward, you can switch them over to the older code base and keep your mainline moving forward without having to actually split your lines, split your code instances.

ZL: Well, awesome. Also, my crack team of researchers also dug up a little bit. It looks like outside of maybe the workplace, you are maybe an aspiring designer or seamstress.

HW: Oh yeah. I sew 100% of the clothes that I wear to conferences because I am really tired of dresses that don’t have pockets. If you wear a mic pack you have to have a belt of pockets. I just do all my own sewing and I find it super satisfying to make something tangible after a long day of software, which is a little fuzzy around the edges. Also, it means that I get to do tourism shopping. I just came back from New York and I honestly have a dedicated fabric suitcase.

ZL: Wow. Now it’s all coming out.

HW: My regular check on, roller board size suitcase fits in the big suitcase. Then I’ve come back from, oh, Tel Aviv or Australia or London, with a second bag, essentially, full of fabric. In fact, the dress that I’m wearing right now has fabric from Australia and London.

ZL: Wow. It’s a trip around the world for your clothes.

HW: It is. It’s really a nice reminder that we aren’t just here to do technology but also to experience the places that we go.

ZL: Yeah, absolutely. Do you have any places in mind for this trip? Have you eyed any places?

HW: Actually, I’ve never been to St. Louis before, but I don’t have a lot of time on this trip. I actually have to leave in a few hours.

ZL: Oh, that stinks.

HW: Yeah, it’s very sad.

ZL: Well, hopefully you get back to the airport in time and no delays there.

HW: Yeah, no. It’s only a couple minutes so it should be fine.

ZL: If people have heard about either LaunchDarkly and are curious, want to hear more about these feature flags, it sounds like a fantastic offering. I don’t know who wouldn’t be interest in something like that, to be able to control your launch a little bit and pull stuff back if disasters happen. If people want to get in contact with you, or with the company, what’s the best way to do that?

HW: They could write me, Heidi@launchdarkly.com, or you can find us on Twitter @LaunchDarkly, or you can find me on Twitter @WiredFerret.

ZL: WiredFerret.

HW: I know.

ZL: What’s the story behind that name?

HW: Well, it turns out that I am both excitable and high strung, so yeah. Anybody who’s seen me at a conference party understands.

ZL: Got you. Do you bite?

HW: I don’t. I don’t. That’s definitely against the code of conduct.

ZL: Yeah, because I mean, that’s the MO on ferrets.

HW: It is. It is. We had ferrets for years and they were terrible sock thieves. Whether or not your foot was in the sock, they were going to steal it.

ZL: It was gone. Well, I appreciate you stopping by and giving up some of your lunch hour. I know this is high priority time, so I appreciate you coming on, sharing your perspective. It sounds like your presentation went great despite a couple technological glitches. I hope you have a good flight back and thank you for taking some time out and sharing your perspective with us today.

HW: All right. Thank you.

ZL: Thank you for listening to the Technically Speaking podcast. Get involved with the show by following us on Twitter @SpeakTech, or like our page at Facebook.Com/SpeakTechPodcast. If you have suggestions or questions related to the show, or would like to be considered as a future guest, send feedback and inquiries to hey@speaktechpodcast.com. I’m Zach Lanz and thank you for listening to The Technically Speaking podcast.

21 Dec 2017

Toggle Talk with Happy Co’s VP of Product Phill Claxton

Phill Claxton has implemented feature flags in three different organizations. As a driving force on the business side, he’s seen his company evolve and witnessed firsthand the difference feature management has made in the overall progress of the company. When it comes to the most fundamental benefits of feature flagging, he points to continuous delivery and the speed of learning. Phill is VP of Product at HappyCo, COO at Storekat.com and a cross-functional consultant for early startups.

How long have you been feature flagging?

I’ve been a fanboy of feature flags for a little while. They’re a strange thing to love but it’s the result of implementing them that has made such a huge difference in my life, and the companies I’ve been involved with.

I think I can trace my first exposure back to a HubSpot article in 2015. We were faced with a challenge at a startup that I founded down here in New Zealand called DeskDirector. We had maybe 150 – 200 customers/accounts at the time, and we were hitting a challenge that’s a telltale sign for those on the other side of learning about feature flags. We had a very small development team, and it was very hard for us to do multiple things. We needed to decide what to build next and measure the impact from it.

We had made some pretty substantial failures, it’d be fair to say. We’d launch features that we thought were going to help the many but when they failed we realized it was the false wisdom of the few.

“So we were trying to work out a way to learn faster.”

At the same time we were thinking about how we can restructure our pricing, and it sort of came together when we read that Hubspot article. We realized we needed to control the release of features we were working on to cohorts of customers. That way we could quickly decide whether this had the broader appeal to launch to our greater market.

When do you think feature flagging is most useful?

Once you hit that 100 customer mark there’s some natural segmentation that lend itself to feature flagging to test out cohorts. The customer grouping starts to look about the same as it will at 1,000 customers. I would suggest that’s probably when you want to start really thinking about feature flagging.

I’d caveat that by saying we were in the B2B space. You might get there much sooner if you were selling a product to consumers.

“That being said—the last times we’ve done it, it was always too late. We could have started earlier. I suggest you start learning really early without worrying too much about cohorts.”

For me it’s always been two things, which is about continuous delivery and the speed of learning. Specifically, allowing a product team or engineering team, to launch something and learn at a much faster rate from the customer base. The same thing goes for pricing inside your product.

In the SaaS world, the need to experiment with pricing is a part of your standard business practice. And the earlier you can do that, the better. Sometimes people forget that there’s many different use cases for feature flagging.

Are there any cases where feature flagging is not a good idea?

Our practice was to flag anything that was impacting on the user’s experience. So, if the impact on the end-user experience is low and there’s a very high level of confidence in the change, then we’re comfortable not flagging. If the change itself is completely hidden from the users’ experience and service side, then that’s not normally something that we would need to consider feature flagging.

“I would say more often than not there are situations where it isn’t thought of enough. The default position should be that it is flagged.”

Best use of a feature flag—a personal story?

By this stage we were feeling mature in our development cycle and had flagging in place. It was a documentation platform that tracked expirations of devices that an IT company was looking after.

We were working on a new notification engine which would trigger expiration alerts to customers. And so we did feature flag this feature, and we’d gone through a lot of initial wireframes with customers. We’d even built out an early stage MVP without production data. All was looking great, and we launched it into production. It was feature flagged and enabled only for those in our Customer Advisory Council.

The moment we did that every single one of them had their inboxes flooded with anywhere from a few hundred to, in one case, about 15,000 emails. We’d built this notification engine to let them know when things were about to expire and embarrassingly we didn’t think about what was going to happen when things had already expired. And so every device they were managing, which counted, for many, into the thousands would suddenly trigger this alert to say, “This is about to expire”—when it actually already had. And so that was one of those cases where we went, “Hallelujah.” We had gone through a lot of exhaustive QA and unit tests, etc., so we knew things weren’t going to break and the upside was our notification engine worked at a massive scale. It still had a lot of work to do when we turned it on.

The reality was that the output for the end-user was certainly not the experience we would’ve wanted. If we had turned it on at the time to our north of 2,000 customers, we would’ve flooded our support team with a lot of requests. Also we would had a lot of very unhappy users.

“So because of feature flags we were able to quickly go back and amend that feature then test again with that Customer Advisory Council before launching as a general release.”

The real benefit obviously came from all the learning that we got from the process as well. Not just for problems, but also just day-to-day functionality that we thought we’d scoped correctly, when given to customer in production, it’s totally different.

“This is when the real learning happens. Showing them some wireframes is totally different from showing them it working with their data.”

What do you think is the number one mistake that’s made around feature flagging?

The #1 mistake for me would be implementing it and actually not remembering to ensure that the features being built are actually flagged. It sounds simple, but it’s actually a common trap that happens all too frequently. You end up not flagging enough of the functionality in the product and it becomes more work later on.

#2 You need to flag more often than you probably were planning to.

“You need to ask yourself why are these features not being flagged rather than the other way around.”

Probably the only other thing is not having a program for removing flags as well. For all the benefit it offers it does make your code base a lot more verbose, and there’s a small amount of arguable tech debt that you’re buying with all this. So not having a program for removing the flags is probably the other mistake.

How do you think feature flags play into the DevOps movement? Continuous Delivery?

Early last year I started a company called IT Glue, as the Chief Operating Officer. And I would certainly say that feature flags would be one of those reasons we were able to scale and scale so fast. We went from 500 accounts and about seven on staff, to 2,500 accounts and 63 staff within one year. So you often get asked, how were you able to scale? What’s the magic bullet?

“And I’d never say there was a silver bullet, but if I listed off the ten things that drive towards continuous delivery, a dev process supported by feature flags would always be on that list.”

They’re really a very critical part of DevOps. Having not worked in very large engineering teams I can only imagine what the impact would be and I would say it plays a fairly substantial role. But it tended to be our DevOps engineers who gained some of the best advantages from actually having these feature flags in place.

Are you seeing feature flagging evolving? If so how? And how do you expect it to change in the future?

I’ve certainly seen companies like LaunchDarkly launch into the market, which is great. So much of what was having to be done in-house is now can be managed by third parties. It’s definitely evolving, yes. I’m not seeing the evolution of it, not for the fault of anyone else, happening fast enough though. Part of the challenge, I think, is that the greater majority of companies I speak to still don’t really know and think about feature flagging as an important part of their growth strategy and engineering strategy.

Know a feature flag enthusiast we should talk to? Email us or let us know on Twitter.

13 Nov 2017

To Be Continuous: Getting Acquired, Founder Conviction, DevOps in Continuous Delivery

In this episode of To Be Continuous, Edith and Paul are joined by Keith Ballinger and Thomas Dohmke from Microsoft. They share their experience of being acquired by Microsoft and discuss the role of DevOps in the continuous delivery process. They then consider the importance of balancing founder convictions of product value and how the world is changing, with necessary market validation activities. Keith also shares his thoughts on why early-stage founders should ignore larger companies for as long as possible, focusing instead on building their business.

This is episode #40 in the To Be Continuous podcast series all about continuous delivery and software development.

Continue reading “To Be Continuous: Getting Acquired, Founder Conviction, DevOps in Continuous Delivery” »