18 Apr 2018

Konmari Your Code Base

Konmari is a new name for an old concept—we shouldn’t live with STUFF that isn’t serving a purpose in our lives. It’s championed by Marie Kondo, a woman passionate about throwing things away and organizing what’s left. Her breakout book was titled “The Life-Changing Magic of Tidying Up.”

Though Kondo’s focus leans heavily on physical things, I think her concepts are important for our digital lives as well. Because storage is effectively free, we never feel the pain of being hoarders, at least not directly. Never throwing things away is a personal choice when it’s about blurry pictures of our cats in our own folders, but it’s a different problem when it’s about our code base.

How do you know when you should remove flags from your code base? How do you clean up after all your experiments?

Build for Future Ease

The easiest way to avoid having big messes to clean up is to not make a mess in the first place. We all know that sounds simpler than it is, but I also believe that any improvement is a help. Just as it’s better to put your dirty clothes in a laundry hamper than on the floor, it’s better to tag new features with meaningful identifiers. Here are some best practices for future ease. You may already know them, but just in case:

  • Add comments about what the code is meant to do
  • Add uniform tags for search-ability
  • Use meaningful variable names
  • If a feature is dependent on another function, mention that in the comments
  • Add dates about when something was added, don’t just depend on commit messages to indicate that

Automate Deletion

We have emotions about deleting things. We have to accept that at some point the feature (or picture or paragraph) will no longer serve a purpose. Since the things we delete always represent memory, potential, or work, we are often reluctant to delete them—we know that moment in time will never happen again or that work would have to be repeated.

The solution to this is not to avoid deleting things, but to avoid the emotional impact of making the decision. Automate the deletion of anything that you can. Meeting notices vanish when the meeting is past–deployment flags should evaporate when the feature is fully adopted. By configuring your code to delete outdated elements, you keep the clutter from accumulating in your codebase.

Consider being even more radical—if you have a policy of only supporting a certain number of versions, automatically roll the oldest version off support when you release a new version. If you make it a policy, everyone will know that they must upgrade when a new version is coming out.

Cultivate Tidy Habits

Wash on Monday. Iron on Tuesday. Bake on Wednesday…

This was an old expectation about when housework happened. Have some set times in your development cycle for cleaning up after yourself. What about that hour at the end of the workweek when you don’t want to start anything? Or the day you have mostly blocked off for sprint exit/entry? Those are good natural break points to go in and clean up the things you have been meaning to get to. When the cadence is regular, you’ll better remember what you were doing. Making cleanup a regular chore instead of a “when I get to it” behavior means it’s much less annoying when you do it.

Reward good behavior

Are you the sort of person who internally yells at yourself (or others) for making mistakes? Consider a new way of training yourself to do the right thing. Instead of being angry or frustrated about mistakes, catch yourself doing the right thing and reward it. Deleted some old code? Stand up and look at the sun, or have a hard candy, or trigger a micropayment to a fun account—whatever you find motivational. Associating good behavior with something you enjoy makes it easier to keep doing the good behavior. Even if you know what you OUGHT to do, most people don’t get a dopamine hit from fulfilling obligations.

11 Apr 2018

I Don’t Always Test My Code, But When I Do It’s In Production

How using automated canary analysis can help increase development productivity, reduce risk and keep developers happy

At our March Meetup Isaac Mosquera, CTO at Armory.io, joined us to talk about canary deployments. He shared lessons learned, best practices for doing good canaries, and how to choose the right metrics for success/failure.

“The reason it’s pretty hard is because it’s a pretty stressful situation when you’re deploying into production and you’re looking at your dashboard, just like Homer is here. And he’s trying to figure out what to do next. Does this metric look similar to this metric? Should I roll back? Should I continue moving forward? Is this okay? And more importantly, what’s actually happening is a human being, when they’re looking at these metrics in their Canary, they’re biased, right?”

Watch his talk below to learn more about automated canaries. If you’re interested in joining us at a future Meetup, you can sign up here.


Alright, so a little about me, besides being super nerdy. I have been doing start-ups for the last 10, maybe 15 years, and in doing so, I’ve been always taking care of the infrastructure, and all the other engineers get to have fun writing all the good application codes. So a big part of just my background is infrastructure as code, writing deployment systems, all the things that you need to do in order to make code get into production, be in production, and be stable. That’s a little about me.

I’m currently the CTO of Armory. So this is the latest start-up, and what we do is we deliver a software delivery platform, so that you can ship better software, faster. The important point here is it’s done through safety and automation. It’s not all about just velocity. It’s about having seat belts on while you’re driving really, really fast. And the core of this platform is called Spinnaker, which was open-sourced by Netflix in 2015. So by a show of hands, who’s heard of Spinnaker? Alright, so roughly half. Alright, are you guys using it in production at all? Oh, zero. Okay. Well, hopefully I can get you interested in using it in production.

Spinnaker was open-sourced in 2015 by Netflix. It’s by far the most popular dominant open-source project by Netflix. Netflix has many open-source projects, but this by far is the one that’s gained the most steam. It’s got support from Google, Oracle, Microsoft, the big Cloud providers. It’s also go support by little start-ups like Armory. And so we’re all working to make a really hardened deployment software delivery system. It’s being used in production by very, very large companies, medium-sized companies, and even some small ones. So you have an idea of the scope and reach and all the people who are involved in this project. It’s pretty amazing to see where it came from 2015 to where it is today.

But what I’m going to talk about today is one small component of Armory, which is Canaries. It’s all around testing in production. Who is familiar with Canarying? Alright, so roughly half aren’t. The idea of Canarying comes from the canary in the coal mine, when coal miners would go into the mine, and there was a poisonous gas, they would die because they wouldn’t know the poisonous gas as there. So instead of them dying, they decided to bring a canary down into the coal mine, because the canary would die first, which sucks for the canary but great if you’re a human coal miner.

And so the same ideas apply to software delivery. So instead of deploying a hundred nodes into production to replace the old version that you have there before, instead what you’re going to do is deploy one node into production, assess the situation, make sure that it doesn’t fall over or is reporting dimetrics, and if it’s doing well, continue to scale out. And if not, kill the canary and roll back. And so what most people are doing today is that very process. They deploy one node into production. They scramble over to Data Dog or New Relic, and they have some human being looking at an array of dashboards, and then the human being has to make an assessment to actually keep going forward with the deployment or roll back.

But it’s actually pretty hard to do really, really good Canaries. And you actually see it as we talk to more and more customers, Canaries, for them, is this pinnacle of a place that they want to get to, and it’s pretty hard. The reason it’s pretty hard is because it’s a pretty stressful situation when you’re deploying into production and you’re looking at your dashboard, just like Homer is here. And he’s trying to figure out what to do next. Does this metric look similar to this metric? Should I roll back? Should I continue moving forward? Is this okay? And more importantly, what’s actually happening is a human being, when they’re looking at these metrics in their Canary, they’re biased, right? You got that product manager breathing down your back to get that feature into production. It’s a month late. So even if they kind of look a little bit off, you’re still going to push that button to go forward. You’re still going to make mistakes like old Homer here.

So who here has kids? Yeah, so you guys are all familiar with this kind of crazy scene here, which is just kind of an out of control situation. And this is what a lot of deployments are like in general. So in order to do a nice, good, clean deployment, you need to have control of the situation and the environment. And when this is happening, there’s no way to do a clean Canary, because when somebody deploys something and it’s a mess and you’re trying to compare metrics, and you’re just unsure, where did that metric come from? Did we deploy it correctly? Did somebody change the config on the way out and it’s not reproducible? You end up in this crazy situation. And most people, or most customers or most companies, their deployments, in fact, are just unstable. So the idea of doing a Canary on top of that, doing some sort of analysis, it becomes very, very untenable.

So why Canaries in the first place? Why test in production? Why not just have a bunch of unit tests and integration tests? And the reason why is that integration tests are actually incredibly hard to maintain, to create. They fail all the time, and they’re very brittle. And engineers don’t like writing tests in the first place anyway, so it’s unlikely that you have that many of those. But these are easy, right? The unit tests provide you a lot of coverage. They’re easy, and as you go up the stack here, the tests will become harder and harder and harder to do, and that’s why you don’t want to write as many of them. And you want to rely on other systems, like Feature Flagging and Canarying to help you out at this top part of the stack.

The other thing about an integration test, each individual integration test that you write only provides you incremental more coverage. You’re not going to get so much more coverage that an integration is ripping through the code base all the way in and all the way back out. You can only do that so many times, right? So they’re hard to create. They’re hard to maintain. They’re brittle. And like I said, I mean, we all know engineers don’t like writing tests in the first place. So it’s a constant battle between devops and engineers on the integration tests.

So then why not only just do Canaries? Well, the problem with only just doing Canaries and getting rid of the unit tests and integration tests and all the other tests, is that you start finding out that there’s huge problems right before you get all the way into production, at this line right here. And that’s a very costly place to find out that you have problems. So while Canarying will be helpful and useful, it’s more of a safety net. It doesn’t replace integration tests. It doesn’t replace unit tests. You still have to figure out how to get those engineers to write integration tests, which is difficult to do. So it’s a combination of all these tools that help you deploy safely into production.

So what makes a good Canary? So the first one is that it’s fully automated. And again, it’s got to be fully automated. The more manual process that you introduce into your Canary process, it doesn’t add any safety. You just have more human beings, and we all know that human beings are error-prone. Literally, as engineers, we build software to automate other human beings out of work, but we won’t do that to ourselves. So we constantly continue to put our human assessment into the situation, into the release process, and that’s pretty painful.

And in fact, actually, a story of a customer that I had, where they had a manual Canary release process, they put out a node, they started putting out more nodes, and this engineer got hungry, so she went to lunch. And she went to lunch for about an hour and a half, and in that time, she took down a very, very large Fortune 100 company site, losing millions of dollars. Again, there’s no rollback mechanism. It was a fully manual system, and so there’s no point in having the Canary if there’s no way to actually automate the process.

The third through the fifth part are about the deployment system itself. It needs to be reproducible, meaning if you run the same Canary twice, you should have the same result, roughly. Obviously, because you’re testing in production and it’s live traffic, it won’t exactly be the same. But you generally want it to be roughly about the same. It needs to transparent. You need to know what’s happening inside of the Canary, like why is the Canary process deciding to move forward or roll back? Because if the Canary system makes a mistake, and the engineer needs to know why it made that mistake so that it can be corrected. And the last thing is reliable, right? There’s a lot of homegrown systems out there in terms of deployments and software delivery, and what ends up happening, if it’s not reliable and you have that one software engineer that’s gone rouge and is going to start building their own Canary system on the side. And now you end up with three broken Canary systems instead of just one. So this is what makes a good Canary system.

So what are some of the lessons we’ve learned doing Canaries? The first approach is like, let’s apply machine learning to everything, because that’s just the answer nowadays I see. But actually, simple statistics just works. And this actually comes back to the previous statement that I was making about it being transparent. The moment you apply machine learning, it turns a bit into a black box, and that black box can vary degrees deepening on who made it, how it was created. And it actually doesn’t really get you much. So what we instead learned was that using simple statistics, we apply now a Mann-Whitney U test to the time series data that we get from the servers, from the metrics, so that we can actually make a decision as to whether to move forward or back. And this is much easier for someone to understand and comprehend when the Canary fails. If it’s a black box and I don’t know what to do and I don’t know why it broke or what metric was actually failing, it’s of no use to me as an engineer who wants to just my application code out.

So another lesson we learned is that this is what we see normally when people do manual Canaries. And I think it makes sense when you do a manual Canary, because it’s simple to orchestrate, you have your existing production cluster, you roll out your one node in the Canary cluster, and then, again, you scramble to Data Dog or New Relic or whatever metric stat store you’re using to just start comparing some of these metrics. But this is like comparing apples to apples, except one of them is actually a green apple. And it’s like slightly different, and you don’t really know why. And it just doesn’t really make sense. But they look kind of the same, so screw it, let’s just go to production, right? And you wonder why the Canary failed. But again, you can’t have a human making decisions about numbers and expect it to be rational.

So instead, what we do is we grab a node from the … Well, actually we don’t grab the node. We grab the image from the existing production cluster. We replicate it into its own cluster called the baseline cluster, and then we deploy the Canary. Because now we can compare one node to one node. They are running different code, which is the whole point, but we’re comparing the same size and metrics. Like if this side is over here ripping through 100,000 requests per second, there’s no way that this is going to be able to be compared to that with simple statistics. So this is what makes also Canarying harder, is that you have to be able to orchestrate an event like this, and that isn’t trivial to do if you don’t have sophisticated software delivery.

So other lessons we learned is that a blue green deployment is actually good enough for smaller services. You don’t need to Canary everything. The blue green is good enough. You release it. If it fails, and alarms start going off, you roll it back. The impact is actually low. And the thing I didn’t mention is all that we’re doing here is trying to figure out how to reduce overall risk to our customers and to the company, right? And sometimes blue green is enough to just actually reduce enough risk if you have that out of the box.

The other thing that we’ve found is people like to do Canaries without enough data coming out of their application. If there is no data coming out of an application into Data Dog or whatever metric store you’re using, you can’t use that information in order to have a good Canary. It’s just impossible. I can’t make data up. And then the last one is choosing the right metrics are important. Each application that our company has has a very different profile. What is means to succeed or fail is very, very different whether it’s a front-end check-out application versus a back-end photo processing application. You know, you might want to look at revenue. Revenue might be a great metric to look at for the front-end customer check-out application, but it has no meaning to the image processing application. So making sure you understand what it means to succeed or fail inside of your application is really important, and it’s very surprising to see how much people don’t realize what it means, and they only learn through failed Canaries to understand what success or failure looks like.

You can do this with Armory. So this is what a pipeline looks like with Armory. I think these might … Yeah. So this is what a pipeline looks like with Armory and Spinnaker. Is everything familiar with the term baking? No? Okay, so you guys are familiar with the term AMI, an image? So the idea of an image is just an immutable box in which you can put files and all your code, and you take a snapshot of your computer at that one time, and it becomes an image that you use to push into production. That term has been coined, I think by Netflix, bake. So it’s called baking an image. I have no idea why it’s called that actually, now that I’m thinking about it.

And then the next step, when we run a Canary, and what the Canary step will do, we’ll actually send out … We’ll grab that node from production, take the image from production, create a new cluster, and then push out the Canary as well, the new change set, and then run the statistical analysis against it for as long as you configure it. And I’ll show you what that looks like next. You can also see that we also create a dashboard in whatever metric store that you’re using, so that again, back to the transparency, if this thing were to fail, you’re going to need to know what metrics were failing. So we automatically create a dashboard, so you have that transparency.

The reliability here actually comes from open source, the fact that there’s 60, maybe even 70 developers now just working on this. You’ve got the world’s best Cloud engineers building this software, so the reliability comes in the fact that a ton of engineers are working on it. They’re not just going to leave to go to another company in another six months for higher pay. We’re working on this, and it’s going to be around for a very long time.

And so those are the properties that you see right there. This is what it looks like to have a failed deployment. So again, back to each application, we give it a score. An 83 might be a good score for a lot of applications. For this one though, for whomever configured it, decided that an 83 is a failing score. And so what a failing score will do will actually destroy the Canary infrastructure. So it will destroy the baseline and the Canary, and then that should return production back into the state that it was at automatically for you. So you don’t have to do anything. If you get hungry and you go to lunch, you can trust that this thing will take care of it for you.

So this is what it looks like to config a Canary. So you can choose the data provider. We allow you to config where you look for the metrics and alarms, what’s going to set this Canary to be good or bad, how you want to do the metrics comparison, how long you want to do an analysis period. For instance, there’s some applications that you only want to do a one hour or two hour Canary, and that might be okay to reduce risk, to get an understanding of how the application is going to behave. There’s other applications that are mission critical, and you may want it to run for 48 hours, right? So every application will have a different property and will have to be configured differently.

And then the metrics that are associated with the Canary, like what do you want us to be looking at. As you start querying, as you’re running in production, we’ll start looking at these metrics. Interestingly enough, we used to have a deviation. That actually no longer exists. We automatically can detect now if you’re falling out of bounds. You don’t need us to tell us what a good score is anymore. We just apply more statistics to the time series we get. And you don’t have to apply deviation or any thresholds. We kind of automatically figure that out for you. So the engineer doesn’t really want to be doing this type of work. They want to get back to writing application code.

So the thing that we get a lot is why Canaries vs Feature Flags, and it’s not a mutually exclusive thing. They’re actually used together. And in fact, we actually use Feature Flags all the time with our Canary product, and the way that we use it is whenever we build a feature, we put it behind a Feature Flag immediately, and we’re constantly deploying to production behind this Feature Flag. Everything that we do is under continuous delivery. We’re a continuous delivery company, so we obviously should practice what we preach. So we use a Feature Flag, and then we start iterating. The Canary helps as other commits are getting pushed into production, that you’re not just affecting anybody blindly. You’re not actually making mistakes. And then especially on this last part. It’s really great for code changers that affect large, large portions of the code base, like a database driver change. I don’t know how many times I’ve seen deployments fail because of somebody just switching a Flag, and next thing you know, production is out.

So provide info for every release. If you’re doing a Canary, it’s giving you a good or a bad for every single release. Spans typically a single deployment. You’re not typically going for multiple deployments, although you can be able to do Canaries. And then Feature Flags are great for user visible features. It kind of lends itself to the product team, where a Canary is more of an engineering activity. There’s not a product manager that’s going to do a Canary, at least not that I’ve seen. And then Feature Flags are great for testing against cohorts and more sophisticated analysis. Canaries aren’t always that sophisticated to be able to test against users. A Canary doesn’t know necessarily what a user is in the first place. Alright, so that’s it.

09 Apr 2018

The Customers Won’t Be Happy: How Atlassian Rolled Out a Large Scale UI Change for Confluence Cloud

At our March Meetup we visited Atlassian’s new offices in Mountain View. Aaron Gentleman, Software Development Team Lead at Atlassian, shared how his team actually tested in production when they rolled out ADG3 to Confluence Cloud.

“There was sort of three main concerns. There was: is it ready in the first place, is it actually ready for customers to start using? What’s the impact going to be like, not just for them but actually for us and sort of what does it look like when it’s complete, when we say we’re done. What’s the definition of done actually look like?”

Watch his talk below to learn more about how the team addressed these concerns throughout the testing and rollout process. If you’re interested in joining us at a future Meetup, you can sign up here.


And I’ve been at Atlassian for about two and a half years now. More recently, been here in the States at Mountain View for seven months or so. And I’m really here to talk to you today about one of the ways that we actually tested in production. Some of the customers didn’t like it.

More specifically, I’m actually gonna be talking about how we rolled out ADG3 to confluence clouds, so can I get a show of hands, who actually uses the confluence cloud product today?

Alright, around 15 to 20 percent of the crowd. That’s good. Can I get a show of hands who used it before we brought out the design change. Alright, about half of that again. And how many of you were frustrated with the actual rollout. Like the actual UI change. Alright, only two. That’s you know, narrowing down so it’s alright.

What I really wanted to bring up is the fact that established users are actually really frustrated with big changes like that. So for those that are unaware, what we really did was take out some big key navigation components from the top of the application that everyone’s used to clicking every single day, like a create button, a dashboard button, a profile icon and we completely ripped that out and put it to the side of the product and a lot of them have gone “why are you doing this? This doesn’t make any sense, I can’t use the product at all.” The feedback from our design team was “well this is awesome, you just gotta get with it”.

That was useful feedback from both customers and design but it still wasn’t very nice for the customers themselves. For those that really aren’t aware what the product used to look like was that slide on the left which is really tiny to use. You probably can’t actually see it. But what we can see up the top on the left is the old navigation components. We had an app switcher, we had a create button and a search button. It’s all there really easy to use and then on the right hand side we have the new UI. What you can see is what we call a portable navigation experience. How you navigate around the whole product and other products. And then on the right portion of that sort of more content directive related to what you do on a day to day basis. And the design team have a really good reasoning as to why they’ve done this but it was very painful for a lot of customers originally.

So what I wanted to talk about when I actually get this back. Yep that’s good, is how we actually roll this change out to production for customers.

Rolling out and testing production, we had sort of, we had a lot of concerns but there was sort of three main concerns. There was: is it ready in the first place, is it actually ready for customers to start using? What’s the impact going to be like, not just for them but actually for us and sort of what does it look like when it’s complete, when we say we’re done. What’s the definition of done actually look like?

When we ask the question “Is it ready?” We ask, what we’re trying to actually ask ourselves is “Was this change something that we ourselves would be willing to actually accept and take on?” In our particular case the answer was yes. Internally we’d actually been using it for about three months internally on all of our development sites. We were very conscious of the limitations that it would actually bring. We wanted to make sure that when we do roll out we don’t actually roll out to stuff that people aren’t ready for.

One of the other questions is what does that actual percentage rollout look like? Some of our considerations were previous tests that were done on customers. Before rolling ADG3 we’d started a new trial for a single page app in the product. So for those that use ADG3 the whole application, well not the whole but the general application is single page application but before that we’d started rolling out an experiment where it was the UI but single app. Those customers weren’t too happy about receiving that change because again it was a big change and they weren’t happy. But then we wanted to make sure that when we started rolling this out, we don’t actually roll that change out again to those new customers that had already received the previous experience. In LaunchDarkly we actually created sort of a prerequisite group, so an actual feature flag in LaunchDarkly with a set of all those customers that got that old change and made sure we set a prerequisite in LaunchDarkly to say “do not serve this up to these people at all until we’re actually really ready for it”.

One of our strong considerations was also the definition of done for the customers. So we needed to make sure that when we rolled out those changes, our definition of the customer was actually not an individual user, it was the full site. So, everyone that actually collaborated and belonged on that site got all the same experience from it at the same because what you don’t want is someone that’s, suddenly gets a UI change and then all of a sudden goes “How do I find anything? How do I use this?” And then they look to the person sitting next to them and they’re still seeing the old version. “What the -? How come you’ve got this and I don’t have this? That’s really frustrating. You can’t even help me do my day to day work.” That was a really strong consideration for us when picking the appropriate key for out percentage rollouts.

We also wanted to make sure that, actually that was about it for that.

We wanted to make sure, another consideration was who gets what version. Who should be getting the change? Should it be new customers, should it be existing customers? What we actually did was we purposely started rolling out to existing customers first but then when we got to a certain threshold of happiness and completeness from a feature point of view, we then started looking at how we roll this change out just for new customers. Our actual rollout pattern hit one hundred percent for new customers before it even hit all of our legacy, existing customers.

By the end of the actual project, we were rolling out at about ten percent every single week from a rollout percentage to see how often people actually got the change.

Another question was “What’s the impact?” Are we fucking the customer? Which is actually one of the elastic values which James just brought up and talked about before. It’s don’t fuck the customer. So that’s a really strong consideration for us is we know that these changes are going to hurt people but you know sometimes it’s for the greater good, but really you’ve got to, there’s a fine line between “she’ll be right, they’ll get used to it” versus “no this is actually not right, they’re not going to get used to it, they’re going to leave us.” That was a really strong consideration.

Another consideration was if they do run into problems with this, how can we help them? How can we make sure that this change is easily rolled back for them, or for the entire user base. And another change was that, how can we support the actual support load? Given that we’re already a product that’s been around for a while, one of our primary considerations was what’s the impact of the change? Can this change, wait sorry. I apologize I’ve already gone through that bit.

I’ll jump into the fucking with customers because that’s sort of the one that gets all the quibbles from everyone but you know. As I was talking about, how we test huge changes in production. We had a bunch of research that showed that this design and this layout was actually going to be better for users in the long run, but for existing customers that wasn’t really the case and people were like “No, I’m really used to this, you’re just completely stuffing up how I actually do my day to day work.” We needed a way for them to be able to opt out of that. One of our considerations was actually do not roll out this to people that have actually selected to opt out. We controlled that by the context that LaunchDarkly lets us provide so that when we actually contact and you know check LaunchDarkly’s features page for that customer, we provide a whole bunch of context and then in our actual feature walls we say “for this context, don’t serve this to a customer who has said don’t give it to me”. That way, the customer can still be happy and still get done what they want to do while we can still progressively roll out in a non micro managed way. We don’t have to do it for every single user.

Another consideration was how an we help them. So if a customer was truly negatively impacted, we needed to be able to roll it out and support, or able to actually just be able to go in, add the customer’s site ID to a list of prerequisites and bang, the whole site doesn’t actually get it any more. That was really important for customer support to be able to really help customers when the problem happened.

And one of our last considerations was, can we handle the support load. So rolling out changes and testing changes in production isn’t just an engineering feat and isn’t just a growth experiment feat and isn’t just for design and marketing. It’s really also support and one of our, because checklists wasn’t really ready from an engineering point of view. It’s is support ready to handle this increased load. So when we first started rolling out, we had massive spikes in support because feature wasn’t complete. It wasn’t as thorough as it was previously and there are a couple of edge cases that we couldn’t just catch when testing in spec and testing locally in development.

They were quite quickly fixed with the front end and what we’ve got in confluence we can actually roll out change in a minute. They were quickly, really quick to actually recover and fix the problems for the customers but the support load and the support impact is still actually there. That was a really strong consideration for us every single we time we sat down and had a meeting. Should we increase it to 5 percent? Should we increase it to ten percent, twenty, fifty, a hundred? So, just when you start testing something in production just keep in mind it’s not just an engineering decision. It’s a decision for the whole team and that includes potentially your support if you’ve got peep support.

So, lastly we get to “Are we done?” We’ve rolled this feature out to production. We’re a hundred percent, we’re done right? There’s nothing else to do. That’s actually sort of, not the way that we consider it. We consider it, it’s only just started now. We’re actually here, we’re a hundred percent. That’s actually now the base experience for customers. What’s next? For us, in confluence, we’re actually experimenting right now with a bunch of navigational changes for customers because not everyone likes the change in how they can actually navigate pages and page trees and stuff like that. That’s one of the experiences that we’re trialing right now.

We’re also trialing a newer navigational experience on the side so that people can actually experience both the new elastic home offering as well as easily get to Gero if they use Gero and other products. We’ve actually also asking the question, was it the right choice? We got to a hundred percent of customers back in August which is nearly six or over six months ago now. And all the data is showing that okay yes it looks like it’s generally the right choice. There are a couple of holdouts that are still not quite happy. But over time we feel that with the right level of communication with them and the right feedback from them, we can actually get the right answer from them.

But it was also the right choice technically. So, this was also a force in change for us to re-envisage how we do the front end in the product. It was a really big engineering feat to change what was an old legacy code based build with Jquery and technology before Jquery and all the way up to backbone and luckily not end killer but. I don’t mean that facetiously.

But we actually got to re-envisage what that front end code base is and what the whole delivery pipeline is for the front end and now we’re actually reactor based code base in the front end and it’s a single page application for the majority of user experiences and across the board the users are a lot happier with performance of the product right now. Outside of initially loading the page, navigating to other pages is a lot quicker than it ever used to be.

One of the questions that we didn’t really expect and sort of hit us in the butt and really knocked us for a six is “What needs to be cleaned up?”. So we’d gone through, we’d been making these changes, we’d been building it since October 2016 and what we didn’t expect is the fact that okay it’s behind feature flag, we can actually turn this on and off by LaunchDarkly production without any problems at all. It’s actually with a single click of the finger, you can actually turn it off or turn it on. But developers weren’t actually using that experience because again it was behind a feature flag. One of our, one of the things that sort of we weren’t expecting was the fact that we’d actually have to go back through our code base, make that the default experience and then actually fix our continuous integration through our CI to actually make sure it’s also testing that new experience which is actually the default experience for customers.

There was almost as much effort pointing out that old experience in our build pipe, within our build pipeline and our tests and local development as it was actually rolling out that feature in the first place. So that’s something, that’s a strong consideration that anyone that wanted to do this level of testing in production should keep in mind to a degree.

What else I wanted to talk about was a couple of other ways that we at Altassian also test in production
experiment on people. We get a lot of flacking on Twitter for that. Some people are like “why are you doing this experiment on me? Why are you doing this experiment on that?” And it’s not, we don’t intentionally mean to upset customers but we’ve got teams that are actually dedicated to trying and improving integration experience between all of our products and sign up experiences and making sure that when you buy Confluence, the first email you get sends you directly to Confluence and those kind of things. They’re actually trying to evaluate this. That’s another way that we sort of test in production, those are little growth experiments.

We also use it to control risky changes. Sometimes there’s changes in the product that we can’t control, well not we can’t control, we can’t test effectively in development or in staging. And the only way we can really effectively test this is when we really get to production. We come up with a design, we undertake the underlying product and have a flag and when the feature flag’s turned on for that customer, they start running that new experience and then when it goes bad, we get so many pages and then we wake up at 3 in the morning and fix it.

That’s another way that we use that. We also, like I mentioned before, we are still using it to actually build on and improve the new UI that we’ve brought to Confluence today. We’re working on changing the space navigation for customers and working on a better happy experience so that the UMUX instead of NPS to also check people’s happiness.

I did want to actually go into a couple of gotchas that we did run into when sort of, not just using the LaunchDarkly but just rolling out feature flag type changes to customers in production. So basically having multiple code paths in your product. One gotcha or one really cool thing to know about with LaunchDarkly is actually the dev console. For those that are unaware, in LaunchDarkly when you actually set up your group in LaunchDarkly you can actually use the dev console to see a stream of all feature flag activity. So people that are actually requesting flag x, flag y, flag z. If you pause that and actually click on any one of those records, you can actually see the associated context with it as well, which helps you easily sort of drill down into going “okay, I want to target these specific people. I want to start target these specific features.” And that helps you sort of correlate between what your product or your service is sending to LaunchDarkly versus what your feature flag should be set as in LaunchDarkly as well.

Another sort of gotcha is using prerequisites. One thing that’s bit a lot of people in the butt in the past is when you, you might create a feature flag with a prerequisite based on another feature flag and then later on in the future that feature flag is no longer relevant so you delete that one, but if that is a prerequisite for another flag all it does is remove that entry but then that feature is, has a blank prerequisite. What that ends up doing is causing the feature to serve up the false condition or the default condition so just be careful of that when actually using and removing prerequisites.

With percentage rollouts, make sure that you actually click the advance button. You can’t really see it in the screenshot there but down the bottom after you’ve actually done your percentage rollouts, you’ve chosen 20%, 80% or whatever. If you’re multivariate, you’ve got a big breakdown there. Make sure you check the advance, make sure you’re actually rolling out on the expected keys. For us at Atlassian, or at least Confluence our key was actually based on the users versus for ADG3 we wanted to actually roll out on a site by site basis. We had to make sure we chose the appropriate actual percentage rollout based on sites rather than on customers. Just check that advance button and make sure you’re actually doing the right thing there.

And again the last one was really about builds and devs. I think I touched on it just before we started. If you’re actually building these multiple code paths in your product, make sure you’re testing them and making sure you’ve got a plan that’s part of your definition of done to actually clean that code path up when you’re happy with it either being removed or being the default.

That’s it. But the last thought is really please think of the customers when you’re actually rolling out changes. Thank you.

06 Apr 2018

Never Read the Comments: Why Comments are Important

In March we visited Atlassian’s new Mountain View offices for our monthly Test in Production Meetup. One of our own software engineers, Traci Lopez, opened the session with a talk on how helpful comments can be for getting visibility into why an event happened the way it did, especially when troubleshooting.

“Maybe you have a clear understanding of where to go, but if someone’s coming onto your team you get a lot more context of what the state of the world is and where you should go.”

Watch her talk below. If you’re interested in joining us at a future Meetup, you can sign up here.


I’m going to talk about comments and this new feature I worked on that we just rolled out and why we think it’s so great.

So, a little bit about myself. Like she said, I’m a software engineer at LaunchDarkly and we’re always looking for great people. If you know anyone that wants to work off development tools wants to work in Oakland, come talk to me.

Alright, so for those of you who don’t know, this is flipping the table. So, what happened last week was I received a notification in our change log Slack channel that someone had turned on the firehose flag for one of our customers that has a bunch of those events that Erin’s mentioning in the dev console. So, what this means is basically, we’re taking all of these events to third party and that can cost us a lot of money and be a big problem. But, what was great was the person who did this wrote a comment. So, “Hey, I turned on this flag to send all this data because I’m trying to troubleshoot something.” And this immediately gives us context into what’s going on, so I know who turned on the flag, why they turned on the flag importantly, and we were able to immediately respond to him and be like, “Hey wait, don’t do that. That’s a lot of data.” And he was able to immediately turn it off.

So, okay, I’m done troubleshooting. Great, so this happened within five seconds, you can kind of see the log up there. He could’ve provided a little more context here, like, “Hey, I collected over 30,000 events, got the data I needed.” But that’s kind of beside the point. What was great about this was it’s immediately giving us some context into what’s happening within our application, which is super useful for collaboration. So, I’m a big climber. There’s gonna be some climbing pictures coming through. But I think it just lends more into working as a team. You have somebody CLE climbing and you’re really trusting that person that’s belaying you.

And this comes with visibility. So, why did he turn on that feature plan. Alright, he’s trouble tubing something. So immediately, we know what’s happening and we also use compliments. It’s another way of keeping a good paper trail to understand why we’ve made all the different decisions to get to where we wanna go. And in this example, see all the different routes that could be taken and why we decided to take a certain route to get there?

So, another thing we have is the audit log, and within the audit log … so all of these changes are also captured within our change log channel in Slack, but we also have this audit log to see everything that’s been going on in application. So, for example this guy said “Flip it.” Not that useful of a comment. It’s like if I’m belaying somebody and they’re saying, “Pull, pull, pull” or something, these aren’t really useful commitments, either. We kind of have this understanding like, “Take” means, “Hey, pull up the slack.” And “slack” means “Give me more slack so I can clip my boot” or something.

So what might be a better comment is initial launch. So, this is Alexis, he’s our Front End Team Lead. So he’s turning on this new feature, rolling it out to everybody, and he’s like “Hey, this is our initial launch.” I’m like, “Great, you’re probably the person in charge of this feature, I know that we’re initially going south. If anything goes wrong or I have any comments, I know to talk to you most likely.”

And that’s not to say that he’s not delegating this to other people, because there’s a lot of people that worked on it, but with that responsibility of, “If something happens with that flag, I know who to talk to.” And maybe if he wasn’t the one even working on it, he’ll know who to point me to, in the right direction. And that is extremely useful when you’re working on a big team of people.

So, to get back to the new feature we rolled out, that I worked on, that we’re so excited about, is comments are the context to understand what’s been going on in your application. So, if I’m looking back for an extreme example is that firehose flag, I can understand “Why the heck did we turn on this flag?” “Oh, he was troubleshooting something and immediately afterwards, oh, this is done, resolved. I don’t have to think about this anymore.” I immediately know what was the problem and what was happening, and an easy example is “Hey, I’m rolling out this new feature. We’re so excited about it.” Great.

Comments can kind of be like a climbing ground, where you have an understanding. If you just show up to the climbing wall, like “Where should I go?” You see all these bolts and you can kinda get like a picture of where you’re trying to go, versus if you’re already on a team and you’re with everybody climbing, maybe you have a clear understanding of where to go, but if someone’s coming onto your team, you get a lot more context of what the state of the world is, and maybe where you should go.

So, another example I wanted to bring up, is we do AB testing, and often times if you have some hypothesis of how you think your test might go and that turns out to be really wrong, you’re like, “Wait, what happened? What changed?” So, I can go back into my audit log and see if somebody played with targeting goals. Like, maybe we were changing our testing purposes because of something, and that can all be documented right there. So, if I think the testing isn’t going how it should have gone, I can immediately see why that happened, who made those changes, so I can talk to them and be like, “Hey, I don’t think we’re supposed to do that.” Or “Maybe that’s what we were supposed to do,” et cetera.

And this also depends on what environment you’re in, so we use LaunchDarkly internally, and when I’m in my dev environment, I’m basically never putting comments in any flag updates in Twig, because I’m probably debugging locally, just turning things on and off, messing around, doesn’t matter. And staging, still not really writing comments. We have a change log for staging, but still not as much, whereas in production, every change we make is pretty clearly documented why we’re making changes. Besides that flippant comment, but we just rolled this out, so… we’re still kind of developing our new best practices on how to use this new functionality.

And that’s it. I guess we’ll do comments later.

03 Apr 2018

Changing Your Engineering Culture Through People, Process and Technology

Guest post by Isaac Mosquera, Armory CTO and Co-Founder.

My co-founder DROdio likes to say Business runs on relationships, and relationships run on feelings“. It’s easy to misjudge the unseen force of human emotions when changing the culture of your organization. Culture is the operating system of the company; it’s how people behave when leadership isn’t in the room. 

The last time I was part of driving a cultural change at a large organization it took over three years to accomplish with many employees lost—both voluntary and involuntary. It’s hard work and not for the faint of heart. Most importantly, cultural change isn’t something that happens by introducing new technology or process; containers and microservices aren’t the solution to all your problems. In fact, introducing those concepts with a team that isn’t ready will just amplify your problems. 

Change always starts with people. When I was at that large organization, we used deployment velocity as our guiding metricThe chart below illustrates our journey. The story of how we got there was riddled with missteps and learning opportunities, but in the end we surpassed our highest expectations by following the People → Process → Technology framework.  


Does your organization really want to change? It’s likely that most people in your organization don’t notice any major issues. If they did, they probably would’ve done something about it. 

In that big change experience, I observed engineering teams that had no automated testing, no code reviews, suffered from outages, infrequent deployments and, worst of all, an ops vs devs mentality. There was complete absence of trust masked by processes that would make the DMV look innovative. In order to deploy anything into production, it took 2-3 approvals from Directors and VPs who were distant from the code.  

This culture resulted in missed deadlines, creating new services in AWS took months due to unnecessary approvals. We saw a revolving door of engineers, losing 1-2 engineers every month on a team of ~20. There was lack of ownership—when services crashed, the ops team would blame engineers. And we experienced a Not Invented Here” mentality that resulted in a myriad of home grown tools with Titanic levels of reliability.

In my attempt to address these issues, I made the common mistake of trying to solve for them before I got consensus. And I got what I deserved—I was attacked by upper management. From their perspective, all of this was a waste of their time since nothing was wrong. Also, by extension, asking to change any of these processes and tools, I was attacking their leadership skills.

Giving Your Team A Voice

The better approach was to fix what the team thought was broken. While upper management was completely ignorant to these issues, engineers in the trenches were experiencing the pain of missed deadlines, pressure from product and disappointed customers. For upper management, ignorance was bliss. 

So we asked the team what they thought. By surveying them, we were able to quantify the problems. It was no longer about me vs upper management. It was the team identifying their biggest problem and asking for change—the Ops vs Devs mentality. We were one team, and we should act like it. 

So what are some of the questions you should ask in your survey? Each organization’s surveys should be tailored to their needs. But we recommend open ended questions, since multiple choice questions typically result in leading the witness”.  Some questions you might want to consider include:

  • What are the biggest bottlenecks in getting the most time?
  • What is stopping you from collaborating with your teammates?
  • Are you clear on your weekly, monthly and yearly objectives?
  • If you were the VP of Engineering what would change?
  • What deployment tools are you using to deploy?

After summarizing your survey feedback, you’ll have a rich set of data points that nobody can argue with because they would just be arguing with the team.

Picking Your One Metric

Now comes the hard part: selecting a single metric that represents what you’re trying to change. The actual metric actually doesn’t matter, what matters more is that it’s a metric that your team can rally your team around.  

For us, the Ops vs Devs” were causing significant bottlenecks, so we chose number of deployments as our guiding metric—or as we called it deployment velocity”. When we started measuring this, it was five deployments per month. After a year of focusing on that single metric we increased it to an astounding 2,400 deployments per month.


If culture is a set of shared values by your organization, then your processes are a manifestation of those values. Once we understood that the Ops vs Devs” culture was the focus, we then wanted to understand why it was there in the first place.  

As it turns out, years earlier there were even more frequent outages, so management decided to put up gates or a series of manual checklists before going to production. This resulted in developers having their access to production systems revoked because they were not trustworthy.

Trusting Your Developers

I don’t understand why engineering leaders insist on saying I trust my developers to make the right decisions,” while at the same time creating processes that prevent them from doing so and turning talented engineers unhappy. To that end, we began by reinstating production access to all developers. But this also came with a new responsibility: on-call rotation. No longer were the operations team responsible for developer mishaps. Everyone was held responsible for their own code. We immediately saw an uptick in both deployments and new services created by making that change in process.

Buy First, Then Build

We also made the decision to buy all the technology we could so development teams could focus on building differentiated software for the company. If anyone wanted to build in-house, the decision and process had to be dramatically cheaper than buying. This process change not only had an impact on our ability to add value to the company but it actually made developers much happier in their jobs.

A True DevOps Culture

What formed was a new team which was truly DevOps. The team was created by developers who truly were interested in building software to help operationalize our infrastructure and software delivery. New processes were created to get the DevOps team involved whenever necessary, but they alone were not responsible for SLA’s or uptime of developer applications. A partnership was born.


Too many engineers like to start a process like this by finding a set of new and interesting technologies, and then forcing them onto their organization. And why not? As engineers we love to tinker. In fact, I made this very mistake at the beginning. I started with technical solutions to problems only I could see, and that got me nowhere. But when I started with people and process first, the technology part became easier.  

Rewriting Our Entire Infrastructure

Though things were getting better, we weren’t out of the woods yet. We lost a core operations engineer who didn’t share the process to deploy the core website! We literally couldn’t deploy our own website. This was a catastrophic failure on the engineering team, but it was also a blessing in disguise. We were able to ask the business for 6 weeks to fix the infrastructure so we would never be in this position again. We stopped all new product development. It was a major hit to the organization, but we knew it had to get done. We ultimately moved everything to Kubernetes and had massive productivity gains and cost reduction.

Replace Homegrown with Off-the-Shelf

In this period of time we also moved everything we could into managed services by AWS or other vendors. The prediction was the bill would be incrementally larger, but in the end we actually saved money on infrastructureThis move allowed the team to focus on building interesting and valuable software instead of supporting databases like Cassandra and Kafka. 

Continuous Deployment and Feature Flagging

We also decided to rewrite our software delivery pipeline and heavily depend on Jenkins 2.0—mostly due to a suitable solution not being available like Spinnaker. We still got to throw away much of our old homegrown tooling, since the original developer was no longer there to support itAnd while this helped us gain velocity, ultimately we needed to have safer deployments—when our SLAs started decreasing, we were exposing our customers to too much risk. To address that issue, we built our homegrown feature flagging solution. This too, was because off the shelf tools like LaunchDarkly didn’t exist at the time (recall our preferred approach to build vs buy). The combination of the tooling and process allowed us to reach deployment velocity that surprised even ourselves.


The chart below speaks for itself. Each new color is a new micro-service. You’ll notice we started with a few large monoliths and then quickly moved to more and more microservices because the friction to create them was so small. In that time, deployment velocity went way up! Most importantly the business benefited too. We built a whole new revenue stream due to our newfound velocity.

Approaching these changes from the people first made the rest of this transition easier and enjoyable—our team was motivated throughout the entire journey. In the end, the people, process and tools were all aligned to achieve a single goal: Deployment Velocity.

Isaac Mosquera is the CTO and Co-Founder of Armory.io. He likes building happy & productive teams.

02 Apr 2018

Targeting Users with Feature Flags

Companies exploring feature management are looking for control over their releases. A common theme is using feature flags to rollout a feature to a small percentage of users, or quickly roll a feature back if it is not working properly. These future flaggers also seek to control a feature further by limiting its visibility to individual users or a group of users. In this piece, we’ll explore how LaunchDarkly lets you control your releases in these ways.

When launching a new feature, or simply updating an existing feature, you may not always want everyone to receive the same experience. When working with a new front-end marketing design, a back-end improvement to search, or anything in between, targeting specific users can help ensure the best experience for all of your customers. Below are a few common scenarios where targeting can come into play. We’d love to hear your thoughts on other targeting best practices—let us know what you come up with!

Targeting users within LaunchDarkly’s UI.

Internal QA

Perhaps the most common use case for individual user targeting is internal QA. Testing in production is a scary thought, but so is launching to production without knowing how the rollout will go. Feature flags enable safe testing in production. By targeting only your internal QA teams to receive the new feature, you can experience how it will function in a production environment without exposing the untested feature to your customers.

Beta Releases

When releasing a feature or product in beta, you can target users in your beta group to receive this update while not releasing it to anyone else. Essentially, you would set the flag to “true” for all beta users to enable the new feature. Everyone else would have the flag set to “false”. When Kevin from LaunchDarkly asks to be part of your beta group, it’s as easy as adding that user to the true variation to enable the new feature.

Attribute-Based Targeting

Sometimes, you’ll want to target users, but may not want to dive into as many specifics as we just visited. Perhaps you are constrained by regulations, and need to deliver a different experience to customers from different states. Or, maybe your marketing team is launching a campaign specifically to Gmail users. You can place the email domain “gmail.com” in your targeting rules to deliver an experience only to that group.

Targeting in a Canary Launch

Many companies are seeking to slowly rollout their releases to before releasing to everybody. Rolling a new feature out to 10% of users, for example, is a great way to gain validation before a complete launch. You can target your internal QA team to be a part of the 10% initial rollout, or specifically include members from a beta group to see the new feature first. Conversely, you can target specific users or groups to exclude from the launch, and give them the new feature only when it is rolled out to everybody.

A common challenge with canary launches is delivering a consistent experience to users from the same group, especially in B2B applications. LaunchDarkly’s platform provides a bucketing feature that solves this problem. By bucketing users, you are ensuring that all users from the same group receive the same variation for the duration of a canary launch.

Other Considerations

The examples we have covered so far are typically used for short-term flags, either upon initial rollout or for a shorter-lived feature. As your company’s feature flag use continues to mature, wrapping a feature or set of features in a long-term or permanent flag is a logical next step. A common use case here applies to entitlements. For example, if you have Basic, Pro, and Elite versions of your product, user targeting helps manage these tiers effectively across your organization. Many products with different account tiers already have a way of controlling functionality, but with a feature management platform, this can be easily managed by anyone on your team. Customer success team members can quickly dial functionality up or down without the need to go to engineering for support. When a customer upgrades from Pro to Elite, a user simply flips the appropriate flags on.

Managing targeting rules at scale is an important consideration, especially as the amount of feature flags you are managing continues to grow. We introduced a new feature we call Segments to help tackle this very topic. Segments enables you to build reusable targeting rules to make this practice quick and scalable. Borrowing from a few of my examples above, you may want to flip a feature on for everyone with an Elite subscription, but also include all beta members as well. Because this is an important feature, you may want to also target internal users to ensure rollouts go smoothly. This same rollout strategy will likely apply to future features and you can simply target your Segment to receive the true or false variation. We’ll be featuring Segments in a future post, so stay tuned for more.

Build reusable targeting rules with Segments

Data Privacy and Targeting

Targeting often involves customer data, and we launched Private Attributes in January to add an additional layer of protection. Private Attributes allows you to shield your customer data from LaunchDarkly while still enabling targeting rules on the shielded data. One thing I’d like to point out is that you never need to upload a database of users into LaunchDarkly to enable targeting. LaunchDarkly simply stores your targeting rules, which are evaluated based on the user objects that you provide. Ben Nadel wrote a wonderful piece describing InVision’s use of targeting and Private Attributes: Using LaunchDarkly To Target Personally Identifiable Information (PII) During Feature Flag Evaluation Without Leaking Sensitive Data.

To wrap things up, there are a number of different ways targeting users can deliver a more valuable experience. We visited just a few above, but the opportunities here are endless. At the end of the day, you’ll target users based on what makes the most sense for your product. We’d love to hear your thoughts and how you think user targeting can help. Until then and for more pointers, you can visit our documentation here.