15 Jan 2018

If You’re Going to Fail, Fail Safely

Harm mitigation is the flipside of risk reduction. Though we want to avoid bad things ever happening and bad choices from being made, we live in a world full of humans, software, and events that require the use of the passive voice. So when something does go wrong, we want to minimize the impact as much as possible.

Seatbelts are the perfect example of harm mitigation. We don’t want anyone to get in a car accident, but if they do, we want them to survive without injury. So we created and mandated seat belts that keep us from flying out of cars, windshields that break into small pellets instead of death spears, and airbags… Our cars have a lot of harm mitigation technology that never gets used until and unless something goes wrong.

Assume Failure

If you don’t know how your product is going to fail, then you’re missing something. Moreover, because failure modes may have different severity for different people, it’s important you assume all the failure modes you can think of, as well as ask your users what they may experience and how severe those events would be for them. In a SaaS setting, what may seem like a harmless bug around graphics might break business-critical reports…or it might be a feature that the vast majority of your users never access.

For example, the vast majority of social media users may not mind exposing their names and approximate location, but for those who have experienced intimate partner violence or stalking, it could be catastrophic and life-threatening. Be sure you understand the worst cases of failure, not just the common cases, so you can mitigate the most drastic outcomes.

Fail Safe, Fail Secure

When we mitigate harm, we need to think about what it is we’re trying to protect. For example, nuclear power plants are designed to fail safe. In the absence of positive control and power, they shut themselves down as quickly as possible. On the other hand, time-lock safes are designed to to fail secure. They aren’t designed to have people in them, rather they are designed to protect money, so without a set of conditions including time and codes they will stay locked.

One of the ways that feature flags help everyone fail safely is that there is not a state where a deployment fails in a bad state. If you are always deploying, there is always good code going out, and as it is approved, the new features are turned on. Older styles of deployment could fail in the middle and leave systems in a state neither stable nor updated. CI/CD helps eliminate the disaster of a failed build by making the changes in builds relatively small and pushing the activation of features toward a collaboration between development and business.

The Killswitch

Going back to predicting possible states—when you know potential places for failure, you can better understand what transition states can be altered so that when failure happens it hurts less. The easiest path for this in software is the kill switch.

We used to do this process by re-deploying a known good build, but that’s painfully slow when you’re in the middle of an ongoing crisis. Instead, you can always have a transition state of Hit the kill switch. If you have a feature marked off in a way that allows you to kill it immediately, you don’t need to redeploy. It’s the SaaS equivalent of a cat falling off a bed but still walking away with great dignity.

Balancing Act

In our careers, all of us have to balance competing priorities like speed, risk, ease of use, reliability, and planning for the future. We can look at previous large projects at our organizations to show us what kinds of risks were accepted and what were avoided or mitigated. It’s not possible to create anything without making mistakes, but at least we won’t have to switch calendar digits again.

Our best practices include small, rapid iteration, testing in a variety of environments, understanding the business cases we are solving, and listening to everyone on our team as they talk about risks, harm, and avoidance.

12 Jan 2018

Launched: Comments, Adding Context to Your Actions

A key part of problem identification is knowing what changed in the system. For developers this is often accomplished with comments in code this is to remind your ‘Future Self‘ why you made the choices that you did. As we have built LaunchDarkly we have tried to follow our own best practices around naming flags with good descriptions, assigning clear owners to flags that map to specific areas of responsibility, and adding comments in our code to provide reminders on what a given flag is for.

However, as more teams using LaunchDarkly have implemented feature flag driven development practices the platform has become a crucial part of their change management process. For anyone that has experience with change management there is an acknowledgement that context matters. Standard audit logging is great to know who changed what, when; but lacks the why. In small systems or on small teams this often falls in the category of tribal knowledge, but in large systems and large organizations 30 seconds on formally documenting changes can save hours or days in the troubleshooting process.

Today, we are happy to rollout the ability to add comments on updates to flags. Now team members can have the option to add a brief comment on why a change is being made.

As we designed this feature we also thought it would be helpful to include these comments in your favorite established output path. So, when you do decide to add a little extra context folks will see it appear along with the who, what , and when sent to the audit log, through your established webhooks, slack channels, and Hipchat rooms.

If it’s for a developer, build it on top of an API

In keeping with our mantra around API-first development the new comments functionality is built on top of the REST API.  We added the ability to submit these optional comments with PATCH changes. With this launch, the Update feature flag resource supports comments, and support in other PATCH resources is forthcoming.

Here’s an example from our API docs for submitting a comment along with a JSON Patch document:

JSON
1
2
3
4
{
  "comment""This is a comment string",
  "patch": [ {"op""replace""path""/description""value""The new description" } ]
}

Oh, and one more thing

While we’re strong proponents of JSON Patch as a protocol for describing partial updates in our REST API, we’ve found that it can be pretty verbose for simple changes. We’ve added support for the JSON Merge Patch protocol to address this. Here’s an example merge patch document  that changes a feature flag’s description:

JSON
1
2
3
{
  "description""New flag description"
}

Compare that to the equivalent JSON Patch document:

JSON
1
2
3
4
5
6
{
  "name""New recommendations engine",
  "key""engine.enable",
  "description""This is the description",
  ...
}

This addition also means that you can get fancy and combine these two new features and submit a comment along with a merge patch document:

JSON
1
2
3
4
{
  "comment""This is a comment string",
  "merge": { "description""New flag description"}
}

Achievement unlocked. And now you’ll know why.

12 Jan 2018

What Makes a Failure a Disaster?

It was December 31, 1999, and a young sysadmin had just bet her boss that their systems wouldn’t go down. The stakes were her job. She was driving 4 hours to go to a New Year’s Eve party with her friends, and she wasn’t worried about Y2K.

Though our sysadmin had her pager, if something went down in her systems she wasn’t going to be able to fix them before people noticed. Her boss told her that if anything happened, she didn’t need to come back, ever. She left anyway, because she knew everything would be fine. This wasn’t an act of faith, it was that she knew nothing would go down, because for the whole of 1999, and even before, the system had been preparing.

Everyone knew Y2K was coming. Our sysadmin had patched all her scripts, and installed all the software updates the vendors had said, and traced back every piece of hardware and firmware to make sure it had been patched and updated.

As we have moved further from the reality of Y2K, we have forgotten. We forget just how many millions of dollars and person-hours went into reducing our risk. Utilities did fail, but that was in July of 1999, because groups were doing planned tests. They even staggered the tests so the whole grid wouldn’t go down. The SEC threatened to cut off non-compliant banks if they didn’t meet a series of rolling upgrade deadlines. Thousands of programmers were hired to remediate systems that couldn’t be automatically upgraded.

Now it’s easy to think Y2K was an overblown story. Nothing really terrible happened, so it must not have been that bad. But it was that bad, and we just managed to fix it in time through an effort more intense and widespread than putting humans on the moon. Often we know that things will fail, but what makes a failure a disaster?

Risk Reduction

In the case of Y2K, all the upgrade focus was about risk reduction. We knew there was something that could cause problems, so we did everything we could to prevent it from happening—from preventing a major disaster from occurring.

While it’s impossible to avoid all risk, we recognize that we can at least reduce risks that we’re aware of. But what are the essential elements of reducing risk? And how do we know what we need to prevent?

Secure Your Zone

Rather than becoming overwhelmed at the idea of trying to avoid risk, we need to break our world into zones, decide how much risk we can tolerate in each zone, and then secure it as much as possible.

In my job, I work on adding documentation to our APIs. There is a risk I could break someone’s integration if I add or delete something in the wrong place. To avoid that, we have a process where someone else reviews my changes before they go live.

Pull requests and reviews are so standard that we don’t usually think of them as a risk reduction strategy, but they are. We are adding a little friction to the act of committing code in order to prevent easy mistakes. We also prevent mistakes or malicious activity by limiting who has “commit bits” to critical parts of our code, and by keeping backups of our code so that we can restore a working state quickly.

At LaunchDarkly, we care about reducing risk by making it easy to do the right thing and hard to make unrecoverable errors. Wrapping feature code and product settings in feature flags makes it easy to change the state of your software without adding the confounding effects of a deployment to the process.

Your zone is only as big as the area you have control over. You want to be sure that people who are changing your software are authorized to do so. I can change the APIs and their documentation, but no one has or should give me access to change other parts of our code.

Predict States

Another way to reduce risk is to address issues before they become problems by evaluating your system and identifying where they might occur. To predict the states something could end up in is formally known as finite state machines (FSMs). An FSM is defined by a list of its states, its initial state, and the conditions for each transition from one state to another. Many process flowcharts are a state machine for systems.

To use a state machine to assess risk, you enumerate all the possible states your system could be in, and what would make that happen. You can then address the transitions that lead to states you don’t want.

After code is committed, there are still several ways it could fail. For instance, we try to have more than one way to serve it, so that if there is some kind of infrastructure problem, we have a second server in another location already online. Sometimes, the risk is that the page or product won’t work for everyone, so we deploy it to just a small percentage of our users—also known as a canary launch. That means that if there was a risk we didn’t foresee, we only have to repair a small part of our user base, not the whole thing.

Being able to test our assumptions on a small scale gives us the confidence that our predicted states are accurate and we can handle them.

Make Low-Stakes Bets

We can’t entirely avoid risk and stay safe. One of the ways we can reduce risk is to make small bets—not all-or-nothing—but a little bet that this will go well, or at least generate a minor improvement. Risk is not an all-or-nothing proposition, it’s a matter of what we can tolerate, what we can afford to lose. Once we can define that for ourselves, we can be much more adventurous within those limits.

21 Dec 2017

Toggle Talk with Happy Co’s VP of Product Phill Claxton

Phill Claxton has implemented feature flags in three different organizations. As a driving force on the business side, he’s seen his company evolve and witnessed firsthand the difference feature management has made in the overall progress of the company. When it comes to the most fundamental benefits of feature flagging, he points to continuous delivery and the speed of learning. Phill is VP of Product at HappyCo, COO at Storekat.com and a cross-functional consultant for early startups.

How long have you been feature flagging?

I’ve been a fanboy of feature flags for a little while. They’re a strange thing to love but it’s the result of implementing them that has made such a huge difference in my life, and the companies I’ve been involved with.

I think I can trace my first exposure back to a HubSpot article in 2015. We were faced with a challenge at a startup that I founded down here in New Zealand called DeskDirector. We had maybe 150 – 200 customers/accounts at the time, and we were hitting a challenge that’s a telltale sign for those on the other side of learning about feature flags. We had a very small development team, and it was very hard for us to do multiple things. We needed to decide what to build next and measure the impact from it.

We had made some pretty substantial failures, it’d be fair to say. We’d launch features that we thought were going to help the many but when they failed we realized it was the false wisdom of the few.

“So we were trying to work out a way to learn faster.”

At the same time we were thinking about how we can restructure our pricing, and it sort of came together when we read that Hubspot article. We realized we needed to control the release of features we were working on to cohorts of customers. That way we could quickly decide whether this had the broader appeal to launch to our greater market.

When do you think feature flagging is most useful?

Once you hit that 100 customer mark there’s some natural segmentation that lend itself to feature flagging to test out cohorts. The customer grouping starts to look about the same as it will at 1,000 customers. I would suggest that’s probably when you want to start really thinking about feature flagging.

I’d caveat that by saying we were in the B2B space. You might get there much sooner if you were selling a product to consumers.

“That being said—the last times we’ve done it, it was always too late. We could have started earlier. I suggest you start learning really early without worrying too much about cohorts.”

For me it’s always been two things, which is about continuous delivery and the speed of learning. Specifically, allowing a product team or engineering team, to launch something and learn at a much faster rate from the customer base. The same thing goes for pricing inside your product.

In the SaaS world, the need to experiment with pricing is a part of your standard business practice. And the earlier you can do that, the better. Sometimes people forget that there’s many different use cases for feature flagging.

Are there any cases where feature flagging is not a good idea?

Our practice was to flag anything that was impacting on the user’s experience. So, if the impact on the end-user experience is low and there’s a very high level of confidence in the change, then we’re comfortable not flagging. If the change itself is completely hidden from the users’ experience and service side, then that’s not normally something that we would need to consider feature flagging.

“I would say more often than not there are situations where it isn’t thought of enough. The default position should be that it is flagged.”

Best use of a feature flag—a personal story?

By this stage we were feeling mature in our development cycle and had flagging in place. It was a documentation platform that tracked expirations of devices that an IT company was looking after.

We were working on a new notification engine which would trigger expiration alerts to customers. And so we did feature flag this feature, and we’d gone through a lot of initial wireframes with customers. We’d even built out an early stage MVP without production data. All was looking great, and we launched it into production. It was feature flagged and enabled only for those in our Customer Advisory Council.

The moment we did that every single one of them had their inboxes flooded with anywhere from a few hundred to, in one case, about 15,000 emails. We’d built this notification engine to let them know when things were about to expire and embarrassingly we didn’t think about what was going to happen when things had already expired. And so every device they were managing, which counted, for many, into the thousands would suddenly trigger this alert to say, “This is about to expire”—when it actually already had. And so that was one of those cases where we went, “Hallelujah.” We had gone through a lot of exhaustive QA and unit tests, etc., so we knew things weren’t going to break and the upside was our notification engine worked at a massive scale. It still had a lot of work to do when we turned it on.

The reality was that the output for the end-user was certainly not the experience we would’ve wanted. If we had turned it on at the time to our north of 2,000 customers, we would’ve flooded our support team with a lot of requests. Also we would had a lot of very unhappy users.

“So because of feature flags we were able to quickly go back and amend that feature then test again with that Customer Advisory Council before launching as a general release.”

The real benefit obviously came from all the learning that we got from the process as well. Not just for problems, but also just day-to-day functionality that we thought we’d scoped correctly, when given to customer in production, it’s totally different.

“This is when the real learning happens. Showing them some wireframes is totally different from showing them it working with their data.”

What do you think is the number one mistake that’s made around feature flagging?

The #1 mistake for me would be implementing it and actually not remembering to ensure that the features being built are actually flagged. It sounds simple, but it’s actually a common trap that happens all too frequently. You end up not flagging enough of the functionality in the product and it becomes more work later on.

#2 You need to flag more often than you probably were planning to.

“You need to ask yourself why are these features not being flagged rather than the other way around.”

Probably the only other thing is not having a program for removing flags as well. For all the benefit it offers it does make your code base a lot more verbose, and there’s a small amount of arguable tech debt that you’re buying with all this. So not having a program for removing the flags is probably the other mistake.

How do you think feature flags play into the DevOps movement? Continuous Delivery?

Early last year I started a company called IT Glue, as the Chief Operating Officer. And I would certainly say that feature flags would be one of those reasons we were able to scale and scale so fast. We went from 500 accounts and about seven on staff, to 2,500 accounts and 63 staff within one year. So you often get asked, how were you able to scale? What’s the magic bullet?

“And I’d never say there was a silver bullet, but if I listed off the ten things that drive towards continuous delivery, a dev process supported by feature flags would always be on that list.”

They’re really a very critical part of DevOps. Having not worked in very large engineering teams I can only imagine what the impact would be and I would say it plays a fairly substantial role. But it tended to be our DevOps engineers who gained some of the best advantages from actually having these feature flags in place.

Are you seeing feature flagging evolving? If so how? And how do you expect it to change in the future?

I’ve certainly seen companies like LaunchDarkly launch into the market, which is great. So much of what was having to be done in-house is now can be managed by third parties. It’s definitely evolving, yes. I’m not seeing the evolution of it, not for the fault of anyone else, happening fast enough though. Part of the challenge, I think, is that the greater majority of companies I speak to still don’t really know and think about feature flagging as an important part of their growth strategy and engineering strategy.

Know a feature flag enthusiast we should talk to? Email us or let us know on Twitter.

04 Dec 2017

LaunchDarkly, #1 Feature Management Platform, Gets $21M in Series B Funding

As founder and CEO of the leading feature management platform, I’ve seen how our customers use LaunchDarkly to help them innovate more quickly, reduce risk, and break down the barriers between developer, product, marketing, and sales. Ultimately, LaunchDarkly helps our customers around the world help their own customers succeed. We take the feature flagging platform that the biggest tech giants (Facebook, Amazon, Netflix, Google) build in-house, and provide it as a service for everyone else. Thanks to a tremendous response—we serve over 10 Billion (with a B) features EVERY DAY— I’m pleased to announce that we’ve raised $21M in our Series B to accelerate our own growth in engineering, customer success, and category education.

Feature flagging/toggling is a deceptively simple idea—by separating code deployment from release with a “flag” or ”toggle”, companies can control who gets what feature in their software. LaunchDarkly allows customers to manage their feature flags at scale, giving them the in-house platform that the big tech giants have. Companies start by using LaunchDarkly for an initial “dark launch” by selectively releasing a feature to a group of their customers. This is something tech giants like Facebook and Netflix do constantly, and it allows them to manage what features we see and use in their products with minimal risk to their business. Once they become comfortable with our platform and services, the product team is able to use LaunchDarkly feature flags for fast feedback, marketing team can use LaunchDarkly for betas and launches, and sales can use us for contract management.  

LaunchDarkly is a unified platform where developer, product, marketing, sales, and customer success teams can manage code in real time. Our three main types of customers are:

Disruptors
These startups want the same feature management superpowers Facebook and Netflix have. We’ve worked with startups as they’ve grown from four people to thriving successful businesses, like Troops.AI. They’ve used us for every feature, usually starting with risk mitigation, then moving into limited rollouts, and then allowing everyone in the business to control their own features. As one startup company’s CEO told me, “We originally started using you as an “oh shoot” button, now we use you everywhere.” Another VP Product said, “Using LaunchDarkly for feature development is like night and day.”

Transitioners
These customers built their own feature management infrastructure and are tired of maintaining it. When companies like Lanetix and a leading ecommerce car buying portal made the switch to our platform, suddenly their developers could do all the things they wanted, like role based authorization and complex rule sets. And what’s more, the rest of the company can also use our tools. Now developers can focus on building and the entire company benefits from access to control and flexibility. When I was in Australia, the company told me “now when we build a feature, everyone asks ‘did you LD it?’”

Innovators
These modernizing companies know they need to move faster to innovate. They are at the forefront of their industry and know that constantly iterating will help them stay competitive. Last year, I gave a talk at NDC Sydney on how to use feature flags. An engineer from a huge IOT conglomerate immediately asked me for a demo and became a customer. This year, he gave a talk on how he’d moved from annual releases to weekly releases.

It’s extremely gratifying for our entire team at LaunchDarkly to see how much customers rely on us to run their own businesses. Customers small and large are looking to us not just as a developer tool, but as a platform that their entire company can use to deliver functionality to the right person at the right time.

While we were in Sydney, a customer’s CEO sent me a personal thank you note for a sales person visiting them and educating them on how best to use feature flags. If you’ve been in enterprise developer software, like I have, you know that usually the reaction to a salesperson visit is not kudos. However, our customers view us as a trusted advisor for expertise in feature management. I am so proud of our team and I hope our funding will help us continue to be the #1 trusted feature flag management platform, as well as invest in more education for our customers and broader market. Want to join our team? We’re hiring!

We found perfect partners in Scott Raney (Redpoint) and Jonathan Heiliger (Vertex). Scott has been a long time friend of LaunchDarkly, giving me advice and guesting on my podcast, “To Be Continuous”. Jonathan is a cloud infrastructure pioneer who is very familiar with the value LaunchDarkly provides from his time at Facebook. I’m looking forward to working closely with them both through the next chapter of LaunchDarkly.

So what’s next? LaunchDarkly has an incredibly broad base of cross-industry customers, from banking to insurance to shipping to ecommerce to hardware. The appeal of feature management is truly game-changing. Instead of code being a static object that’s changed only once a year or quarter, suddenly, code is a living, evolving power. Developers can build, marketing can launch, product can iterate, and sales can sell. Equipping businesses with the ability to move at the speed of every deploy allows an entire company to learn rapidly, deliver value to their customers faster, and produce more value. With this funding, we hope to support more customers and teach the world that there is a better way to build software—feature flagging.

*Header image credit: NASA astronaut Sunita Williams, Expedition 32 flight engineer.

17 Nov 2017

The Only Constant in Modern Infrastructure, is Change

Photo by Federico Beccari on Unsplash

We all know what an audit log is…right? We think of the audit log as a chronological list of changes in a system. Martin Fowler defines it as:

“…the simplest, yet also one of the most effective forms of tracking temporal information. The idea is that any time something significant happens you write some record indicating what happened and when it happened.”¹

While event logs will tell you that a thing happened, the audit log should also tell you what happened, when it happened, and who set that event into motion.

A Practice, Not a Product

I’m sure you can appreciate why having this level of detail is important. By understanding what, when, and who impacted an event that occurred in your system, you now have a timeline of how things have been changed, can infer why the change happened, and decide what your next steps should be.

A great example of this is with security and compliance. It’s important to show a record of what transpired and have accountability. But the audit log is a practice, not a product. You have to think about how you track and record this information so that you can access it later.

One of the things that distinguishes LaunchDarkly from homegrown feature flagging systems, is that often people who build their own don’t take the time to build an audit log. This is usually because an audit log requires a role based auth system that will allow you to track accountability. A standard database doesn’t inherently have a way to track accountability, you must add that to the schema. LaunchDarkly, on the other hand,  enforces this by requiring all users to authenticate into the system, even API access is authenticated and scopes through the use of  personal access tokens.

Let’s Play ‘What-If?’

Your team has a new feature, and you’re releasing it to your end users. You’ve decided to do a percentage rollout because you’re also testing in production—you want to be sure this new feature isn’t going to have a negative impact on any customers. You’ve also explicitly excluded some specific users. You know they’re in a quiet period, and this new feature would not be beneficial to them right now. Maybe you’ll roll this out to them in two weeks when they’re ready.

Just when you think the rollout is going smoothly, you get a call from a customer. They’ve reached out to support wanting to know what’s going on. “There’s a new feature and it’s impacting our performance!” How do you find out what happened?

What if you didn’t have an audit log? Well this process would look very different. This would be a lengthy process in which you (and your team) must look through everything to try and figure out what happened. What flag impacted the change? What was deleted or changed? Do you even have a dashboard to help you review this information? Who made the change—was it on the engineering side or the product team? How many people in support, product, and engineering do you need to talk to to sort out what could be the cause of this error? And keep in mind, you’re probably relying on people’s memory of what they might have done with 100s of flags… This is not an easy task!

The More You Know

But if you did have an audit log in place, you could quickly go back into your records and identify where a change was made, what the change was, and who made it. In this case, someone thought the rollout was doing so well, they decided to push it to 100%. However, they didn’t realize there were customers excluded from the rollout because they shouldn’t be experiencing this kind of change just yet. You could see that the exclusion rule was deleted, and you now know what you can do to fix this and move forward.

So, now I’m sure you can appreciate the value of an audit log over a simple event log. And what your team could be doing instead of hunting down elusive unknowns and what-ifs. With this valuable knowledge in hand, you can quickly identify why an event happened, and decide on how to proceed next with more confidence.

¹Martin Fowler (2004), The Audit Log