15 Jan 2018

If You’re Going to Fail, Fail Safely

Harm mitigation is the flipside of risk reduction. Though we want to avoid bad things ever happening and bad choices from being made, we live in a world full of humans, software, and events that require the use of the passive voice. So when something does go wrong, we want to minimize the impact as much as possible.

Seatbelts are the perfect example of harm mitigation. We don’t want anyone to get in a car accident, but if they do, we want them to survive without injury. So we created and mandated seat belts that keep us from flying out of cars, windshields that break into small pellets instead of death spears, and airbags… Our cars have a lot of harm mitigation technology that never gets used until and unless something goes wrong.

Assume Failure

If you don’t know how your product is going to fail, then you’re missing something. Moreover, because failure modes may have different severity for different people, it’s important you assume all the failure modes you can think of, as well as ask your users what they may experience and how severe those events would be for them. In a SaaS setting, what may seem like a harmless bug around graphics might break business-critical reports…or it might be a feature that the vast majority of your users never access.

For example, the vast majority of social media users may not mind exposing their names and approximate location, but for those who have experienced intimate partner violence or stalking, it could be catastrophic and life-threatening. Be sure you understand the worst cases of failure, not just the common cases, so you can mitigate the most drastic outcomes.

Fail Safe, Fail Secure

When we mitigate harm, we need to think about what it is we’re trying to protect. For example, nuclear power plants are designed to fail safe. In the absence of positive control and power, they shut themselves down as quickly as possible. On the other hand, time-lock safes are designed to to fail secure. They aren’t designed to have people in them, rather they are designed to protect money, so without a set of conditions including time and codes they will stay locked.

One of the ways that feature flags help everyone fail safely is that there is not a state where a deployment fails in a bad state. If you are always deploying, there is always good code going out, and as it is approved, the new features are turned on. Older styles of deployment could fail in the middle and leave systems in a state neither stable nor updated. CI/CD helps eliminate the disaster of a failed build by making the changes in builds relatively small and pushing the activation of features toward a collaboration between development and business.

The Kill Switch

Going back to predicting possible states—when you know potential places for failure, you can better understand what transition states can be altered so that when failure happens it hurts less. The easiest path for this in software is the kill switch.

We used to do this process by re-deploying a known good build, but that’s painfully slow when you’re in the middle of an ongoing crisis. Instead, you can always have a transition state of Hit the kill switch. If you have a feature marked off in a way that allows you to kill it immediately, you don’t need to redeploy. It’s the SaaS equivalent of a cat falling off a bed but still walking away with great dignity.

Balancing Act

In our careers, all of us have to balance competing priorities like speed, risk, ease of use, reliability, and planning for the future. We can look at previous large projects at our organizations to show us what kinds of risks were accepted and what were avoided or mitigated. It’s not possible to create anything without making mistakes, but at least we won’t have to switch calendar digits again.

Our best practices include small, rapid iteration, testing in a variety of environments, understanding the business cases we are solving, and listening to everyone on our team as they talk about risks, harm, and avoidance.

12 Jan 2018

What Makes a Failure a Disaster?

It was December 31, 1999, and a young sysadmin had just bet her boss that their systems wouldn’t go down. The stakes were her job. She was driving 4 hours to go to a New Year’s Eve party with her friends, and she wasn’t worried about Y2K.

Though our sysadmin had her pager, if something went down in her systems she wasn’t going to be able to fix them before people noticed. Her boss told her that if anything happened, she didn’t need to come back, ever. She left anyway, because she knew everything would be fine. This wasn’t an act of faith, it was that she knew nothing would go down, because for the whole of 1999, and even before, the system had been preparing.

Everyone knew Y2K was coming. Our sysadmin had patched all her scripts, and installed all the software updates the vendors had said, and traced back every piece of hardware and firmware to make sure it had been patched and updated.

As we have moved further from the reality of Y2K, we have forgotten. We forget just how many millions of dollars and person-hours went into reducing our risk. Utilities did fail, but that was in July of 1999, because groups were doing planned tests. They even staggered the tests so the whole grid wouldn’t go down. The SEC threatened to cut off non-compliant banks if they didn’t meet a series of rolling upgrade deadlines. Thousands of programmers were hired to remediate systems that couldn’t be automatically upgraded.

Now it’s easy to think Y2K was an overblown story. Nothing really terrible happened, so it must not have been that bad. But it was that bad, and we just managed to fix it in time through an effort more intense and widespread than putting humans on the moon. Often we know that things will fail, but what makes a failure a disaster?

Risk Reduction

In the case of Y2K, all the upgrade focus was about risk reduction. We knew there was something that could cause problems, so we did everything we could to prevent it from happening—from preventing a major disaster from occurring.

While it’s impossible to avoid all risk, we recognize that we can at least reduce risks that we’re aware of. But what are the essential elements of reducing risk? And how do we know what we need to prevent?

Secure Your Zone

Rather than becoming overwhelmed at the idea of trying to avoid risk, we need to break our world into zones, decide how much risk we can tolerate in each zone, and then secure it as much as possible.

In my job, I work on adding documentation to our APIs. There is a risk I could break someone’s integration if I add or delete something in the wrong place. To avoid that, we have a process where someone else reviews my changes before they go live.

Pull requests and reviews are so standard that we don’t usually think of them as a risk reduction strategy, but they are. We are adding a little friction to the act of committing code in order to prevent easy mistakes. We also prevent mistakes or malicious activity by limiting who has “commit bits” to critical parts of our code, and by keeping backups of our code so that we can restore a working state quickly.

At LaunchDarkly, we care about reducing risk by making it easy to do the right thing and hard to make unrecoverable errors. Wrapping feature code and product settings in feature flags makes it easy to change the state of your software without adding the confounding effects of a deployment to the process.

Your zone is only as big as the area you have control over. You want to be sure that people who are changing your software are authorized to do so. I can change the APIs and their documentation, but no one has or should give me access to change other parts of our code.

Predict States

Another way to reduce risk is to address issues before they become problems by evaluating your system and identifying where they might occur. To predict the states something could end up in is formally known as finite state machines (FSMs). An FSM is defined by a list of its states, its initial state, and the conditions for each transition from one state to another. Many process flowcharts are a state machine for systems.

To use a state machine to assess risk, you enumerate all the possible states your system could be in, and what would make that happen. You can then address the transitions that lead to states you don’t want.

After code is committed, there are still several ways it could fail. For instance, we try to have more than one way to serve it, so that if there is some kind of infrastructure problem, we have a second server in another location already online. Sometimes, the risk is that the page or product won’t work for everyone, so we deploy it to just a small percentage of our users—also known as a canary launch. That means that if there was a risk we didn’t foresee, we only have to repair a small part of our user base, not the whole thing.

Being able to test our assumptions on a small scale gives us the confidence that our predicted states are accurate and we can handle them.

Make Low-Stakes Bets

We can’t entirely avoid risk and stay safe. One of the ways we can reduce risk is to make small bets—not all-or-nothing—but a little bet that this will go well, or at least generate a minor improvement. Risk is not an all-or-nothing proposition, it’s a matter of what we can tolerate, what we can afford to lose. Once we can define that for ourselves, we can be much more adventurous within those limits.

09 Nov 2017

Measure Twice, Launch Once

You want all your developers to have access to the main trunk of code to deploy — that’s the point of trunk-based development. It’s important they can put their code out as often as they want and iterate on their projects. However, you don’t always want developers turning on features that will have customer impact without some way to reverse course.

Secured activation is an under-appreciated part of feature management. Your developers can deploy code whenever they want—but when it comes time to test it externally, or turn it on for everyone, you can use settings to make sure that only a select group of people has the permissions to do so. All the activation changes should be tracked and audited to ensure that all activations have an accountability chain.

At LaunchDarkly, we have found that it’s good to be permissive about who can use and create feature flags, and restrictive about who can activate them. If you are trying to get started with transitioning to using feature flags more broadly, you might want to think about how to implement a repeatable process. You might also want to leverage LaunchDarkly’s ‘Tags’ feature to help with the organization and custom roles to assist with delegation and access.

You want the following qualities in people who have the permissions to change your user experience:

  • Understands the business reason for making the change
  • Has the technical knowledge or advisors to know when the code is ready to go live
  • Has a process in place for making the change and then testing it

You don’t want to have only one person who can do this, because they’ll inevitably become a bottleneck. Make sure your process can keep releasing even if a key team member is unavailable.

In the beginning you may look to put a process around every change and then look for optimization in that process. However, over time you should look into determining  what level of change merits process and what can be executed more easily. In some cases this might even allow for small changes to be approved or executed by individual engineers. Usually, features that have anything to do with money, user data collection, or changes in the user process should have a formal approval process. Changes to backend operations can be quieter and therefore need less formal process and lean more heavily on automated testing and peer review.

Think about your current deployment process. What happens if someone releases something too early? How do you protect against that? How will you port that control over to the access control that LaunchDarkly offers? What is the failure case if something doesn’t launch properly?

Feature flags are easy to implement in code, but managing them well across an organization takes some planning and forethought.

30 Oct 2017

Deploying Rapidly for Continuous Integration

Velocity is a great conference for ideas about where software and companies are going in the future. And I got a chance to talk with Mike Hendrickson about LaunchDarkly and feature management in that context.

We think that feature management is a best practice for anyone who is trying to do continuous integration and deployment. Feature management gives you a way to test features in production quietly before making them visible to everyone. At the end of this interview we talked about what I think will happen in the future—I’m excited about the idea of giving users the exact page or experience they want and need.

If you’re interested in learning a bit more about feature management or LaunchDarkly, take a few minutes to watch this video.

19 Sep 2017

OpenAPI and Transparent Process

At LaunchDarkly, we’ve put a bunch of time into making our console fast and usable, and we’re pretty proud of it.

However, we’re aware there are lots of reasons people would want to use an API to create and manage feature flags. Since we are using our API to drive the dashboard, it’s easy for us to keep everything in sync if we make changes to the API. And we wanted to make it easy for our customers to do the same.

We started out with ReadMe, which is an excellent industry standard. That let us get our documents published and dynamic. For further refinements, we went to Swagger/OpenAPI. We liked it because:

  • It’s a well-known and widely-used format
  • It allows us to generate usable code snippets and examples in many languages automatically
  • It’s easy to add context and documentation to as we go.
  • We can host it on readme.io or other places, depending on our traffic needs.

We created our REST specification using OpenAPI, and you can find our documents here: LaunchDarkly OpenAPI.

We’re still working on adding examples, descriptions, and context, but we think that the documentation is stronger and more usable already.

As always, if you have any comments, you can contact us here, or make comments directly in the repository.

28 Aug 2017

All the Pretty Ponies

August is always full of security awareness in the wake of DefCon, BlackHat USA and their associated security conference satellites. Las Vegas fills with people excited about digital and physical lockpicking, breakout talks feature nightmare-inducing security vulnerabilities, and trivially simple vote machine hacking. The ultimate in backhanded awards are given out to companies and organizations who made the world less secure, usually because of an overlooked flaw.

The Pwnie Awards (pronounced pony) are the security industry recognizing and mocking organizations who have failed to protect their data and their users. There are categories for:

  • Server-side Bug
  • Client-side Bug
  • Privilege Escalation Bug
  • Cryptographic Attack
  • Best Backdoor
  • Best Branding
  • Most Epic Achievement
  • Most Innovative Research
  • Lamest Vendor Response
  • Most Over-hyped Bug
  • Most Epic Fail
  • Most Epic 0wnage
  • Lifetime Achievement Award

And last but not least:

  • Best Song

This year’s best song is a parody of Adele’s HelloHello From The Other Side is complete with a demonstration of the exploit and a lyrical summation of what they’re doing.

Across the industry, security vulnerabilities are given a tracking number (Common Vulnerabilities and Exposures or CVE) and described in a semi-standard system. This helps everyone understand which category they fall into  so they can read and understand vulnerabilities that may be outside their area of expertise. This also helps us talk about vulnerabilities without resorting to sensational names like “Heartbleed”.

Since it is August and the 2017 awards have been announced, I got to wondering whether any of the Pwnie-winning vulnerabilities could have been prevented by feature flags. There are several that could possibly have been mitigated by the ability to turn a feature off, but I chose this one:

Pwnie for Epic 0wnage
0wnage, measured in owws, can be delivered in mass quantities to a single organization or distributed across the wider Internet population. The Epic 0wnage award goes to the hackers responsible for delivering the most damaging, widely publicized, or hilarious 0wnage. This award can also be awarded to the researcher responsible for disclosing the vulnerability or exploit that resulted in delivering the most owws across the Internet.

Credit: North Korea(?)

Shutting down German train systems and infrastructure was Child’s play for WannaCry. Take a legacy bug that has patches available, a leaked (“NSA”) 0day that exploits said bug, and let it loose by a country whose offensive cyber units are tasked with bringing in their own revenue to support themselves and yes, we all do wanna cry.

An Internet work that makes the worms of the late 1990s and early 2000s blush has it all: ransomware, nation state actors, un-patched MS Windows systems, and copy-cat follow on worms Are you not entertained?!?!?

WannaCry was especially interesting because it got “sinkholed” by a security researcher called MalwareTech. (I know, he’s under indictment. Security researchers are interesting people.) He noticed that the ransomware was calling out to a domain that wasn’t actually registered, so he registered the domain himself. That gave lots of people time to patch their systems instead of getting infected.

As an industry, we tend to think of server uptime as a good thing. But is it? Not for any single server, because if it’s been up for a thousand days, it also hasn’t been rebooted for patches in over three years. You may remember early 2017 when Amazon S3 went down? That was also due to a restart on a system that no one had rebooted in ages.

So how can feature flags help keep production servers current?

  • It’s safer to make a change in production if you know you can revert it instantly using a feature flag kill switch.
  • Small, iterative development means that you can ship changes more quickly with less risk—the more often you reboot a server, the less important that particular server’s uptime is.
  • Many infrastructures today are built with the concept of ‘infrastructure as code’ through the use of automation tooling. These automation configurations can also benefit from the use of feature flags to roll-out system or component versions alongside your application updates.

Second we have:

Pwnie for Best Cryptographic Attack (new for 2016!)
Awarded to the researchers who discovered the most impactful cryptographic attack against real-world systems, protocols, or algorithms. This isn’t some academic conference where we care about theoretical minutiae in obscure algorithms, this category requires actual pwnage.

The first collision for full SHA-1
Credit: Marc Stevens, Elie Bursztein, Pierre Karpman, Ange Albertini, Yarik Markov

The SHAttered attack team generated the first known collision for full SHA-1. The team produced two PDF documents that were different that produced the same SHA-1 hash. The techniques used to do this led to an a 100k speed increase over the brute force attack that relies on the birthday paradox, making this attack practical by a reasonably (Valasek-rich?) well funded adversary. A practical collision like this, moves folks still relying on a deprecated protocol to action.

SHA-1 was a cryptographic standard for years, and you can still select it as the encryption for a lot of software. This exploit makes it clear that we need to stop letting anyone use it—and for their own good.

How a feature flag might have made this better:

  • Turn off the ability to select SHA-1 as an encryption option
  • Force current SHA-1 users to choose a new encryption option

Some feature flags are intended to be short-term and will be removed from the code once the feature is fully incorporated. Others exist longer-term, as a way to segment off possible dangerous sections of code that may need to be removed in a hurry. Feature flags could be combined with detection, like I suggested for the SHA-1 problem, and used to drive updates and vulnerability analysis.

It’s easy for us to think of deployment as something a long way from security exploits that involve physical access to computers and networks, but security exploits are “moving left” at the same time as all the rest of our technology. Fewer of us manage physical servers, and fewer security vulnerabilities relate to inserting unknown USB sticks into our laptops. Instead, we’re moving to virtual machines and containers, and so are our vulnerabilities. Constructing code, cookbooks, and scripts that allow us to change the path of execution after deployment gives us more options for stay far, far away from the bright light of the Pwnie Awards.