09 Nov 2017

Measure Twice, Launch Once

You want all your developers to have access to the main trunk of code to deploy — that’s the point of trunk-based development. It’s important they can put their code out as often as they want and iterate on their projects. However, you don’t always want developers turning on features that will have customer impact without some way to reverse course.

Secured activation is an under-appreciated part of feature management. Your developers can deploy code whenever they want—but when it comes time to test it externally, or turn it on for everyone, you can use settings to make sure that only a select group of people has the permissions to do so. All the activation changes should be tracked and audited to ensure that all activations have an accountability chain.

At LaunchDarkly, we have found that it’s good to be permissive about who can use and create feature flags, and restrictive about who can activate them. If you are trying to get started with transitioning to using feature flags more broadly, you might want to think about how to implement a repeatable process. You might also want to leverage LaunchDarkly’s ‘Tags’ feature to help with the organization and custom roles to assist with delegation and access.

You want the following qualities in people who have the permissions to change your user experience:

  • Understands the business reason for making the change
  • Has the technical knowledge or advisors to know when the code is ready to go live
  • Has a process in place for making the change and then testing it

You don’t want to have only one person who can do this, because they’ll inevitably become a bottleneck. Make sure your process can keep releasing even if a key team member is unavailable.

In the beginning you may look to put a process around every change and then look for optimization in that process. However, over time you should look into determining  what level of change merits process and what can be executed more easily. In some cases this might even allow for small changes to be approved or executed by individual engineers. Usually, features that have anything to do with money, user data collection, or changes in the user process should have a formal approval process. Changes to backend operations can be quieter and therefore need less formal process and lean more heavily on automated testing and peer review.

Think about your current deployment process. What happens if someone releases something too early? How do you protect against that? How will you port that control over to the access control that LaunchDarkly offers? What is the failure case if something doesn’t launch properly?

Feature flags are easy to implement in code, but managing them well across an organization takes some planning and forethought.

30 Oct 2017

Deploying Rapidly for Continuous Integration

Velocity is a great conference for ideas about where software and companies are going in the future. And I got a chance to talk with Mike Hendrickson about LaunchDarkly and feature management in that context.

We think that feature management is a best practice for anyone who is trying to do continuous integration and deployment. Feature management gives you a way to test features in production quietly before making them visible to everyone. At the end of this interview we talked about what I think will happen in the future—I’m excited about the idea of giving users the exact page or experience they want and need.

If you’re interested in learning a bit more about feature management or LaunchDarkly, take a few minutes to watch this video.

19 Sep 2017

OpenAPI and Transparent Process

At LaunchDarkly, we’ve put a bunch of time into making our console fast and usable, and we’re pretty proud of it.

However, we’re aware there are lots of reasons people would want to use an API to create and manage feature flags. Since we are using our API to drive the dashboard, it’s easy for us to keep everything in sync if we make changes to the API. And we wanted to make it easy for our customers to do the same.

We started out with ReadMe, which is an excellent industry standard. That let us get our documents published and dynamic. For further refinements, we went to Swagger/OpenAPI. We liked it because:

  • It’s a well-known and widely-used format
  • It allows us to generate usable code snippets and examples in many languages automatically
  • It’s easy to add context and documentation to as we go.
  • We can host it on readme.io or other places, depending on our traffic needs.

We created our REST specification using OpenAPI, and you can find our documents here: LaunchDarkly OpenAPI.

We’re still working on adding examples, descriptions, and context, but we think that the documentation is stronger and more usable already.

As always, if you have any comments, you can contact us here, or make comments directly in the repository.

28 Aug 2017

All the Pretty Ponies

August is always full of security awareness in the wake of DefCon, BlackHat USA and their associated security conference satellites. Las Vegas fills with people excited about digital and physical lockpicking, breakout talks feature nightmare-inducing security vulnerabilities, and trivially simple vote machine hacking. The ultimate in backhanded awards are given out to companies and organizations who made the world less secure, usually because of an overlooked flaw.

The Pwnie Awards (pronounced pony) are the security industry recognizing and mocking organizations who have failed to protect their data and their users. There are categories for:

  • Server-side Bug
  • Client-side Bug
  • Privilege Escalation Bug
  • Cryptographic Attack
  • Best Backdoor
  • Best Branding
  • Most Epic Achievement
  • Most Innovative Research
  • Lamest Vendor Response
  • Most Over-hyped Bug
  • Most Epic Fail
  • Most Epic 0wnage
  • Lifetime Achievement Award

And last but not least:

  • Best Song

This year’s best song is a parody of Adele’s HelloHello From The Other Side is complete with a demonstration of the exploit and a lyrical summation of what they’re doing.

Across the industry, security vulnerabilities are given a tracking number (Common Vulnerabilities and Exposures or CVE) and described in a semi-standard system. This helps everyone understand which category they fall into  so they can read and understand vulnerabilities that may be outside their area of expertise. This also helps us talk about vulnerabilities without resorting to sensational names like “Heartbleed”.

Since it is August and the 2017 awards have been announced, I got to wondering whether any of the Pwnie-winning vulnerabilities could have been prevented by feature flags. There are several that could possibly have been mitigated by the ability to turn a feature off, but I chose this one:

Pwnie for Epic 0wnage
0wnage, measured in owws, can be delivered in mass quantities to a single organization or distributed across the wider Internet population. The Epic 0wnage award goes to the hackers responsible for delivering the most damaging, widely publicized, or hilarious 0wnage. This award can also be awarded to the researcher responsible for disclosing the vulnerability or exploit that resulted in delivering the most owws across the Internet.

WannaCry
Credit: North Korea(?)

Shutting down German train systems and infrastructure was Child’s play for WannaCry. Take a legacy bug that has patches available, a leaked (“NSA”) 0day that exploits said bug, and let it loose by a country whose offensive cyber units are tasked with bringing in their own revenue to support themselves and yes, we all do wanna cry.

An Internet work that makes the worms of the late 1990s and early 2000s blush has it all: ransomware, nation state actors, un-patched MS Windows systems, and copy-cat follow on worms Are you not entertained?!?!?

WannaCry was especially interesting because it got “sinkholed” by a security researcher called MalwareTech. (I know, he’s under indictment. Security researchers are interesting people.) He noticed that the ransomware was calling out to a domain that wasn’t actually registered, so he registered the domain himself. That gave lots of people time to patch their systems instead of getting infected.

As an industry, we tend to think of server uptime as a good thing. But is it? Not for any single server, because if it’s been up for a thousand days, it also hasn’t been rebooted for patches in over three years. You may remember early 2017 when Amazon S3 went down? That was also due to a restart on a system that no one had rebooted in ages.

So how can feature flags help keep production servers current?

  • It’s safer to make a change in production if you know you can revert it instantly using a feature flag kill switch.
  • Small, iterative development means that you can ship changes more quickly with less risk—the more often you reboot a server, the less important that particular server’s uptime is.
  • Many infrastructures today are built with the concept of ‘infrastructure as code’ through the use of automation tooling. These automation configurations can also benefit from the use of feature flags to roll-out system or component versions alongside your application updates.

Second we have:

Pwnie for Best Cryptographic Attack (new for 2016!)
Awarded to the researchers who discovered the most impactful cryptographic attack against real-world systems, protocols, or algorithms. This isn’t some academic conference where we care about theoretical minutiae in obscure algorithms, this category requires actual pwnage.

The first collision for full SHA-1
Credit: Marc Stevens, Elie Bursztein, Pierre Karpman, Ange Albertini, Yarik Markov

The SHAttered attack team generated the first known collision for full SHA-1. The team produced two PDF documents that were different that produced the same SHA-1 hash. The techniques used to do this led to an a 100k speed increase over the brute force attack that relies on the birthday paradox, making this attack practical by a reasonably (Valasek-rich?) well funded adversary. A practical collision like this, moves folks still relying on a deprecated protocol to action.

SHA-1 was a cryptographic standard for years, and you can still select it as the encryption for a lot of software. This exploit makes it clear that we need to stop letting anyone use it—and for their own good.

How a feature flag might have made this better:

  • Turn off the ability to select SHA-1 as an encryption option
  • Force current SHA-1 users to choose a new encryption option

Some feature flags are intended to be short-term and will be removed from the code once the feature is fully incorporated. Others exist longer-term, as a way to segment off possible dangerous sections of code that may need to be removed in a hurry. Feature flags could be combined with detection, like I suggested for the SHA-1 problem, and used to drive updates and vulnerability analysis.

It’s easy for us to think of deployment as something a long way from security exploits that involve physical access to computers and networks, but security exploits are “moving left” at the same time as all the rest of our technology. Fewer of us manage physical servers, and fewer security vulnerabilities relate to inserting unknown USB sticks into our laptops. Instead, we’re moving to virtual machines and containers, and so are our vulnerabilities. Constructing code, cookbooks, and scripts that allow us to change the path of execution after deployment gives us more options for stay far, far away from the bright light of the Pwnie Awards.

18 Aug 2017

Risk Reduction and Harm Mitigation

Risk Reduction is trying to make sure bad things happen as rarely as possible. It is anti-lock brakes, vaccinations, clothing irons that turn off by themselves, and all sorts of things that we think of as safety modifications in our life. We are trying to build lives where bad things happen less often.

Harm Mitigation is what we do to make sure that when bad things do happen, they are less catastrophic. Fire sprinklers in buildings, seat belts, and needle exchanges are all about making the consequences of something bad less terrible.

What does that mean for a DevOps world where our risks and harms are very different? Charity Majors says we shouldn’t split the world into developers and operations, but into product and infrastructure—and I think that’s a useful way to think about risk too.

Product risk is problems that users experience. We can usually predict and mitigate the danger by testing and being aware of common failings. For example, we can expect and plan for users that may be on a flaky connection, or that they may try to exit a page without saving information. We can work around these problems because we know they’re out there. But our deployments are not something they should see until we’re ready for prime time. For this kind of risk, we use feature flags to control what content is delivered.

Infrastructure risk is more about the inherent fragility of delivering software. CDNs, fiber, servers, switches, towers…the whole system of getting data to people has failure zones. When we are trying to reduce infrastructure risk, we assume that latency is ever-present, networks are intermittent, and we won’t always be able to count on everything working right the first time. We try to build in robust and flexible ways than can route around failures. This is the place we might use feature flags to control failovers or to create circuit breakers to prevent flooding a fragile sector.

Product harm reduction is about making sure that users can have a positive experience, even if something happens that makes it less than ideal. We want to preserve their blog drafts, keep them from committing errors, only show them things they are allowed to change, and above all, avoid giving them a blank page. Something has gone wrong in the trip to the user, but they shouldn’t have to suffer for it.

Infrastructure harm reduction is the ability to pull back breaking fixes, shunt users away from vulnerabilities, and respond near-immediately to things that have gone very wrong. Harm reduction at this level is the kind of action a pager-responder can take to get things back on track before doing more intensive repairs and investigations in the morning.

In my first week, I spent a lot of time thinking about how to summarize our product for a variety of audiences. “Feature flags as a service” is short and pithy, but only works if you can bring your own definition of “feature flag” and a business understanding of why you would use them. What about “Feature flags allow you to make changes in near real-time instead of waiting for a deployment, and LaunchDarkly helps you manage and track flags across an organization”?

Well, that works, but it still doesn’t get to the core of why an organization would want to use feature flags. What I’ve come up with so far is this:

Feature flags segment the risk of creating a product into manageable parts.

Creating and deploying software is risky. We can accidentally build in errors, we can deliver it badly, or to the wrong people, and it can interact in unfortunate ways with existing software or hardware. As organizations, we want to do our best to do no harm and provide benefit. Using feature flags lets us wrap our features in decision points that we can then use to make life easier for our users.

Here are some types of risks that are reduced by using feature flags:

  • Server falls over from too much traffic
  • Canary launch is not well-tracked, problems are missed
  • Old features and workarounds are invisible and get left in place
  • Feature with vulnerable content is deployed
  • API endpoints are exposed to unauthorized users

Managing your feature flags is a post for another day—today I ask that you take a few minutes and think about how you can reduce risk and minimize harm in your organization, your project, or your code. How can you make things robust enough to resist failure, instrumented sufficiently to identify a failure spot, and flexible enough to reduce harmful consequences on the fly?