17 Jan 2018

Launched: SOC 2 Type II Certified

This week we are happy to announce that we have achieved SOC 2 Type II certification. You may recall that last year we worked with an independent auditor to receive our SOC 2 Type I report. There are actually two kinds of SOC 2 reports—Type I and Type II.

What’s the Difference?

Type I reports on whether the systems are suitably designed. Providing an overview of the systems a SaaS platform has in place to satisfy the Trust Principles it’s being audited on. This type of report looks at your company and your procedures to determine whether they are designed to do the job you say they do. For instance, does your service correctly control access the way you say it does? However, a Type I report doesn’t take into consideration how well these actually work in practice—for that, you need a Type II report.

Type II reports look at the actual outcomes of the system and if it’s operating as designed. The report audits how well your software, team, and procedures worked with actual data within the evaluated timeframe. Were there any significant service outages? Were there any security incidents? Was the data handled effectively? Did your service effectively manage features for the allotted time, 6 months in our case, the auditors evaluated your system?

Type II Is All About CONSISTENCY

The Type II reports are what your customers really want. When a customer is trusting your service to run a portion of their business, they want to see proof your system operates effectively over the long term. And enterprise organizations like working with products that conform to the highest standards of the AICPA and that your practices back up your promises.

With a SOC 2 Type II report, you can provide customers evidence that you accomplish what you claim. You can show that external auditors have agreed that the control systems you have in place work as you say they do, and were observed over an extended period. This is not just about closing a deal, this is a key part of establishing trust that our customers can feel confident they can pass along to their customers.

15 Jan 2018

If You’re Going to Fail, Fail Safely

Harm mitigation is the flipside of risk reduction. Though we want to avoid bad things ever happening and bad choices from being made, we live in a world full of humans, software, and events that require the use of the passive voice. So when something does go wrong, we want to minimize the impact as much as possible.

Seatbelts are the perfect example of harm mitigation. We don’t want anyone to get in a car accident, but if they do, we want them to survive without injury. So we created and mandated seat belts that keep us from flying out of cars, windshields that break into small pellets instead of death spears, and airbags… Our cars have a lot of harm mitigation technology that never gets used until and unless something goes wrong.

Assume Failure

If you don’t know how your product is going to fail, then you’re missing something. Moreover, because failure modes may have different severity for different people, it’s important you assume all the failure modes you can think of, as well as ask your users what they may experience and how severe those events would be for them. In a SaaS setting, what may seem like a harmless bug around graphics might break business-critical reports…or it might be a feature that the vast majority of your users never access.

For example, the vast majority of social media users may not mind exposing their names and approximate location, but for those who have experienced intimate partner violence or stalking, it could be catastrophic and life-threatening. Be sure you understand the worst cases of failure, not just the common cases, so you can mitigate the most drastic outcomes.

Fail Safe, Fail Secure

When we mitigate harm, we need to think about what it is we’re trying to protect. For example, nuclear power plants are designed to fail safe. In the absence of positive control and power, they shut themselves down as quickly as possible. On the other hand, time-lock safes are designed to to fail secure. They aren’t designed to have people in them, rather they are designed to protect money, so without a set of conditions including time and codes they will stay locked.

One of the ways that feature flags help everyone fail safely is that there is not a state where a deployment fails in a bad state. If you are always deploying, there is always good code going out, and as it is approved, the new features are turned on. Older styles of deployment could fail in the middle and leave systems in a state neither stable nor updated. CI/CD helps eliminate the disaster of a failed build by making the changes in builds relatively small and pushing the activation of features toward a collaboration between development and business.

The Killswitch

Going back to predicting possible states—when you know potential places for failure, you can better understand what transition states can be altered so that when failure happens it hurts less. The easiest path for this in software is the kill switch.

We used to do this process by re-deploying a known good build, but that’s painfully slow when you’re in the middle of an ongoing crisis. Instead, you can always have a transition state of Hit the kill switch. If you have a feature marked off in a way that allows you to kill it immediately, you don’t need to redeploy. It’s the SaaS equivalent of a cat falling off a bed but still walking away with great dignity.

Balancing Act

In our careers, all of us have to balance competing priorities like speed, risk, ease of use, reliability, and planning for the future. We can look at previous large projects at our organizations to show us what kinds of risks were accepted and what were avoided or mitigated. It’s not possible to create anything without making mistakes, but at least we won’t have to switch calendar digits again.

Our best practices include small, rapid iteration, testing in a variety of environments, understanding the business cases we are solving, and listening to everyone on our team as they talk about risks, harm, and avoidance.

12 Jan 2018

Launched: Comments, Adding Context to Your Actions

A key part of problem identification is knowing what changed in the system. For developers this is often accomplished with comments in code this is to remind your ‘Future Self‘ why you made the choices that you did. As we have built LaunchDarkly we have tried to follow our own best practices around naming flags with good descriptions, assigning clear owners to flags that map to specific areas of responsibility, and adding comments in our code to provide reminders on what a given flag is for.

However, as more teams using LaunchDarkly have implemented feature flag driven development practices the platform has become a crucial part of their change management process. For anyone that has experience with change management there is an acknowledgement that context matters. Standard audit logging is great to know who changed what, when; but lacks the why. In small systems or on small teams this often falls in the category of tribal knowledge, but in large systems and large organizations 30 seconds on formally documenting changes can save hours or days in the troubleshooting process.

Today, we are happy to rollout the ability to add comments on updates to flags. Now team members can have the option to add a brief comment on why a change is being made.

As we designed this feature we also thought it would be helpful to include these comments in your favorite established output path. So, when you do decide to add a little extra context folks will see it appear along with the who, what , and when sent to the audit log, through your established webhooks, slack channels, and Hipchat rooms.

If it’s for a developer, build it on top of an API

In keeping with our mantra around API-first development the new comments functionality is built on top of the REST API.  We added the ability to submit these optional comments with PATCH changes. With this launch, the Update feature flag resource supports comments, and support in other PATCH resources is forthcoming.

Here’s an example from our API docs for submitting a comment along with a JSON Patch document:

JSON
1
2
3
4
{
  "comment""This is a comment string",
  "patch": [ {"op""replace""path""/description""value""The new description" } ]
}

Oh, and one more thing

While we’re strong proponents of JSON Patch as a protocol for describing partial updates in our REST API, we’ve found that it can be pretty verbose for simple changes. We’ve added support for the JSON Merge Patch protocol to address this. Here’s an example merge patch document  that changes a feature flag’s description:

JSON
1
2
3
{
  "description""New flag description"
}

Compare that to the equivalent JSON Patch document:

JSON
1
2
3
4
5
6
{
  "name""New recommendations engine",
  "key""engine.enable",
  "description""This is the description",
  ...
}

This addition also means that you can get fancy and combine these two new features and submit a comment along with a merge patch document:

JSON
1
2
3
4
{
  "comment""This is a comment string",
  "merge": { "description""New flag description"}
}

Achievement unlocked. And now you’ll know why.

12 Jan 2018

What Makes a Failure a Disaster?

It was December 31, 1999, and a young sysadmin had just bet her boss that their systems wouldn’t go down. The stakes were her job. She was driving 4 hours to go to a New Year’s Eve party with her friends, and she wasn’t worried about Y2K.

Though our sysadmin had her pager, if something went down in her systems she wasn’t going to be able to fix them before people noticed. Her boss told her that if anything happened, she didn’t need to come back, ever. She left anyway, because she knew everything would be fine. This wasn’t an act of faith, it was that she knew nothing would go down, because for the whole of 1999, and even before, the system had been preparing.

Everyone knew Y2K was coming. Our sysadmin had patched all her scripts, and installed all the software updates the vendors had said, and traced back every piece of hardware and firmware to make sure it had been patched and updated.

As we have moved further from the reality of Y2K, we have forgotten. We forget just how many millions of dollars and person-hours went into reducing our risk. Utilities did fail, but that was in July of 1999, because groups were doing planned tests. They even staggered the tests so the whole grid wouldn’t go down. The SEC threatened to cut off non-compliant banks if they didn’t meet a series of rolling upgrade deadlines. Thousands of programmers were hired to remediate systems that couldn’t be automatically upgraded.

Now it’s easy to think Y2K was an overblown story. Nothing really terrible happened, so it must not have been that bad. But it was that bad, and we just managed to fix it in time through an effort more intense and widespread than putting humans on the moon. Often we know that things will fail, but what makes a failure a disaster?

Risk Reduction

In the case of Y2K, all the upgrade focus was about risk reduction. We knew there was something that could cause problems, so we did everything we could to prevent it from happening—from preventing a major disaster from occurring.

While it’s impossible to avoid all risk, we recognize that we can at least reduce risks that we’re aware of. But what are the essential elements of reducing risk? And how do we know what we need to prevent?

Secure Your Zone

Rather than becoming overwhelmed at the idea of trying to avoid risk, we need to break our world into zones, decide how much risk we can tolerate in each zone, and then secure it as much as possible.

In my job, I work on adding documentation to our APIs. There is a risk I could break someone’s integration if I add or delete something in the wrong place. To avoid that, we have a process where someone else reviews my changes before they go live.

Pull requests and reviews are so standard that we don’t usually think of them as a risk reduction strategy, but they are. We are adding a little friction to the act of committing code in order to prevent easy mistakes. We also prevent mistakes or malicious activity by limiting who has “commit bits” to critical parts of our code, and by keeping backups of our code so that we can restore a working state quickly.

At LaunchDarkly, we care about reducing risk by making it easy to do the right thing and hard to make unrecoverable errors. Wrapping feature code and product settings in feature flags makes it easy to change the state of your software without adding the confounding effects of a deployment to the process.

Your zone is only as big as the area you have control over. You want to be sure that people who are changing your software are authorized to do so. I can change the APIs and their documentation, but no one has or should give me access to change other parts of our code.

Predict States

Another way to reduce risk is to address issues before they become problems by evaluating your system and identifying where they might occur. To predict the states something could end up in is formally known as finite state machines (FSMs). An FSM is defined by a list of its states, its initial state, and the conditions for each transition from one state to another. Many process flowcharts are a state machine for systems.

To use a state machine to assess risk, you enumerate all the possible states your system could be in, and what would make that happen. You can then address the transitions that lead to states you don’t want.

After code is committed, there are still several ways it could fail. For instance, we try to have more than one way to serve it, so that if there is some kind of infrastructure problem, we have a second server in another location already online. Sometimes, the risk is that the page or product won’t work for everyone, so we deploy it to just a small percentage of our users—also known as a canary launch. That means that if there was a risk we didn’t foresee, we only have to repair a small part of our user base, not the whole thing.

Being able to test our assumptions on a small scale gives us the confidence that our predicted states are accurate and we can handle them.

Make Low-Stakes Bets

We can’t entirely avoid risk and stay safe. One of the ways we can reduce risk is to make small bets—not all-or-nothing—but a little bet that this will go well, or at least generate a minor improvement. Risk is not an all-or-nothing proposition, it’s a matter of what we can tolerate, what we can afford to lose. Once we can define that for ourselves, we can be much more adventurous within those limits.

10 Jan 2018

Launched: Private User Attributes

Today we are launching a new feature called Private User Attributes. This feature allows our customers to collect event data from feature flag evaluation while controlling which user attribute fields are stored or exposed in other parts of the LaunchDarkly user interface.

In our feature management platform we use the stream of flag impression events generated by our SDKs to power many useful features—including attribute autocomplete, our users page, the flag status dashboard and A/B testing. Some of our customers have requirements to restrict Personally Identifiable Information (PII) from being sent to LaunchDarkly. Until now, these customer’s only choice had been to completely disable any event data that might contain PII fields (emails, user names, etc.).

The introduction of Private User Attributes means all customers now have the the ability to selectively choose not to send back some or all user attributes. With this selective approach, customers can continue to view flag statuses, use A/B testing, and view and override attributes for targeted users while protecting sensitive data. We have also made several small changes to the LaunchDarkly web UI to display when attributes are not available and offer reasonable choices when data has been marked as private and is unavailable for display.

What is a Private User Attribute?

When a user attribute is declared private it can continue to be used for targeting but its value will not be sent back to LaunchDarkly. For example, consider the following targeting rule:

if e-mail endsWith "@launchdarkly.com" serve true

To implement this in your application, you’d make this call to LaunchDarkly:

ldclient.variation("alternate.page", {
  key: "35935",
  email: "bob@launchdarkly.com"
}, false)

When this code is executed, a flag impression event would be sent back to LaunchDarkly containing the following data:

  {
    "key""alternate.page",
    "kind""feature",
    "user": {
      "email":"bob@launchdarkly.com",
      "key""35935"
    },
    "value"true
  }

In the example above, LaunchDarkly has received the email address of a user. Suppose now that you want to target by email address but not send that personally identifiable address back to LaunchDarkly.

This is what private attributes are for—you can mark the email field ‘private’. If your application marks the email field private, you can still target by email address but LaunchDarkly will receive the following data:

  {
    "key""alternate.page",
    "kind""feature",
    "user": {
      "key""35935",
      "privateAttrs": ["email"]
    },
    "value"true
  }

This data object no longer contains the targeted user’s email address. Instead, there is a record that an email attribute was provided during feature flag evaluation but that its value has been excluded from the event data.

Ok, let’s hide some stuff.

There are three different ways to configure private attributes in our SDKs:

  • You can mark all attributes private globally in your LDClient configuration object.
  • You can mark specific attributes private by name globally in your LdClient configuration object.
  • You can also mark specific attributes private by name for individual users when you construct LDUser objects.

It should be noted that the user key attribute cannot be marked private, for this reason we encourage the best practice of not using any PII as the user key. For more in-depth docs you can look at the SDK Reference Guides.

How do private attributes work for mobile and browser SDKs?

You might recall that our mobile and browser SDKs work slightly differently—flags are evaluated on LaunchDarkly’s servers, not in the SDKs. We still support private user attributes for these SDKs; when evaluation happens, we do not record any data about the user being evaluated. For analytics events, these SDKs strip out private user attributes the same way our server-side SDKs do, so your data is still kept private.

How will marking attributes private affect my experience?

At LaunchDarkly usability is critically important to us (after all, we use LaunchDarkly to build LaunchDarkly). In thinking through the impact of reducing the amount of data that is visible we wanted to still provide an intuitive UX for existing features. Because we record the name—but not the value—of a private attribute, name autocomplete will continue to function for private attributes in the UI, but autocomplete will not be offered for attribute values. We also indicate which attributes on a user are marked private on the User Page and when we are unable to predict a user’s Flag settings because a targeting rule uses a private attribute. Below is an example of the User page for a user with attributes marked as private:

Why shouldn’t I just make ALL my attributes private?

Making all your attributes private would limit some of the product features that rely on having attribute values available. For example:

  • The UI will not offer to autocomplete values. You’ll be able to see the names of hidden attributes, but not their values. So you’ll know that (e.g.) “email” is available for targeting, but you won’t know (in the UI) what email addresses exist.
  • Flag settings (on the users page) won’t be exact. We’ll be able to know if the user has a specific setting enabled, but we won’t be to predict what variation a user is receiving because we won’t be able to evaluate the targeting rules when hidden attributes are in play.

Given these limitations, our recommendation is that you selectively use Private User Attributes only for the purpose of securing PII.

We’d like to thank all the customers that provided feedback on this feature, we hope you enjoy not sharing your data.

26 Dec 2017

Talking Nerdy with Technically Speaking

Our own developer advocate sat down with the Technically Speaking podcast at the Agile Midwest conference in St Louis, MO. They discussed what a developer advocate is, the LaunchDarkly feature management platform, and how teams use feature flags.

“Decoupling that deployment from the activation of code gives people so much more security and reduces the risk of deployment. I think a lot of reasons people resist continuous deployment is the fear that they could break something and not be able to fix it in a hurry. So with this we’re saying, ‘Deploy all you want, activate when you’re ready.’” – Heidi Waterhouse

Check it out to hear how companies like Microsoft and GoPro are using feature flags to dark launch, do internal testing, and use feature flags as kill switches.

TRANSCRIPT

Zach Lanz: Live from St. Louis, Missouri, it’s the Technically Speaking podcast, brought to you by Daugherty Business Solutions. Get ready ’cause it’s time to talk nerdy on the Technically Speaking podcast.

Welcome back in, Technically Speaking. We are talking nerdy today. We are talking all day. We’re live at the Agile Midwest Conferences here in beautiful, beautiful downtown St. Charles, Missouri at the St. Charles Convention Center. I mean, I can overlook the river right out the window. It’s picturesque. Today we have quite a collection of some of the most forefront on the agile front lines and thought leadership bringing you some content and their perspectives. This episode is no different because today we have Heidi Waterhouse. Heidi, welcome to the podcast.

Heidi Waterhouse: Thank you. I’m excited to be here.

ZL: Listen to your adoring fans.

HW: My adoring fans.

ZL: We might have to shut the door so they leave us alone. So Heidi, you are in town from Minneapolis, coming all the way from Minneapolis, correct?

HW: Yes.

ZL: Just for the conference?

HW: Just for this.

ZL: You had your talk earlier today and I understand that you had a couple technical snafus.

HW: Oh well, I think it’s not really a presentation if something doesn’t go a little weird, but I did get to talk about the things that I cared about so that worked out.

ZL: You are a developer advocate at LaunchDarkly. From what I understand, it’s a company that specializes in feature flags. Tell us a little bit about what a developer advocate is, what you do there, and a little bit about LaunchDarkly.

HW: A developer advocate has three different roles that we serve. The first is I go out to conferences like this. This is my third this month. I have two more to go and I have five next month. I talk to developers about ways that I could improve their workflow and make their lives easier. I also spend a lot of time listening to developers and figuring out what it is that they need so I can take that back to my team and say, “Hey, there’s this unmet need the developers have that we need to look into.” The third is recruiting. I meet a lot of developers and I can say to them, “Hey, we have this opening for a full stack developer in Oakland.” They’re sometimes interested in it. Midwesterners seldom take us up on, “You could move to Oakland.”

When I say LaunchDarkly and feature flags, what I’m talking about is LaunchDarkly is a company that provides feature flags as a service. You wrap your code in a little snippet of our code, connect it to the SDK and then you deploy the whole package and it sits out there like Schrodinger’s code, both the live and dead, until you flip the switch to turn it on. Decoupling that deployment from the activation of code gives people so much more security and really reduces the risk of deployment because I think a lot of reasons people resist continuous deployment is the fear that they could break something and not be able to fix it in a hurry. With this we’re saying like, “Deploy all you want, activate when you’re ready.”

ZL: Just generally I guess, I think you touched on it a little bit, but explain a feature flag and what that does.

HW: A feature flag, or a toggle, or a feature management, it’s called a bunch of different things, is a way to turn a feature on and off from a central command. We work with Fastly and we have edge servers that help people to do this. Imagine you deployed a feature that say gave you the ability to do holiday snow on your webpage, like CSS that made your webpage snow. Well, it turns out that has a conflict with some browsers and you’re crashing people’s browsers. You want to be able to turn that off instantly. We like to say that you can kill a feature faster than you can say, “Kill the feature.” It’s about 200 milliseconds. All you have to do is flip the toggle and the feature turns off almost instantly for everybody, instead of having to deploy again and hope that you get it right this time and didn’t leave anything in.

ZL: You mainly work with companies large and small, I would imagine, that have a need to be able to turn those features on and off. Are there any great success stories that you’d like to share of being able to prevent an issue before it happened, or turn it around before customers even knew?

HW: We work with Microsoft and they do a lot of dark launching and sneak the features in so that they can do what we call canary testing, where you test with a small percentage of your population. But we also worked a lot with GoPro to be able to let them develop in their main line and do internal testing. Then, when they were ready, they pushed out their new version with no problems. Not a lot of people tell us when they have to hit the kill switch. We can see it in the logs, but I don’t want to call anybody else because it’s uncomfortable to have to say, “I deployed something that was bad,” but it’s not as uncomfortable as it is if you have to say, “I deployed something that was bad and it took forever to fix it.”

ZL: Your session today was a choose your own adventure interactive. I love that concept. I love games and I love just the interaction in a session like that. Walk through the concept there of how the choose your own adventure maps into this idea of deploying in a dark manner and the feature flags and things like that.

HW: We have a little mascot, an astronaut named Toggle. I created an adventure for Toggle where you can choose how you want to get to a planet based on different routes, like which ones are safer, or which ones are faster, or which ones are most scenic. If you think about the way you can sometimes get a map to show you like, “Don’t put me on any freeways,” that’s sort of the way that I designed this talk. This is the first time I’ve used the Twine gaming platform to design a talk. It turned out my CSS is really rusty, but there’s a lot of good help out there, so I got through it. It was really interesting to be able to say like, “Hey audience, pick which direction you want to go. Let’s talk about the thing you’re interested in. I’ve created 60 some slides of content, but the path through it is unique every time to your experience.”

ZL: How did they react? Which routes did you go?

HW: Unsurprisingly, the first route that we took was development. We talked about canary launches, which is the small launch, and we talked about multivariate flags. You could say, “I want to launch to 10% of the people in Germany.” We are basically doing access control lists for you product. You can slice and dice your audience however you’d like. Then, I really love this concept, I got it from Fastly, the albatross launch where you have a legacy client that’s very important to you that you want to keep happy, but you also want to keep your code base moving forward, you can switch them over to the older code base and keep your mainline moving forward without having to actually split your lines, split your code instances.

ZL: Well, awesome. Also, my crack team of researchers also dug up a little bit. It looks like outside of maybe the workplace, you are maybe an aspiring designer or seamstress.

HW: Oh yeah. I sew 100% of the clothes that I wear to conferences because I am really tired of dresses that don’t have pockets. If you wear a mic pack you have to have a belt of pockets. I just do all my own sewing and I find it super satisfying to make something tangible after a long day of software, which is a little fuzzy around the edges. Also, it means that I get to do tourism shopping. I just came back from New York and I honestly have a dedicated fabric suitcase.

ZL: Wow. Now it’s all coming out.

HW: My regular check on, roller board size suitcase fits in the big suitcase. Then I’ve come back from, oh, Tel Aviv or Australia or London, with a second bag, essentially, full of fabric. In fact, the dress that I’m wearing right now has fabric from Australia and London.

ZL: Wow. It’s a trip around the world for your clothes.

HW: It is. It’s really a nice reminder that we aren’t just here to do technology but also to experience the places that we go.

ZL: Yeah, absolutely. Do you have any places in mind for this trip? Have you eyed any places?

HW: Actually, I’ve never been to St. Louis before, but I don’t have a lot of time on this trip. I actually have to leave in a few hours.

ZL: Oh, that stinks.

HW: Yeah, it’s very sad.

ZL: Well, hopefully you get back to the airport in time and no delays there.

HW: Yeah, no. It’s only a couple minutes so it should be fine.

ZL: If people have heard about either LaunchDarkly and are curious, want to hear more about these feature flags, it sounds like a fantastic offering. I don’t know who wouldn’t be interest in something like that, to be able to control your launch a little bit and pull stuff back if disasters happen. If people want to get in contact with you, or with the company, what’s the best way to do that?

HW: They could write me, Heidi@launchdarkly.com, or you can find us on Twitter @LaunchDarkly, or you can find me on Twitter @WiredFerret.

ZL: WiredFerret.

HW: I know.

ZL: What’s the story behind that name?

HW: Well, it turns out that I am both excitable and high strung, so yeah. Anybody who’s seen me at a conference party understands.

ZL: Got you. Do you bite?

HW: I don’t. I don’t. That’s definitely against the code of conduct.

ZL: Yeah, because I mean, that’s the MO on ferrets.

HW: It is. It is. We had ferrets for years and they were terrible sock thieves. Whether or not your foot was in the sock, they were going to steal it.

ZL: It was gone. Well, I appreciate you stopping by and giving up some of your lunch hour. I know this is high priority time, so I appreciate you coming on, sharing your perspective. It sounds like your presentation went great despite a couple technological glitches. I hope you have a good flight back and thank you for taking some time out and sharing your perspective with us today.

HW: All right. Thank you.

ZL: Thank you for listening to the Technically Speaking podcast. Get involved with the show by following us on Twitter @SpeakTech, or like our page at Facebook.Com/SpeakTechPodcast. If you have suggestions or questions related to the show, or would like to be considered as a future guest, send feedback and inquiries to hey@speaktechpodcast.com. I’m Zach Lanz and thank you for listening to The Technically Speaking podcast.