14 Sep 2017

Beta Testing with Feature Toggles: Testing in Production Like a Pro

We all know beta testing is important—not just for understanding your customers’ needs, but also for stability and security. Every time you do a launch you are essentially asking: “Are there bugs? Is there feedback?” Both with the goal of making your product better.

Testing in production will give you the most information about the success of your new functionality. And because feature flags help separate deployment from release, they make such testing safe and easy. When it comes to beta testing, a lot of the top companies tend to adhere to a similar paradigm—test early, test often, and do it in your production environment.

So how do companies have smooth and simple transitions from alpha to beta testing, and then to full rollout? Read on to learn how top companies are approaching their beta testing using deployment tools with feature flags providing links out to more in-depth descriptions.

But before we get started, here’s a quick terminology review. Pete Hodgson refers to this use of feature flags for betas as “permissioning toggles.” Also known as a “canary launch,” this is often random like a percentage rollout. A set group, or “champagne brunch,” releases to internal users or another section or group.

6 Approaches to Product Launching

#1 Facebook is the prime example of dark launching. Their release management has to be impeccable to operate at such massive scale. Their betas are often up to  million users or more.

“Although we push to production only once a week, it’s still important to test the code early in real-world settings so that engineers can get quick feedback. We make mobile release candidates available every day for canary users, including 1 million or so Android beta testers.”

Read their article on Rapid Release at Massive Scale to learn more about how they do continuous delivery at scale.

#2 Hootsuite gives a typical rollout pattern for its features—starting internally and then slowly exposing to a larger audience.

Push new code then:
– Dark launch to yourself or your team to test
– Launch to the whole Hootsuite organization
– 10% of all users
– Watch graphs
– 50%
– 100%
– Simple means of rollback if necessary

Check out Bill Monkman’s full deck on dark launching here.

#3 Etsy calls feature flags “config flags,” and gives a lot of credit for their process to Flickr.

“Key system-level and business level metrics (like checkout/listing/registration/sign-in rates) are projected on screens in the office and we have a number of internal dashboards that the team uses (we mainly use Ganglia and Graphite). We also have lots of switches and knobs to help us roll features out to percentages of users and ramp them up slowly, or quickly. Features are used and tested by us here at Etsy for some period of time before they are rolled out publicly.”

They have custom built a feature flagging API, “Feature API” to enable this. Some of the bucketing they use include: admin, internal, users, groups.Read more about Etsy’s deployment practices and check out their Feature API on GitHub.

#4 Beta can also apply to back-end rollouts. Instagram does canary deployments to a subset of servers, using feature flags as a continuous delivery tool. It’s important for continuous delivery to perform these tests, which are key in helping them avoid failed deployments.

But Instagram hasn’t always had this system. Read here to learn how they evolved from a “mish-mash of manual steps and scripts” to a system they could depend on. And check this out if you want more recipes for database migration with feature flags.

#5 Niantic’s Pokemon Go betas are well known and rabidly tracked by its fans. They famously roll out by region—a field test in Japan here, a limited beta in Australia, and then something in New Zealand. Sometimes these betas for features are invite-only. Here’s a write up of how they approached the rollout of the game Ingress.

#6 GoPro released their GoPro Plus product early using feature flags. By breaking the larger release into smaller features with their own testing timelines, they were able to iterate and improve continuously. The video below walks through the technology they used and the timeline from dogfood to a “big bang” marketing announcement.

“At GoPro you can kind of tell we don’t things lightly. We want to do big announcements and we want to come out with great products…we actually had smaller features that would go out, and then go for alpha testing and beta testing along the way. Shortly after March, we actually had most of the applications done from a core feature standpoint, but we kept iterating and improving those core features that we knew we were going to launch with.”


Controlling Your Rollout Like a Boss

Did you notice some trends there? These larger companies are using beta testing to do one of the following:

  • Testing in production with feature flags
  • Ability to release early and test small functionalities before a broader release
  • Internal tests that easily become external canaries
  • Regional rollouts

As more companies start to use feature management, these incremental rollouts are not the headaches they once were. Companies can be safer and smarter with how and when they expose features to their end users.

If you want to get started with feature flagging, check out featureflags.io a resource we made for the community to learn best practices.  

14 Jun 2016

DevOps West – DevOps goes international!


DevOpsWest is run concurrently with AgileWest and had a good mix of international and domestic attendees, who wanted to get better at software. As one attendee from Florida said “I’m not working at Google or Facebook because I want to be in Florida, but I still want to be as sharp and nimble as them.” And the fact that it allowed people from London and India to visit Las Vegas seemed to be a selling point too. Continue reading “DevOps West – DevOps goes international!” »

23 Apr 2016

Is it a feature flag or a feature toggle?

Feature flag feature toggle launchdarkly

The evolution of feature toggles to feature flags and what it holds for the future of software development.

History of the Toggle

Jez Humble and Martin Fowler are best known for promoting the separation of feature rollout from code deployment.  Within the context of continuous delivery, they provided the foundation for a framework that would allow developers to release software faster and with less risk.

Fowler is well-known for championing the notion of feature toggles, which are ways to wrap features in conditionals that allow you to toggle features on and off for users.  This would enable developers to take full control of their feature rollouts, dark launch, and roll back poorly performing features.

Continue reading “Is it a feature flag or a feature toggle?” »

19 Apr 2016

Why Leading Companies Dark Launch

LaunchDarkly Dark Launch

When it comes to releasing new features, there is nothing worse than deploying a feature that cripples your application, degrades performance, and turns away customers.

With the rise of continuous delivery, software teams are embracing faster, more iterative feature releases.  It’s now imperative for teams to ensure their features will be well-received by customers and maintain their application’s performance.

This is why companies like Google, Facebook, and Amazon have embraced dark launching to ensure the efficacy of their feature releases and the stability of their app infrastructure.

Continue reading “Why Leading Companies Dark Launch” »

11 Aug 2015

Secret to Facebook’s Hacker Engineering Culture

Facebook’s engineering is legendary for its speed and execution. You too can be as quick and smart as Facebook, if you know their hacker engineering secret. Originally they lived by “Move Fast and Break Things”, which has now evolved with wisdom to “Move Fast With Stable Infra.” Speed is important, as is stability and providing a good experience to users.Facebook’s engineering Kent Beck wrote a great Facebook Note on how Facebook embraces reversibility to scale up. I highly recommend you read his entire post.

Facebook has a secret sauce: an in-house system called Gatekeeper that allows them to get quick feature feedback and quickly iterate based on feedback. Engineering changes are wrapped with a feature flag and pushed live to production. However, the features are live but off, then turned on via Gatekeeper to different users . Facebook’s seemingly simple system of separating deployment from rollout unlocks many powerful ways to move faster with more stability. All items in italics below are quotes from Kent Beck, followed by my analysis of how Facebook uses Gatekeeper.

Internal usage. Engineers can make a change, get feedback from thousands of employees using the change, and roll it back in an hour.

Initially, the engineer uses Gatekeeper to turn the feature on to internal users (only) . Interestingly, I’ve heard that Facebook is too large for changes to be effectively communicated EXCEPT by actually making the change. Instead of flurries of emails or blasts in chat rooms notifying other groups, Facebook engineers makes the code change and waits for impacted parties to notify them that something is broken, or fix their own dependencies. Separating changes from bigger releases with feature flags mean that any change can be rolled back at any time.

Staged rollout. We can begin deploying a change to a billion people and, if the metrics tank, take it back before problems affect most people using Facebook.

Staged rollout depends on feature flags to encapsulate a change and a feature flagging system (like Gatekeeper) to take it back.

Dynamic configuration. If an engineer has planned for it in the code, we can turn off an offending feature in production in seconds. Alternatively, we can dial features up and down in tiny increments (i.e. only 0.1% of people see the feature) to discover and avoid non-linear effects.

The key to turning features off in seconds (rather than hours or in best case, minutes) is “if the engineer has planned for it in the code”. By using feature flags to separate code deployment from functionality, Facebook can quickly kill malignant features. Without feature flags and Gatekeeper, Facebook would have to do a full redeployment.

Right hand side units. We can add a little bit of functionality to the website and turn it on and off in seconds, all without interfering with people’s primary interaction with NewsFeed.

Facebook smartly uses micro services and avoids monolithic code. Small changes in functionality, wrapped in feature flags, can quickly be toggled on and off using Gatekeeper.

Shadow production. We can experiment with new services under real load, from a tiny trickle to the whole flood, without affecting production.

Facebook pioneered dark launches, the ability to expose features to load without exposing them to users. I’ve heard that it’s impossible to simulate Facebook’s production load as it’s so large. Gatekeeper allows Facebook to control via feature flags load testing from user visibility.

Data-informed decisions. Data-informed decisions are inherently reversible. “We expect this feature to affect this metric. If it doesn’t, it’s gone.”

By wrapping a feature with a flag, it’s possible to isolate its effect on the system. Data-informed decision , tying an individual feature to metrics, is made possible by Gatekeeper and feature flags. Without feature flags, it’s impossible to see the impact of a change – if you release five features and twenty bug fixes at once, and engagement drops by 5%, what feature is to blame? Could one of the bug fixes actually have caused a 10% drop and one of the features a 15% gain? Only by separating out each change can true causation (not just correlation) be seen. Yammer also follows data-informed decision in its product development. Again, it’s necessary to have encapsulation of the feature to both have measurement as well as enable the rollback.

Advance countries. We can roll a feature out to a whole country, generate accurate feedback, and roll it back without affecting most of the people using Facebook.

Gatekeeper and feature flags, are enabling canary launches – using an entire country as “canary in a coal mine” to see if there are issues with a release. Rather than having a world-wide failure, Facebook can iterate quickly and rollback.

Soft launches. When we roll out a feature or application with a minimum of fanfare it can be pulled back with a minimum of public attention.

Facebook, after many misfires like Facebook Beacon, now follows Eric Ries (Don’t launch – separate out a marketing launch from a product launch). With feature flags, Facebook can get feedback from their own users, and control the story. Facebook has avoided the flameouts of Google, which has had epic failures with Google Wave, Google Buzz, and most recently Google Plus – all expensively launched, then expensively decommissioned. With feature flags and Gatekeeper, Facebook is always in control of who sees what when.

Want to be as smart as Facebook for developing software? Want to integrate reversibility, dark launches, data-informed decisions into your own development cycle? The smartest companies like Facebook, Medium, DropBox, and LinkedIn have in-house feature-flagging systems custom built for them. You can build your own system, or simply use LaunchDarkly, “Gatekeeper for everyone else”.



08 Jul 2015

Dark Launching Meetup: Lessons Learned

We hosted the first Dark Launching meetup in May with a surprisingly large turnout. We’d originally planned the meetup to be a user group of our current LaunchDarkly users sharing how they were using LaunchDarkly for dark launches. We were very pleasantly surprised by how many people joined our meetup as they wanted to learn more about dark launching itself! Dark launching is a best practice used by Facebook to launch new features “dark”(off), then slowly light up (turn on) features for different users. A key component of dark launches is the ability to easily turn a feature off again if issues are found.

Why dark launch? Shouldn’t all releases always be throughly tested and production ready? No matter how good or thorough your QA and performance testing is, you will never find all issues before production. Dark launches are a recognition that the real world exists. Rather than exposing your release into the full light of day to get blasted, shouldn’t you control access yourself?

I started the Dark Launch meetup by asking for a show of hands from everyone who’d had a bad release. Every hand went up, way up. I then asked if anyone wanted to share their story of a bad release. Everyone put their hands back down. I think there’s too much shame attached to bad releases, when we should 1) admit that bad releases happen to good teams 2) learn from bad releases 3) put process like dark launches into place to help mitigate bad releases.

So I’ll share a tale of two releases – one where I used dark launching, and one where I should have used dark launching. At TripIt, our users emailed us their email travel confirmations. We’d parse the emails, extract the useful information like flight and hotel, and make them a beautiful trip itinerary. Our users loved us, we were the number #1 Travel app in the App Store. I was the product manager on a daring new feature – users would connect their gmail accounts directly to TripIt, skipping the step of manually forwarding itineraries. It was considered extremely risky to have people authorize us to scan their email inbox. So I found a group of frequent travelers who were willing to give us feedback. We pushed the feature “live” to production, but only granted the opted-in users access, as a dark launch.

Internally, we carefully monitored our speed scanning real world inboxes. I also followed up on all users on whether we were importing the right (or wrong) email items. Some early mistakes were being too aggressive with the word “ticket” and importing a Tiffany order, or a Turkish Airlines promotion. When we’d gotten enough real world feedback, we then truly launched, with a TechCrunch article. I was pretty excited both that we’d launched something to help our users as well as my picture making it into TechCrunch!

We didn’t dark launch another email feature at Tripit, with very bad results. A persistent user complaint to TripIt is we would import ALL travel confirmations, whether or not you were the actual traveler. We called this “Other People’s Travel” (OPT). For example, if my sister Margaret Harbaugh emailed me her flight information – bam! her trip would appear in my TripIt account. Our system recognized that there was a travel plan, but there was no way for our system to tell the difference between Margaret and Edith. Or was there? Our engineers implemented some logic to compare names on the account with names on the itinerary and if there wasn’t a close enough match, to not import the trip. Sounds easy – “Margaret” is clearly different than “Edith”, so let’s ship it, right? However, we quickly had a flood of complaints. We had over a million users using auto-import (the largest gmail authorized company worldwide) . It turns out that email itineraries often had very odd concatenations like MsEdith or EdithMs, etc, which TripIt was rejecting as not the same as “Edith”. We were skipping too many emails and our users were very unhappy. They’d relied on us to “automagically” work, and they had tolerated some false positive imports. Now we were ignoring may emails they expected us to import. We had to do an emergency patch and quickly revert the change. If we’d done a dark launch, we could have tested the change with a smaller batch of users and monitor their reaction. It was a lesson learned to me that dark launching is a powerful tool to ensure user satisfaction.

We’re looking forward to our next dark launching meetup to share more stories and lessons learned. You can join the Dark Launch meetup group here, and we hope to see you soon!