09 Feb 2018

Visibility and Monitoring for Machine Learning Models

Josh Willis, an engineer at Slack, spoke at our January MeetUp about testing machine learning models in production. (If you’re interested in joining this Meetup, sign up here.)

Josh has worked as the Director of Data Science at Cloudera, he wrote the Java version of Google’s AB testing framework, and he recently held the position of Director of Data Engineering at Slack. On the subject of machine learning models, he thinks the most important question is: “How often do you want to deploy this?” You should never deploy a machine learning model once. If the problem is not important enough to keep working on it and deploy new models, then its not important enough to pay the cost of putting it into production in the first place.

“The tricky thing, though, is in order to get good at machine learning, you need to be able to do deploys as fast as humanly possible and repeatedly as humanly possible. Deploying a machine learning model isn’t like deploying a regular code patch or something like that, even if you have a continuous deployment system.” -Josh

Watch his entire talk below.


How’s it going, everybody? Good to see you. Thanks for having me here. A little bit about me, first and foremost. Once upon a time, I was an engineer at Google. I love feature flags, and I love experiments. I love A/B testing things. I love them so much that I wrote the Java version of Google’s A/B testing framework, which is a nerdy, as far as I know … I don’t know. Does anyone here work at Google? Any Googlers in the audience? I know there’s at least one because my best friend is here, and he works at Google. As far as I know, that is still used in production and probably gets exercised a few trillion times or so every single day, which is kind of a cool thing to hang my nerd hat on.

I used to work at Cloudera, where I was the director of data science. I primarily went around and talked to people about Hadoop and big data and machine learning and data science-y sorts of things. I am Slack’s former director of data engineering. I’ve been at Slack for about two and half years. I am a recovering manager. Any other recovering managers in the audience? I was going up the management hierarchy from first line management to managing managers, and I started to feel like I was in a pie eating contest, where first prize is more pie. I didn’t really like it so much. Wanted to go back and engineer. So about six months ago, I joined our machine learning team, and now I’m doing machine learning-ish sorts of things at Slack as well as trying to make Slack search suck less than it does right now. So if anyone’s done a search on Slack, I apologize. We’re working hard on fixing it.

That’s all great, but what I’m really most famous for … Like most famous people, I’m famous for tweeting. I wrote a famous tweet once: Which is a proper, defensible definition of a data scientist? Someone who is better at statistics than any software engineer and better at software engineering than any statisticians. That’s been retweeted a lot and is widely quoted and all that kind of good stuff. Are there any … Is this sort of a data science, machine learning audience or is this more of an engineering ops kind of audience? Any data scientists here? I’m going to be making fun of data scientists a lot, so this is going to be … Okay, good. So mostly, I’ll be safe. That’s fine. If that guy makes a run at me, please block his way.

So anyway, that’s my cutesy, pithy definition of what a data scientist is. If you’re an engineer, you’re sort of the natural opposite of that, which is this is someone who is worse at software engineering than an actual software engineer and worse at statistics than an actual statistician. That’s what we’re talking about here. There are some negative consequences of that. Roughly speaking at most companies, San Francisco, other places, there are two kinds of data scientists, and I call them the lab data scientists and the factory data scientists. This my own nomenclature. It doesn’t really mean anything.

So you’re hiring your first data scientist for your startup or whatever. There’s two ways things can go. You can either hire a lab data scientist, which is like a Ph.D., someone who’s done a Ph.D. in statistics or political science, maybe or genetics or something like that, where they were doing a lot of data analysis, and they got really good at programming. That’s fairly common data science standard. A lot of people end up that way. That wasn’t how I ended up. I’m the latter category. I’m a factory data scientist. I was a software engineer. I’ve been a software engineer for 18 years now. I was the kind of software engineer when I was young where I was reasonably smart and talented but not obviously useful. I think we all know software engineers like this, smart, clearly smart but not obviously useful, can’t really do anything. This is the kind of software engineer who ends up becoming a data scientist because someone has an idea of hey, let’s give this machine learning recommendation engine spam detection project to the smart, not obviously useful person who’s not doing anything obviously useful and see if they can come up with something kind of cool. That’s how I fell into this field. That’s the two kinds. You’ve got to be careful which one you end up with.

Something about data scientists and machine learning. All data scientists want to do machine learning. This is the problem. Rule number one of hiring data scientists: Anyone who wants to do machine learning isn’t qualified to do machine learning. Someone comes to you and is like, “Hey, I really want to do some machine learning.” You want to run hard the other direction. Don’t hire that person because anyone who’s actually done machine learning knows that it’s terrible, and it’s really the absolute worse. So wanting to do machine learning is a signal that you shouldn’t be doing machine learning. Ironically, rule two of hiring data scientists, if you can convince a data scientist that what they’re doing is machine learning, you can get them to do anything you want. It’s a secret manager trick. It’s one of the things learned in my management days.

Let’s talk about why, briefly. Deep learning for shallow people like ourselves. Deep learning, AI, big stuff in the news. I took a snapshot here of the train from my favorite picture, “Back to the Future, Part III,” a truly excellent film. Machine learning is not magic. Machine learning is, it’s basically the equivalent of a steam engine. That’s really what it is, especially deep learning in particular. What machine learning lets us do is stuff that we could’ve done ourselves, manually, by hand over the course of months or years, much, much, much faster in the same way a steam engine lets us move a bunch of rocks from point A to point B. It’s not something we couldn’t do. We knew how to move a bunch of rocks from point A to point B. That’s how we built the pyramids and stuff like that. But this lets us do it much, much faster and much, much cheaper. That’s what machine learning fundamentally is.

There are consequences of that. One of the nasty consequences of it. Machine learning … There’s a great paper that I highly recommend you read by this guy named D. Sculley, who is a professor at Tufts, engineer at Google. He says machine learning is the high interest credit card of technical debt because machine learning is basically spaghetti code that you deploy on purpose. That’s essentially what machine learning is. You’re taking a bunch of data, generating a bunch of numbers and then putting it in a rush intentionally. And then trying to figure out, reverse engineer how does this thing actually work. There are a bunch of terrible downstream consequences to this. It’s a risky thing to do. So you only want to do it when you absolutely have to.

Lab data scientists want to do machine learning. Factory data scientists want to machine learning. Their backgrounds mean they have different failure modes for machine learning. There’s a yin and yang aspect to it. Lab data scientists are generally people who have a problem with letting the perfect be the enemy of the good, broadly speaking. They want to do things right. They want to do things in a principled way. They want to do things the best way possible. Most of us who live in the real world know that you hardly ever have to do things the right way. You can do a crappy Band-Aid solution, and it basically works. That’s the factory data scientist attitude. The good news, though, of people who want to do things perfectly, they don’t really know anything about visibility monitoring, despite knowing a bunch of stuff about linear algebra and tensors, they don’t know how to count things. But you can teach them how to do Graphite Grafana. You can teach them how to do Logstash. They can learn all these kinds of things, and they want to learn, and they have no expectation that they know what they’re doing, so they’re very easy to teach. That’s a good thing.

Factory data scientists have the opposite problem. They’re very practical. They’re very pragmatic. So they’ll build things very quickly in a way that will work in your existing system. However, they overestimate their ability to deploy things successfully the way most not obviously useful software engineers do. As a result, they are much more likely to completely bring down your system when they deploy something. So that’s what you want to watch out for there.

Another really great paper, “What’s your ML test score? A rubric for production ML systems.” I love this paper. This is a bunch of Google people who basically came up with a checklist of things you should do before you deploy a machine learning system into production. I love it. Great best practices around testing, around experimentation, around monitoring. It covers a lot of very common problems. My only knock against this paper is they came up with a bunch of scoring criteria for deciding whether or not a model was good enough to go into production that was basically ludicrous. So I took their scoring system and redid it myself. So you’ll see down there, if you don’t do any of the items on their checklist, you’re building a science project. If you do one or two things, it’s still a science project. Three or four things are a more dangerous science project. Five to 10 points, you have the potential to destroy Western civilization. And then finally, once you do at least 10 things on their checklist, you’ve built a production system. So it’s kind of a u-shaped thing.

This is a great paper. If you have people at your company who want to deploy machine learning into production, highly, highly recommend reading it and going through it and doing as much of the stuff they recommend as you possibly can. More than anything, for the purposes of this talk, I want to get you in the right headspace for thinking about what it means to take a machine learning model and deploy it into production. The most important question by far when someone wants to deploy a machine learning model is, how often do you want to deploy this? If the answer is once, that is a bad answer. You should never deploy a machine learning model once. You should deploy it never or prepare to deploy it over and over and over and over and over again, repeatedly forever, ad infinitum.

If the problem is not important enough to keep working on it and keep deploying new models, it’s not important to pay the cost of putting it into production in the first place. That’s thing one. The tricky thing, though, is in order to get good at machine learning, you need to be able to do deploys as fast as humanly possible and repeatedly as humanly possible. Deploying a machine learning model isn’t like deploying a regular code patch or something like that, even if you have a continuous deployment system. The analogy I would use is it’s kind of like someone coming to you and saying, “Hey listen. We’re going to migrate over our database system from MySQL to Postgres, and then next week, we’re going to go back to MySQL again. And then the week after that, we’re going to go back.” And just kind of like back and forth, back and forth. I’m exaggerating slightly, but I’m trying to get you in the right headspace for what we’re talking about here. It’s basically different machine learning models are systems that are complicated and are opaque, that are nominally similar to each other but slightly different in ways that can be critically bad for the overall performance and reliability of your systems. That’s the mentality I want you to be in when it comes to deploying machine learning models. Think about it that way.

The good news is that we all stop worrying and learn to love machine learning, whatever the line is from “Dr. Strangelove,” that kind of thing. You get good at this kind of stuff after a while, and it really … I love doing machine learning, and I love doing it production in particular because it makes everything else better because the standards around how you operate, how you deploy production systems, how you test, how you monitor have to be so high just across the board for regular stuff in order to do it really, really well. Despite all the horrible consequences and the inevitable downtime that the machine learning engineers will cause, I swear, I promise, it’s ultimately worth doing it, and in particular, companies should do it more so I get paid more money to do it. That’s kind of a self-interested argument.

If you like to do monitoring, if you like to do visibility, if you like to do devOps stuff in general and you want to do it at a place that’s done it really, really well, slack.com/jobs. Thank you very much. I appreciate it.

31 Aug 2016

Secrets of Netflix’s Engineering Culture

Netflix is not just known for the cultural phenomena of “Netflix and chill”, but for its legendary engineering team that releases hundreds of times a day in a data-driven culture. Netflix is the undisputed winner in the video wars, having driven Blockbuster into the “return” bin of history. Netflix won by iterating quickly and innovating with numerous micro-deployments. Could what worked for Netflix work for you?

Netflix had a virtuous cycle of product innovation. Every change made in the product is with the goal of getting new users to become subscribers. Netflix has a constant flow of new users every month, so they always have new users to test on. Also, they have a vast store of past data to optimize on. Did someone who liked “Princess Bride” also like “Monty Python and the Holy Grail”? When is the right time to prompt for a subscription? Interesting tests that Netflix can run include whether TV ads drive Netflix signups, or whether requiring Facebook to create an account drives enough social activity to counteract the drop in subscriptions from people who don’t have Facebook. If a change increased new user subscriptions, it went into the product. If it didn’t increase new user subscriptions, it didn’t make it in – hypothesis driven development.

However, what if you’re not Netflix? What if you’re a steady SaaS business with 1,000 business customers, on boarding 30 new customers a month? This is a healthy business, doubling in size annually. However what if you wanted to test whether you get more subscriptions with a one step or two step process to add a credit card. With a sample set of 30 a month & 90% current success rate, it will take you three months to determine success. Not everything can be tested at small scale. Tomasz Tunguz talks more about the perils of testing early here.

The other “gotcha” to watch out for with Netflix style development is obsessive focus on one metric can degrade other metrics. For example, focusing on optimizing new user signup might mean degrading experience for old users. Let’s say that 10,000 customers could be served with “good” speed, or 2,000 with “superfast” speed and 8,000 with “not good speed”. Or 1,000 with lightning fast and 9,000 with terrible speed. You might make the 1,000 new customers very happy, but piss off the 9,000 existing customers and have them quit. A good counterweight is to always have a contra-metric to keep an eye on. It’s okay if it dips slightly if the main metric rises. However, if the other metric tanks, re-consider whether the overall gains are worth it.

So what lessons can you take from Netflix to help your own business?

One, have a clear idea of why you’re making changes, even if it’s not something that you can a/b test. Is it to increase stability in your system? Make it quicker for someone to onboard? Know what your success criteria are, even if there’s not a statistically significant “winner”.

Two, break down projects into easily quantifiable chunks of value. Velocity can be as important (if not more important) than always being right. For example if you try 20 small changes, and half are right, you’ll end up 50% better. If you try one big change, and it’s not accretive, you’ll end up with a zero percent gain. Or, as Adrian Cockcroft, Netflix Architect says “If you’re doing quarterly releases and your competitor is doing daily releases you will fall so far behind”.

Three, don’t underestimate the importance of your own domain expertise. If you’re constantly testing ideas, even without having enough data, you’re quicker to get into the right path. Let your competitors copy your past mistakes, while you move forward. As Kris Gale, co-founder and CTO of Clover Health said, “You will always make better decisions with more information, and you will always have more information in the future.” But the way to get more information is to iterate.



03 Dec 2015

To Be Continuous: Continuous Delivery and Mobile Development

On this episode of To Be Continuous, Edith and Paul are joined by Kevin Henrikson, Partner Director of Engineering at Microsoft, to discuss what’s necessary to be successful with continuous delivery in mobile development. This is episode #8 in the To Be Continuous podcast series all about continuous delivery and software development.

This episode of To Be Continuous, brought to you by Heavybit and hosted by Paul Biggar of CircleCI and Edith Harbaugh of LaunchDarkly. To learn more about Heavybit, visit heavybit.com. While you’re there, check out their library, home to great educational talks from other developer company founders and industry leaders.

Paul: All right, Kevin. So, what’s your favorite thing about Continuous Delivery?

Kevin: Good question. So, Acompli started kind of the approach, when we first started was just a couple people. And our deal was like, “Hey, you’re trying to get this thing to work, send emails.” So it was like, keep checking it in, get closer and closer to kind of the definition of it was working.

And then as we started to experiment farther out, we said, you know, “What’s the right sprint cadence?” And we’d come from kind of traditional enterprise companies, where, you know you ship every six weeks or every four weeks and we’re like, oh, we can go faster.

People are doing this agile thing. It’s like two week sprints, and we tried two weeks and realized that most of the work came in on Thursday afternoon or Friday kind of thing, ’cause the end of the second week and so what if we just do one week then most of that work will come in, you know, the first Friday. And sure enough, it worked, right? So 80% of the work still comes in on the last day or kind of completes on the last day or gets to the point of ready. And so, what we liked is that it gave us a much smaller kind of chunk of deliverable, right.

So, like the train keeps moving but the size of what’s on the actual boxcars is smaller and so if you have a problem, it’s easier to roll back, easier to fix.

Or if you have a problem, you can just patch it quickly ’cause you’re used to moving quickly.

Edith: Wow. Okay.

Kevin: So unlike in a traditional world where you ship something every six weeks, you’re like oh, we made a mistake, we need to go patch this. Like, the patch felt like an exception and you’re like, your muscle memory for patching was not there, right? So you’re like do we do it now? Is it worth it?

And then it’s like well, if you’re five weeks in and you discover that thing, it’s like well, the six-week thing is out there where with, if you’re shipping every week, you’re like oh let’s patch it. We shipped yesterday? We just click the button again, and it builds and it goes. So you’ve built the muscle memory to move quickly, right. And so I think, what I like the most, kind of going back to the specific question, is just the fact that it lets you get good things done quickly, but also correct mistakes quickly.

Edith: Wow. So this is a good time for you to introduce yourself.

Kevin: Cool. So, Kevin Henrikson. I currently manage the Outlook, kind of non-Windows teams at Microsoft. So Outlook for iOS, Android, and Mac. Previously, I was VP of Engineering at Acompli, which was a mobile startup, focused on iOS and Android.

Edith: Yeah so you had a lot of good thoughts right then. One of the things I wanted you to talk more about is why do you think 70-90% of the work is always on the last day?

Kevin: Yeah, I think people are kind of deadline driven. I think generally they want to work until the last minute and I think software development ends up being this thing where you’re never done. And so the only way to define done is to actually draw a line and say this is done and then still some things miss that line. And so the tighter you give the window for things to get done, the more often they’ll hit that window because you’ve kind of broke things into smaller chunks.

And also, because software’s kind of this long-lasting thing that never is kind of done done that you end up with, if the chunks are small, people are able to estimate and think is that going to take me a day? Very rarely does somebody think something’s going to take them a day that takes months.

But very often, like if you think something will take months, it may take many, many months. Because it’s hard to guess, there’s so many variables, so many things are going to change in the environment, turn what you’re doing before you get there.

So, I think the short answer is if you start with a smaller window and more discrete small tasks, those tasks are easier to estimate, they’re easier to consume, they’re easier if you get disturbed, like oh I’m going to go down and get a soda, or somebody asks me a random question ’cause some fire comes up, I can kind of make it up because I was only expecting to spend half my day working on that anyways.

So, I think that kind of works out in our favor.

Paul: So, there’s an implication with what you said there, that the tasks sort of grew because it was two weeks, rather than–

Kevin: Yes.

Paul: So, I guess the phrasing is, when it was two weeks, was it that a one-week task had grown to two weeks? Or was it that things got cut when it got shrunk down to one week, and you ended up with technical debt, or something along those lines?

Kevin: I think it was a couple combinations. I think if you feel like you have two weeks to work on a task, you’ll spend more time polishing or potentially doing things that aren’t on the critical path to getting it done.

And I think that’s natural, right? Like, maybe it is you’re paying off too much technical debt, right? In the start-up world, you have to have that lever. And even just in general software development, like, you have to have a lever. We’re not launching the space shuttle here. There’s room for some wiggle room and imperfection.

And I think breaking it into small pieces lets people focus on, “Hey, I’m only gonna do this” and “I have this much time to do it.” Where you say, “Hey, it’s got two weeks,” you may be deep into the second week and saying, “You know, I just don’t like my approach. I’m gonna change it.” And you basically end up throwing away a week’s of work, whereas, if you’re thinking that “Hey, the deadline’s coming in a week,” you’re gonna make that kind of epiphany, like, “Oh, I made a mistake,” or “I’ve taken the wrong path” much quicker.

So, I think it allows you to break your decision-making and optimize on getting it done, because the window is kind of quickly approaching, being a one-week spread versus two.

Edith: This might sound like Seven-Minute Abs, but why not two-day sprints?

Kevin: Yeah, so I think in two days we– or one-day sprints, right, how do you continue to make it faster? So, for us, there was sort of one artificial gate, which was the fact that you really can’t ship quicker than about a week with Apple.

And even then, it sometimes slips by, because, you know, four or five days, and then during a peak season, if there’s lots of apps, like a Christmas holiday, or around a platform update, you’ll see a spike in the review times. So, Apple kind of, and we run iOS app first before we port it to Android. So that ended up driving us towards a little bit to a week. Cause it was like, if you shipped on Monday, the app would get in the store some time end of that week. And that was a natural cadence to start the next one.

So, it’s really hard without pushing the emergency release button, which Apple only lets you do once or twice a year, to ship more than once a week. And then, we kind of let the rest of the team fall behind that. If you’re only shipping the app once a week, let’s keep the new chunks of new work, or big bug fixes to that path on our service.

The way our architecture is, and we can talk about that more, we spend as much, you know, 77% of the logic of the app is down deep in the service, because that we can update all the time. So, when we push a release and don’t wanna fix something, we fix from the service the same day.

Edith: Interesting.

Kevin: So, that allows us to have a much more agile approach, because the service is constantly being updated, but the app is kind of gated in one-week chunks.


So, what you’re saying is that the biggest impediment to your shipping frequently is Apple.

Kevin: The platform, yeah. The platform, basically, Apple’s– And specifically, it’s the fact that Apple has a review process, which is great. That Apple review process generally keeps the bar of apps better and puts some good discipline inside of the wider range of development, people that are coming to the Apple platform.

But, it also, when you wanna move quickly, you can’t. The platform itself kind of enforces this four to seven-day lag in shipping.

Paul: So, you’re now running teams that are shipping on non-Apple platforms. Did you guys make any changes to your release cadence as a result of that?

Kevin: Yeah, so I think on Android, for example, we still kind of keep to the one-week junk of we plan a sprint. Build for a week. Release to our beta group or dog feed users, and then release to production. But Android gives you this advantage where you can do percent-based rollout.

So, with Android, we end up releasing usually two or three times a week. So, you’ll release kind of The Release on Monday, at like 1-2%, kind of walk that out three, five, 10%. Usually at five or 10%, you can pick some new signal of crash or behavior that you wanna adjust, make a hot fix, push another release, start at 10%, 20%. Oh, you see another one you wanna tweak at 20%. And finally, from 20 or 30, you end up going all the way forward.

Paul: Do those Android users help you on the other platforms, as well? Are they finding crashes that also exist on the other platforms?

Kevin: Today we’re 100% native on both, so there’s no shared code between iOS and Android from the mobile app perspective. Obviously from the service, it’s 100% shared code. So, what we’ve actually done in the past, is when we launch a new feature that we think is either controversial, or has a high chance of risk, for example, when we launched iMap support.

We launched generic iMap support on Android first, knowing that we could patch it quickly, and adjust if things went wrong, simply because there was no way to test the millions of random iMap servers in the world and get any kind of focus. When we talked to Gmail or Yahoo, or– Single point.

Paul: That’s really interesting, actually, the idea that because the rollout, or the sort of safety features are only supported on Android, that people will end up building features there first.

Edith: How deliberate did you have to be about putting stuff in the service, versus in the apps? Was this the lesson that you learned?

Kevin: So, we had always thought of the service as the place where we always wanted to do as much of the logic that could break us– You just don’t wanna do it twice. You don’t wanna implement active sync twice. You don’t wanna implement an iMap stat twice.

And it also allowed us to pick a protocol between our cloud service and the mobile app. It was optimized for mobile. So, think of the API, the shape of the API looks closer to the screen of the app. There’s a request for “Get my inbox,” and then make a change. Where, if you think of iMap, or the mirror out of email protocols, that we talk today, they don’t have that very linear type of approach of “Give me this screen. Give me that screen.”

They’re thought of in a way where you’re going to sync all the mail down to your desktop client, and then manipulate it locally, and then sync changes back up in some either connected or disconnected state. So, the service was basically set to put complex logic, but also to allow us to get leverage, ’cause everything we could build in the service was one less thing we’d have to do twice on the client, because we picked a path that we didn’t support. We didn’t basically have a lot of shared clip or clients, essentially none, just the API.

Paul: And it’s the same service that sits behind all the different clients?

Kevin: Yes, so we have a single cloud service that basically powers both the iPhone client and the Android clients.

Edith: It’s interesting, so at TripIt we had a lot of similar issues. Like, we were at one time one of the biggest, a lot of our users are Gmail.

Kevin: Okay, awesome.

Edith: People would sync their Gmail with us, and their calendars. Did you find there was a difference between the you or the app experience that Android and iOS wanted? How much did you bifurcate that?


Sure, so, a lot of people ask is like, “Hey, did you launch a feature first on Android or iOS, “or do you always have to make them “come out identical?” My answer to that is pretty simple. Most normal people carry one phone. They’re either an Android user or an iPhone user.

So, they don’t know that you released it first on Android or iPhone, because they don’t carry both phones. So, we think of that, from a feature point of view, that when they’re done, then we ship it. Now, if we’re trying to line a press event around something, we’ll wanna make sure they’re both there before we do the press event. One may come out a day or two before the other. But, generally speaking, we don’t try to, you know, legislate you have to do it this way, ’cause it works on iPhone.

In fact, we actually took a different approach, where we said, We wrote it one way on iPhone and then we went to Android. It wasn’t just like, “Oh go port all the screens over.” It was like, “What would this screen look like for an Android user?”

So, we want our apps on the mobile device, and most mobile users wanna look at an app and feel like it fits in with the rest of their phone. So, the way that an Android navigation works, the way that people use the back button, or not, is very different on iPhone and Android. The fact that there’s widgets, and the way the widgets are exposed on Android, and the way that those people want to integrate deeper into the platform.

Those kind of things we let go with the platform. Our view is write it in native code, ’cause it allows us to take advantage, and move quickly to the platform-specific features. But it also lets us tailor the experience, and the UI and UX to whatever that user base wants.

Paul: It’s interesting that’s, that you talk about it that way, because in the old days, kind of ten years ago, when people were making multi-platform applications, there was a tendency of people to try to make it all match the same, the Firefox is an example here, or Chrome one of those tools that people use on all platforms, was designed to look the same on every machine, even though people didn’t have that machine.

I think one of the nice things that’s happened with the mobile ecosystem is people have thrown away that concept, and they’ve made it feel actually native for each of the platforms.

Kevin: Yeah, and I think that also goes to feature set, right? Like, you would think that the traditional app is like, “Oh, if it works this way on Mac, “or on Windows, or on the web, you would expect kind of 100% feature parity between all those platforms. And today on mobile, we go and select where you actually use it. There’s features that exist in mobile only.

Because mobile has all this additional context of your location and where you are, and what you’re doing. Are you running? Are you driving? There’s other sensors in the devices today that allow you to build specific features that are richer. So, we always thought of a commonality.

So, rather than sending more emails, send better email, right. Don’t say, “I’m running late.” Say, “I’m running late and I’m here,” and send a quick tap of a button to put the map and show you that, “Oh, I’m 20 minutes away,” or, “I’m on the train,” or, “I’m at the train station,” so that people can understand you don’t have to type out how many minutes it’s gonna take. You don’t know. Or, you walk in, or you drive in–

Edith: Plus, people lie all the time. Like, “I’m on my way” has a pretty loose meaning of anywhere from “I might leave my house in 30 minutes” to–

Kevin: And depending on the context, right? What that person’s thinking, so yeah, good point.

Paul :     One of the interesting things that you said there is that it might take a day or two … When you’re trying to launch something, they might actually land a day or two apart. On this show, one of the things we try to do, is we try to pitch Edith’s business, which is feature flags as a service, LaunchDarkly.com.

Edith:     I never knew you did that, Paul.

Paul :     It would seem to me that, kind of, best practice there is to release the code and not have the code be a blocker on the launch. The code is there, but it’s behind a feature flag. Do you guys do that much?

Kevin:    We have feature flags. We have flight. We call it flightings, same idea. We launch all kinds of dark features on the server, in particular. There’s all kind of things that we’ve built that are currently in production on the service that the production app may not take advantage, right? Great example is like we launched Avatar Support and it fit with the UI of Android better. We turned it on there and didn’t turn it on in iOS. We’re still working through how we want that UI to lay out. In that kind of case, and maybe we’ll turn it on for some percentage of users and test it. We basically think through like, what are the features that make sense?

I think the one piece that we’re a little different than people that kind of go crazy with tons of feature flags is that we actually try to like think through, like, “Is this something we’re going to want to potentially toggle on and off, rather than adding a flag for everything?” Simply because when you end up with a large number of flags, the interdependencies, the combinations, become very hard to test. Even at the scale we’re at with millions and millions of users, you just can’t get to the combinations, and then you make a mistake and developers are like, “Oh, well you didn’t have XYZ default on. You weren’t running experiment 3.” Those kind of things.

For certain things, we don’t feature flag and the fact that we always … People used to ask in the early days, “Well you guys do A/B testing?” We’re like, “Yeah, we do A/B testing every Friday. We ship a new build.” As a startup, we were able to, we would just make the change. Now the user base is bigger with millions and millions of users using the app, we need to be a little more thoughtful in terms of how you roll that out. In this case, we are using some more flighting type of feature flag stuff, too. Enable the users to see features in kind of waves.

Edith:     Let’s walk through this. I think it called, “Shipping Friday.” Every Friday you shipped. What would you do if there was a bad release out on the field for a week?

Kevin:    I mean, it’d be very hard for it to ever show up in a week. We would notice it … Our process basically is, we plan the day on Monday. Review kind of the previous week’s work. Code Tuesday through Friday. We ship to testing on Friday. Test Friday night, Saturday night, Sunday night. We have a team that basically does testing overnight on those three days. Monday morning we get the report and then we decide to push it. We push it to our beta users. There’s a lot of gates. There’s a set of QA and kind of manual testing that happens over the weekend. There’s a beta test, kind of dogfood, that happens the first few days of the week. Then we kind of push it live. Like I said, those kind of things, if it’s a really … We can split the platform. If it’s an Android bug, we usually can fix it within an hour. It’s very quick to push a new Google app or Android app.

With an iPhone issue, or iOS issue, we have to determine the severity of it. If it’s something that needs a critical fix, we’ll push the red button with Apple that says we need an expedited review, which usually can bring the time down from a week, down to about 24 hours, but no guarantees. It’s just best effort. That’s what we do. Now, there’s many cases where we disable features. iMap for example, we turned on. We started seeing a lot of problems. We turned it off to a few countries that we saw more problems from and it was very popular iMap servers in those countries we didn’t support.

Edith:     Oh, interesting.

Kevin:    We just disabled it in our geo basis. There are places where we’ve turned off things were we had put a feature flag in place to do that, but it really depends on the feature. Some big UI changes are hard to feature flag, so those you have to deal with. Fortunately those cause a lot less crashes because UI’s just not as in-stable of code as the, kind of, coursing code.

Edith:     Is there ever something that you really wish that you’d put a feature flag on that you didn’t?

Kevin:    Yeah, I’m sure there is. We shipped iOS9 on the day iOS9 launched and had a crasher. There is a migration bug. I don’t know how you could have turned off iOS9. Some of these things, you’re just … Then we pushed the button. We had … It took us 2 days, so we had to eat 1 store of use for 2 days, which was super painful.

Edith:     They hurt.

Kevin:    Yeah. They’re pretty permanent.

Edith:     Well, they hurt and then they write mean things and it just stings. I got, you know, at TripIt that we were overall 5, but when you get 1 star, you feel the pain personally.

Kevin:    Yeah. You get thick skin.

Paul :     You were saying, you’re releasing something to testers over the weekend and then it goes out to the beta channel the next week. If the testers gave you a bad report and you needed to pull it, then you end up in the situation of, I presume, of instead of a week’s changes going ahead, you’re going to have two weeks’ changes going ahead, and if that’s bad, it might stack up with three weeks’ changes and then suddenly, you’re out of continuous delivery. You’ve got a big risky release going in. Any experiences with that?

Kevin:    Yeah, we’ve had that frequently. Usually what happens, though, is because you’ve only written code for 4 days, we usually always say, if it takes us 4 days to write the code, we should be able to fix it within 4 hours. We fix it Monday and just ship it and take the risk that, “Hey, we thought we’d …” Extra code reviews, very aggressive look. Then we ship it to dog food. We usually very rarely skip an entire week, where we pull through that. We almost always find a way to ship it. Everybody kind of rallies around that as cultural thing. As like, “Hey, if there’s something wrong, or a crasher comes through and we can’t fix it,” it’s kind of all hands on deck to sort that out.

Paul :     Got you.

Kevin:    One thing that I did miss is we internally dog food every check in, right? The way that our system works is when you check in code, whether it’s on the app or on the service, within about 3 minutes, it’s pushed to your phone, or the service has been updates. Service is constantly updating with every check in. Things that are very critical, like … We say, “What’s a critical bug?” Well I can’t send email or I can’t reply, or forward, or a button doesn’t work. That stuff’s found within minutes of it being checked in because there’s a … Dev team’s relatively large and we’re all running that app, so we’re seeing those kind of things. Like, “Whoa, I don’t like this.” Jump on chat and tell everybody like, “Hey, let’s go look at this.”

Paul :     Got you. You just stop shipping your things if there’s a problem. You’re like all hands on deck to fix the actual problems.

Kevin:    Yup. Like I said, usually when you’ve only had 4 days of change, kind of at the maximum, usually within a few hours we can figure out what’s going on and either disable it … For example, in the case of when we shipped the iOS9 crasher, it was with spotlight indexing, it was the new indexing feature in iOS9, we just disabled spotlight indexing. Cut a new build and we shipped it within an hour. Most … Again, it’s never perfect, because software’s an imperfect trade, but most things we can figure out soon enough, or be able to roll it back or disable it.

Paul :     Right. One of the big releases that I was part of that went badly on this kind of thing was the Firefox 4 release. It took 18 months to go from Firefox 3 to Firefox 4. The main reason was, there’d be a big feature that would go at … The big feature would be foggy and would need a lot of work. It just wasn’t ready to go ahead. In the meantime, other teams are working on their own thing, things are getting included. In the end, something that was supposed to be, I think a 6 month release, ended up taking 18 months before it went out.

Kevin:    Yeah, it’s rough.

Paul :     There was stuff that was ready, new CSS features. The sort of thing that you don’t want to wait 18 months to be released.

Kevin:    Yeah, compatibility stuff.

Paul :     Yeah. That was the catalyst for completely changing to the continuous release model.

Edith:     You talked about dogfooding. How much do you dog food? Do you require everybody to only do email on their mobile?

Kevin:    We don’t. We’ve actually, there were cases where we block, we could block, or we don’t. Fortunately, everybody uses mobile email and so, I mean, it’s a pretty light ask. It’s not like we’re building an app that people would not normally use on their daily basis. I think the biggest thing is that generally developers on the team don’t have the email volume that matches with our user base. Generally developers don’t get crazy long 120 message threads and kajillion attachments and super long complex HTML in the messages. I think the bigger problem is not about people using the app. It’s about the data, kind of the representation of the data, matching our wider user base.

Edith:     I ask because Facebook legendarily does not allow people to use Google products. For example, they use Microsoft Bing instead of Google.

Kevin:    Yeah. I mean, I think companies have those rules. I think for a long time, Microsoft, they would only pay for your cell phone if you used a Windows mobile phone. I think these are common ways of trying to encourage folks to use what’s … I think those are great and they work. I think there’s always exceptions to the rule. I think that you also can come blind to competitors by only focusing on your own stuff and you end up with tunnel vision of saying, “Hey, all I know is how my stuff works with everything else. I’ve never worked in a “real world universe” where everything’s different.” Where are you live in now … You have Mac. You have Windows. You have this phone. I have Android devices. I have …

With Android in particular, there’s how many different versions of Android that are out there in the wild today on all kind of different hardware? Even from the same vendor, there could be … We can see bugs between Samsung devices. I think we encourage people to use what they have. People are allowed to expense or buy whatever phone they want. Lots of the team members, especially those of us in support that are dealing with customers, have multiple phones. We have different tablets and phones that our personal devices that are connected and we’re using them, kind of switching between them, because that’s what customers are doing. They’re looking at all these platforms. Being familiar with them is one, but also just being able to catch issues.

Paul :     One of the questions that … One of the things that we talked about with Kris last week, was the trade off between doing continuous delivery and delivering polish. Acompli had this reputation as being this incredibly polished product that people loved using. It was the best Gmail app, I think, on an iOS, was the description I’ve heard. How do you feel that trade off works, where you have something with just such a high degree of polish, how does it affect what you’re shipping and how you ship?

Kevin:    First of all, thank you, it’s awesome. We’ve got lots of accolades for our polish and I think we always thought that we could do better. The founders were all kind of like plumbers. We all worked on infrastructure and back end systems. UI wasn’t our thing. The thanks really goes to the folks on that mobile team, the experts we hired in that space, that really just like to own their craft. It was a constant tension. There’s always this notion of the startup lever of you have to get function and utility. We’re building a utility. We’re building an email client. If it’s missing big features, people just can’t use it. If delete is missing and reply all is missing and certain features are missing, you can’t use it. No matter how polished the send button is, it doesn’t matter if you can’t do reply all and it doesn’t support BCC and those kind of things. There was always this tension between how do we get polish and utility. I think we could have got a more beautiful app, but potentially a less functional app and who knows how that would have affected the outcome if we would have spent more time on polish and vice versa. I think what I’ve learned now …

Another team joined Microsoft just after we did, called Sunrise. It was a calendar. They were like world renowned design. Really, really beautiful design. You ask them how they kind of went to their sprints and delivery and their model was more like every 6 weeks, 7 weeks. It was done when it was beautiful, was kind of their mantra. Versus us, it was done on Friday. They brought to … Now we kind of have a combined team, we’re working together. They brought this notion of, “Oh my gosh. They basically spend all this time getting the design perfect and honing it, but we’re able to ship it every week.” The way that we do that now is the design starts about 2 or 3 weeks ahead of the actual work.

We’re allowed to get the designs much more time to kind of allow that polish to happen and those designs and locks to come back very, very high fidelity, so that when we implement it, we know what we’re doing. Where Acompli kind of did design and implementation, almost in the same week, we would design stuff. Spec it on Monday. Design it on Tuesday. Build it on Wednesday and Thursday. Quick, kind of put some paint on it on Friday, and then ship it on Monday. It just didn’t give you time to get to complete beautiful. I think now that with the Sunrise and integrating that process and the team that they’ve brought over is really up to our game on the UI side.

Paul :     If you’re looking for beautiful, you know it seems … If you’re looking for functional and beautiful, I guess, then shipping constantly would seem to be a helping hand there. They literally just spend 6 or 7 weeks working it internally?

Kevin:    Yeah. They would work it internally and continue to polish it and get the feature and the UX just right until they decided, “Hey, now this new chunk of work is done.”

Paul :     Would that involve like user testing and data?

Kevin:    Tons of data. User testing, data research and then it also helped, because as a startup, you become immune to the press. If you’re releasing every week and you’re like, “Hey, look at this, look at this, look at this.” The press stops writing about you. The Sunrise did an amazing job of packing up that 6 week with something a little bigger and dropping a bigger hammer and saying, “This is something big and amazing. Write about it.” A lot of the press and the attention they got was because they were able to drop these slightly larger release, than a traditional one week bite, that gave them the additional attention.

Paul :     I think that’s really interesting. I remember one of the sort of software marketing 101 things, I think Joel Spolsky talked about years and years ago, was that the biggest thing that they could do to get press, was just increase the version number. If FogBuz moves from 5 or 6, would get a

Just got a 20% or 30% customer increase just by existing. Now with continuous delivery, we’ve kind of killed that. No one talks about the new version of, almost anything. I don’t think anyone’s ever talked about the new version of Facebook. There’s the new news feed and things you can yell about, but it’s not like we’ve gone to Facebook 37 now and it’s amazing.

Edith:     I disagree with, Paul. I think what’s gone away is the notion of versioning, but yes, Facebook does a very good job about announcing new features. So does Twitter. They push moments. Nobody … The number’s an arbitrary thing, but they’re definitely still pushing at least functionality.

Paul :     It feels like it’s not … Sorry, I think there’s two things I’m saying. One is that releasing features feels like less than releasing a whole new version. The second thing is that companies of Facebook’s size, and that sort of thing, and Google size, do good jobs of it. Whereas small startups, it is very difficult for small startups to be able to do the same thing when they have that continuous delivery cadence.

Edith:     I think it’s just hard for small startups to get press, period.

Kevin:    Yeah, you basically need … You talked about moments. We always talked about, what’s the next product moment? Even inside of a company like Microsoft, that’s important too. With Microsoft, you have different competing events, which is … There’s always lots of things to talk about. Aligning your moment with, “Oh, they’re launching Windows 10, or they’re launching Office, or their launching the new Xbox.” There’s very, very large events that are like, “Oh, the NFL is switching to the next version of Surface. These are huge product moments and like, oh, I want to talk about my little feature in my app. I say little, but even at the scale with millions and millions of users, it’s still, in the scale of the like the NFL switching to the new Surface Pros, like that’s pretty cool. You have to kind of …

There’s a marketing strategy and PR strategy and how that aligns with continuous, it’s tough. I think we struggle with that all the time, of trying to pick when to turn things on and when to say, “Oh …” What things you groups together and what’s the story? We always hear it as a startup. It’s like it’s not about the feature or the moment or what version you are, it’s the story you’re telling. The version was the easiest way to tell the story 10 years ago. Today the story’s about, “Oh, we’re launching this new capability, or this new feature, or this new thing.” That’s the story now.

Paul:      I think it’s very easy for companies like Apple and I see AWS doing this as well, that they just have this event that everybody pays attention to. Everybody listens to. They get to announce, in the list of features, but they … I remember I was on Monster. Some changes to code deploy in the last reinvent. I don’t know if people really care that much about the new features that has gone into code deploy, but it got a mention because it got packaged in with all this, all the other 50 other changes that happened to AWS. All of the press and everyone was paying attention to it that week.

Kevin:    Yeah, I think they do a great job of focusing their moments and having a train of moments. Say, “Hey, this is going to be train week.” They get people to come show up. They’ve got a writer that’s assigned to sit in every session and write about it, they’re going to write about it. There’s nothing else they’re going to do all day. I think that they’re able to create that.

Paul :     You almost have to justify the fact that you’ve sat in this session for the last hour. You have to write something, even if it’s the worst thing ever.

Edith:     Yeah, it goes back to a lot of startups get paralyzed by fear. They’re like, “What if we release a feature and TechCrunch writes about it.” I’m like, “TechCrunch is not going to write about your feature.”

Kevin:    Right, right. Generally speaking, right? There’s just too many other things to write about. They’re more interested in writing about the startup you’ve never heard of that got 5 million bucks. That’s more interesting than your feature. They’ve already wrote about you and people roughly know what you do. You’re not completely changing the world.

Edith:     On a detour from press, back to something interesting you said …

Kevin:    Sure.

Edith:     You told me that your app was loved by users and by IT. In this world of rapid builds, how did you also win the love of IT?

Kevin:    Yeah, so our view in coming from the enterprise world is that you don’t get to choose what software you use at work. We felt that that was kind of a big sea of change that’s going to come over the next couple of decades. That people, especially with mobile, are getting to choose what apps they run on their phone and using them for work, whether IT has sanctioned them or not. You look at the drop boxes of the world, obviously …

Paul :     Consumerization to the enterprise.

Kevin:    Consumerization, right? As you do that, you say, “Okay, well let’s go build things that people love.” I think, again, Dropbox is a great example of building something that people love, but they’ve had trouble convincing IT that it’s the right solution, where box, I think, has done a better job of that because box went down a much more IT checklist. When I look at it, there’s a security and IT checklist that’s required to get to the IT person. There’s no checklist, or easy way to get to users to love you. We spent the better first year, first year and a half, focusing on getting users to love what we built and iterating like crazy and moving continuously to get them there. Then the plan and kind of the thinking was is we had all these enterprise trials and one very large enterprise customer with Acompli, before Microsoft acquired us, was, “Hey, we’re starting to work down the enterprise checklist.”

They’re going to tell you what they want. It’s very clear. Like, “I need this SSL thing. I need this HIPAA thing. I need this Fed Ramp thing.” They’re going to tell you what those requirements are they’re not fun, but they’re very explicit of, in fact, your choice for them to negotiate, “Hey, can I do a smaller version of that?” I think the mix there is … Once you get it in there, one of your sales angles is, “Hey, oh by the way, our users already using this for home or on their work accounts. They’re going to love it when you roll it out. It’s by far the best thing you’ll trial in kind of a bake off.” That was our strategy to go to market, was to say, “Hey, let’s make it so users love it. Spend a bunch of time on that. Use that for press and branding and then go to the IT folks and get them to buy it.”

Edith:     Did they ever challenge you on the weekly build and how that would affect the stability?

Kevin:    They didn’t. I think the challenge was more around security and data locality and things of that nature, was initially the … Speed of change, didn’t seem to be a huge deal. In mobile, in particular, like I said, the actual thing that the users were touching. People are so used to mobile apps updating that that’s not a problem. I think it was more around data, data security and compliance around the data platform. We had strategies to deal with that, either keeping the data … Accessing it through a secure line, or VPN or dedicated lines, or using dedicated tenants in AWS or Azure and those type of things. There was ways to kind of get around the compliance thing, but the key thing was really where the data was and data compliance and less about how quickly you were updating it.

Paul :     Awesome.

Edith:     Well thanks, Kevin, for coming in. Do you have any final thoughts around software development or continuous delivery?

Kevin:    I think it’s cool. My thing is that, the faster you go, things break, but the right things break. I think when we got acquired by Microsoft, everybody’s like, “Well no other team can ship every week. This is …” Like, “Good. That was fun. That was fun as your little startup thing.” I had this deck that I present where it’s like, “Hey, we ship every week. We drink beer on Fridays and release on Mondays.” They’re like, “That’s cool.” It was like a novelty thing. They were like, “It’ll just never work.” Things that were like, “It’s going to take a week to sign the bill because we have all these special requirements.” Yes, but at the end of the day, we’re going to push a button and it takes 22 seconds. That’s what it used to before.

They’re like, “No, no. We’ll give you a special deal. It’ll be like one day.” We’re like, “No, that’s not possible. We need 22 seconds.” We finally got it to work, but those kind of things, we had to break these internal processes and reviews. A lot of times it wasn’t just a process but a team involved, to kind of help. That you can’t have humans involved in some of these processes. A lot of the continuous thing is how do you automate and say, “Look, we’ll do everything you’re asking us to do, but just let us codify it, so that we can push the button and it does it. We’ll trust that we’re doing what you asked. That’s the way we kind of got back to that high speed delivery. We’re shipping 7 days in a very, very large company.

Edith:     Wow. I’d love to hear a little bit more about the tools that make all this happen.

Kevin:    Yeah, so there’s all kinds of craziness in there. A lot of the tools that were involved were tools that most of us know today. It’s Perl Scripts and Shell Scripting and Power Shell and .NET. It’s just code. What does it take to sign an Apple build? You have to run the Apple’s code signing tools and have to have the CERT and you have to have access to it. A lot of it was network VPNs to be able to have the right access to the CERTs, on the right boxes when you’re making that. You used to have to use a smart card. We need a two-factor auth to sign the bill. You have to literally stick your smart card in my Windows laptop, to get the thing to get the CERT to sign the Android build. Those kind of things, we said, “Look, there’s got to be a way to automate this. How do we go get two developers? We’re going to do two factors. It’s going to be and Paul and we’re going to both say the build shipped and it’s been reviewed. Then the automated process kicked off.” Those kind of things, we just had to kind of re-envision. The beauty of setting like, “No, no. The goal is 7 days. We’re going to ship every 7 days.”

Just finding all the thing that that broke and going after them and just breaking them. Saying, “Nope, we can’t do that. We can’t do that. We need this access. We don’t need that access.” This step was like, “Oh, we need to review like political correctness and make sure that there’s no offensive words, because we ship to so many different countries. How do we do that?” Well, there’s tools for that and that requires somebody to get to this team and waiting. It’s like, “No, no. At the end of the day it’s going to be a code and run. How do we just ship that over there and automate it?” It ends up being an automation thing. It’s just much like DevOpss today. There’s teams all around the world with one or two people running thousands and thousands of servers with automation. If you kind of box the requirements into … It has to go quickly and it has to just work, you have to just kind of write the code to close that gap. That’s kind of what we did.

Paul :     When you’re making that transition, I guess when you went in, you were shipping once a week.

Kevin:    Yup.

Paul :     Did you just refuse to break that once a week thing, or did you submit to their thing and then try to get it back to once a week?

Kevin:    We refused. When we first got acquired, we were still shipping as Acompli, under our own rules. We kept shipping. In parallel, we knew we were making the switch to re-brand as Outlook. As we started to re-brand, it took us about 6 weeks to go from re-branding to get to Outlook to ship. We knew that as that was going … It’s taking us 6 weeks. We’re like, “Well, where’s the CERT at? How do you reign in this? What font should you use? Where’s the logo file?” A lot of things were one time things and those did take a while. I think the first couple weeks, we were probably shipping more like every 10 days. It was a little longer. There was a lot of manual elbow grease and escalation emails and phone calls and things that were just kind of more elbow grease than required to get us through the process. Where, we got it to a point now where it’s like … It wasn’t fun developing the first …

Paul :     Right, right, right. Did you have any stuff where, on your end, it was automation, but the API that you were hitting was a manual process?

Kevin:    Multiple. Yeah. Just send us an email and we’ll sign your build. We’re like, “No, no. We need … Well, what are you going to do?” It was literally kind of interrogation, like, “Okay, when I send you an email and I ask for this request, or I fill out this form, what actually happens?” You had to sit there and pull on the string and say, “Okay.” Then you copy it to this. Then, okay, so you have a password for this thing. How does this thing work? Okay, great. Then you’re going to copy it to here. Say, “Okay, great.” It ends up needing to go into this directory that only you have access to. Okay, we need access to that directory and we’ll copy it there. Okay, great. Then the tool automatically picks it up and signs it. Oh, okay, perfect. A lot of it was figuring out the human mechanics of it, for sure. Then figuring out where the code is that we can go write it.

Edith:     So, Kevin, you felt so strongly that the 7 days versus 10 was worth it.

Kevin:    For sure. No, it was incredibly valuable to our getting to where we were at Acompli. We felt that if we had any chance of hitting the goals that we had set for ourselves and that the board had set for us when they acquired the company, in terms of growth and, we had to get to 7 days and basically iterate as quickly. Otherwise, the targets we set were going to take us 3 or 4 times longer because if you … For every time you lengthen the development cycle, I’m a pretty firm believer now, that it takes you that much longer to actually get your results. The whole game is how many shots do you have on the goal. If you can do 50 a year, you’re a lot better off than the team that can only do 20. I think that was the fundamental thing, is that we can 50 changes a year. The size of the change is not as important as how many you can make and how quickly you can iterate on that.

Edith:     Well, great. We really enjoyed you sharing your stories.

Paul :     Yeah, that was great.

Kevin:    Cool. Thanks for inviting me.

Paul :     Thanks for listening to this episode of, “To Be Continuous,” brought you by Heavybit and hosted by me, Paul Biggar, of CircleCI, and Edith Harbaugh of LaunchDarkly.

Edith:     To learn more about Heavybit, visit Heavybit.com. While you’re there, check out their library, home to great educational talks from other developer company founders and industry leaders.


21 Oct 2015

To Be Continuous: The Fear of Shipping

In this segment of To Be Continuous, Edith Harbaugh and Paul Biggar talk about the fear of shipping and if code is an asset. This is podcast #5 in the To Be Continuous podcast series all about continuous delivery and software development.

This episode of To Be Continuous, brought to you by Heavybit and hosted by Paul Biggar of CircleCI and Edith Harbaugh of LaunchDarkly. To learn more about Heavybit, visit heavybit.com. While you’re there, check out their library, home to great educational talks from other developer company founders and industry leaders.

Paul: Okay, so one of the things we wanted to talk about this week was the idea of shipping velocity and how it’s affected by team structure, sort of miscellaneous factors that aren’t part of continuous delivery.

Paul: So at the end of the last episode, Edith, you said, I think this was a quote from Yammer VP, “The organization you design is the software you built.”

Edith: Yeah, it was actually from the CTO Adam Pisoni and it really struck home from me because I see this at so many companies, they have a lot of engineers and they wonder why they have a lot of code but not a lot of product. And it ties back to what you just said of specializing the product management role.

Paul: Right. I think that this a name. I think it’s Conway’s Law and the way that I saw that expressed is that if you have, if you’re building a compiler and you have four different teams that are building a compiler, you’ll end up with a four-pass compiler.

Edith: There’s no one vision.

Paul: Right. When looking at lots of different teams and team structures, the interesting one that I found was the Heroku one.

And they have a language team and they have an add-ons team and they have sort of sharp delineations in their software, or in their stack, that allows them to really focus on one particular area because they’re such sharp demarcations between the different areas of the product.

Edith: I think that’s good if you’re a fairly mature product. I think in the early days of Heroku that would not have worked at all.

Paul: I wouldn’t say it’s quite from the early days, but relative to now it was quite early, I think they had that in Cedar, which was around 2010.

Edith: I just more meant when you’re an early stage start-up, sometimes you change your entire product.

Paul: Well okay, yes, yes. I mean, absolutely. I think once you get into, once you get past the first stage of the product, and if you’re able to draw very good interfaces between how your customers understand what your product is.

Edith: I don’t know, I mean I’ve seen this go bad in so many organizations, where you have entrenched engineering organizations that care more about staying on their current project, then actually about where the market is going.

You know, like we’ve always worked on this, so we need to stay here because we don’t know anything else versus being able to evolve to where the market is going.

Paul: Right. This reminds me a little bit of something that I’m working on at the moment. We brought in some UX experts to look at our app and to help us sort of transform it into something that was a little more usable.

And they did a fantastic job and I spent this afternoon reading some of the reports. But what was difficult was understanding where the product needed to be. So for us in particular, there’s not enough focus on the deployment.

There’s a lot of focus on the build and there isn’t really sort of a broader look at what do engineers actually do when they’re trying to do continuous delivery. And so we ended up with what was in the product was redone in a really fantastic way, but there wasn’t much affordance made for here’s the thing that actually needs to be in the product.

And when you talk to customers, it’s hard for customers to tell you oh here’s the thing that you actually need to be or they look within the box you’ve drawn for them.

Edith: Yeah. I say this ’cause I had a similar evolution to you, I actually started off in engineering. And when I was in engineering it was very obvious what we should build next, extremely obvious. And so I always thought that our product manager was an idiot for not seeing as clearly as me. When I became a product manager, I realized how myopic I had been as an engineer.

Paul: Can you give an example of what was the next thing to build that in retrospect was wrong?

Edith: I would see all the little bug fixes that we should be doing instead of the next big features. Or not even the next big features, but the next big product.

Paul: I think big or small, or big picture versus small picture, is a good way to distinguish these two. I think that it’s very easy when you’re talking about product management to get the idea that a product manager knows everything and the engineer is just an implementer.

And I think this is where a lot of the resistance to product managers comes from with an engineering organization’s the idea that they’re going to be relegated to mirror kind of peasants in the…

Edith: Code monkeys.

Paul: Code monkeys, there we go.

Edith: You know, nobody wants to be a code monkey. That doesn’t sound very fun.

Paul: Right, right. I would disagree with that, but I think that’s–

Edith: Wait a second, we never disagree.

Paul: It is very frustrating trying to understand everything. And on the other hand, it’s very satisfying to ship things and to get your stuff in front to customers. So very often, the ability to just be a code monkey for a certain period of time is this sort of soothing feeling of just shipping software that fixes a lot of small problems.

I remember reading one of the famous GitHub guys, I don’t remember which one it was, but let’s assume it was Kyle Neath, that spends a lot of time on big projects. And in between the big projects, he needs to find what is the next big project to work on.

And it’s often very frustrating, or very you go down certain rabbit holes and whatever, and you end up kind of not shipping things, or you end up getting frustrated or whatever. And what he likes to do then, is just reach to the back log and just take a bunch of small fixes. And he spent like two weeks of just like implementing very small things.

You don’t need to think about it, and it’s cathartic and it lets you ship and so a couple of weeks as a code monkey I think is a very useful thing to sort of refresh the head and that sort of thing.

Edith: I agree, but I think nobody wants to do that full-time and I’ll also challenge something else you said, which is everybody wants to ship.

I think there are a lot of people who find shipping terrifying. and they’d rather keep holding stuff on until it’s perfect… Like, I’ve certainly been with in situations like this where it’s like–

Paul: Right, where we can’t ship because it’s not perfect yet or it’s not complete.

Edith: Yeah, and as an engineer you have this real battle of well what if people want this, or what if they want this or this might be not quite right.

Paul: Yeah. The personal strategy that I use to manage that is to try to write the blog post that you’re going to launch this with and to, cause very often you’ll be like, “oh I can’t launch this cause it hasn’t got this feature, it hasn’t got this feature.”

And in the blog post, assuming that you’re going to tell people how to use it, or you’re writing the doc maybe, if not the blog post. You get that sort of feedback as you’re trying to explain to your user how to do this you’re going to say, all you need to do is this, and you’ll realize that this is seven steps long instead of one step long.

Edith: Yeah, the Amazon model. So at Amazon, they actually start with writing the press release first.

Paul: Okay, right.

Edith: And everything and that’s a really good guide back.

Because too often, people do the other end of they’ve built this gargantuan thing and they’re trying to write a press release or blog post and they’re like, whoa, we’ve built all this stuff but there’s nothing actually to talk about.

Paul: Right. And part of that, and something which I think engineers have a difficult time thinking about is how to get this widely in use. So you can build the feature, but its no use having built it if no one uses it. So you need to build the breadcrumbs, you need to figure out the ways that are subtly hinted that this is the feature that you want when it’s the feature that you want and to draw people’s attention to it.

And sometimes that’s putting it in docs and sometimes that’s doing a big announcement, but more often it’s trying to get the product in a place where the UX naturally implies the right path or the right direction for users to discover parts of the product.

Edith: Yeah, I mean the whole idea of responsive design, and I think even more, and this goes back to why I started LaunchDarkly, is you might have built it, but nobody might want to use it.

Like, you could have put all this effort into building it and done all these breadcrumbs that nobody follows. So that’s the idea of you actually start doing the bread crumbs first and see if people start following that path.

Paul: So with LaunchDarkly, I’m guessing that the way that you see whether someone is using it is whether it’s enabled for them. Is that right?

Edith: Well no, so what we do is we allow people to turn on features for certain users.

Just turning on a feature for a certain user doesn’t necessarily mean that they start using it.

Paul: Right, exactly. So do you tie this to mixed panel usage? Or some sort of analytic stuff?

Edith: We could tie it to different back ends, like we tie it to New Relic, we tie it to actually optimize so you can see if people are even, and we have our own internal analytics.

Paul: Gotcha, okay. So this is the thing for me, that I started a project recently, and the first thing I did is built the dashboards for adoption. And we’re still at the stage in the project where there’s no adoption, or there’s tiny amount of adoption amongst early users.

A trickle. A trickle that you can’t even see on the graphs.

Edith: So it’s more like a fine mist.

Paul: A fine mist. But what you need to get is you need to get to the place where everyone is using this. ‘Cause if you just build it, they’re not going to come.

They need to be told about it, they need to understand how to use it, and getting those first customers to using it and where it’s deployed amongst them is it gives you incredible feedback about how one actually ships that software to the larger customer base.

Edith: Totally agree. I mean, this is classic Lean principles of just making sure some people can use it well before rolling it out further.

Paul: We discovered a part of the product that exactly three customers were using.

Edith: How did that make you feel?

Paul: Well, we didn’t actually know it was a feature. This was the idea that you could do deployments in parallel. So at CircleCI we paralyze your build and so the idea is that basically we take your test suite and split it across 20 or 50 or whatever machines.

But it turns out that that applied to deployment as well. And there were exactly three customers using that. And one of them had a valid use case for it. Out of thousands of customers, exactly one valid use case was there.

Edith: So what did you do with the feature?

Paul: We killed it.

Edith: Did you tell him?

Paul: I hope so. Yeah, I know. I think we reached out to that guy. There was another way for them to do it.


The painful part of project management is also when you have a feature that you like to kill but that a subset of power users loves.

So like at TripIt, we’re a mobile travel itinerary but we let people do a printout. And one time we’re like, oh nobody prints anymore, let’s just kill it. Turns out that people print and they really really like printouts.

Paul: Right, right. I understand that, yeah.

Edith: Like particularly if you’re traveling to a foreign country.

Paul: Yeah, you’re not going to have internet or your phone’s going to be dead.

Edith: Or you want to show something to a passport guard.

Paul: Yeah, or a local. Without handing over your phone, like here’s what I’m doing.

Edith: Yeah, so they were furious with us.

Paul: Right. So, had you killed it at that stage?

Edith: Oh we killed it. Like we were just like oh, we didn’t have good analytics on people printing, so we just said oh nobody’s printing, let’s kill it.

Paul: Oh, wow wow.

Edith: So our analytics later was that people complained. Quite loudly.

Paul: And so had you properly killed it at that point, or had you merely disabled it to see if it went away?

Edith: Let’s see. We disabled it. I think we could get it back but people were really really upset.


I like the thing of shipping something turned off, rather than actually deleting the code. Even though it’s incredibly cathartic to delete the code and to hide and it and remove it. But the turning something off with a feature flag is just a lot better way to sunset something.

Edith: Why do you think it’s cathartic?

Paul: Oh deleting code is amazing. It’s like my favorite thing to do.

Edith: It’s funny. My cofounder John, he was from Ex Atlassian, and he said the winner of their hack competition was always the person who deleted the most code.

Paul: Right, right. That makes perfect sense.

Edith: Cause that’s what they wanted a reward is tidiness.

Paul: Yep, yep. It’s very much related to the idea of product management and validating things and making sure that you don’t build too much of the product. Code is not an asset.

Code is an asset in the financial sense of it in that you think you want it but you actually don’t. You actually want the best performance with the least amount of code slash asset available.

Edith: Yeah it’s like, so a friend asked me once, should I pay my developer’s more if they write more lines of code? I was like, no. It was like no, that’s a really easy metric to gain.

Paul: Right. So we were talking about deleting features by feature flagging them. I think that this is an awesome way to delete a feature because it’s very, very easy to get back. It’s much easier to get back than a rollback.

Rollbacks are this thing that nobody wants to do because they’re very, very painful. Especially if a ton of code has come in between. So if you ship something, or if you delete something by literally putting a feature flag in around it, and then you ship the code and then it’s still on, and you turned it off for a certain amount of people, see if anyone complains, see that it still works and that you know, show it to ten people, and then you delete it for everyone just by flicking the flag.

And then if you get someone saying, you know, we really, really, really need print it, You can turn it back on for them while you have a think about how you really want to solve this problem.

Edith: Yeah, I think feature flags is a really misleading term.

So feature flags implies that it’s always on or off, when really it’s more of a feature control.

It’s a way to encapsulate a portion of functionality such that you have total control over it, from the sunrise of it, you know, from launching it to certain people, seeing their reaction, getting analytics, and then all the way to the end, as you said, every feature eventually you want to kill.

Paul :     There’s an interesting parallel here between feature flags and the configuration variables. By configuration variables, what I mean is, in a rails app, you often have a set of four different environments. You have dev, test, staging and production. You often get in trouble in that you fill your code base full of, “If dev, this,” or, “If n is production, then do this.” What you really want, is you want to be able to say, “If we have enabled the x feature, if we’re using SSL,” is one example of a thing that might be on in some configurations, but wouldn’t be on in other ones. In the closure ecosystem, there’s this idea of a component. Where, you build your application as a set of components and all of the variables through our components are passed into it. The variables are essentially feature flags. Are things disabled? Are they enabled? What is the setting that is built on it? It makes it very easy to compartmentalize functionality and to only expose a simple interface that allows you to control how the functionality works, without necessarily having to dive into the functionality all the time.

Edith:     That’s exactly my vision for LaunchDarkly. That everything should be controlled.

Paul :     Right. You should like into the closure idea of components. It’s a very sort of … The closure people speak in very theoretical abstractions and that sort of thing. They use weird words like complecting. It’s a weird thing, but they actually really know what they’re talking about, which is even more annoying.

Edith:     That’s always the worst, right?

Paul :     Yeah, so someone invents their own vocabulary and then they’re right, so you actually have to discover what they mean by this vocabulary. Then no one else understands the vocabulary. So complecting is an interesting word, in this case. It means unnecessarily tied together, or it means that it’s not complex, but it’s two things that are like … It’s the opposite of simple, basically.

Edith:     That was the …

Paul :     The idea that you have two components and they’re two widely connected, or complected.

Edith:     Yeah. What do you think is a better word than feature flag, when really it’s more about feature controlling or feature wrapping?

Paul :     I used to feel that there were different concepts for feature flags versus AB testing. That they were actually different concepts. I’m now convinced that they are the same concept.

Edith:     This is interesting, because I think AB testing is just something that is enabled be having an unwrapped feature.

Paul :     Something that is … Say that to me again.

Edith:     If you’ve wrapped all your features, as I talked before, you can launch them, you can monitor them, you could sunset them, and you could also compare them.

Paul :     Right, right, right. Yes, so an AB test is really a feature flag which is enabled randomly for a certain subset of customers and you’re looking at the analytics.

Edith:     Yeah, so it’s just an extension of … If everything is an object, and nicely, as you said, a wrapped object, it’s then you could say, “Oh, is this object doing better than this other object?”

Paul :     Right. Okay. The thing that I found really weird in looking at how people talked about AB tests versus feature flags, weird because those are the same concepts, but AB tests are a marketing thing, or growth thing. You tie them to business goals or to KPIs, or to funnels, or something along those lines. Features, and especially the kind of operational side of features, you tie to database latency and basically operational metrics. There’s not difference between business and operational metrics. Every AB test should be tied to operational metrics because it’s no good knowing, “Oh, no one buys on this thing.” If the reason no one buys on it is because the exceptions are through the roof.

Edith:     Yeah, you …

Paul :     Similarly, there’s no concept of, “This feature works really well if the database load is really low.” Oh, the database load is really low because no one is clicking down that path. Your business metrics are through the floor.

Edith:     You’re absolutely right. I think one of the goals of LaunchDarkly is to provide analytics on everything. I do think what I found when I was talking to customers is there’s a lot of fear around AB testing. Just the word, I think, has been overloaded that people … Give Max from Heroku, who sits down the stairs from us, they love feature flagging. They feature flag everything. He doesn’t think of it as AB testing. He’s like, “If you said do I AB test?” He would say, because that implies that you’re really doing more of a test of an old versus a new and picking which on is better. Really, it’s more he has a new feature and he wants to make sure the metrics are correct.

Paul :     Right. When I would advocate for AB test in the past, it was mostly to say, “Does this perform not worse than what was there before?” Someone to have a new design of something and they think it’s great. We’d all agree that it looked a lot better, but does it convert better? In fact, not even does it convert better, because if it looks better and it converts the same, we should definitely go with it. Does it convert not worse?

Edith:     Yeah, the issue is most people don’t have the traffic for AB testing.

Paul :     You can make AB tests work at a small scale.

Edith:     If you’re a SaaS company with maybe 300 customers, you don’t have enough volume to do AB testing.

Paul :     You don’t have … No.

Edith:     You could be quite profitable and very happy with those 300 customers that are each, like, 100K a year.

Paul :     AB testing is a test of statistical significance, right? On a small scale, you need a lot of people. To be able to tell small significance, or small differences, in the result. You know, 5% versus 6%. You need a large number to people to have any confidence in the statistical significance.

Edith:     Yeah, so if you have a feature …

Paul :     You can tell the difference between 80% of people get through your funnel and 20% of people get through your funnel, with 50 customers.

Edith:     If you say … So that stuff you have a feature that you use at the very top of the funnel. What I was trying to say is if you have something that people might not see very often, it’s very hard to AB test it. If you look at a funnel …

Paul :     It’s also very hard to get …

Edith:     You have your marketing site, which gets the most. You have your signup, which gets the most. Then the further down you get into product, the harder and harder it is to do AB tests.

Paul :     Especially if it’s something where you don’t have the funnel, necessarily, or where you have a funnel, but you haven’t constructed a funnel. One of the ways I think about software is that everything is a funnel.

Edith:     Yeah, life is a funnel.

Paul :     Life is a funnel. Hiring is a funnel. Your jobs page is part of a funnel, but you don’t actually have a mix panel thing built on your jobs page. Making any change to your jobs page, means that you don’t actually know if it’s had a positive effect, or a negative effect. You can’t really tell anything about it.

Edith:     That’s very true. On the other hand, if you’re getting only like one person applying a month anyway, it doesn’t really matter. That’s still in the realm of statistical insignificance.

Paul :     Right, so you have no controls over something goes out, you just have opinions. Whereas on the home page you can have data and then on your jobs page, you can argue that this word that we use here is really off-putting to female developers, or something along those lines. If someone disagrees with you, you’ve got nothing to back up your arguments either way.

Edith:     Well actually, a company, Textio, now is doing a lot of statistical analysis of job postings to see if they have … If the words in them are gender neutral, or not.

Paul :     That’s based across a massive corpus of job pages, versus … It’s not maundering your actual through put, or something like that?

Edith:     No, it’s just based on … She’s a machine learning PhD.

Paul :     Right. That works for something like gender, or things that are … Diversity and the general case. It’s not going to work on, “Are all the closure developers … Are they aware that this is a closure shop?” As an example.

Edith:     Yeah and that’s why you can’t AB test life.

Paul :     If only.

Edith:     No, I mean, that would be horrible. You’d have to do everything a thousand times.

Paul :     Right, right, right.

Edith:     What if some of those thousands were just terrible?

Paul :     Right. You can AB test web pages.

Edith:     No. Not if they don’t get enough traffic.

Paul :     Not if they don’t get enough … Yeah. Yeah.

Edith:     It goes back to what you said. You can have through putter latency.

Paul :     Right.

Edith:     You could test every page if you have a thousand years. If you have a low traffic page … Sometimes, to go to what you said before, you just have to go with, “I feel this color’s better. I feel this color pops.”

Paul :     This is one of the most frustrating things about developing software. Everyone knows that you should have data and analytics and it’s just very difficult to really have any idea of whether you should have analytics for a particular thing. You know you should have it on your funnel. You know you should be measuring what customer they’re using. In almost every instance, you can make a justification for just doing it this way.

Edith:     You have to at a certain point.

Paul :     That’s what’s frustrating. If you could uniquely say, “In every situation, we’re going to use data. We’re going to use the funnel. We’re going to use these analytics.” Then you’d be in a great place. Where there are 90% of your webpage or your product, actually don’t get enough use to get any statistical significance, and then you end up with only having a funnel on your signup page. Then you don’t really have a very data driven company, as a result of that.

Edith:     That’s the hard truth. There’s still a lot of art in the science. At some point you have to make a decision, like I like …

Paul :     Unless you’re at Google.

Edith:     I’m sorry?

Paul :     Unless you’re at Google.

Edith:     Yeah, unless you literally have the world’s population using you, you do have to make a decision like, “I think our jobs page should look like this.”

Paul :     Right.

Edith:     “I think that our onboarding should look like this.”

Paul :     Right. Why? Don’t know.

Edith:     Other people do it that way.

Paul :     I just kind of feel it. I think this pops more.

Edith:     Was that an American accent?

Paul :     No.

Edith:     I thought you slipped it on in.

Paul :     No. I think this color pops a little bit more. That’s what most arguments about UX end up with, if you don’t have high level principles and goals and personas that you’re building the product around.

Edith:     That’s fair. I think it goes back to what you said before. There’s always this tension of data versus gut. What we talked about engineers about making it perfect versus ship it now. I think they go together. The more data you watch, the more you can be convinced that now is the time to ship versus the person who’s like, “Okay, it’s just time.”

Paul :     Right, right, right. When I’m trying to ship something, I try to make sure that my fears are addressed, more than I try to make sure that the thing is feature perfect.

Edith:     That’s a really good way to look at it.

Paul :     We shipped this feature that I’ve been working on and it was just supposed to go to the first 5 customers. What I want to do is make sure that the back end was shipped and then I could test it on our own project and validate, “Does it actually work at all? What does it look like when someone actually uses this in production?” The first thing that I really needed to validate is, “Can I turn it on without causing any problems to everyone?” Or to me and that meant that I have to make sure that there’s no problem to anyone. Basically, what I had to do with the feature was insulate it from if it went wrong, in the thousand ways that I couldn’t expect it to go wrong, that I knew for sure that it wouldn’t affect the rest of the customers.

Edith:     Did you use a feature flag?

Paul :     A feature flag, but also … The problem in a lot of languages is it’s not just a feature flag. You have to wrap the exceptions and make sure that the exceptions get caught and just like end nicely versus going forward. It was something that was on the critical path where everyone’s build. If the code went wrong, everyone’s build could be affected. I just needed to make sure, no matter what goes wrong in this code base, let the builds continue.

Edith:     The builds must flow.

Paul :     The builds must flow. What I was addressing wasn’t, “How do we feature flag this off?” It was just like, “Can I ship this without being fearful of something going wrong?”

Edith:     I think you sum it up very well. Fearful.

Paul :     Right. Continuous delivery is mostly about fear more than anything, I think.

Edith:     I think it’s about mitigating fear because it’s very freeing that if you could turn something off at any time, you could move forward.


Paul :     Right and if you ship something and you know that it’s not going to break things …

Edith:     You could ship it.

Paul :     Right. Exactly.

Edith:     I think it’s when you have these big bang releases where it’s everything all together in one kluge, that you have a lot of fear.

Paul :     Yup.

Edith:     I had a customer who said they stopped doing continuous integration, sorry, because it was faster to not. They were just …

Paul :     Let me just predict how this ended up.

Edith:     Have you seen this movie already?

Paul :     I’ve seen this movie so many times.

Edith:     You saw…

Paul :     Every iteration of this movie.

Edith:     Every thousand?

Paul :     Yup.

Edith:     How does the story end?

Paul :     The story ends with software not being able to be delivered. Engineers quitting, is a common ending, or at least last chapter surprise. That like, “Oh, we’re shipping so fast.” No, we can’t ship anything because it’s so frustrating to ship things because we just don’t know if it’s going to work.

Edith:     How do you … That’s the end of the story. What’s the next chapter?

Paul :     After the end of the … What’s the sequel?

Edith:     What’s the sequel?

Paul :     The sequel is they bring in a new VP of Engineering. The VP of Engineering says, “What the hell? There’s no testing.” The VP of Engineering sets up testing. Everyone complains about testing and that everything was much faster before testing, but they have to do what the VP of Engineering says. In about a month, they realize their velocity is about 10 times faster than it used to be and everyone who complained about testing is actually happy.

Edith:     Yeah. That’s what happened at our customer. They said they stopped doing testing because it seemed to take too much time. They got to the point where they couldn’t ship. They were like, “Actually this is the same as like basic housekeeping.” You can’t just not do your dishes everyday for 2 months and expect …

Paul :     Was there anyone fired on this journey?

Edith:     No, they came to a pretty quick realization.

Paul :     I think they probably got lucky there. People have been fired for less.

Edith:     Oh, really?

Paul :     If you’re in charge of a software team and you bring out something that brings the team to a halt, maybe a little encouragement to your boss that you actually did know what you were talking about in the first place.

Edith:     I heard this legendary story that Salesforce, they got to such a state that it took them two years to ship anything.

Paul :     Wow.

Edith:     They fixed it. They fixed it because they realized this is a problem. It takes us two years to ship anything.

Paul :     Right.

Edith:     Well Paul, it was fun catching up with you about product management and how to AB test, or not.

Paul :     Whether AB testing is the same as feature flags. It is.

Edith:     Oh, Paul, we always agree, but not on this one. I would say, all right … I would say that you can AB test if you have feature flags, but I don’t say you have to AB test if you have feature flags.

Paul :     I think this is just language. AB testing is … The way that I look at feature flags … We have feature flags that sit in a bunch of different places. We have feature flags of, “Do I enable this on this machine?” We have feature flags for, “Do I enable it for this customer?” Then we have feature flags for, “Do we randomly enable this across the customer base and what proportion do we use?” They’re all some form of AB testing. They’re all some form of feature flag. I don’t see a distinction between them at all.

Edith:     I think because in customers’ minds they think of AB testing as some sort of statistical thing, where they’ll get a deterministic result. Whereas, sometimes they’re using a feature flag just to roll out a new feature and expose it to some users.

Paul :     Would you classify it as an AB test if it’s statistical and not an AB test if it’s not statistical?

Edith:     Paul, I actually agree with you. I was just saying, I’ve talked to a lot of customers and when you say, “AB testing,” it puts a lot of fear into them that they have to …

Paul :     Right, right. I agree that a lot of people see a distinction between AB tests and feature flags. I think my point is this. I’ve come to the conclusion that there is no distinction.

Edith:     I think we actually agree, after all.

Paul: Thanks for listening to this episode of “To Be Continuous,” brought to you by Heavybit and hosted by me, Paul Biggar of CircleCI and Edith Harbaugh of LaunchDarkly.