09 Feb 2018

Visibility and Monitoring for Machine Learning Models

Josh Willis, an engineer at Slack, spoke at our January MeetUp about testing machine learning models in production. (If you’re interested in joining this MeetUp, sign up here.)

Josh has worked as the Director of Data Science at Cloudera, he wrote the java version of Google’s AB testing framework, and he recently held the position of Director of Data Engineering at Slack. On the subject of machine learning models, he thinks the most important question is: “How often do you want to deploy this?” You should never deploy a machine learning model once. If the problem is not important enough to keep working on it and deploy new models, then its not important enough to pay the cost of putting it into production in the first place.

“The tricky thing, though, is in order to get good at machine learning, you need to be able to do deploys as fast as humanly possible and repeatedly as humanly possible. Deploying a machine learning model isn’t like deploying a regular code patch or something like that, even if you have a continuous deployment system.” -Josh

Watch his entire talk below.

TRANSCRIPT

How’s it going, everybody? Good to see you. Thanks for having me here. A little bit about me, first and foremost. Once upon a time, I was an engineer at Google. I love feature flags, and I love experiments. I love A/B testing things. I love them so much that I wrote the Java version of Google’s A/B testing framework, which is a nerdy, as far as I know … I don’t know. Does anyone here work at Google? Any Googlers in the audience? I know there’s at least one because my best friend is here, and he works at Google. As far as I know, that is still used in production and probably gets exercised a few trillion times or so every single day, which is kind of a cool thing to hang my nerd hat on.

I used to work at Cloudera, where I was the director of data science. I primarily went around and talked to people about Hadoop and big data and machine learning and data science-y sorts of things. I am Slack’s former director of data engineering. I’ve been at Slack for about two and half years. I am a recovering manager. Any other recovering managers in the audience? I was going up the management hierarchy from first line management to managing managers, and I started to feel like I was in a pie eating contest, where first prize is more pie. I didn’t really like it so much. Wanted to go back and engineer. So about six months ago, I joined our machine learning team, and now I’m doing machine learning-ish sorts of things at Slack as well as trying to make Slack search suck less than it does right now. So if anyone’s done a search on Slack, I apologize. We’re working hard on fixing it.

That’s all great, but what I’m really most famous for … Like most famous people, I’m famous for tweeting. I wrote a famous tweet once: Which is a proper, defensible definition of a data scientist? Someone who is better at statistics than any software engineer and better at software engineering than any statisticians. That’s been retweeted a lot and is widely quoted and all that kind of good stuff. Are there any … Is this sort of a data science, machine learning audience or is this more of an engineering ops kind of audience? Any data scientists here? I’m going to be making fun of data scientists a lot, so this is going to be … Okay, good. So mostly, I’ll be safe. That’s fine. If that guy makes a run at me, please block his way.

So anyway, that’s my cutesy, pithy definition of what a data scientist is. If you’re an engineer, you’re sort of the natural opposite of that, which is this is someone who is worse at software engineering than an actual software engineer and worse at statistics than an actual statistician. That’s what we’re talking about here. There are some negative consequences of that. Roughly speaking at most companies, San Francisco, other places, there are two kinds of data scientists, and I call them the lab data scientists and the factory data scientists. This my own nomenclature. It doesn’t really mean anything.

So you’re hiring your first data scientist for your startup or whatever. There’s two ways things can go. You can either hire a lab data scientist, which is like a Ph.D., someone who’s done a Ph.D. in statistics or political science, maybe or genetics or something like that, where they were doing a lot of data analysis, and they got really good at programming. That’s fairly common data science standard. A lot of people end up that way. That wasn’t how I ended up. I’m the latter category. I’m a factory data scientist. I was a software engineer. I’ve been a software engineer for 18 years now. I was the kind of software engineer when I was young where I was reasonably smart and talented but not obviously useful. I think we all know software engineers like this, smart, clearly smart but not obviously useful, can’t really do anything. This is the kind of software engineer who ends up becoming a data scientist because someone has an idea of hey, let’s give this machine learning recommendation engine spam detection project to the smart, not obviously useful person who’s not doing anything obviously useful and see if they can come up with something kind of cool. That’s how I fell into this field. That’s the two kinds. You’ve got to be careful which one you end up with.

Something about data scientists and machine learning. All data scientists want to do machine learning. This is the problem. Rule number one of hiring data scientists: Anyone who wants to do machine learning isn’t qualified to do machine learning. Someone comes to you and is like, “Hey, I really want to do some machine learning.” You want to run hard the other direction. Don’t hire that person because anyone who’s actually done machine learning knows that it’s terrible, and it’s really the absolute worse. So wanting to do machine learning is a signal that you shouldn’t be doing machine learning. Ironically, rule two of hiring data scientists, if you can convince a data scientist that what they’re doing is machine learning, you can get them to do anything you want. It’s a secret manager trick. It’s one of the things learned in my management days.

Let’s talk about why, briefly. Deep learning for shallow people like ourselves. Deep learning, AI, big stuff in the news. I took a snapshot here of the train from my favorite picture, “Back to the Future, Part III,” a truly excellent film. Machine learning is not magic. Machine learning is, it’s basically the equivalent of a steam engine. That’s really what it is, especially deep learning in particular. What machine learning lets us do is stuff that we could’ve done ourselves, manually, by hand over the course of months or years, much, much, much faster in the same way a steam engine lets us move a bunch of rocks from point A to point B. It’s not something we couldn’t do. We knew how to move a bunch of rocks from point A to point B. That’s how we built the pyramids and stuff like that. But this lets us do it much, much faster and much, much cheaper. That’s what machine learning fundamentally is.

There are consequences of that. One of the nasty consequences of it. Machine learning … There’s a great paper that I highly recommend you read by this guy named D. Sculley, who is a professor at Tufts, engineer at Google. He says machine learning is the high interest credit card of technical debt because machine learning is basically spaghetti code that you deploy on purpose. That’s essentially what machine learning is. You’re taking a bunch of data, generating a bunch of numbers and then putting it in a rush intentionally. And then trying to figure out, reverse engineer how does this thing actually work. There are a bunch of terrible downstream consequences to this. It’s a risky thing to do. So you only want to do it when you absolutely have to.

Lab data scientists want to do machine learning. Factory data scientists want to machine learning. Their backgrounds mean they have different failure modes for machine learning. There’s a yin and yang aspect to it. Lab data scientists are generally people who have a problem with letting the perfect be the enemy of the good, broadly speaking. They want to do things right. They want to do things in a principled way. They want to do things the best way possible. Most of us who live in the real world know that you hardly ever have to do things the right way. You can do a crappy Band-Aid solution, and it basically works. That’s the factory data scientist attitude. The good news, though, of people who want to do things perfectly, they don’t really know anything about visibility monitoring, despite knowing a bunch of stuff about linear algebra and tensors, they don’t know how to count things. But you can teach them how to do Graphite Grafana. You can teach them how to do Logstash. They can learn all these kinds of things, and they want to learn, and they have no expectation that they know what they’re doing, so they’re very easy to teach. That’s a good thing.

Factory data scientists have the opposite problem. They’re very practical. They’re very pragmatic. So they’ll build things very quickly in a way that will work in your existing system. However, they overestimate their ability to deploy things successfully the way most not obviously useful software engineers do. As a result, they are much more likely to completely bring down your system when they deploy something. So that’s what you want to watch out for there.

Another really great paper, “What’s your ML test score? A rubric for production ML systems.” I love this paper. This is a bunch of Google people who basically came up with a checklist of things you should do before you deploy a machine learning system into production. I love it. Great best practices around testing, around experimentation, around monitoring. It covers a lot of very common problems. My only knock against this paper is they came up with a bunch of scoring criteria for deciding whether or not a model was good enough to go into production that was basically ludicrous. So I took their scoring system and redid it myself. So you’ll see down there, if you don’t do any of the items on their checklist, you’re building a science project. If you do one or two things, it’s still a science project. Three or four things are a more dangerous science project. Five to 10 points, you have the potential to destroy Western civilization. And then finally, once you do at least 10 things on their checklist, you’ve built a production system. So it’s kind of a u-shaped thing.

This is a great paper. If you have people at your company who want to deploy machine learning into production, highly, highly recommend reading it and going through it and doing as much of the stuff they recommend as you possibly can. More than anything, for the purposes of this talk, I want to get you in the right headspace for thinking about what it means to take a machine learning model and deploy it into production. The most important question by far when someone wants to deploy a machine learning model is, how often do you want to deploy this? If the answer is once, that is a bad answer. You should never deploy a machine learning model once. You should deploy it never or prepare to deploy it over and over and over and over and over again, repeatedly forever, ad infinitum.

If the problem is not important enough to keep working on it and keep deploying new models, it’s not important to pay the cost of putting it into production in the first place. That’s thing one. The tricky thing, though, is in order to get good at machine learning, you need to be able to do deploys as fast as humanly possible and repeatedly as humanly possible. Deploying a machine learning model isn’t like deploying a regular code patch or something like that, even if you have a continuous deployment system. The analogy I would use is it’s kind of like someone coming to you and saying, “Hey listen. We’re going to migrate over our database system from MySQL to Postgres, and then next week, we’re going to go back to MySQL again. And then the week after that, we’re going to go back.” And just kind of like back and forth, back and forth. I’m exaggerating slightly, but I’m trying to get you in the right headspace for what we’re talking about here. It’s basically different machine learning models are systems that are complicated and are opaque, that are nominally similar to each other but slightly different in ways that can be critically bad for the overall performance and reliability of your systems. That’s the mentality I want you to be in when it comes to deploying machine learning models. Think about it that way.

The good news is that we all stop worrying and learn to love machine learning, whatever the line is from “Dr. Strangelove,” that kind of thing. You get good at this kind of stuff after a while, and it really … I love doing machine learning, and I love doing it production in particular because it makes everything else better because the standards around how you operate, how you deploy production systems, how you test, how you monitor have to be so high just across the board for regular stuff in order to do it really, really well. Despite all the horrible consequences and the inevitable downtime that the machine learning engineers will cause, I swear, I promise, it’s ultimately worth doing it, and in particular, companies should do it more so I get paid more money to do it. That’s kind of a self-interested argument.

If you like to do monitoring, if you like to do visibility, if you like to do devOps stuff in general and you want to do it at a place that’s done it really, really well, slack.com/jobs. Thank you very much. I appreciate it.

07 Feb 2018

When a Necessary Evil becomes Indispensable: Testing in Production at Handshake

In January at our Test in Production MeetUp, we invited Maria Verba, QA Automation Engineer at Handshake, to talk about why and how her team tests in production. (If you’re interested in joining us, sign up here.)

At Handshake, Maria is focused on making sure her organization ships a product that customers love to use, as well as supporting the engineers with anything test related. In her presentation, she discusses the requirements her team has put in place to test safely. She also covers some specific examples of when testing in production is better than in pre-production.

“Whenever I hear test and production, I definitely don’t cringe anymore. But I know that there are right ways to do this, and wrongs ways to do it. Right ways will always include a healthy ecosystem. A healthy ecosystem implies you have good thorough tests. You have feature toggles. You have monitoring in place. We have to consider our trade-offs. Sometimes it does make more sense to create the mock production environment, sometimes it doesn’t. There are definitely things that cannot be tested in pre-production environments.” -Maria

Watch her talk below.

TRANSCRIPT

Hello everybody! Again, good evening. Thank you for having me. My name is Maria Verba. I work at a company called Handshake here in San Francisco.

Today, I’m going to talk to you about how we do, and why we do testing and production. My own personal journey from thinking about it as necessary evil, things that you have to do but you don’t want to do, to thinking about it more of indispensable, quite valuable tool.

To start off, let’s talk a little bit about Handshake. Most of you have probably never heard of it. It’s a small company, it’s understandable. Handshake is a career platform for students that helps connect students with employers, and helps students find meaningful jobs, and jump-start their careers. At Handshake, we believe that talent is distributed equally, but opportunity is not, so we work very hard to bridge that gap, and to help students find their dream jobs, and make a career out of it.

My role at Handshake is quality assurance engineer. Besides making sure that we ship very good products that our customers love to use, I think that my responsibility as a QA engineer is also supporting engineers, engineering team with everything and anything test-related, test infrastructure, test framework, as well as keeping their productivity and happiness high.

With my background with being in QA for about six years, whenever I heard “Oh, let’s just test this in production,” this is pretty much the reaction I would get. No! No way! Let’s not do that. Why don’t we want to do this? First of all, I do not like customers to find any issues. I do not want to expose them to any potential even small bugs. Yeah, let’s not test in production.

The second thing is I personally create a lot of gibberish records. I don’t care when I test. Sometimes it makes sense, sometimes it doesn’t make sense. I wouldn’t want anybody to see those records. Besides, it has a potential to mess up reporting and analytics. It’s also very, very easy to delete records, and modify records, so making this mistake is very costly. It’s very expensive, and probably engineers will spend like a couple of hours trying to dig out a recent database backup, and try to restore it. So yeah, let’s not do that in production. The last one is data security. We are actually exposing some records to other people that shouldn’t see that.

Then I started realizing that maybe it’s not actually that bad. Sometimes we have to make the sacrifice, and sometimes we have to test in production. I thought, “Yeah, maybe it’s okay as long as we have automation testing.” Before deploying, we have something that verify that our stuff works. We also want to make sure that we have a very good system, early detection system and monitoring and alerting system, which we’re going to find mistakes that we make before our customers do. We absolutely have to put it behind the feature toggle. There is absolutely no way around it. Even if there is no issues with our new feature, we do not want to surprise our customers. We do not want to throw in a new feature in them, deal with it, whatever you want to do. We want to have that material ready for them. We want to expose them to it before they see it.

In the recent times, the way that we’ve done it, it historically happened that most of our features were read-only. We would not create any new records. We would not modify any records. We also made some trade-offs. We thought about it. We realized that yes, it is possible to create those separate production-like environments and test it there, but for us, it’s not realistic to do that. We move very fast. We don’t have that many engineers to spend time creating new environments. Basically, it’s a very expensive process for us. We really don’t want to do it. Besides that, we are very careful about our data records, and we take security very seriously. We do constantly data privacy training. We have multiple checks in place and code. This helps us safeguard against any issues.

In the past where we did test in production, this would be a couple of examples. When we needed real-life traffic – general performance testing, performance monitoring on a daily basis. We also needed our production data. Sometimes it’s very hard to reproduce and replicate the same data that we have in production. Examples would be Elasticsearch5 upgrade and job recommendation service. Lastly, we do a lot of A/B testing and experimentation at Handshake. We validate our product ideas.

I’m going to talk a little bit more in detail about a couple of these examples here. Elasticsearch5 upgrade was a pretty good example, because first of all, why we needed to this? We rely heavily on search engine. Search-based features are pretty much our core. Whenever students log in, most of the time, what they want to do is find a job. Our search service cannot be down. However, we were using Elasticsearch2, which became deprecated. That put us in a position where we had to switch to Elasticsearch5 very quickly.

Let’s talk about why we can’t really test this in production. The update is too large. Basically, we cannot afford making any mistakes here. It touches everything. It has a potential to affect every single customer we have, every single user we have. It’s very extremely important for us to keep the features functionally working.

How did we do this? First of all, it took us a while to figure out where did the Elasticsearch version 3 and 4 go. Then we created new classes. The classes for the new search syntax. We started the upgrade with changing the syntax for our… We put those classes corresponding to each index behind the feature toggle, and we thoroughly tested with unit and integration framework. We’ve rolled it out to production incrementally, index by index, and we tested that in production, because we have that amount of data in production that we needed. We were able to verify that we get correct results after the update.

The second case study is slightly different. Job recommendation service is … at Handshake, we use machine learning. We apply machine learning to recommend jobs to students based on students’ behavior, based on students’ major skill, maybe a previous job history, job viewing history, location interest, etc. We take hold of that melting pot information, and we suggest a job based on that to a student. The question that we were trying to ask here is “Does this job make sense to this particular student?” A lot of times, you cannot go through all of those possible combinations and computations to answer that question.

To recreate that environment, it would take us, again, a lot of time and effort. We, again, rolled out incrementally. We heavily used our interns, because they’re still in school. After verifying that the jobs make sense, we started rolling it out to other schools as well.

talking about these two examples, I can hear the question in the audience boiling. You could still do this. You could still create a mock environment, and do the same kind of testing not in production. You could have done it pre-production. And yes, that’s true. I agree with that. We had to make those changes based on what our business needs are, based on our speed. We could not afford spending time and effort in setting up, maintaining, and securing those environments. But yeah, it is possible.

However, what if it changed the algorithm behind the job recommendation service? How do we make sure that those jobs are still relevant to the students? Do we ask the students after we change everything? I don’t know.

Here is where our experimentation framework came in. We use experiments and A/B testing a lot at Handshake. We pretty much test every major product hypothesis that we have. For us, it’s a two-step process. We have impressions and conversions. Impressions are a call-to-actions, or but on the prompt, something that the user sees. And then conversion is action based on seeing that prompt. An example would be showing a profile completion prompt to a student on a dashboard, and that would be an impression. Conversion would be if the student fills out their profile.

The way that we do it is we, first of all, determine which users, which students we want to target. We separate them into a different pool. Sometimes we want to say we want our experiment to work only for, or we want to target only seniors. Base on that, we remove other students from the test group. Then we split it into treatment and control groups. Then we track their conversions and impressions. We track them for both control and treatment group. This allows us to get very accurate results.

We are using LaunchDarkly to do this. We’re using targeting rules. As you can see here in the, maybe it’s hard to see, but basically, we’re specifying the user type as students, thus we’re removing every other student; maybe there is administrators or something like that. And then we say we want 5% of students to be on variation A of the feature, and 95% on variation B of the feature.

Unfortunately, we can’t do all of it on LaunchDarkly. Some of our experiments or yeah, maybe like a good chunk of our experiments rely on some more dynamic information, some things that change over time. For example, a student has seen a job recently. We have to define recently. We have to define what exactly we want to track about that job. In order to solve this problem, we create records for LaunchDarkly, and this allows us to take additional conditions and configurations, and split our user groups into more detailed cohorts.

As a result, we get a very, very powerful experimentation for framework, and we can only do this in production. There is no way around it.

Going back to my personal QA perspective and key takeaways. Whenever I hear test and production, I definitely don’t cringe anymore. But I know that there are right ways to do this, and wrongs ways to do it. Right ways will always include a healthy ecosystem. A healthy ecosystem implies you have good thorough tests. You have feature toggles. You have monitoring in place. We have to consider our trade-offs. Sometimes it does make more sense to create the mock production environment, sometimes it doesn’t. There are definitely things that cannot be tested in pre-production environments.

That also is in line with my personal goals of, first of all, keeping our customers happy, and product stable. Our customers would love to use the product that is driven by the data, and not our own hypothesis. I also think that it keeps the engineers on the team productive and happy, because they don’t have to spend time creating and maintaining all of these different environments.

What’s next for us? It’s Canary testing. We want to be able to divert some portion of our traffic into new features, for example, Rails 5 versus Rails 4. Right now, we are working on moving to Kubernetes to achieve that.

Thank you so much for your time. If you have any questions, find me, I’m going to be around, and reach out. Thank you.

05 Feb 2018

Testing Microservices: A Sane Approach Pre-Production & In Production

At LaunchDarkly we host a monthly MeetUp for anyone interested in testing in production. (If you’re interested in joining us, sign up here.) In January we met at the Heavybit Clubhouse in San Francisco, and had 3 guest speakers talk about how they test in production. Cindy Sridharan (aka @copyconstruct) talked about how she approaches testing microservices—including how to think about testing, what to consider when setting up tests, and other best practices.

Her talk was inspired by an article she wrote in December 2017 on testing microservices. She wrote it in response to an article that showcased a team’s incredibly complex testing infrastructure. The piece described how a team built something that helped their developers run a lot of tests very quickly. And while it helped tremendously, Cindy wondered why it had to be so complex.

“The whole point of microservices is to enable teams to develop, deploy and scale independently. Yet when it comes to testing, we insist on testing *everything* together by spinning up *identical* environments, contradicting the mainspring of why we do microservices.”

Check out her talk below.

TRANSCRIPT

So, my name is Cindy and as Andrea said, you probably best know me from Twitter or from my blog. So, couple days ago, probably was last, the last week of 2017. I wrote this blog post called Testing Microservices the Sane Way. It was a 40 minute blog post and I am still astounded how many people read it. I mean, because I personally do not read anything that’s longer than 10 minutes.

I mean, I can write stuff that’s 40 minutes or 60 minutes long, but I can’t read. I mean my attention span is like 10 minutes. Right, so kudos to everyone who’s read this. So, like I said my name is Cindy. You probably know me on Twitter and pretty much everywhere on the internet as copy construct, and I’m here to talk about microservices.

The main reason why I wrote this blog post was because I read another blog post about someone who described the testing infrastructure. It was, the blog post of a company, it’s fairly largish, but you know we’re not talking about Google, or you know like Facebook or anything. But pretty largish for a San Francisco company, and you know, I sort of read the whole blog post, and they were talking about like you know, they had this pretty cool infrastructure there.

You know, they had like Mesos that was like scheduling Jenkins and EC2 and you know it had like … it was doing bin packing. It was running a million tests, it was like … what else did it do? Let’s see, it was like streaming all the logs to Kafka and then sending all to S3. It was like this incredibly complex system that people had built and basically the whole blog post was about how it just greatly helped the developers to run millions of tests in a very, sort of fast way. Right, which is probably always seen as a positive thing.

When I finished reading that my sort of reaction was like, oh my God, why is this so complex? Why is it that this industry thinks of testing as this thing that needs to, you know, just necessarily be done this way. Where it’s just like, you know, you have a system, which is probably even more complex than an actual production system. That you spin up just to run a bunch of tests. Right? So, once I read that I was like, oh my God. I was just completely surprised, I was just pretty unsettled. I was like why are we doing this and because this is not just one company. Right? Pretty much everyone or everyone attempts to do this or you know get somewhere close to this and bigger companies than Google probably have something like that order of magnitude, even more complex than this.

At least from my personal experience, trying to build these kind of systems, at least the way I see it. It’s not really the right way to really be testing microservices. So, it really started with this conversation that I was having on Twitter. I asked someone, “Okay, so what is the best way in which you test microservices?” Rather, I think the initial conversation was how do you test service mesh architectures because that is a strong area of interest of mine. So, when I was asking people that I got a lot of answers, but I got this one reply from someone called David who mentioned that you know, I believe he works at Amazon. That’s what his Twitter bio says, that you know, increasingly convinced that integration testing a skill in the microservices environment is unattainable.

This also completely matches with my own experience as well as what I strongly believe. The whole point of microservices is to enable individual teams to build things separately, to ship things separately. Right? To deploy things separately so that you have these self contained system with abundant context that really encapsulate a certain amount of business eventuality. Yet, when it comes to testing these things, we test it all together. Right? The default is that we need these incredibly complex systems just to integration test all these before we can deploy something.

So, before we even get to testing. Another thing that I’ve seen people do, and this is also something that I’ve seen happen at one of the previous companies that I’ve worked at. Not just testing microservices, but even when it comes to developing microservices, what a lot of people do is, they try to replicate the entire topology on developer laptops. Right? So, a few, four or five years ago I tried … The company that I was working at, we tried to do this with Vagrant. The whole idea was like, with a single Vagrant up you should be able to boot the entire cloud up on a laptop. These days pretty much everyone’s doing Docker compose.

So, the thing is, in my opinion, this is really the wrong mindset. Not only is it not scalable, and not only is it completely … It’s absolutely nothing like a cloud environment. Right? It’s also really the wrong mindset to be thinking about it, when I had this blog post reviewed by someone called Fred Heber. One of the things had he pointed out was that really trying to boot a cloud on your laptop, is almost like you’re supporting the worst possible cloud provider ever. That’s like you know, a single laptop. Right? Cause that’s kind of trying to replicate. When you try to boot all of like say, a dozen microservices in a laptop. That sort of becomes the worst possible cloud ever.

Even your actual cloud is going to be far more effective than that. So, why are we even doing this? It’s because people, even when we build microservices, we are really building distributor monolithic applications. Right? I mean, this is kind of what we’re doing. The whole idea that we need … If we can not decouple these different sort of boundaries when we’re developing and when we’re testing. Then we can not expect that when we’re running these things together that the sort of decoupling that we’re trying to sort of achieve will actually work because from the ground up we’re doing it wrong.

So, how do we do it right? Right? If I’m saying, okay this is all wrong so what is the right way to do it? Well, couple caveats, my background is startups. Really always stayed startups, we’re not even talking about like the unicorns. We’re talking about companies that are small and scrappy and you know, always almost resource constrained. Where we have always had to make smart, smart bets and smart choices and where we’ve always had to do a lot with a little. What that means is that we’ve never almost always had the resources to just go out and do what some of these companies who blog about some of these things do. Right? What we’ve always had to do is think about the opportunity cost, the maintenance cost, as well as what the trade off, and the payoffs are before we make any sort of decision. This also applies to testing.

So, you know, I was just drawing this out when I was writing this blog post about like just the taxonomy of testing. Right? How do we really think of testing? For a lot of people the way … Especially a lot of software engineers think of testing. Is that you write a bunch of automated tests and pre production testing is like the only thing that they really sort of focus on and even then you don’t focus on all of this. It’s primarily just writing automated tests, writing tests in the form of code. Right? One of the reasons I even sort of like, created this illustration is because we really have to see testing as a spectrum. Not as this thing that you do when you write code, as this thing that you run every time that you create a branch. The thing that you sort of like need this thing to go green before you merge it deploy, but it needs to be seen as a spectrum.

That really starts when they start writing code, and it doesn’t really end until you’ve sort of decommissioned that code from production. Right? It’s just something that’s … it’s a continuum, and it’s not this sort of thing that you just do. Once we start thinking of testing in that manner, and especially with this other sort of paradigm that I’m seeing in San Francisco at least. This big San Francisco bubble that I’m very cozily ensconced in, which is that software engineers are now … Back in the day you probably had testers. Right? You had QA engineers and like software engineers in tests. I’m guessing bigger companies still have this, where the developer writes the code, and the engineer’s in tests actually write the tests. Then you know, if it was a service you handed it over to someone in operations to actually run it.

This is not the sort of like, I’ve never done this at any company that I’ve worked at. I’ve always been a person writing the tests. You know, deploying the code, monitoring and also sort of rolling back and fire fighting. I agree that this is unusual and this probably doesn’t apply to everyone, but that is the position from which a lot of my opinions are sort of like formed. Right? So, when I think of it that way, I personally find this model to be incredibly powerful because for the first time ever it allows us to think about testing in a very holistic manner. It delves us to think about testing across the breath of it and not just this little type of silos.

So, one of the reasons I even created this was because I sort of like wanted … I primarily see all of these sort of like mechanisms, all of these techniques as a form of testing. Right? You know, this is again the distinction isn’t quite as binary. Like, you could probably be doing some sort of A, B testing your test environment, you’ll probably do you know, profiling in production. So, this is a very fluid sort of border here, but this is how I see testing. I don’t think anyone here, at least anyone that I know of, does all forms of testing. Really the goal is to really pick the subset of testing techniques that we need given our applications requirements.

For really safety critical and health critical systems for systems that deal with money, probably the vast majority of testing is going to be done pre production. I personally only have experience running services, running backend services and running services whose deployment I control. Which means that if something goes bad, I can immediately sort of roll back. Right? That is super powerful and that is not something that a lot of people actually enjoy, because if you, I guess if you’re shipping a desktop application. You ship it to people and if you ship it with a bug then you know, unless they all sort of, upgrade, or you know, sort of change their thing it’s not going to change. Right?

If you’re just deploying a service, and you’re controlling the deployment it can ship something, and if it’s not great, then you roll it back. Right? So. When you think of testing services, I kind of think, at least from my perspective. I tend to give … I tend to put testing in production a lot higher in the sort of like priority list. Then I necessarily do pre production testing. So, the other sort of thing that I’ve been thinking about was this thread by Sarah where she actually said that … She was responding to someone else, was saying that you know, there’s probably a lot of code that you don’t necessarily have to test that you can just deploy. Right?

This was pretty controversial when it sort of came out because a whole bunch of people were like, don’t ever do this. Right? The reason why I kind of felt that there was this reaction to what she was saying, was because a lot of software engineers, what we’re taught at our mother’s knee, is that testing is a good thing. We have to be testing and if you don’t test, that if you don’t write … What I mean test, like a very narrow definition of test, then you’re doing it wrong. Right? Which means that, when people find it really hard to even accept that automated pre production testing can actually sometimes be insufficient. To the point where it can be sometimes entirely useless.

I think that really was the reason why there was this huge uproar to what Sarah was saying, which was that people are stuck in a different mindset. For us to really sort of embrace this more holistic form of testing, it really becomes important to start thinking about it differently. What I mean thinking about it differently, is that I think reliability should be … it’s not an SRE concern. I personally see it as a software engineering concern. I don’t personally think that the code that I write is going to be made any more reliable by hiring a SRE team. You know? I think that this whole top down approach where you have certain people writing the code and certain people sort of make it reliable. I don’t think that works and I really think reliability is a feature of software. Right?

Like testing, it’s like something that you build into the code that you write, so that then you can understand what it’s doing in a lot better way. Understanding what it’s doing is also a form of testing, because we’re just seeing … do our assumptions hold good? You know, did everything that we sort of like base our system on, does that all still hold water? Right? So, this is again something that I very strongly believe in. That to be able to sort of push certain amounts of regression testing to post production monitoring requires not just a change in a mindset, but and a certain appetite for risk obviously, but it really requires software engineers to code for failure.

I think that is the thing that I’m not seeing enough and to be honest I don’t do it enough. This is like, I’m not saying I’m doing everything perfectly. I probably don’t do this well enough and that’s probably something that I would like to see. Not just me, but also in general industry, just do better. It involves a whole different mindset. It really involves thinking from the ground up how we build software. Again, this is probably true for all kinds of software, this is especially true for microservices. Right? Because what … crazy service into architecture is that we’re building, what they really can note is this incredibly complex system that’s always going to exist in some form of unhealthiness. Right?

It’s almost like saying unhealthy is the new healthy, because we you know, have all of these crazy mechanisms like retries and circuit breaking and multi-tiered caching and eventual consistency and all these relaxed guarantees just to make sure that our service is going to be sort of, healthy. So, we’re by default in a mode where we are thinking you know, in a way where unhealthy is really the norm, but when we test, we test as if healthy is the norm. Right? Which, is really kind of bizarre when you think about it. So, again something that Fred Heber said, which was that, when you have a large scale system. If your tests require 100% health then you know you got a problem because you’re not really testing the system. You’re not even testing what you’re building for, but you’re really testing something totally different. Which, is like the success case, where what you’re actually doing is architecting for failure.

That’s completely bizarre to me. Right? So, code accordingly. This was another tweet that I saw and I think I retweeted this. Failure is going to happen, it may not be something you did, it’s probably something club provider did. The reason is not deployment, there is no internet. Code accordingly. So I see a lot of these. Whenever I see tweets like this my initial instinct is like hell yeah, this is so true. My second instincts are like, fuck no, this is not true. What is this person even talking about? This is like so hand wavy, and this is like so like yeah code accordingly. Like what am I supposed to do? Like, I used to ask these questions like a couple years ago. Like what the hell am I supposed to do? Like give me answers, don’t just give me these little like, taught literary sound bites. That then gets fucking tweeted a hundred times. Like, what am I supposed to be doing? Like, code accordingly.

I mean, I’m ranting here, but I mean, this is what I personally feel. So, you know, the more I think about it. There’s certain things that I feel software engineers can really do, really try to understand when they’re actually coding in order to code accordingly. So what does that mean? Well, in my opinion, it comes down to three things. It really comes down to coding for debugging, being able to instrument your services well enough so that you can really … so that when you’re actually debugging something you’re not really starting with the conjecture and you’re not really starting with this hypothesis, but you can ask questions of your system. That you can ask questions and then you can sort of refine your hypothesis based on this question.

Keep asking questions, keep asking questions until you come to something, which is probably actually really true to the root cause. Rather than sort of like looking at the wrong things or like just you know, trying to sort of like fit this failure into this mental model that you already have built. Based on like, because you build the software so you have this mental model of how it should work and something is failing and then you try to retrofit that to this mental model that you have, which in my opinion, is the completely wrong way to do it. So, number one instrument your services better. Number two, understand the operational semantics of the application. How is it deployed? You know, how is a process started on a new host? Like, because if you think of a lot of outages, they’re actually like deployment outages. It’s basically bad coding getting deployed for some reason.

Again, a lot of these codes get deployed after going through like this huge like intricate sort of like testing pipelines. Then why is it still the case that deployments are like the number one cause of outages or like bad deploys are the number one cause of like services going down? Right? It’s again because when you’re coding and also when you’re testing, what you’re really validating is your flawed mental model. Not really the reality. Right? I mean, does anyone disagree with that? No? Okay, cool, because that’s what I strongly believe. Is that you still have these outages just by doing everything correctly and that’s because you’re really, what you’re really testing is not the reality, but what you’re testing is just your bias. Right?

It’s just this big form of confirmation bias. That’s what you’re really dealing with. So, you know, just being able to understand the operation semantics. I personally feel allows us to think about how services fail in a better way, which means again, so what process manager are you using? Like, what’s the restart mechanism? How many times is it going to try to restart? What’s your deployment mechanism? Is it like some sort of like blue, green deploy or are you doing some sort of like red, black deploy? Are you doing those rolling deploys? Like what is the release engineering sort of like practices that are being used? How does that lead to failure?

There’s one more in my opinion, and that really comes down to the operational semantics of the dependencies views. Right? Veeble on top of crazy amount of dependencies. That’s in my opinion, at least what I’ve mostly been doing is gluing together a bunch of dependencies. Basically it’s just some libraries on top of this and then you have like huge, hugely sort of like tower of abstraction on top of the Veeble, a little tiny abstraction. The best way to deal with abstractions is, at least in my opinion, even the leakiest ones. Is to sort like have some amount of trust that it’s gonna say what it says on the thing. Right?

I mean, it’s not ideal, but if you’re just gonna like start doubting every single abstraction layer that we’re dealing with right down to TCP or your art packets or something, then we’re probably never going to get anything shipped. So the best way we deal with abstractions, is sort of like having a certain level of trust in the abstraction, but failure still happens. In my opinion that happens because we don’t fully understand the boundaries between the abstractions well enough. Spending a little more time there I think, I personally feel has sort of like helped me understand the services and the systems that I build better.

So, what do I mean by that? Well, a good example would be, if you’re using a service like consult for service discovery. Right? It has certain sort of like default consistency modes in which, it sort of like queries for other services. The default case is strongly consistent, it’s probably not what you might need for when you are trying to do service discovery because that is more a fundamentally consistent problem. Right? Which, means that the developer using a consult library or if someone provides consult as a service then you understand the right fault defaults. Different people who are using this library need to understand these defaults and be able to sort of like reason about whether that is the right default for their needs.

So, just spending some time and understanding a lot of these operational semantics of both the dependencies as well as the application and programming for the debuggability. Is in my opinion, what I think denotes coding accordingly or adding coding for failure. Another point Sarah made was that you know, the writing and running of tests itself is like not a goal in itself. Like, tests themselves pretty much are, I wouldn’t say useless, but like in and off themselves they’re not like particularly useful. The reason is, you know again, I stated this previously. It’s just a big form of confirmation bias, all the test I write because we’re sort of like predicting beforehand what are all the failures and success modes. Tests are only good as their ability to predict. Right? The reason why systems fail is because we’re not able to predict that failure or we didn’t try to understand that well enough.

So what this means in my opinion is that just having more tests doesn’t make your service any better. It just means that it can think of more cases, but there’s probably still a case that’s gonna take your service down and if you don’t understand that then, you know. Having like a million tests doesn’t help you in any way. Right? So, and when we talk about tests, the other thing that really annoys me is this test pyramid. So, this was probably proposed back in like, I don’t know the early 2000’s along with the whole agile movement and everything. It was probably really small then so I don’t really understand or like, know the history of this. Every single testing blog post or like every single like testing conversation people have like this thing gets wheeled out. Every time I look at this I’m like okay, yeah if you have like a or have a like the Rails app or like a single Jangle app, if you have like a CRUD app basically, yeah this kind of makes sense. Right?

When it comes to really dealing with this kind of service, like is this the best we can do when it comes to testing? Seriously? Like, why do we still talk about this thing, right? This is like, in my opinion, super old and this is completely the wrong way to think about testing this kind of a service or into the architecture. This is completely hypothetical by the way, but it’s not very far removed from some kinds of some of the sort of topologies that I’ve personally worked with. So in my opinion, I’m totally wearing my thought leader hat now, this is what testing should look like.

I see it more as an old style, the pyramid is more like a funnel, I don’t know why I made it a funnel. It could have probably been this big rectangle, but for some reason this is a funnel. In my opinion, to really test these kind of crazy architectures, we need to think of testing as a spectrum and it just doesn’t end when you write some code. It just doesn’t end when you deploy something, it’s this large wide spectrum, where it sort of continues until you really decommission your applications. Right? Again, this is isn’t like sort of comprehensive by any way, but this is sort of like my mental model for testing this kind of system.

You have your unit test, you have your benchmarking, you have property based testing, you have your contract testing. All this is things that it can do in development. Right? You can do all of this without necessarily spinning up this whole crazy topology locally, you can just like run one service and still to have all these kinds of tests. Testing pre production and personally I believe that massive integration tests are an anti pattern. I mean, because think of it, what we’re really trying to do when people try to like build this whole dev environment then an identical test environment, rather try to keep your test environment as close as possible to production. It’s not really the real thing, because it becomes this huge maintenance burden and especially if we’re smaller companies, we just can’t do it. It’s just like, it’s just not attainable, it’s like probably it ends up requiring a team in its own right to just maintain. Right?

That’s kind of hard and the fact is, doing all of that doesn’t give you any more confidence in your system. So, if you’re going to be integration testing in microservice architecture then you should be able to test that against production. This has also the added benefit in that it helps you really think about your boundaries. It makes you certain that if you should be able to test a potentially buggy sort of like service in production with the rest of them, then it really forces you to think of what the boundaries should be and how the blast radius or how widespread the blast radius might be.

The goal is that it should be really small in that even if you push out this really shitty sort of like patch. Or like you know, you sort of like push out this big buggy creature. It really shouldn’t take down the rest of your service. Right? If you actually think about it that’s really what the whole microservice, the goal of microservices architectures are. You have like these stand alone services that do this one thing well, they don’t really affect or really greatly affect like the other services. So, this is like a very positive architectural pressure in my mind. Being able to integration test, your single service with your production environment.

Then there’s config tests and this is again something that I’ve learned from you know, just my operations colleagues, that configuration changes can sometimes be more dangerous than a code change. Right? Especially when it comes to things like Nginx configs or even Haproxy configs or even some sort of like given that we live in the era of service meshes like Envoy or Linkerdy. The thing about service meshes it’s like it becomes this huge sea of failure across like your whole entire fleet, because it sits in between every damn service that you have. If you’re not gonna be testing the config changes before pushing out a change to Envoy, it’s not just gonna take down one service. It has the potential to take down your entire architecture.

That’s pretty scary to me, and I’m surprised that more people don’t talk about like you know, how to build resilience service mesh. That’s something that I’m hoping will happen in the next year or two as like this whole paradigm gets more traction. They’re shadowing which is… data… production traffic and actually replay it against tests. Right? I’m also arguing that let’s not have this whole crazy test cluster in place, which means that integration testing and shadowing sort of like combine together. When you’re integration with production don’t sort of generate your own lord or don’t try to sort of like create like your own request response cycle. Try to use something from production, that’s just happened like maybe five seconds ago, cause that’s as real as it can get.

Then there’s testing in production, obviously. There’s A, B testing, feature flagging, you know, canaring, distributor tracing, monitoring, I think monitoring is a form of testing that’s again why I think software engineers should be doing it. You know, it’s not an operational concern and just understanding what it’s software is doing in production. Because you can build something, but it is completely constrained by all the biases that you have and you mental models and basically how your team thinks about it. A lot of reasons, if you think about it a lot of failed projects are just like big architectural flaws.

It’s primarily because people didn’t validate soon enough in production and it’s probably because people let their biases carry them to far until they spent too many engineering cycles and too much time building something that’s pretty big. Then it’s probably too flawed, because you didn’t integrate rapidly and you didn’t basically get feedback from production fast enough. This is like a service that I’ve been sort of the developer for, on call for, for the better part of the last two years. It doesn’t matter what e-service it really does, but you know when I developed this thing. I don’t really bring up all these services on the laptop, what I really do is just bring up this one thing and then test the rest in production.

Surprisingly it’s worked well enough for us, I mean, there’s never been an outage because I’ve been developing this way. The other pretty weird thing is I’ve worked at companies where people have come and asked us like, can you give us your whole service as a Docker container. You’re talking about like a SaaS service. Like can it just give it your service as a Docker container so that we can integration test our services with basically yours. Most of us these days just integrate with a whole bunch of third party services. As we should, because if it’s not a core competency for your startup then you should probably be asserting it at someone else. Right?

Who’s basic business model and who’s basically the reason they even exist is because they’re providing the service and then it begs the question. How do you even test these services? Right? In my opinion, the way you test is not fast enough to give you an operating container that then you sort of spin up in a test environment. You need to either have better fractions in place that you know, where you can sort of understand your vendors failure modes and sort of program against that. Or you just directly talk to your … directly make like an ecro call to this third party service that we’re talking to. Which, again becomes harder because a lot of services don’t really provide a test mode. Sort of like features or they don’t really allow you to test it like every time you make a call to them you get charged, which it kind of sucks.

That’s again, that’s a user experience thing right? In my opinion not a lot of vendors do this thing where they really think about how someone is gonna be integrating with them or gonna test well. So, yeah, that’s pretty much all I’ve got and a lot of these thoughts are like probably better explained and better articulated in the blog post if anyone really wants to spend 40 minutes reading that. Go read that, I won’t do that, but yeah. Thank you.

26 Dec 2017

Talking Nerdy with Technically Speaking

Our own developer advocate sat down with the Technically Speaking podcast at the Agile Midwest conference in St Louis, MO. They discussed what a developer advocate is, the LaunchDarkly feature management platform, and how teams use feature flags.

“Decoupling that deployment from the activation of code gives people so much more security and reduces the risk of deployment. I think a lot of reasons people resist continuous deployment is the fear that they could break something and not be able to fix it in a hurry. So with this we’re saying, ‘Deploy all you want, activate when you’re ready.’” – Heidi Waterhouse

Check it out to hear how companies like Microsoft and GoPro are using feature flags to dark launch, do internal testing, and use feature flags as kill switches.

TRANSCRIPT

Zach Lanz: Live from St. Louis, Missouri, it’s the Technically Speaking podcast, brought to you by Daugherty Business Solutions. Get ready ’cause it’s time to talk nerdy on the Technically Speaking podcast.

Welcome back in, Technically Speaking. We are talking nerdy today. We are talking all day. We’re live at the Agile Midwest Conferences here in beautiful, beautiful downtown St. Charles, Missouri at the St. Charles Convention Center. I mean, I can overlook the river right out the window. It’s picturesque. Today we have quite a collection of some of the most forefront on the agile front lines and thought leadership bringing you some content and their perspectives. This episode is no different because today we have Heidi Waterhouse. Heidi, welcome to the podcast.

Heidi Waterhouse: Thank you. I’m excited to be here.

ZL: Listen to your adoring fans.

HW: My adoring fans.

ZL: We might have to shut the door so they leave us alone. So Heidi, you are in town from Minneapolis, coming all the way from Minneapolis, correct?

HW: Yes.

ZL: Just for the conference?

HW: Just for this.

ZL: You had your talk earlier today and I understand that you had a couple technical snafus.

HW: Oh well, I think it’s not really a presentation if something doesn’t go a little weird, but I did get to talk about the things that I cared about so that worked out.

ZL: You are a developer advocate at LaunchDarkly. From what I understand, it’s a company that specializes in feature flags. Tell us a little bit about what a developer advocate is, what you do there, and a little bit about LaunchDarkly.

HW: A developer advocate has three different roles that we serve. The first is I go out to conferences like this. This is my third this month. I have two more to go and I have five next month. I talk to developers about ways that I could improve their workflow and make their lives easier. I also spend a lot of time listening to developers and figuring out what it is that they need so I can take that back to my team and say, “Hey, there’s this unmet need the developers have that we need to look into.” The third is recruiting. I meet a lot of developers and I can say to them, “Hey, we have this opening for a full stack developer in Oakland.” They’re sometimes interested in it. Midwesterners seldom take us up on, “You could move to Oakland.”

When I say LaunchDarkly and feature flags, what I’m talking about is LaunchDarkly is a company that provides feature flags as a service. You wrap your code in a little snippet of our code, connect it to the SDK and then you deploy the whole package and it sits out there like Schrodinger’s code, both the live and dead, until you flip the switch to turn it on. Decoupling that deployment from the activation of code gives people so much more security and really reduces the risk of deployment because I think a lot of reasons people resist continuous deployment is the fear that they could break something and not be able to fix it in a hurry. With this we’re saying like, “Deploy all you want, activate when you’re ready.”

ZL: Just generally I guess, I think you touched on it a little bit, but explain a feature flag and what that does.

HW: A feature flag, or a toggle, or a feature management, it’s called a bunch of different things, is a way to turn a feature on and off from a central command. We work with Fastly and we have edge servers that help people to do this. Imagine you deployed a feature that say gave you the ability to do holiday snow on your webpage, like CSS that made your webpage snow. Well, it turns out that has a conflict with some browsers and you’re crashing people’s browsers. You want to be able to turn that off instantly. We like to say that you can kill a feature faster than you can say, “Kill the feature.” It’s about 200 milliseconds. All you have to do is flip the toggle and the feature turns off almost instantly for everybody, instead of having to deploy again and hope that you get it right this time and didn’t leave anything in.

ZL: You mainly work with companies large and small, I would imagine, that have a need to be able to turn those features on and off. Are there any great success stories that you’d like to share of being able to prevent an issue before it happened, or turn it around before customers even knew?

HW: We work with Microsoft and they do a lot of dark launching and sneak the features in so that they can do what we call canary testing, where you test with a small percentage of your population. But we also worked a lot with GoPro to be able to let them develop in their main line and do internal testing. Then, when they were ready, they pushed out their new version with no problems. Not a lot of people tell us when they have to hit the kill switch. We can see it in the logs, but I don’t want to call anybody else because it’s uncomfortable to have to say, “I deployed something that was bad,” but it’s not as uncomfortable as it is if you have to say, “I deployed something that was bad and it took forever to fix it.”

ZL: Your session today was a choose your own adventure interactive. I love that concept. I love games and I love just the interaction in a session like that. Walk through the concept there of how the choose your own adventure maps into this idea of deploying in a dark manner and the feature flags and things like that.

HW: We have a little mascot, an astronaut named Toggle. I created an adventure for Toggle where you can choose how you want to get to a planet based on different routes, like which ones are safer, or which ones are faster, or which ones are most scenic. If you think about the way you can sometimes get a map to show you like, “Don’t put me on any freeways,” that’s sort of the way that I designed this talk. This is the first time I’ve used the Twine gaming platform to design a talk. It turned out my CSS is really rusty, but there’s a lot of good help out there, so I got through it. It was really interesting to be able to say like, “Hey audience, pick which direction you want to go. Let’s talk about the thing you’re interested in. I’ve created 60 some slides of content, but the path through it is unique every time to your experience.”

ZL: How did they react? Which routes did you go?

HW: Unsurprisingly, the first route that we took was development. We talked about canary launches, which is the small launch, and we talked about multivariate flags. You could say, “I want to launch to 10% of the people in Germany.” We are basically doing access control lists for you product. You can slice and dice your audience however you’d like. Then, I really love this concept, I got it from Fastly, the albatross launch where you have a legacy client that’s very important to you that you want to keep happy, but you also want to keep your code base moving forward, you can switch them over to the older code base and keep your mainline moving forward without having to actually split your lines, split your code instances.

ZL: Well, awesome. Also, my crack team of researchers also dug up a little bit. It looks like outside of maybe the workplace, you are maybe an aspiring designer or seamstress.

HW: Oh yeah. I sew 100% of the clothes that I wear to conferences because I am really tired of dresses that don’t have pockets. If you wear a mic pack you have to have a belt of pockets. I just do all my own sewing and I find it super satisfying to make something tangible after a long day of software, which is a little fuzzy around the edges. Also, it means that I get to do tourism shopping. I just came back from New York and I honestly have a dedicated fabric suitcase.

ZL: Wow. Now it’s all coming out.

HW: My regular check on, roller board size suitcase fits in the big suitcase. Then I’ve come back from, oh, Tel Aviv or Australia or London, with a second bag, essentially, full of fabric. In fact, the dress that I’m wearing right now has fabric from Australia and London.

ZL: Wow. It’s a trip around the world for your clothes.

HW: It is. It’s really a nice reminder that we aren’t just here to do technology but also to experience the places that we go.

ZL: Yeah, absolutely. Do you have any places in mind for this trip? Have you eyed any places?

HW: Actually, I’ve never been to St. Louis before, but I don’t have a lot of time on this trip. I actually have to leave in a few hours.

ZL: Oh, that stinks.

HW: Yeah, it’s very sad.

ZL: Well, hopefully you get back to the airport in time and no delays there.

HW: Yeah, no. It’s only a couple minutes so it should be fine.

ZL: If people have heard about either LaunchDarkly and are curious, want to hear more about these feature flags, it sounds like a fantastic offering. I don’t know who wouldn’t be interest in something like that, to be able to control your launch a little bit and pull stuff back if disasters happen. If people want to get in contact with you, or with the company, what’s the best way to do that?

HW: They could write me, Heidi@launchdarkly.com, or you can find us on Twitter @LaunchDarkly, or you can find me on Twitter @WiredFerret.

ZL: WiredFerret.

HW: I know.

ZL: What’s the story behind that name?

HW: Well, it turns out that I am both excitable and high strung, so yeah. Anybody who’s seen me at a conference party understands.

ZL: Got you. Do you bite?

HW: I don’t. I don’t. That’s definitely against the code of conduct.

ZL: Yeah, because I mean, that’s the MO on ferrets.

HW: It is. It is. We had ferrets for years and they were terrible sock thieves. Whether or not your foot was in the sock, they were going to steal it.

ZL: It was gone. Well, I appreciate you stopping by and giving up some of your lunch hour. I know this is high priority time, so I appreciate you coming on, sharing your perspective. It sounds like your presentation went great despite a couple technological glitches. I hope you have a good flight back and thank you for taking some time out and sharing your perspective with us today.

HW: All right. Thank you.

ZL: Thank you for listening to the Technically Speaking podcast. Get involved with the show by following us on Twitter @SpeakTech, or like our page at Facebook.Com/SpeakTechPodcast. If you have suggestions or questions related to the show, or would like to be considered as a future guest, send feedback and inquiries to hey@speaktechpodcast.com. I’m Zach Lanz and thank you for listening to The Technically Speaking podcast.

09 Nov 2017

Measure Twice, Launch Once

You want all your developers to have access to the main trunk of code to deploy — that’s the point of trunk-based development. It’s important they can put their code out as often as they want and iterate on their projects. However, you don’t always want developers turning on features that will have customer impact without some way to reverse course.

Secured activation is an under-appreciated part of feature management. Your developers can deploy code whenever they want—but when it comes time to test it externally, or turn it on for everyone, you can use settings to make sure that only a select group of people has the permissions to do so. All the activation changes should be tracked and audited to ensure that all activations have an accountability chain.

At LaunchDarkly, we have found that it’s good to be permissive about who can use and create feature flags, and restrictive about who can activate them. If you are trying to get started with transitioning to using feature flags more broadly, you might want to think about how to implement a repeatable process. You might also want to leverage LaunchDarkly’s ‘Tags’ feature to help with the organization and custom roles to assist with delegation and access.

You want the following qualities in people who have the permissions to change your user experience:

  • Understands the business reason for making the change
  • Has the technical knowledge or advisors to know when the code is ready to go live
  • Has a process in place for making the change and then testing it

You don’t want to have only one person who can do this, because they’ll inevitably become a bottleneck. Make sure your process can keep releasing even if a key team member is unavailable.

In the beginning you may look to put a process around every change and then look for optimization in that process. However, over time you should look into determining  what level of change merits process and what can be executed more easily. In some cases this might even allow for small changes to be approved or executed by individual engineers. Usually, features that have anything to do with money, user data collection, or changes in the user process should have a formal approval process. Changes to backend operations can be quieter and therefore need less formal process and lean more heavily on automated testing and peer review.

Think about your current deployment process. What happens if someone releases something too early? How do you protect against that? How will you port that control over to the access control that LaunchDarkly offers? What is the failure case if something doesn’t launch properly?

Feature flags are easy to implement in code, but managing them well across an organization takes some planning and forethought.

20 Oct 2017

Keeping Client-Side Feature Flags Secure

LaunchDarkly’s JavaScript SDK allows you to access feature flags in your client-side JavaScript. If you’re familiar with our server-side SDKs, setup for our JS SDK looks superficially similar—you can install the SDK via npm, yarn, or via CDN and be up and running with a few lines of code. However, if you dig a bit deeper into the source code and do an apples-to-apples comparison with our node SDK, you’ll see that our client-side SDK works completely differently under the hood.

There’s one simple reason for this difference—security. Because browsers are an untrusted environment, our client-side SDK uses a completely different approach that allows us to serve feature flags to your users securely. While we’ve made our approach as secure as possible out of the box, there are three best practices that we recommend all companies follow to maximize security when using client-side feature flags.

1. Choose Appropriate Use Cases

First, in a browser environment your entire JavaScript source code is fully exposed to the public. It may be minified and obfuscated, but obfuscation is not security. This means that if you’re using client-side feature flags, you must be aware that anyone can modify your source code directly and bypass your feature flags. For example, if you have a code snippet like this:

You can expect that any of your end users could inline edit the above code snippet in their browser window to show the brand new feature, bypassing the feature flag and its rollout rules entirely. This is a fact about how client-side feature flags work—not a limitation in the LaunchDarkly platform.

This doesn’t mean that client-side feature flags are always a security vulnerability—it just means that there are some use cases where a client-side feature flag alone isn’t appropriate. For example:

  • Launching a feature that is press-, publicity-, or competitor- sensitive. In this situation, having a user enable the feature early could derail a launch, or tip your company’s hand. Even the feature flag key itself (e.g. brand.new.feature) could be sensitive. In this scenario, you can’t even afford to push the new front-end code to production prior to launch, even if it’s kept dark.
  • Entitlements. A client-side feature flag alone is not a sufficient control for locking users out of functionality that they shouldn’t be able to access. In this case, a client-side feature flag is fine assuming it’s coupled to a backend feature flag. For example, you can use the same feature flag (with the same rollout rules) on both the front-end (to disable a UI element) and the back-end (to control access to a REST endpoint triggered by that same UI element).

Using client-side feature flags securely starts with choosing appropriate use cases. Ultimately, the choice depends on whether or not you can expose your new feature’s code to users that may not have the feature enabled.

2. Enable Secure Mode to Protect Customer Data

There’s a corollary to the observation that all JS code is fully exposed to the public—there is no way to securely pass authentication credentials to a 3rd party API from the front-end alone. Because anyone can view source, any API key or token that you pass to a 3rd party API via client-side JS is immediately exposed to users.

One solution to this problem is to sign requests—use a shared secret on the server-side to sign the data being passed to the 3rd party API to ensure that the data hasn’t been tampered with, or the request hasn’t been forged. Tools like Intercom take this approach. In Intercom’s case, request signing guarantees that a user can’t fetch chat conversations for a different user.

LaunchDarkly follows this approach as well. We call this Secure Mode, and when enabled, it guarantees that a user can’t impersonate another user and fetch their feature flag settings. Our server-side SDKs work in tandem with our client-side JavaScript SDK to make Secure Mode easy to implement. We recommend that all customers enable Secure Mode in their production environments.

3. Use a Package Manager to Bundle the SDK

There’s another aspect of security that comes with using any 3rd party service in your JavaScript—protecting yourself in case that service is attacked. Our recommended best practice is to embed our SDK in your code using a package manager, rather than injecting the SDK via script tag. We offer easy installation with popular package managers like npm, yarn, and bower. This does put the onus on you to upgrade when we release new versions of our SDK, but at the same time it can reduce latency to fetch the SDK and give you control in accepting updates.

Conclusion

We go to great lengths to ensure the security of our customers and their end users. We follow industry best practices in the operation of our service (we recently completed our SOC2 certification) in addition to providing advanced security controls in our product such as multi-factor authentication, scoped API access tokens, and role-based access controls. This approach to security extends to our client-side JavaScript SDK. By following the practices outlined here, you can ensure that your use of client-side feature flags keeps your data and your customers’ data safe.