11 Apr 2018

I Don’t Always Test My Code, But When I Do It’s In Production

How using automated canary analysis can help increase development productivity, reduce risk and keep developers happy

At our March Meetup Isaac Mosquera, CTO at Armory.io, joined us to talk about canary deployments. He shared lessons learned, best practices for doing good canaries, and how to choose the right metrics for success/failure.

“The reason it’s pretty hard is because it’s a pretty stressful situation when you’re deploying into production and you’re looking at your dashboard, just like Homer is here. And he’s trying to figure out what to do next. Does this metric look similar to this metric? Should I roll back? Should I continue moving forward? Is this okay? And more importantly, what’s actually happening is a human being, when they’re looking at these metrics in their Canary, they’re biased, right?”

Watch his talk below to learn more about automated canaries. If you’re interested in joining us at a future Meetup, you can sign up here.


Alright, so a little about me, besides being super nerdy. I have been doing start-ups for the last 10, maybe 15 years, and in doing so, I’ve been always taking care of the infrastructure, and all the other engineers get to have fun writing all the good application codes. So a big part of just my background is infrastructure as code, writing deployment systems, all the things that you need to do in order to make code get into production, be in production, and be stable. That’s a little about me.

I’m currently the CTO of Armory. So this is the latest start-up, and what we do is we deliver a software delivery platform, so that you can ship better software, faster. The important point here is it’s done through safety and automation. It’s not all about just velocity. It’s about having seat belts on while you’re driving really, really fast. And the core of this platform is called Spinnaker, which was open-sourced by Netflix in 2015. So by a show of hands, who’s heard of Spinnaker? Alright, so roughly half. Alright, are you guys using it in production at all? Oh, zero. Okay. Well, hopefully I can get you interested in using it in production.

Spinnaker was open-sourced in 2015 by Netflix. It’s by far the most popular dominant open-source project by Netflix. Netflix has many open-source projects, but this by far is the one that’s gained the most steam. It’s got support from Google, Oracle, Microsoft, the big Cloud providers. It’s also go support by little start-ups like Armory. And so we’re all working to make a really hardened deployment software delivery system. It’s being used in production by very, very large companies, medium-sized companies, and even some small ones. So you have an idea of the scope and reach and all the people who are involved in this project. It’s pretty amazing to see where it came from 2015 to where it is today.

But what I’m going to talk about today is one small component of Armory, which is Canaries. It’s all around testing in production. Who is familiar with Canarying? Alright, so roughly half aren’t. The idea of Canarying comes from the canary in the coal mine, when coal miners would go into the mine, and there was a poisonous gas, they would die because they wouldn’t know the poisonous gas as there. So instead of them dying, they decided to bring a canary down into the coal mine, because the canary would die first, which sucks for the canary but great if you’re a human coal miner.

And so the same ideas apply to software delivery. So instead of deploying a hundred nodes into production to replace the old version that you have there before, instead what you’re going to do is deploy one node into production, assess the situation, make sure that it doesn’t fall over or is reporting dimetrics, and if it’s doing well, continue to scale out. And if not, kill the canary and roll back. And so what most people are doing today is that very process. They deploy one node into production. They scramble over to Data Dog or New Relic, and they have some human being looking at an array of dashboards, and then the human being has to make an assessment to actually keep going forward with the deployment or roll back.

But it’s actually pretty hard to do really, really good Canaries. And you actually see it as we talk to more and more customers, Canaries, for them, is this pinnacle of a place that they want to get to, and it’s pretty hard. The reason it’s pretty hard is because it’s a pretty stressful situation when you’re deploying into production and you’re looking at your dashboard, just like Homer is here. And he’s trying to figure out what to do next. Does this metric look similar to this metric? Should I roll back? Should I continue moving forward? Is this okay? And more importantly, what’s actually happening is a human being, when they’re looking at these metrics in their Canary, they’re biased, right? You got that product manager breathing down your back to get that feature into production. It’s a month late. So even if they kind of look a little bit off, you’re still going to push that button to go forward. You’re still going to make mistakes like old Homer here.

So who here has kids? Yeah, so you guys are all familiar with this kind of crazy scene here, which is just kind of an out of control situation. And this is what a lot of deployments are like in general. So in order to do a nice, good, clean deployment, you need to have control of the situation and the environment. And when this is happening, there’s no way to do a clean Canary, because when somebody deploys something and it’s a mess and you’re trying to compare metrics, and you’re just unsure, where did that metric come from? Did we deploy it correctly? Did somebody change the config on the way out and it’s not reproducible? You end up in this crazy situation. And most people, or most customers or most companies, their deployments, in fact, are just unstable. So the idea of doing a Canary on top of that, doing some sort of analysis, it becomes very, very untenable.

So why Canaries in the first place? Why test in production? Why not just have a bunch of unit tests and integration tests? And the reason why is that integration tests are actually incredibly hard to maintain, to create. They fail all the time, and they’re very brittle. And engineers don’t like writing tests in the first place anyway, so it’s unlikely that you have that many of those. But these are easy, right? The unit tests provide you a lot of coverage. They’re easy, and as you go up the stack here, the tests will become harder and harder and harder to do, and that’s why you don’t want to write as many of them. And you want to rely on other systems, like Feature Flagging and Canarying to help you out at this top part of the stack.

The other thing about an integration test, each individual integration test that you write only provides you incremental more coverage. You’re not going to get so much more coverage that an integration is ripping through the code base all the way in and all the way back out. You can only do that so many times, right? So they’re hard to create. They’re hard to maintain. They’re brittle. And like I said, I mean, we all know engineers don’t like writing tests in the first place. So it’s a constant battle between devops and engineers on the integration tests.

So then why not only just do Canaries? Well, the problem with only just doing Canaries and getting rid of the unit tests and integration tests and all the other tests, is that you start finding out that there’s huge problems right before you get all the way into production, at this line right here. And that’s a very costly place to find out that you have problems. So while Canarying will be helpful and useful, it’s more of a safety net. It doesn’t replace integration tests. It doesn’t replace unit tests. You still have to figure out how to get those engineers to write integration tests, which is difficult to do. So it’s a combination of all these tools that help you deploy safely into production.

So what makes a good Canary? So the first one is that it’s fully automated. And again, it’s got to be fully automated. The more manual process that you introduce into your Canary process, it doesn’t add any safety. You just have more human beings, and we all know that human beings are error-prone. Literally, as engineers, we build software to automate other human beings out of work, but we won’t do that to ourselves. So we constantly continue to put our human assessment into the situation, into the release process, and that’s pretty painful.

And in fact, actually, a story of a customer that I had, where they had a manual Canary release process, they put out a node, they started putting out more nodes, and this engineer got hungry, so she went to lunch. And she went to lunch for about an hour and a half, and in that time, she took down a very, very large Fortune 100 company site, losing millions of dollars. Again, there’s no rollback mechanism. It was a fully manual system, and so there’s no point in having the Canary if there’s no way to actually automate the process.

The third through the fifth part are about the deployment system itself. It needs to be reproducible, meaning if you run the same Canary twice, you should have the same result, roughly. Obviously, because you’re testing in production and it’s live traffic, it won’t exactly be the same. But you generally want it to be roughly about the same. It needs to transparent. You need to know what’s happening inside of the Canary, like why is the Canary process deciding to move forward or roll back? Because if the Canary system makes a mistake, and the engineer needs to know why it made that mistake so that it can be corrected. And the last thing is reliable, right? There’s a lot of homegrown systems out there in terms of deployments and software delivery, and what ends up happening, if it’s not reliable and you have that one software engineer that’s gone rouge and is going to start building their own Canary system on the side. And now you end up with three broken Canary systems instead of just one. So this is what makes a good Canary system.

So what are some of the lessons we’ve learned doing Canaries? The first approach is like, let’s apply machine learning to everything, because that’s just the answer nowadays I see. But actually, simple statistics just works. And this actually comes back to the previous statement that I was making about it being transparent. The moment you apply machine learning, it turns a bit into a black box, and that black box can vary degrees deepening on who made it, how it was created. And it actually doesn’t really get you much. So what we instead learned was that using simple statistics, we apply now a Mann-Whitney U test to the time series data that we get from the servers, from the metrics, so that we can actually make a decision as to whether to move forward or back. And this is much easier for someone to understand and comprehend when the Canary fails. If it’s a black box and I don’t know what to do and I don’t know why it broke or what metric was actually failing, it’s of no use to me as an engineer who wants to just my application code out.

So another lesson we learned is that this is what we see normally when people do manual Canaries. And I think it makes sense when you do a manual Canary, because it’s simple to orchestrate, you have your existing production cluster, you roll out your one node in the Canary cluster, and then, again, you scramble to Data Dog or New Relic or whatever metric stat store you’re using to just start comparing some of these metrics. But this is like comparing apples to apples, except one of them is actually a green apple. And it’s like slightly different, and you don’t really know why. And it just doesn’t really make sense. But they look kind of the same, so screw it, let’s just go to production, right? And you wonder why the Canary failed. But again, you can’t have a human making decisions about numbers and expect it to be rational.

So instead, what we do is we grab a node from the … Well, actually we don’t grab the node. We grab the image from the existing production cluster. We replicate it into its own cluster called the baseline cluster, and then we deploy the Canary. Because now we can compare one node to one node. They are running different code, which is the whole point, but we’re comparing the same size and metrics. Like if this side is over here ripping through 100,000 requests per second, there’s no way that this is going to be able to be compared to that with simple statistics. So this is what makes also Canarying harder, is that you have to be able to orchestrate an event like this, and that isn’t trivial to do if you don’t have sophisticated software delivery.

So other lessons we learned is that a blue green deployment is actually good enough for smaller services. You don’t need to Canary everything. The blue green is good enough. You release it. If it fails, and alarms start going off, you roll it back. The impact is actually low. And the thing I didn’t mention is all that we’re doing here is trying to figure out how to reduce overall risk to our customers and to the company, right? And sometimes blue green is enough to just actually reduce enough risk if you have that out of the box.

The other thing that we’ve found is people like to do Canaries without enough data coming out of their application. If there is no data coming out of an application into Data Dog or whatever metric store you’re using, you can’t use that information in order to have a good Canary. It’s just impossible. I can’t make data up. And then the last one is choosing the right metrics are important. Each application that our company has has a very different profile. What is means to succeed or fail is very, very different whether it’s a front-end check-out application versus a back-end photo processing application. You know, you might want to look at revenue. Revenue might be a great metric to look at for the front-end customer check-out application, but it has no meaning to the image processing application. So making sure you understand what it means to succeed or fail inside of your application is really important, and it’s very surprising to see how much people don’t realize what it means, and they only learn through failed Canaries to understand what success or failure looks like.

You can do this with Armory. So this is what a pipeline looks like with Armory. I think these might … Yeah. So this is what a pipeline looks like with Armory and Spinnaker. Is everything familiar with the term baking? No? Okay, so you guys are familiar with the term AMI, an image? So the idea of an image is just an immutable box in which you can put files and all your code, and you take a snapshot of your computer at that one time, and it becomes an image that you use to push into production. That term has been coined, I think by Netflix, bake. So it’s called baking an image. I have no idea why it’s called that actually, now that I’m thinking about it.

And then the next step, when we run a Canary, and what the Canary step will do, we’ll actually send out … We’ll grab that node from production, take the image from production, create a new cluster, and then push out the Canary as well, the new change set, and then run the statistical analysis against it for as long as you configure it. And I’ll show you what that looks like next. You can also see that we also create a dashboard in whatever metric store that you’re using, so that again, back to the transparency, if this thing were to fail, you’re going to need to know what metrics were failing. So we automatically create a dashboard, so you have that transparency.

The reliability here actually comes from open source, the fact that there’s 60, maybe even 70 developers now just working on this. You’ve got the world’s best Cloud engineers building this software, so the reliability comes in the fact that a ton of engineers are working on it. They’re not just going to leave to go to another company in another six months for higher pay. We’re working on this, and it’s going to be around for a very long time.

And so those are the properties that you see right there. This is what it looks like to have a failed deployment. So again, back to each application, we give it a score. An 83 might be a good score for a lot of applications. For this one though, for whomever configured it, decided that an 83 is a failing score. And so what a failing score will do will actually destroy the Canary infrastructure. So it will destroy the baseline and the Canary, and then that should return production back into the state that it was at automatically for you. So you don’t have to do anything. If you get hungry and you go to lunch, you can trust that this thing will take care of it for you.

So this is what it looks like to config a Canary. So you can choose the data provider. We allow you to config where you look for the metrics and alarms, what’s going to set this Canary to be good or bad, how you want to do the metrics comparison, how long you want to do an analysis period. For instance, there’s some applications that you only want to do a one hour or two hour Canary, and that might be okay to reduce risk, to get an understanding of how the application is going to behave. There’s other applications that are mission critical, and you may want it to run for 48 hours, right? So every application will have a different property and will have to be configured differently.

And then the metrics that are associated with the Canary, like what do you want us to be looking at. As you start querying, as you’re running in production, we’ll start looking at these metrics. Interestingly enough, we used to have a deviation. That actually no longer exists. We automatically can detect now if you’re falling out of bounds. You don’t need us to tell us what a good score is anymore. We just apply more statistics to the time series we get. And you don’t have to apply deviation or any thresholds. We kind of automatically figure that out for you. So the engineer doesn’t really want to be doing this type of work. They want to get back to writing application code.

So the thing that we get a lot is why Canaries vs Feature Flags, and it’s not a mutually exclusive thing. They’re actually used together. And in fact, we actually use Feature Flags all the time with our Canary product, and the way that we use it is whenever we build a feature, we put it behind a Feature Flag immediately, and we’re constantly deploying to production behind this Feature Flag. Everything that we do is under continuous delivery. We’re a continuous delivery company, so we obviously should practice what we preach. So we use a Feature Flag, and then we start iterating. The Canary helps as other commits are getting pushed into production, that you’re not just affecting anybody blindly. You’re not actually making mistakes. And then especially on this last part. It’s really great for code changers that affect large, large portions of the code base, like a database driver change. I don’t know how many times I’ve seen deployments fail because of somebody just switching a Flag, and next thing you know, production is out.

So provide info for every release. If you’re doing a Canary, it’s giving you a good or a bad for every single release. Spans typically a single deployment. You’re not typically going for multiple deployments, although you can be able to do Canaries. And then Feature Flags are great for user visible features. It kind of lends itself to the product team, where a Canary is more of an engineering activity. There’s not a product manager that’s going to do a Canary, at least not that I’ve seen. And then Feature Flags are great for testing against cohorts and more sophisticated analysis. Canaries aren’t always that sophisticated to be able to test against users. A Canary doesn’t know necessarily what a user is in the first place. Alright, so that’s it.

14 Mar 2018

Tonight We Monitor, For Tomorrow, We Test in Production!

In February Steven Czerwinski, Head of Engineering at Scalyr, spoke at our Test in Production Meetup. This session was focused on monitoring and observability while testing in production, and Steve shows why he feels monitoring is an important element within that process. If you’re interested in joining us at a future Meetup, you can sign up here.

Steve presented a case study around latency issues a Scalyr customer recently faced. He shares how his colleague, John Hart, explored the issue, and then reviews some key learnings realized after the event.

“Monitoring is so important to testing in production. I want to evoke the idea that you need to get your monitoring in place before testing in production. If you’re not really monitoring, you’re not really testing—you’re just hoping that things go right.”

Watch his talk below.


Thanks both to Andrea and Heavybit both for organizing this. This is a topic near and dear to our hearts at Scalyr. As Andrea said, my name is Steven Czerwinski, I’m Head of Engineering at Scalyr. Tonight I’m actually going to be presenting some work done by my colleague, John Hart.

In particular tonight, I’m going to talk about some, essentially lessons that John uncovered while he was performing the deep dive into some query latency issues that one of our customers was facing. And, with this, I’m not going to focus on the particular performance issues that he uncovered. What I’d rather like to do is talk about how our monitoring impacted that investigation. Often our monitoring actually helped out with the investigation, as you would hope it would. It helped make the investigation smoother, uncovered some interesting things. However, there are other times where our monitoring actually got in our way, where it actually hindered and misled us in the investigation. And those are the more interesting examples. In retrospect, those issues were because we violated some known best practices with monitoring. And this happens. So what we want to do is go down to the specifics of those examples to better reinforce that it’s a good idea to follow good practices in monitoring.

And, in general, monitoring is so important to testing in production. The little bit of mashup title that I used that Andrea was referring to, I really want to use that to invoke the idea that you really have to get your monitoring in place, before you do testing in production. If you’re not doing monitoring, then you’re not really testing, you’re just hoping that things go right. John Hart also likes to talk about this idea of performance dark matter. When you’re running a complex distributed system, like we do, there’s a lot of performance dark matter that’s just kind of hidden in the system. And it’s only through best practices in monitoring that you can really shed light on that dark matter and figure out what’s going on.

This slide here illustrates the problem that John was looking at. One of our customers, I’m going to refer to them as Acme Corp, just to protect the innocent, was facing bad query latency for certain class of their queries. And here in this graph, you can kind of see that. The blue line is the average latency for Acme Corps queries for this class of queries over time. The red line is for all customers other than Acme Corp. You can see that for Acme Corp, we had times where the latencies were picking over five, ten seconds. And for us, that’s unacceptable. We really strive to make sure that 99% of our query … customer’s query latencies are answered in less than 1 second. For us, this is a challenging problem. We have some very large customers. This customer, in particular, sends us tens of terabytes of logged volume every day. We have hundreds of their engineers logging in every day, issuing queries in order to see what’s going on in their system.

Now before I dive into the details of the best practices, I want to give a little bit of overview of our backend system, because it’s going to give you a little bit of information to put the rest of the best practices in context. One of the fun things that we’ve done at Scalyr, is we’ve actually built our own no SQL database from scratch, that’s optimized for this particular problem domain. And for us, this is one of our competitor advantages. It is what allows us to give orders of magnitude better performance than a lot of our other competitors. And, for our no SQL design, we followed a lot of the normal way that other no SQL databases are structured like Bigtable, Cassandra, that sort of thing.

For us, we take all of a particular account’s data, which is essentially the logs coming in for that account and we break it up into small five minute chunks, which we refer to as Epochs. Each Epoch is assigned to a particular shard of servers in our system. And we sprinkle these Epochs for an account all across all of these Shards that we have in our infrastructure. To answer a query, we … at the account master, receive the query from our customer. We … the account master knows where all the Epochs are stored, what Shards are … hold all the appropriate data. And the account master forwards the query to the appropriate Shards in order to execute the query on the appropriate Epochs. Now, in our world a basic Shard of servers has both masters and slaves. And a given query can be satisfied at either a master or a slave.

The right side of the diagram blows up the query engine of a particular slave or master. And, in here, you can see that there’s a few block diagrams in the flow of executing a query. One of the first things that happens when a … the server receives the query, is to do some admission control policies. So this is enforcing rate limits, in order to make sure customers aren’t abusing our system, acquiring account blocks, that sort of thing. After the query passes admission control, then it gets farmed off to the query execution engine, the query execution engine essentially tries to execute the query over data in a RAM cache. In order to satisfy that query, often you have to pull in Epochs or the data blocks that make up Epochs into that RAM cache. So that’s why you see us pulling in blocks off disk into the RAM cache.

Just to briefly talk about some of the things that did work well for our monitoring. First of all, we actually already have an A/B testing framework for our queries. On a per query basis, we can apply different execution strategies in order to experiment with effects of small modifications. We actually have this integrated with our logging, as well. So, very quickly, we can be able to analyze the difference of … the effects of different execution strategies on query latencies. One of the other things that we do is we’re very careful about how we run our experiments. John is a big believer on markdown files, so every time he starts up a new investigation, starts up a new markdown file along with a new dashboard. Everything that he does during the experiment, gets dumped in there. He uses our dashboards and our Scalyr command line interface extensively in order to populate information to that markdown file and to add results to the dashboard.

And finally, one of the other things that we have in our system, is the ability to modify the server configuration on the fly. So all these experiments that we’re running, all these things that we’re doing in order to test our strategies on our real users queries, we can adjust over time, through some simple updates.

Alright, let’s talk about the more interesting points. The monitoring lessons that we essentially had to relearn. So the first lesson I want to talk about is the importance of consistency. And the other way I like to think about this lesson is that, there should be no surprises in how information is communicated. When you look for a piece of information, it should exist where you expect it to exist. It should exist in the form that you expect it to. And, the performance issue that really reinforced us for this, was the discovery of an RPC rate limit gate issue that we had. In our system, I kind of eluded to earlier, we have rate limits that are applied to all incoming queries to make sure that there’s no abuse. We don’t want to have too many queries from one particular customer executing on the query engine, because they’re getting unfair advantage of the system then.

So, normally what happens is the gate keeps track of the number of current queries that are being executed per second. If it exceeds a certain threshold, then the gate will artificially block a given query in order to slow it down. Now, it turned out for Acme Corp, we were actually experiencing wait times of multiple seconds or more at the gate. And this was a big contributor to their latency. It was slowing them down. But, we didn’t notice the issue that quickly, which is surprising because actually all the information that we needed was in the logs, we just didn’t see it. Let’s talk about why.

Essentially it boils down to multiple issues with consistency. First of all, we had inconsistency with how our metrics were laid out. We already did have a good model for reporting query latencies broken down by various sub components. We had a systematic way of reporting that. But for this feature, this RPC rate limit, it was not part of the query system. It was part of the RPC subsystem. So we actually reported it in a different way. And, when it came down to it, we were looking at the breakdown of the query latencies, we were just missing the fact that there was time stalled out while we were waiting for the gate.

Now, we actually did have the gate wait latency in the logs. In fact, John even thought to check it out. He had a long list of all the places where we could be missing performance. And he did some manual checks. He knew what he was looking for in the logs in order to check to see whether or not this was an issue or not. And he did some scans and saw numbers like four and five and he’s like, “Oh, okay, four or five milliseconds, that’s fine”. That’s not contributing to the multiple seconds that we’re seeing.

But, the problem here was that latency was actually being reported in seconds and it was inconsistent with how we report most of our latencies. Everywhere else in the system, we report them in milliseconds. But here we were being misled by our results, because we were just inconsistent with units.

Okay. This is just kind of before and after actually. So John, after figuring this out, did some fixes to how we handle the gate and the red bars are essentially the counts or number of times that the waits at a gate are exceeded one second. That was before the fix. And the blue are the number of times we waited after the fix. So you can see there’s a significant reduction.

The next lesson. The second lesson we learned. Essentially it boils down to what I like to describe as you have to analyze based on what matters. You have to have … when you design your monitoring, you really have to think about what really matters in terms of the behavior of your system. Another way people talk about this is averages versus distributions. And I’ll explain that more in a minute. The performance issue that reinforced this lesson was an issue we were having with our RAM block cache utilization. I mentioned earlier that in order to execute a query for a given Epoch, all the blocks for that Epoch have to be read into a RAM cache.

Well, it turned out that because of an odd interaction with how we decide what Epochs should be executed on what … on the master versus the slave and how we had architected this structure of our RAM cache, we were only using half of the cache at the time. And essentially, just to give you a little more detail, our RAM cache was actually composed of numerous two gigabyte pools. And it turned out that if we had an odd number of RAM pools, then only the even RAM pools were being used on the masters and only the odd RAM pools were being used on the slaves. And it was just because of this odd interaction. And, but it resulted in the fact that we were effectively using only half of our RAM for RAM … I’m sorry, half of our RAM for our cache. We had 50 gigabytes delivered to that cache, we were only using 25.

And so, why did it take so long for us to figure this out. It comes down to the fact that we were measuring the wrong thing. We had some metrics that we were looking at that would have uncovered this sooner. We had a dashboard that essentially talked about the cache fill rates. How many blocks were we inserting into the cache of the second? If there was a problem, if we weren’t really utilizing the cache, this would have dropped to zero. And so we look at this graph. We look at the average of the cache fill rates across all the RAM pools. Everything looks fine, okay, we’re inserting blocks at a pretty decent rate.

However, this graph tells a different story. And what this graph shows, is it’s the graph fill rate for all the odd number RAM pools, that’s the one in blue. And the even is in the red. You can see right there, that there’s a huge difference between the fill rates for the even RAM pools and the odd RAM pools. And what this really gets down to, is what really matters? It doesn’t matter that we’re inserting blocks into the cache at a decent rate. What matters is that we’re inserting blocks in for all the RAM pools. All the RAM pools were effectively having blocks added. And so this is where I get … you get to the idea of averages of distributions. You can’t take the average across something. You really, in some cases, have to look at the distributions where that matters.

Okay. After 20 hours of investigation, basically this boiled down to a single character fix for John. The easy fix was just actually changing the shared RAM pool count down from 26 to 25 to give an odd one. And so, in effect, it ended up reducing the total RAM that we’re using for our cache, but actually resulted in more blocks being cached as anti-intuitive as that is. And that’s it. Those are the lessons I wanted to go over.

If you want to learn more about our system, feel free to visit our blog. The obligatory, we’re hiring. And that. So …

12 Mar 2018

Instrumenting CI Pipelines

In February, we invited New Relic Developer Advocate, Clay Smith, to our Test in Production Meetup to talk about instrumenting CI pipelines. If you’re interested in joining us at a future meetup, you can sign up here.

Clay took a look at the three pillar approach in monitoring—metrics, tracing, and logging. He wanted to explore what tracing looks like within a CI pipeline, and so he observed a single run of a build with multiple steps kicked off by a code commit.

“I wanted to try and apply some of this stuff to understanding AWS CodePipeline that I was using to build some Docker images. The question that I wanted to explore and kind of get into and try to learn more about observability by trying to build something is, if we take this really interesting ideas that were outlined in these posts and apply them to a CI tool, what can we learn and what does that actually look like?”

Watch his talk below.


Clay Smith:

I’ve had this very long and checkered history with attempting to build CI pipelines. My background’s engineering, not release or operations. It’s been a mixed bag of trying to build pipelines most disastrously with iOS and the Xcode build server. Trying to build something fast and reliable to do these checks that makes it easier to actually deliver software.

I revisited that fairly recently after spending a lot of time in 2017 reading a lot about this notion of observability and just going over some really interesting material on that. The inspiration for this was basically three things I read, kind of what was my reading list in 2017 for observability.

The really interesting thing is a lot of the really interesting posts and thought leadership I guess you could call it, has been very much centered in San Francisco. I think we can more or less blame Twitter to some extent for it.

Back in September 2013, they described the situation where Twitter was undergoing rapid growth. They were having issues managing and understanding their distributive systems. They introduced this notion of observability, which isn’t necessarily something new, but it was new in this kind of IT distributive systems context.

In 2017, there were two really great posts I highly recommend you read. They were pretty widely circulated. The first was from Copy Construct’s Cindy Sridharan, she wrote a really amazing post that kind of described that these three things, metrics, logs, and traces are really central to the notion of understanding the work your system does.

We had the three pillars conversation or post, and then slightly before that this Venn diagram from Peter Bourgon. I thought these posts were super cool because again my background isn’t necessarily in operations and caring really deeply about log, or metric, or trace data. I thought the way they presented these ideas was super interesting.

In particular, this Venn diagram that was presented in this post, I thought was really interesting because it got this idea that when we’re talking about metrics, or logs, or traces, which we heard about in the previous talk, there is some sort of relationship between all of them.

I had a couple days right before New Years, and I wanted to try and apply some of this stuff to understanding AWS CodePipeline that I was using to build some Docker images. The question that I wanted to explore and kind of get into and try to learn more about observability by trying to build something is, if we take this really interesting ideas that were outlined in these posts and apply them to a CI tool, what can we learn and what does that actually look like?

I was at Re:invent this year, which was very, very large, I think around 50,000 people. There was a really cool dashboard that Capital One was showing off. I took a photo on my phone, it’s open source. I think they were calling it something like the single view of the DevOps pipeline.

They have some really interesting metrics and graphs around things like build failures, what’s the ticket backlog, what’s the build speed in success, things you would expect. Typically, if you use Jenkins or all these other tools, there’s almost always a way to inspect log output.

Taking the three pillar approach, it seemed like in this view and in other common systems and tools, there wasn’t much necessarily going on with getting a trace of what work is actually going on inside some sort of CI pipeline.

I really wanted to explore that and try and build something in a day or two. The one thing that I kind of changed from the Venn diagram, instead of scoping a trace to a request, what if we just scope it to a single run of a build. Multiple steps kicked off by something like a code commit.

I was using AWS CodeBuild at the time, this is managed infrastructure from AWS. How it works is you have a YAML file, you can give it a container, and you basically give a script. It can do things like build an image, compile code, you can configure it in a lot of different ways.

The infrastructure itself, like a lot of AWS services, is fully managed so there’s nothing to SSH into. You don’t have access to the host, no root privileges. You’re kind of just locked into that container environment, similar to SaaS based CI tools.

What I wanted from that, as it goes it through it’s build steps, I want the trace view. One of the things that I had a lot of fun doing was I realized there was no way I could really natively instrument the code build process. It’s fully managed by AWS, they’re not going to give me access to the code.

Inspired by the diagram, if you can log an event and if you can log the relationship between different events, you can get something that kind of approximates traces. I just wrote a really stupid thing, there’s a verb at the front, you capture different events, and you’re writing it to a file.

The idea there is you’re writing this formatted log, you’re doing this as each build step progresses. You can have write access to the file system in CodeBuild so nothing big there. From there, we can actually build these traces. There was also a huge hack, so you could actually capture those events in real time. It would just hail the log file that you’re writing events to, and send it up to the back end, which in this case is just New Relic APM.

Once all that’s in place, you can actually get this tracing specific view of different events inside the AWS CodeBuild pipeline. It’s really interesting because all of this stuff was designed very much for an application. I think this view has been around in New Relic for more than seven years.

When you apply it to the pipeline, you actually still get some pretty interesting views of what’s going on. First is just the frequency and duration, but then you actually see the breakdown in time between each step. Not surprisingly, the build step which is actually building the Docker image takes the most time.

From there, because we’re actually building a Docker container, we know from what commits and source control actually builds the image, and we use that to actually connect it to production performance.

The hack, or the trick, or the thing here with instrumentation is when it’s actually building the Docker image, we tag that trace with the get commit hash of what’s actually being built. When we run that code in production, we also capture that as well so we have traces of how the code is behaving in production. We also have a trace of how that build artifact, that Docker container that’s running in production, was actually being built.

Here you have this interesting view of you see code running, this is different deploys, there’s a spike as [inaudible 00:07:50] scales up and down and all that. You also see next to it what was actually happening when that Docker image was being built in the first place.

An interesting connection between connecting these potentially complicated processes of actually building the image that you’re going to get gradually deployed to production. If you can annotate both traces with something like a git commit hash or a version number, you can connect them together, which I think is kind of interesting.

To wrap up this experiment, I think we talk more and more to different customers and people that are building very complex pipelines. Often at the end of that pipeline, there’s a very complex deploy strategy. Blue green, I read a really interesting post the other day that was talking about, this is a blue green rainbow deploys, 15 colors, or 26 colors. Canary deploys, lots of different strategies.

With that complexity, it feels like the stuff that we all know and are hearing about managing systems who need services could potentially apply in some respects to complex pipelines too. I think this idea of understanding and monitoring your production performance and then being able to have some relationship where you connect it back to whatever it was that built it, ideally ran through some automated tests, test suites, that seems pretty interesting too.

It was a really fun exploration. It was fun to get my hands dirty with these ideas around observability. So many people that go through this to learn about it, it seems really important and also really interesting. Looking forward to continuing the conversation about how people are attacking this and applying it to things we’re all building.

On that note, thanks very much.

08 Mar 2018

Testing and Debugging in Production with Distributed Tracing

At LaunchDarkly we host a monthly Meetup for anyone interested in testing in production. At our Meetup in February, we focused on monitoring and observability while testing in production. If you’re interested in joining this Meetup, sign up here.

Priyanka Sharma, Product Marketing, Partnerships & Open Source at LightStep, kicked off the event with a discussion around how she sees tracing as an essential tool for testing in production. She pointed out how software systems have become more complex in recent years, especially with the rise of CI/CD and microservices.

“There’s an explosion of data. The more services there are, the more touchpoints there are, and the more data you have to collect about what’s happening in your system. It’s very hard to manage all of this.”

Watch her talk to see how distributed tracing can help teams get a better understanding of their systems and the responses they’re seeing when testing in production.


Hi, everybody. How’s everyone doing?

Awesome, awesome. I love Heavybit, just like Andrea. LightStep also was here until a few months ago and we miss the space and the community. And today, I’ll be talking about tracing being an essential tool for testing in production. So before we deep dive in to tracing, let’s take a brief overview of what are the big challenges in production, especially around debugging and performance.

As people here probably know better anybody else, software is changing. Software workflows are changing, especially with the event of CICD and also microservices. Things are done completely different now. This diagram you see here, the first one, is what a monolithic architecture would look like where it’s one giant box and a request from start to finish and its lifecycle goes through various parts of this box, which, while it’s huge and has a lot inside it is at least contained and viewing something through it was much easier.

But now as systems are breaking up into fragments whether they’re microservices or larger services, however you call your service or you call it core projects, whatever, the point is that there’s more and more complexity introduced in today’s software systems.

What also happens with this is that there’s an explosion of data. The more services there are, the more touchpoints there are, and the more data you have to collect about what’s happening in your system. It’s very hard to manage all of this.

This is where distributed tracing can help. So if you have anomalies or outliers that you want to detect, it gets very challenging in that very fragmented system over here because is it in that corner, on this corner here? Where is the actual problem? With distributed tracing, you can examine the exact edge case and see what happened end to end in that request lifecycle. If you’re running a distributed system, out of curiosity, how many people here have a distributed system with at least two services? All right. A bunch of you. So you know what I’m talking about where that observability can be really lost the minute there is any fragmentation. So with tracing, you can get a full picture off your system from the plans all the way to the back end and the responses that come out.

How many of you here run CICD? All right. A bunch here too. So when you’re running CICD, it’s great that you can ship faster, but also there can be problems with your bills and you need to understand what is the issue. Often times, it’s not the code, but actually the resources being utilized. And with tracing, you can pinpoint that as the issue if it is the case.

Andrea mentioned gradual software rollouts with feature flags, things like that. If you’re doing that, you want to analyze the full transaction data off each rollout to look at the performance, see what errors were there, and make informed choices about the next rollout that happens and that is, again, something that distributed tracing can provide you.

So just to make sure we’re all on the same page, before deep diving further, I’d like to do a quick intro of distributed tracing as I’m defining it in this context and make sure we’re all aligned. So what is distributed tracing? So tracing as a technology has actually been around since the 70s. But what’s brought it mainstream into industry is the coming of distributed systems into the internet age.

So then you need to know the lifecycle of a request and to end in a distributed system, you use distributed tracing. It’s a view of the request as it travels across multiple parts of the systems and come back out with the response. This here is an example of a relatively simple request going through a small system. So you can imagine if you 10x it how big this will be and how complicated it’ll be to follow the path of a request.

So what’s a trace exactly? A trace is the entering life cycle of a request and its composed of parts that are called spans. Now a span is a named timed operation that represents a piece of workflow. So think of it as Point A to Point B and then there’s a jump in the context to a different span and then Point C to D and it goes from there. This diagram here, it should give you a little bit of a sense of how a trace is structured where there’ll be the parent span that starts a lot and there can be child spans or logs, etc. in here. So you can add in a lot of information that is important to you and your system. So there’s a lot of flexibility. Ultimately though, the TL;DR of this all is that the trace is a collection of spans.

So here, I’ve tried to visualize what a request looks like in a very simple architecture and then the trace view. So on, I guess your left, my right, your left, there’s a very simple service. There’s a client then a rep server that talks to an Auth service, billing, and a database. If you’re looking at the trace we offered, which is my left, your right, that’s on the time continuum. And the top you see over there, the client, or the root span is the top most thing because that is the beginning and end of the request. And these dashes that you see are all spans of work that happened within this whole trace. So this can be quite useful when you’re debugging a problem because you can deep dive exactly is the problem and the process workflow or in the art in the DP, client, whatever.

So that’s great, right? You can deep dive into things, you can append information and logs and baggage. There’s so much you can do with tracing and systems of fragmenting so much as changing. So then why is tracing not ubiquitous? The key reason here is that tracing instrumentation has just been too hard. Until recently, most of the instrumentation libraries were vendor specific. And so if you went through all the work of instrumenting each part of your code base, you were connected to this vendor and you didn’t really have a choice to change options should this one not work out for you.

Then monkey patching is insufficient. So an automated agent can do so much, but you need instrumentation by humans who understand the system and the problems that they run into for it to be a comprehensive system.

Then there’s the problem of inconsistent APIs. Today’s systems… How many of you use more than one language in your code base? A bunch of you. So the minute there’s more than one API, if you have to transfer between different APIs or different languages, which don’t work nice together, you have a problem because you’re only seeing a snapshot of your system. And the same applies to open source projects. So most people’s code is a ton of open source projects with Glue code, that’s your application code, right? And so in that situation, if you’re not seeing anything through those project, then you’re, again, flying in the dark. And tracing libraries for different projects if they’re not the same … You have the same problem that you had with different language APIs that don’t play nice with each other. So all this is a lot of difficulty and dissuaded most developers from going into tracing.

And this where open tracing, the project I work on is your ticket to omniscience. So the open tracing API is a vendor-neutral open standard for distributed tracing. It’s open source and part of the Cloud Native Foundation, as I mentioned.

It enables developers to instrument their existing code for tracing functionality without committing to an implementation in the start. So you could use any tracer and visualization system such as LightStep, Zipkin, Jaeger, New Relic, Datadog. Whatever choice you have, if they have a binding for open tracing, you can go ahead and swap out with just a change in the main function.

And this standard has been doing really well and part of the reason why I that it’s backed by folks who’ve been around the tracing landscape since it became mainstream industry. So Google release this project called Dapper, which it’s in production tracing system that was created by Ben Sigelman and a few of his colleagues. And they released a paper about it in the 2000s sometime. Inspired by that, Twitter built Zipkin, which it open sourced around 2009. And then now these people who’ve worked on these project including Ben Sigelman, including other folks from industry like Yuri Shkuro from Uber, other folks came together to build the open tracing spec in 2016 and it was built by people who had solved these problems before, had lived these problems multiple times, and the result is there for others to see.

So let’s look at the open tracing architecture. This is it. Jk. So there’s the open tracing API. It speaks to the application logic, the frameworks, the RPC libraries, you name it and they all connection to open tracing. And open tracing talks to the tracing visualizer off choice. Here, the examples listed like Zipkin, Lightstep, Jaeger, very, very few. As I mentioned just before, multiple vendors such as New Relic, Datadog, Dynatrace, and [inaudible 00:09:57] have created bindings for open tracing. So the options for the end user community are very many.

A lot of end user companies are finding this very useful in production. So this logo wall is by no means complete. It’s all kinds of companies from hipster engineering cultures to large scale enterprises to somewhere in between. You see the Under Armors of the world. You also see the Lyfts and the CockroachDB, etc.

Along the way, the open source site has really pushed the project along too. The open source project maintainers have adopted open tracing and accepted PRs for it because this an opportunity for them to allow their users to have visibility through the system without having to build the whole thing themselves. Some notable ones will be Spring by Pivotal, GRPC, and there’s a bunch of language libraries fueled by this open source and end user adoption vendors have been jumping onboard.

So that all sounds great. Clearly there’s some social proof, people seem to like it, but what does this really mean? Let’s look at open tracing in action with some traces. So I’m now presenting the next unicorn of Silicon Valley, Donut zone, which is a DaaS, Donuts as a Service. It’s the latest innovation, it’s donut delivery at its very best, it’s backed by Yummy Tummy Capital with a lot of money. This is the beautiful application we’ve built and through this, you can order donuts. Now I’m going to be asking you guys for help very soon to help me create a success disaster on donut zone.

Before that, I quickly want to explain the DaaS architecture, which is built to move fast in big things so that you know when we look at traces, what we’re really looking at. So we have bootstrapped with a single donut mixer for our topper, start up life. But the good news is that we’ve built with open tracing. There’s an open tracing of our Mutex wrapper that’s included in this. So it will hopefully help us debug any issues that may come up. But who knows?

So now we are going to get on to the success disaster part of the piece. So if you can please pull out your mobile phones or laptop, whatever you may have and go to donut.zone. I’m going to do this too. And then once you’re there at donut.zone, just start ordering a bunch of donuts, as many as you would like in a dream world where there are no calories in donuts and they were just like water but tasted like they do. So let’s get started. Click away because if you don’t do this, there will be no good traces to see and so this is all on you. Okay, so tap, tap, tap. Click, click, click. Okay, let’s just do a little more for good measure. Don’t be shy. Click a lot. Okay, great. So thank you so much.

Now let’s look at some traces. I have a bunch of traces here. Now you’ll see that this is a full trace and the yellow line that you see here … Can people see the yellow line? The yellow line is the critical path, which means the things that … the blocker, so to speak, for the process to be completed. So here you see that the donut browser is taking up in this specific trace, which is 10.2 seconds long. Seven seconds are gone just by the browser and client interaction. What does that tell us? Something that’s not every surprising, which is that the internet’s a little slow. So nothing moved beyond the client until seven seconds in. Then you see here this is like the payment is being processed because you paid fake money for your donuts.

Then here is an interesting one. Here we see that from 7.6 seconds to 9 second was taken up by the donut fryer. But, the interesting point here is that the Mutex acquire says that that itself took close to about, I would say, two and a half seconds, which mean … And here in the tags, you will see that I was waiting on six other donut requests. This is where you created the success disaster by clicking so many times. So this request was just waiting for its chance to be fried and that took up the most time. So that should tell you that if you were debugging over here, that it’s a resource allocation issue as opposed to an issue in your code. The code that worked just took this tiny amount of time. So you’ll be better off optimizing by adding a fryer. And this is something you knew right away because you had this distributed trace available and you could debug on the moment to figure out why is it so slow to get these donuts from donut.zone.

So I hope you found this useful. For those of you who are interested in seeing those kinds of traces for your system and debugging efficiently and quickly, I recommend going to open tracing.io to learn about the project. We have documentation, we have links to all our GitHub Repos, and lots of advice, etc. But if you’re interested more in the community aspect, you want to talk to people, hear their experience, then I’d recommend the Gitter community where we all hang out. It’s gitter.im/opentracing. This is where people help each other and you can get good advice on how to get started. I hope you found this useful. If you have any questions, feel free to reach out to me, priyanka at lightstep.com. I am also on Twitter at @pritianka. Reply really quickly. So please reach out, check out distributed tracing, and let me know if I can help in any way. Thank you so much.

09 Feb 2018

Visibility and Monitoring for Machine Learning Models

Josh Willis, an engineer at Slack, spoke at our January MeetUp about testing machine learning models in production. (If you’re interested in joining this Meetup, sign up here.)

Josh has worked as the Director of Data Science at Cloudera, he wrote the Java version of Google’s AB testing framework, and he recently held the position of Director of Data Engineering at Slack. On the subject of machine learning models, he thinks the most important question is: “How often do you want to deploy this?” You should never deploy a machine learning model once. If the problem is not important enough to keep working on it and deploy new models, then its not important enough to pay the cost of putting it into production in the first place.

“The tricky thing, though, is in order to get good at machine learning, you need to be able to do deploys as fast as humanly possible and repeatedly as humanly possible. Deploying a machine learning model isn’t like deploying a regular code patch or something like that, even if you have a continuous deployment system.” -Josh

Watch his entire talk below.


How’s it going, everybody? Good to see you. Thanks for having me here. A little bit about me, first and foremost. Once upon a time, I was an engineer at Google. I love feature flags, and I love experiments. I love A/B testing things. I love them so much that I wrote the Java version of Google’s A/B testing framework, which is a nerdy, as far as I know … I don’t know. Does anyone here work at Google? Any Googlers in the audience? I know there’s at least one because my best friend is here, and he works at Google. As far as I know, that is still used in production and probably gets exercised a few trillion times or so every single day, which is kind of a cool thing to hang my nerd hat on.

I used to work at Cloudera, where I was the director of data science. I primarily went around and talked to people about Hadoop and big data and machine learning and data science-y sorts of things. I am Slack’s former director of data engineering. I’ve been at Slack for about two and half years. I am a recovering manager. Any other recovering managers in the audience? I was going up the management hierarchy from first line management to managing managers, and I started to feel like I was in a pie eating contest, where first prize is more pie. I didn’t really like it so much. Wanted to go back and engineer. So about six months ago, I joined our machine learning team, and now I’m doing machine learning-ish sorts of things at Slack as well as trying to make Slack search suck less than it does right now. So if anyone’s done a search on Slack, I apologize. We’re working hard on fixing it.

That’s all great, but what I’m really most famous for … Like most famous people, I’m famous for tweeting. I wrote a famous tweet once: Which is a proper, defensible definition of a data scientist? Someone who is better at statistics than any software engineer and better at software engineering than any statisticians. That’s been retweeted a lot and is widely quoted and all that kind of good stuff. Are there any … Is this sort of a data science, machine learning audience or is this more of an engineering ops kind of audience? Any data scientists here? I’m going to be making fun of data scientists a lot, so this is going to be … Okay, good. So mostly, I’ll be safe. That’s fine. If that guy makes a run at me, please block his way.

So anyway, that’s my cutesy, pithy definition of what a data scientist is. If you’re an engineer, you’re sort of the natural opposite of that, which is this is someone who is worse at software engineering than an actual software engineer and worse at statistics than an actual statistician. That’s what we’re talking about here. There are some negative consequences of that. Roughly speaking at most companies, San Francisco, other places, there are two kinds of data scientists, and I call them the lab data scientists and the factory data scientists. This my own nomenclature. It doesn’t really mean anything.

So you’re hiring your first data scientist for your startup or whatever. There’s two ways things can go. You can either hire a lab data scientist, which is like a Ph.D., someone who’s done a Ph.D. in statistics or political science, maybe or genetics or something like that, where they were doing a lot of data analysis, and they got really good at programming. That’s fairly common data science standard. A lot of people end up that way. That wasn’t how I ended up. I’m the latter category. I’m a factory data scientist. I was a software engineer. I’ve been a software engineer for 18 years now. I was the kind of software engineer when I was young where I was reasonably smart and talented but not obviously useful. I think we all know software engineers like this, smart, clearly smart but not obviously useful, can’t really do anything. This is the kind of software engineer who ends up becoming a data scientist because someone has an idea of hey, let’s give this machine learning recommendation engine spam detection project to the smart, not obviously useful person who’s not doing anything obviously useful and see if they can come up with something kind of cool. That’s how I fell into this field. That’s the two kinds. You’ve got to be careful which one you end up with.

Something about data scientists and machine learning. All data scientists want to do machine learning. This is the problem. Rule number one of hiring data scientists: Anyone who wants to do machine learning isn’t qualified to do machine learning. Someone comes to you and is like, “Hey, I really want to do some machine learning.” You want to run hard the other direction. Don’t hire that person because anyone who’s actually done machine learning knows that it’s terrible, and it’s really the absolute worse. So wanting to do machine learning is a signal that you shouldn’t be doing machine learning. Ironically, rule two of hiring data scientists, if you can convince a data scientist that what they’re doing is machine learning, you can get them to do anything you want. It’s a secret manager trick. It’s one of the things learned in my management days.

Let’s talk about why, briefly. Deep learning for shallow people like ourselves. Deep learning, AI, big stuff in the news. I took a snapshot here of the train from my favorite picture, “Back to the Future, Part III,” a truly excellent film. Machine learning is not magic. Machine learning is, it’s basically the equivalent of a steam engine. That’s really what it is, especially deep learning in particular. What machine learning lets us do is stuff that we could’ve done ourselves, manually, by hand over the course of months or years, much, much, much faster in the same way a steam engine lets us move a bunch of rocks from point A to point B. It’s not something we couldn’t do. We knew how to move a bunch of rocks from point A to point B. That’s how we built the pyramids and stuff like that. But this lets us do it much, much faster and much, much cheaper. That’s what machine learning fundamentally is.

There are consequences of that. One of the nasty consequences of it. Machine learning … There’s a great paper that I highly recommend you read by this guy named D. Sculley, who is a professor at Tufts, engineer at Google. He says machine learning is the high interest credit card of technical debt because machine learning is basically spaghetti code that you deploy on purpose. That’s essentially what machine learning is. You’re taking a bunch of data, generating a bunch of numbers and then putting it in a rush intentionally. And then trying to figure out, reverse engineer how does this thing actually work. There are a bunch of terrible downstream consequences to this. It’s a risky thing to do. So you only want to do it when you absolutely have to.

Lab data scientists want to do machine learning. Factory data scientists want to machine learning. Their backgrounds mean they have different failure modes for machine learning. There’s a yin and yang aspect to it. Lab data scientists are generally people who have a problem with letting the perfect be the enemy of the good, broadly speaking. They want to do things right. They want to do things in a principled way. They want to do things the best way possible. Most of us who live in the real world know that you hardly ever have to do things the right way. You can do a crappy Band-Aid solution, and it basically works. That’s the factory data scientist attitude. The good news, though, of people who want to do things perfectly, they don’t really know anything about visibility monitoring, despite knowing a bunch of stuff about linear algebra and tensors, they don’t know how to count things. But you can teach them how to do Graphite Grafana. You can teach them how to do Logstash. They can learn all these kinds of things, and they want to learn, and they have no expectation that they know what they’re doing, so they’re very easy to teach. That’s a good thing.

Factory data scientists have the opposite problem. They’re very practical. They’re very pragmatic. So they’ll build things very quickly in a way that will work in your existing system. However, they overestimate their ability to deploy things successfully the way most not obviously useful software engineers do. As a result, they are much more likely to completely bring down your system when they deploy something. So that’s what you want to watch out for there.

Another really great paper, “What’s your ML test score? A rubric for production ML systems.” I love this paper. This is a bunch of Google people who basically came up with a checklist of things you should do before you deploy a machine learning system into production. I love it. Great best practices around testing, around experimentation, around monitoring. It covers a lot of very common problems. My only knock against this paper is they came up with a bunch of scoring criteria for deciding whether or not a model was good enough to go into production that was basically ludicrous. So I took their scoring system and redid it myself. So you’ll see down there, if you don’t do any of the items on their checklist, you’re building a science project. If you do one or two things, it’s still a science project. Three or four things are a more dangerous science project. Five to 10 points, you have the potential to destroy Western civilization. And then finally, once you do at least 10 things on their checklist, you’ve built a production system. So it’s kind of a u-shaped thing.

This is a great paper. If you have people at your company who want to deploy machine learning into production, highly, highly recommend reading it and going through it and doing as much of the stuff they recommend as you possibly can. More than anything, for the purposes of this talk, I want to get you in the right headspace for thinking about what it means to take a machine learning model and deploy it into production. The most important question by far when someone wants to deploy a machine learning model is, how often do you want to deploy this? If the answer is once, that is a bad answer. You should never deploy a machine learning model once. You should deploy it never or prepare to deploy it over and over and over and over and over again, repeatedly forever, ad infinitum.

If the problem is not important enough to keep working on it and keep deploying new models, it’s not important to pay the cost of putting it into production in the first place. That’s thing one. The tricky thing, though, is in order to get good at machine learning, you need to be able to do deploys as fast as humanly possible and repeatedly as humanly possible. Deploying a machine learning model isn’t like deploying a regular code patch or something like that, even if you have a continuous deployment system. The analogy I would use is it’s kind of like someone coming to you and saying, “Hey listen. We’re going to migrate over our database system from MySQL to Postgres, and then next week, we’re going to go back to MySQL again. And then the week after that, we’re going to go back.” And just kind of like back and forth, back and forth. I’m exaggerating slightly, but I’m trying to get you in the right headspace for what we’re talking about here. It’s basically different machine learning models are systems that are complicated and are opaque, that are nominally similar to each other but slightly different in ways that can be critically bad for the overall performance and reliability of your systems. That’s the mentality I want you to be in when it comes to deploying machine learning models. Think about it that way.

The good news is that we all stop worrying and learn to love machine learning, whatever the line is from “Dr. Strangelove,” that kind of thing. You get good at this kind of stuff after a while, and it really … I love doing machine learning, and I love doing it production in particular because it makes everything else better because the standards around how you operate, how you deploy production systems, how you test, how you monitor have to be so high just across the board for regular stuff in order to do it really, really well. Despite all the horrible consequences and the inevitable downtime that the machine learning engineers will cause, I swear, I promise, it’s ultimately worth doing it, and in particular, companies should do it more so I get paid more money to do it. That’s kind of a self-interested argument.

If you like to do monitoring, if you like to do visibility, if you like to do devOps stuff in general and you want to do it at a place that’s done it really, really well, slack.com/jobs. Thank you very much. I appreciate it.

07 Feb 2018

When a Necessary Evil becomes Indispensable: Testing in Production at Handshake

In January at our Test in Production MeetUp, we invited Maria Verba, QA Automation Engineer at Handshake, to talk about why and how her team tests in production. (If you’re interested in joining us, sign up here.)

At Handshake, Maria is focused on making sure her organization ships a product that customers love to use, as well as supporting the engineers with anything test related. In her presentation, she discusses the requirements her team has put in place to test safely. She also covers some specific examples of when testing in production is better than in pre-production.

“Whenever I hear test and production, I definitely don’t cringe anymore. But I know that there are right ways to do this, and wrongs ways to do it. Right ways will always include a healthy ecosystem. A healthy ecosystem implies you have good thorough tests. You have feature toggles. You have monitoring in place. We have to consider our trade-offs. Sometimes it does make more sense to create the mock production environment, sometimes it doesn’t. There are definitely things that cannot be tested in pre-production environments.” -Maria

Watch her talk below.


Hello everybody! Again, good evening. Thank you for having me. My name is Maria Verba. I work at a company called Handshake here in San Francisco.

Today, I’m going to talk to you about how we do, and why we do testing and production. My own personal journey from thinking about it as necessary evil, things that you have to do but you don’t want to do, to thinking about it more of indispensable, quite valuable tool.

To start off, let’s talk a little bit about Handshake. Most of you have probably never heard of it. It’s a small company, it’s understandable. Handshake is a career platform for students that helps connect students with employers, and helps students find meaningful jobs, and jump-start their careers. At Handshake, we believe that talent is distributed equally, but opportunity is not, so we work very hard to bridge that gap, and to help students find their dream jobs, and make a career out of it.

My role at Handshake is quality assurance engineer. Besides making sure that we ship very good products that our customers love to use, I think that my responsibility as a QA engineer is also supporting engineers, engineering team with everything and anything test-related, test infrastructure, test framework, as well as keeping their productivity and happiness high.

With my background with being in QA for about six years, whenever I heard “Oh, let’s just test this in production,” this is pretty much the reaction I would get. No! No way! Let’s not do that. Why don’t we want to do this? First of all, I do not like customers to find any issues. I do not want to expose them to any potential even small bugs. Yeah, let’s not test in production.

The second thing is I personally create a lot of gibberish records. I don’t care when I test. Sometimes it makes sense, sometimes it doesn’t make sense. I wouldn’t want anybody to see those records. Besides, it has a potential to mess up reporting and analytics. It’s also very, very easy to delete records, and modify records, so making this mistake is very costly. It’s very expensive, and probably engineers will spend like a couple of hours trying to dig out a recent database backup, and try to restore it. So yeah, let’s not do that in production. The last one is data security. We are actually exposing some records to other people that shouldn’t see that.

Then I started realizing that maybe it’s not actually that bad. Sometimes we have to make the sacrifice, and sometimes we have to test in production. I thought, “Yeah, maybe it’s okay as long as we have automation testing.” Before deploying, we have something that verify that our stuff works. We also want to make sure that we have a very good system, early detection system and monitoring and alerting system, which we’re going to find mistakes that we make before our customers do. We absolutely have to put it behind the feature toggle. There is absolutely no way around it. Even if there is no issues with our new feature, we do not want to surprise our customers. We do not want to throw in a new feature in them, deal with it, whatever you want to do. We want to have that material ready for them. We want to expose them to it before they see it.

In the recent times, the way that we’ve done it, it historically happened that most of our features were read-only. We would not create any new records. We would not modify any records. We also made some trade-offs. We thought about it. We realized that yes, it is possible to create those separate production-like environments and test it there, but for us, it’s not realistic to do that. We move very fast. We don’t have that many engineers to spend time creating new environments. Basically, it’s a very expensive process for us. We really don’t want to do it. Besides that, we are very careful about our data records, and we take security very seriously. We do constantly data privacy training. We have multiple checks in place and code. This helps us safeguard against any issues.

In the past where we did test in production, this would be a couple of examples. When we needed real-life traffic – general performance testing, performance monitoring on a daily basis. We also needed our production data. Sometimes it’s very hard to reproduce and replicate the same data that we have in production. Examples would be Elasticsearch5 upgrade and job recommendation service. Lastly, we do a lot of A/B testing and experimentation at Handshake. We validate our product ideas.

I’m going to talk a little bit more in detail about a couple of these examples here. Elasticsearch5 upgrade was a pretty good example, because first of all, why we needed to this? We rely heavily on search engine. Search-based features are pretty much our core. Whenever students log in, most of the time, what they want to do is find a job. Our search service cannot be down. However, we were using Elasticsearch2, which became deprecated. That put us in a position where we had to switch to Elasticsearch5 very quickly.

Let’s talk about why we can’t really test this in production. The update is too large. Basically, we cannot afford making any mistakes here. It touches everything. It has a potential to affect every single customer we have, every single user we have. It’s very extremely important for us to keep the features functionally working.

How did we do this? First of all, it took us a while to figure out where did the Elasticsearch version 3 and 4 go. Then we created new classes. The classes for the new search syntax. We started the upgrade with changing the syntax for our… We put those classes corresponding to each index behind the feature toggle, and we thoroughly tested with unit and integration framework. We’ve rolled it out to production incrementally, index by index, and we tested that in production, because we have that amount of data in production that we needed. We were able to verify that we get correct results after the update.

The second case study is slightly different. Job recommendation service is … at Handshake, we use machine learning. We apply machine learning to recommend jobs to students based on students’ behavior, based on students’ major skill, maybe a previous job history, job viewing history, location interest, etc. We take hold of that melting pot information, and we suggest a job based on that to a student. The question that we were trying to ask here is “Does this job make sense to this particular student?” A lot of times, you cannot go through all of those possible combinations and computations to answer that question.

To recreate that environment, it would take us, again, a lot of time and effort. We, again, rolled out incrementally. We heavily used our interns, because they’re still in school. After verifying that the jobs make sense, we started rolling it out to other schools as well.

talking about these two examples, I can hear the question in the audience boiling. You could still do this. You could still create a mock environment, and do the same kind of testing not in production. You could have done it pre-production. And yes, that’s true. I agree with that. We had to make those changes based on what our business needs are, based on our speed. We could not afford spending time and effort in setting up, maintaining, and securing those environments. But yeah, it is possible.

However, what if it changed the algorithm behind the job recommendation service? How do we make sure that those jobs are still relevant to the students? Do we ask the students after we change everything? I don’t know.

Here is where our experimentation framework came in. We use experiments and A/B testing a lot at Handshake. We pretty much test every major product hypothesis that we have. For us, it’s a two-step process. We have impressions and conversions. Impressions are a call-to-actions, or but on the prompt, something that the user sees. And then conversion is action based on seeing that prompt. An example would be showing a profile completion prompt to a student on a dashboard, and that would be an impression. Conversion would be if the student fills out their profile.

The way that we do it is we, first of all, determine which users, which students we want to target. We separate them into a different pool. Sometimes we want to say we want our experiment to work only for, or we want to target only seniors. Base on that, we remove other students from the test group. Then we split it into treatment and control groups. Then we track their conversions and impressions. We track them for both control and treatment group. This allows us to get very accurate results.

We are using LaunchDarkly to do this. We’re using targeting rules. As you can see here in the, maybe it’s hard to see, but basically, we’re specifying the user type as students, thus we’re removing every other student; maybe there is administrators or something like that. And then we say we want 5% of students to be on variation A of the feature, and 95% on variation B of the feature.

Unfortunately, we can’t do all of it on LaunchDarkly. Some of our experiments or yeah, maybe like a good chunk of our experiments rely on some more dynamic information, some things that change over time. For example, a student has seen a job recently. We have to define recently. We have to define what exactly we want to track about that job. In order to solve this problem, we create records for LaunchDarkly, and this allows us to take additional conditions and configurations, and split our user groups into more detailed cohorts.

As a result, we get a very, very powerful experimentation for framework, and we can only do this in production. There is no way around it.

Going back to my personal QA perspective and key takeaways. Whenever I hear test and production, I definitely don’t cringe anymore. But I know that there are right ways to do this, and wrongs ways to do it. Right ways will always include a healthy ecosystem. A healthy ecosystem implies you have good thorough tests. You have feature toggles. You have monitoring in place. We have to consider our trade-offs. Sometimes it does make more sense to create the mock production environment, sometimes it doesn’t. There are definitely things that cannot be tested in pre-production environments.

That also is in line with my personal goals of, first of all, keeping our customers happy, and product stable. Our customers would love to use the product that is driven by the data, and not our own hypothesis. I also think that it keeps the engineers on the team productive and happy, because they don’t have to spend time creating and maintaining all of these different environments.

What’s next for us? It’s Canary testing. We want to be able to divert some portion of our traffic into new features, for example, Rails 5 versus Rails 4. Right now, we are working on moving to Kubernetes to achieve that.

Thank you so much for your time. If you have any questions, find me, I’m going to be around, and reach out. Thank you.