Companies exploring feature management are looking for control over their releases. A common theme is using feature flags to rollout a feature to a small percentage of users, or quickly roll a feature back if it is not working properly. These future flaggers also seek to control a feature further by limiting its visibility to individual users or a group of users. In this piece, we’ll explore how LaunchDarkly lets you control your releases in these ways.
When launching a new feature, or simply updating an existing feature, you may not always want everyone to receive the same experience. When working with a new front-end marketing design, a back-end improvement to search, or anything in between, targeting specific users can help ensure the best experience for all of your customers. Below are a few common scenarios where targeting can come into play. We’d love to hear your thoughts on other targeting best practices—let us know what you come up with!
Targeting users within LaunchDarkly’s UI.
Perhaps the most common use case for individual user targeting is internal QA. Testing in production is a scary thought, but so is launching to production without knowing how the rollout will go. Feature flags enable safe testing in production. By targeting only your internal QA teams to receive the new feature, you can experience how it will function in a production environment without exposing the untested feature to your customers.
When releasing a feature or product in beta, you can target users in your beta group to receive this update while not releasing it to anyone else. Essentially, you would set the flag to “true” for all beta users to enable the new feature. Everyone else would have the flag set to “false”. When Kevin from LaunchDarkly asks to be part of your beta group, it’s as easy as adding that user to the true variation to enable the new feature.
Sometimes, you’ll want to target users, but may not want to dive into as many specifics as we just visited. Perhaps you are constrained by regulations, and need to deliver a different experience to customers from different states. Or, maybe your marketing team is launching a campaign specifically to Gmail users. You can place the email domain “gmail.com” in your targeting rules to deliver an experience only to that group.
Targeting in a Canary Launch
Many companies are seeking to slowly rollout their releases to before releasing to everybody. Rolling a new feature out to 10% of users, for example, is a great way to gain validation before a complete launch. You can target your internal QA team to be a part of the 10% initial rollout, or specifically include members from a beta group to see the new feature first. Conversely, you can target specific users or groups to exclude from the launch, and give them the new feature only when it is rolled out to everybody.
A common challenge with canary launches is delivering a consistent experience to users from the same group, especially in B2B applications. LaunchDarkly’s platform provides a bucketing feature that solves this problem. By bucketing users, you are ensuring that all users from the same group receive the same variation for the duration of a canary launch.
The examples we have covered so far are typically used for short-term flags, either upon initial rollout or for a shorter-lived feature. As your company’s feature flag use continues to mature, wrapping a feature or set of features in a long-term or permanent flag is a logical next step. A common use case here applies to entitlements. For example, if you have Basic, Pro, and Elite versions of your product, user targeting helps manage these tiers effectively across your organization. Many products with different account tiers already have a way of controlling functionality, but with a feature management platform, this can be easily managed by anyone on your team. Customer success team members can quickly dial functionality up or down without the need to go to engineering for support. When a customer upgrades from Pro to Elite, a user simply flips the appropriate flags on.
Managing targeting rules at scale is an important consideration, especially as the amount of feature flags you are managing continues to grow. We introduced a new feature we call Segments to help tackle this very topic. Segments enables you to build reusable targeting rules to make this practice quick and scalable. Borrowing from a few of my examples above, you may want to flip a feature on for everyone with an Elite subscription, but also include all beta members as well. Because this is an important feature, you may want to also target internal users to ensure rollouts go smoothly. This same rollout strategy will likely apply to future features and you can simply target your Segment to receive the true or false variation. We’ll be featuring Segments in a future post, so stay tuned for more.
To wrap things up, there are a number of different ways targeting users can deliver a more valuable experience. We visited just a few above, but the opportunities here are endless. At the end of the day, you’ll target users based on what makes the most sense for your product. We’d love to hear your thoughts and how you think user targeting can help. Until then and for more pointers, you can visit our documentation here.
Last week I attended the QCon London conference from Monday to Wednesday. It was a thoroughly interesting and informative three days. The sessions I heard ranged from microservice architectures to Chaos Engineering, and from how to retain the best people to new features of the Windows console. But there was one talk that really stood out to me—it took a hard look at whether we are Software Developers or Software Engineers.
“QCon London is a conference for senior software engineers and architects on the patterns, practices, and use cases leveraged by the world’s most innovative software shops.”
QCon London describes itself as a software conference, not necessarily a developer conference. It focuses more on the practices of creating software as opposed to showing off the latest and greatest frameworks and languages, and how to work with them. This came through in the talks I attended where most showed very little code and focused on how we work as teams, how we can interact with tools, and the idea that how we treat our code can have a huge impact on what we ultimately deliver.
With that in mind I went to a talk titled “Taking Back ‘Software Engineering’” by Dave Farley. I wanted to understand the differences he sees between being a software developer and an engineer, and learn how those differences can help create better code. During his 50 minute presentation, Farley outlined three main phases of production we have gone through. The first was craft, where one person builds one thing. The next was mass production, which produced a lot of things but wastefully resulted in stockpiles of products that weren’t necessarily used. The final type of production was lean mass production and Just In Time (JIT) production. This is the most common form of production today and is possible because of tried and tested methodologies ensuring high quality and efficient production lines. JIT production requires a good application of the Scientific Method to be applied to enable incremental and iterative improvements that result in a high-quality product at the end.
Without this JIT production approach and the Scientific Method, Farley pointed out that NASA would never have taken humans to the moon and back. It is only through robust and repeatable experiments that NASA could understand the physics and engineering required to build a rocket that could enter Earth’s orbit, then the moon’s, land humans on the moon, and then bring them back to Earth. It’s worth noting that NASA achieved this feat within 10 years of when President John F. Kennedy declared the US would do it—at which point NASA had not yet even launched an orbital rocket successfully.
Farley surmised that engineering and the application of the Scientific Method has led to some of humanity’s greatest achievements, and yet when it comes to software there is a tendency to shy away from the title of “Engineering”. For many the title brings with it ideas of strict regulations and standards that hinder or slow creativity and progress rather than enable them. To demonstrate his point Farley asked the audience, “How many of you studied Computer Science at university?” Most of the room raised their hand. He followed up with, “How many of you were taught the Scientific Method when studying Computer Science?” There were now very few hands up.
Without a scientific approach to software development it’s perhaps an industry that follows a craft-like approach to production where because it works, it’s good enough. For example, if a software specification was given to several organisations to build, one could expect widely different levels of quality with the product. However, the same cannot be said of giving a specification for a building, car or rocket to be built—they could appear different but would be quality products based on rigorous tests and standards.
Farley went on to talk about how Test-Driven Development and Continuous Delivery are great at moving the software industry to be more scientific and rigorous in its testing standards. Though they are helping the industry to be better at engineering, there is perhaps another step needed—Hypothesis-Driven Development (HDD)—to truly move the industry to being one of engineers instead of developers.
Through HDD, theories would be created with expected outcomes before the software development aspect was even considered. This allows some robust testing to do be done further down the line, if the hypothesis stands up to the testing then it can be concluded that this appears to be correct. Further testing of the same hypothesis could be done too, allowing for repeatable tests that demonstrate the theory to be correct. The theories could be approached on a highly iterative basis, following a MVP like approach, if at any point the theory no longer holds up then the work on that feature could be stopped.
The theories wouldn’t need to come from developers and engineers themselves, although they could, but could come from other aspects of the business and stakeholders who request work to be done on the products being built. This would result in more accountability for what is being requested with a clear expectation around the success criteria.
Whilst I and my colleagues apply some of these aspects to the way we work, we don’t do everything and don’t approach working with software with such a scientific view. I can see clear benefits to using the Scientific Method when working with software. When I think about how we might better adopt this way of working I am drawn to LaunchDarkly.
We use LaunchDarkly at work for feature rollouts, changing user journeys and for A/B testing. The ease and speed of use make it a great tool for running experiments, both big and small. When I think about how we could be highly iterative with running experiments to test a hypothesis, LaunchDarkly would be an excellent way to control that test. A feature flag could be set up for a very small test with strict targeting rules, and if the results match or exceed the hypothesis then the targeting could be expanded. However, if the results are not matching what was expected, then the flag could be turned off. This approach allows for small changes to be made, with minimal amount of time and effort being spent, but for useful for results to be collected before any major investment was made into developing a feature.
I found Farley’s talk at QCon London to be an interesting and thought provoking look at how I could change the way I work with software. I’m now thinking about ways to be more scientific in how I approach my work, and I think LaunchDarkly will be a very useful tool when working with a Hypothesis Driven Development approach to software engineering.
In February Steven Czerwinski, Head of Engineering at Scalyr, spoke at our Test in Production Meetup. This session was focused on monitoring and observability while testing in production, and Steve shows why he feels monitoring is an important element within that process. If you’re interested in joining us at a future Meetup, you can sign up here.
Steve presented a case study around latency issues a Scalyr customer recently faced. He shares how his colleague, John Hart, explored the issue, and then reviews some key learnings realized after the event.
“Monitoring is so important to testing in production. I want to evoke the idea that you need to get your monitoring in place before testing in production. If you’re not really monitoring, you’re not really testing—you’re just hoping that things go right.”
Watch his talk below.
Thanks both to Andrea and Heavybit both for organizing this. This is a topic near and dear to our hearts at Scalyr. As Andrea said, my name is Steven Czerwinski, I’m Head of Engineering at Scalyr. Tonight I’m actually going to be presenting some work done by my colleague, John Hart.
In particular tonight, I’m going to talk about some, essentially lessons that John uncovered while he was performing the deep dive into some query latency issues that one of our customers was facing. And, with this, I’m not going to focus on the particular performance issues that he uncovered. What I’d rather like to do is talk about how our monitoring impacted that investigation. Often our monitoring actually helped out with the investigation, as you would hope it would. It helped make the investigation smoother, uncovered some interesting things. However, there are other times where our monitoring actually got in our way, where it actually hindered and misled us in the investigation. And those are the more interesting examples. In retrospect, those issues were because we violated some known best practices with monitoring. And this happens. So what we want to do is go down to the specifics of those examples to better reinforce that it’s a good idea to follow good practices in monitoring.
And, in general, monitoring is so important to testing in production. The little bit of mashup title that I used that Andrea was referring to, I really want to use that to invoke the idea that you really have to get your monitoring in place, before you do testing in production. If you’re not doing monitoring, then you’re not really testing, you’re just hoping that things go right. John Hart also likes to talk about this idea of performance dark matter. When you’re running a complex distributed system, like we do, there’s a lot of performance dark matter that’s just kind of hidden in the system. And it’s only through best practices in monitoring that you can really shed light on that dark matter and figure out what’s going on.
This slide here illustrates the problem that John was looking at. One of our customers, I’m going to refer to them as Acme Corp, just to protect the innocent, was facing bad query latency for certain class of their queries. And here in this graph, you can kind of see that. The blue line is the average latency for Acme Corps queries for this class of queries over time. The red line is for all customers other than Acme Corp. You can see that for Acme Corp, we had times where the latencies were picking over five, ten seconds. And for us, that’s unacceptable. We really strive to make sure that 99% of our query … customer’s query latencies are answered in less than 1 second. For us, this is a challenging problem. We have some very large customers. This customer, in particular, sends us tens of terabytes of logged volume every day. We have hundreds of their engineers logging in every day, issuing queries in order to see what’s going on in their system.
Now before I dive into the details of the best practices, I want to give a little bit of overview of our backend system, because it’s going to give you a little bit of information to put the rest of the best practices in context. One of the fun things that we’ve done at Scalyr, is we’ve actually built our own no SQL database from scratch, that’s optimized for this particular problem domain. And for us, this is one of our competitor advantages. It is what allows us to give orders of magnitude better performance than a lot of our other competitors. And, for our no SQL design, we followed a lot of the normal way that other no SQL databases are structured like Bigtable, Cassandra, that sort of thing.
For us, we take all of a particular account’s data, which is essentially the logs coming in for that account and we break it up into small five minute chunks, which we refer to as Epochs. Each Epoch is assigned to a particular shard of servers in our system. And we sprinkle these Epochs for an account all across all of these Shards that we have in our infrastructure. To answer a query, we … at the account master, receive the query from our customer. We … the account master knows where all the Epochs are stored, what Shards are … hold all the appropriate data. And the account master forwards the query to the appropriate Shards in order to execute the query on the appropriate Epochs. Now, in our world a basic Shard of servers has both masters and slaves. And a given query can be satisfied at either a master or a slave.
The right side of the diagram blows up the query engine of a particular slave or master. And, in here, you can see that there’s a few block diagrams in the flow of executing a query. One of the first things that happens when a … the server receives the query, is to do some admission control policies. So this is enforcing rate limits, in order to make sure customers aren’t abusing our system, acquiring account blocks, that sort of thing. After the query passes admission control, then it gets farmed off to the query execution engine, the query execution engine essentially tries to execute the query over data in a RAM cache. In order to satisfy that query, often you have to pull in Epochs or the data blocks that make up Epochs into that RAM cache. So that’s why you see us pulling in blocks off disk into the RAM cache.
Just to briefly talk about some of the things that did work well for our monitoring. First of all, we actually already have an A/B testing framework for our queries. On a per query basis, we can apply different execution strategies in order to experiment with effects of small modifications. We actually have this integrated with our logging, as well. So, very quickly, we can be able to analyze the difference of … the effects of different execution strategies on query latencies. One of the other things that we do is we’re very careful about how we run our experiments. John is a big believer on markdown files, so every time he starts up a new investigation, starts up a new markdown file along with a new dashboard. Everything that he does during the experiment, gets dumped in there. He uses our dashboards and our Scalyr command line interface extensively in order to populate information to that markdown file and to add results to the dashboard.
And finally, one of the other things that we have in our system, is the ability to modify the server configuration on the fly. So all these experiments that we’re running, all these things that we’re doing in order to test our strategies on our real users queries, we can adjust over time, through some simple updates.
Alright, let’s talk about the more interesting points. The monitoring lessons that we essentially had to relearn. So the first lesson I want to talk about is the importance of consistency. And the other way I like to think about this lesson is that, there should be no surprises in how information is communicated. When you look for a piece of information, it should exist where you expect it to exist. It should exist in the form that you expect it to. And, the performance issue that really reinforced us for this, was the discovery of an RPC rate limit gate issue that we had. In our system, I kind of eluded to earlier, we have rate limits that are applied to all incoming queries to make sure that there’s no abuse. We don’t want to have too many queries from one particular customer executing on the query engine, because they’re getting unfair advantage of the system then.
So, normally what happens is the gate keeps track of the number of current queries that are being executed per second. If it exceeds a certain threshold, then the gate will artificially block a given query in order to slow it down. Now, it turned out for Acme Corp, we were actually experiencing wait times of multiple seconds or more at the gate. And this was a big contributor to their latency. It was slowing them down. But, we didn’t notice the issue that quickly, which is surprising because actually all the information that we needed was in the logs, we just didn’t see it. Let’s talk about why.
Essentially it boils down to multiple issues with consistency. First of all, we had inconsistency with how our metrics were laid out. We already did have a good model for reporting query latencies broken down by various sub components. We had a systematic way of reporting that. But for this feature, this RPC rate limit, it was not part of the query system. It was part of the RPC subsystem. So we actually reported it in a different way. And, when it came down to it, we were looking at the breakdown of the query latencies, we were just missing the fact that there was time stalled out while we were waiting for the gate.
Now, we actually did have the gate wait latency in the logs. In fact, John even thought to check it out. He had a long list of all the places where we could be missing performance. And he did some manual checks. He knew what he was looking for in the logs in order to check to see whether or not this was an issue or not. And he did some scans and saw numbers like four and five and he’s like, “Oh, okay, four or five milliseconds, that’s fine”. That’s not contributing to the multiple seconds that we’re seeing.
But, the problem here was that latency was actually being reported in seconds and it was inconsistent with how we report most of our latencies. Everywhere else in the system, we report them in milliseconds. But here we were being misled by our results, because we were just inconsistent with units.
Okay. This is just kind of before and after actually. So John, after figuring this out, did some fixes to how we handle the gate and the red bars are essentially the counts or number of times that the waits at a gate are exceeded one second. That was before the fix. And the blue are the number of times we waited after the fix. So you can see there’s a significant reduction.
The next lesson. The second lesson we learned. Essentially it boils down to what I like to describe as you have to analyze based on what matters. You have to have … when you design your monitoring, you really have to think about what really matters in terms of the behavior of your system. Another way people talk about this is averages versus distributions. And I’ll explain that more in a minute. The performance issue that reinforced this lesson was an issue we were having with our RAM block cache utilization. I mentioned earlier that in order to execute a query for a given Epoch, all the blocks for that Epoch have to be read into a RAM cache.
Well, it turned out that because of an odd interaction with how we decide what Epochs should be executed on what … on the master versus the slave and how we had architected this structure of our RAM cache, we were only using half of the cache at the time. And essentially, just to give you a little more detail, our RAM cache was actually composed of numerous two gigabyte pools. And it turned out that if we had an odd number of RAM pools, then only the even RAM pools were being used on the masters and only the odd RAM pools were being used on the slaves. And it was just because of this odd interaction. And, but it resulted in the fact that we were effectively using only half of our RAM for RAM … I’m sorry, half of our RAM for our cache. We had 50 gigabytes delivered to that cache, we were only using 25.
And so, why did it take so long for us to figure this out. It comes down to the fact that we were measuring the wrong thing. We had some metrics that we were looking at that would have uncovered this sooner. We had a dashboard that essentially talked about the cache fill rates. How many blocks were we inserting into the cache of the second? If there was a problem, if we weren’t really utilizing the cache, this would have dropped to zero. And so we look at this graph. We look at the average of the cache fill rates across all the RAM pools. Everything looks fine, okay, we’re inserting blocks at a pretty decent rate.
However, this graph tells a different story. And what this graph shows, is it’s the graph fill rate for all the odd number RAM pools, that’s the one in blue. And the even is in the red. You can see right there, that there’s a huge difference between the fill rates for the even RAM pools and the odd RAM pools. And what this really gets down to, is what really matters? It doesn’t matter that we’re inserting blocks into the cache at a decent rate. What matters is that we’re inserting blocks in for all the RAM pools. All the RAM pools were effectively having blocks added. And so this is where I get … you get to the idea of averages of distributions. You can’t take the average across something. You really, in some cases, have to look at the distributions where that matters.
Okay. After 20 hours of investigation, basically this boiled down to a single character fix for John. The easy fix was just actually changing the shared RAM pool count down from 26 to 25 to give an odd one. And so, in effect, it ended up reducing the total RAM that we’re using for our cache, but actually resulted in more blocks being cached as anti-intuitive as that is. And that’s it. Those are the lessons I wanted to go over.
If you want to learn more about our system, feel free to visit our blog. The obligatory, we’re hiring. And that. So …
In February, we invited New Relic Developer Advocate, Clay Smith, to our Test in Production Meetup to talk about instrumenting CI pipelines. If you’re interested in joining us at a future meetup, you can sign up here.
Clay took a look at the three pillar approach in monitoring—metrics, tracing, and logging. He wanted to explore what tracing looks like within a CI pipeline, and so he observed a single run of a build with multiple steps kicked off by a code commit.
“I wanted to try and apply some of this stuff to understanding AWS CodePipeline that I was using to build some Docker images. The question that I wanted to explore and kind of get into and try to learn more about observability by trying to build something is, if we take this really interesting ideas that were outlined in these posts and apply them to a CI tool, what can we learn and what does that actually look like?”
Watch his talk below.
I’ve had this very long and checkered history with attempting to build CI pipelines. My background’s engineering, not release or operations. It’s been a mixed bag of trying to build pipelines most disastrously with iOS and the Xcode build server. Trying to build something fast and reliable to do these checks that makes it easier to actually deliver software.
I revisited that fairly recently after spending a lot of time in 2017 reading a lot about this notion of observability and just going over some really interesting material on that. The inspiration for this was basically three things I read, kind of what was my reading list in 2017 for observability.
The really interesting thing is a lot of the really interesting posts and thought leadership I guess you could call it, has been very much centered in San Francisco. I think we can more or less blame Twitter to some extent for it.
Back in September 2013, they described the situation where Twitter was undergoing rapid growth. They were having issues managing and understanding their distributive systems. They introduced this notion of observability, which isn’t necessarily something new, but it was new in this kind of IT distributive systems context.
In 2017, there were two really great posts I highly recommend you read. They were pretty widely circulated. The first was from Copy Construct’s Cindy Sridharan, she wrote a really amazing post that kind of described that these three things, metrics, logs, and traces are really central to the notion of understanding the work your system does.
We had the three pillars conversation or post, and then slightly before that this Venn diagram from Peter Bourgon. I thought these posts were super cool because again my background isn’t necessarily in operations and caring really deeply about log, or metric, or trace data. I thought the way they presented these ideas was super interesting.
In particular, this Venn diagram that was presented in this post, I thought was really interesting because it got this idea that when we’re talking about metrics, or logs, or traces, which we heard about in the previous talk, there is some sort of relationship between all of them.
I had a couple days right before New Years, and I wanted to try and apply some of this stuff to understanding AWS CodePipeline that I was using to build some Docker images. The question that I wanted to explore and kind of get into and try to learn more about observability by trying to build something is, if we take this really interesting ideas that were outlined in these posts and apply them to a CI tool, what can we learn and what does that actually look like?
I was at Re:invent this year, which was very, very large, I think around 50,000 people. There was a really cool dashboard that Capital One was showing off. I took a photo on my phone, it’s open source. I think they were calling it something like the single view of the DevOps pipeline.
They have some really interesting metrics and graphs around things like build failures, what’s the ticket backlog, what’s the build speed in success, things you would expect. Typically, if you use Jenkins or all these other tools, there’s almost always a way to inspect log output.
Taking the three pillar approach, it seemed like in this view and in other common systems and tools, there wasn’t much necessarily going on with getting a trace of what work is actually going on inside some sort of CI pipeline.
I really wanted to explore that and try and build something in a day or two. The one thing that I kind of changed from the Venn diagram, instead of scoping a trace to a request, what if we just scope it to a single run of a build. Multiple steps kicked off by something like a code commit.
I was using AWS CodeBuild at the time, this is managed infrastructure from AWS. How it works is you have a YAML file, you can give it a container, and you basically give a script. It can do things like build an image, compile code, you can configure it in a lot of different ways.
The infrastructure itself, like a lot of AWS services, is fully managed so there’s nothing to SSH into. You don’t have access to the host, no root privileges. You’re kind of just locked into that container environment, similar to SaaS based CI tools.
What I wanted from that, as it goes it through it’s build steps, I want the trace view. One of the things that I had a lot of fun doing was I realized there was no way I could really natively instrument the code build process. It’s fully managed by AWS, they’re not going to give me access to the code.
Inspired by the diagram, if you can log an event and if you can log the relationship between different events, you can get something that kind of approximates traces. I just wrote a really stupid thing, there’s a verb at the front, you capture different events, and you’re writing it to a file.
The idea there is you’re writing this formatted log, you’re doing this as each build step progresses. You can have write access to the file system in CodeBuild so nothing big there. From there, we can actually build these traces. There was also a huge hack, so you could actually capture those events in real time. It would just hail the log file that you’re writing events to, and send it up to the back end, which in this case is just New Relic APM.
Once all that’s in place, you can actually get this tracing specific view of different events inside the AWS CodeBuild pipeline. It’s really interesting because all of this stuff was designed very much for an application. I think this view has been around in New Relic for more than seven years.
When you apply it to the pipeline, you actually still get some pretty interesting views of what’s going on. First is just the frequency and duration, but then you actually see the breakdown in time between each step. Not surprisingly, the build step which is actually building the Docker image takes the most time.
From there, because we’re actually building a Docker container, we know from what commits and source control actually builds the image, and we use that to actually connect it to production performance.
The hack, or the trick, or the thing here with instrumentation is when it’s actually building the Docker image, we tag that trace with the get commit hash of what’s actually being built. When we run that code in production, we also capture that as well so we have traces of how the code is behaving in production. We also have a trace of how that build artifact, that Docker container that’s running in production, was actually being built.
Here you have this interesting view of you see code running, this is different deploys, there’s a spike as [inaudible 00:07:50] scales up and down and all that. You also see next to it what was actually happening when that Docker image was being built in the first place.
An interesting connection between connecting these potentially complicated processes of actually building the image that you’re going to get gradually deployed to production. If you can annotate both traces with something like a git commit hash or a version number, you can connect them together, which I think is kind of interesting.
To wrap up this experiment, I think we talk more and more to different customers and people that are building very complex pipelines. Often at the end of that pipeline, there’s a very complex deploy strategy. Blue green, I read a really interesting post the other day that was talking about, this is a blue green rainbow deploys, 15 colors, or 26 colors. Canary deploys, lots of different strategies.
With that complexity, it feels like the stuff that we all know and are hearing about managing systems who need services could potentially apply in some respects to complex pipelines too. I think this idea of understanding and monitoring your production performance and then being able to have some relationship where you connect it back to whatever it was that built it, ideally ran through some automated tests, test suites, that seems pretty interesting too.
It was a really fun exploration. It was fun to get my hands dirty with these ideas around observability. So many people that go through this to learn about it, it seems really important and also really interesting. Looking forward to continuing the conversation about how people are attacking this and applying it to things we’re all building.
At LaunchDarkly we host a monthly Meetup for anyone interested in testing in production. At our Meetup in February, we focused on monitoring and observability while testing in production. If you’re interested in joining this Meetup, sign up here.
Priyanka Sharma, Product Marketing, Partnerships & Open Source at LightStep, kicked off the event with a discussion around how she sees tracing as an essential tool for testing in production. She pointed out how software systems have become more complex in recent years, especially with the rise of CI/CD and microservices.
“There’s an explosion of data. The more services there are, the more touchpoints there are, and the more data you have to collect about what’s happening in your system. It’s very hard to manage all of this.”
Watch her talk to see how distributed tracing can help teams get a better understanding of their systems and the responses they’re seeing when testing in production.
Hi, everybody. How’s everyone doing?
Awesome, awesome. I love Heavybit, just like Andrea. LightStep also was here until a few months ago and we miss the space and the community. And today, I’ll be talking about tracing being an essential tool for testing in production. So before we deep dive in to tracing, let’s take a brief overview of what are the big challenges in production, especially around debugging and performance.
As people here probably know better anybody else, software is changing. Software workflows are changing, especially with the event of CICD and also microservices. Things are done completely different now. This diagram you see here, the first one, is what a monolithic architecture would look like where it’s one giant box and a request from start to finish and its lifecycle goes through various parts of this box, which, while it’s huge and has a lot inside it is at least contained and viewing something through it was much easier.
But now as systems are breaking up into fragments whether they’re microservices or larger services, however you call your service or you call it core projects, whatever, the point is that there’s more and more complexity introduced in today’s software systems.
What also happens with this is that there’s an explosion of data. The more services there are, the more touchpoints there are, and the more data you have to collect about what’s happening in your system. It’s very hard to manage all of this.
This is where distributed tracing can help. So if you have anomalies or outliers that you want to detect, it gets very challenging in that very fragmented system over here because is it in that corner, on this corner here? Where is the actual problem? With distributed tracing, you can examine the exact edge case and see what happened end to end in that request lifecycle. If you’re running a distributed system, out of curiosity, how many people here have a distributed system with at least two services? All right. A bunch of you. So you know what I’m talking about where that observability can be really lost the minute there is any fragmentation. So with tracing, you can get a full picture off your system from the plans all the way to the back end and the responses that come out.
How many of you here run CICD? All right. A bunch here too. So when you’re running CICD, it’s great that you can ship faster, but also there can be problems with your bills and you need to understand what is the issue. Often times, it’s not the code, but actually the resources being utilized. And with tracing, you can pinpoint that as the issue if it is the case.
Andrea mentioned gradual software rollouts with feature flags, things like that. If you’re doing that, you want to analyze the full transaction data off each rollout to look at the performance, see what errors were there, and make informed choices about the next rollout that happens and that is, again, something that distributed tracing can provide you.
So just to make sure we’re all on the same page, before deep diving further, I’d like to do a quick intro of distributed tracing as I’m defining it in this context and make sure we’re all aligned. So what is distributed tracing? So tracing as a technology has actually been around since the 70s. But what’s brought it mainstream into industry is the coming of distributed systems into the internet age.
So then you need to know the lifecycle of a request and to end in a distributed system, you use distributed tracing. It’s a view of the request as it travels across multiple parts of the systems and come back out with the response. This here is an example of a relatively simple request going through a small system. So you can imagine if you 10x it how big this will be and how complicated it’ll be to follow the path of a request.
So what’s a trace exactly? A trace is the entering life cycle of a request and its composed of parts that are called spans. Now a span is a named timed operation that represents a piece of workflow. So think of it as Point A to Point B and then there’s a jump in the context to a different span and then Point C to D and it goes from there. This diagram here, it should give you a little bit of a sense of how a trace is structured where there’ll be the parent span that starts a lot and there can be child spans or logs, etc. in here. So you can add in a lot of information that is important to you and your system. So there’s a lot of flexibility. Ultimately though, the TL;DR of this all is that the trace is a collection of spans.
So here, I’ve tried to visualize what a request looks like in a very simple architecture and then the trace view. So on, I guess your left, my right, your left, there’s a very simple service. There’s a client then a rep server that talks to an Auth service, billing, and a database. If you’re looking at the trace we offered, which is my left, your right, that’s on the time continuum. And the top you see over there, the client, or the root span is the top most thing because that is the beginning and end of the request. And these dashes that you see are all spans of work that happened within this whole trace. So this can be quite useful when you’re debugging a problem because you can deep dive exactly is the problem and the process workflow or in the art in the DP, client, whatever.
So that’s great, right? You can deep dive into things, you can append information and logs and baggage. There’s so much you can do with tracing and systems of fragmenting so much as changing. So then why is tracing not ubiquitous? The key reason here is that tracing instrumentation has just been too hard. Until recently, most of the instrumentation libraries were vendor specific. And so if you went through all the work of instrumenting each part of your code base, you were connected to this vendor and you didn’t really have a choice to change options should this one not work out for you.
Then monkey patching is insufficient. So an automated agent can do so much, but you need instrumentation by humans who understand the system and the problems that they run into for it to be a comprehensive system.
Then there’s the problem of inconsistent APIs. Today’s systems… How many of you use more than one language in your code base? A bunch of you. So the minute there’s more than one API, if you have to transfer between different APIs or different languages, which don’t work nice together, you have a problem because you’re only seeing a snapshot of your system. And the same applies to open source projects. So most people’s code is a ton of open source projects with Glue code, that’s your application code, right? And so in that situation, if you’re not seeing anything through those project, then you’re, again, flying in the dark. And tracing libraries for different projects if they’re not the same … You have the same problem that you had with different language APIs that don’t play nice with each other. So all this is a lot of difficulty and dissuaded most developers from going into tracing.
And this where open tracing, the project I work on is your ticket to omniscience. So the open tracing API is a vendor-neutral open standard for distributed tracing. It’s open source and part of the Cloud Native Foundation, as I mentioned.
It enables developers to instrument their existing code for tracing functionality without committing to an implementation in the start. So you could use any tracer and visualization system such as LightStep, Zipkin, Jaeger, New Relic, Datadog. Whatever choice you have, if they have a binding for open tracing, you can go ahead and swap out with just a change in the main function.
And this standard has been doing really well and part of the reason why I that it’s backed by folks who’ve been around the tracing landscape since it became mainstream industry. So Google release this project called Dapper, which it’s in production tracing system that was created by Ben Sigelman and a few of his colleagues. And they released a paper about it in the 2000s sometime. Inspired by that, Twitter built Zipkin, which it open sourced around 2009. And then now these people who’ve worked on these project including Ben Sigelman, including other folks from industry like Yuri Shkuro from Uber, other folks came together to build the open tracing spec in 2016 and it was built by people who had solved these problems before, had lived these problems multiple times, and the result is there for others to see.
So let’s look at the open tracing architecture. This is it. Jk. So there’s the open tracing API. It speaks to the application logic, the frameworks, the RPC libraries, you name it and they all connection to open tracing. And open tracing talks to the tracing visualizer off choice. Here, the examples listed like Zipkin, Lightstep, Jaeger, very, very few. As I mentioned just before, multiple vendors such as New Relic, Datadog, Dynatrace, and [inaudible 00:09:57] have created bindings for open tracing. So the options for the end user community are very many.
A lot of end user companies are finding this very useful in production. So this logo wall is by no means complete. It’s all kinds of companies from hipster engineering cultures to large scale enterprises to somewhere in between. You see the Under Armors of the world. You also see the Lyfts and the CockroachDB, etc.
Along the way, the open source site has really pushed the project along too. The open source project maintainers have adopted open tracing and accepted PRs for it because this an opportunity for them to allow their users to have visibility through the system without having to build the whole thing themselves. Some notable ones will be Spring by Pivotal, GRPC, and there’s a bunch of language libraries fueled by this open source and end user adoption vendors have been jumping onboard.
So that all sounds great. Clearly there’s some social proof, people seem to like it, but what does this really mean? Let’s look at open tracing in action with some traces. So I’m now presenting the next unicorn of Silicon Valley, Donut zone, which is a DaaS, Donuts as a Service. It’s the latest innovation, it’s donut delivery at its very best, it’s backed by Yummy Tummy Capital with a lot of money. This is the beautiful application we’ve built and through this, you can order donuts. Now I’m going to be asking you guys for help very soon to help me create a success disaster on donut zone.
Before that, I quickly want to explain the DaaS architecture, which is built to move fast in big things so that you know when we look at traces, what we’re really looking at. So we have bootstrapped with a single donut mixer for our topper, start up life. But the good news is that we’ve built with open tracing. There’s an open tracing of our Mutex wrapper that’s included in this. So it will hopefully help us debug any issues that may come up. But who knows?
So now we are going to get on to the success disaster part of the piece. So if you can please pull out your mobile phones or laptop, whatever you may have and go to donut.zone. I’m going to do this too. And then once you’re there at donut.zone, just start ordering a bunch of donuts, as many as you would like in a dream world where there are no calories in donuts and they were just like water but tasted like they do. So let’s get started. Click away because if you don’t do this, there will be no good traces to see and so this is all on you. Okay, so tap, tap, tap. Click, click, click. Okay, let’s just do a little more for good measure. Don’t be shy. Click a lot. Okay, great. So thank you so much.
Now let’s look at some traces. I have a bunch of traces here. Now you’ll see that this is a full trace and the yellow line that you see here … Can people see the yellow line? The yellow line is the critical path, which means the things that … the blocker, so to speak, for the process to be completed. So here you see that the donut browser is taking up in this specific trace, which is 10.2 seconds long. Seven seconds are gone just by the browser and client interaction. What does that tell us? Something that’s not every surprising, which is that the internet’s a little slow. So nothing moved beyond the client until seven seconds in. Then you see here this is like the payment is being processed because you paid fake money for your donuts.
Then here is an interesting one. Here we see that from 7.6 seconds to 9 second was taken up by the donut fryer. But, the interesting point here is that the Mutex acquire says that that itself took close to about, I would say, two and a half seconds, which mean … And here in the tags, you will see that I was waiting on six other donut requests. This is where you created the success disaster by clicking so many times. So this request was just waiting for its chance to be fried and that took up the most time. So that should tell you that if you were debugging over here, that it’s a resource allocation issue as opposed to an issue in your code. The code that worked just took this tiny amount of time. So you’ll be better off optimizing by adding a fryer. And this is something you knew right away because you had this distributed trace available and you could debug on the moment to figure out why is it so slow to get these donuts from donut.zone.
So I hope you found this useful. For those of you who are interested in seeing those kinds of traces for your system and debugging efficiently and quickly, I recommend going to open tracing.io to learn about the project. We have documentation, we have links to all our GitHub Repos, and lots of advice, etc. But if you’re interested more in the community aspect, you want to talk to people, hear their experience, then I’d recommend the Gitter community where we all hang out. It’s gitter.im/opentracing. This is where people help each other and you can get good advice on how to get started. I hope you found this useful. If you have any questions, feel free to reach out to me, priyanka at lightstep.com. I am also on Twitter at @pritianka. Reply really quickly. So please reach out, check out distributed tracing, and let me know if I can help in any way. Thank you so much.
Josh Willis, an engineer at Slack, spoke at our January MeetUp about testing machine learning models in production. (If you’re interested in joining this Meetup, sign up here.)
Josh has worked as the Director of Data Science at Cloudera, he wrote the Java version of Google’s AB testing framework, and he recently held the position of Director of Data Engineering at Slack. On the subject of machine learning models, he thinks the most important question is: “How often do you want to deploy this?” You should never deploy a machine learning model once. If the problem is not important enough to keep working on it and deploy new models, then its not important enough to pay the cost of putting it into production in the first place.
“The tricky thing, though, is in order to get good at machine learning, you need to be able to do deploys as fast as humanly possible and repeatedly as humanly possible. Deploying a machine learning model isn’t like deploying a regular code patch or something like that, even if you have a continuous deployment system.” -Josh
Watch his entire talk below.
How’s it going, everybody? Good to see you. Thanks for having me here. A little bit about me, first and foremost. Once upon a time, I was an engineer at Google. I love feature flags, and I love experiments. I love A/B testing things. I love them so much that I wrote the Java version of Google’s A/B testing framework, which is a nerdy, as far as I know … I don’t know. Does anyone here work at Google? Any Googlers in the audience? I know there’s at least one because my best friend is here, and he works at Google. As far as I know, that is still used in production and probably gets exercised a few trillion times or so every single day, which is kind of a cool thing to hang my nerd hat on.
I used to work at Cloudera, where I was the director of data science. I primarily went around and talked to people about Hadoop and big data and machine learning and data science-y sorts of things. I am Slack’s former director of data engineering. I’ve been at Slack for about two and half years. I am a recovering manager. Any other recovering managers in the audience? I was going up the management hierarchy from first line management to managing managers, and I started to feel like I was in a pie eating contest, where first prize is more pie. I didn’t really like it so much. Wanted to go back and engineer. So about six months ago, I joined our machine learning team, and now I’m doing machine learning-ish sorts of things at Slack as well as trying to make Slack search suck less than it does right now. So if anyone’s done a search on Slack, I apologize. We’re working hard on fixing it.
That’s all great, but what I’m really most famous for … Like most famous people, I’m famous for tweeting. I wrote a famous tweet once: Which is a proper, defensible definition of a data scientist? Someone who is better at statistics than any software engineer and better at software engineering than any statisticians. That’s been retweeted a lot and is widely quoted and all that kind of good stuff. Are there any … Is this sort of a data science, machine learning audience or is this more of an engineering ops kind of audience? Any data scientists here? I’m going to be making fun of data scientists a lot, so this is going to be … Okay, good. So mostly, I’ll be safe. That’s fine. If that guy makes a run at me, please block his way.
So anyway, that’s my cutesy, pithy definition of what a data scientist is. If you’re an engineer, you’re sort of the natural opposite of that, which is this is someone who is worse at software engineering than an actual software engineer and worse at statistics than an actual statistician. That’s what we’re talking about here. There are some negative consequences of that. Roughly speaking at most companies, San Francisco, other places, there are two kinds of data scientists, and I call them the lab data scientists and the factory data scientists. This my own nomenclature. It doesn’t really mean anything.
So you’re hiring your first data scientist for your startup or whatever. There’s two ways things can go. You can either hire a lab data scientist, which is like a Ph.D., someone who’s done a Ph.D. in statistics or political science, maybe or genetics or something like that, where they were doing a lot of data analysis, and they got really good at programming. That’s fairly common data science standard. A lot of people end up that way. That wasn’t how I ended up. I’m the latter category. I’m a factory data scientist. I was a software engineer. I’ve been a software engineer for 18 years now. I was the kind of software engineer when I was young where I was reasonably smart and talented but not obviously useful. I think we all know software engineers like this, smart, clearly smart but not obviously useful, can’t really do anything. This is the kind of software engineer who ends up becoming a data scientist because someone has an idea of hey, let’s give this machine learning recommendation engine spam detection project to the smart, not obviously useful person who’s not doing anything obviously useful and see if they can come up with something kind of cool. That’s how I fell into this field. That’s the two kinds. You’ve got to be careful which one you end up with.
Something about data scientists and machine learning. All data scientists want to do machine learning. This is the problem. Rule number one of hiring data scientists: Anyone who wants to do machine learning isn’t qualified to do machine learning. Someone comes to you and is like, “Hey, I really want to do some machine learning.” You want to run hard the other direction. Don’t hire that person because anyone who’s actually done machine learning knows that it’s terrible, and it’s really the absolute worse. So wanting to do machine learning is a signal that you shouldn’t be doing machine learning. Ironically, rule two of hiring data scientists, if you can convince a data scientist that what they’re doing is machine learning, you can get them to do anything you want. It’s a secret manager trick. It’s one of the things learned in my management days.
Let’s talk about why, briefly. Deep learning for shallow people like ourselves. Deep learning, AI, big stuff in the news. I took a snapshot here of the train from my favorite picture, “Back to the Future, Part III,” a truly excellent film. Machine learning is not magic. Machine learning is, it’s basically the equivalent of a steam engine. That’s really what it is, especially deep learning in particular. What machine learning lets us do is stuff that we could’ve done ourselves, manually, by hand over the course of months or years, much, much, much faster in the same way a steam engine lets us move a bunch of rocks from point A to point B. It’s not something we couldn’t do. We knew how to move a bunch of rocks from point A to point B. That’s how we built the pyramids and stuff like that. But this lets us do it much, much faster and much, much cheaper. That’s what machine learning fundamentally is.
There are consequences of that. One of the nasty consequences of it. Machine learning … There’s a great paper that I highly recommend you read by this guy named D. Sculley, who is a professor at Tufts, engineer at Google. He says machine learning is the high interest credit card of technical debt because machine learning is basically spaghetti code that you deploy on purpose. That’s essentially what machine learning is. You’re taking a bunch of data, generating a bunch of numbers and then putting it in a rush intentionally. And then trying to figure out, reverse engineer how does this thing actually work. There are a bunch of terrible downstream consequences to this. It’s a risky thing to do. So you only want to do it when you absolutely have to.
Lab data scientists want to do machine learning. Factory data scientists want to machine learning. Their backgrounds mean they have different failure modes for machine learning. There’s a yin and yang aspect to it. Lab data scientists are generally people who have a problem with letting the perfect be the enemy of the good, broadly speaking. They want to do things right. They want to do things in a principled way. They want to do things the best way possible. Most of us who live in the real world know that you hardly ever have to do things the right way. You can do a crappy Band-Aid solution, and it basically works. That’s the factory data scientist attitude. The good news, though, of people who want to do things perfectly, they don’t really know anything about visibility monitoring, despite knowing a bunch of stuff about linear algebra and tensors, they don’t know how to count things. But you can teach them how to do Graphite Grafana. You can teach them how to do Logstash. They can learn all these kinds of things, and they want to learn, and they have no expectation that they know what they’re doing, so they’re very easy to teach. That’s a good thing.
Factory data scientists have the opposite problem. They’re very practical. They’re very pragmatic. So they’ll build things very quickly in a way that will work in your existing system. However, they overestimate their ability to deploy things successfully the way most not obviously useful software engineers do. As a result, they are much more likely to completely bring down your system when they deploy something. So that’s what you want to watch out for there.
Another really great paper, “What’s your ML test score? A rubric for production ML systems.” I love this paper. This is a bunch of Google people who basically came up with a checklist of things you should do before you deploy a machine learning system into production. I love it. Great best practices around testing, around experimentation, around monitoring. It covers a lot of very common problems. My only knock against this paper is they came up with a bunch of scoring criteria for deciding whether or not a model was good enough to go into production that was basically ludicrous. So I took their scoring system and redid it myself. So you’ll see down there, if you don’t do any of the items on their checklist, you’re building a science project. If you do one or two things, it’s still a science project. Three or four things are a more dangerous science project. Five to 10 points, you have the potential to destroy Western civilization. And then finally, once you do at least 10 things on their checklist, you’ve built a production system. So it’s kind of a u-shaped thing.
This is a great paper. If you have people at your company who want to deploy machine learning into production, highly, highly recommend reading it and going through it and doing as much of the stuff they recommend as you possibly can. More than anything, for the purposes of this talk, I want to get you in the right headspace for thinking about what it means to take a machine learning model and deploy it into production. The most important question by far when someone wants to deploy a machine learning model is, how often do you want to deploy this? If the answer is once, that is a bad answer. You should never deploy a machine learning model once. You should deploy it never or prepare to deploy it over and over and over and over and over again, repeatedly forever, ad infinitum.
If the problem is not important enough to keep working on it and keep deploying new models, it’s not important to pay the cost of putting it into production in the first place. That’s thing one. The tricky thing, though, is in order to get good at machine learning, you need to be able to do deploys as fast as humanly possible and repeatedly as humanly possible. Deploying a machine learning model isn’t like deploying a regular code patch or something like that, even if you have a continuous deployment system. The analogy I would use is it’s kind of like someone coming to you and saying, “Hey listen. We’re going to migrate over our database system from MySQL to Postgres, and then next week, we’re going to go back to MySQL again. And then the week after that, we’re going to go back.” And just kind of like back and forth, back and forth. I’m exaggerating slightly, but I’m trying to get you in the right headspace for what we’re talking about here. It’s basically different machine learning models are systems that are complicated and are opaque, that are nominally similar to each other but slightly different in ways that can be critically bad for the overall performance and reliability of your systems. That’s the mentality I want you to be in when it comes to deploying machine learning models. Think about it that way.
The good news is that we all stop worrying and learn to love machine learning, whatever the line is from “Dr. Strangelove,” that kind of thing. You get good at this kind of stuff after a while, and it really … I love doing machine learning, and I love doing it production in particular because it makes everything else better because the standards around how you operate, how you deploy production systems, how you test, how you monitor have to be so high just across the board for regular stuff in order to do it really, really well. Despite all the horrible consequences and the inevitable downtime that the machine learning engineers will cause, I swear, I promise, it’s ultimately worth doing it, and in particular, companies should do it more so I get paid more money to do it. That’s kind of a self-interested argument.
If you like to do monitoring, if you like to do visibility, if you like to do devOps stuff in general and you want to do it at a place that’s done it really, really well, slack.com/jobs. Thank you very much. I appreciate it.