14 Mar 2018

Tonight We Monitor, For Tomorrow, We Test in Production!

In February Steven Czerwinski, Head of Engineering at Scalyr, spoke at our Test in Production Meetup. This session was focused on monitoring and observability while testing in production, and Steve shows why he feels monitoring is an important element within that process. If you’re interested in joining us at a future Meetup, you can sign up here.

Steve presented a case study around latency issues a Scalyr customer recently faced. He shares how his colleague, John Hart, explored the issue, and then reviews some key learnings realized after the event.

“Monitoring is so important to testing in production. I want to evoke the idea that you need to get your monitoring in place before testing in production. If you’re not really monitoring, you’re not really testing—you’re just hoping that things go right.”

Watch his talk below.


Thanks both to Andrea and Heavybit both for organizing this. This is a topic near and dear to our hearts at Scalyr. As Andrea said, my name is Steven Czerwinski, I’m Head of Engineering at Scalyr. Tonight I’m actually going to be presenting some work done by my colleague, John Hart.

In particular tonight, I’m going to talk about some, essentially lessons that John uncovered while he was performing the deep dive into some query latency issues that one of our customers was facing. And, with this, I’m not going to focus on the particular performance issues that he uncovered. What I’d rather like to do is talk about how our monitoring impacted that investigation. Often our monitoring actually helped out with the investigation, as you would hope it would. It helped make the investigation smoother, uncovered some interesting things. However, there are other times where our monitoring actually got in our way, where it actually hindered and misled us in the investigation. And those are the more interesting examples. In retrospect, those issues were because we violated some known best practices with monitoring. And this happens. So what we want to do is go down to the specifics of those examples to better reinforce that it’s a good idea to follow good practices in monitoring.

And, in general, monitoring is so important to testing in production. The little bit of mashup title that I used that Andrea was referring to, I really want to use that to invoke the idea that you really have to get your monitoring in place, before you do testing in production. If you’re not doing monitoring, then you’re not really testing, you’re just hoping that things go right. John Hart also likes to talk about this idea of performance dark matter. When you’re running a complex distributed system, like we do, there’s a lot of performance dark matter that’s just kind of hidden in the system. And it’s only through best practices in monitoring that you can really shed light on that dark matter and figure out what’s going on.

This slide here illustrates the problem that John was looking at. One of our customers, I’m going to refer to them as Acme Corp, just to protect the innocent, was facing bad query latency for certain class of their queries. And here in this graph, you can kind of see that. The blue line is the average latency for Acme Corps queries for this class of queries over time. The red line is for all customers other than Acme Corp. You can see that for Acme Corp, we had times where the latencies were picking over five, ten seconds. And for us, that’s unacceptable. We really strive to make sure that 99% of our query … customer’s query latencies are answered in less than 1 second. For us, this is a challenging problem. We have some very large customers. This customer, in particular, sends us tens of terabytes of logged volume every day. We have hundreds of their engineers logging in every day, issuing queries in order to see what’s going on in their system.

Now before I dive into the details of the best practices, I want to give a little bit of overview of our backend system, because it’s going to give you a little bit of information to put the rest of the best practices in context. One of the fun things that we’ve done at Scalyr, is we’ve actually built our own no SQL database from scratch, that’s optimized for this particular problem domain. And for us, this is one of our competitor advantages. It is what allows us to give orders of magnitude better performance than a lot of our other competitors. And, for our no SQL design, we followed a lot of the normal way that other no SQL databases are structured like Bigtable, Cassandra, that sort of thing.

For us, we take all of a particular account’s data, which is essentially the logs coming in for that account and we break it up into small five minute chunks, which we refer to as Epochs. Each Epoch is assigned to a particular shard of servers in our system. And we sprinkle these Epochs for an account all across all of these Shards that we have in our infrastructure. To answer a query, we … at the account master, receive the query from our customer. We … the account master knows where all the Epochs are stored, what Shards are … hold all the appropriate data. And the account master forwards the query to the appropriate Shards in order to execute the query on the appropriate Epochs. Now, in our world a basic Shard of servers has both masters and slaves. And a given query can be satisfied at either a master or a slave.

The right side of the diagram blows up the query engine of a particular slave or master. And, in here, you can see that there’s a few block diagrams in the flow of executing a query. One of the first things that happens when a … the server receives the query, is to do some admission control policies. So this is enforcing rate limits, in order to make sure customers aren’t abusing our system, acquiring account blocks, that sort of thing. After the query passes admission control, then it gets farmed off to the query execution engine, the query execution engine essentially tries to execute the query over data in a RAM cache. In order to satisfy that query, often you have to pull in Epochs or the data blocks that make up Epochs into that RAM cache. So that’s why you see us pulling in blocks off disk into the RAM cache.

Just to briefly talk about some of the things that did work well for our monitoring. First of all, we actually already have an A/B testing framework for our queries. On a per query basis, we can apply different execution strategies in order to experiment with effects of small modifications. We actually have this integrated with our logging, as well. So, very quickly, we can be able to analyze the difference of … the effects of different execution strategies on query latencies. One of the other things that we do is we’re very careful about how we run our experiments. John is a big believer on markdown files, so every time he starts up a new investigation, starts up a new markdown file along with a new dashboard. Everything that he does during the experiment, gets dumped in there. He uses our dashboards and our Scalyr command line interface extensively in order to populate information to that markdown file and to add results to the dashboard.

And finally, one of the other things that we have in our system, is the ability to modify the server configuration on the fly. So all these experiments that we’re running, all these things that we’re doing in order to test our strategies on our real users queries, we can adjust over time, through some simple updates.

Alright, let’s talk about the more interesting points. The monitoring lessons that we essentially had to relearn. So the first lesson I want to talk about is the importance of consistency. And the other way I like to think about this lesson is that, there should be no surprises in how information is communicated. When you look for a piece of information, it should exist where you expect it to exist. It should exist in the form that you expect it to. And, the performance issue that really reinforced us for this, was the discovery of an RPC rate limit gate issue that we had. In our system, I kind of eluded to earlier, we have rate limits that are applied to all incoming queries to make sure that there’s no abuse. We don’t want to have too many queries from one particular customer executing on the query engine, because they’re getting unfair advantage of the system then.

So, normally what happens is the gate keeps track of the number of current queries that are being executed per second. If it exceeds a certain threshold, then the gate will artificially block a given query in order to slow it down. Now, it turned out for Acme Corp, we were actually experiencing wait times of multiple seconds or more at the gate. And this was a big contributor to their latency. It was slowing them down. But, we didn’t notice the issue that quickly, which is surprising because actually all the information that we needed was in the logs, we just didn’t see it. Let’s talk about why.

Essentially it boils down to multiple issues with consistency. First of all, we had inconsistency with how our metrics were laid out. We already did have a good model for reporting query latencies broken down by various sub components. We had a systematic way of reporting that. But for this feature, this RPC rate limit, it was not part of the query system. It was part of the RPC subsystem. So we actually reported it in a different way. And, when it came down to it, we were looking at the breakdown of the query latencies, we were just missing the fact that there was time stalled out while we were waiting for the gate.

Now, we actually did have the gate wait latency in the logs. In fact, John even thought to check it out. He had a long list of all the places where we could be missing performance. And he did some manual checks. He knew what he was looking for in the logs in order to check to see whether or not this was an issue or not. And he did some scans and saw numbers like four and five and he’s like, “Oh, okay, four or five milliseconds, that’s fine”. That’s not contributing to the multiple seconds that we’re seeing.

But, the problem here was that latency was actually being reported in seconds and it was inconsistent with how we report most of our latencies. Everywhere else in the system, we report them in milliseconds. But here we were being misled by our results, because we were just inconsistent with units.

Okay. This is just kind of before and after actually. So John, after figuring this out, did some fixes to how we handle the gate and the red bars are essentially the counts or number of times that the waits at a gate are exceeded one second. That was before the fix. And the blue are the number of times we waited after the fix. So you can see there’s a significant reduction.

The next lesson. The second lesson we learned. Essentially it boils down to what I like to describe as you have to analyze based on what matters. You have to have … when you design your monitoring, you really have to think about what really matters in terms of the behavior of your system. Another way people talk about this is averages versus distributions. And I’ll explain that more in a minute. The performance issue that reinforced this lesson was an issue we were having with our RAM block cache utilization. I mentioned earlier that in order to execute a query for a given Epoch, all the blocks for that Epoch have to be read into a RAM cache.

Well, it turned out that because of an odd interaction with how we decide what Epochs should be executed on what … on the master versus the slave and how we had architected this structure of our RAM cache, we were only using half of the cache at the time. And essentially, just to give you a little more detail, our RAM cache was actually composed of numerous two gigabyte pools. And it turned out that if we had an odd number of RAM pools, then only the even RAM pools were being used on the masters and only the odd RAM pools were being used on the slaves. And it was just because of this odd interaction. And, but it resulted in the fact that we were effectively using only half of our RAM for RAM … I’m sorry, half of our RAM for our cache. We had 50 gigabytes delivered to that cache, we were only using 25.

And so, why did it take so long for us to figure this out. It comes down to the fact that we were measuring the wrong thing. We had some metrics that we were looking at that would have uncovered this sooner. We had a dashboard that essentially talked about the cache fill rates. How many blocks were we inserting into the cache of the second? If there was a problem, if we weren’t really utilizing the cache, this would have dropped to zero. And so we look at this graph. We look at the average of the cache fill rates across all the RAM pools. Everything looks fine, okay, we’re inserting blocks at a pretty decent rate.

However, this graph tells a different story. And what this graph shows, is it’s the graph fill rate for all the odd number RAM pools, that’s the one in blue. And the even is in the red. You can see right there, that there’s a huge difference between the fill rates for the even RAM pools and the odd RAM pools. And what this really gets down to, is what really matters? It doesn’t matter that we’re inserting blocks into the cache at a decent rate. What matters is that we’re inserting blocks in for all the RAM pools. All the RAM pools were effectively having blocks added. And so this is where I get … you get to the idea of averages of distributions. You can’t take the average across something. You really, in some cases, have to look at the distributions where that matters.

Okay. After 20 hours of investigation, basically this boiled down to a single character fix for John. The easy fix was just actually changing the shared RAM pool count down from 26 to 25 to give an odd one. And so, in effect, it ended up reducing the total RAM that we’re using for our cache, but actually resulted in more blocks being cached as anti-intuitive as that is. And that’s it. Those are the lessons I wanted to go over.

If you want to learn more about our system, feel free to visit our blog. The obligatory, we’re hiring. And that. So …

12 Mar 2018

Instrumenting CI Pipelines

In February, we invited New Relic Developer Advocate, Clay Smith, to our Test in Production Meetup to talk about instrumenting CI pipelines. If you’re interested in joining us at a future meetup, you can sign up here.

Clay took a look at the three pillar approach in monitoring—metrics, tracing, and logging. He wanted to explore what tracing looks like within a CI pipeline, and so he observed a single run of a build with multiple steps kicked off by a code commit.

“I wanted to try and apply some of this stuff to understanding AWS CodePipeline that I was using to build some Docker images. The question that I wanted to explore and kind of get into and try to learn more about observability by trying to build something is, if we take this really interesting ideas that were outlined in these posts and apply them to a CI tool, what can we learn and what does that actually look like?”

Watch his talk below.


Clay Smith:

I’ve had this very long and checkered history with attempting to build CI pipelines. My background’s engineering, not release or operations. It’s been a mixed bag of trying to build pipelines most disastrously with iOS and the Xcode build server. Trying to build something fast and reliable to do these checks that makes it easier to actually deliver software.

I revisited that fairly recently after spending a lot of time in 2017 reading a lot about this notion of observability and just going over some really interesting material on that. The inspiration for this was basically three things I read, kind of what was my reading list in 2017 for observability.

The really interesting thing is a lot of the really interesting posts and thought leadership I guess you could call it, has been very much centered in San Francisco. I think we can more or less blame Twitter to some extent for it.

Back in September 2013, they described the situation where Twitter was undergoing rapid growth. They were having issues managing and understanding their distributive systems. They introduced this notion of observability, which isn’t necessarily something new, but it was new in this kind of IT distributive systems context.

In 2017, there were two really great posts I highly recommend you read. They were pretty widely circulated. The first was from Copy Construct’s Cindy Sridharan, she wrote a really amazing post that kind of described that these three things, metrics, logs, and traces are really central to the notion of understanding the work your system does.

We had the three pillars conversation or post, and then slightly before that this Venn diagram from Peter Bourgon. I thought these posts were super cool because again my background isn’t necessarily in operations and caring really deeply about log, or metric, or trace data. I thought the way they presented these ideas was super interesting.

In particular, this Venn diagram that was presented in this post, I thought was really interesting because it got this idea that when we’re talking about metrics, or logs, or traces, which we heard about in the previous talk, there is some sort of relationship between all of them.

I had a couple days right before New Years, and I wanted to try and apply some of this stuff to understanding AWS CodePipeline that I was using to build some Docker images. The question that I wanted to explore and kind of get into and try to learn more about observability by trying to build something is, if we take this really interesting ideas that were outlined in these posts and apply them to a CI tool, what can we learn and what does that actually look like?

I was at Re:invent this year, which was very, very large, I think around 50,000 people. There was a really cool dashboard that Capital One was showing off. I took a photo on my phone, it’s open source. I think they were calling it something like the single view of the DevOps pipeline.

They have some really interesting metrics and graphs around things like build failures, what’s the ticket backlog, what’s the build speed in success, things you would expect. Typically, if you use Jenkins or all these other tools, there’s almost always a way to inspect log output.

Taking the three pillar approach, it seemed like in this view and in other common systems and tools, there wasn’t much necessarily going on with getting a trace of what work is actually going on inside some sort of CI pipeline.

I really wanted to explore that and try and build something in a day or two. The one thing that I kind of changed from the Venn diagram, instead of scoping a trace to a request, what if we just scope it to a single run of a build. Multiple steps kicked off by something like a code commit.

I was using AWS CodeBuild at the time, this is managed infrastructure from AWS. How it works is you have a YAML file, you can give it a container, and you basically give a script. It can do things like build an image, compile code, you can configure it in a lot of different ways.

The infrastructure itself, like a lot of AWS services, is fully managed so there’s nothing to SSH into. You don’t have access to the host, no root privileges. You’re kind of just locked into that container environment, similar to SaaS based CI tools.

What I wanted from that, as it goes it through it’s build steps, I want the trace view. One of the things that I had a lot of fun doing was I realized there was no way I could really natively instrument the code build process. It’s fully managed by AWS, they’re not going to give me access to the code.

Inspired by the diagram, if you can log an event and if you can log the relationship between different events, you can get something that kind of approximates traces. I just wrote a really stupid thing, there’s a verb at the front, you capture different events, and you’re writing it to a file.

The idea there is you’re writing this formatted log, you’re doing this as each build step progresses. You can have write access to the file system in CodeBuild so nothing big there. From there, we can actually build these traces. There was also a huge hack, so you could actually capture those events in real time. It would just hail the log file that you’re writing events to, and send it up to the back end, which in this case is just New Relic APM.

Once all that’s in place, you can actually get this tracing specific view of different events inside the AWS CodeBuild pipeline. It’s really interesting because all of this stuff was designed very much for an application. I think this view has been around in New Relic for more than seven years.

When you apply it to the pipeline, you actually still get some pretty interesting views of what’s going on. First is just the frequency and duration, but then you actually see the breakdown in time between each step. Not surprisingly, the build step which is actually building the Docker image takes the most time.

From there, because we’re actually building a Docker container, we know from what commits and source control actually builds the image, and we use that to actually connect it to production performance.

The hack, or the trick, or the thing here with instrumentation is when it’s actually building the Docker image, we tag that trace with the get commit hash of what’s actually being built. When we run that code in production, we also capture that as well so we have traces of how the code is behaving in production. We also have a trace of how that build artifact, that Docker container that’s running in production, was actually being built.

Here you have this interesting view of you see code running, this is different deploys, there’s a spike as [inaudible 00:07:50] scales up and down and all that. You also see next to it what was actually happening when that Docker image was being built in the first place.

An interesting connection between connecting these potentially complicated processes of actually building the image that you’re going to get gradually deployed to production. If you can annotate both traces with something like a git commit hash or a version number, you can connect them together, which I think is kind of interesting.

To wrap up this experiment, I think we talk more and more to different customers and people that are building very complex pipelines. Often at the end of that pipeline, there’s a very complex deploy strategy. Blue green, I read a really interesting post the other day that was talking about, this is a blue green rainbow deploys, 15 colors, or 26 colors. Canary deploys, lots of different strategies.

With that complexity, it feels like the stuff that we all know and are hearing about managing systems who need services could potentially apply in some respects to complex pipelines too. I think this idea of understanding and monitoring your production performance and then being able to have some relationship where you connect it back to whatever it was that built it, ideally ran through some automated tests, test suites, that seems pretty interesting too.

It was a really fun exploration. It was fun to get my hands dirty with these ideas around observability. So many people that go through this to learn about it, it seems really important and also really interesting. Looking forward to continuing the conversation about how people are attacking this and applying it to things we’re all building.

On that note, thanks very much.

08 Mar 2018

Testing and Debugging in Production with Distributed Tracing

At LaunchDarkly we host a monthly Meetup for anyone interested in testing in production. At our Meetup in February, we focused on monitoring and observability while testing in production. If you’re interested in joining this Meetup, sign up here.

Priyanka Sharma, Product Marketing, Partnerships & Open Source at LightStep, kicked off the event with a discussion around how she sees tracing as an essential tool for testing in production. She pointed out how software systems have become more complex in recent years, especially with the rise of CI/CD and microservices.

“There’s an explosion of data. The more services there are, the more touchpoints there are, and the more data you have to collect about what’s happening in your system. It’s very hard to manage all of this.”

Watch her talk to see how distributed tracing can help teams get a better understanding of their systems and the responses they’re seeing when testing in production.


Hi, everybody. How’s everyone doing?

Awesome, awesome. I love Heavybit, just like Andrea. LightStep also was here until a few months ago and we miss the space and the community. And today, I’ll be talking about tracing being an essential tool for testing in production. So before we deep dive in to tracing, let’s take a brief overview of what are the big challenges in production, especially around debugging and performance.

As people here probably know better anybody else, software is changing. Software workflows are changing, especially with the event of CICD and also microservices. Things are done completely different now. This diagram you see here, the first one, is what a monolithic architecture would look like where it’s one giant box and a request from start to finish and its lifecycle goes through various parts of this box, which, while it’s huge and has a lot inside it is at least contained and viewing something through it was much easier.

But now as systems are breaking up into fragments whether they’re microservices or larger services, however you call your service or you call it core projects, whatever, the point is that there’s more and more complexity introduced in today’s software systems.

What also happens with this is that there’s an explosion of data. The more services there are, the more touchpoints there are, and the more data you have to collect about what’s happening in your system. It’s very hard to manage all of this.

This is where distributed tracing can help. So if you have anomalies or outliers that you want to detect, it gets very challenging in that very fragmented system over here because is it in that corner, on this corner here? Where is the actual problem? With distributed tracing, you can examine the exact edge case and see what happened end to end in that request lifecycle. If you’re running a distributed system, out of curiosity, how many people here have a distributed system with at least two services? All right. A bunch of you. So you know what I’m talking about where that observability can be really lost the minute there is any fragmentation. So with tracing, you can get a full picture off your system from the plans all the way to the back end and the responses that come out.

How many of you here run CICD? All right. A bunch here too. So when you’re running CICD, it’s great that you can ship faster, but also there can be problems with your bills and you need to understand what is the issue. Often times, it’s not the code, but actually the resources being utilized. And with tracing, you can pinpoint that as the issue if it is the case.

Andrea mentioned gradual software rollouts with feature flags, things like that. If you’re doing that, you want to analyze the full transaction data off each rollout to look at the performance, see what errors were there, and make informed choices about the next rollout that happens and that is, again, something that distributed tracing can provide you.

So just to make sure we’re all on the same page, before deep diving further, I’d like to do a quick intro of distributed tracing as I’m defining it in this context and make sure we’re all aligned. So what is distributed tracing? So tracing as a technology has actually been around since the 70s. But what’s brought it mainstream into industry is the coming of distributed systems into the internet age.

So then you need to know the lifecycle of a request and to end in a distributed system, you use distributed tracing. It’s a view of the request as it travels across multiple parts of the systems and come back out with the response. This here is an example of a relatively simple request going through a small system. So you can imagine if you 10x it how big this will be and how complicated it’ll be to follow the path of a request.

So what’s a trace exactly? A trace is the entering life cycle of a request and its composed of parts that are called spans. Now a span is a named timed operation that represents a piece of workflow. So think of it as Point A to Point B and then there’s a jump in the context to a different span and then Point C to D and it goes from there. This diagram here, it should give you a little bit of a sense of how a trace is structured where there’ll be the parent span that starts a lot and there can be child spans or logs, etc. in here. So you can add in a lot of information that is important to you and your system. So there’s a lot of flexibility. Ultimately though, the TL;DR of this all is that the trace is a collection of spans.

So here, I’ve tried to visualize what a request looks like in a very simple architecture and then the trace view. So on, I guess your left, my right, your left, there’s a very simple service. There’s a client then a rep server that talks to an Auth service, billing, and a database. If you’re looking at the trace we offered, which is my left, your right, that’s on the time continuum. And the top you see over there, the client, or the root span is the top most thing because that is the beginning and end of the request. And these dashes that you see are all spans of work that happened within this whole trace. So this can be quite useful when you’re debugging a problem because you can deep dive exactly is the problem and the process workflow or in the art in the DP, client, whatever.

So that’s great, right? You can deep dive into things, you can append information and logs and baggage. There’s so much you can do with tracing and systems of fragmenting so much as changing. So then why is tracing not ubiquitous? The key reason here is that tracing instrumentation has just been too hard. Until recently, most of the instrumentation libraries were vendor specific. And so if you went through all the work of instrumenting each part of your code base, you were connected to this vendor and you didn’t really have a choice to change options should this one not work out for you.

Then monkey patching is insufficient. So an automated agent can do so much, but you need instrumentation by humans who understand the system and the problems that they run into for it to be a comprehensive system.

Then there’s the problem of inconsistent APIs. Today’s systems… How many of you use more than one language in your code base? A bunch of you. So the minute there’s more than one API, if you have to transfer between different APIs or different languages, which don’t work nice together, you have a problem because you’re only seeing a snapshot of your system. And the same applies to open source projects. So most people’s code is a ton of open source projects with Glue code, that’s your application code, right? And so in that situation, if you’re not seeing anything through those project, then you’re, again, flying in the dark. And tracing libraries for different projects if they’re not the same … You have the same problem that you had with different language APIs that don’t play nice with each other. So all this is a lot of difficulty and dissuaded most developers from going into tracing.

And this where open tracing, the project I work on is your ticket to omniscience. So the open tracing API is a vendor-neutral open standard for distributed tracing. It’s open source and part of the Cloud Native Foundation, as I mentioned.

It enables developers to instrument their existing code for tracing functionality without committing to an implementation in the start. So you could use any tracer and visualization system such as LightStep, Zipkin, Jaeger, New Relic, Datadog. Whatever choice you have, if they have a binding for open tracing, you can go ahead and swap out with just a change in the main function.

And this standard has been doing really well and part of the reason why I that it’s backed by folks who’ve been around the tracing landscape since it became mainstream industry. So Google release this project called Dapper, which it’s in production tracing system that was created by Ben Sigelman and a few of his colleagues. And they released a paper about it in the 2000s sometime. Inspired by that, Twitter built Zipkin, which it open sourced around 2009. And then now these people who’ve worked on these project including Ben Sigelman, including other folks from industry like Yuri Shkuro from Uber, other folks came together to build the open tracing spec in 2016 and it was built by people who had solved these problems before, had lived these problems multiple times, and the result is there for others to see.

So let’s look at the open tracing architecture. This is it. Jk. So there’s the open tracing API. It speaks to the application logic, the frameworks, the RPC libraries, you name it and they all connection to open tracing. And open tracing talks to the tracing visualizer off choice. Here, the examples listed like Zipkin, Lightstep, Jaeger, very, very few. As I mentioned just before, multiple vendors such as New Relic, Datadog, Dynatrace, and [inaudible 00:09:57] have created bindings for open tracing. So the options for the end user community are very many.

A lot of end user companies are finding this very useful in production. So this logo wall is by no means complete. It’s all kinds of companies from hipster engineering cultures to large scale enterprises to somewhere in between. You see the Under Armors of the world. You also see the Lyfts and the CockroachDB, etc.

Along the way, the open source site has really pushed the project along too. The open source project maintainers have adopted open tracing and accepted PRs for it because this an opportunity for them to allow their users to have visibility through the system without having to build the whole thing themselves. Some notable ones will be Spring by Pivotal, GRPC, and there’s a bunch of language libraries fueled by this open source and end user adoption vendors have been jumping onboard.

So that all sounds great. Clearly there’s some social proof, people seem to like it, but what does this really mean? Let’s look at open tracing in action with some traces. So I’m now presenting the next unicorn of Silicon Valley, Donut zone, which is a DaaS, Donuts as a Service. It’s the latest innovation, it’s donut delivery at its very best, it’s backed by Yummy Tummy Capital with a lot of money. This is the beautiful application we’ve built and through this, you can order donuts. Now I’m going to be asking you guys for help very soon to help me create a success disaster on donut zone.

Before that, I quickly want to explain the DaaS architecture, which is built to move fast in big things so that you know when we look at traces, what we’re really looking at. So we have bootstrapped with a single donut mixer for our topper, start up life. But the good news is that we’ve built with open tracing. There’s an open tracing of our Mutex wrapper that’s included in this. So it will hopefully help us debug any issues that may come up. But who knows?

So now we are going to get on to the success disaster part of the piece. So if you can please pull out your mobile phones or laptop, whatever you may have and go to donut.zone. I’m going to do this too. And then once you’re there at donut.zone, just start ordering a bunch of donuts, as many as you would like in a dream world where there are no calories in donuts and they were just like water but tasted like they do. So let’s get started. Click away because if you don’t do this, there will be no good traces to see and so this is all on you. Okay, so tap, tap, tap. Click, click, click. Okay, let’s just do a little more for good measure. Don’t be shy. Click a lot. Okay, great. So thank you so much.

Now let’s look at some traces. I have a bunch of traces here. Now you’ll see that this is a full trace and the yellow line that you see here … Can people see the yellow line? The yellow line is the critical path, which means the things that … the blocker, so to speak, for the process to be completed. So here you see that the donut browser is taking up in this specific trace, which is 10.2 seconds long. Seven seconds are gone just by the browser and client interaction. What does that tell us? Something that’s not every surprising, which is that the internet’s a little slow. So nothing moved beyond the client until seven seconds in. Then you see here this is like the payment is being processed because you paid fake money for your donuts.

Then here is an interesting one. Here we see that from 7.6 seconds to 9 second was taken up by the donut fryer. But, the interesting point here is that the Mutex acquire says that that itself took close to about, I would say, two and a half seconds, which mean … And here in the tags, you will see that I was waiting on six other donut requests. This is where you created the success disaster by clicking so many times. So this request was just waiting for its chance to be fried and that took up the most time. So that should tell you that if you were debugging over here, that it’s a resource allocation issue as opposed to an issue in your code. The code that worked just took this tiny amount of time. So you’ll be better off optimizing by adding a fryer. And this is something you knew right away because you had this distributed trace available and you could debug on the moment to figure out why is it so slow to get these donuts from donut.zone.

So I hope you found this useful. For those of you who are interested in seeing those kinds of traces for your system and debugging efficiently and quickly, I recommend going to open tracing.io to learn about the project. We have documentation, we have links to all our GitHub Repos, and lots of advice, etc. But if you’re interested more in the community aspect, you want to talk to people, hear their experience, then I’d recommend the Gitter community where we all hang out. It’s gitter.im/opentracing. This is where people help each other and you can get good advice on how to get started. I hope you found this useful. If you have any questions, feel free to reach out to me, priyanka at lightstep.com. I am also on Twitter at @pritianka. Reply really quickly. So please reach out, check out distributed tracing, and let me know if I can help in any way. Thank you so much.