21 Jul 2015

Golang pearl: Thread-safe writes and double-checked locking in Go

Channels should be your first choice for structuring concurrent programs in Go. Remember the Go concurrency credo– “do not communicate by sharing memory; instead, share memory by communicating.” That said, sometimes you just need to roll up your sleeves and share some memory. Lock-based concurrency is pretty old-school stuff, and battle-hardened Java veterans switching to Go will undoubtedly feel nostalgic reading this. Still, many brand-new Go converts probably haven’t encountered low-level concurrency primitives before. So let’s sit down and program like it’s 1999.

To start, let’s set up a simple lazy initialization problem. Imagine that we have a resource that is expensive to construct- it’s read often but only written once. Our first attempt at lazy initialization will be completely broken:

This doesn’t work, as both goroutines in Broken may race in GetInstance. There are many incorrect (or semantically correct but inefficient) solutions to this problem, but let’s focus on two approaches that work. Here’s one using read/write locks:

If you’re a Java developer you might recognize this as a safe approach to double-checked locking. In Java, the volatile keyword is typically used on instance instead of using a read/write lock, but since Go does not have a volatile keyword (there is sync.atomic, and we’ll get to that) we’ve gone with a read lock.

The reason for the additional synchronization around the first read is the same in Go as it is in Java. The go memory model does not otherwise guarantee that the initialization of instance is visible to other threads unless there is a happens-before relation that makes the write visible. The read lock ensures this.

Now back to sync.atomic. Among other things, the sync.atomic package provides utilities for atomically visible writes. We can use this to achieve the same effect as the volatile keyword in Java, and eliminate the read/write lock. The cost is one of readability– we have to change instance to an unsafe.Pointer to make this work, which is aesthetically displeasing. But hey, it’s Go– we’re not here for aesthetics (I’m looking at you interface{}):

Astute Gophers might recognize that we’ve re-derived a utility in the sync package called Once. Once encapsulates all of the locking logic for us so we can simply write:

Lazy initialization is a fairly basic pattern, but once we understand how it works, we can build safe variations like a resettable Once. Remember though– this is all last-resort stuff. Prefer channels to using any of these low-level synchronization primitives.


LAUNCHDARKLY HELPS YOU BUILD BETTER SOFTWARE FASTER WITH FEATURE FLAGS AS A SERVICE. START YOUR FREE TRIAL NOW.
19 Jun 2015

Launched: Teams support

LaunchDarkly now supports teams! Teams support means that you don’t have to use shared credentials to access your account. Each individual member of your team gets their own login and password. If you already have a LaunchDarkly account, you can manage your team from your account settings page. You can invite new members here, or remove members that are no longer on your team.

settings

We’re going to continue adding features for teams over time– stay tuned for more!


LAUNCHDARKLY HELPS YOU BUILD BETTER SOFTWARE FASTER WITH FEATURE FLAGS AS A SERVICE. START YOUR FREE TRIAL NOW.

 

01 Jun 2015

Launched: CORS support

We’ve upgraded our REST APIs to support CORS. This lets anyone build web pages that call LaunchDarkly’s APIs directly– no need to proxy through a server.

You can make authenticated CORS calls just as you would make same-origin calls, using either token or session-based authentication. If you’re using session auth, you should set the withCredentials property for your xhr request to true.

One last thing– for security reasons, all POST, PATCH, and PUT calls to our API require a Content-Type of application/json.

Happy launching!

 


LAUNCHDARKLY HELPS YOU BUILD BETTER SOFTWARE FASTER WITH FEATURE FLAGS AS A SERVICE. START YOUR FREE TRIAL NOW.
25 Mar 2015

Golang pearl: It’s dangerous to go alone!

its-dangerous-to-go-alone-take-this

In Go, the panic function behaves somewhat like an unrecoverable exception: panic propagates up the call stack until it reaches the topmost function in the current goroutine, at which point the program crashes.

This is reasonable behavior in some environments, but programs that are structured as asynchronous handler functions (like daemons and servers) need to continue processing requests even if individual handlers panic. This is what recover is for, and if you inspect the source you’ll see that Go’s built-in HTTP server package recovers from panics for you, meaning that bugs in your handler code will never take down your entire HTTP server.

Unless, of course, your handler code spawns a goroutine that panics. Then your server is screwed. Let’s demonstrate with a trivial example:

This server will happily chug along, panicking every time the root resource is hit, but never crashing.

But if hello panics in a goroutine, the entire server goes down:

In our web services, we can never really trust a naked use of go. We wrap all our goroutine creation in a utility function we call GoSafely:

Here’s how we use it:

Not as syntactically sweet as a naked go routine, but it does the trick. The unfortunate thing (which we don’t really have a solution for) is that any third-party code that spawns a go routine could potentially panic– and we have no way of protecting ourselves from that.


LaunchDarkly helps you build better software faster with feature flags as a service. Start your free trial now.
17 Mar 2015

What we talk about when we talk about API reliability

One of the first questions we often get from customers is about reliability– how can you make a system that uses a remote API reliable? What happens to our site if your service goes down?

The usual approach to reliability is to pick an SLA, say 99.99% uptime, turn on a monitoring tool like Pingdom to measure uptime against that SLA, and then play a montage while your engineers do a lot of hard work to meet that SLA.

My issue with this is that uptime is a one-dimensional view of reliability. Counting nines is almost certainly a necessary condition, but it’s often not sufficient. It ignores other important questions about reliability (How much variability is there in response times? What happens to performance as the number of concurrent clients increases?).

LaunchDarkly provides feature flags as a service, so our API delivers rules that determine which features our customers’ end-users see. If our API is unreliable, these end-users could see changes in their UI, missing functionality, or worse. Of course, uptime is critical for our service, and we have to do all the normal “montage work”– zero downtime deploys, extensive integration and load testing, and ensuring redundancy in all critical systems. But reliability for us also means handling things like sporadic packet loss, latency spikes, and other issues that are invisible to a tool like Pingdom, but potentially visible to end-users.

For us, thinking about reliability as a number line that goes from 0 to 100 wasn’t sufficient. We started thinking about the problem from the ground up– starting with what was even possible in terms of designing for reliability.

“It’s happened again— you’ve wasted another perfectly good hour listening to CAP Talk.”

If you’re familiar with distributed systems, the CAP theorem is probably near and dear to your heart. In a nutshell, the CAP theorem states that a distributed system can have at most two of three properties:

  • Consistency: The system agrees on one up-to-date copy of data
  • Availability: The system returns a response, whether it succeeds or fails
  • Partition tolerance: The system continues to function in the presence of network failures

Most people think about CAP in relation to database design, but it’s important to keep in mind that the theorem applies to any shared-network shared-data distributed system. This means that any HTTP API and its clients are bound by the CAP theorem. And this is one point we often forget when we talk about API reliability.

In the case of an HTTP API running over the Internet, we have to accept two facts:

  • Occasional large spikes in latency are inevitable
  • Latency is often indistinguishable from network failure

Practically speaking, designing an API with CAP in mind means that we’ll encounter a situation where something bad happens, and we have to choose between consistency and availability. In the case of a GET request, for example, this means:

  • Choosing consistency and returning an error
  • Choosing availability and returning stale content

The LaunchDarkly API returns “rulesets” defining how feature flags should be rolled out. For example, a ruleset might contain logic like “roll this feature out to 25% of all users, plus everyone in the beta testers group”. Serving stale data means delivering an older copy of those rules, which is preferable to returning an error or waiting an unbounded amount of time (the consistency route) and forcing customers to fall back on a “default” variant for the feature flag that they’ve hardcoded. With that choice in mind, we had a clearer mandate to work with:

  • Maximize availability by trading off consistency
  • Minimize the possibility of partitions

The dream of the 90’s is alive

The 90’s were a great time. Java came along and changed the web forever with applets. I rocked JNCO jeans and had blue hair. And CDNs hit the scene, dramatically decreasing the probability that clicking on a Slashdot link would lead to a server timeout.

90’s reminiscence aside, CDNs were designed precisely to provide caching and redundancy. The challenge, though, is that CDNs are (traditionally) designed to serve static content. Proxying an API through a CDN leads to a host of problems. For one thing, CDNs typically aren’t able to cache authenticated content, which is a must for our APIs. Another problem is that CDNs usually aren’t well suited to managing rapidly changing content– purge latencies on the order of 5-15 minutes are common. While we’d already gone down the path towards eventual consistency, 15 minutes is a little too eventual for us.

Luckily, we found a solution in Fastly. Fastly gives us the option to specify caching rules using VCL, which gives us the fine-grained control we need to safely cache authenticated content. All we need to do is specify Vary: Authorization headers, telling Fastly to provide separate content caches for distinct API keys. Fastly also offers purge latencies on the order of milliseconds, as well as an API we could call programmatically to purge content whenever a customer updated the rules for their feature flag.

You might have noticed that using a CDN actually increases the chance of a network partition occurring, because CDNs introduce multiple POPs as intermediaries that may fail. Here we leverage another consistency tradeoff that reduces the impact of partitions– we try to avoid network calls altogether.

The best remote call is the one you don’t make

One advantage that we have with LaunchDarkly is that our API’s “mission-critical” clients (the ones that serve feature flags) are under our control: we build and distribute SDKs for our supported platforms. This means that we can introduce another layer of caching in these clients, with a completely different set of cache directives from what we give Fastly, our intermediate cache.

First, we use a simple max-age directive to set a TTL for feature flag changes. This is another area where we trade off consistency for availability: our SDKs can avoid making network calls as long as the max-age of the locally cached content hasn’t been exceeded. In production, we recommend setting the TTL to 5 minutes, which means that our SDKs only make one remote call every 5 minutes. No matter what the frequency of feature flag requests is, only one call every 5 minutes will have any significant overhead. We can even hide that overhead by using stale-while-revalidate to serve stale content while we fetch the new flag rules in the background. The main tradeoff here is that changes to feature flag rollout rules on the LaunchDarkly dashboard don’t take effect for 5 minutes.

We also use stale-if-error, which tells our client caches to prefer stale content to error responses. stale-if-error gives our SDKs some resilience to things like Fastly POP outages.

Conclusion

The takeaway here is that we shouldn’t be satisfied with measuring reliability in terms of uptime alone. Definitely do continue using tools like Pingdom to monitor uptime– that’s table stakes. But other factors, like baseline error rates, median + 90th percentile response times, and failure mode under load need to be considered as well. In the case of HTTP APIs over the Internet, design tradeoffs between consistency and availability can be made to improve reliability according to these metrics, independent of uptime numbers.


LaunchDarkly helps you build better software faster with feature flags as a service. Start your free trial now.
24 Feb 2015

Packaging Go Microservices for AWS Deployment using CircleCI

Most of LaunchDarkly’s backend systems are written in Go. We have a microservice-based architecture, so we have about 10 distinct standalone binaries (either web services or asynchronous worker processes) that we deploy to AWS.

We have a small engineering team, so it’s important that we can package and deploy our code with minimal overhead. Out of the box, Go provides an amazing set of tools that make this manageable, but we found we needed to hunt around to fill in a few gaps in the toolchain. Here are the extra tools we use to get our Go code packaged and ready for deployment.

Repeatable builds with Godep

One of the first issues we saw with Go was that out of the box, the go tool doesn’t give you any way to version dependencies. This seems like a big oversight to me, but perhaps it makes more sense in a world where you have one monolithic repository.

At any rate, we have code in about 20 distinct repositories, so versioned dependencies and repeatable builds are essential for us. We use Godep to handle versioning in our Go builds.

Godep is fairly simple. godep save stores version metadata (for us, git sha’s) for your dependencies into a Godeps/Godep.json file in your repository root. It also stores all your dependency sources in a Godeps/_workspace directory. You commit this entire directory structure including the _workspace into your repository. Instead of running go build, you compile with godep go build. When you want to update a dependency, do so in your normal go path and then invoke godep update IMPORT_PATH. Simple and effective.

One thing that wasn’t obvious to us was how to structure “library” packages to work best with go and godep. By that, I mean repositories that consist of several packages that aren’t necessarily tied together with a top-level “main” package. We have a repository that serves as our ubiquitous “bucket” of essential utility code. We call it foundation, and it has packages for things like epoch times and richer error types. The directory structure for foundation looks like this:

There’s no go source in the top-level of the repository, so go build complains:

The right thing to do is to point go to each of the package subdirectories:

Typing ./... is pretty thoroughly ingrained into our muscle memories now. We have to type it every time we build our foundation, as well as every time we want to do a wholesale update of all the foundation packages in a dependent repository: godep update github.com/launchdarkly/foundation/...

Cross compiling

Go compiles to machine code, so binaries are platform-dependent. We do all our local development on OS X, but for staging and production we need to build Linux targets. Again, reproducible builds are important to us– we want to be able to build our production artifacts for Linux on our OS X boxes if necessary. For that, we use a cross-compilation tool called goxc.

goxc is pretty straightforward to set up. We store our configuration in .goxc.json files in each of our repositories. For a simple worker or web service, our .goxc.json files look like this:

This simple configuration runs the cross compiler and packages a .tar.gz archive of our binary. The BuildConstraints line uses Go’s build constraints notation— the line we use compiles for Linux and OS X, but disables Linux ARM (which we don’t need to target). The GOPATH setting took a bit of figuring out– goxc lets you specify a template format including variables like the current working directory (PWD), a platform specific path separator (PS) and path list separater (PLS), etc. Our GOPATH setting points goxc to our Godeps directory and falls back to our environment-specified GOPATH. The Godeps directory alone might be preferable– ensuring that your build doesn’t depend on local copies of packages:

Another useful trick that we use is to pass a build-ldflags flag to goxc. We use this to inject a git SHA into our binaries in a Version variable. This makes it extremely easy to figure out *exactly* what version of a service is running at runtime. Versions for us are just git SHAs– we don’t bother with the overhead of maintaining semantic version numbers for our services. In our Go code, all we have to do is this:

Once we’ve got this, we wrap up our goxc invocation into a small script called package.sh that sets the VERSION variable to our current git sha:

When built locally with go build or godep go build, the Version is set to "DEV", but the packaged binary will overwrite the variable with a git SHA.

CircleCI setup

Notice how we set the destination directory for our archive based on the $CIRCLE_ARTIFACTS environment variable. We use CircleCI to build our binaries. The symlink trickery we do is necessary to make our “library” repositories build on CircleCI– by default, checked-out repositories on Circle don’t seem to be part of the Gopath. Here’s what a basic circle.yml setup looks like for most of our workers:

Uploading artifacts

We use ansible for our deploy scripts, and they’re set up to pull artifacts from S3. All of our Circle builds run the following upload_to_s3.sh script:

This script should be pretty re-usable– change the service and bucket names, and enter your S3_KEY and S3_SECRET into CircleCI’s environment variables, and you’re good to go. We tell CircleCI to upload the artifacts as part of a “deployment” step (even though we don’t use CircleCI to actually deploy to EC2):

Again, you can customize this– for example, you may only want to push artifacts built from master to S3.

Conclusion

None of this was incredibly complex, and that’s a testament to how good the Go tooling is out of the box. Going from an empty repository to a new Go microservice running in production is remarkably easy. In fact, it’s so easy that we’ve automated it (well, everything except writing the actual service). We’ve created a giter8 template (open-sourced on GitHub that sets up a simple Go program with goxc cross-compilation, CircleCI tests and artifact packaging, and S3 artifact upload:

Once you’ve configured your build on CircleCI (including setting the S3_KEY and S3_SECRET environment variables) and pushed your new repository to GitHub, you’ll see artifacts uploaded to your S3 bucket. From there, we use ansible scripts (which I haven’t covered in this post) to actually deploy artifacts onto EC2 instances.


LaunchDarkly helps you build better software faster with feature flags as a service. Start your free trial now.