30 Sep

The Design Void of the Developer World

The challenge of effectively designing for continuous delivery.

Developers are expected to use increasingly powerful tools to deliver increasingly complex and innovative software. So, why do developer tools often feel like an IRS tax form?

I argue that it stems from a culture that treats aesthetics and functionality as exclusive, whereas I would argue they are highly intertwined. When software is consumer facing, we see a sweeping trend of minimalist elegance aimed at cultivating an emotional connection with the user. But, when that target user is now a developer, design seems to take a back seat to functionality.

“It just needs to work! Not look good!” is what I’ve heard in the past. This is the notion that accurately performing a task correlates to function; that putting in the correct inputs and receiving a reasonable output is the benchmark of success.

But shouldn’t there be other indicators of success? What about time saved? What about ease of use? What about scalability and reliability? What about the ability to make informed decisions? I can show someone a spreadsheet of tabular data or I could show them some informative charts. Both present the same data. Both functionally deliver. But one works much better at delivering the information and fulfilling its purpose.

Let’s look at continuous delivery: the art of releasing better software in shorter cycles. It’s the amalgam of time management, planning, and iterating. It would make sense that developing a product for continuous delivery would be cognizant of this interplay.

At LaunchDarkly, we could have taken two approaches with our continuous delivery tool.

  1. Functional Model: “Let’s just make the product work.”
  2. Empowerment Model: “Let’s make sure our product saves time, mitigates aggravation, and allows for better decision making.”

Let’s assume that both models have equivalent functionality and backend logic. Truly harnessing the entire functional suite requires intuitive access. This is where design comes in. Design is about access and empowerment. It is about empowering the user through access to features, knowledge, and informed decision-making.

Developer tools do not have to be barren and sparse. They can be fun, interactive, and powerful without sacrificing efficiency. While designing for developers is different than designing for consumers, those differences manifest in information delivery and should not sacrifice empowerment.

17 Aug

Best practices for testing Stripe webhook event processing

While writing the code to integrate our application with Stripe, I was very impressed with the level of polish that Stripe has put on their API, in the documentation, the language-specific SDK ergonomics, and how easy they make it to integrate with some so obviously complex as payment processing.

However, there were two areas in which things were not great in the development/testing process, both surrounding webhooks:

  • Testing webhook processing
  • Webhooks from one test environment being sent to another environment

Handling webhooks from Stripe

Stripe can be configured to send events to your application via webhooks. In this way, you can maintain the internal state of your customers as they transition through the payment process. However, there is no way to know that the webhook request actually came from Stripe.

In order to verify the authenticity of the webhook payload, we need to fetch the event from Stripe’s API, ignoring the payload from the webhook (except the event ID, which we use to perform the lookup). Since we initiate the call to Stirpe’s API, and it is secured over HTTPS, we can trust that we are getting accurate data from Stripe. Once we have a valid event, we can do whatever processing we need to on it. This is the process suggested by Stripe:

If security is a concern, or if it’s important to confirm that Stripe sent the webhook, you should only use the ID sent in your webhook and should request the remaining details from the API directly. We also advise you to guard against replay-attacks by recording which events you receive, and never processing events twice.

(I hope all of Stripe’s customers would consider security to be ‘a concern’, since we are dealing with payment processing.) The second note about avoiding replay attacks is also worth noting, but it is relatively easy to take care of— just record each webhook payload in a database collection with a unique index, and check if the insert succeeded before proceeding.

Testing webhooks from Stripe

The problem with this approach for validating webhook data is that it makes integration testing difficult, because Stripe doesn’t send invoice webhook events right away:

If you have configured webhooks, the invoice will wait until one hour after the last webhook is successfully sent (or the last webhook times out after failing).

So, imagine a test script that does the following:

  • Sign up a user with a new plan
  • Update the user’s credit card to be one that will fail to charge (4000000000000341 is the test card number that Stripe provides for this purpose)
  • Change the trial end date to end in one second (other tests will ensure that the trial period works properly, but this is meant to test the renewal flow)
  • Wait 5 seconds, to be sure the trial has ended, and the webhook has been sent
  • Ensure that the account is put into the ‘charge failed’ mode

However, the ‘Wait 5 seconds’ step isn’t right, since we might need to wait up to an hour. This is way too long to wait to know if our tests pass. So, what else can we do? We can’t just fake the webhook event, since our code needs to fetch the event from Stripe to be sure it is authentic.

Disable authenticity check in test mode

The solution we settled on was to disable the authenticity check in test mode. Since testing just that we can fetch an event from Stripe isn’t a terribly interesting test (and presumably, it is covered by their SDK test suite), I’m comfortable with this deviation between the test and production flows. In the end, the test script listed above looks more like this (the test app has the ‘test mode’ flag enabled, which disables the event fetching):

  • Sign up a user with a new plan
  • Update the user’s credit card to be one that will fail to charge (4000000000000341 is the test card number that Stripe provides for this purpose)
  • Change the trial end date to end in one second (other tests will ensure that the trial period works properly, but this is meant to test the renewal flow)
  • Send a fake webhook event that looks like one Stripe would send for a failed invoice charge (be sure to use a unique event ID, so that it won’t trigger the replay attack detection code— test that code in a different test)
  • Ensure that the account is put into the ‘charge failed’ mode

How could this system be improved?

If I were to implement a webhook-sending service, I would include a header on the request including an HMAC value that could be used to verify that the request was coming from a trusted origin. The HMAC process is detailed in RFC 2104, but it can be summarized as:

1. The sender prepares the message to be send (the webhook payload, in this case).
2. The sender computes a signature using the message payload and a shared secret (this could be the Stripe secret key, or it could be separate secret used only for this purpose, as long as it is known to both Stripe and your application, and no one else).
3. The sender then sends the message along with the signature (usually in an HTTP header).
4. The receiver (ie, your application) takes the message and computes its own HMAC signature, using the shared secret.
5. The receiver compares the signature it computed with the one that was received, and if they match, the message is authentic.

Avoiding webhook confusion

The next problem we faced was dealing with renewal webhooks being sent to our staging server, referencing unknwon accounts. The problem can be summarized like this:

  • Stripe only supports two modes: Live and Test
  • We have many non-production systems: staging, dogfood, each developer’s local instance, etc
  • Webhooks are retried by Stripe until they succeed. If you have multiple webhooks configured, each one will be retried until it succeeds (so if you have three configured, and one succeeds while the others fail, the others will be retried).
  • In production, a failing webhook would be a problem that would require investigation—we don’t want something like that to fail silently.

So, if I am testing/developing the signup flow on my laptop, and my local app is configured with the ‘Test’ Stripe credentials, webhooks resulting from these interactions will be sent to our Staging server (since we have webhooks in the ‘Test’ mode configured to go there).

Staging will get the webhook payload, validate it, and then look at the account ID to do its work. Then, it will find that it doesn’t know about the referenced account, log an error message, and return a non-successful status to Stripe, indicating that it should retry. This is the desired workflow in production if processing a webhook fails in this way.

So, how do we avoid this noise in our alerting system? We don’t want to disregard all errors in staging; how would we catch issues before they get to production?

The answer we settled on is simple, but it still feels a little hacky: sign up for a second Stripe account. Don’t use the ‘Live’ mode in this account, and don’t configure any real bank info. Don’t keep any webhooks configured in there (but for testing ad hoc issues, you can add one, possibly using an ngrok url pointing to your local instance). Use this new dummy account for local developer configs, and use the real account’s ‘Test’ mode for staging.

How could this be better?

Considering how Stripe really covers all the bases in other areas, and provides an amazingly easy to use and powerful system, it is kind of surprising that they don’t have better support for customers who also have even moderately complex testing requirements. We recently implemented our own support for different environments, so we did some research into how other services solve this problem:

  • SendWithUs – they differentiate between production and non-production, but they allow you to create many non-production API keys, that behave differently. So my local test key can have all emails sent to my email (regardless of the original recipient), and Alexis’s key can have emails sent from his laptop go to his email, while John’s key can not send any real emails at all. All of these test emails will end up in the same bucket of ‘test’ emails in the logs on the SendWithUs console (so they aren’t siloed all the way through), but the analytics are not tracked on test emails, so they don’t interfere with your production metrics. All test & production keys can share the same templates, drip campaigns, etc. You can create as many keys as you like. I think this scheme works well for SendWithUs’s product, but it wouldn’t likely work for other products that need environments’ data to remain fully siloed.
  • NewRelic – You can create different ‘applications’ for each environment, and they remain fully siloed. There is no different between the same codebase running in production vs. staging and a completely different application.
  • Mailchimp – You can create different lists, and have staging subscribe people to the staging list, while production subscribes people to the production list. This is very similar to the NewRelic approach, but in a different domain.
  • LaunchDarkly – This blog post isn’t meant to go into depth about our environments feature, there are other places for that. But, we did something more like SendWithUs, where the environments can share common data, like goals or the basic existence of features, but allows you to have different rules for each environment, and keeps all data siloed.

So, that is how a few other companies have solved this problem, how could Stripe improve their solution? My first suggestion is to allow me to create as many environments as I need, and keep all data siloed. Alternatively, they could allow me to create groups of webhooks, such that only one in each group must succeed before considering it delivered. This would solve my problem right now, but it feels less flexible than multiple environments, and would likely not solve other people’s problems.


LAUNCHDARKLY HELPS YOU BUILD BETTER SOFTWARE FASTER WITH FEATURE FLAGS AS A SERVICE. START YOUR FREE TRIAL NOW.

 

03 Aug

Launched: Environments

LaunchDarkly now supports multiple environments!

Environments let you manage your feature flags throughout your entire continuous delivery pipeline, from local development to QA, staging and production. When you create your LaunchDarkly account, we’ll provide you with two environments, called Test and Production. Out of the box, you can use Test to set up feature flags for your non-production environments, while keeping your rules and data separate from your Production environment.

It’s incredibly easy to switch environments— just select your environment from the dropdown on the sidebar. When you create a flag, you’ll get a copy of that flag in every environment. Each flag can have different targeting and rollout rules for each environment. So you can roll a flag out to 100% of your traffic in staging, while keeping it ‘off’ in production.

switch

Each environment has its own API key— use your Test key in your SDK for local development, staging, and QA, and reserve your Production API key for your production environment.

Environments are also completely customizable— you can create new environments, rename them, or delete them. You can even change the sidebar swatch color for your environment— make your Production environment red, for example, to give yourself a visual reminder that you’re modifying the rules for your live customer base.

color

Check out our documentation to learn more!


LAUNCHDARKLY HELPS YOU BUILD BETTER SOFTWARE FASTER WITH FEATURE FLAGS AS A SERVICE. START YOUR FREE TRIAL NOW.

 

21 Jul

Golang pearl: Thread-safe writes and double-checked locking in Go

Channels should be your first choice for structuring concurrent programs in Go. Remember the Go concurrency credo– “do not communicate by sharing memory; instead, share memory by communicating.” That said, sometimes you just need to roll up your sleeves and share some memory. Lock-based concurrency is pretty old-school stuff, and battle-hardened Java veterans switching to Go will undoubtedly feel nostalgic reading this. Still, many brand-new Go converts probably haven’t encountered low-level concurrency primitives before. So let’s sit down and program like it’s 1999.

To start, let’s set up a simple lazy initialization problem. Imagine that we have a resource that is expensive to construct- it’s read often but only written once. Our first attempt at lazy initialization will be completely broken:

This doesn’t work, as both goroutines in Broken may race in GetInstance. There are many incorrect (or semantically correct but inefficient) solutions to this problem, but let’s focus on two approaches that work. Here’s one using read/write locks:

If you’re a Java developer you might recognize this as a safe approach to double-checked locking. In Java, the volatile keyword is typically used on instance instead of using a read/write lock, but since Go does not have a volatile keyword (there is sync.atomic, and we’ll get to that) we’ve gone with a read lock.

The reason for the additional synchronization around the first read is the same in Go as it is in Java. The go memory model does not otherwise guarantee that the initialization of instance is visible to other threads unless there is a happens-before relation that makes the write visible. The read lock ensures this.

Now back to sync.atomic. Among other things, the sync.atomic package provides utilities for atomically visible writes. We can use this to achieve the same effect as the volatile keyword in Java, and eliminate the read/write lock. The cost is one of readability– we have to change instance to an unsafe.Pointer to make this work, which is aesthetically displeasing. But hey, it’s Go– we’re not here for aesthetics (I’m looking at you interface{}):

Astute Gophers might recognize that we’ve re-derived a utility in the sync package called Once. Once encapsulates all of the locking logic for us so we can simply write:

Lazy initialization is a fairly basic pattern, but once we understand how it works, we can build safe variations like a resettable Once. Remember though– this is all last-resort stuff. Prefer channels to using any of these low-level synchronization primitives.


LAUNCHDARKLY HELPS YOU BUILD BETTER SOFTWARE FASTER WITH FEATURE FLAGS AS A SERVICE. START YOUR FREE TRIAL NOW.
25 Mar

Golang pearl: It’s dangerous to go alone!

its-dangerous-to-go-alone-take-this

In Go, the panic function behaves somewhat like an unrecoverable exception: panic propagates up the call stack until it reaches the topmost function in the current goroutine, at which point the program crashes.

This is reasonable behavior in some environments, but programs that are structured as asynchronous handler functions (like daemons and servers) need to continue processing requests even if individual handlers panic. This is what recover is for, and if you inspect the source you’ll see that Go’s built-in HTTP server package recovers from panics for you, meaning that bugs in your handler code will never take down your entire HTTP server.

Unless, of course, your handler code spawns a goroutine that panics. Then your server is screwed. Let’s demonstrate with a trivial example:

This server will happily chug along, panicking every time the root resource is hit, but never crashing.

But if hello panics in a goroutine, the entire server goes down:

In our web services, we can never really trust a naked use of go. We wrap all our goroutine creation in a utility function we call GoSafely:

Here’s how we use it:

Not as syntactically sweet as a naked go routine, but it does the trick. The unfortunate thing (which we don’t really have a solution for) is that any third-party code that spawns a go routine could potentially panic– and we have no way of protecting ourselves from that.


LaunchDarkly helps you build better software faster with feature flags as a service. Start your free trial now.
17 Mar

What we talk about when we talk about API reliability

One of the first questions we often get from customers is about reliability– how can you make a system that uses a remote API reliable? What happens to our site if your service goes down?

The usual approach to reliability is to pick an SLA, say 99.99% uptime, turn on a monitoring tool like Pingdom to measure uptime against that SLA, and then play a montage while your engineers do a lot of hard work to meet that SLA.

My issue with this is that uptime is a one-dimensional view of reliability. Counting nines is almost certainly a necessary condition, but it’s often not sufficient. It ignores other important questions about reliability (How much variability is there in response times? What happens to performance as the number of concurrent clients increases?).

LaunchDarkly provides feature flags as a service, so our API delivers rules that determine which features our customers’ end-users see. If our API is unreliable, these end-users could see changes in their UI, missing functionality, or worse. Of course, uptime is critical for our service, and we have to do all the normal “montage work”– zero downtime deploys, extensive integration and load testing, and ensuring redundancy in all critical systems. But reliability for us also means handling things like sporadic packet loss, latency spikes, and other issues that are invisible to a tool like Pingdom, but potentially visible to end-users.

For us, thinking about reliability as a number line that goes from 0 to 100 wasn’t sufficient. We started thinking about the problem from the ground up– starting with what was even possible in terms of designing for reliability.

“It’s happened again— you’ve wasted another perfectly good hour listening to CAP Talk.”

If you’re familiar with distributed systems, the CAP theorem is probably near and dear to your heart. In a nutshell, the CAP theorem states that a distributed system can have at most two of three properties:

  • Consistency: The system agrees on one up-to-date copy of data
  • Availability: The system returns a response, whether it succeeds or fails
  • Partition tolerance: The system continues to function in the presence of network failures

Most people think about CAP in relation to database design, but it’s important to keep in mind that the theorem applies to any shared-network shared-data distributed system. This means that any HTTP API and its clients are bound by the CAP theorem. And this is one point we often forget when we talk about API reliability.

In the case of an HTTP API running over the Internet, we have to accept two facts:

  • Occasional large spikes in latency are inevitable
  • Latency is often indistinguishable from network failure

Practically speaking, designing an API with CAP in mind means that we’ll encounter a situation where something bad happens, and we have to choose between consistency and availability. In the case of a GET request, for example, this means:

  • Choosing consistency and returning an error
  • Choosing availability and returning stale content

The LaunchDarkly API returns “rulesets” defining how feature flags should be rolled out. For example, a ruleset might contain logic like “roll this feature out to 25% of all users, plus everyone in the beta testers group”. Serving stale data means delivering an older copy of those rules, which is preferable to returning an error or waiting an unbounded amount of time (the consistency route) and forcing customers to fall back on a “default” variant for the feature flag that they’ve hardcoded. With that choice in mind, we had a clearer mandate to work with:

  • Maximize availability by trading off consistency
  • Minimize the possibility of partitions

The dream of the 90’s is alive

The 90’s were a great time. Java came along and changed the web forever with applets. I rocked JNCO jeans and had blue hair. And CDNs hit the scene, dramatically decreasing the probability that clicking on a Slashdot link would lead to a server timeout.

90’s reminiscence aside, CDNs were designed precisely to provide caching and redundancy. The challenge, though, is that CDNs are (traditionally) designed to serve static content. Proxying an API through a CDN leads to a host of problems. For one thing, CDNs typically aren’t able to cache authenticated content, which is a must for our APIs. Another problem is that CDNs usually aren’t well suited to managing rapidly changing content– purge latencies on the order of 5-15 minutes are common. While we’d already gone down the path towards eventual consistency, 15 minutes is a little too eventual for us.

Luckily, we found a solution in Fastly. Fastly gives us the option to specify caching rules using VCL, which gives us the fine-grained control we need to safely cache authenticated content. All we need to do is specify Vary: Authorization headers, telling Fastly to provide separate content caches for distinct API keys. Fastly also offers purge latencies on the order of milliseconds, as well as an API we could call programmatically to purge content whenever a customer updated the rules for their feature flag.

You might have noticed that using a CDN actually increases the chance of a network partition occurring, because CDNs introduce multiple POPs as intermediaries that may fail. Here we leverage another consistency tradeoff that reduces the impact of partitions– we try to avoid network calls altogether.

The best remote call is the one you don’t make

One advantage that we have with LaunchDarkly is that our API’s “mission-critical” clients (the ones that serve feature flags) are under our control: we build and distribute SDKs for our supported platforms. This means that we can introduce another layer of caching in these clients, with a completely different set of cache directives from what we give Fastly, our intermediate cache.

First, we use a simple max-age directive to set a TTL for feature flag changes. This is another area where we trade off consistency for availability: our SDKs can avoid making network calls as long as the max-age of the locally cached content hasn’t been exceeded. In production, we recommend setting the TTL to 5 minutes, which means that our SDKs only make one remote call every 5 minutes. No matter what the frequency of feature flag requests is, only one call every 5 minutes will have any significant overhead. We can even hide that overhead by using stale-while-revalidate to serve stale content while we fetch the new flag rules in the background. The main tradeoff here is that changes to feature flag rollout rules on the LaunchDarkly dashboard don’t take effect for 5 minutes.

We also use stale-if-error, which tells our client caches to prefer stale content to error responses. stale-if-error gives our SDKs some resilience to things like Fastly POP outages.

Conclusion

The takeaway here is that we shouldn’t be satisfied with measuring reliability in terms of uptime alone. Definitely do continue using tools like Pingdom to monitor uptime– that’s table stakes. But other factors, like baseline error rates, median + 90th percentile response times, and failure mode under load need to be considered as well. In the case of HTTP APIs over the Internet, design tradeoffs between consistency and availability can be made to improve reliability according to these metrics, independent of uptime numbers.


LaunchDarkly helps you build better software faster with feature flags as a service. Start your free trial now.