08 Sep 2017

Saving Private Instances

You are a new Director of Engineering at an enterprise company. The company has just moved your product to the cloud and is hosting in Microsoft Azure. You come from a tech startup where you ran all of your software in the cloud and focused on building product. So naturally you have just implemented instrumented monitoring with HoneyComb, Kubernetes to manage your containers, and you’re thinking about leveraging serverless architecture.

You moved to this established enterprise company because they are providing technology to the healthcare industry and you are passionate about this space. They want you because you are good.

Now you face a challenge—how can I bring the good things from my past role and pair them with the security standards and compliance in my new role? Well, let’s first think about what the good things from your past role entails. Your previous team:

  • Deployed 5 times a day
  • Used feature branching
  • Feature flag first development
  • Implemented continuous integration

You realize you used 3rd party software for most of these good things, and had homegrown for others. You decide your first project is to research what software the new team is using, has built, and what software you will need to build and buy.

Security and the cloud.

Now, before we dive in, let’s consider the security standards and compliance you need to think about in your new role:

 

  • Where is my data stored?
  • What data is being sent over to the third party—is it PII?
  • Where connections are made to third parties?
  • How can I set different access control?
  • How it will work when connection fails-redundancy?
  • Is it HIPAA compliant?
  • How is all this audited?

 

Like your new company, many companies are just starting to move key operations to the cloud. This is SCARY. Moving into Azure will allow your company to free up floor space, add more server redundancy, easily scale up and down, only pay for what you use, and focus resources on revenue generating activities— security is still your responsibility.  And though these are all great things, you are not a security company.

Now you are tasked with bringing in these good processes, and they include 3rd party software providers. Products in this space are inherent in your development lifecycle and very close to your core application, but you need to ensure they are secure.

As we all know, most software providers in this space are cool tech companies in The Valley, far and isolated from the realities of your secure enterprise world. You look at on-premise options. And you remember that you’re driving change in the Healthcare space, not looking to manage infrastructure. The cloud options require a significant audit that is time consuming. A SaaS provider carries an ongoing risk that requires you store some data on 3rd party servers, which is a non-starter in Healthcare.

Is there a middle ground?

There is one more option many software providers offer: Private SaaS, also known as a managed instance, Dedicated VPC, or private instance. You surmise if they do not have on-prem, private SaaS, or a really really good security team, then it’s not likely their team is accustomed to working on enterprise challenges.

What exactly is a private SaaS offering? This refers to a dedicated single tenant, cloud software where the vendor manages the infrastructure.

Why choose a private instance?

  • Single Tenant—You will have a dedicated set of infrastructure contained within a VPC. This eliminates the risk of noisy neighbors.
  • Data Storage—Data can be stored in your AWS account or in an isolated section of the vendor’s AWS account. This allows flexibility, if you are in the EU, vendor could spin up the instance in AWS in the EU.
  • Residency—If you have a preference on where the instance is located or need to ensure proximity.
  • Compliance—If you have additional security or compliance regulations, a private instance can be customized to fit your needs.
  • Change Cadence—If you need to know when the software will be updated, you will have better insight and greater flexibility with a private instance.
  • Integrations—If you have custom tools, integrations can be built.

What’s next?

You have narrowed down a few software providers for each job you are trying to solve (continuous integration, branching, feature flagging). You understand the value, the costs, and the security implications. You have chosen to stay in the cloud, which makes your infrastructure team happy. You have chosen to use private instances and SaaS where it makes sense, which makes your security team happy. And you have the tools you need to bring the good things into your new role, so you are happy.

Now you can focus on helping your company deliver product faster, eliminate risk in your release process, speed up your product feedback loop, and do it all securely.

21 Aug 2017

How to Comply

Everyone loves a little affirmation (Image credit: twitter)

Things we wish we had known and things we were happy that we implemented early. 

At LaunchDarkly we recently embarked on the journey to SOC Type 2 compliance. While the reasons we chose to pursue this certification were primarily business driven, the tasks and actions incorporated into the process most directly impacted our engineering operations and development teams. And due to the experience and philosophy of the founding engineering team, the actually impact of the process was minimal.

Once deciding to sally forth on SOC certification (or any security and compliance certification for that matter) you will be in a much better position if you view the criteria of the certification as a benefit (or imperative) of good business practices. If you are in search of the certification primarily as a way to sell into certain customers or verticals, you will likely be frustrated with this entire process. Security is like diversity—you need to inherently believe in the value to be successful in implementations and outcome.

“So, what are these criteria you speak of?”

Need vs. want

The principle of least privilege is a well-known concept where you only provide access to the people that actually need it. This also extends to the level of access that is given. Ultimately this distills to: view vs control. I like to start here because this is a forcing function for so many of the other criteria. When you are thinking about who should have access to each system/service, you are inclined to:

  • Define roles
  • Log activity based on user/account
  • Build review process for accounts

On-boarding each new employee gives you an opportunity to review the tools you use to run your business, question who needs access, and to what extent.

This is also a great time to introduce new employees to another key security maxim: Trust, but verify. This concept is relevant in the context of the access that accounts or component services assert are needed, as well as the way that you validate user accounts.

For services—either 3rd party or just components of a larger service—you should know what information is being stored, and where. Ideally, you are being thoughtful of the first principle of least privilege.

This is especially true of customer data. The easiest way to protect data is to limit what you collect. Second is to make sure that you are being intentional and explicit with where you store the data. Finally, you need to look at the access policy that you put around that data.

For user accounts, multi-factor authentication (MFA) or two factor authentication (2FA) significantly increases your ability to validate that accounts are only being accessed by account owners.

Another time to plan for is employee off-boarding—ideally, before you off-board your first employee. This is also a good time to review your tools and access privileges.

Context; not blame

When a single developer is working on building a service, logs are primarily useful for understanding how well things are working (or where they broke). As the number of individuals working on the system becomes larger, the ability to know who changes what becomes increasingly important.

One caveat is that as you incorporate the ability to know who, you leverage this data to build context around why things changed, rather than simply using it as a means to place blame. If one person can bring down your service, then you probably should direct blame at the architect of the system. (Unless that destructive individual and the architect are the same person ¯\_(ツ)_/¯).

“A log without account context is like a novel without characters.”

A log without account context is like a novel without characters. You can build a picture of what happened, but likely will miss why things happened. If you don’t know why, you’re unlikely to prevent it from happening again.

Built for toddlers… or failure

Failure is trivial compared to proactive destruction (image credit: Lego City)

The stability of a service is often a strong indicator of the inherent security. After all, the most common exploitations are based on overloading some resource. Thinking further about the elimination of blame from building a secure and stable service, failures should be viewed as opportunities for increased robustness. This is where building for toddlers comes in.

Toddlers are the ultimate destroyers. It is the developmental stage where everyone starts to experimentally test the laws of physics. Gravity, entropy, projectile motion, harmonic oscillation—they’re all put through a battery of tests.

Ideally, you are thinking about your service from the perspective of a parent (or guardian) that is toddler-proofing their home. Bolt things down, put breakable things out of reach, lock up the flamethrower, and embrace the fact that you will miss things. For the items you miss, have an emergency procedure in place and appropriate medical supplies on hand.

Back in the context of your service…

Write it down… or it never happened

Your code/feature is not ‘done’ until the docs are written or updated. Services require constant supervision and are never ‘done’. But if a developer builds or changes something and doesn’t write it down, then they effectively become the only individual that is able to monitor or operate the entire service (at least with all the context).

So, what should you write down? “Everything,” is the easy answer, but often not the complete one. You want to write down enough to provide context if any component starts doing something unexpected.

If you don’t know where to start, a good approach is to write down what would need to happen if your service was deleted. How would you rebuild and restart your service? How would you restore your data?

Next you can look at the impact/process of the loss of each individual component service. The important part of this is to incorporate the documentation into the development process to ensure that as your service evolves your documentation is always up to date—otherwise, the change doesn’t exist after a failure.

Great, now you know what to do next time AWS S3 needs to reboot. But, what about your customers? The next step is to write down an action plan for service interruption. Make sure you have a process and plan in place for keeping folks in the loop.

Security is not the french fries

If you are in a situation where you are looking to “add” security, you are likely going to be in good company with Sisyphus. Security needs to be a part of your foundation—it is not an “add-on”. But if you are realizing this is now—that security is a requirement for your business— you can do more than wish you had considered it sooner. It is not too late, but it is not a quick fix that you can solve with a certification.

First you need to implement the security in your foundation and process. Make sure that it is part of your culture. Once you have a culture of security and process, compliance is just providing proof of your culture.

Now… about that certificate

You don’t show up to your official Genius Book of World Records judging day having never practiced juggling 9 clubs. Same goes for when you decide to get your certification for SOC.

However, when you build a strong foundation and culture for security and compliance, then the steps to get certification are rather straightforward. You call up your friendly neighborhood SOC auditor and get a copy of their check list.

In the case of LaunchDarkly we worked with the fine folks at A-lign. After an initial conversation we retained their services to conduct an independent audit of our systems and practices.

A few months prior to the audit, A-Lign provided our team with a checklist of all the documentation and proof they would need to see when they came onsite for our assessment. This afforded us the opportunity to ensure that all of our practices were organized and in a state that could be easily evaluated.

When the time came for the audit the auditor spent three days* on-site interviewing members of the team and reviewing our practices. After the on-site visit, we were informed that we had passed the initial competence certification.

Of course, now that we have gone through the validation process for one certification, it seems like a good time to keep going for a few more. Many of the certifications have a significant overlap in requirements. They are all looking to establish trust and ensure the service provider is operating in the best interest of the customer. And it turns out, most customers define trust in a very similar way.

18 Aug 2017

Risk Reduction and Harm Mitigation

Risk Reduction is trying to make sure bad things happen as rarely as possible. It is anti-lock brakes, vaccinations, clothing irons that turn off by themselves, and all sorts of things that we think of as safety modifications in our life. We are trying to build lives where bad things happen less often.

Harm Mitigation is what we do to make sure that when bad things do happen, they are less catastrophic. Fire sprinklers in buildings, seat belts, and needle exchanges are all about making the consequences of something bad less terrible.

What does that mean for a DevOps world where our risks and harms are very different? Charity Majors says we shouldn’t split the world into developers and operations, but into product and infrastructure—and I think that’s a useful way to think about risk too.

Product risk is problems that users experience. We can usually predict and mitigate the danger by testing and being aware of common failings. For example, we can expect and plan for users that may be on a flaky connection, or that they may try to exit a page without saving information. We can work around these problems because we know they’re out there. But our deployments are not something they should see until we’re ready for prime time. For this kind of risk, we use feature flags to control what content is delivered.

Infrastructure risk is more about the inherent fragility of delivering software. CDNs, fiber, servers, switches, towers…the whole system of getting data to people has failure zones. When we are trying to reduce infrastructure risk, we assume that latency is ever-present, networks are intermittent, and we won’t always be able to count on everything working right the first time. We try to build in robust and flexible ways than can route around failures. This is the place we might use feature flags to control failovers or to create circuit breakers to prevent flooding a fragile sector.

Product harm reduction is about making sure that users can have a positive experience, even if something happens that makes it less than ideal. We want to preserve their blog drafts, keep them from committing errors, only show them things they are allowed to change, and above all, avoid giving them a blank page. Something has gone wrong in the trip to the user, but they shouldn’t have to suffer for it.

Infrastructure harm reduction is the ability to pull back breaking fixes, shunt users away from vulnerabilities, and respond near-immediately to things that have gone very wrong. Harm reduction at this level is the kind of action a pager-responder can take to get things back on track before doing more intensive repairs and investigations in the morning.

In my first week, I spent a lot of time thinking about how to summarize our product for a variety of audiences. “Feature flags as a service” is short and pithy, but only works if you can bring your own definition of “feature flag” and a business understanding of why you would use them. What about “Feature flags allow you to make changes in near real-time instead of waiting for a deployment, and LaunchDarkly helps you manage and track flags across an organization”?

Well, that works, but it still doesn’t get to the core of why an organization would want to use feature flags. What I’ve come up with so far is this:

Feature flags segment the risk of creating a product into manageable parts.

Creating and deploying software is risky. We can accidentally build in errors, we can deliver it badly, or to the wrong people, and it can interact in unfortunate ways with existing software or hardware. As organizations, we want to do our best to do no harm and provide benefit. Using feature flags lets us wrap our features in decision points that we can then use to make life easier for our users.

Here are some types of risks that are reduced by using feature flags:

  • Server falls over from too much traffic
  • Canary launch is not well-tracked, problems are missed
  • Old features and workarounds are invisible and get left in place
  • Feature with vulnerable content is deployed
  • API endpoints are exposed to unauthorized users

Managing your feature flags is a post for another day—today I ask that you take a few minutes and think about how you can reduce risk and minimize harm in your organization, your project, or your code. How can you make things robust enough to resist failure, instrumented sufficiently to identify a failure spot, and flexible enough to reduce harmful consequences on the fly?

08 Aug 2017

Flexible Infrastructure with Continuous Integration and Feature Flagging

flexible infrastructure

I’m incredibly excited to be LaunchDarkly’s first solutions engineer. During my first week I got to learn about some of the clever ways we do feature management. Not only do we use feature flags to control the release of fixes and new features, but we also use them to manage the health of our infrastructure in production. I’ve been a part of a number of teams, and I’ve never seen a more advanced development pipeline.

Normally, dealing with issues in production can be a frightening and time-consuming experience, but adopting a mature continuous delivery pipeline can allow you to react faster and be proactive. Continuously integrating and deploying makes getting fixes into your production code a trivial task, but using feature flagging takes it to the next level and lets you put fixes in place for potential future pain points that you can easily enable without having to do another deploy.

One common problem is handling extreme server load. This is managed easily with LaunchDarkly. Imagine you have a server that is pulling time-sensitive jobs from a queue, but the queue fills faster than the server can handle it, and causes all jobs to fail. In situations like these it would be better to at least get some of the jobs done, instead of none of them. This is concept is known as “bend-don’t-break”.

I built a proof of concept using Python and rabbitMQ which demonstrates how you could use LaunchDarkly’s dashboard to control what percentage of jobs get done, and the rest get thrown away. If the worker takes too long to get to the job, the job fails. As you see the queue grow you can manage it easily with feature flags.

It consists of two scripts, taskQueuer and taskWorker. The taskQueuer adds imaginary time-sensitive jobs to the queue; the rate is configurable using feature flags.

The taskWorker removes one job from the queue and processes it. One job takes one second. If tasks are queued faster than the worker can process it, the queue fills up and the worker will begin failing. To protect against this, you can use the “skip rate” feature flag to allow the worker to drop a certain percentage of jobs on the floor.

The concept of using LaunchDarkly as a control panel to manage the operation of your app is really cool and opens up a world of possibilities beyond simply percentage rollouts and canary releases. If you have an interesting implementation, get in touch with us at hello@launchdarkly.com and maybe we’ll feature it!

More about tough devops: https://insights.sei.cmu.edu/devops/2015/04/build-devops-tough.html
More about using Python with rabbitMQ: https://www.rabbitmq.com/tutorials/tutorial-one-python.html
More about LaunchDarkly’s stack: https://stackshare.io/launchdarkly/how-launchdarkly-serves-over-4-billion-feature-flags-daily

12 Apr 2017

How Spinnaker and Feature Flags Together Power DevOps

It’s very common for customers to be excited about both Spinnaker (continuous delivery platform) as well as feature flags. But wait? Aren’t they both continuous delivery platforms? Yes, they are both trying to solve the same pain points – the ability to quickly get code in a repeatable, non-breaking fashion, from the hands of the developers into the arms of hopefully excited end users, with a minimal amount of pain and heartache for everyone along the toolchain. But they solve different pain points:

  • Spinnaker helps you deploy functionality to clusters of machines.
  • Feature Flags help you connect those functionality to clusters of USERS.  

Spinnaker helps with “cluster management and deployment management”. With Spinnaker, it is possible to push out code changes rapidly, sometimes hundreds (if not thousands) of times a day. As Keanu Reeves would say “Whoa.” That’s great! All code is live in production! Spinnaker even has handy tools to run black/red deployments where traffic can be shunted from cluster to cluster based on benchmarks. Dude! For those who remember the “Release to Manufacturing” days where binaries had to be put on an FTP server (and hope that someone would install and download in the next quarter or so), code being live within a few minutes of being written is amazing. For those who remember “master disks” and packaged software, this is even more amazing.

Nevertheless, with dazzling speed comes another set of problems. All code can be pushed anytime. However, many times you do not want everyone to have access to the code – you want to run a canary release on actual users, not just machines. You might want QA to try your code in production, instead of a test server with partial data. If you’re a SaaS product, you might want your best customers to get access first to get their feedback. For call center software, you want to have an opportunity to test in a few call centers. You might want to have a marketing push in a certain country days (or weeks or months) after another country. You might want to fine-tune the feature with some power users, or see how new users react to a complicated use case. All of these scenarios can not be done at a server level. This is where feature flags come in. By feature flagging, you can gate off a code path, deploy using Spinnaker, and then use a feature flag to control actual access.

Together, Spinnaker and feature flagging make an amazing combination. You can quickly get code to “production”, and from there decide who gets it, when.

14 Mar 2017

My Agile Launch

Starting at a company that helps software teams release faster with less risk has reminded me of my first foray into agile development.

One of my earliest professional experiences was as an intern at HBO, where I reported directly to the VP of Emerging Technologies.  Over the course of the summer, it became clear that there were more promising new technologies to explore than there were engineers.  The undaunted college student that I was, I convinced my boss that I should own one of these projects.  Every day I would demo my work on a screen in his office, and he would provide feedback.  A few weeks later the VP presented to senior management and the company officially green-lit the project.  Though we didn’t refer to it as such, that cycle was my introduction to Agile.  Yet more importantly, it was a daily routine where a non-technical business stakeholder provided direct feedback to a technical resource; an arrangement that I’ve come to realize is quite uncommon.

Working at large and small companies in a variety of engineering roles, there are two overall trends that I’ve identified:

  1. Companies often separate engineering from the business side.
  2. Most development-related failures (e.g. missed deadline, bad or buggy feature, release causing a security vulnerability, etc.) are a result of miscommunication.

Neither of these statements are profound in isolation but when coupled genuinely pique my curiosity.

Over time the broader engineering community has developed numerous tools and processes to mitigate risk.  Missing deadlines?  Use story points to measure team output (velocity) over time.  Releasing buggy features?  Try test-driven development.  Want to avoid downtime during a release?  Setup application performance monitoring in your staging environment.  While these “quality assurance” measures are not guarantees of perfection, they make development more predictable.

Problems arise when we cut corners in response to misaligned expectations.  Let’s say there’s a feature request for a relatively straightforward user enhancement.  The development team has done everything right to this point (features properly designed and scoped), but towards the end of the sprint the team finds a scalability issue with no clear path forward.  Engineering management notifies the business side of this issue and insists that there should be resolution within five business days.  Five days pass and the developers have made no progress.  Engineering rushes to fix the issue through a refactor but skips unit testing to release earlier.  Two new bugs slip into production.  Instead of working on the next sprint the development team now works to get a hotfix release out.  One unexpected event can change everything.

The separation of the business side from its engineering counterpart sets the stage for frustration and missed opportunities.  Any tool that can bridge the gap between these two groups offers immense value to an organization or a working relationship.  Cucumber, a Behavior-Driven Development test tool, empowers non-technical stakeholders to define requirements, in plain English, that double as executable test guidelines for engineers.  By regularly reviewing Cucumber test results, a business stakeholder could easily assess the current status of a project.  Nevertheless, Cucumber facilitates one-way communication and offers no clear guidance on iteration or state.

The daily iterations I had at HBO were extremely effective in moving the project forward quickly. In a perfect world we could repeat this process as often as possible with senior management to win their approval as early as possible.  What if we schedule a daily meeting with the entire senior management team and show up with a buggy build?  It would quickly become apparent that our good intentions are far less valuable than executive time.  Instead, what if we gave each one of the senior managers a version of the technology that we could push updates to over time?  What if I could focus on building features and my boss could choose and deploy the ones that he thought were ready?  What if after realizing that there was a problem with a deployed feature he could hide that feature without involving me?  LaunchDarkly does all of this at the enterprise level.

To me LaunchDarkly is about much more than feature flagging or even quality assurance; the platform empowers companies to reconnect the technical and non-technical departments in order to shorten feedback cycles with customers and make better decisions.

This mission is a game changer.

That its founders and team are all extraordinary yet humble makes the opportunity twice as appealing.