photo of me

Lindsay Holmwood is an engineering manager living in the Australian Blue Mountains. He is the creator of Visage & cucumber-nagios, and organises the Sydney DevOps Meetup.

Follow him on Twitter, or find out more about him.

The How and Why of Flapjack

In October @rodjek asked on Twitter:

"I've got a working Nagios (and maybe Pagerduty) setup at the moment. Why and how should I go about integrating Flapjack?"

Flapjack will be immediately useful to you if:

  • You want to identify failures faster by rolling up your alerts across multiple monitoring systems.
  • You monitor infrastructures that have multiple teams responsible for keeping them up.
  • Your monitoring infrastructure is multitenant, and each customer has a bespoke alerting strategy.
  • You want to dip your toe in the water and try alternative check execution engines like Sensu, Icinga, or cron in parallel to Nagios.

The double-edged Nagios sword (or why monolithic monitoring systems hurt you in the long run)

One short-term advantage of Nagios is how much it can do for you out of the box. Check execution, notification, downtime, acknowledgements, and escalations can all be handled by Nagios if you invest a small amount of time understanding how to configure it.

This short-term advantage can turn into a long-term disadvantage: because Nagios does so much out of the box, you heavily invest in a single tool that does everything for you. When you hit cases that fit outside the scope of what Nagios can do for you easily, the cost of migrating away from Nagios can be quite high.

The biggest killer when migrating away from Nagios is you either have to:

  • Find a replacement tool that matches Nagios's feature set very closely (or at least the subset of features you're using)
  • Find a collection of tools that integrate well with one another

Given the composable monitoring world we live in, the second option is more preferable, but not always possible.

Enter Flapjack

flapjack logo

Flapjack aims to be a flexible notification system that handles:

  • Alert routing (determining who should receive alerts based on interest, time of day, scheduled maintenance, etc)
  • Alert summarisation (with per-user, per media summary thresholds)
  • Your standard operational tasks (setting scheduled maintenance, acknowledgements, etc)

Flapjack sits downstream of your check execution engine (like Nagios, Sensu, Icinga, or cron), processing events to determine if a problem has been detected, who should know about the problem, and how they should be told.

A team player (composable monitoring pipelines)

Flapjack aims to be composable - you should be able to easily integrate it with your existing monitoring check execution infrastructure.

There are three immediate benefits you get from Flapjack's composability:

  • You can experiment with different check execution engines without needing to reconfigure notification settings across all of them. This helps you be more responsive to customer demands and try out new tools without completely writing off your existing monitoring infrastructure.
  • You can scale your Nagios horizontally. Nagios can be really performant if you don't use notifications, acknowledgements, downtime, or parenting. Nagios executes static groups of checks efficiently, so scale the machines you run Nagios on horizontally and use Flapjack to aggregate events from all your Nagios instances and send alerts.
  • You can run multiple check execution engines in production. Nagios is well suited to some monitoring tasks. Sensu is well suited to others. Flapjack makes it easy for you to use both, and keep your notification settings configured in one place.

While you're getting familiar with how Flapjack and Nagios play together, you can even do a side-by-side comparison of how Flapjack and Nagios alert by configuring them both to alert at the same time.

Multitenant monitoring

If you work for a service provider, you almost certainly run shared infrastructure to monitor the status of the services you sell your customers.

Exposing the observed state to customers from your monitoring system can be a real challenge - most monitoring tools simply aren't built for this particular requirement.

Bulletproof spearheaded the reboot of Flapjack because multitenancy is a core requirement of Bulletproof's monitoring platform - we run a shared monitoring platform, and we have very strict requirements about segregating customers and their data from one another.

To achieve this, we keep the security model in Flapjack extraordinarily simple - if you can authenticate against Flapjack's HTTP APIs, you can perform any action.

Flapjack pushes authorization complexity to the consumer, because every organisation is going to have very particular security requirements, and Flapjacks wants to make zero assumptions about what those requirements are going to be.

If you're serious about exposing this sort of data and functionality to your customers, you will need to do some grunt work to provide it through whatever customer portals you already run. We provide a very extensive Ruby API client to help you integrate with Flapjack, and Bulletproof has been using this API client in production for over a year in our customer portal.

One shortfall of Flapjack right now is we perhaps take multitenancy a little too seriously - the Flapjack user experience for single tenant users still needs a little work.

In particular, there are some inconsistencies and behaviours in the Flapjack APIs that make sense in a multitenant context, but are pretty surprising for single tenant use cases.

We're actively improving the single tenant user experience for the Flapjack 1.0 release.

One other killer feature of Flapjack that's worth mentioning: updating any setting via Flapjack's HTTP API doesn't require any sort of restart of Flapjack.

This is a significant improvement over tools like Nagios that require full restarts for simple notification changes.

Multiple teams

Flapjack is useful for organisations who segregate responsibility for different systems across different teams, much in the same way Flapjack is useful in a multitenant context.

For example:

  • Your organisation has two on-call rosters - one for customer alerts, and one for internal infrastructure alerts.
  • Your organisation is product focused, with dedicated teams owning the availability of those products end-to-end.

You can feed all your events into Flapjack so operationally you have a single aggregated source of truth of monitoring state, and use the same multitenancy features to create custom alerting rules for individual teams.

We're starting to experiment with this at Bulletproof as development teams start owning the availability of products end-to-end.

Summarisation

Probably the most powerful Flapjack feature is alert summarisation. Alerts can be summarised on a per-media, per-contact basis.

What on earth does that mean?

Contacts (people) are associated with checks. When a check alerts, a contact can be notified on multiple media (Email, SMS, Jabber, PagerDuty).

Each media has a summarisation threshold that allows a contact to specify when alerts should be "rolled up" so the contact doesn't receive multiple alerts during incidents.

If you've used PagerDuty before, you've almost certainly experienced similar behaviour when you have multiple alerts assigned to you at a time.

Summarisation is particularly useful in multitenant environments where contacts only care about a subset of things being monitored, and don't want to be overwhelmed with alerts for each individual thing that has broken.

To generalise, large numbers of alerts either indicate a total system failure of the thing being monitored, and or false-positives in the monitoring system.

In either case, nobody wants to receive a deluge of alerts.

Mitigating the effects of monitoring false-positives are especially important when you consider how failures in the monitoring pipeline cascade into surrounding stages of the pipeline.

Monitoring alert recipients generally don't care about the extent of a monitoring system failure (how many things are failing simultaneously, as evidenced by an alert for each thing), they care that the monitoring system can't be trusted right now (at least until the underlying problem is fixed).

What Flapjack is not

  • Check execution engine. Sensu, Nagios, and cron already do a fantastic job of this. You still need to configure a tool to run your monitoring checks - Flapjack just processes events generated elsewhere and does notification magic.
  • PagerDuty replacement. Flapjack and PagerDuty complement one another. PagerDuty has excellent on-call scheduling and escalation support, which is something that Flapjack doesn't try to go near. Flapjack can trigger alerts in PagerDuty.

At Bulletproof we use Flapjack to process events from Nagios, and work out if our on-call or customers should be notified about state changes. Our customers receive alerts directly from Flapjack, and our on-call receive alerts from PagerDuty, via Flapjack's PagerDuty gateway.

The Flapjack PagerDuty gateway has a neat feature: it polls the PagerDuty API for alerts it knows are unacknowledged, and will update Flapjack's state if it detects alerts have been acknowledged in PagerDuty.

This is super useful for eliminating the double handling of alerts, where an on-call engineer acknowledges an alert in PagerDuty, and then has to go and acknowledge the alert in Nagios.

In the Flapjack world, the on-call engineer acknowledges the alert in PagerDuty, Flapjack notices the acknowledgement in PagerDuty, and Flapjack updates its own state.

How do I get started?

Follow the quickstart guide to get Flapjack running locally using Vagrant.

The quickstart guide will take you through basic Flapjack configuration, pushing events check results from Nagios into Flapjack, and configuring contacts and entities.

Once you've finished the tutorial, check out the Flapjack Puppet module and manifest that sets up the Vagrant box.

Examining the Puppet module will give you a good starting point for rolling out Flapjack into your monitoring environment.

Where to next?

We're gearing up to release Flapjack 1.0.

If you take a look at Flapjack in the next little while, please let us know any feedback you have on the Google group, or ping @auxesis or @jessereynolds on Twitter.

Jesse and I are also running a tutorial at linux.conf.au 2014 in Perth next Wednesday, and we'll make the slides available online.

Happy Flapjacking!

CLI testing with RSpec and Cucumber-less Aruba

At Bulletproof, we are increasingly finding home brew systems tools are critical to delivering services to customers.

These tools are generally wrapping a collection of libraries and other general Open Source tools to solve specific business problems, like automating a service delivery pipeline.

Traditionally these systems tools tend to lack good tests (or simply any tests) for a number of reasons:

  • The tools are quick and dirty
  • The tools model business processes that are often in flux
  • The tools are written by systems administrators

Sysadmins don't necessarily have a strong background in software development. They are likely proficient in Bash, and have hacked a little Python or Ruby. If they've really gotten into the infrastructure as code thing they might have delved into the innards of Chef and Puppet and been exposed to those projects respective testing frameworks.

In a lot of cases, testing is seen as "something I'll get to when I become a real developer".

The success of technical businesses can be tied to the quality of their tools.

Ask any software developer how they've felt inheriting an untested or undocumented code base, and you'll likely hear wails of horror. Working with such a code base is a painful exercise in frustration.

And this is what many sysadmins are doing on a daily basis when hacking on their janky scripts that have evolved to send and read email.

So lets build better systems tools:

  • We want to ensure our systems tools are of a consistent high quality
  • We want to ensure new functionality doesn't break old functionality
  • We want to verify we don't introduce regressions
  • We want to streamline peer review of changes

We can achieve much of this by skilling up sysadmins on how to write tests, adopting a developer mindset to write system tools, and provide them a good framework that helps frame questions that can be answered with tests.

We want our engineers to feel confident their changes are going to work, and they are consistently meeting our quality standards.

But what do you test?

We've committed to testing, but what exactly do we test?

Unit and integration tests are likely not relevant unless the cli tool is large and unwieldy.

The user of the tool doesn't care whether the tool is tested. The user cares whether they can achieve a goal. Therefore, the tests should verify that the user can achieve those goals.

Acceptance tests are a good fit because we want to treat the cli tool as a black box and test what the user sees.

Furthermore, we don't care how the tool is actually built.

We can write a generic set of high level tests that are decoupled from the language the tool is implemented in, and refactor the tool to a more appropriate language once we're more familiar with the problem domain.

How do you test command line applications?

Aruba is a great extension to Cucumber that helps you write high level acceptance tests for command line applications, regardless of the language those cli apps are written in.

There are actually two parts to Aruba:

  1. Pre-defined Cucumber steps for running + verifying behaviour of command line applications locally
  2. An API to perform the actual testing, that is called by the Cucumber steps
Scenario: create a file
  Given a file named "foo/bar/example.txt" with:
    """
    hello world
    """
  When I run `cat foo/bar/example.txt`
  Then the output should contain exactly "hello world"

The other player in the command line application testing game is serverspec. It can do very similar things to Aruba, and provides some fancy RSpec matchers and helper methods to make the tests look neat and elegant:

describe package('httpd') do
  it { should be_installed }
end

describe service('httpd') do
  it { should be_enabled   }
  it { should be_running   }
end

describe port(80) do
  it { should be_listening }
end

The cool thing about serverspec that sets it apart from Aruba is it can test things locally and remotely via SSH.

This is useful when testing automation that creates servers somewhere: run the tool, connect to the server created, verify conditions are met.

But what happens when we want to test the behaviour of tools that create things both locally and remotely? For local testing Aruba is awesome. For remote testing, serverspec is a great fit.

But Aruba is Cucumber, and serverspec is RSpec. Does this mean we have to write and maintain two separate test suites?

Given we're trying to encourage people who have traditionally never written tests before to write tests, we want to remove extraneous tooling to make testing as simple as possible.

A single test suite is a good start.

This test suite should be able to run both local + remote tests, letting us use the powerful built-in tests from Aruba, and the great remote tests from serverspec.

There are two obvious ways to slice this:

  1. Use serverspec like Aruba - build common steps around serverspec matchers
  2. Use the Aruba API without the Cucumber steps

We opted for the second approach - use the Aruba API from within RSpec, sans the Cucumber steps.

Opinions on Cucumber within Bulletproof R&D are split between love and loathing. There's a reasonable argument to be made that Cucumber adds a layer of abstraction to tests that increases maintenance of tests and slows down development. On the other hand, Cucumber is great for capturing high level user requirements in a format those users are able to understand.

Again, given we are trying to keep things as simple as possible, eliminating Cucumber from the testing setup to focus purely on RSpec seemed like a reasonable approach.

The path was pretty clear:

  1. Do a small amount of grunt work to allow the Aruba API to be used in RSpec
  2. Provide small amount of coaching to developers on workflow
  3. Let the engineers run wild

How do you make Aruba work without Cucumber?

It turns out this was easier than expected.

First you add Aruba to your Gemfile

# Gemfile
source 'https://rubygems.org'

group :development do
  gem 'rake'
  gem 'rspec'
  gem 'aruba'
end

Run the obligatory bundle to ensure all dependencies are installed locally:

bundle

Add a default Rake task to execute tests, to speed up the developer's workflow, and make tests easy to run from CI:

# Rakefile

require 'rspec/core/rake_task'

RSpec::Core::RakeTask.new(:spec)

task :default => [:spec]

Bootstrap the project with RSpec:

$ rspec --init

Require and include the Aruba API bits in the specs:

# spec/template_spec.rb

require 'aruba'
require 'aruba/api'

include Aruba::Api

This pulls in just the API helper methods in the Aruba::Api namespace. These are what we'll be using to run commands, test outputs, and inspect files. The include Aruba::Api makes those methods available in the current namespace.

Then we set up PATH so the tests know where executables are:

# spec/template_spec.rb
require 'pathname'

root = Pathname.new(__FILE__).parent.parent

# Allows us to run commands directly, without worrying about the CWD
ENV['PATH'] = "#{root.join('bin').to_s}#{File::PATH_SEPARATOR}#{ENV['PATH']}"

The PATH environment variable is used by Aruba to find commands we want to run. We could specify a full path in each test, but by setting PATH above we can just call the tool by its name, completely pathless, like we would be doing on a production system.

How do you go about writing tests?

The workflow for writing stepless Aruba tests that still use the Aruba API is pretty straight forward:

  1. Find the relevant step from Aruba's cucumber.rb
  2. Look at how the step is implemented (what methods are called, what arguments are passed to the method, how is output captured later on, etc)
  3. Take a quick look at how the method is implemented in Aruba::Api
  4. Write your tests in pure-RSpec

Here's an example test:

# spec/template_spec.rb

# genud is the name of the tool we're testing
describe "genud" do
  describe "YAML templates" do
    it "should emit valid YAML to STDOUT" do
      fqdn      = 'bprnd-test01.bulletproof.net'

      # Run the command with Aruba's run_simple helper
      run_simple "genud --fqdn #{fqdn} --template #{template}"

      # Test the YAML can be parsed
      lambda {
        userdata = YAML.parse(all_output)
        userdata.should_not be_nil
      }.should_not raise_error
      assert_exit_status(0)
    end
  end
end

Multiple inputs, and DRYing up the tests

Testing multiple inputs and outputs of the tool is important for verifying the behaviour of the tool in the wild.

Specifically, we want to know the same inputs create the same outputs if we make a change to the tool, and we want to know that new inputs we add are valid in multiple use cases.

We also don't want to write test cases for each instance of test data - generating the tests automatically would be ideal.

Our first approach at doing this was to glob a bunch of test data and test the behaviour of the tool for each instance of test data:

# spec/template_spec.rb

describe "genud" do
  describe "YAML templates" do
    it "should emit valid YAML to STDOUT" do

      # The inputs we want to test
      templates = Dir.glob(root + 'templates' + "*.yaml.erb") do |template|
        fqdn     = 'hello.example.org'

        # Run the command with Aruba's run_simple helper
        run_simple "genud --fqdn #{fqdn} --template #{template}"

        # Test the YAML can be parsed
        lambda {
          userdata = YAML.parse(all_output)
          userdata.should_not be_nil
        }.should_not raise_error
        assert_exit_status(0)
      end
    end
  end
end

This worked great provided all the tests were passing, but the tests themselves became very black box when one of the test data input caused a failure.

The engineer would need to add a bunch of puts statements all over the place to determine which input was causing the failure. And even worse, early test failures mask failures in later test data.

To combat this, we DRY'd up the tests by doing the Dir.glob once in the outer scope, rather than in each test:

# spec/template_spec.rb

describe "genud" do
  templates = Dir.glob(root + 'templates' + "*.yaml.erb") do |template|
    describe "YAML templates" do
      describe "#{File.basename(template)}" do
        it "should emit valid YAML to STDOUT" do

          fqdn     = 'hello.example.org'

          # Run the command with Aruba's run_simple helper
          run_simple "genud --fqdn #{fqdn} --template #{template}"

          # Test the YAML can be parsed
          lambda {
            userdata = YAML.parse(all_output)
            userdata.should_not be_nil
          }.should_not raise_error
          assert_exit_status(0)
        end
      end
    end
  end
end

This produces a nice clean test output that decouples the tests from one another while providing the engineer more insight into what test data triggered a failure:

$ be rake

genud
  YAML templates
    test.yaml.erb
      should emit valid YAML to STDOUT
  YAML templates
    test2.yaml.erb
      should emit valid YAML to STDOUT

Where to from here?

The above test rig is a good first pass at meeting our goals for building systems tools:

  • We want to ensure our systems tools are of a consistent high quality
  • We want to ensure new functionality doesn't break old functionality
  • We want to verify we don't introduce regressions
  • We want to streamline peer review of changes

… but we want to take it to the next level: integrating serverspec into the same test suite.

Having a quick feedback loop to verify local operation of the tool is essential to engineer productivity, especially when remote operations of these type of system tools can take upwards of 10 minutes to complete.

But we have to verify the output of local operation actually creates the desired service at the other end. serverspec will help us do this.

Just post mortems

Earlier this week I gave a talk at Monitorama EU on psychological factors that should be considered when designing alerts.

Dave Zwieback pointed me to a great blog post of his on managing the human side of post mortems, which bookends nicely with my talk:

Imagine you had to write a postmortem containing statements like these:

We were unable to resolve the outage as quickly as we would have hoped because our decision making was impacted by extreme stress.

We spent two hours repeatedly applying the fix that worked during the previous outage, only to find out that it made no difference in this one.

We did not communicate openly about an escalating outage that was caused by our botched deployment because we thought we were about to lose our jobs.

While the above scenarios are entirely realistic, it’s hard to find many postmortem write-ups that even hint at these “human factors.” Their absence is, in part, due to the social stigma associated with publicly acknowledging their contribution to outages.

Dave's third example dovetails well with some of the examples in Dekker's Just Culture.

Dekker posits that people fear the consequences of reporting mistakes because:

  • They don't know what the consequences will be
  • The consequences of reporting can be really bad

The last point can be especially important when you consider how things like hindsight bias elevate the importance of proximity.

Simply put: when looking at the consequences of an accident, we tend to blame people who were closest to the thing that went wrong.

In the middle of an incident, unless you know your organisation has your back if you volunteer mistakes you have made or witnessed, you are more likely to withhold situationally helpful but professionally damaging information.

This limits the team's operational effectiveness and perpetuates a culture of secrecy, thwarting any organisational learning.

I think for Dave's first example to work effectively ("our decision making was impacted by extreme stress"), you would need to quantify what the causes and consequences of that stress are.

At Bulletproof we are very open to customers in our problem analyses about the technical details of what fails, because our customers are deeply technical themselves, appreciate the detail, and would cotton on quickly if we were pulling the wool over their eyes.

This works well for all parties because all parties have comparable levels of technical knowledge.

There is risk when talk about stress in general terms because psychological knowledge is not evenly distributed.

Because every man and his dog has experienced stress, every man and his dog feel qualified to talk about and comment on other people's reactions to stress. Furthermore, it's a natural reaction to distance yourself from bad qualities you recognise in yourself by attacking and ridiculing those qualities in others.

I'd wager that outsiders would be more reserved in passing judgement when unfamiliar concepts or terminology is used (e.g. talking about confirmation bias, the Semmelweis reflex, etc).

You could reasonably argue that by using those concepts or terminology you are deliberately using jargon to obfuscate information to those outsiders and Cover Your Arse, however I would counter that it's a good opportunity to open a dialog with those outsiders on building just cultures, eschewing the use of labels like human error, and how cognitive biases are amplified in stressful situations.

Counters not DAGs

Monitoring dependency graphs are fine for small environments, but they are not a good fit for nested complex environments, like those that make up modern web infrastructures.

DAGs are a very alluring data structure to represent monitoring relationships, but they fall down once you start using them to represent relationships at scale:

  • There is an assumption there is a direct causal link between edges of the graph. It's very tempting to believe that you can trace failure from one edge of the graph to another. Failures in one part of a complex systems all too often have weird effects and induce failure on other components of the same system that are quite removed from one another.

  • Complex systems are almost impossible to model. With time and an endless stream of money you can sufficiently model the failure modes within complex systems in isolation, but fully understanding and predicting how complex systems interact and relate with one another is almost impossible. The only way to model this effectively is to have a closed system with very few external dependencies, which is the opposite of the situation every web operations team is in.

  • The cost of maintaining the graph is non trivial. You could employ a team of extremely skilled engineers to understand and model the relationships between each component in your infrastructure, but their work would never be done. On top of that, given the sustained growth most organisations experience, whatever you model will likely change within 12-18 months. Fundamentally it would not provide a good return on investment.

check_check

This isn't a new problem.

Jordan Sissel wrote a great post as part of Sysadvent almost three years ago about check_check.

His approach is simple and elegant:

  • Configure checks in Nagios, but configure a contact that drops the alerts
  • Read Nagios's state out of a file + parse it
  • Aggregate the checks by regex, and alert if a percentage is critical

It's a godsend for people who manage large Nagios instances, but it starts falling down if you've got multiple independent Nagios instances (shards) that are checking the same thing.

You still end up with a situation where each of your shards alert if the shared entity they're monitoring fails.

Flapjack

This is the concrete use case behind why we're rebooting Flapjack - we want to stream the event data from all Nagios shards to Flapjack, and do smart things around notification.

The approach we're looking at in Flapjack is pretty similar to check_check - set thresholds on the number of failure events we see for particular entities - but we want to take it one step further.

Entities in Flapjack can be tagged, so we automatically create "failure counters" for each of those tags.

When checks on those entities fail, we simply increment each of those failure counters. Then we can set thresholds on each of those counters (based on absolute value like > 30 entities, or percentage like > 70% of entities), and perform intelligent actions like:

  • Send a single notification to on-call with a summary of the failing tag counters
  • Rate limit alerts and provide summary alerts to customers
  • Wake up the relevant owners of the infrastructure that is failing
  • Trigger a "workaround engine" that attempts to resolve the problem in an automated way

The result of this is that on-call aren't overloaded with alerts, we involve the people who can fix the problems sooner, and it all works across multiple event sources.

One note on complexity: I am not convinced that automated systems that try to derive meaning from relationships in a graph (or even tag counters) and present the operator with a conclusion are going to provide anything more than a best-guess abstraction of the problem. In the real world, that best guess is most likely wrong.

We need to provide better rollup capabilities that give the operator a summarised view of the current facts, and allow the operator to do their own investigation untainted by the assumptions of the programmer who wrote the inaccurate heuristic.

The benefit of Flapjack's (and check_check's) approach also minimises the maintainability aspect, as tagging of entities becomes the only thing required to build smarter aggregation + analysis tools. This information can easily be pulled out configuration management.

More metadata == more granularity == faster resolution times.

How we do Kanban

At my day job, I run a distributed team of infrastructure coders spread across Australia + one in Vietnam. Our team is called the Software team, but we're more analogous to a product focused Research & Development team.

Other teams at Bulletproof are a mix of office and remote workers, but our team is a little unique in that we're fully distributed. We do daily standups using Google Hangouts, and try to do face to face meetups every few months at Bulletproof's offices in Sydney.

Intra-team communication is something we're good at, but I've been putting a lot of effort lately into improving how our team communicates with others in the business.

This is a post I wrote on our internal company blog explaining how we schedule work, and why we work this way.


our physical wallboard in the office

What on earth is this?

This is a Kanban board.

A Kanban board is a tool for implementing Kanban. Kanban is a scheduling system developed at Toyota in the 70's as part of the broader Toyota Production System.

Applied to software development, the top three things Kanban aims to achieve are:

  • Visualise the flow of work
  • Limit the Work-In-Progress (WIP)
  • Manage and optimise the flow of work

How does Kanban work for the Software team?

In practical terms, work tends to be tracked in:

  • RT tickets, as created using the standard request process, or escalated from other teams
  • GitHub issues, for product improvements, and work discovered while doing other work
  • Ad-hoc requests, through informal communication channels (IM, email)

Because Software deals with requests from many audiences, we use a Kanban board to visualise work from request to completion across all these systems.

Managing flow

As of writing, we have 5 stages a task progresses through:

the board

  • To Do - tasks triaged, and scheduled to be worked on next
  • Doing - tasks being worked on right now
  • Deployable - completed tasks that need to be released to production in the near future (generally during change windows)
  • Done - completed tasks

That's only 4 - there is another stage called the Icebox. This is for tasks we're aware of, but haven't been triaged and aren't scheduled to be worked on yet.

Done tasks are cleaned out once a week on Mondays, after the morning standup.

Triage is the process of taking a request and:

  • Determining the business priority
  • Breaking it up into smaller tasks
  • (Tentatively) allocating it to someone
  • Classifying the type of work (Internal, Customer, BAU)
  • Estimating a task completion time

We use the board exclusively to visualise the tasks - we don't communicate with the stakeholder through the board.

Each task has a pointer to the system the request originated from:

detailed view

…and a little bit of metadata about the overall progress.

Communication with the stakeholder is done through the RT ticket / GitHub issue / email.

Limiting WIP

The WIP Limit is an artificial limit on the number of tasks the whole team can work on simultaneously. We currently calculate the WIP as:

(Number of people in Software) x 2

The goal here is to ensure no one person is ever working on more than 2 tasks at once.

I can hear you thinking "That's crazy and will never work for me! I'm always dealing with multiple requests simultaneously".

The key to making the WIP Limit work is that tasks are never pushed through the system - they are pulled by the people doing the work. Once you finish your current task, you pull across the next highest priority task from the To Do column.

The WIP Limit is particularly useful when coupled with visualising flow because:

  • If people need to work on more than 2 things at once, it's indicative of a bigger scheduling contention problem that needs to be solved. We are likely context switching rapidly, which rapidly reduces our delivery throughput.
  • If the team is constantly working at the WIP limit, we need more people. We always aim to have at least 20% slack in the system to deal with ad-hoc tasks that bubble up throughout the day. If we're operating at 100% capacity, we have no room to breathe, and this severely reduces our operational effectiveness.

Visualising flow

Work makes it way from left to right across the board.

This is valuable for communicating to people where their requests sit in the overall queue of work, but also in identifying bottlenecks where work isn't getting completed.

The Kanban tool we use colour codes tasks based on how long they have been sitting in the same column:

colour coding of tasks

This is vital for identifying work that people are blocking on completing, and tends to be indicative of one of two things:

  • Work that is too large and needs to be broken down into smaller tasks
  • Work that is more complex or challenging than originally anticipated

The latter is an interesting case, because it may require pulling people off other work to help the person assigned that task push through and complete that work.

Normally as a manager this isn't easy to discover unless you are regularly polling your people about their progress, but that behaviour is incredibly annoying to be on the receiving end of.

The board is updated in real time as people in the team do work, which means as a manager I can get out of their way and let them Get Shit Done while having a passive visual indicator of any blockers in the system.

Escalating Complexity

Back in 2009 when I was backpacking around Europe I remember waking up on the morning of June 1 and reading about how an Air France flight had disappeared somewhere over the Atlantic.

The lack of information on what happened to the flight intrigued me, and given the traveling I was doing, I was left wondering "what if I was on that plane?"

Keeping an ear out for updates, in December 2011 I stumbled upon the Popular Mechanics article describing the final moments of the flight. I was left fascinated by how a technical system so advanced could fail so horribly, apparently because of the faulty meatware operating it.

Around the same time I began reading the works of Sidney Dekker. I was left in a state of cognitive dissonance, trying to reconcile the mainstream explanation of what happened in the final moments of AF447 (the pilots were poorly trained, inexperienced, and simply incompetent) with the New View that the operators were merely locally rational actors within a complex system, and that "root cause is simply the place you stop looking further" - with that cause far too commonly attributed to humans.

I decided to do my own research, which resulted in me producing a talk that has received the strongest reaction of any talk I've ever given.

On June 1, 2009 Air France 447 crashed into the Atlantic ocean killing all 228 passengers and crew. The 15 minutes leading up to the impact were a terrifying demonstration of the how thick the fog of war is in complex systems.

Mainstream reports of the incident put the blame on the pilots - a common motif in incident reports that conveniently ignore a simple fact: people were just actors within a complex system, doing their best based on the information at hand.

While the systems you build and operate likely don't control the fate of people's lives, they share many of the same complexity characteristics. Dev and Ops can learn an abundance from how the feedback loops between these aviation systems are designed and how these systems are operated.

In this talk Lindsay will cover what happened on the flight, why the mainstream explanation doesn't add up, how design assumptions can impact people's ability to respond to rapidly developing situations, and how to improve your operational effectiveness when dealing with rapidly developing failure scenarios.

The subject matter is heavy, and I while it's something I'm passionate about, it was an emotionally taxing talk to prepare, and a talk that angers me when presenting.

Time to let it sit and rest.

Data failures, compartmentalisation challenges, monitoring pipelines

To recap, pipelines are a useful way of modelling monitoring systems.

Each compartment of the pipeline manipulates monitoring data before making it available to the next.

At a high level, this is how data flows between the compartments:

basic pipeline

This design gives us a nice separation of concern that enables scalability, fault tolerance, and clear interfaces.

The problem

What happens when there is no data available for the checks to query?

In this very concrete case, we can divide the problem into two distinct classes of failure:

  • Latency when accessing the metric storage layer, manifested as checks timing out.
  • Latency or failure when pushing metrics into the storage layer, manifested as checks being unable to retrieve fresh data.

There are two outcomes from this:

  • We need to provide clearer feedback to the people responding to alerts, to give them more insight into what's happening within the pipeline
  • We need to make the technical system more robust when dealing with either of the above cases

Alerting severity levels aren't granular or accurate in a modern monitoring context

There are entire classes of monitoring problems (like the one we're dealing with here) that map poorly into the existing levels. This is an artefact of an industry wide cargo culting of the alerting levels from Nagios, and these levels may not make sense in a modern monitoring pipeline with distinctly compartmentalised stages.

For example, the Nagios plugin development guidelines state that UNKNOWN from a check can mean:

  • Invalid command line arguments were supplied to the plugin
  • Low-level failures internal to the plugin (such as unable to fork, or open a tcp socket) that prevent it from performing the specified operation.

"Low-level failures" is extremely broad, and it's important operationally to provide precise feedback to the people maintaining the monitoring system.

Adding an additional level (or levels) with contextual debugging information would help close this feedback loop.

In defence of the current practice, there are operational benefits to mapping problems into just 4 levels. For example, there are only ever 4 levels that an engineer needs to be aware of, as opposed to a system where there are 5 or 10 different levels that capture the nuance of a state, but engineers don't understand what that nuance actually is.

Compartmentalisation as the saviour and bane

The core idea driving the pipeline approach is compartmentalisation. We want to split out the different functions of monitoring into separate reliable compartments that have clearly defined interfaces.

The motivation for this approach comes from the performance limitations of traditional monitoring systems where all the functions essentially live on a single box that can only be scaled vertically. Eventually you will reach the vertical limit of hardware capacity.

This is bad.

a monolithic monitoring system

Thus the pipeline approach:

Each stage of the pipeline is handled by a different compartment of monitoring infrastructure that analyses and manipulates the data before deciding whether to pass it onto the next compartment.

This sounds great, except that now we have to deal with the relationships between each compartment both in the normal mode of operation (fetching metrics, querying metrics, sending notifications, etc), but during failure scenarios (one or more compartments being down, incorrect or delayed information passed between compartments, etc).

The pipeline attempts to take this into account:

Ideally, failures and scalability bottlenecks are compartmentalised.

Where there are cascading failures that can't be contained, safeguards can be implemented in the surrounding compartments to dampen the effects.

For example, if the data storage infrastructure stops returning data, this causes the check infrastructure to return false negatives. Or false positives. Or false UNKNOWNs. Bad times.

We can contain the effects in the event processing infrastructure by detecting a mass failure and only sending out a small number of targeted notifications, rather than sending out alerts for each individual failing check.

While the design is in theory meant to allow this containment, the practicalities of doing this are not straightforward.

Some simple questions that need to be asked of each compartment:

  • How does the compartment deal with a response it hasn't seen before?
  • What is the adaptive capacity of each compartment? How robust is each compartment?
  • Does a failure in one compartment cascade into another? How far?

The initial answers won't be pretty, and the solutions won't be simple (ideal as that would be) or easily discovered.

Additionally, the robustness of each compartments in the pipeline will be different, so making each compartent fault tolerant is a hard slog with unique challenges in each compartment.

How are people solving this problem?

Netflix recently open sourced a project called Hystrix:

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

Specifically, Netflix talk about how they make this happen:

How does Hystrix accomplish this?

  • Wrap all calls to external systems (dependencies) in a HystrixCommand object (command pattern) which typically executes within a separate thread.
  • Time-out calls that take longer than defined thresholds. A default exists but for most dependencies is custom-set via properties to be just slightly higher than the measured 99.5th percentile performance for each dependency.
  • Maintain a small thread-pool (or semaphore) for each dependency and if it becomes full commands will be immediately rejected instead of queued up.
  • Measure success, failures (exceptions thrown by client), timeouts, and thread rejections.
  • Trip a circuit-breaker automatically or manually to stop all requests to that service for a period of time if error percentage passes a threshold.
  • Perform fallback logic when a request fails, is rejected, timed-out or short-circuited.
  • Monitor metrics and configuration change in near real-time.

Potential Solutions

We can apply many of the strategies from Hystrix to the monitoring pipeline:

  • Wrap all monitoring checks with a timeout that returns an UNKNOWN (assuming you stick with the existing severity levels)
  • Add some sort of signalling mechanism to the checks so they fail faster, e.g.
    • Stick a load balancer like HAProxy or Nginx in front of the data storage compartment
    • Cache the state of the data storage compartment that all monitoring checks check before querying the compartment
  • Detect mass failures, and notify on-call and the monitoring system owners directly to shorten the MTTR. This is something Flapjack aims to do as part of the reboot.

I don't profess to have all (or even any) of the answers. This is new ground, and I'm very curious to hear how other people are solving this problem.

Pipelines: a modern approach to modelling monitoring

Over the last few years I have been experimenting with different approaches for scaling systems that monitor large numbers of heterogenous hosts, specifically in hosting environments.

This post outlines a pipeline approach for modelling and manipulating monitoring data.


Monitoring can be represented as a pipeline which data flows through, and is eventually turned into a notification for a human.

This approach has several benefits:

  • Failures are compartmentalised
  • Compartments can be scaled independently from one another
  • Clear interfaces are required between compartments, enabling composability

Each stage of the pipeline is handled by a different compartment of monitoring infrastructure that analyses and manipulates the data before deciding whether to pass it onto the next compartment.

These components are the bare minimum required for a monitoring pipeline:

  • Data collection infrastructure, is generally a collection of agents on target systems, or standalone tools that extract metrics from opaque systems (preferably via an API).

  • Data storage infrastructure, provides a place to push collected metrics. These metrics are almost always numerical. These metrics are then queried and fetched for graphing, monitoring checks, and reporting - thus enabling "We alert on what we draw".

  • Check execution infrastructure, runs the monitoring checks that are configured for each host, that query the data storage infrastructure. Checks that query textual data often poll the target system directly, which can have effects on latency.

  • Notification infrastructure, processes check results from the check execution infrastructure to send notifications to engineers or stakeholders. Ideally the notification infrastructure can also feed back actions from engineers to acknowledge, escalate, or resolve alerts.

At a high level, this is how data flows between the compartments:

basic pipeline

When using Nagios, the check + notification infrastructure are generally collapsed into one compartment (with the exception of NRPE).

Many monitoring pipelines start out with the data collection + storage infrastructure decoupled from the check infrastructure. Monitoring checks query the same targets that are being graphed, but:

  • Because the check intervals don't necessarily match up to the data collection intervals, it can be hard to correlate monitoring alerts to features on the graphs.
  • The more systems poll the target system, the more the observer effect is amplified.

There are two other compartments that are becoming increasingly common:

  • Event processing infrastructure. Sitting between the check execution and notification infrastructure, this compartment processes events generated from the check infrastructure, identifies trends and emergent behaviours, and forwards the alerts to the notification infrastructure. It may also make decisions on who to send alerts to.

  • Management infrastructure, provides command + control facilities across all the compartments, as well as being the natural place for graphing and dashboards of metrics in the data storage infrastructure to live. If the target audience is non-technical or strongly segmented (e.g. many customers on a shared monitoring infrastructure), it can also provide an abstracted pretty public face to all the compartments.

This is how event processing + management fit into the pipeline:

event processing + management added to the pipeline

The management infrastructure can likely be broken up into different compartments as well, but for now it serves as a placeholder.

Let's explore the benefits of this pipeline design.

Failures are compartmentalised

Ideally, failures and scalability bottlenecks are compartmentalised.

Where there are cascading failures that can't be contained, safeguards can be implemented in the surrounding compartments to dampen the effects1.

For example, if the data storage infrastructure stops returning data, this causes the check infrastructure to return false negatives. Or false positives. Or false UNKNOWNs. Bad times.

We can contain the effects in the event processing infrastructure by detecting a mass failure and only sending out a small number of targeted notifications, rather than sending out alerts for each individual failing check.

This problem is tricky, interesting, and fodder for further blog posts. :-)

Compartments can be scaled independently

Monolithic monitoring architectures are a pain to scale. Viewing a monolithic architecture through the prism of the pipeline model, all of the compartments are squeezed onto a single machine. Quite often there isn't a data collection or storage layer either.

a monolithic monitoring system

Monolithic architectures often use the same moving parts under the hood, but they tend to be very closely entwined. Each tool has very distinct performance characteristics, but because they all run on a single machine and poorly separated, the only way to improve performance is by throwing expensive hardware at the problem.

If you've ever worked with a monolithic monitoring system, you will likely be experiencing painful flashbacks right about now.

To generalise the workload of the different compartments:

  • Check execution, notifications, and event processing tends to be very CPU intensive + network latency sensitive
  • Data storage is IO intensive + disk space expensive

Making sure each compartment is humming along nicely is super important when providing a consistent and reliable monitoring service.

Splitting the compartments onto separate infrastructure enables us to:

  • Optimise the performance of each component individually, either through using hardware that's more appropriate for the workloads (SSDs, multi-CPU physical machines), or tuning the software stack at the kernel and user space level.
  • Expose data through well defined APIs, which leads into the next point:

Clear interfaces are required between compartments

I like to think of this as "the Duplo approach" - compartments with well defined interfaces you can plug together to compose your pipeline.

a Dulpo brick

Clear interfaces abstract the tools used in each compartment of the pipeline, which is essential for chaining tools in a composable way.

Clear interfaces help us:

  • Replace underperforming tools that have reached their scalability limits
  • Test new tools in parallel with the old tools by verifying their inputs + outputs
  • Better identify input that could be considered erroneous, and react appropriately

Concepts like Design by Contract, Service Oriented Architecture, or Defensive Programming then have direct applicability to the design of individual components and the pipeline overall.


It's not all rainbows and unicorns. There are some downsides to the pipeline approach.

Greater Cost

There will almost certainly be a bigger initial investment in building a monitoring system with the pipeline approach.

You'll be using more components, thus more servers, thus the cost is greater. While the cost of scaling out may be greater up-front, you limit the need to scale up later on.

You can counteract some of these effects by starting small and dividing up compartments over time as part of a piecemeal strategy, but this takes time + persistence.

I can tell you from personal project management experience when rolling out of this pipeline design that it's hard work keeping a model of the complexity in your head and also well documented.

More Complexity

The pipeline makes it easier to eliminate scalability bottlenecks at the expense of more moving parts. The more moving parts, the greater the likelihood of failure.

Operationally it will be more difficult to troubleshoot when failures occur, and this becomes worse as you increase the safeguards and fault tolerance within your compartments.

This is the cost of scalability, and there is no easy fix.

Conclusion

The pipeline model maps nicely to existing monitoring infrastructures, but also to larger distributed monitoring systems.

It provides scalability, fault tolerance, and composability at the cost of a larger upfront investment.


1: This is a vast simplification of a very complex topic. Thinking of failure as an energy to be contained by barriers was a popular perspective in accident prevention circles from the 1960's to the 1980's, but the concept doesn't necessarily apply to complex systems.

Rebooting Flapjack

This is the first time I've actually blogged about Flapjack.

The past

In 2008 I started talking with Matt Moor about building a "next generation monitoring system" that would be simple to setup & operate, and provide obvious paths to scale.

In 2009 I started hacking on Flapjack while backpacking, and by mid 2009 I had a working prototype running basic monitoring checks.

The fundamental idea was simple: decouple the check execution from the alerting and notification, and use message queues to distribute the check execution across lots of machines.

It seems simple and obvious now, but at the time nobody was really talking about doing this, so Flapjack gathered a reasonable amount of attention relatively quickly after I started talking about it at conferences.

2010 rolled around and I was unable to maintain a good development pace and hold that attention gained by talking at conferences due to some fairly significant life changes. Pretty much all of my open source projects suffered, and in the space of 12 months:

There were plenty of other interesting projects like Sensu that were achieving similar goals excellently, so while winding up Flapjack was a source of bitter personal disappointment, it was offset by seeing other people doing awesome work in the monitoring space.

The present

Mid last year, an interesting problem arose at work:

In a modern "monitoring system", how do you:

  • Notify a dynamic group of people on a variety of media based on monitoring events? Bulletproof has thousands of people that may need to be notified by our monitoring system, depending on what monitoring checks are failing. While the thresholds on each monitoring check are universal, each of these people can have different notification settings based on time of day or week, the type of service affected, or the severity of the failure.

  • Dampen or roll up common events so on-call isn't bombarded during outages? When one system deep in the stack fails, it has significant flow-on effects to everything else that depends on it. This generally manifests as thousands (or tens of thousands, in extremely bad cases) of alerts being sent to on-call in a very short period of time (<60 seconds). Obviously this is bad, and we simply want to detect cases like these, and wake up people involved in the incident response process.

  • Do the above in an API driven way? We need to solve both problems in a way that works in a multitenant environment with strong segregation between customers, and integrates with an existing monitoring & customer self-service stack.

Thus, Flapjack was rebooted with a significantly altered focus:

  • Event processing
  • Correlation & rollup
  • API driven configuration

We've been actively working on the reboot since July last year, and have been sending alerts from Flapjack to customers since January.

We're developing Flapjack as a fully Open Source composable platform on which you can adapt and build to your organisation's needs by hooking it into your existing check execution infrastructure (we ship a Nagios event processor), and self service and provisioning automation tools.

Because we care deeply about people integrating Flapjack into their existing environments, we have invested a lot of time and energy into writing quality documentation that covers working with the API, debugging production issues, and the data structures used behind the scenes. That's all on top of the usage documentation, of course.

Flapjack is built on Redis, and funnily enough R.I. Pienaar did a post earlier this year that investigates using Redis to solve the same problem in an extremely similar way. R.I.'s post provides a good primer on some of the thinking behind Flapjack, so I recommend giving it a read.

The future

Fundamentally, Flapjack is trying to plug a notification hole in the monitoring ecosystem that I don't believe is being adequately addressed by other tools, but the key to doing this is to play nicely with other tools and build a composable pipeline.

The above is merely a glimpse of Flapjack that leaves quite a few questions unanswered (e.g. "Why aren't you using $x feature of $y check execution engine to do roll-up?", "Do Flapjack and Riemann play nicely with one another?"), so stay tuned for more:

more waffles

Upcoming speaking engagements and travel

My next 2 months is going to be jam packed with conferences and travel!

  • Devopsdays NZ, March 8 2013. I will be giving a talk that analyses AA261 through a DevOps lense, looking at the collaborative maintenance and operation of the MD-83 in the crash.
  • Monitorama, March 28-29 2013. I'm looking forward to slowing down and listening at Monitorama, which has a tremendous line up of speakers. I'll be keen to hear what others think of the work we've been doing on Flapjack the last 6 months.
  • Mountain West Ruby Conf 2013, April 3-5 2013. MWRC has added an extra day of DevOps content to the conference this year, and I'll be joining an esteemed speaker lineup to talk about what both dev and ops can learn from AF447 when responding to rapidly evolving failure scenarios.
  • I'll be staying in the Netherlands for a little under a week between conferences, visiting family and friends. Hopefully I can visit a meetup or two.
  • Open Source Data Center Conference 2013, April 17-18 2013. This will be my first time in Nürenberg, and I'm really looking forward to saying I have attended both OSDCs. I'll be talking about Ript, a DSL for describing firewall rules, and a tool for incrementally applying them.
  • Puppet Camp Nürenberg 2013, April 19 2013. Straight after OSDC I'll be talking about how we are using Puppet at Bulletproof Networks in multi-tenant, isolated environments.

How I make interesting technical presentations

Whenever I talk at conferences, I am routinely asked how I go about preparing and making my presentations.

There are no hard and fast rules, but these are some things I have learnt:

Start analog

The most limiting thing you can do when you start putting together a presentation is to reach for slideware. I use a paper notebook to brainstorm my ideas with multicoloured pens, then scan it so I can refer back to it quickly when putting the slides together.

mindmapping a talk

Don't create slides linearly

I focus on an idea in the brainstorm that surprised me the most when I wrote it down, and use it as a jump-off point for creating slides. I've found exploring that initial idea helps set the tone for the rest of the presentation.

Weave a story

Kathy Sierra used to bang on about this heaps. We're wired as a species to find stories interesting, so use this to your advantage.

But don't concoct a story just for the talk - try to relate the content back to your own experiences. Nobody wants to hear about Alice and Bob, they want to hear you and your co-workers rise above adversity and the setbacks you had along the way.

Chris Fegan's NBNCo talk at Puppet Camp Sydney 2013 was a good example of how to weave technical detail into an organisational growth story.

Use slides appropriately

They are a visual aid, and a visual aid alone. People's attention should be on you - you are the speaker after all! Use lots of supporting visuals, and minimal text. No bullet point lists! Put each point on a separate slide.

I use Flickr's Creative Commons search to find relevant images, and favourite them when I want to use them again across multiple presentations. Sometimes they even provide a visual trigger that moves the presentation in a direction I wasn't expecting.

If I post the slides after the presentation, it's always nice to comment on the picture on Flickr to let the photographer know I appreciate their contributions to Open Culture.

Don't rely on the slides

Ideally if your laptop died 5 minutes before the talk, you should know your material well enough that you could deliver it by voice alone.

Be thorough

Shortcuts are obvious to your audience. I spend at least 20 hours preparing each presentation.

A lot of that time is research (I spent 10 hours alone doing research on AF447 before I created a single slide, and that research was probably too little given the depth of subject matter), and a lot of it is finding images on Flickr. :-)

Maybe 20 hours is a lot, but every minute you put into preparation pays off.

Tailor your content

It's ok to give the same talk at multiple conferences, but make sure you alter the content so it's relevant to your audience.

I gave my cucumber-nagios talk tens of times over an 18 month period, but the talk was different every time.

If I was at a developer conference, I would talk about how to reuse your existing tests as monitoring checks. If I was at a sysadmin conference, I would talk about testing systems infrastructure. If I was at a DevOps conference, I would talk about encoding & communicating business processes in your monitoring.

Practice, practice, practice

Know the timing of your talk. Work out what the average time you should spend on each slide. I generally rehearse each talk at least 3-5 times before I give it the first time, and will revise and rehearse at least 1-2 times on subsequent presentations.

Don't wait until you've finished the presentation before you start practicing. I'll often practice the 20% I've put together and discover it feels mechanical, or the ideas don't flow well into one another. Refactor.

Test your equipment

Plug your laptop into the projector at least once, preferably twice, before your talk. I carry multiple adapters for every conceivable display type out there, some display cables, a power board, and a clicker. Test everything, then test it again.

Mirror your display

It's tempting to use your laptop screen for presenter notes and stopwatch widgets. Don't. Know your material. Use a physical stopwatch. Split displays will break unexpectedly, and you'll lose your flow. Besides, mirroring is always easier than craning your neck to see what your audience is seeing.

Watch yourself

If you're lucky to talk at a conference where your talk is recorded, go back and watch your talk. This is vitally important for working out what bits flowed well and what bits were stilted.

The most important thing is to speak at many events as often as possible. You're only going to get better at presenting if you present. Start working towards that 10,000 hours of mastery!

DevOps Down Under 2012 - what happened?

Almost 2 days ago Patrick kicked off a discussion about organising another Australian DevOps conference in 2013 amongst a small group of passionate DevOps who are actively involved in the Australian community.

While the discussion was trundling on without me, I felt I owed everyone involved an explanation of what happened with this year's unrealised conference, and why the conference fell flat.

Let's start at the beginning.

Having come back from a year of backpacking around Europe and attending the first DevOpsDays conference, I took it upon myself to try and replicate the success by organising the first DevOps Down Under conference in 2010.

It was a relatively small affair held downstairs at Atlassian's Corn Exchange offices in Sydney, and I put the thing together on a shoestring budget in my spare time with some on-the-ground help from Atlassian's Nicholas Muldoon.

The event was successful, with people from all across Australia and New Zealand to attending. At the end of the conference, each attendee was asked to write down one thing they loved, and one thing they hated about the conference.

Stacks of love and hate

This gave me a great starting point to build another conference on, and in early 2011 I started getting the itch to do another. At the same time, Evan Bottcher pinged me about ThoughtWorks lending a hand to organise another DevOps Down Under in Melbourne later in 2011.

The most consistent feedback we got from the 2010 conference was that the coffee was "a little bit shit", so we fixed that by moving the whole conference to Melbourne.

After an initial planning meeting, ThoughtWorks kindly lent Chris Bushell and Natalie Drucker to assist with organising.

I was just starting a new position at work, and wasn't able to dedicate nearly as much time to organising as I had in 2010. I provided the initial vision and direction, but without Chris and Natalie's tireless efforts and persistent pestering of me to get my arse into gear, the conference would have been but a shadow of itself.

Attendees at #dodu2011

By the time DevOps Down Under 2011 wrapped up in July, I was tired and wasn't feeling fired up about putting on another conference just yet. I decided to wait and see how I felt in the new year.

Around March this year I started thinking about doing another conference, but the spark wasn't there like in other years. I decided to press on regardless, motivated by my perceived expectation that people wanted another conference.

The vision for DevOps Down Under 2012 was to build a quiet, intimate, and safe atmosphere that was removed from the rat race. To achieve this, the plan was to cap the number of attendees at 140, find a venue outside a major capital city, and source high quality talks.

Venue shot for #dodu2012

The venue & budget was in place, and we got a really great collection of talks submitted. I simply failed to execute on anything beyond that.

The main reasons why execution failed were:

  • I had lost the passion for organising the conference, and was motivated by the wrong reasons.
  • I had even less time to commit.
  • Everyone involved was similarly time poor.
  • There was no organisational cadence.
  • I didn't lean enough on other people to help me do the grunt work.
  • I didn't have the time to fix any of these problems.

With the benefit of hindsight, I simply shouldn't have tried to put it on.

Seeing people putting their hands up to organise a 2013 conference takes a huge mental weight off my shoulders.

Through my own actions and inactions, I have felt the responsibility of leading the conference organisation year-on-year has fallen to me. In 2012 that pressure became paralysing, and my eventual coping mechanism was to ignore the conference entirely.

As for my future involvement: I am still burnt out, and it would simply be unfair to myself, the organisers, speakers, and attendees to commit to taking an active role in organising a 2013 conference.

I have provided the current crop of potential organisers a collection of resources to get them started, and I am extremely confident they will manage to pull off something spectacular.

Drawing on my battered experience of organising several conferences, these are the key actionable things I believe you need to make an event like DevOps Down Under happen:

  • Have at least 3 people who can each dedicate 2+ hours a week to doing the grunt work. Anyone who tells you organising a conference is anything but a hard slog is either lying to you, or doesn't know what they are talking about.
  • Do weekly catchup meetings to keep things on track. Increase the frequency of these closer to the conference date.
  • Use a mailing list for asynchronous organisation.
  • Nominate someone to lead & own the conference vision & organisation.

I hope the above arms you with enough information to avoid falling into the same traps I did.

Ript: quick, reliable, and painless firewalling

Running your own servers? Hate managing firewall rules?

For the last year at Bulletproof Networks I've been working on a little tool called Ript to make writing firewall rules a joy, and applying them quick, reliable, and painless.

Ript is a clean and opinionated Domain Specific Language for describing firewall rules, and a tool with database migrations-like functionality for applying these rules with zero downtime.

The DSL

At Ript's core is an easy to use Ruby DSL for describing both simple and complex sets of iptables firewall rules. After defining the hosts and networks you care about:

partition "joeblogsco" do
  label "www.joeblogsco.com",      :address => "172.19.56.216"
  label "app-01",                  :address => "192.168.5.230"
  label "joeblogsco uat subnet",   :address => "192.168.5.0/24"
  label "joeblogsco stage subnet", :address => "10.60.2.0/24"
  label "joeblogsco prod subnet",  :address => "10.60.3.0/24"
  label "bad guy",                 :address => "172.19.110.247"
  label "bad guys",                :address => "10.0.0.0/8"
end

...you use Ript's helpers for accepting, dropping, & rejecting packets, as well as for performing DNAT and SNAT:

partition "joeblogsco" do
  label "www.joeblogsco.com",      :address => "172.19.56.216"
  label "app-01",                  :address => "192.168.5.230"
  label "joeblogsco uat subnet",   :address => "192.168.5.0/24"
  label "joeblogsco stage subnet", :address => "10.60.2.0/24"
  label "joeblogsco prod subnet",  :address => "10.60.3.0/24"
  label "bad guy",                 :address => "172.19.110.247"
  label "bad guys",                :address => "10.0.0.0/8"

  rewrite "public website + ssh access" do
    ports 80, 22
    dnat  "www.joeblogsco.com" => "app-01"
  end

  rewrite "private to public" do
    snat  [ "joeblogsco uat subnet",
            "joeblogsco stage subnet",
            "joeblogsco prod subnet"  ] => "www.joeblogsco.com"
  end

  reject "bad guy" do
    from "bad guy"
    to   "www.joeblogsco.com"
  end

  drop "bad guys" do
    protocols "udp"
    from      "bad guys"
    to        "www.joeblogsco.com"
  end
end

The DSL provides many helpful shortcuts for DRYing up your firewall rules, and tries to do as much of the heavy lifting for you as possible.

Part of Ript being opinionated is that it doesn't expose all the underlying features of iptables. This was done for several reasons:

  • The DSL would become complex, and thus harder to use.
  • Not all features within iptables map to Ript's DSL
  • Ript caters for the simple-to-moderately complex use cases that 80% of users have. If you need to use iptables features documented deep within the man pages, Ript is almost certainly not the tool for you.

Rule application

While the DSL is pretty, we didn't write Ript because of it - we wrote it because we're working with tens of thousands of iptables rules & making several changes a day to those rules, and the traditional way of applying changes doesn't cut it at scale.

Most tools try to apply firewall rules by flushing all the loaded rules and loading in new ones. This works fine if you only have a few hundred rules, but as soon as you start scaling into thousands of rules, the load time becomes very noticable.

The effects of this are fairly simple: the rule load time manifests itself as downtime.

Because the ruleset has to be applied serially, rules at the end of the set are held up by rules still being applied at the beginning of the set. From a service provider's perspective, this means that a rule change for one customer can end up causing downtime for other completely unrelated customers. Not cool.

iptables-save and iptables-restore help with this, but you still end up writing + applying rules by hand - a tedious task if you're making lots of firewall changes every day.

Ript's killer feature is incrementally applying rules.

Ript generates firewall chains in a very specific way that allows it to apply new rules incrementally, and clean out old rules intelligently. Here's an example session:

# Output all the generated rules by interpreting all files under /etc/firewall
ript rules generate /etc/firewall
# Output a diff of rules to apply, based on what rules are currently loaded in memory
ript rules diff /etc/firewall
# Apply the aforementioned diff
ript rules apply /etc/firewall
# Output the currently loaded rule in iptables-restore format
ript rules save
# Output a diff of rules to delete
ript clean diff /etc/firewall
# Apply the aforementioned diff
ript clean apply /etc/firewall

Getting started

Ript has been Open Sourced under an MIT license, and is available on GitHub. To get you going, Ript ships with extensive DSL usage documentation, and a boatload of examples used by the tests.

I'll also be giving a talk about Ript at linux.conf.au in Canberra in January 2013.

Happy Ripting!

Incentivising automated changes

Matthias Marschall wrote a great peice last week on the pitfalls of making manual changes to production systems. TL,DR; Making manual changes in the heat of the moment will bite you at the most inopportune times.

The article finishes with this suggestion:

You should have your configuration management tool (like Puppet or Chef) setup so that you can try out possible solutions without having to go in and do it manually.

In my experience, this is the key to solving the problem.

Rather than coercing people to follow a "no manual changes" policy, you make the incentives for making changes with automation better than for making changes manually.

Specifically:

  • Make it simple. Reduce the number of steps to make the change with automation. It should be quicker to find the place in your Chef or Puppet code and deploy than logging into the box, editing a file, and restarting a service.
  • Make it fast. The time from thinking about the change to the change being applied should be shorter with automation than by doing it manually.
  • Make it safe. Provide a rollback mechanism for changes. A safety harness can be as simple as a thin process around "git revert" + deploy.

It's a perfect example of how tools should complement culture.

Instrumenting your monitoring checks with New Relic

This post is part 3 of 3 in a series on monitoring scalability.

In parts 1 and 2 of this series I talked about check latency and how you can mitigate its effects by splitting data collection + storage out from alerting, while looking at monitoring systems through the prism of an MVC web application.

This final post in the series provides a concrete example of how to instrument your monitoring checks so you can identify which exact parts of your checks are inducing latency in your monitoring system.

When debugging performance bottlenecks, I tend to use a simple but effective workflow:

  1. observe the system
  2. analyse the results
  3. optimise the bottleneck that is having the most impact
  4. rinse and repeat until the system is performing within the expected performance parameters

What if we continue to look at monitoring checks as micro MVC web applications? What tools exist to aid this optimisation workflow, and how can we hook instrumentation into our checks?

The crème de la crème of web app performance monitoring + optimisation tools is New Relic, boasting an incredibly rich feature set that lets you drill down deep into your application while also providing a high level view of app-wide performance.

But is it possible to hook New Relic into applications that aren't web apps? Let's give it a go.

Here's an example monitoring check:

#!/usr/bin/env ruby
#
# Usage: check.rb <time>

class Check
  attr_reader :opts

  def initialize(opts={})
    @opts = opts
  end

  def model(opts={}
    i = opts[:time]
    sleep(1)
    raise [Exception, RuntimeError, StandardError][rand(2)] if rand(i) == 1
    return i
  end

  def view(data)
    i = data
    sleep(rand(i) / 5)
    raise [Exception, RuntimeError, ArgumentError][rand(2)] if rand(i) == 2

    puts "OK: we made it!"
  end

  def run
    data = model(@opts)
    view(data)
  end
end

Check.new(:time => ARGV[0].to_i).run

As you can see, it's flat out like a lizard drinking inducing latency by sleeping and spicing things up by randomly throwing exceptions. All things considered, it's actually a pretty good example of a monitoring check that aims to misbehave.

Let's start instrumenting!

First up we need to load some libraries:

#!/usr/bin/env ruby

require 'rubygems'
require 'newrelic_rpm'

class Check
  include NewRelic::Agent::Instrumentation::ControllerInstrumentation

Reading through the New Relic API documentation...

# When the app environment loads, so does the Agent. However, the
# Agent will only connect to the service if a web front-end is found. If
# you want to selectively monitor ruby processes that don't use
# web plugins, then call this method in your code and the Agent
# will fire up and start reporting to the service.

...it looks like we need to manually start up the agent:

class Check
  # ...
end

NewRelic::Agent.manual_start

Now we need to tell the New Relic agent what to instrument. The API provides methods to do this at the transaction and method level:

class Check
  # ...

  add_transaction_tracer :run,   :name => 'run', :class_name => '#{self.class}'
  add_method_tracer      :model, 'Nagios/#{self.class.name}/model'
  add_method_tracer      :view,  'Nagios/#{self.class.name}/view'
end

In New Relic parlance, a transaction is an end-to-end process that is comprised of many smaller units of work, and a method is an individual unit of work. In this monitoring check scenario, a transaction is an invocation of the check.

When using the New Relic agent with Rails, by default it captures the query parameters passed to the controller action. This helps massively when debugging why a certain transaction takes longer to complete on particular inputs.

Wouldn't it be cool if we could treat the command line arguments to the monitoring check as query parameters to the controller action? That way we could identify which services are running slowly and holding up the check.

Turns out this is just another option to add_transaction_tracer:

add_transaction_tracer :run, :name => 'run', :class_name => '#{self.class}', :params => 'self.opts'

Provided you store all your options in an instance variable with an attr_reader, you can capture whatever data is passed to the check on execution.

One piece of data the New Relic agent captures is an Apdex score for each request. An Apdex score is a measurement of user satisfaction when interacting with an application or service.

In this particular scenario, the "user" is actually a monitoring system, so the score may not be that meaningful. Let's disable it for now:

class Check
  # ...

  newrelic_ignore_apdex
end

So far everything has been very smooth - we've taken an existing check and added some instrumentation points with New Relic - but we're about to hit a complication.

Internally the New Relic agent spawns a separate thread from which it sends all this instrumented data to the New Relic service. Establishing a connection to the New Relic service actually takes a while (15+ seconds in the worst cases), which doesn't quite fit the paradigm we're working in where monitoring checks are returning sub-second results.

Essentially this means that we're collecting all this interesting data with the New Relic agent but it's never actually sent to the New Relic service.

In the PHP world this is a very real problem as PHP processes will exit at the end of each request. In the PHP edition of New Relic there's quite a cute workaround for exactly this problem - each PHP process sends data to a daemon running in the background that buffers it and sends it to New Relic at a regular interval.

Let's emulate this functionality in Ruby:

at_exit do
  NewRelic::Agent.save_data
end

This will serialise the captured data to log/newrelic_agent_store.db as a marshalled Ruby object. The last step is to send this data to New Relic at a regular interval:

#!/usr/bin/env ruby
#
# Usage: collector.rb
#

require 'rubygems'
require 'newrelic_rpm'

module NewRelic
  module Agent
    def self.connected?
      agent.connected?
    end
  end
end

$stdout.sync = true
NewRelic::Agent.manual_start

print "Waiting to connect to the NewRelic service"
until NewRelic::Agent.connected? do
  print '.'
  sleep 1
end
puts

NewRelic::Agent.load_data
NewRelic::Agent.shutdown(:force_send => true)

This waits for the New Relic agent to establish a connection to the New Relic service, loads the data serialised by the checks, and sends it to New Relic.

Just for testing, we can run our pseudo collector like this:

while true; do echo "Sending" && ruby send.rb && echo "Sleeping 30" && sleep 30 ; done

And invoke the monitoring check like this:

while true ; do RACK_ENV=development bundle exec ruby main.rb 5 ; done

Now we've got all this set up, we can log into New Relic to view some pretty visualisations of our monitoring check latency:

New Relic dashboard screenshot

New Relic automatically identifies which transactions are the slowest, and lets you deep dive to identify where the slowness is:

New Relic transaction deep dive screenshot

If you haven't got a brass razoo there are plenty of Open Source alternatives to New Relic, but you'll have to do a bit more grunt work to get them going.

This post concludes this series on monitoring scalability! The TL;DR series summary:

  • Check latency is the monitoring system killer.
  • Even in simple environments check latency slows down your monitoring system and obfuscates incidents.
  • To eliminate latency, separate data collection from alerting.
  • Make your monitoring checks as non-blocking as possible.
  • Whenever debugging monitoring performance problems, think of your monitoring system as an MVC web app.
  • Instrument your monitoring checks to identify sources of latency.

You can find the above code examples on GitHub.

If you've enjoyed this series of posts, you can find more of my keen insights, witty banter, and Australian colloquialisms on Twitter, or subscribe to my blog.