Join us for Flagship 2024: April 16-17 – Register Now.

How We Introduced Guardrail Metrics for Our Continuous Integration Pipeline

Contents

The Problem

I’m part of the Engineering Productivity team at Split, where our mission is to provide tools and systems to empower teams to ship code faster.

One of our repositories is a mono-repo where different teams introduce changes for various backend services.

The build time was high, and we wanted to accelerate it. Our problem was related to the lack of observability around the current build state. We discovered the issue while looking for opportunities to improve things. After coming up with a few possible solutions, we needed a way to measure and validate our hypothesis.

To crack the code, we divided this effort into two milestones: The first step would be to measure our continuous integration piece. This article will focus on this. The second step is about continuous delivery and will be tackled later on in this article series.

Our Approach

To help our Product Engineering teams to move faster we had a few questions to answer.

  • How many builds do we run a day? Or by week?
  • How often the build is failing?
  • What is the piece from the build that fails more often?
  • What is the average build run duration?
  • Which is the slowest piece of the build?

To answer these questions, our idea was to introduce guardrail metrics. These will analyze the duration of the builds over time, the amount of executed workflows, build successes, and failure rates. As a result, engineers could observe the bottlenecks and act on them.

Capturing Data

In our current setup, we use GitHub Actions for the continuous integration step. Even if there are some out-of-the-box metrics, there is no easy way to analyze the duration of the builds over time, the amount of executed workflows, build success/failure rate, queued duration, etc.

Our first goal was to provide an easy way to collect Continuous Integrations metrics to be analyzed later on in a more digestible way. Since we use Datadog for managing all of our metrics, we decided to use it for this purpose as well.

To get information from the GitHub Actions workflows, we leveraged the Datadog Actions Metric plugin into our continuous integration pipeline to get guardrails metrics. The aforementioned plugin gives us the following information.

Workflows

  • The number of workflow runs (e.g., # CI workflow runs)
  • The number of workflow success runs
  • The number of workflow failures runs
  • The duration of workflow run (time from a workflow is started until it is updated)
  • Queued duration of the workflow (time from a workflow is created until the first job is started)

Jobs

  • The number of job runs (e.g., # Build job runs)
  • The number of job successful runs
  • The number of job failure runs
  • The duration of job run (time from a job is started to completed)
  • Queued duration of job (time from a job is started until the first step is started)

Steps

  • The number of steps runs
  • The number of steps success runs
  • The number of steps failure runs
  • The duration of step run (time from a step is started to completed)

For configuring the plugin we added the following YAML snippet to our continuous integration step. In that way, we could start capturing metrics.

on:
  workflow_run:
    workflows:
      - "**"
    types:
      - completed
jobs:
  send:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
      - name: Send GitHub Actions metrics to DataDog
        # https://github.com/int128/datadog-actions-metrics/
        uses: int128/datadog-actions-metrics@v1
        with:
          datadog-api-key: ${{ secrets.DATADOG_API_KEY }}
          collect-job-metrics: ${{ github.event.workflow_run.head_branch == github.event.repository.default_branch }}
          collect-step-metrics: ${{ github.event.workflow_run.head_branch == github.event.repository.default_branch }}
YAML

We were only interested in successful workflows for calculating durations. This was to have a consistent view of successful runs. We also needed the failed workflows to calculate success and failure rates.

Since we do trunk development, the only branch that was important for us was the main branch.

After setting this in our continuous integration pipeline, we had new metrics available in Datadog. The next step was to present this data in a way where engineers could understand it and use it for improving things.

Displaying Information

The second goal of this project was to make this information easy to consume. So we created a dashboard in Datadog with all the new metrics.

From the interactions with the Product Engineering teams, we wanted to surface the following information.

CI Builds Success/Failure

We used the metrics github.actions.workflow_run.conclusion.success_total and github.actions.workflow_run.conclusion.failure_total to calculate the failure rate.

Run Duration for Each One of the Workflows

We used the metric github.actions.job.duration_second and grouping by workflow_name.

Workflow Total Run Duration

We used the metric github.actions.job.duration_second.

Jobs Duration

We used the metric github.actions.job.duration_second and grouping by job_name.

Number of PRs Merged

We used the metric github.actions.workflow_run.total.

We didn’t include any information about steps since that could impact Datadog costs.

One cool thing about using this dashboard in Datadog is that we could use template variables to replicate this dashboard to other repositories and filter by repository or specific engineers.

Learnings

From using this dashboard the team could understand how the build execution was evolving over time. By surfacing the workflow duration, we could understand the bottlenecks. From now on, new improvements can use these metrics for validating our hypothesis. Learn more about experimentation here!

Want to Dive Deeper?

We have a lot to explore that can help you understand feature flags. Learn more about benefits, use cases, and real world applications that you can try.

Create Impact With Everything You Build

We’re excited to accompany you on your journey as you build faster, release safer, and launch impactful products.