The Lattice engineering org has grown quite a lot over the last few years, and the org is expected to at least double in headcount each year. But in order to make use of all of these wonderful engineers, we need to provide them with tooling and support to make development at Lattice a joy.
At Lattice, we have a group of engineers called the DX (Developer Experience) guild that leads the charge on understanding and improving the experience of building software at Lattice. But with the high growth that our engineering org is experiencing, it's become quite challenging to make sense of all the inputs. If our engineers are telling us that running tests locally (on their laptops) is slow, there's a lot of work that needs to be done to understand exactly what that means. It's not immediately clear how slow they are or if some tests are slower than others. Put another way, we need some quantitative data to understand what the issue really is, and today collecting that data involves manual exploration. Now, imagine scaling this manual process up to an org with 100+ engineers and beyond.
If we think of these engineers as our customers and our mission is making them happy and productive, then what we need is a systematic way to measure DX and analyze these variables to understand what's really going on. Sound a bit familiar? That's because this is the same problem that we face in our production environments! If we're to have any hope of knowing which way is up, we need to be able to observe our production systems. Why shouldn't we apply the same observability practices for the tools we provide for our own engineers?
Solution
When it comes to the developer's experience at Lattice, there are two key environments that we need to consider:
The first is CI/CD (Continuous Integration, Continuous Deployment) — where we build, test, and deploy our applications after we merge changes into our main branch. It's of interest to us because the health of our CI/CD pipeline greatly affects the DX at Lattice. Slow build times, flaky tests, or configuration issues can cause whole groups of engineers to become blocked on delivering value to our customers.
The second environment is local, or the laptops our engineers use each day to develop. There are various workflows that developers use frequently in local environments such as incrementally building NextJS pages when running our front-end application, or waiting for a back-end service to spin up so we can test out our changes.
Overall, the goal we set was to be able to observe development metrics in Datadog, our cloud monitoring solution. Datadog offers powerful analytical tools that will prove useful for understanding DX in a more objective way. Further, Datadog allows us to define monitors against some future SLAs and alert us when DX becomes unhealthy.
CI/CD Environment
Lucky for us, Datadog provides an integration for CI visibility that was easy enough to configure for CircleCI. Once set up, we were able to use Datadog to explore all sorts of CI-related metrics including job durations, job success rates, individual test durations, flaky test detection, and more. Neat!
Local Environment
Unlucky for us, local environments were a bit more challenging. Before we can discuss the solution we've built, let's start with a look at how we interact with Datadog in production.
We run a Datadog agent in a Kubernetes cluster that our production apps communicate with. The agent then passes along our application metrics to the Datadog API where we can view them in graphs and dashboards.
For this project, the goal is to collect metrics from development environments, which means that our “production app,” or development tools, are actually running on company engineering laptops, not within our Kubernetes cluster. So the question becomes, “How can we observe local environments in a way that is both secure and low-friction for our engineers?”
In order to communicate with the Datadog API, you need to be in possession of an API key. We don't exactly want to turn our app's API key over to every engineer, so we extended our existing architecture to allow for local metrics without needing to expose our API key. By putting a "Development Telemetry" service in front of our agent, it can accept metric requests from engineer laptops and pass them through our Datadog agent, much like how we do in production.
Allowing Lattice engineers to post metrics to our Development Telemetry service implies that the service is exposed on the public Internet. So how do we ensure that only Lattice engineers can post metrics? One solution would be to implement some sort of authentication system for our service, but that adds complexity and doesn't change the fact that our service is publicly exposed. Instead, we opted to expose the Development Telemetry service through our company's VPN. This way, the security team can manage roles and permissions the same way they do for our other internal systems.
Great, now that we have an architecture in place, we need to figure out what we're going to measure exactly. Stepping back for a moment, let's remind ourselves of why we're here: we aim to understand DX in hopes of improving it. What we're certainly not building here is spyware.
So the metrics we collect should be intentional and directly serve our mission. The metrics we collect should be anonymous. We should target areas that we consider key events in the developer's workflow. At a minimum, we want to know how long a particular event takes and some basic information about the engineer's environment, like what versions of NodeJS they are running or how much total system memory their machine has. Additionally, each metric might require some additional information to provide more context. For example, when tracking how long it takes for engineers to spin up a back-end application on their laptop, we'll want to know which application.
After studying past DX survey results, we came up with a list of metrics to start with:
- Test runner duration - How long does it take to run tests for various apps and packages locally?
- Linting duration - How long does it take to lint our apps and packages locally?
- NextJS page compilation duration - How long does it take to build a NextJS page locally so an engineer can interact with it during development?
- Application/package build duration - How long does it take for engineers to build apps and packages locally?
- Application startup duration - How long does it take to lift an application locally?
In the future, our team can evolve this list to include even more key development events.
Implementing the above metrics was a matter of hooking into various tooling and scripts that exist and inserting some timer to measure when the event begins and when it completes. Upon completion, we simply fire an event off to our Development Telemetry service to eventually make its way into Datadog. And just like that, we now have local metrics in Datadog!
There's one thing I want to address that you might be thinking. Is measuring local environments reliable? Aren't there other variables that could throw our measurements off? What if an engineer is running Slack, jamming to Spotify, and building a Docker container all while running one of these events? It's true; we can expect variations across these metrics. But over a large enough set of data points over a window of time, durations may trend up or down, which is a valid indicator of the overall health of DX.
What's next
The goal of this project was to build an extensible system for development telemetry. Now that the system is in place, we can analyze the available data and make decisions about how to address DX pain from a data-driven perspective. For example, it became apparent that DX pain seems not to be equally felt for all engineers. Looking at how long it takes engineers to build apps and packages, it appears some engineers with older hardware are waiting longer than others. Is this gap enough to justify a call for new laptops for everyone? Maybe, but at least we can approach that conversation with clear eyes now that we have real data. For example, we’ve begun to pilot our local environment on new M1 Macbooks, and we can now point to data to help us understand if the performance improvements we see would be worth the cost of upgrading hardware.
Another interesting find is that when we look at the slowest NextJS pages to compile locally, we see some team names pop up a number of times. This might mean those particular teams are feeling extra pain developing within our front-end application. This pattern gives us a lead, and if true, then the strategy for resolving this DX pain can be focused on why the pages these teams own are much slower to compile than other areas in our app.
In the near future, we hope to define SLAs for our development environments and can now use these metrics from CI/CD and local environments to craft appropriate SLIs. I'm excited to see how this system evolves over time and how it can help us to continue to scale up our engineering org. When building internal tools for developers, it's incredibly beneficial to think of those developers as the end-users of your team's "product." Thinking this way allows us to apply all we know about developing our external products to our internal ones and drive them towards our definitions of success. If you're passionate about improving developer experience or building great products for any kind of user, we're hiring!