Saturday, February 18, 2017

Revamping SAFE's Program Level PI Metrics - Conclusion

Base controls on relative indicators and trends, not on variances against plan” – Bjarte Bogsnes, Implementing Beyond Budgeting


The series began with an overview of a metric model defined to address the following question:
"Is the ART sustainably improving in its ability to generate value through the creation of a passionate, results-oriented culture relentlessly improving both its engineering and product management capabilities?"
The ensuing posts delved into the definitions and rationale for the Business Impact, Culture, Quality and Speed quadrants.  In this final article, I will address dashboard representation, implementation and application.

Dashboard Representation

The model is designed such that the selected set of metrics will be relatively stable unless the core mission of the ART changes.  The only expected change would result from either refinement of the fitness function or incorporation of the advanced measures as the ART becomes capable of measuring them.

Given that our focus is on trend analysis rather than absolutes, my recommendation is that for each measure the dashboard reflects the vale for the PI just completed, the previous PI and the average of the last 3 PI’s.   Given the assumption that most will initially implement the dashboard in Excel (sample available here), I would further suggest the use of conditional formatting to color-code movement (dark green for strongly positive through dark red for strongly negative). 


In The Art of Business Value, Mark Schwartz proposes the idea of “BI-Driven Development (BIDD?)”.  His rationale?  “In the same sense that we do Test-Driven-Development, we can set up dashboards in the BI or reporting system that will measure business results even before we start writing our code”.

I have long believed that if we are serious about steering product strategy through feedback, every ART should either have embedded analytics capability or a strong reach into the organisation’s analytics capability.  While the applicability extends far beyond the strategic dashboard (ie per Feature), I would suggest the more rapidly one can move from a manually collated and completed spreadsheet to an automated analytics solution the more effective the implementation will be.

Virtually every metric on the dashboard can be automatically captured, whether it be from the existing enterprise data-warehouse for Business Metrics, the Feature Kanban in the agile lifecycle management tool, Sonarqube, the logs of the Continuous Integration and Version Control tools or the Defect Management System.  Speed and Quality will require deliberate effort to configure tooling such that the metrics can be captured, and hints as to approach were provided in the rationales of the relevant deep-dive articles.  NPS metrics will require survey execution, but are relatively trivial to capture using such tools as Survey Monkey.


I cannot recommend base-lining your metrics prior to ART launch strongly enough.  If you do not know where you are beginning, how will you understand the effectiveness of your early days?  Additionally, the insights derived from the period from launch to end of first PI can be applied in improving the effectiveness of subsequent ART launches across the enterprise.

With sufficient automation, the majority of the dashboard can be in a live state throughout the PI, but during the period of manual collation the results should be captured in the days leading up to the Inspect & Adapt workshop.


The correct mindset is essential to effective use of the dashboard.  It should be useful for multiple purposes:
  • Enabling the Portfolio to steer the ART and the accompanying investment strategy
  • Enabling enterprise-level trend analysis and correlation across multiple ARTs
  • Improving the effectiveness of the ART’s Inspect and Adapt cycle
  • Informing the strategy and focus areas for the Lean Agile Centre of Excellence (LACE)
Regardless of application specifics, our focus is on trends and global optimization.  Are the efforts of the ART yielding the desired harvest, and are we ensuring that our endeavors to accelerate positive movement in a particular area are not causing sub-optimizations elsewhere in the system?

It is vital to consider the dashboard as a source not of answers, but of questions.   People are often puzzled by the Taiichi Ohno quote “Data is of course important … but I place the greatest emphasis on facts”.   Clarity lies in appreciating his emphasis on not relying on reports, but rather going to the “gemba”.  For me, the success of the model implementation lies in the number and quality of questions it poses.  The only decisions made in response to the dashboard should be what areas of opportunity to explore – and of course every good question begins with why.   For example:
  • Why is our feature execution time going down but our feature lead time unaffected?
  • Why has our deployment cycle time not reduced in response to our DevOps investment?
  • Why is Business Owner NPS going up while Team NPS is going down?
  • Why is our Program Predictability high but our Fitness Function yield low?
  • Why is our Feature Lead Time decreasing but our number of production incidents rising?


It’s been quite a journey working through this model, and I’m grateful for all the positive feedback I have received along the way.   The process has inspired me to write a number of supplementary articles.  

The first of these is a detailed coverage of the Feature Kanban (also known as the Program Kanban).  Numerous people have queried me as to the most effective way of collecting the Speed Metrics, and this becomes trivial with the development an effective Feature Kanban (to say nothing of the other benefits).

I’ve also wound up doing a lot of digging into “Objectives and Key Results” (OKR’s).  Somehow the growing traction of this concept had passed me by, and when my attention was first drawn to it I panicked at the thought it might invalidate my model before I had even finished publishing it.  However, my research concluded that the concepts were complementary rather than conflicting.  You can expect an article exploring this to follow closely on the heels of my Feature Kanban coverage.

There is no better way to close this series than with a thought from Deming reminding us of the vital importance of mindset when utilising any form of metric.
People with targets and jobs dependent upon meeting them will probably meet the targets – even if they have to destroy the enterprise to do it.” – W. Edwards Deming


  1. Love this! Curious why unit test coverage is included. I only ask this because that's a metric that can be completely misused. We can have 100% test coverage but *not assert anything*. I'd like to see "Running Tested Features" as described in: Thank you for the great post.

    1. G'day Tim, Thanks for the feedback. Let me take a crack at your question.

      I start with the premise that any metric can be mis-used. I'm very aware that I can achieve 100% unit test coverage with completely useless unit tests. On the "gaming and misuse" front, I believe that the combinations of metrics on the dashboard will help with this (more later).

      However, I wanted metrics that satisfied a few needs:
      * As much as possible, could be captured automatically
      * Could be applied almost universally in terms of tech stack diversity
      * Could be captured by an ART early in its maturity journey, even if the answer was 0%.

      I have yet to see anyone come up with a useful, consistently applicable measure for functional test automation available early in lifecycle. Nor have I seen someone come up with an automatic way to verify usefulness of a test. The closest has been the work google did in running analytics over their test cycle results to try to drive focus on high value tests (see, but this is hardly for the beginner. On the functional test side, Ron's stuff is great thinking but getting to consistent quantitative assessment of it at scale would require significant discipline in the specification and automation of feature level acceptance tests - which most trains take some time to get to.

      So .. in short .. if you are in a context where you can get some higher order quantitative test automation assessment consistently across many teams/trains, I envy you and by all means extend the dashboard :) If not, the path to great test automation begins with unit test automation and TDD.

      Now to the gaming front. This is where metric combinations should be helping me. Assuming my unit test coverage is high, my late phase defect count should be low and my fault feedback ratio should be low. If this is not the case, I have a smell! If I was really concerned about patterns and trends, I'd probably monitor my duplication and cyclomatic complexity in isolation for both my production code and my test code. Helping people understand that their test-code is a first-class citizen and deserving of refactoring and craftsmanship is often a much needed corrective action when they've pursued high code coverage with limited understanding :)

      Hope this helps.