Saturday, February 18, 2017

Revamping SAFE's Program Level PI Metrics - Conclusion

Base controls on relative indicators and trends, not on variances against plan” – Bjarte Bogsnes, Implementing Beyond Budgeting


The series began with an overview of a metric model defined to address the following question:
"Is the ART sustainably improving in its ability to generate value through the creation of a passionate, results-oriented culture relentlessly improving both its engineering and product management capabilities?"
The ensuing posts delved into the definitions and rationale for the Business Impact, Culture, Quality and Speed quadrants.  In this final article, I will address dashboard representation, implementation and application.

Dashboard Representation

The model is designed such that the selected set of metrics will be relatively stable unless the core mission of the ART changes.  The only expected change would result from either refinement of the fitness function or incorporation of the advanced measures as the ART becomes capable of measuring them.

Given that our focus is on trend analysis rather than absolutes, my recommendation is that for each measure the dashboard reflects the vale for the PI just completed, the previous PI and the average of the last 3 PI’s.   Given the assumption that most will initially implement the dashboard in Excel (sample available here), I would further suggest the use of conditional formatting to color-code movement (dark green for strongly positive through dark red for strongly negative). 


In The Art of Business Value, Mark Schwartz proposes the idea of “BI-Driven Development (BIDD?)”.  His rationale?  “In the same sense that we do Test-Driven-Development, we can set up dashboards in the BI or reporting system that will measure business results even before we start writing our code”.

I have long believed that if we are serious about steering product strategy through feedback, every ART should either have embedded analytics capability or a strong reach into the organisation’s analytics capability.  While the applicability extends far beyond the strategic dashboard (ie per Feature), I would suggest the more rapidly one can move from a manually collated and completed spreadsheet to an automated analytics solution the more effective the implementation will be.

Virtually every metric on the dashboard can be automatically captured, whether it be from the existing enterprise data-warehouse for Business Metrics, the Feature Kanban in the agile lifecycle management tool, Sonarqube, the logs of the Continuous Integration and Version Control tools or the Defect Management System.  Speed and Quality will require deliberate effort to configure tooling such that the metrics can be captured, and hints as to approach were provided in the rationales of the relevant deep-dive articles.  NPS metrics will require survey execution, but are relatively trivial to capture using such tools as Survey Monkey.


I cannot recommend base-lining your metrics prior to ART launch strongly enough.  If you do not know where you are beginning, how will you understand the effectiveness of your early days?  Additionally, the insights derived from the period from launch to end of first PI can be applied in improving the effectiveness of subsequent ART launches across the enterprise.

With sufficient automation, the majority of the dashboard can be in a live state throughout the PI, but during the period of manual collation the results should be captured in the days leading up to the Inspect & Adapt workshop.


The correct mindset is essential to effective use of the dashboard.  It should be useful for multiple purposes:
  • Enabling the Portfolio to steer the ART and the accompanying investment strategy
  • Enabling enterprise-level trend analysis and correlation across multiple ARTs
  • Improving the effectiveness of the ART’s Inspect and Adapt cycle
  • Informing the strategy and focus areas for the Lean Agile Centre of Excellence (LACE)
Regardless of application specifics, our focus is on trends and global optimization.  Are the efforts of the ART yielding the desired harvest, and are we ensuring that our endeavors to accelerate positive movement in a particular area are not causing sub-optimizations elsewhere in the system?

It is vital to consider the dashboard as a source not of answers, but of questions.   People are often puzzled by the Taiichi Ohno quote “Data is of course important … but I place the greatest emphasis on facts”.   Clarity lies in appreciating his emphasis on not relying on reports, but rather going to the “gemba”.  For me, the success of the model implementation lies in the number and quality of questions it poses.  The only decisions made in response to the dashboard should be what areas of opportunity to explore – and of course every good question begins with why.   For example:
  • Why is our feature execution time going down but our feature lead time unaffected?
  • Why has our deployment cycle time not reduced in response to our DevOps investment?
  • Why is Business Owner NPS going up while Team NPS is going down?
  • Why is our Program Predictability high but our Fitness Function yield low?
  • Why is our Feature Lead Time decreasing but our number of production incidents rising?


It’s been quite a journey working through this model, and I’m grateful for all the positive feedback I have received along the way.   The process has inspired me to write a number of supplementary articles.  

The first of these is a detailed coverage of the Feature Kanban (also known as the Program Kanban).  Numerous people have queried me as to the most effective way of collecting the Speed Metrics, and this becomes trivial with the development an effective Feature Kanban (to say nothing of the other benefits).

I’ve also wound up doing a lot of digging into “Objectives and Key Results” (OKR’s).  Somehow the growing traction of this concept had passed me by, and when my attention was first drawn to it I panicked at the thought it might invalidate my model before I had even finished publishing it.  However, my research concluded that the concepts were complementary rather than conflicting.  You can expect an article exploring this to follow closely on the heels of my Feature Kanban coverage.

There is no better way to close this series than with a thought from Deming reminding us of the vital importance of mindset when utilising any form of metric.
People with targets and jobs dependent upon meeting them will probably meet the targets – even if they have to destroy the enterprise to do it.” – W. Edwards Deming

Sunday, February 5, 2017

Revamping SAFe's Program Level PI Metrics Part 5/6 - Speed

Changing the system starts with changing your vantage point so you can ‘see’ the system differently.  Development speed is often attributed to quick decisions.  Early definition of the requirements and freezing specification quickly are often highlighted as keys to shortening the product development cycle.  Yet the key steps required to bring a new product to market remain the creation and application of knowledge, regardless of how quickly the requirements are set.  The challenge in creating an effective and efficient development system lies in shortening the entire process.” – Dantar Oosterwal, The Lean Machine.

Series Context

Part 1 – Introduction and Overview
Part 2 – Business Impact Metrics
Part 3 – Culture Metrics
Part 4 – Quality Metrics
Part 5 – Speed Metrics (You are here)
Part 6 – Conclusion and Implementation


As mentioned in my last post, the categorization of metrics went through some significant reshaping in the review process.  The “Speed” (or “Flow”) quadrant didn’t exist, with its all-important metrics divided between “Business Impact” and “Deployment Health”.  

Lead Time is arguably the most important metric in Lean, as evidenced by Taiicho Ohno’s famous statement that “All we are doing is looking at the customer time line, from the moment the customer gives us the order to the point when we collect the cash”.  Not only does it measure our (hopefully) increasing ability to respond rapidly to opportunity, but is a critical ingredient in enabling a focus on global rather than local optimization.

In this quadrant, the core focus is to employ two perspective views on Lead Time.  The first (Feature Lead Time) relates to the delivery of feature outcomes, and the second (MTTR from Incident) our ability to rapidly recover from production incidents.

The other proposed metrics highlight the cycle time of key phases in the idea-to-value life-cycle as an aid to understanding “where we are slow, and where we are making progress”.  In particular, they will highlight failure to gain traction in XP and DevOps practices.

There is, however, a caveat.  Many (if not most) Agile Release Trains do not begin life in control of the entire idea-to-value life-cycle.  On the one hand, its very common for features to be handed off to an enterprise release management organisation for production release.  On the other, whilst Lean Principles are at the heart of SAFe the framework centers on hardware/software development.  The (traditionally business) skill-sets in areas such as operational readiness, marketing and sales required to move from “Deployed product” to “Value generating product” are nowhere on the big picture. 

ARTs focused on bringing to life the SAFe principles will address these gaps as they inspect and adapt, but in the meantime  there is a temptation to “not measure what we are not in control of”.  As a coach, I argue that ARTs should “never let go until you’ve validated the outcome”.  You may not be in control, but you should be involved – if for nothing else than in pursuit of global optimization.  

Basic Definitions

Basic Metrics Rationale

Average Feature Lead Time (days)

This is the flagship metric.   However, the trick is determining "when the timer starts ticking".   For an ART maintaining the recommended 3-PI roadmap, feature lead time would rarely be shorter than a depressing 9 months.  
To measure it, one needs 2 things: A solid Feature Kanban, and agreement on which stage triggers the timer.  A good feature kanban will of necessity be more granular the sample illustrated in the framework (fuel for a future post), but the trigger point I most commonly look for is "selection for next PI".  In classic kanban parlance, this is the moment when a ticket moves from "backlog" to "To Do", and in most ARTs triggers the deeper preparation activities necessary to prepare a feature for PI planning.  The end-point for the measure is the moment at which the feature starts realizing value and is dependent on solution context, often triggered by deployment for digital solutions but after business change management activities for internal solutions.

Average Deployment Cycle Time (days)

This metric was inspired by the recently released Devops Handbook by Gene Kim and friends.  In essence, we want to measure “time spent in the tail”.  I have known ART after ART that accelerated their development cycle whilst never making inroads on their path to production.  If everything you build has to be injected in a 3-month enterprise release cycle, its almost pointless accelerating your ability to build!  
Whilst our goal is to measure this in minutes, I have selected days as the initial measure as for most large enterprises the starting point will be weeks if not months.

Average Mean Time to Restore (MTTR) from Incident (mins)

When a high severity incident occurs in production, how long does it take us to recover?  In severe cases, these incidents can cause losses of millions of dollars per hour.  Gaining trust in our ability to safely deploy regularly can only occur with demonstrated ability to recover fast from issues.  Further, since these incidents are typically easy to quantify in bottom-line impact, we gain the ability to start to measure the ROI of investment in DevOps enablers.

Prod Deploys Per PI (#)

Probably the simplest measure of all listed on the dashboard - how frequently are we deploying and realizing value?

Advanced Definitions

Advanced Metrics Rationale

Average Feature Execution Cycle Time (days)

This is one of the sub-phases of the lead time which are worth measuring in isolation, and is once again dependent on the presence of an appropriately granular feature kanban.  
The commencement trigger is "first story played", and the finalization trigger is "feature ready for deployment packaging" (satisfies Feature Definition of Done).  The resultant measure will be an excellent indicator of train behaviors when it comes to Feature WIP during the PI.  Are they working on all features simultaneously throughout the PI or effectively collaborating across teams to shorten the execution cycles at the feature level?

One (obvious) use of the metric is determination of PI length.  Long PI’s place an obvious overhead on Feature Lead Time, but if average Feature Execution time is 10 weeks its pointless considering an 8 week PI.  

Average Deploy to Value Cycle Time (days)

This sub-phase of feature lead time measures "how long a deployed feature sits on the shelf before realizing value".  
The commencement trigger is "feature deployed", and the finalization trigger is "feature used in anger".  It will signal the extent to which true system level optimization is being achieved, as opposed to local optimization for software build.  In a digital solution context it is often irrelevant (unless features are being shipped toggled-off in anticipation of marketing activities), but for internal solution contexts it can be invaluable in signalling missed opportunities when it comes to organizational change management and business readiness activities.

Average Deployment Outage (mins)

How long an outage will our users and customers experience in relation to a production deployment?  Lengthy outages will severely limit our aspirations to deliver value frequently.  


We’ve now covered all 4 quadrants and their accompanying metrics.  The next post will conclude the series with a look at dashboard representation, implementation and utilisation. 

High performers [in Devops practices] were twice as likely to exceed profitability, market share, and productivity goals.  And, for those organizations that provided a stock ticker symbol, we found that high performers had 50% higher market capitalization growth over three years.” – Gene Kim, Jez Humble, Patrick Debois, John Willis .. The Devops Handbook

Monday, January 30, 2017

Revamping SAFe's Program Level PI Metrics Part 4/6 - Quality

The Systems Science Institute at IBM has reported that the cost to fix an error after product release was four to five times as much as one uncovered during design, and up to 100 times more than one identified in the maintenance phase”- iSixSigma magazine

Series Context


Given the central nature of the “build quality in” mindset to Lean and Agile, my early drafts of the metrics dashboard devoted 3 full categories to quality:
  • Technical Health 
  • Quality 
  • Deployment Health 
The “quality” aspect of the original cut took a lean lens on the traditional “defect/incident” quality metrics, whilst the other two focused on technical quality and “devops” type quality respectively.

I was fortunate enough to both get some great review feedback from +Dean Leffingwell on the drafts and spend some time at a whiteboard brainstorming with him. He dwelt on the fact that I had “too many metrics and too many quadrants” :) As we brainstormed, we came to two conclusions. Firstly, the 3 concepts listed above were just different perspectives on quality – and secondly, we could separate my individual metrics into “the basics everyone should have” and “the advanced things people should have but might take time to incorporate”. The result is the set of basic and advanced definitions below.

One might question the incorporation of highly technical metrics in an executive dashboard, however there are three very good reasons to do so:
  • If our technical practices are not improving, no amount of process improvement will deliver sustainable change. 
  • If our teams are taking lots of shortcuts to deliver value fast, there is no sustainability to the results being achieved and we will wind up “doing fragile not agile”. 
  • If the executives don’t care, the teams are unlikely to. 
The only non-subjective way I know to approach this is through static code analysis. Given the dominance of Sonarqube in this space, I have referenced explicit Sonarqube measures in the definitions. Additionally, effective adoption of Continuous Integration (CI) amongst the developers is not only a critical foundation for DevOps but also an excellent way to validate progress in the “build quality in” mindset space.

On the “traditional quality measurement” front, my core focus is “are we finding defects early or late”? Thus, I look to both evaluate the timing of our validation activities and the level of quality issues escaping the early life-cycle. For deployment health, all inspiration was sourced from DevOps materials and as we re-structured the overall model it became apparent that many of these measures really belonged in the “Speed” quadrant – all that remained in the quality quadrant was clarity on production incidents.

Basic Definitions

Basic Metrics Rationale

Unit Test Coverage %

As I regularly inform participants in the training room, "if you do not aggressively pursue automated testing your agile implementation will fail!"  It is impossible to sustainably employ an iterative and incremental approach to software development without it.

Static analysis tools will not tell you the quality of the unit tests or the meaningfulness of the coverage, but simply having coverage will give the developers confidence to refactor - the key to increasing maintainability.  It should also increase the ratio of first fix resolution, giving confidence that defects can be resolved fast and minor enhancements made without causing unintended side effects.
Further, even if automated functional tests are still on the to-do list, testers who can read unit tests will be able to more effectively adopt risk-based manual testing and thus reduce manual test effort.

Mean Time Between Green Builds (mins)

Note that many ARTs will implement multiple CI cycles – local ones executing on branches and a central master cycle on the mainline.   Whilst branch-level CI cycles might be of interest at the team level, the only one we are interested in at the ART level is the master on the mainline.

Red CI builds are of course an indicator of poor developer quality practices (failure to locally validate code prior to check-in), and most believe the full CI cycle should occur in under 10 minutes to provide an adequate level of timely feedback to the developers, but failure on either of these fronts will naturally extend the time between green builds, so they need not be discretely measured on the dashboard.

Mean Time to Recover from Red build (mins)

Two things will cause this metric to trend in the wrong direction.  One is lack of the Andon mindset (its someone else's fault, or even worse its always red, just ignore it).  The second is failure to regularly commit, resulting in complex change-sets and difficult debugging.  The second is easily identified through the mean time between Green Builds, so the metric enables measurement of the establishment of the Andon mindset among developers.

Late Phase Defects #

The identification and resolution of defects during the execution of a story is evidence of good team quality practices, and should be excluded from any strategic treatment of defect trends.  However, defects identified in functionality associated with a story after its acceptance or in late-phase (integration, performance, security, UAT etc) testing are indicators of a failure to "build quality in".   
Whilst many teams do not formally log defects identified during story development, where this is done there will be a need for classification in the defect management system to separate late phase defects for reporting purposes.

Validation Capacity %

Great agile means a story is accepted once it is in production.  Good agile means it is accepted once it is ready for production.  For most enterprises in the early years of their agile adoption, this seems like a fairy-tale - the DevOps definition of "Unicorns" such as Amazon and Netflix resonates strongly!   
The reality is for some time there will be testing and packaging activities which get batched up and executed late in development.  Typical examples include:
  • User Acceptance Testing - of course, the Product Owner as the embedded customer is meant to do this in good agile but for many they are neither sufficiently knowledgeable nor sufficiently empowered.
  • Integration Testing - in theory redundant if the team is practicing good full-stack continuous integration.  But for all too many, environment constraints prohibit this and lead to extensive use of stubs until late phase.
  • Performance Testing - for many organisations, the performance test environments are congested, hard to book, and take days if not weeks to configure for a performance test run.  
  • Penetration Testing - a highly specialised job with many organisations possessing a handful of skilled penetration testers spread across thousands of developers.
  • Release Documentation
  • Mandated Enterprise Level Integration and Deployment preparation cycles for all changes impacting strategic technology assets.
Given that the backlog "represents the collection of all the things a team needs to do" , all of these activities should appear in backlogs, estimated and prioritized to occur in the appropriate iterations.   It is a simple matter to introduce a categorization to the backlog management tool to flag these items as hardening activities.

Average Severity 1 and 2 Incidents per Deploy

High severity incidents associated with deployments are a critical quality indicator.  Measurement is generally fairly trivial with the appropriate flagging in incident management systems.  However, some debate may exist as to whether an incident is associated with a deployment or simply the exposition of a preexisting condition.  An organisation will need to agree on clear classification standards in order to produce meaningful measures.

Advanced Definitions

Advanced Metrics Rationale

Duplication %

Duplicate code is bad code.  Its simple.  One line of duplicated business logic is a time-bomb waiting to explode.  If this number is trending down, its an indicator developers are starting to refactor, the use of error-prone copy/paste techniques is falling and the maintainability of the source code is going up.  Its potentially debatable whether one measures duplicate blocks or duplicate lines, but given the amount of logic possible to embed in a single line of code I prefer the straight up measurement of duplicated lines.  

Average Cyclomatic Complexity

Cyclomatic complexity is used to measure the complexity of a program by analyzing the number of linearly independent paths through a program's code.    More complexity leads to more difficulty in maintaining or extending functionality and greater reliance on documentation to understand intent.  It can be measured at multiple levels, however from a dash-boarding perspective my interest is in function or method level complexity.  

Average Branch Age at Merge (days)

This metric may take a little more work to capture, but it is well worth the effort.  The modern ideal is of course not to branch at all (branching by abstraction), however the technical sophistication required by developers to achieve this takes some time to achieve.  
Code living in a branch is code that has not been integrated, and thus code that carries risk.  The longer the code lives in a branch, the more effort it takes to merge it back into the mainline and the greater the chance that the merge process will create high levels of late-phase defects.
Whiteboard spotted at Pivotal Labs by @testobsessed

Fault Feedback Ratio (FFR) %

When it comes to defects, we are interested in not just when we find them but how we respond to them.  In his book "Quality Software Management vol 2: First-Order Measurement, Gerry Weinberg introduced me to the concept (along with many other fascinating quality metrics).  Our goal is to determine what happens when we address a defect.  Do we resolve it completely?  Do we introduce other new defects in resolving the first one?  A rising FFR value can indicate poor communication between testers and developers, hacked-in fixes, and deterioration in the maintainability of the application among other things.  According to +Johanna Rothman in this article (), a value of <= 10% is a good sign.
Measuring it should be trivial with appropriate classifications of defect sources and resolution verification activities in the defect management system.

Average Open Defects #

When it comes to open defects, one needs to make a number of local decisions.  Firstly, what severity are we interested in?  Restricting it to high severity defects can hide all kinds of quality risk, but at the same time many low severity defects tend to be more matters of interpretation and often represent minor enhancement requests masquerading as defects.
Further, we need to determine whether we are interested in the open count at the end of the PI or the average throughout the PI.  A Lean focus on building quality in leads me to be more interested in our every-day quality position rather than what we've cleaned up in our end-of-PI rush.


More than for any other quadrant, I wrestled to find a set of quality metrics small enough not to be overwhelming yet comprehensive enough to provide meaningful insight.  At the team level, I would expect significantly more static code analysis metrics (such as “Code Smells”, “Comment Density” and “Afferent Coupling” ) to be hugely valuable.  Kelley Horton of Net Objectives suggested a Defect Density measure based on “# of production defects per 100 story points released”, and “% capacity allocated to technical debt reduction”.  For further inspiration, I can recommend nothing so much as the “Quality Software Management” series by +Gerald Weinberg.

You should name a variable with the same care with which you name a first-born child” – Robert C. Martin, Clean Code

Wednesday, January 25, 2017

Revamping SAFe's Program Level PI Metrics Part 3/6: Culture

"Organizational culture can be a major asset or a damaging liability that hinders all efforts to grow and become more successful. Measuring and managing it is something few companies do well." - Mark Graham Brown, Business Finance Magazine


After exploring the Business Impact quadrant in Part 2 of this series, our focus now moves to Culture. I have been involved with over 30 release trains since I started working with SAFe in early 2012, and I have come to the passionate belief over that time that positive movement in culture is the most accurate predictor of sustained success.

While most agree that it is impossible to truly measure culture, there are certainly indicators that can be measured which help us in steering our path.

In selecting the mix of measures proposed, I was looking for a number of elements:
  • Are our people happy?
  • Are our stakeholders happy?
  • Are we becoming more self-organizing?
  • Are we breaking down silos?

The basic metrics address the first 2 elements, while the advanced metrics tackle self-organization and silos.

Basic Definitions

Basic Metrics Rationale

Team Net Promoter Score (NPS) - "Are our people happy?"

In his book The Ultimate Question 2.0, Fred Reichheld describes the fashion in which many companies also apply NPS surveys to their employees - altering the question from "how likely are you to recommend [Company Name]" to "how likely are you to recommend working for [Company Name]".

My recommendation is that the question is framed as "how likely are you to recommend being a member of [Release Train name]?". Survey Monkey provides a very easy mechanism for running the surveys.

For a more detailed treatment, see this post by my colleague +Em Campbell-Pretty. Pay particular attention to the value of the verbatims and the inclusion of vendor staff in the survey – they’re team members too!

As a coach, I often ponder what “mission success” looks like. What is the moment when the ART I’ve been nurturing is set for greatness and my job is done? Whilst not enough of my ARTs have adopted the team NPS discipline to give me great data, I have developed a belief based on the data I do have that the signal is the moving Team NPS above +20.

Business Owner Net Promoter Score (NPS) - "Are our stakeholders happy?

This is a more traditional treatment of NPS based on the notion that business owners are effectively internal customers of the ART. The question is framed as "how likely are you to recommend the services of [Release Train Name] to a friend or colleague?"

If you’re truly serious about the Lean mindset, you will be considering your vendors when you identify the relevant Business Owners for this metric. There is vendor involvement in virtually every ART I work with, team-members sourced from vendors are a key part of our culture, and vendor management need to be satisfied the model is working for their people and their organization.

Staff Turnover %

In one sense, this metric could be focused on "Are our people happy", however I believe it is more holistic in nature. Staff turnover can be triggered either by people being unhappy and leaving, or by lack of organizational commitment to maintaining long-lived train membership. Either will have negative impacts.

Advanced Definitions

Advanced Metrics Rationale

Developer % (IT) - "Are we becoming more self-organizing?"

When an ART is first formed it classically finds “a role in SAFe” for all relevant existing IT staff (often a criticism of SAFe from "anti-SAFe crowd"). However, as it matures and evolves the people might stay but their activities change. People who have spent years doing nothing but design start writing code again. Great business analysts move from the IT organisation to the business organisation. Project managers either return to a practical skill they had prior to become project managers or roll off the train. In short, the only people who directly create value in software development are software developers. All other IT roles are useful only in so far as they enable alignment (and the greater our self-organisation maturity the less the need for dedicated alignment functions). In short, if we seek true productivity gains we seek a greater proportion of doers.

One of my customers started using this metric to measure progress on this front and I loved it. One of the early cost-saving aspects of agile is reduction in management overhead, whether it be the instant win of preventing duplication of management functions between the implementing organization and their vendors or the conversion of supervision roles (designers, project managers) to contribution roles.

Obviously, this is a very software-centric view of the ART. As the “Business %” metric will articulate, maturing ARTs will tend to deliberately incorporate more people with skills unrelated to software development. Thus, this measure focuses on IT-sourced Train members (including leadership) who are developers.

As a benchmark, the (Federal Government) organization who inspired the incorporation of this metric had achieved a ratio of 70%.

Business % - "Are we breaking down silos?"

While most ARTs begin life heavily staffed by IT roles, as the mission shifts towards global optimization of the “Idea to Value” life-cycle they discover the need for more business related roles. This might be the move from “proxy Product Owners” to real ones, but equivalently and powerfully sees the incorporation of business readiness skill-sets such as business process engineering, learning and development, marketing and other business readiness type skills.

Whilst the starting blueprint for an ART incorporates only 1 mandatory business role (the Product Manager) and a number of recommended business roles (Product Owners), evolution should see this mix change drastically.

The purpose of this measure could easily have been written as "Are we achieving system-level optimization?", however my personal bent for the mission of eliminating the terms "business" and "IT" led to the silo focus in the question.


When it comes to culture, I have a particular belief in the power of a change in language employed to provide acceleration. A number of ARTs I coach are working hard to eliminate the terms “Business” and “IT” from their vocabulary, but the most powerful language change you can make is to substitute the word “person” for “resource”!

Series Context

Part 1 – Introduction and Overview
Part 2 – Business Impact Metrics
Part 3 – Culture Metrics (You are here)
Part 4 – Quality Metrics
Part 5 – Speed Metrics 
Part 6 – Conclusion and Implementation

Instead of trying to change mindsets and then change the way we acted, we would start acting differently and the new thinking would follow.” David Marquet, Turn the Ship Around.

Thursday, January 19, 2017

Revamping SAFe's Program Level PI Metrics Part 2/6: Business Impact

Managers shape networks’ behavior by emphasizing indicators that they believe will ultimately lead to long term profitability” – Philip Anderson, Seven Levers for Guiding the Evolving Enterprise


In Part 1 of this series, we introduced the Agile Release Train (ART) PI metrics dashboard and gave an overview of the 4 quadrants of measurement. This post explores the first and arguably most important quadrant – Business Impact.

As you may have guessed from the rather short list, there can be no useful generic set of metrics. They must be context driven for each ART based on both the mission of the ART and the organisational strategic themes that gave birth to it. As Mark Schwartz put it so elegantly in The Art of Business Value, “Business value is a hypothesis held by the organization’s leadership as to what will best accomplish the organization’s ultimate goals or desired outcomes”.



Fitness Function

When reading The Everything Store: Jeff Bezos and the Age of Amazon, I was particularly taken by the concept of the fitness function. Each team was required to propose ".. A linear equation that it could use to measure its own impact without ambiguity. … A group writing software code for the fulfillment centers might home in on decreasing the cost of shipping each type of product and reducing the time that elapsed between a customer's making a purchase and the item leaving the FC in a truck". Amazon has since moved to more discrete measures rather than equations (I suspect in large part due to the bottlenecks caused by Bezos' insistence on personally signing off each team's fitness function equation), but I believe the “fitness function mindset” has great merit in identifying the set of business impact metrics which best measure the performance of an ART.

To illustrate based on three ART's I work with:
  • An ART at an organisation which ships parcels uses "First time delivery %". They implement numerous digital features enabling pre-communication with customers to avoid delivery vans arriving at empty houses. Moving this a percentage point has easily quantifiable bottom-line ROI impacts.
  • An ART focused on Payment Assurance at an organisation which leverages "Delivery Partners" to execute field installation and service work. Claims for payment submitted by these partners are complex and require payment within tight SLA's. A fitness function based on Payment lead time and cost savings based on successful claim disputes would again easily be mapped to quantifiable ROI.
  • A telco ART focused on self-service diagnostics for customers. The fitness function in this case would reference “reduced quantity of fault-related calls to call centers” (due to the customer having self-diagnosed and used the tool to make their own service booking if required), “reduced quantity of no-fault-found truck rolls” (due to the tool having aided the customer in identifying ‘user error’), “increased first call resolution rates for truck rolls” (due to the detailed diagnostic information available to service technicians).
Considerations when selecting fitness function components
Obviously, the foremost consideration is identifying a number of components from which one can model a monetary impact. However, I believe two other factors should be considered in the identification process:
  • Impact on the Customer
  • Ensuring a mix of both Leading and Lagging Measures
Net Promoter Score (NPS) is rapidly becoming the default customer loyalty metric, and whilst Reichheld argues in The Ultimate Question 2.0 that mature NPS implementations gain the ability to quantify the value of a movement in a specific NPS demographic I have yet to actually work with an organization that has reached this maturity. However, most have access to reasonably granular NPS metrics. The trick is identifying the NPS segments impacted by the customer interactions associated with the ART’s mission and incorporating those measures.

When it comes to identifying useful leading metrics, there can be no better inspiration than the Innovation Accounting concepts explained by Eric Ries in The Lean Startup. In some cases (particularly Digital), it can also be as simple as taking the popular Pirate Metrics for inspiration. For many trains with digital products, I also believe abandonment rate is an extremely valuable metric given that an abandoned transaction tends to equate to either a lost sale or added load on a call center.

Program Predictability

This is the standard proxy result measure defined in SAFe. It is a great way of ensuring focus on objectives whilst leaving space for Product Owners and Managers to make trade-off calls during PI execution. In short, I paraphrase it as "a measure of how closely the outcomes from PI execution live up to the expectations established with Business Owners at PI planning and how clear those expectations were".

But wait, there's more!

A good train will use far more granular results metrics than those listed above. Each feature should come with specific success measures that teams, product owners and managers should be using to steer their strategy and tactics (fuel for another post), but I am seeking here a PI level snapshot that can be utilized consistently at portfolio levels to understand the success or otherwise of investment strategy.

A closing note on the Fitness Function

I believe the fitness function definition should be identified and agreed at the ART Launch Workshop. Well-launched ARTs will have all the key Business Owners present at this workshop, and I strongly believe that agreement on how the business impact of the ART will be measured is a critical component of mission alignment.

Series Context

Part 1 – Introduction and Overview
Part 2 – Business Impact Metrics (You are here)
Part 3 – Culture Metrics
Part 4 – Quality Metrics
Part 5 – Speed Metrics 
Part 6 – Conclusion and Implementation

The gold standard of metrics: Actionable, Accessible and Auditable ... For a report to be considered actionable it must demonstrate clear cause and effect … Make the reports as simple as possible, so everyone understands them … We must ensure that the data is credible” – Eric Ries, The Lean Startup

Saturday, January 14, 2017

Revamping SAFe's Program Level PI Metrics Part 1/6 - Overview

"Performance of management should be measured by potential to stay in business, to protect investment, to ensure future dividends and jobs through improvement of product and service for the future, not by the quarterly dividend" - Deming

Whilst the Scaled Agile Framework (SAFe) has evolved significantly over the years since inception, one area that has lagged is that of metrics. Since the Agile Release Train (ART) is the key value-producing vehicle in SAFe, I have a particular interest in Program Metrics - especially those produced on the PI boundaries.

In tackling this topic, I have numerous motivations. Firstly, the desire to acknowledge that it is easier to critique than create. I have often harassed +Dean Leffingwell  over the need to revamp the PI metrics, but not until recently have I developed a set of thoughts which I believe meaningfully contribute to progress. Further, I wish to help organisations avoid falling into the all-too-common traps of mistaking velocity for productivity or simply adopting the default “on time, on budget, on scope” and phase gate inheritance. It is one thing to tout Principle 5 – Base milestones on objective evaluation of working systems, and quite another to provide a sample set of measures which provide a convincing alternative to traditional milestones and measures.

Scorecard Design

It is not enough to look at value alone. One must take a balanced view not just of the results being achieved but of the sustainability of those results. In defining the PI scorecard represented here, I was in pursuit of a set of metrics which answered the following question:

"Is the ART sustainably improving in its ability to generate value through the creation of a passionate, results-oriented culture relentlessly improving both its engineering and product management capabilities?"

After significant debate, I settled on 4 quadrants, each focused on a specific aspect of the question above:

For each quadrant, I have defined both a basic and advanced set of metrics.  The basics represent “the essentials”, the bare minimum that should be measured for a train.  However, if one desires to truly use metrics to both measure and identify opportunities for improvement some additional granularity is vital – and this is the focus of the additional advanced metrics.

Business Impact

Whilst at first glance this quadrant might look sparse, the trick is in the “Fitness Function”. Wikipedia defines it as “a particular type of objective function that is used to summarise, as a single figure of merit, how close a given design solution is to achieving the set aims”. Jeff Bezos at Amazon quite famously applied it, insisting that every team in the organization developed a fitness function to measure how effectively they were impacting the customer. It will be different for every ART, but should at minimum identify the key business performance measures that will be impacted as the ART fulfils its mission.


The focus in culture is advocacy. Do our people advocate working here? Do our stakeholders advocate our services? Are we managing to maintain a stable ART?


For quality, our primary question is “are we building quality in?” Unit Test coverage demonstrate progress with unit test automation, while “Mean time between Green Builds” and “MTTR from Red Build” provide good clues as to the establishment of an effective Continuous Integration mindset. From there we look at late phase defect counts and validation capacity to understand the extent to which our quality practices are “backloaded” – in short, how much is deferred to “end-to-end” feature validation and pre-release validation activities. And finally, we are looking to see incidents associated with deployments dropping.


This quadrant is focused on responsiveness - how rapidly can our ART respond to a newly identified opportunity or threat?  Thus, we start with Feature Lead Time - "how fast can we realise value after identifying a priority feature?". Additionally, we are looking for downward trends in time spent “on the path to production”, mean time to recover from incidents and frequency of deployments as our Devops work pays dividends.


In parts 2 through 5 of this series, I will delve into each quadrant in turn, exploring the definitions of and rationale for each measure and in part 6 wrap it all up with a look at usage of the complete dashboard.

Series Context

Part 1 – Introduction and Overview (You are here)
Part 2 – Business Impact Metrics
Part 3 – Culture Metrics
Part 4 – Quality Metrics
Part 5 – Speed Metrics 
Part 6 – Conclusion and Implementation 

"Short term profits are not a reliable indicator of performance of management. Anybody can pay dividends by deferring maintenance, cutting out research, or acquiring another company" – Deming

Thursday, January 5, 2017

Tips for Designing and Leveraging Great Kanban Boards


I’ve been working on an article about the SAFe Program Kanban, and found myself mixing in a number of basic Kanban techniques. As I read through the (overly lengthy) first draft and realised the fuzzy focus being caused by a mix of “Kanban 101” and “Program Kanban”, I found myself reflecting on the fact that a lot of people kind of “fall into Kanban”. The two most common cases I encounter are the dev team that evolves their Scrum implementation to the point that the arbitrary batching mechanism of the Sprint Boundary seems suboptimal and the Agile Release Train (ART) Product Management team taking their first crack at a Program Kanban. For whatever reason, many start to use it without ever understanding any of the basic tools available other than “use WIP limits”.

In this article, I’m going to cover two of the basic Kanban concepts every team should take advantage of and a third which tends to be more applicable for strategic Kanban systems than those operated at the dev team level.

Doing and Done

One of the simplest improvements you can make to a Kanban is the separation of states into “Doing” and “Done”. This separation enables far more accurate visualization of the true state of a piece of work, adds immensely to the granularity of the metrics that can be derived, and most importantly is critical to enabling the move from "push" to "pull" mindset.
Consider the simple yet very common 2-state Team Kanban below:

When the developer completes story S1, they will signal this by pushing it into Test. However, the system is now lying. The fact that the developer has placed it in test does not mean testing has commenced (the clue lay in the use of the word "pushed").

Consider an alternative:

Now, when the developer completes story S1, they place it in "Dev Done". This sends a signal to the testers that when they are ready, they can "pull" story S1 into Test. If we see a big queue building in Dev Done, we can see a bottleneck emerging in front of Test. If (over time), we discover that stories are spending significant amounts of time in "Dev Done" it should trigger some root cause analysis.

You could also achieve the same effect by making a 4 state Kanban as follows:

  • Dev
  • Ready for Test
  • Test
  • Ready for Acceptance
To be brutally honest, the difference is intellectual. Aesthetically, I tend to prefer the “Doing|Done” approach, partially because it leaves me with less apparent states in the Kanban and mainly because I tend to assign WIP limits spanning “Doing and Done”. In fact, when designing complex Kanbans I will often use a mix of “Single State” and “Multi-State (Doing|Done)” columns from a clarity perspective. The “Single State” columns tend to be those in which no activity is occurring – they’re just a queue (eg “Backlog”).

Exit Policies

The creator of the Kanban Method (David Anderson) identified 5 core properties of successful Kanban implementations, one of which was “Make Process Policies Explicit”. In designing the states of your Kanban system, you are beginning to fulfill this need by making the key stages of your workflow explicit (and supporting another of the key properties – “Manage Flow”). For the evolving Scrum Team, this is often sufficient as it will be supported by their “Definition of Done” (another explicit policy).

However, at the strategic level (or for a Kanban system that crosses team boundaries) we benefit by taking it to another level and defining “Exit Policies”. An Exit Policy is effectively the “Definition of Done” for a single state. Whilst it is up to the team member(s) (or Teams) exactly how they do the work for that state, it is not considered “Done” until it meets the exit policies for the state. These policies should be made visible as part of the Kanban design, and should be subject to review and evolution as part of continuous improvement activities. In the words of Taichi Ohno – “Without standards there can be no Kaizen”.
Note the explicit exit policies below each state heading in this Portfolio Kanban


Every piece of information you can add to a Kanban board is valuable in conveying extra information to support conversation and insight. Most teams are familiar with the practice of creating a set of laminated “avatars” for each team member. When a team member is participating in the work on a card, they add their avatar to the card as a signal. Thus, anyone observing a Kanban and wanting to know who is working on a card gets instant gratification. Incidentally, this is for me one of the biggest failure areas of digital Kanban boards. To my knowledge, the only digital Kanban tool that supports multiple avatars on a single card is LeanKit – a very strange condition in a world centred on collaboration :)

Now to extend the concept. There is no reason to restrict avatars to the representation of individuals. If we create avatars for Dev Teams, we can (for example) understand which dev teams are involved in the implementation of a feature on a feature Kanban. Take it up a layer, and we can create avatars for ARTs and other delivery organizations. Suddenly, we can look at a portfolio Kanban and understand which delivery organizations are involved in the implementation of an Epic.

The cards above are Epics on a Portfolio Kanban.  The "Standard Avatars" (with pictures) represent individual BA's, whilst the smaller solid color avatars represent the impacted delivery organisations (an ART in one case, Business Units in others)


There are many more tips and tricks to creating powerful Kanban visualisations, but these are the three I find myself introducing again and again as I help Scrum teams leverage Kanban techniques and ART Leadership teams implement strategic flow Kanban systems.

Always remember, as +Inbar Oren put it so well, a good Kanban board TALKS:
  • Tells
  • Always Visible
  • Lives
  • Keeps it Simple
  • Self-explanatory