Monday, January 30, 2017

Revamping SAFe's Program Level PI Metrics Part 4/6 - Quality

The Systems Science Institute at IBM has reported that the cost to fix an error after product release was four to five times as much as one uncovered during design, and up to 100 times more than one identified in the maintenance phase”- iSixSigma magazine

Series Context




Introduction


Given the central nature of the “build quality in” mindset to Lean and Agile, my early drafts of the metrics dashboard devoted 3 full categories to quality:
  • Technical Health 
  • Quality 
  • Deployment Health 
The “quality” aspect of the original cut took a lean lens on the traditional “defect/incident” quality metrics, whilst the other two focused on technical quality and “devops” type quality respectively.

I was fortunate enough to both get some great review feedback from +Dean Leffingwell on the drafts and spend some time at a whiteboard brainstorming with him. He dwelt on the fact that I had “too many metrics and too many quadrants” :) As we brainstormed, we came to two conclusions. Firstly, the 3 concepts listed above were just different perspectives on quality – and secondly, we could separate my individual metrics into “the basics everyone should have” and “the advanced things people should have but might take time to incorporate”. The result is the set of basic and advanced definitions below.

One might question the incorporation of highly technical metrics in an executive dashboard, however there are three very good reasons to do so:
  • If our technical practices are not improving, no amount of process improvement will deliver sustainable change. 
  • If our teams are taking lots of shortcuts to deliver value fast, there is no sustainability to the results being achieved and we will wind up “doing fragile not agile”. 
  • If the executives don’t care, the teams are unlikely to. 
The only non-subjective way I know to approach this is through static code analysis. Given the dominance of Sonarqube in this space, I have referenced explicit Sonarqube measures in the definitions. Additionally, effective adoption of Continuous Integration (CI) amongst the developers is not only a critical foundation for DevOps but also an excellent way to validate progress in the “build quality in” mindset space.

On the “traditional quality measurement” front, my core focus is “are we finding defects early or late”? Thus, I look to both evaluate the timing of our validation activities and the level of quality issues escaping the early life-cycle. For deployment health, all inspiration was sourced from DevOps materials and as we re-structured the overall model it became apparent that many of these measures really belonged in the “Speed” quadrant – all that remained in the quality quadrant was clarity on production incidents.

Basic Definitions



Basic Metrics Rationale


Unit Test Coverage %

As I regularly inform participants in the training room, "if you do not aggressively pursue automated testing your agile implementation will fail!"  It is impossible to sustainably employ an iterative and incremental approach to software development without it.

Static analysis tools will not tell you the quality of the unit tests or the meaningfulness of the coverage, but simply having coverage will give the developers confidence to refactor - the key to increasing maintainability.  It should also increase the ratio of first fix resolution, giving confidence that defects can be resolved fast and minor enhancements made without causing unintended side effects.
Further, even if automated functional tests are still on the to-do list, testers who can read unit tests will be able to more effectively adopt risk-based manual testing and thus reduce manual test effort.

Mean Time Between Green Builds (mins)

Note that many ARTs will implement multiple CI cycles – local ones executing on branches and a central master cycle on the mainline.   Whilst branch-level CI cycles might be of interest at the team level, the only one we are interested in at the ART level is the master on the mainline.

Red CI builds are of course an indicator of poor developer quality practices (failure to locally validate code prior to check-in), and most believe the full CI cycle should occur in under 10 minutes to provide an adequate level of timely feedback to the developers, but failure on either of these fronts will naturally extend the time between green builds, so they need not be discretely measured on the dashboard.


Mean Time to Recover from Red build (mins)

Two things will cause this metric to trend in the wrong direction.  One is lack of the Andon mindset (its someone else's fault, or even worse its always red, just ignore it).  The second is failure to regularly commit, resulting in complex change-sets and difficult debugging.  The second is easily identified through the mean time between Green Builds, so the metric enables measurement of the establishment of the Andon mindset among developers.

Late Phase Defects #

The identification and resolution of defects during the execution of a story is evidence of good team quality practices, and should be excluded from any strategic treatment of defect trends.  However, defects identified in functionality associated with a story after its acceptance or in late-phase (integration, performance, security, UAT etc) testing are indicators of a failure to "build quality in".   
Whilst many teams do not formally log defects identified during story development, where this is done there will be a need for classification in the defect management system to separate late phase defects for reporting purposes.

Validation Capacity %

Great agile means a story is accepted once it is in production.  Good agile means it is accepted once it is ready for production.  For most enterprises in the early years of their agile adoption, this seems like a fairy-tale - the DevOps definition of "Unicorns" such as Amazon and Netflix resonates strongly!   
The reality is for some time there will be testing and packaging activities which get batched up and executed late in development.  Typical examples include:
  • User Acceptance Testing - of course, the Product Owner as the embedded customer is meant to do this in good agile but for many they are neither sufficiently knowledgeable nor sufficiently empowered.
  • Integration Testing - in theory redundant if the team is practicing good full-stack continuous integration.  But for all too many, environment constraints prohibit this and lead to extensive use of stubs until late phase.
  • Performance Testing - for many organisations, the performance test environments are congested, hard to book, and take days if not weeks to configure for a performance test run.  
  • Penetration Testing - a highly specialised job with many organisations possessing a handful of skilled penetration testers spread across thousands of developers.
  • Release Documentation
  • Mandated Enterprise Level Integration and Deployment preparation cycles for all changes impacting strategic technology assets.
Given that the backlog "represents the collection of all the things a team needs to do" , all of these activities should appear in backlogs, estimated and prioritized to occur in the appropriate iterations.   It is a simple matter to introduce a categorization to the backlog management tool to flag these items as hardening activities.


Average Severity 1 and 2 Incidents per Deploy

High severity incidents associated with deployments are a critical quality indicator.  Measurement is generally fairly trivial with the appropriate flagging in incident management systems.  However, some debate may exist as to whether an incident is associated with a deployment or simply the exposition of a preexisting condition.  An organisation will need to agree on clear classification standards in order to produce meaningful measures.

Advanced Definitions



Advanced Metrics Rationale

Duplication %

Duplicate code is bad code.  Its simple.  One line of duplicated business logic is a time-bomb waiting to explode.  If this number is trending down, its an indicator developers are starting to refactor, the use of error-prone copy/paste techniques is falling and the maintainability of the source code is going up.  Its potentially debatable whether one measures duplicate blocks or duplicate lines, but given the amount of logic possible to embed in a single line of code I prefer the straight up measurement of duplicated lines.  

Average Cyclomatic Complexity

Cyclomatic complexity is used to measure the complexity of a program by analyzing the number of linearly independent paths through a program's code.    More complexity leads to more difficulty in maintaining or extending functionality and greater reliance on documentation to understand intent.  It can be measured at multiple levels, however from a dash-boarding perspective my interest is in function or method level complexity.  


Average Branch Age at Merge (days)

This metric may take a little more work to capture, but it is well worth the effort.  The modern ideal is of course not to branch at all (branching by abstraction), however the technical sophistication required by developers to achieve this takes some time to achieve.  
Code living in a branch is code that has not been integrated, and thus code that carries risk.  The longer the code lives in a branch, the more effort it takes to merge it back into the mainline and the greater the chance that the merge process will create high levels of late-phase defects.
Whiteboard spotted at Pivotal Labs by @testobsessed

Fault Feedback Ratio (FFR) %

When it comes to defects, we are interested in not just when we find them but how we respond to them.  In his book "Quality Software Management vol 2: First-Order Measurement, Gerry Weinberg introduced me to the concept (along with many other fascinating quality metrics).  Our goal is to determine what happens when we address a defect.  Do we resolve it completely?  Do we introduce other new defects in resolving the first one?  A rising FFR value can indicate poor communication between testers and developers, hacked-in fixes, and deterioration in the maintainability of the application among other things.  According to +Johanna Rothman in this article (), a value of <= 10% is a good sign.
Measuring it should be trivial with appropriate classifications of defect sources and resolution verification activities in the defect management system.

Average Open Defects #

When it comes to open defects, one needs to make a number of local decisions.  Firstly, what severity are we interested in?  Restricting it to high severity defects can hide all kinds of quality risk, but at the same time many low severity defects tend to be more matters of interpretation and often represent minor enhancement requests masquerading as defects.
Further, we need to determine whether we are interested in the open count at the end of the PI or the average throughout the PI.  A Lean focus on building quality in leads me to be more interested in our every-day quality position rather than what we've cleaned up in our end-of-PI rush.

Conclusion

More than for any other quadrant, I wrestled to find a set of quality metrics small enough not to be overwhelming yet comprehensive enough to provide meaningful insight.  At the team level, I would expect significantly more static code analysis metrics (such as “Code Smells”, “Comment Density” and “Afferent Coupling” ) to be hugely valuable.  Kelley Horton of Net Objectives suggested a Defect Density measure based on “# of production defects per 100 story points released”, and “% capacity allocated to technical debt reduction”.  For further inspiration, I can recommend nothing so much as the “Quality Software Management” series by +Gerald Weinberg.


You should name a variable with the same care with which you name a first-born child” – Robert C. Martin, Clean Code










No comments:

Post a Comment