Sunday, March 12, 2017

Improving SAFe Product Management Strategy with OKRs

Introduction

In my last article, I introduced the concept of Objectives and Key Results (OKRs) and provided an illustration based on personal application.  However, I only tried out personal OKRs because I wanted a way to “learn by doing” before I started trying out some of my thinking on incorporating them in SAFe.   This article will explore the first application in SAFe: Improving Product Strategy.

Late last year, Product Management guru John Cutler wrote a brilliant article called “12 signs you’re working in a Feature Factory”.  Some highlights include:

  • No connection to core metrics.  Infrequent discussions about desired customer and business outcomes.
  • “Success Theater” around shipping with little discussion about impact.
  • Culture of hand-offs.  Front-loaded process in place to “get ahead of the work” so that items are ready for engineering”.  Team is not directly involved in research, problem exploration, or experimentation and validation.
  • Primary measure of success is delivered features, not delivered outcomes.
  • No measurement.  Teams do not measure the impact of their work.
  • Mismatch between prioritization rigour (deciding what gets worked on) and validation rigour (deciding if it was, in fact, the right thing to work on).

His blog is a gold-mine for Product Managers, and I found much of his thinking resonated with me when it came to the difference between a healthy ART and a sick one.  OKRs are a very handy tool in combating the syndrome.

Introducing OKRs to Feature Definitions

The official guidance says “every feature must have defined benefits”.  We teach it in every class, and our Cost of Delay estimates should be based on the defined benefits but it is frightening how few ARTs actually define benefits or expected outcomes for features.  At best, most scratch out a few rarely referenced qualitative statements.

Imagine if every feature that went into a Cost of Delay (COD) estimation session or PI planning had a clearly defined set of objectives and key results!

Following is an illustration based on an imaginary feature for a restaurant wanting to introduce an automated SMS-based reservation reminder capability.


Applying OKRs to PI Objectives

I’ve lost count of the number of long discussions I’ve had about the difference between PI Objectives and Features.  They almost always start the same way.  “Surely our PI Objective is to deliver Feature X”.  Or, in an ART with a lot of component teams it might be “Support Team A in delivering Feature X”.

WRONG! 
 
Good PI objectives are based on outcomes.  Over the course of the planning event, the team unpacks their features and generates stories to capture the ongoing conversation and identify key inter-dependencies.  Eventually, we consolidate what we’ve learned about the desired outcomes by expressing them as PI objectives.  The teams don’t commit to delivering the stories, they commit to achieving the objectives.  They should be working with their product owners all the way through the PI to refine and reshape their backlogs to maximize their outcomes.  

If every feature that enters PI planning has a well-defined set of OKRs, we have a great big hint as to what our objectives are.  An ART with good Devops maturity may well transpose the Feature OKRs straight into their PI objectives (subject to trade-off decisions made during planning).  They will ship multiple times during the PI, leveraging the feedback from each deployment to guide their ongoing backlog refinement.   

SAFe Principle 9 suggests enabling decentralization by defining the economic logic behind the decision rather than the decision itself.   Using well-defined OKRs for our PI objectives provides this economic logic, enabling Product Owners and teams to make far more effective trade-off decisions as they execute.

However, without good DevOps maturity the features may not have been deployed in market for sufficient time by the end of the PI to be measuring the predefined Key Results.  In this case, the team needs to work to define a set of Key Results they will be able to measure by the time the PI is over.  These should at minimum be demonstrating some level of incremental validation of the target KRs.  Perhaps they could be based on results from User Testing (the UX kind, not UAT).  Or maybe, measured results in integration test environments.  For example:
  • 50% of test users indicated they were highly likely to recommend the feature to a friend or colleague
  • End-to-end processing time reduced by 50% in test environments. 
Last but not least, we’ve started to solve some of the challenges ARTs experience with the “Business Value” measure on PI objectives.  It should be less subjective during the assignment stage at PI Planning and completely quantitative once we reach Inspect and Adapt!

Outcome Validation

By this point, we’ve started to address a number of the warning signs.   But, unless we’re some distance down the path to good DevOps we haven’t really done much about validation.   If you read my recent Feature Kanban article, you’ll have noticed that the final stage before we call a feature “done” was “Impact Validation”. 

This is the moment for final leverage of our Feature OKRs.  What happened once the feature is deployed and used in anger?  Do the observed outcomes match expectations?  Do they trigger the generation of an enhancement feature to feed back through our Program Backlog for the next Build/Measure/Learn cycle?  Do they affect the priorities and value propositions of other features currently in the Program Backlog?

Linking it back to the ART PI Metrics dashboard

In the business impact quadrant of my PI metrics model, I proposed that every ART should have a “fitness function” defining the fashion in which the ART could measure its impact on the business.  This function is designed to be enduring rather than varying feature by feature – the intended use is to support trend-based analysis.  The job of effective product managers is, of course, to identify features which will move the strategic needles in the desired direction.  Business Features should be contributing to Business Impacts, Enablers might be either generating learning or contributing to the other three quadrants.

Thus, what we should see is Key Result measures in every feature that exert force on the strategic success measures for the ART.   Our team level metric models should be varying with feature content to enable them to steer.  This helps with the first warning sign Cutler identifies:
  1. No measurement.  Teams do not measure the impact of their work.  Or, if measurement happens, it is done in isolation by the product management team and selectively shared.  You have no idea if your work worked.

Conclusion

Whilst OKRs obviously have a lot to offer in the sphere of Product Strategy, this is not the only area in which they might help us with SAFe.  In my next article, I’ll explore applying them to Relentless Improvement.

Sunday, March 5, 2017

An Introduction to Objectives and Key Results (OKRs)

Introduction

In late January I was working on a final draft of one of my metrics articles when someone suggested I take a look at OKRs.   I’d heard of them, but hadn’t really paid much attention other than to file it on my always long backlog of “things I need to learn something about”.  Then a couple of days later I was writing my abstract for a metrics session at the Agile Alliance 2017 conference and in scanning the other submissions in the stream discovered a plethora of OKR talks.  This piqued my interest, so I started reading the OKR submissions.   At this point I panicked a little.  The concept sounded fascinating (and kind of compelling when you realize its how Google and Intel work).  I was facing into the idea that my lovingly crafted “SAFe PI Metrics revamp” might be invalidated before I even finished the series.

So, I went digging.  The references that seemed to dominate were Radical Focus by Christina Wodtke, a presentation by Rick Clau of Google, and an OKR guide by Google.  A few days of reading later, I relaxed a little.   I hadn’t read anything that had invalidated my metric model, but I could see a world of synergies.  A number of obvious and compelling opportunities to apply OKR thinking both personally and to SAFe had emerged.

In this article, I will provide a brief overview of the concept and use my early application to personal objective setting as a worked example.  My next article will detail a number of potentially powerful applications in the context of SAFe.

OKRs – an Overview

As Wodtke states in Radical Focus, “this is a system that originated at Intel and is used by folks such as Google, Zynga, LinkedIn and General Assembly to promote rapid and sustained growth.  O stands for Objective, KR for key results.  Objective is what you want to do (Launch a killer game!), Key Results are how you know if you’ve achieved them (Downloads of 25K/day, Revenue of 50K/day). OKRs are set annually and/or quarterly and unite the company behind a vision

Google provides some further great clarifying guidance:

  • “Objectives are ambitious and may feel somewhat uncomfortable”
  • “Key results are measurable and should be easy to grade with a number (Google uses a scale of 0-1.0”
  • “The sweet spot for an OKR grade is 60-70%; if someone consistently fully attains their objectives, their OKRs are not ambitious enough and they need to think bigger”
  • “Pick just three to five objectives – more can lead to overextended teams and a diffusion of effort”.
  • “Determine around three Key Results for each objective”
  • “Key results express measurable milestones which, if achieved, will directly advance the objective”.
  • “Key results should describe outcomes, not activities”

Wodtke dials it up a notch when it comes to ambition:

  • OKRs are always stretch goals.  A great way to do this is to set a confidence level of five of ten on the OKR.  By confidence level of five out of ten, I mean ‘I have confidence I only have a 50/50 shot of making this goal”.


Personal OKRs – a worked example

I’m a great believer in experimenting on myself before I experiment on my customers, so after finishing Radical Focus I decided to try quarterly personal OKRs.  Over the years, I’ve often been asked when I’m going to write a book.  After my colleague +Em Campbell-Pretty  published Tribal Unity last year, the questions started arriving far more frequently.

After a particularly productive spurt of writing over my Christmas Holidays I was starting to think it might be feasible so I began to toy with the objective of publishing a book by the end of the year.  Em had used a book writing retreat as a focusing tool, so it seemed like planning to hole up in a holiday house for a few weeks late in the year would be a necessary ingredient but that was far away (and a large commitment) so the whole thing still felt very daunting.

As I focused in on the first quarter, the idea for an objective to commit to scheduling the writing retreat emerged.  Then the tough part came – what were the quantitative measures that would work as Key Results?  Having now facilitated a number of OKR setting workshops, I'm coming to learn that Objectives are easy, but Key Results is where the brain really has to engage.  My first endeavors to apply it personally were no exception.  Eventually, I came to the realization that I needed to measure confidence.  Confidence that I could generate enough material, and confidence that people would be interested in reading the book.  After a long wrestle, I reached a definition:



Working the Objectives

As Wodtke states when concluding Radical Focus, “The Google implementation of OKRs is quite different than the one I recommend here”.  Given that I had started with her work, I was quite curious to spot the differences as I read through Google’s guidance.  At first, it seemed very subtle.  The biggest difference I could see was that for Google not all objectives were stretch.  Then the gap hit me.  Wodtke added a whole paradigm for working your way towards the objective with a lightweight weekly cycle.   I had loved it, as I’d mentally translated it to “weekly iterations”.

The model was simple.  At the start of each week, set three “Priority 1” goals and two “Priority 2” goals that would move you towards your objectives.  At the end of the week, celebrate your successes and reflect on your insights.  Whilst she had a lot more to offer on application and using the weekly goals as a lightweight communication tool, the simplicity was very appealing personally.  After all, I had a “day job” training and coaching.  My personal objectives were always going to be extra-curricular so I wanted a lightweight focusing tool.

Whilst writing a book was not my only objective, it was one of two so features heavily in my weekly priorities.  Following is an example excerpt:

Clarifying notes:
  • Oakleigh is a suburb near me with a big Greek population.  They have a bunch of late night coffee shops, and its my favorite place to write.  I head up at 10pm after the kids are in bed and write until they kick me out around 2am.  This article is being written from my newly found “Brisbane Oakleigh”.
  • In week 2 I utilised OKRs in an executive strategy workshop I was facilitating (with a fantastic result), and shared the concept with a friend who was struggling with goal clarity.  In both cases, I used my writing OKR as an example 

Conclusion

Hopefully by this point you are both starting to understand the notion of the OKR and perhaps inspired enough to read some of the source material that covers it more fully.  I can’t recommend Radical Focus highly enough.  It’s an easy read, fable style (think The Goal, The Phoenix Project). 

You might have noticed along the way that my priority goal this week was to “Publish OKR article”.  As seems to happen regularly to me, once I start exploring an idea it becomes too big for one post.  I never actually got to the “SAFe” part of OKRs.  To whet your appetite, here are some things I plan to address in the second article:

  • Specifying OKRs as part of a Feature Definition
  • Applying OKR thinking to PI Objectives
  • Applying OKR thinking to Inspect and Adapt and Retrospective Outcomes


Sunday, February 26, 2017

Getting from Idea to Value with the ART Program Kanban

Introduction


Readers of this blog will be no strangers to the fact that I'm a strong believer in the value of kanban visualization at all levels of the SAFe implementation.  Further, if you've been reading my Metrics Series you may have been wondering how to collect all those Speed Metrics.

Given that the Feature is the primary unit of "small batch value flow" in SAFe, effective application of kanban tools to the feature life-cycle is critical in supporting study and improvement of the flow of value for an Agile Release Train (ART).

The first application for most ARTs is enabling flow in the identification and preparation of Features for PI planning.  Many ARTs emerge from their first PI planning promising themselves to do a better job of starting early on identifying and preparing features for their next PI only to suddenly hit panic-stations 2 weeks out from the start of PI 2.  The introduction of a kanban to visualize this process is extremely valuable in starting to create visibility and momentum to solving the problem.

However, I vividly remember the debate my colleague +Em Campbell-Pretty and I had with +Dean Leffingwell and +Alex Yakyma  regarding their proposed implementation of the SAFe 4 Program Kanban over drinks at the 2015 SAFe Leadership Retreat in Scotland.  Their initial cut positioned the job of the program kanban as "delivering features to PI planning", whilst we both felt the life-cycle needed to extend all the way to value realization.  This was in part driven by our shared belief that a feature kanban made a great visualization to support Scrum-of-Scrums during PI execution but primarily by our drive to enable optimization of the full “Idea to Value” life-cycle.   Dean bought in and adjusted the representation in the framework (the graphic depicting the backlog as an interim rather than an end-state was in fact doodled by Em on her iPad during the conversation).

A good Kanban requires a level of granularity appropriate to exposing the bottlenecks, queues and patterns throughout the life-cycle.  Whilst the model presented in SAFe acts much as the Portfolio Kanban in identifying the overarching life-cycle states, it leaves a more granular interpretation as an exercise for the implementer.  

Having now built (and rebuilt) many Program Kanban walls over the years while coaching, I've come to a fairly standard starting blueprint (depicted below).  This article will cover the purpose and typical usage of each column in the blueprint.



Note: My previous article on Kanban tips-and-tricks is worthwhile pre-reading in order to best understand and leverage the presented model.  Avatars should indicate both Product Management team members and Development teams associated with the Feature as it moves through its life-cycle.

Background thinking

The fundamental premise of lean is global optimization of the flow from idea to value.  Any truly valuable Feature Kanban will cover this entire life-cycle.  The reality is that many ARTs do not start life in control of the full cycle, but I believe this should not preclude visualization and monitoring of the full flow.  In short, “don’t let go just because you’re not in control”.  If you’re not following the feature all the way into production and market realization you’re missing out on vital feedback.

The Kanban States


Funnel

This is the entry point for all new feature ideas.  They might arrive here as features decomposed out of the Epic Kanban, features decomposed from Value Stream Capabilities, or as independently identified features.  In the words of SAFe, "All new features are welcome in the Feature Funnel".
No action occurs in this state, it is simply a queue with (typically) no exit policies.

Feature Summary

In this state, we prepare the feature for prioritization.  My standard recommendation is that ARTs adopt a "half to one page summary" Feature Template (sample coming soon in a future article).

Exit Policies would typically dictate that the following be understood about the feature in order to support an effective Cost of Delay estimation and WSJF calculation:

  • Motivation (core problem or opportunity)
  • Desired outcomes
  • Key stakeholders and impacted users
  • Proposed benefits (aligned to Cost of Delay drivers)
  • Key dependencies (architectural or otherwise)
  • Very rough size.


Prioritization


Features are taken to Cost of Delay estimation workshop, WSJF calculated, and either rejected or approved to proceed to the backlog.

Exit Policies would typically indicate:

  • Initial Cost of Delay agreed
  • WSJF calculated
  • Feature has requisite support to proceed.

Backlog

This is simply a holding queue.  We have a feature summary and a calculated WSJF.  Features are stored here in WSJF order, but held to avoid investing more work in analysis until the feature is close to play.  If applying a WIP limit to this state, it would likely be based on ART capacity and limited to 2-3 PI's capacity.

Exit Policies would typically surround confirmation that the Feature has been selected as a candidate for the next PI and any key dependencies have been validated sufficiently to support the selection.  I find most Product Management teams will make a deliberate decision at this point rather than just operating on “pull from backlog when ready”.

Next PI Candidate

Again, this state is simply a holding queue.  Movement from the “Backlog” to this state indicates that the Feature can be pulled for “Preparation” when ready.

Generally, there are no exit policies, but I like to place a spanning WIP limit over this and the following state (Preparing).  The logical WIP limit is based on capacity rather than number of features, and should roughly match the single-PI capacity of the ART.

Preparing

Here, we add sufficient detail to the Feature to enable it to be successfully planned.  The Exit Policy is equivalent to a “Feature Definition of Ready”.  Typically, this would specify the following:
Acceptance Criteria Complete
Participating Dev Teams identified and briefed
Dependencies validated and necessary external alignment reached
High level Architectural Analysis complete
Journey-level UX complete
Required Technical Spikes complete

This is the one state in the Feature Kanban that is almost guaranteed to be decomposed to something more granular when applied.  The reality is that feature preparation involves a number of activities, and the approach taken will vary significantly based on context.   A decomposition I have often found useful is as follows:

  • Product Owner onboarding (affected Product Owners are briefed on the Feature by Product Management and perform some initial research, particularly with respect to expected benefits)
  • Discovery Workshops (led by Product Owner(s) and including affected team(s), architecture, UX and relevant subject matter experts to explore the feature and establish draft acceptance criteria and high level solution options)
  • Finalization (execution of required technical spikes, validation of architectural decisions, finalization of acceptance criteria, updates to size and benefit estimates).


Planned 

The planning event itself is not represented on the Kanban, but following the conclusion of PI planning all features which were included in the plan are pulled from "Preparing " to "Planned".

This is a queue state.  Just because a feature was included in the PI plan does not mean teams are working on it from Day 1 of the PI.  We include it deliberately to provide more accuracy (particularly with respect to cycle time) to the following states.  There are generally no exit policies.
  

Executing 

A feature is pulled into this state the instant the first team pulls the first story for it into a Sprint, and the actual work done here is the build/test of the feature.

Exit policies are based on the completion of all story level build/test activities and readiness for feature level validation.  Determination of appropriate WIP limit strategies for this state will emerge with time and study.  In the beginning, the level of WIP observed here provides excellent insight into the alignment strategy of the teams and the effectiveness of their observation of Feature WIP concepts during PI planning.

Feature Validation

A mature ART will eliminate this state (given that maturity includes effective DevOps).  However, until such time as the ART reaches maturity, the type of activities we expect to occur here are:

  • Feature-level end-to-end testing
  • Feature UAT
  • Feature-level NFR validation

Exit Policies for this state are equivalent to a “Feature Definition of Done”.  They are typically based around readiness of the feature for Release level hardening and packaging.   The size of the queues building in this Done state will provide excellent insights into the batch-size approach being taken to deployments (and "time-in-state" metrics will reveal hard data about the cost of delay of said batching).

Release Validation

Once again, a mature ART will eliminate this state.  Until this maturity is achieved we will see a range of activities occurring here around pre-deployment finalization.
 
Exit Policies will be the equivalent of a "Release Definition of Done", and might include:

  • Regression Testing complete
  • Release-level NFR Validation (eg Penetration, Stress and Volume Testing) complete
  • Enterprise-level integration testing complete
  • Deployment documentation finalized 
  • Deployment approvals sought and granted and deployment window approved


The set of features planned for release packaging will be pulled as a batch from "Feature Validation" into this state, and the set of features to be deployed (hopefully the same) will progress together to “Done” once the exit policies are fulfilled.

Deployment

Yet another "to-be-eliminated" state.  When the ARTs DevOps strategy matures, this state will last seconds - but in the meantime it will often last days.  The batch of features sitting in "Release Hardening" will be simultaneously pulled into this state at the commencement of Production Deployment activities, and moved together to Done at the conclusion of post-deployment verification activities.

Exit Policies will be based on enterprise deployment governance policy.  For many of my clients, they are based on the successful completion of a Business Verification Testing (BVT) activity where a number of key business SME’s manually verify a set of mission-critical scenarios prior to signalling successful deployment.

Operational Readiness

This state covers the finalization of operational readiness activities.  An ART that has matured well along Lean lines will already have performed much of the operational readiness work prior to deployment, but we are interested in the gap between "Feature is available" and "Feature is realizing value".   Typical activities we might see here depend on whether the solution context is internal or external, but might include:

  • Preparation and introduction of Work Instructions
  • Preparation and Delivery of end-user training
  • Preparation and execution of marketing activities
  • Education of sales channel

Exit Policies should be based around “first use in anger” by a production user in a real (non-simulated) context.

Impact Validation

A (hopefully not too) long time ago, our feature had some proposed benefits.  It's time to see whether the hypothesis was correct (the Measure and Learn cycles in Lean Startup).  I typically recommend this state be time-boxed to 3 months.  During this time, we are monitoring the operational metrics which will inform the correlation between expected and actual benefits.

Whilst learning should be harvested and applied regularly throughout this phase, it should conclude with some form of postmortem, with participants at minimum including Product Management and Product Owners but preferably also relevant subject matter experts, the RTE and representative team members.  Insights should be documented, and fed back into the Program Roadmap.
Exit Policies would be based upon the completion of the “learning validation workshop” and the incorporation of the generated insights into the Program Roadmap.

Done

Everybody needs a "brag board"!


Conclusion

Once established, this Kanban will provide a great deal of value.  Among other things, it can support:

  • Visualization and maintenance of the Program Roadmap
  • Management of the flow of features into the PI
  • Visualization of the current state of the PI to support Scrum of Scrums, PO Sync, executive gemba walks, and other execution steering activities.
  • Visualization and management (or at least monitoring) of the deployment, operational rollout and outcome validation phases of the feature life-cycle.
  • Collection of cumulative flow and an abundance of Lead and Cycle time metrics at the Feature level.


Saturday, February 18, 2017

Revamping SAFE's Program Level PI Metrics - Conclusion

Base controls on relative indicators and trends, not on variances against plan” – Bjarte Bogsnes, Implementing Beyond Budgeting

Introduction

The series began with an overview of a metric model defined to address the following question:
"Is the ART sustainably improving in its ability to generate value through the creation of a passionate, results-oriented culture relentlessly improving both its engineering and product management capabilities?"
The ensuing posts delved into the definitions and rationale for the Business Impact, Culture, Quality and Speed quadrants.  In this final article, I will address dashboard representation, implementation and application.

Dashboard Representation

The model is designed such that the selected set of metrics will be relatively stable unless the core mission of the ART changes.  The only expected change would result from either refinement of the fitness function or incorporation of the advanced measures as the ART becomes capable of measuring them.


Given that our focus is on trend analysis rather than absolutes, my recommendation is that for each measure the dashboard reflects the vale for the PI just completed, the previous PI and the average of the last 3 PI’s.   Given the assumption that most will initially implement the dashboard in Excel (sample available here), I would further suggest the use of conditional formatting to color-code movement (dark green for strongly positive through dark red for strongly negative). 

Implementation

In The Art of Business Value, Mark Schwartz proposes the idea of “BI-Driven Development (BIDD?)”.  His rationale?  “In the same sense that we do Test-Driven-Development, we can set up dashboards in the BI or reporting system that will measure business results even before we start writing our code”.

I have long believed that if we are serious about steering product strategy through feedback, every ART should either have embedded analytics capability or a strong reach into the organisation’s analytics capability.  While the applicability extends far beyond the strategic dashboard (ie per Feature), I would suggest the more rapidly one can move from a manually collated and completed spreadsheet to an automated analytics solution the more effective the implementation will be.

Virtually every metric on the dashboard can be automatically captured, whether it be from the existing enterprise data-warehouse for Business Metrics, the Feature Kanban in the agile lifecycle management tool, Sonarqube, the logs of the Continuous Integration and Version Control tools or the Defect Management System.  Speed and Quality will require deliberate effort to configure tooling such that the metrics can be captured, and hints as to approach were provided in the rationales of the relevant deep-dive articles.  NPS metrics will require survey execution, but are relatively trivial to capture using such tools as Survey Monkey.

Timing

I cannot recommend base-lining your metrics prior to ART launch strongly enough.  If you do not know where you are beginning, how will you understand the effectiveness of your early days?  Additionally, the insights derived from the period from launch to end of first PI can be applied in improving the effectiveness of subsequent ART launches across the enterprise.

With sufficient automation, the majority of the dashboard can be in a live state throughout the PI, but during the period of manual collation the results should be captured in the days leading up to the Inspect & Adapt workshop.

Application

The correct mindset is essential to effective use of the dashboard.  It should be useful for multiple purposes:
  • Enabling the Portfolio to steer the ART and the accompanying investment strategy
  • Enabling enterprise-level trend analysis and correlation across multiple ARTs
  • Improving the effectiveness of the ART’s Inspect and Adapt cycle
  • Informing the strategy and focus areas for the Lean Agile Centre of Excellence (LACE)
Regardless of application specifics, our focus is on trends and global optimization.  Are the efforts of the ART yielding the desired harvest, and are we ensuring that our endeavors to accelerate positive movement in a particular area are not causing sub-optimizations elsewhere in the system?

It is vital to consider the dashboard as a source not of answers, but of questions.   People are often puzzled by the Taiichi Ohno quote “Data is of course important … but I place the greatest emphasis on facts”.   Clarity lies in appreciating his emphasis on not relying on reports, but rather going to the “gemba”.  For me, the success of the model implementation lies in the number and quality of questions it poses.  The only decisions made in response to the dashboard should be what areas of opportunity to explore – and of course every good question begins with why.   For example:
  • Why is our feature execution time going down but our feature lead time unaffected?
  • Why has our deployment cycle time not reduced in response to our DevOps investment?
  • Why is Business Owner NPS going up while Team NPS is going down?
  • Why is our Program Predictability high but our Fitness Function yield low?
  • Why is our Feature Lead Time decreasing but our number of production incidents rising?

Conclusion

It’s been quite a journey working through this model, and I’m grateful for all the positive feedback I have received along the way.   The process has inspired me to write a number of supplementary articles.  

The first of these is a detailed coverage of the Feature Kanban (also known as the Program Kanban).  Numerous people have queried me as to the most effective way of collecting the Speed Metrics, and this becomes trivial with the development an effective Feature Kanban (to say nothing of the other benefits).

I’ve also wound up doing a lot of digging into “Objectives and Key Results” (OKR’s).  Somehow the growing traction of this concept had passed me by, and when my attention was first drawn to it I panicked at the thought it might invalidate my model before I had even finished publishing it.  However, my research concluded that the concepts were complementary rather than conflicting.  You can expect an article exploring this to follow closely on the heels of my Feature Kanban coverage.

There is no better way to close this series than with a thought from Deming reminding us of the vital importance of mindset when utilising any form of metric.
People with targets and jobs dependent upon meeting them will probably meet the targets – even if they have to destroy the enterprise to do it.” – W. Edwards Deming
 

Sunday, February 5, 2017

Revamping SAFe's Program Level PI Metrics Part 5/6 - Speed

Changing the system starts with changing your vantage point so you can ‘see’ the system differently.  Development speed is often attributed to quick decisions.  Early definition of the requirements and freezing specification quickly are often highlighted as keys to shortening the product development cycle.  Yet the key steps required to bring a new product to market remain the creation and application of knowledge, regardless of how quickly the requirements are set.  The challenge in creating an effective and efficient development system lies in shortening the entire process.” – Dantar Oosterwal, The Lean Machine.

Series Context

Part 1 – Introduction and Overview
Part 2 – Business Impact Metrics
Part 3 – Culture Metrics
Part 4 – Quality Metrics
Part 5 – Speed Metrics (You are here)
Part 6 – Conclusion and Implementation


Introduction

As mentioned in my last post, the categorization of metrics went through some significant reshaping in the review process.  The “Speed” (or “Flow”) quadrant didn’t exist, with its all-important metrics divided between “Business Impact” and “Deployment Health”.  

Lead Time is arguably the most important metric in Lean, as evidenced by Taiicho Ohno’s famous statement that “All we are doing is looking at the customer time line, from the moment the customer gives us the order to the point when we collect the cash”.  Not only does it measure our (hopefully) increasing ability to respond rapidly to opportunity, but is a critical ingredient in enabling a focus on global rather than local optimization.

In this quadrant, the core focus is to employ two perspective views on Lead Time.  The first (Feature Lead Time) relates to the delivery of feature outcomes, and the second (MTTR from Incident) our ability to rapidly recover from production incidents.

The other proposed metrics highlight the cycle time of key phases in the idea-to-value life-cycle as an aid to understanding “where we are slow, and where we are making progress”.  In particular, they will highlight failure to gain traction in XP and DevOps practices.

There is, however, a caveat.  Many (if not most) Agile Release Trains do not begin life in control of the entire idea-to-value life-cycle.  On the one hand, its very common for features to be handed off to an enterprise release management organisation for production release.  On the other, whilst Lean Principles are at the heart of SAFe the framework centers on hardware/software development.  The (traditionally business) skill-sets in areas such as operational readiness, marketing and sales required to move from “Deployed product” to “Value generating product” are nowhere on the big picture. 

ARTs focused on bringing to life the SAFe principles will address these gaps as they inspect and adapt, but in the meantime  there is a temptation to “not measure what we are not in control of”.  As a coach, I argue that ARTs should “never let go until you’ve validated the outcome”.  You may not be in control, but you should be involved – if for nothing else than in pursuit of global optimization.  

Basic Definitions


Basic Metrics Rationale

Average Feature Lead Time (days)

This is the flagship metric.   However, the trick is determining "when the timer starts ticking".   For an ART maintaining the recommended 3-PI roadmap, feature lead time would rarely be shorter than a depressing 9 months.  
To measure it, one needs 2 things: A solid Feature Kanban, and agreement on which stage triggers the timer.  A good feature kanban will of necessity be more granular the sample illustrated in the framework (fuel for a future post), but the trigger point I most commonly look for is "selection for next PI".  In classic kanban parlance, this is the moment when a ticket moves from "backlog" to "To Do", and in most ARTs triggers the deeper preparation activities necessary to prepare a feature for PI planning.  The end-point for the measure is the moment at which the feature starts realizing value and is dependent on solution context, often triggered by deployment for digital solutions but after business change management activities for internal solutions.

Average Deployment Cycle Time (days)

This metric was inspired by the recently released Devops Handbook by Gene Kim and friends.  In essence, we want to measure “time spent in the tail”.  I have known ART after ART that accelerated their development cycle whilst never making inroads on their path to production.  If everything you build has to be injected in a 3-month enterprise release cycle, its almost pointless accelerating your ability to build!  
Whilst our goal is to measure this in minutes, I have selected days as the initial measure as for most large enterprises the starting point will be weeks if not months.

Average Mean Time to Restore (MTTR) from Incident (mins)

When a high severity incident occurs in production, how long does it take us to recover?  In severe cases, these incidents can cause losses of millions of dollars per hour.  Gaining trust in our ability to safely deploy regularly can only occur with demonstrated ability to recover fast from issues.  Further, since these incidents are typically easy to quantify in bottom-line impact, we gain the ability to start to measure the ROI of investment in DevOps enablers.

Prod Deploys Per PI (#)

Probably the simplest measure of all listed on the dashboard - how frequently are we deploying and realizing value?

Advanced Definitions


Advanced Metrics Rationale

Average Feature Execution Cycle Time (days)

This is one of the sub-phases of the lead time which are worth measuring in isolation, and is once again dependent on the presence of an appropriately granular feature kanban.  
The commencement trigger is "first story played", and the finalization trigger is "feature ready for deployment packaging" (satisfies Feature Definition of Done).  The resultant measure will be an excellent indicator of train behaviors when it comes to Feature WIP during the PI.  Are they working on all features simultaneously throughout the PI or effectively collaborating across teams to shorten the execution cycles at the feature level?

One (obvious) use of the metric is determination of PI length.  Long PI’s place an obvious overhead on Feature Lead Time, but if average Feature Execution time is 10 weeks its pointless considering an 8 week PI.  

Average Deploy to Value Cycle Time (days)

This sub-phase of feature lead time measures "how long a deployed feature sits on the shelf before realizing value".  
The commencement trigger is "feature deployed", and the finalization trigger is "feature used in anger".  It will signal the extent to which true system level optimization is being achieved, as opposed to local optimization for software build.  In a digital solution context it is often irrelevant (unless features are being shipped toggled-off in anticipation of marketing activities), but for internal solution contexts it can be invaluable in signalling missed opportunities when it comes to organizational change management and business readiness activities.

Average Deployment Outage (mins)

How long an outage will our users and customers experience in relation to a production deployment?  Lengthy outages will severely limit our aspirations to deliver value frequently.  

Conclusion

We’ve now covered all 4 quadrants and their accompanying metrics.  The next post will conclude the series with a look at dashboard representation, implementation and utilisation. 

High performers [in Devops practices] were twice as likely to exceed profitability, market share, and productivity goals.  And, for those organizations that provided a stock ticker symbol, we found that high performers had 50% higher market capitalization growth over three years.” – Gene Kim, Jez Humble, Patrick Debois, John Willis .. The Devops Handbook


Monday, January 30, 2017

Revamping SAFe's Program Level PI Metrics Part 4/6 - Quality

The Systems Science Institute at IBM has reported that the cost to fix an error after product release was four to five times as much as one uncovered during design, and up to 100 times more than one identified in the maintenance phase”- iSixSigma magazine

Series Context




Introduction


Given the central nature of the “build quality in” mindset to Lean and Agile, my early drafts of the metrics dashboard devoted 3 full categories to quality:
  • Technical Health 
  • Quality 
  • Deployment Health 
The “quality” aspect of the original cut took a lean lens on the traditional “defect/incident” quality metrics, whilst the other two focused on technical quality and “devops” type quality respectively.

I was fortunate enough to both get some great review feedback from +Dean Leffingwell on the drafts and spend some time at a whiteboard brainstorming with him. He dwelt on the fact that I had “too many metrics and too many quadrants” :) As we brainstormed, we came to two conclusions. Firstly, the 3 concepts listed above were just different perspectives on quality – and secondly, we could separate my individual metrics into “the basics everyone should have” and “the advanced things people should have but might take time to incorporate”. The result is the set of basic and advanced definitions below.

One might question the incorporation of highly technical metrics in an executive dashboard, however there are three very good reasons to do so:
  • If our technical practices are not improving, no amount of process improvement will deliver sustainable change. 
  • If our teams are taking lots of shortcuts to deliver value fast, there is no sustainability to the results being achieved and we will wind up “doing fragile not agile”. 
  • If the executives don’t care, the teams are unlikely to. 
The only non-subjective way I know to approach this is through static code analysis. Given the dominance of Sonarqube in this space, I have referenced explicit Sonarqube measures in the definitions. Additionally, effective adoption of Continuous Integration (CI) amongst the developers is not only a critical foundation for DevOps but also an excellent way to validate progress in the “build quality in” mindset space.

On the “traditional quality measurement” front, my core focus is “are we finding defects early or late”? Thus, I look to both evaluate the timing of our validation activities and the level of quality issues escaping the early life-cycle. For deployment health, all inspiration was sourced from DevOps materials and as we re-structured the overall model it became apparent that many of these measures really belonged in the “Speed” quadrant – all that remained in the quality quadrant was clarity on production incidents.

Basic Definitions



Basic Metrics Rationale


Unit Test Coverage %

As I regularly inform participants in the training room, "if you do not aggressively pursue automated testing your agile implementation will fail!"  It is impossible to sustainably employ an iterative and incremental approach to software development without it.

Static analysis tools will not tell you the quality of the unit tests or the meaningfulness of the coverage, but simply having coverage will give the developers confidence to refactor - the key to increasing maintainability.  It should also increase the ratio of first fix resolution, giving confidence that defects can be resolved fast and minor enhancements made without causing unintended side effects.
Further, even if automated functional tests are still on the to-do list, testers who can read unit tests will be able to more effectively adopt risk-based manual testing and thus reduce manual test effort.

Mean Time Between Green Builds (mins)

Note that many ARTs will implement multiple CI cycles – local ones executing on branches and a central master cycle on the mainline.   Whilst branch-level CI cycles might be of interest at the team level, the only one we are interested in at the ART level is the master on the mainline.

Red CI builds are of course an indicator of poor developer quality practices (failure to locally validate code prior to check-in), and most believe the full CI cycle should occur in under 10 minutes to provide an adequate level of timely feedback to the developers, but failure on either of these fronts will naturally extend the time between green builds, so they need not be discretely measured on the dashboard.


Mean Time to Recover from Red build (mins)

Two things will cause this metric to trend in the wrong direction.  One is lack of the Andon mindset (its someone else's fault, or even worse its always red, just ignore it).  The second is failure to regularly commit, resulting in complex change-sets and difficult debugging.  The second is easily identified through the mean time between Green Builds, so the metric enables measurement of the establishment of the Andon mindset among developers.

Late Phase Defects #

The identification and resolution of defects during the execution of a story is evidence of good team quality practices, and should be excluded from any strategic treatment of defect trends.  However, defects identified in functionality associated with a story after its acceptance or in late-phase (integration, performance, security, UAT etc) testing are indicators of a failure to "build quality in".   
Whilst many teams do not formally log defects identified during story development, where this is done there will be a need for classification in the defect management system to separate late phase defects for reporting purposes.

Validation Capacity %

Great agile means a story is accepted once it is in production.  Good agile means it is accepted once it is ready for production.  For most enterprises in the early years of their agile adoption, this seems like a fairy-tale - the DevOps definition of "Unicorns" such as Amazon and Netflix resonates strongly!   
The reality is for some time there will be testing and packaging activities which get batched up and executed late in development.  Typical examples include:
  • User Acceptance Testing - of course, the Product Owner as the embedded customer is meant to do this in good agile but for many they are neither sufficiently knowledgeable nor sufficiently empowered.
  • Integration Testing - in theory redundant if the team is practicing good full-stack continuous integration.  But for all too many, environment constraints prohibit this and lead to extensive use of stubs until late phase.
  • Performance Testing - for many organisations, the performance test environments are congested, hard to book, and take days if not weeks to configure for a performance test run.  
  • Penetration Testing - a highly specialised job with many organisations possessing a handful of skilled penetration testers spread across thousands of developers.
  • Release Documentation
  • Mandated Enterprise Level Integration and Deployment preparation cycles for all changes impacting strategic technology assets.
Given that the backlog "represents the collection of all the things a team needs to do" , all of these activities should appear in backlogs, estimated and prioritized to occur in the appropriate iterations.   It is a simple matter to introduce a categorization to the backlog management tool to flag these items as hardening activities.


Average Severity 1 and 2 Incidents per Deploy

High severity incidents associated with deployments are a critical quality indicator.  Measurement is generally fairly trivial with the appropriate flagging in incident management systems.  However, some debate may exist as to whether an incident is associated with a deployment or simply the exposition of a preexisting condition.  An organisation will need to agree on clear classification standards in order to produce meaningful measures.

Advanced Definitions



Advanced Metrics Rationale

Duplication %

Duplicate code is bad code.  Its simple.  One line of duplicated business logic is a time-bomb waiting to explode.  If this number is trending down, its an indicator developers are starting to refactor, the use of error-prone copy/paste techniques is falling and the maintainability of the source code is going up.  Its potentially debatable whether one measures duplicate blocks or duplicate lines, but given the amount of logic possible to embed in a single line of code I prefer the straight up measurement of duplicated lines.  

Average Cyclomatic Complexity

Cyclomatic complexity is used to measure the complexity of a program by analyzing the number of linearly independent paths through a program's code.    More complexity leads to more difficulty in maintaining or extending functionality and greater reliance on documentation to understand intent.  It can be measured at multiple levels, however from a dash-boarding perspective my interest is in function or method level complexity.  


Average Branch Age at Merge (days)

This metric may take a little more work to capture, but it is well worth the effort.  The modern ideal is of course not to branch at all (branching by abstraction), however the technical sophistication required by developers to achieve this takes some time to achieve.  
Code living in a branch is code that has not been integrated, and thus code that carries risk.  The longer the code lives in a branch, the more effort it takes to merge it back into the mainline and the greater the chance that the merge process will create high levels of late-phase defects.
Whiteboard spotted at Pivotal Labs by @testobsessed

Fault Feedback Ratio (FFR) %

When it comes to defects, we are interested in not just when we find them but how we respond to them.  In his book "Quality Software Management vol 2: First-Order Measurement, Gerry Weinberg introduced me to the concept (along with many other fascinating quality metrics).  Our goal is to determine what happens when we address a defect.  Do we resolve it completely?  Do we introduce other new defects in resolving the first one?  A rising FFR value can indicate poor communication between testers and developers, hacked-in fixes, and deterioration in the maintainability of the application among other things.  According to +Johanna Rothman in this article (), a value of <= 10% is a good sign.
Measuring it should be trivial with appropriate classifications of defect sources and resolution verification activities in the defect management system.

Average Open Defects #

When it comes to open defects, one needs to make a number of local decisions.  Firstly, what severity are we interested in?  Restricting it to high severity defects can hide all kinds of quality risk, but at the same time many low severity defects tend to be more matters of interpretation and often represent minor enhancement requests masquerading as defects.
Further, we need to determine whether we are interested in the open count at the end of the PI or the average throughout the PI.  A Lean focus on building quality in leads me to be more interested in our every-day quality position rather than what we've cleaned up in our end-of-PI rush.

Conclusion

More than for any other quadrant, I wrestled to find a set of quality metrics small enough not to be overwhelming yet comprehensive enough to provide meaningful insight.  At the team level, I would expect significantly more static code analysis metrics (such as “Code Smells”, “Comment Density” and “Afferent Coupling” ) to be hugely valuable.  Kelley Horton of Net Objectives suggested a Defect Density measure based on “# of production defects per 100 story points released”, and “% capacity allocated to technical debt reduction”.  For further inspiration, I can recommend nothing so much as the “Quality Software Management” series by +Gerald Weinberg.


You should name a variable with the same care with which you name a first-born child” – Robert C. Martin, Clean Code










Wednesday, January 25, 2017

Revamping SAFe's Program Level PI Metrics Part 3/6: Culture

"Organizational culture can be a major asset or a damaging liability that hinders all efforts to grow and become more successful. Measuring and managing it is something few companies do well." - Mark Graham Brown, Business Finance Magazine



Introduction

After exploring the Business Impact quadrant in Part 2 of this series, our focus now moves to Culture. I have been involved with over 30 release trains since I started working with SAFe in early 2012, and I have come to the passionate belief over that time that positive movement in culture is the most accurate predictor of sustained success.

While most agree that it is impossible to truly measure culture, there are certainly indicators that can be measured which help us in steering our path.

In selecting the mix of measures proposed, I was looking for a number of elements:
  • Are our people happy?
  • Are our stakeholders happy?
  • Are we becoming more self-organizing?
  • Are we breaking down silos?

The basic metrics address the first 2 elements, while the advanced metrics tackle self-organization and silos.

Basic Definitions



Basic Metrics Rationale

Team Net Promoter Score (NPS) - "Are our people happy?"

In his book The Ultimate Question 2.0, Fred Reichheld describes the fashion in which many companies also apply NPS surveys to their employees - altering the question from "how likely are you to recommend [Company Name]" to "how likely are you to recommend working for [Company Name]".

My recommendation is that the question is framed as "how likely are you to recommend being a member of [Release Train name]?". Survey Monkey provides a very easy mechanism for running the surveys.

For a more detailed treatment, see this post by my colleague +Em Campbell-Pretty. Pay particular attention to the value of the verbatims and the inclusion of vendor staff in the survey – they’re team members too!

As a coach, I often ponder what “mission success” looks like. What is the moment when the ART I’ve been nurturing is set for greatness and my job is done? Whilst not enough of my ARTs have adopted the team NPS discipline to give me great data, I have developed a belief based on the data I do have that the signal is the moving Team NPS above +20.

Business Owner Net Promoter Score (NPS) - "Are our stakeholders happy?

This is a more traditional treatment of NPS based on the notion that business owners are effectively internal customers of the ART. The question is framed as "how likely are you to recommend the services of [Release Train Name] to a friend or colleague?"

If you’re truly serious about the Lean mindset, you will be considering your vendors when you identify the relevant Business Owners for this metric. There is vendor involvement in virtually every ART I work with, team-members sourced from vendors are a key part of our culture, and vendor management need to be satisfied the model is working for their people and their organization.

Staff Turnover %

In one sense, this metric could be focused on "Are our people happy", however I believe it is more holistic in nature. Staff turnover can be triggered either by people being unhappy and leaving, or by lack of organizational commitment to maintaining long-lived train membership. Either will have negative impacts.

Advanced Definitions


Advanced Metrics Rationale

Developer % (IT) - "Are we becoming more self-organizing?"

When an ART is first formed it classically finds “a role in SAFe” for all relevant existing IT staff (often a criticism of SAFe from "anti-SAFe crowd"). However, as it matures and evolves the people might stay but their activities change. People who have spent years doing nothing but design start writing code again. Great business analysts move from the IT organisation to the business organisation. Project managers either return to a practical skill they had prior to become project managers or roll off the train. In short, the only people who directly create value in software development are software developers. All other IT roles are useful only in so far as they enable alignment (and the greater our self-organisation maturity the less the need for dedicated alignment functions). In short, if we seek true productivity gains we seek a greater proportion of doers.

One of my customers started using this metric to measure progress on this front and I loved it. One of the early cost-saving aspects of agile is reduction in management overhead, whether it be the instant win of preventing duplication of management functions between the implementing organization and their vendors or the conversion of supervision roles (designers, project managers) to contribution roles.

Obviously, this is a very software-centric view of the ART. As the “Business %” metric will articulate, maturing ARTs will tend to deliberately incorporate more people with skills unrelated to software development. Thus, this measure focuses on IT-sourced Train members (including leadership) who are developers.

As a benchmark, the (Federal Government) organization who inspired the incorporation of this metric had achieved a ratio of 70%.

Business % - "Are we breaking down silos?"

While most ARTs begin life heavily staffed by IT roles, as the mission shifts towards global optimization of the “Idea to Value” life-cycle they discover the need for more business related roles. This might be the move from “proxy Product Owners” to real ones, but equivalently and powerfully sees the incorporation of business readiness skill-sets such as business process engineering, learning and development, marketing and other business readiness type skills.

Whilst the starting blueprint for an ART incorporates only 1 mandatory business role (the Product Manager) and a number of recommended business roles (Product Owners), evolution should see this mix change drastically.

The purpose of this measure could easily have been written as "Are we achieving system-level optimization?", however my personal bent for the mission of eliminating the terms "business" and "IT" led to the silo focus in the question.

Conclusion

When it comes to culture, I have a particular belief in the power of a change in language employed to provide acceleration. A number of ARTs I coach are working hard to eliminate the terms “Business” and “IT” from their vocabulary, but the most powerful language change you can make is to substitute the word “person” for “resource”!


Series Context

Part 1 – Introduction and Overview
Part 2 – Business Impact Metrics
Part 3 – Culture Metrics (You are here)
Part 4 – Quality Metrics
Part 5 – Speed Metrics 
Part 6 – Conclusion and Implementation

Instead of trying to change mindsets and then change the way we acted, we would start acting differently and the new thinking would follow.” David Marquet, Turn the Ship Around.