You are hereA New Chore for Sisyphus / Chapter 1 - Failure is not an option (more likely, it's a requirement)
Chapter 1 - Failure is not an option (more likely, it's a requirement)
RFC 1925-1: It has to work.
This is not a book about a new software development methodology or paradigm. There are plenty of those; each hawking their author's supposedly exclusive solution to developing high quality software under a constrained (or at least predictable) schedule and budget. Instead, this is a book about how human nature and the nature of collaborative problem solving conspire to constrain how rapidly a given software development effort can be accomplished under any methodology, process, paradigm, technological tool set, or accidental collection of practices. These constraints actually apply to solving any complex engineering problem but it is the world of software development that continually defies past lessons to the contrary and attempts to solve too big of a problem in too little time.
Severely schedule constrained projects rarely succeed and often take longer, cost more and produce a lower quality product than if a more rational effort had been attempted. Miracles sometimes occur (although rarely on demand) and software teams continue to attempt such projects regardless of the odds against successfully meeting the project goals within the abbreviated schedule. The only predictable result of such folly will be a staff so badly burnt out that such projects have earned the sobriquet of “death march.”
The Death March
The following story is a real life example of just such an effort; the result of engineering hubris, bad planning, predictable mistakes and managerial edict. I was not privy to the management decision to embark on this fiasco although I’m sure the reasons were compelling and the options apparently few. On the other hand, this was just the latest and most absurd demand from upper management for the development organization to produce not just a product but a miracle. I have kept a variety of details out although, as this book was being written, I decided to end my employment rather than endure the continued pressure to manifest such miracles. The dates, proposed and actual schedules, code size, technical details and, especially, the eventual outcome of the development effort are all very real.
The kick-off meeting in mid-October 2004 was intended to get the team fired up over the challenge of the upcoming effort. Contractors would be brought in, along with lunches, dinners and snacks. The reward was to be an all expense paid weekend at the Snowmass ski resort if the ship date of December 31 was achieved along with a hefty cash bonus. The “consolation prize” if we shipped by mid-January 2005 would be just the cash bonus. This effort would approximately double the product's code base to nearly half a million lines of C, C++, java, perl, and shell scripts and the future success, even survival, of the company rested on our ability to successfully execute the plan. Management hoped that several chunks of code that had already been written as prototypes or that were in progress would provide the badly needed “head start” required for meeting the schedule. Some of this “early code” involved a re-write of the product's core “engine” and had originally been slated for a separate release that was now being folded into the new, massive effort.
Throughout the development effort short term thinking prevailed over long-term consequences with everyone focused on making the ship date even after both of our original ship dates were long past. Given the size of the monetary reward, the short term thinking wasn’t a surprise and the management pressure to complete the project in spite of having “blown” the original deadlines became even more intense after those milestones sailed past. Corners were cut at every opportunity with any pretense of following accepted software engineering practices thrown unceremoniously out the window. Development techniques such as “cut-and-paste sub-classing” showed that even an object oriented programming language such as java presented no obstacle to bad development practices. Negative feedback or even pointing out obvious flaws in user interface design reviews (there were no open design or code reviews for internal functions) were generally met with a response of, “It’s not in the MRD (Marketing Requirements Document).” This occurred even for trivial functionality that had no business in such a system level specification. Of course, there were no testable functional specifications to arbitrate such disputes since there obviously wasn't time to develop such documents; just collections of programmer notes along with a few e-mails of various vintages from the product manager.
As the holidays approached, tensions within the development team flared. Those with families or an outside life were not as willing to put in long hours or work weekends. What little team spirit that had previously existed within the development group dissolved into a number of competing, almost warring cliques. After the holidays and the original ship dates were history, new, targeted completion dates sprang into existence and then rapidly sailed into the past. This occurred as newly identified but obviously required functionality was found and the obligatory counter argument that it wasn't called for in the MRD added to the delay in implementation.
Throughout the product there was no guarantee that apparently identical functionality (e.g., go to next page of report) would be implemented by common code. Instead, some minor, nuanced difference in how a function was being used would frequently result in a cut-and-paste, almost duplicate of the original function. In addition, minor changes to functionality in one part of the program frequently resulted in unpredictable breakage elsewhere as hard coded values and data range assumptions weren’t met. Unexpected values in the input data frequently crashed various pieces of the application since there wasn’t time to develop a single robust access method to the raw data. Again, finding the error and fixing it in one location was no guarantee that the identical error wouldn’t occur elsewhere due to yet another copy of the originally flawed logic that had not yet been tested.
Integration and testing was the absolute hell that is the one predictable output from such development techniques. Previous releases had hardly been examples of the software engineering art, so contract developers frequently misunderstood what little abstract design existed before this effort and that broken, leaky abstractions1 were the rule; not the exception. Every user action had to be fully tested in every location it occurred and with as many possible data values as the test team could imagine and find time to throw at it. Such testing was required since extensive cut-and-paste sub-classing meant there was no guarantee that any action or function that worked in one location would be implemented as common code and could reasonably be expected to work in another location. The “head start,” early code rarely “fit” with the newly developed functionality and the two pieces of code were spliced together with patches that neither developer considered correct.
Every piece of code that handled the raw data had to be independently patched to prevent ill-behaved raw data from crashing that particular piece of the program. No attempt had been made to capture all such processing in a well-engineered collection of robust access methods. According to management, determining the requirements and design for such a common module would have obviously taken far too long. Code leaks spewed everything from lost file handles to megabytes of memory. Throughout the program, each of the various components of the program tried to separately handle all possible data cases with no clear, common abstraction. Cut-and-pasted copies of code would invariably duplicate the logic errors of the original, while unique approaches would further baffle both testers and developers by their own unique behavior and bugs.
What was intended to be the initial “beta release” was thankfully never shipped to a real customer. The second beta was a vast improvement but was still so unstable that it was virtually unusable. Other than glib marketing quotes, customer feedback was limited to describing glaring bugs and instabilities since we were already so far behind schedule that only truly broken capabilities could be addressed. Ease-of-use enhancements to alleviate the inscrutable, inconsistent user interface were not even considered. Internal testing cycles continued to require full regression testing of all capabilities as even the process architecture was revised in the last few weeks before product general availability. The change in process architecture resulted in three new processes being added in the final weeks before release. This change was required in order to address nagging problems by hiding a particularly intractable instability from the user in a background process that could be restarted out of sight. Not unsurprisingly, the cure was worse than the disease with the resulting hastily created daemon process leaking memory by the megabyte.
At least two patch releases were required after what was euphemistically called the general availability release. The first patch managed to decrease the memory leaks to the point that a nightly restart of the affected processes could be used to hide the remaining slow drip. This brought the product up to something approaching “beta” quality. This was soon followed by the second patch release which attempted to make some of the worst parts of the user interface a little less “user hostile” while applying fixes to the multitude of latent bugs discovered by customers and the QA group as testing continued. Shooting for making the interface “user friendly” wasn’t planned until some yet to be defined future release. Management fall-out continued with a burned-out staff and long denied vacations wreaking havoc on follow-on plans that were behind schedule before they could even be started.
Not unsurprisingly, sales and revenue goals were not met. This resulted in the majority of the perpetrators of this ill-conceived abomination either leaving voluntarily or being cashiered. This included the CEO, the vice presidents of sales and engineering, and the director of the software development. Technical reviews in trade journals placed the product squarely in the middle of the pack even though the company had initially had a head start by being a first mover in the problem space. Comparisons between the product and those of competitors showed it as slower, harder to use and less accurate. Addressing these problems would eventually require rewriting significant portions of the only recently completed code since most of it was a morass of poorly designed, pretzel logic liberally infected with kludges, band-aid patches and poorly thought through quick fixes.
Almost exactly one year to the day after the kick-off meeting for the previous absurdity we had a meeting of the development organization. The meeting topic was to lay out the near term and long term plans for the product. It was absolutely critical that the next release be completed by the end of the year (2005) and the release after that before the end of first quarter of 2006. These releases would just get us back to feature parity with our primary competitors so the future success of the company rested on our ability to successfully execute the plan....
The stone hadn’t quite made it over the top of the hill and we were back at the bottom of the hill once again. Nothing had changed except the cast of characters in management roles. Given the latent bugs and instabilities, the problem of getting the next release out meant the stone would be heavier and the hill steeper. Ugh.
It is important to note that responsibility for this fiasco rests with those who insisted on attempting what was obviously not feasible rather than work for something somewhat less ambitious but at least possible. It remains to be seen whether the resulting damage to the overall business can be survived.
The final outcome of the development effort described above was as predictable as the sun coming up in the east. The resulting low quality product easily took as long to debug and test as the project would have taken by applying appropriate development methodologies with fewer resources and what eventually turned out to be the achieved schedule. It is also very likely that a shorter schedule than what was finally achieved would have sufficed with better planning and some reasonable level of software engineering discipline. The consensus among the surviving participants was that, at a high level, this was simply the result of attempting to accomplish everything at once and on far too short of a schedule. On the positive side, this exercise in futility served as the inspiration for this book by providing me with first hand experience of a classic, schedule constrained, death march project.
Is there ever a rationale?
To the extent that people make a rational decision to embark on a death march development effort, there seems to be a mistaken belief that the called for, abbreviated schedule is somehow achievable. This belief can be traced to either (and frequently both) of the following invalid assumptions:
- The productivity possible with small projects or agile methodologies can be achieved regardless of the specific problem at hand, the size of the effort required for solving it or the methodology (or lack thereof) applied, and
- Any problem is infinitely divisible. Thus, the schedule for developing a program that solves “the problem” can be shortened as much as desired by just increasing the number of developers.
These assumptions may be valid for some projects. A project may be small and simple enough that high productivity will be achieved and a reasonable quality product produced in spite of the methodology used by the development organization. Likewise, within reason, adding developers will frequently somewhat shorten the schedule required for a development effort.
For larger, more complex projects, these assumptions rarely are true and running the development effort for such a project using these assumptions is a recipe for disaster. When these assumptions are applied to any project, regardless of the project's size or in spite of the complexity of the problem to be solved, a death march results. As the “Snowmass” development effort described above shows, the result is not just a blown schedule, a burnt out development team, and a few management changes. The worst consequences of such a development approach affect the quality and stability of the very core of the product.
Many software development theorists dismiss the death march or constant crisis (think death march but on a continuing basis) approaches to software development because the practices and processes of these approaches, such as they are, don’t follow a recognized theory or methodology. Obviously, there is no theory or methodology; only the practice of throwing as many bodies as seems feasible at the problem and then attempting to ride herd on the chaotic development process that results. This may explain why these projects rarely succeed but it adds nothing to understanding the larger consequences of such an effort. These consequences can be determined by recognizing that the collected practices that arise in such a development environment end up following a predictable process. This process is obviously not well defined, thought-out or documented but is simply the consequence of the human nature reaction of developers to prolonged, absurd schedule pressure.
Any size software development organization can embark on a death march development effort although organizations familiar with doing large scale development projects typically know better and avoid such exercises in futility. The death march software development approach appears most frequently within development organizations that are attempting to outgrow doing only small projects with small team development methodologies. Unfortunately for such organizations, small scale software development techniques differ fundamentally from traditional development methodologies that have evolved for dealing with larger scale software and systems development efforts. This difference is not just in the management structures, documentation requirements or other externally visible elements of the project but it is also inherent in the types of problems that the different methodologies can solve.
More than just size matters
Besides just the size of the required project, certain characteristics of problems that can be solved by small versus large development efforts are very different. Applying small scale development techniques to a problem that can be characterized as needing the more robust problem solving capabilities of a traditional development methodology is, at best inefficient and, more probably, simply will fail. Attempting this same mistake on too short of a schedule is the underlying cause of the poor software engineering practices that typically arise in a death march. These same characteristics of a project also determine whether the project is amenable to agile development methodologies. Thus, the projects that fail miserably when attempted by a death march frequently2 also aren’t amenable to development using agile methodologies. Such projects can’t be rushed to completion using any methodology, tool set or even by outsourcing but I'm getting ahead of myself.
A number of factors obscure the fact that these characteristics of the problem constrain how fast it can be solved. Foremost among these obscuring factors is how the scope of the impact to the user organization that will be using the program increases as the size of the program grows. Further, this increase in the impact to the user organization is usually non-linear with larger programs having much larger impacts. That is, as the size of the problem to be solved grows, chances are that there will be much more at stake in business processes, operational procedures, or even jobs that will be impacted by the program being developed. Small scale, rapid, agile or lightweight development methodologies tend to be most appropriate when the extent of problem to be solved and the scope of the program for solving it are virtually identical. These methodologies typically lack mechanisms for dealing with such non-software impacts to the users and others.
Thus, there are actually two factors that drive larger projects to both longer schedules and apparently lower productivity. The most visible of these factors consists of the large number of people working on a larger project who will not be writing any code. These are the middle managers, configuration managers, technical writers, trainers, schedulers, and so on. Some of these roles (middle managers, schedulers, etc.) are required simply to control the project as the size of the effort increases. Other roles result from the specialization of tasks such as configuration management that are handled by the developers on smaller projects. To the extent that some of the specialists are performing tasks that the developers on a smaller project would have handled somewhat inefficiently on their own, productivity may actually be increased. Unfortunately, the sheer size and the increase in the relative size of some of these tasks for larger projects mean that these productivity gains will not be noticeable.
This very non-linear growth in the size of some tasks is caused by the larger impact to the user's organization of larger programs. This is the second factor causing apparently lower productivity and longer schedules for larger project and it consists of the non-software aspects of the larger project’s impact. These impacts may include jobs being redefined or going away completely, training for those who will use the new system and possible impacts to vendors and other organizations within or even outside of the actual user group. As the project progresses, ensuring that the final project fits into the user's existing organization structure or that it’s impact is understood also becomes part of the cost of the project. These “extended impacts” are frequently also the source of requirement changes and scope growth as the people and organizations affected by the program demand a voice in the direction of the project.
In general, most small scale and agile methodologies only address the issue of coordinating coding plus providing a framework for ensuring that the customer’s immediate expectations are met. This isn’t the fault of the methodology (and, to adherents of agile methodologies, its a feature) but rather a characteristic of the types of problems that are conducive to being solved by a relatively small team in a short time period. Successful, large scale methodologies, conversely, recognize that there is frequently more to a project than just cranking out code and provide a consistent management framework in which larger system level issues and organizational impacts can be understood and solved.
Few large-scale system development methodologies state their purpose this way since they tend to emphasize just the predictability of the process3. This management framework is about solving the complete problem with only the expression of a specific portion of the solution being the resulting software program. When the larger problem that formal methods address (verifiability, facilities impacts, training, staffing, documentation, transition planning, the list goes on) is excluded from the equation, predictive methodologies appear to be just a lot of overhead for simply writing some code.
Learning to swim
Time and again different organizations and frequently the same organizations confront the deep water of attempting to develop a significantly larger software project than they have in the past. The organization bases their estimate on what they have known in the past, if they attempt to estimate the effort at all and then go blindly to slaughter. Their previous experience rarely prepares them for larger scale projects and sets them up for failure if they have not encountered the increased external complexity of such projects before. The extended impacts of larger software development efforts mean that the more difficult management problem of running larger software development projects goes beyond just the expected “friction4” of simply organizing larger efforts. Development organizations attempting such projects must learn to manage both the external elements of the project and the effort to develop a larger, more complex solution to the problem itself at the same time.
Contributing to the difficulty of quantifying software development planning for such projects are the number of human factors involved. Intangible and subjective characteristics (e.g., the development team's experience level, morale, individual ability, etc.) can have a profound impact on the ability of a particular team to complete a project as planned. Smaller, shorter efforts are highly dependent on the specific people involved while large, longer efforts tend to be subject to organizational norms and development methodology constraints. In between are the bulk of the projects by which small technology companies and internal development organizations live and die. These are the projects that are too big for just assigning a handful of developers and a “project manager”. But supposedly too small (and generally assumed to be on too short of a schedule) to allow for or warrant the classic practices called for by any traditional, predictive software development methodology.
Unfortunately, moderate sized projects in this size range are the most intractable for a variety of reasons. In a self-fulfilling prophecy, many software development organizations are caught in the conundrum of not being able to effectively execute larger software projects but competitive pressures and an increasingly complex world force them to grow beyond the small projects that are within their team's limited ability. This is not to say the team lacks talent. It is their development methodology that is unable to cope with larger problems. Like a person learning to swim, they venture toward the deep water of larger and more ambitious projects only to sink when they attempt to stand on their old methodologies. Like the beginning swimmer, the organization then sputters to the surface, thrashing around to stay afloat and cursing the deep water for their problems when, in fact, it is their own inability that causes their difficulties.
Often, an organization's most visible inability to solve a larger problem stems from the more diverse and less visible impacts of the larger project as described above. Project failures due to the inability of the development organization to fulfill explicit terms of the contract tend to not be easily missed. Once the organization learns to manage the myriad external impacts of larger projects, they frequently continue to fail or, at best flail at larger projects. This is due to not understanding that the nature of the problem to be solved is, itself, now the constraint. This constraint becomes fully apparent only when dealing with larger projects but then its effect is frequently obscured by the additional managerial and customer or end-user impact work associated with larger projects.
People who have only experienced small-scale software development projects tend to discount the impact the nature of the problem to be solved has on the schedule. In their experience, a particularly difficult part of the problem to be solved just means that one routine or a small collection of routines will be a tad more difficult to code. At worst, this limits which developers can be assigned to develop these more difficult portions of the program. This will be the extent of the impact and any remaining code to be developed can easily be parceled out to another developer. Based on such experience, a belief develops that schedule duration and effort applied can usually be traded fairly cleanly. This belief is further reinforced if the small project consists almost entirely of coding which, as noted above, is fairly likely for such a project. Finally, for smaller projects and reasonable applications of additional resources, adding developers actually does shorten the expected schedule duration in more or less direct proportion to the number of additional developers since these projects consist almost exclusively of cranking out code to solve the problem.
And adding people doesn't help
Unfortunately, for larger projects, as additional resources are applied, the relationship between the additional resources and the resulting schedule impact becomes noticeably less than linear. As noted above, the external factors of a larger project tend to get blamed for this but the nature of the problem itself is the ultimate or governing constraint on the development schedule even if the external or non-software factors are somehow removed from consideration. This is the underlying cause that explains why, at some point, adding more developers has no beneficial schedule impact and may, instead, actually lengthen the schedule. This effect was first described by Fred Brooks in the Mythical Man Month when he noted that adding personnel to a late software project usually will only make it later5.
This limitation can best be illustrated by looking at an absurd extreme. Why not bring in a programmer for each projected line of code? If the project is expected to take 10,000 lines of code, hire 10,000 programmers and it should be possible to complete the job in a few minutes with each programmer writing their one line of code. Everyone recognizes the absurdity of this suggestion but this recognition implies that, beyond a certain point, adding more developers to a project does not shorten the schedule duration required to complete the effort.
Even if feasible, hiring a programmer for each line of code to be developed isn't going to reduce the schedule duration to almost nothing regardless of the size of the program to be developed. On the other hand, past multi-developer successes obviously mean that appropriately sized teams of developers can successfully complete projects in less time than it would take for a single developer to implement the entire project. Between these two extremes there must be a level of staffing for minimizing the schedule duration required to implement the project6. This last point is important since it implies that this staffing level results in a schedule that is the shortest possible for the project being attempted. Intuitively, this schedule will be achieved if the development staff consists of far fewer programmers than one person per line of code. More likely, it is closer to the staff calculated to be able to complete the project on a schedule that is no less than seventy-five percent of the optimal schedule duration7. This is based on the observation that projects that attempt to complete sooner than this rarely succeed8.
Something has to give
Throwing additional developers at a problem can not succeed in completing larger projects in less than the minimum schedule. Unfortunately for organizations that only have experience with smaller projects, this is exactly the approach they take since, in their experience, this approach has worked on past projects. The only way a larger project can be completed in less than its minimal schedule duration is by removing functionality. This removal of functionality can either be accomplished consciously by deferring sufficient specific capabilities so that the available schedule becomes feasible or it will result from random bugs and instabilities that restrict the actual functionality available to the user and are still present when the product is shipped. Either way, the user will not have a fully functional product as defined by the original product specification. It should also be clear that users will probably not be amused to find random bugs and instabilities in the product instead of the fully functional product they thought they were getting. Based on the quality of much commercial software, believing that throwing more people at the problem will work isn't a fallacy that is confined to small software companies and development organizations.
As we will see in the next chapter, the effects of forcing a project onto a schedule that is less than the minimal schedule for the planned effort extend far beyond what can be predicted when additional personnel are added to a late software project in an attempt to “bring in the schedule.” Attempts to force a project to completion in less than the minimal schedule duration will impact both the quality of the current software product and the ability of the organization to provide revisions and additional functionality in later versions of the software. The impacts to both the current release and to future releases result from attempting to develop too much functionality in too little time even though the resulting bugs and instabilities that plague the current release mean that the desired schedule also won't be achieved. All too often, management seems to be oblivious to all of these effects and sees the worst consequence of forcing a project onto a death march schedule as only a possible increase in staff turn-over9 since the project is needed now and, of course, it has to work.
1Spassky, J., Joel on Software, essay titled “Leaky Abstractions”. Computers work with ones and zeros. Everything from images to personnel files are abstractions created by defining certain collections of ones and zeros to have a specific meaning. When the underlying ones and zeros leak up into a personnel application (e.g., someone with no middle name becomes “John NULL Doe”), an abstraction has leaked.
2Just because a project failed when attempted using a death march does not mean that the problem can't be solved using agile methodologies. There have probably been a number of unsuccessful “death march” software development projects that could have been successfully developed using an appropriate agile methodology. Many of the bad software engineering practices that arise during a death march and sew the seeds of the project's failure would be prevented by any valid software development methodology.
3I will use the term “predictive methodology” to describe traditional software methodologies such as the RUP, classic waterfall and spiral model. These methodologies seek to predict the effort and schedule required in building a particular piece of software as well as its behavior and performance once implemented.
4See On War, Book I, Chapter 7, Carl von Clausewitz. In our context, friction is the tendency of large efforts (of any kind) to bog down or, at least, not execute efficiently simply due to the effort required to keep everyone involved synchronized. Clausewitz noted that, in larger efforts, something will go wrong and that the plan for the effort needs to take the “friction” thus created into account. Software developers, possibly due to their exposure to the “perfect” execution of computers, have a tendency to ignore the affects of real-world friction when planning or estimating.
6The existence of a lower bound to a continuous function (schedule duration as a function of effort applied) implies that there exists a greatest lower bound.
7The optimal schedule duration for a project is the project schedule and implied staffing profile that gives the lowest cost for the project. Adding developers will, at a minimum, result in inefficiencies that increase the cost while employing fewer developers means that the project's management structure must be kept in place longer, again increasing costs.
9See Peopleware by Tom DeMarco and Timothy Lister for a more comprehensive description of the long-term consequences this practice has on the development organization.
This work is copyrighted by David G. Miller, and is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/2.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.