Ideal Transform Rule
Always work from ideal input.
Developers familiar with the power of pipeline operations central to the UNIX operating system know how simple, modular tools can be chained together to accomplish a wide variety of complex tasks. The small-scope and single-purposed blocks of code that make up a pipeline maximize the ability to maintain the code. Even if you don’t understand the workings of a single stage, it’s likely that you can understand the stage’s place in the context of the pipeline process.
One of the biggest challenges of code maintenance is handling changing requirements without breaking the existing code base. When you have pipelines composed of stages with clear input and output definitions, the introduction of additional stages becomes much safer. As long as you respect the input and output expectations of each stage, the rest of the pipeline can remain unchanged (in most situations).
I use a principle I call the “Ideal Transform Rule” when designing a complex process. I start by asking the question, “What’s the simplest input I could use to produce the desired output?”
If I were working on a report (gack!) I would want a data source with a “shape” that corresponded to the output format, required no normalization or exception processing, and that spoke in terms the consumer of the report understood. With that “ideal” input in hand, the report formatting would be as simple as possible. The report definition would not need to concern itself with any serious data transformation.
My next step, having established an ideal input for the final stage, is to work my way forward, backward, recursively from the original input in a reasonable number of discrete stages to achieve the ideal input.
For example, the initial data may be coming from a mainframe (double-gack!) and adhere to a dated column-naming convention. You know, one of those indescipherable short character coded names like BNF_TY_CD (not to be confused with BEN_TY_CD or BNF_TYP). The input data may also include additional extraneous elements or have codes that need to be expanded to readable descriptions. I might choose to “normalize” the data to include data elements named with business terms (ie. BenefitTypeCode), discard un-needed data elements, and expand codes with descriptions (ie. A – Active). This step may seem frivilous, but I believe in leaving legacy conventions behind because of the terrible impact they can have on communication in the present. By using current business terminology in a modern naming convention, developers will greatly reduce the learning curve of the data they are working with.
The interim stages could include sorting, filtering, aggregation, cross-tabulation, etc. It all depends on what needs to be accomplished. Eventually, you may face a piece of complexity that can’t easily be broken down further. With this pipeline approach you at least isolate the complexity to as few stages as possible.
With the report example above, it’s very well possible that the report could be produced from the original mainframe input in a single (complex!) step in the report definition. If the complexity is there in the report definition alone, you’ve created code that has many concerns, does not communicate clearly to future maintainers, and perhaps hides errors. Additionally report definition tools like Crystal Reports are often opaque…not the best way to manage code. Working with pipeline stages “shows your work”, a best practice you may have learned in school.
You don’t need to limit your use of the Ideal Transform Rule to multi-stage pipelines. Creating just a two step process has benefits as well. A quick Perl script can often clean-up data for use with a database bulk import tool…the Perl script saves you from limitations in both the input data and the import tool. Similarly, database view definitions simplify reports against a database…the view created for a report should be the ideal report input, removing much of the join logic from the concerns of the developer.
There’s a joy in writing simple code. Working from ideal input keeps your code simple. Simple code is more accesible and therefore more maintainable and less likely to break. When faced with a complex task, start by defining the ideal input then close the gap from actual to ideal.