Updating Legacy Data Pipelines: Patterns

6 min readJul 17, 2022

Increasingly I am working with technical teams to modernize legacy data analysis pipelines. I’ve done most, but not all, of this work in the bioinformatics domain. Despite this, the emergent patterns are applicable to legacy pipeline projects in a number of domains.

I’ve written a list of key stages, activities and expected goals for each stage below. Shown below is a diagram which summarizes the overall pattern.

Updating Legacy Data Pipelines Stages Map

Stage 0 — Assess Current State

Legacy pipelines are, by definition, those in current production use. They are differentiated from new research pipelines by number of users, volume of input data and reuse.

Often resource constraints in existing systems (i.e. running out of storage space and/or compute capacity) drives the need to update existing pipelines. Frequently teams need to move these workloads to the public cloud.

The key activities at this stage are the following:
- accurate capture of current usage patterns / key user stories
- historical analysis of data storage growth
- hard project deadlines — due to technology retirement or renewal contracts
- review of current pipeline architectures, configurations

GOAL: Written docs, summarizing information collected

Stage 1 — Train and Prepare Development Environments

In the rush to implement updated solutions, teams often miss key steps at this point in the process. Vendor patterns, best practices, sample architectures and examples are often useful in selecting best-fit candidate technologies for upgrading. I’ve collected a set of links for Google Cloud pipeline patterns.

Key activities are this stage are the following:
- review and select candidate technologies & architectures for testing
- schedule technical training (including hands-on exercises) for the technical teams using the candidate technologies
- allocate budget for technical training environment, set up training cloud project(s)
- collect & prepare test datasets
- create dev env with baseline security settings
- identify subject matter experts for user testing

GOALS: Team hands-on training, Dev env functional ( start/use VM, view/upload file to bucket, use required APIs )

Stage 2 — Develop, Test Updated Pipelines

In this phase, it is common to engage with partners who have expertise in technologies which are new to the local development team. Here it’s recommended to have developers work in pairs or teams to speed up knowledge transfer within teams.

Teams benefit from simple architecture diagrams for rapid prototyping and testing. An example is shown below. I’ve collected a few more simple cloud system visualization examples as well.

Diagram Simple Prototypes for Rapid Testing

Key actives here include these:
- Create technology validation ‘hello Tech’ (single task) pipelines on the dev environment (run in < 5 min)
- Create single task pipelines which upgrade existing full pipelines on the dev environment (run in < 15 min)
- Create single scattered task pipelines which upgrade existing full pipelines on the dev environment (run in < 15 min)
- Create minimal written documentation (for developers)
- Create multi-step pipelines
- Create full step pipelines with minimal input data (for rapid testing)
- Create full step pipelines with full input data (for load testing)

GOALS: Devs can run Tiny, One-task, One-scattered-task, Small and Full pipelines run to successful completion.

Stage 3 — Test and Optimize Pipelines

In this phase, it is common to engage with partners who have expertise in optimization of technologies used. Also dev teams often lack the time to fully understand service configuration options. Misconfigured options for key services (storage, compute) can result in cost overruns and poor performance.

Optimized Genomics Pipelines require careful attention to service configuration options

Additionally, a frequent oversight at this point is to fail to include one or more types of verification testing.

Key verification tests include these:
- Functional? Does each pipeline type run to successful completion?
- Valid? Do the pipelines produce identical output to their predecessors
- Secure? Can a tester with end user permissions successfully run all pipelines?
- Recoverable? Can the pipelines be restarted if they terminate early with an error? Are the error messages meaningful to end users?
- Configurable? How complex is the process to update required configuration files? Could this be automated (scripts, etc…)?
- Comprehensible? If moving to a new technology are the updated data analysis tasks comparable to those in the legacy pipeline?
- Cost-effective? What are the costs to run these pipelines?
- Faster? How long do each of the pipelines take to run? Do certain tasks require certain configuration parameter ranges to run optimally?

GOALS: Testers have verified and run all pipelines in a predictable, reproducible and cost-optimized way. Updated developer documentation.

Stage 4 — End-user Testing and Pre-Launch

In this phase, the group of end user testers begins to work with the team to learn to use the updated pipelines. It is common to overlook the amount of new technology being presented to the end users given the dev team’s familiarity with the new technologies and tools
given their work using them.

End Users Benefit from Technical Training Too!

Key activities here include the following:
- Dev team creates process to add end users to test environments
- Dev team creates prod environment
- Dev team runs all pipelines on prod env
- Dev tea runs load test (using workload analysis) on prod env
- Project status (biz drivers for change, reasons for tech selection, list of technologies, progress on pipeline updates)
- End user hands-on technical training (example shown above for GCP)
- End users can run all pipelines

GOALS: End users can use new technologies (VMs, Buckets, APIs…), Dev and Test can run key pipelines on Dev Env as expected.

Stage 5 — Initial Launch

In this phase, the testing end users begin to run production workloads on the updated pipeline environment. Invariably bugs and also further optimization opportunities are discovered in this phase of the project.

Key activities here include these:
- End users run top 3 workloads on new prod env and verify — success, correctness of data output, cost and time to run as expected
- End users test end user documentation for initial end user setup
- End users test end user documentation for top 3 workloads (pipeline job runs)
- Dev team monitors prod cluster and prod jobs for unexpected results and tunes as needed
- Dev team updates dev documentation if/when updates occur

GOALS: End users on-boarded within expected timeframe. End users run pipeline jobs for top 3 workloads as expected.

Stage 6 — Review and Cut Over

In this phase end users are guided, trained and directed to use the updated prod environment.

Key activities here include the following:
- Project status vehicle (visual dashboard) showing xx% movement updated weekly and communicated to complete team
- Dashboard includes metrics for the following: total # users, pipelines run, data processed, etc…
- Pilot end users conduct training for new end users
- Dev team monitors and optimizes as needed (cluster and pipelines)
- Dev team forecasts schedule for retirement of Legacy Env
- End user and dev documentation updated as needed

GOALS: Weekly dashboard updates, increasing usage of new prod environment

Stage 7 — Retirement of Legacy Pipelines

In this stage, the legacy environment is retired, or partially retired. If the latter, then the legacy environment is no longer available for a subset of
all pipelines, i.e. ‘these 3 pipelines run in the new env, those 4 pipelines run in the legacy env, etc…’

Key activities here the following:
- Verify all (or key) pipelines are running on the prod env and NOT on the legacy env
- Dev team to remove (and archive) files, access, etc… for legacy pipelines
- If complete, dev team to decommission legacy env

GOALS: Pipeline jobs are running on new Prod Env

Next Steps

To learn more about my work modernizing pipelines follow me on Medium. To read the technical details of a specific examples, see my short series about the work I did with a team to build an updated (AWS) pipeline. The result was going from on premises pipeline job runs which took 500 hours per run down to 10 minutes per run — using a data lake pattern.