5 golden rules

If you consider 5 golden rules, building and especially maintaining a workflow is way more efficient and faster:

  1. Specification → Specification before doing!

    1. Think about the end result

    2. Define your data structure

    3. Divide and conquer

  2. Documentation → Always work clean!

    1. Understand your input data

    2. Workflow and data naming

    3. Processor Naming, Grouping and Color Coding

  3. Know-how → Know what you do!

    1. Parallel computing

    2. SQL knowledge (especially Joins)

    3. Don’t play around -> change, execute and wait…

  4. Monitor and debug → Specification before doing!!!

    1. Think about expected results first

    2. Debug your workflow and validate

    3. Validate your running workflow when data change regularly

  5. Re-Use→ Don‘t duplicate, standardize!

    1. Use variables

    2. Don’t duplicate paths

    3. Have grouping


Take your time to specify your workflows, reports and how you want to reach it!

Think starting from the end result

  • What does your final report look like?
  • Which configurations do you need?

Define data structure and path

  • What data sources so you need?
  • What is the necessary data structure for the end result?
  • How can I get there?

Divide and conquer

  • What sub problems do I have to solve?
  • Can I mock some problems in other workflows?
  • How many workflows do I need? What’s their purpose?


Understand your data and work with discipline - ALWAYS

Understand your input data

  • What do my data look like?
  • Which data can be joined?
  • What are possible errors, preprocessing steps?
  • What is the amount? What would I expect in the end?

Workflow and data naming

  • Are all workflows and data named properly?
  • What is the necessary data structure for the end result?
  • How can I get there?

Processor Naming, Grouping and Color Coding

  • Does another user understand what I did?
  • Will I understand what I did a month ago?


Get to know your tool and get deep knowledge about execution paths, SQLs and your way of working

Parallel computing/Spark

  • What is the most efficient way for parallel computing?
  • What libraries/processors can I use?
  • What leads to full execution/partial?
  • What are RDDs?

SQL and processor knowledge (especially Joins)

  • What effects will a join have?
  • What are my keys within the structures?
  • What is a normal form?
  • What functions does SQL provide?
  • What processors improve my speed?

Don’t play around

  • Can I do something in parallel?
  • Can I work with sampling for setting up the workflow?
  • Will my workflow run? Can I foresee errors?


Monitor and debug

Double check all (intermediate) results for consistency, reasonability and expectations

Think about expected results first

  • Are my intermediate results reasonable?
  • Does the amount of data make sense?
  • What would I expect, what is the result?

Debug your workflow and validate

  • What is the sum?
  • Is there a condition that should not happen?
  • What can go wrong?
  • Do I have to ensure data consistency?

Validate your running workflow

  • How many data would I expect?
  • Are the sums, counts etc. reasonable?
  • Did I doublecheck my output?


Think about workflows like code . Don‘t copy paste anything , merge the paths

Use variables

  • Do I use filter conditions often?
  • Do I want to change parameters?

Don’t duplicate paths

  • Can I combine paths?
  • How easy is it to change parameters?
  • Can I work with identifiers?

Have grouping, standardize

  • Can I reuse my combination of processors?
  • What are problems I have to solve often?
  • Can I create templates?