5 golden rules

If you consider 5 golden rules, building and especially maintaining a workflow is way more efficient and faster:

  1. Specification → Specification before doing!

    1. Think about the end result

    2. Define your data structure

    3. Divide and conquer

  2. Documentation → Always work clean!

    1. Understand your input data

    2. Workflow and data naming

    3. Processor Naming, Grouping and Color Coding

  3. Know-how → Know what you do!

    1. Parallel computing

    2. SQL knowledge (especially Joins)

    3. Don’t play around -> change, execute and wait…

  4. Monitor and debug → Specification before doing!!!

    1. Think about expected results first

    2. Debug your workflow and validate

    3. Validate your running workflow when data change regularly

  5. Re-Use→ Don‘t duplicate, standardize!

    1. Use variables

    2. Don’t duplicate paths

    3. Have grouping


Specification

Take your time to specify your workflows, reports and how you want to reach it!


Think starting from the end result

  • What does your final report look like?
  • Which configurations do you need?


Define data structure and path

  • What data sources so you need?
  • What is the necessary data structure for the end result?
  • How can I get there?


Divide and conquer

  • What sub problems do I have to solve?
  • Can I mock some problems in other workflows?
  • How many workflows do I need? What’s their purpose?



Documentation

Understand your data and work with discipline - ALWAYS


Understand your input data

  • What do my data look like?
  • Which data can be joined?
  • What are possible errors, preprocessing steps?
  • What is the amount? What would I expect in the end?


Workflow and data naming

  • Are all workflows and data named properly?
  • What is the necessary data structure for the end result?
  • How can I get there?


Processor Naming, Grouping and Color Coding

  • Does another user understand what I did?
  • Will I understand what I did a month ago?



Know-how

Get to know your tool and get deep knowledge about execution paths, SQLs and your way of working


Parallel computing/Spark

  • What is the most efficient way for parallel computing?
  • What libraries/processors can I use?
  • What leads to full execution/partial?
  • What are RDDs?


SQL and processor knowledge (especially Joins)

  • What effects will a join have?
  • What are my keys within the structures?
  • What is a normal form?
  • What functions does SQL provide?
  • What processors improve my speed?


Don’t play around

  • Can I do something in parallel?
  • Can I work with sampling for setting up the workflow?
  • Will my workflow run? Can I foresee errors?


 

Monitor and debug

Double check all (intermediate) results for consistency, reasonability and expectations


Think about expected results first

  • Are my intermediate results reasonable?
  • Does the amount of data make sense?
  • What would I expect, what is the result?


Debug your workflow and validate

  • What is the sum?
  • Is there a condition that should not happen?
  • What can go wrong?
  • Do I have to ensure data consistency?


Validate your running workflow

  • How many data would I expect?
  • Are the sums, counts etc. reasonable?
  • Did I doublecheck my output?


Re-Use

Think about workflows like code . Don‘t copy paste anything , merge the paths


Use variables

  • Do I use filter conditions often?
  • Do I want to change parameters?


Don’t duplicate paths

  • Can I combine paths?
  • How easy is it to change parameters?
  • Can I work with identifiers?


Have grouping, standardize

  • Can I reuse my combination of processors?
  • What are problems I have to solve often?
  • Can I create templates?