5 golden rules
If you consider 5 golden rules, building and especially maintaining a workflow is way more efficient and faster:
Specification → Specification before doing!
Think about the end result
Define your data structure
Divide and conquer
Documentation → Always work clean!
Understand your input data
Workflow and data naming
Processor Naming, Grouping and Color Coding
Know-how → Know what you do!
Parallel computing
SQL knowledge (especially Joins)
Don’t play around -> change, execute and wait…
Monitor and debug → Specification before doing!!!
Think about expected results first
Debug your workflow and validate
Validate your running workflow when data change regularly
Re-Use→ Don‘t duplicate, standardize!
Use variables
Don’t duplicate paths
Have grouping
Specification
Take your time to specify your workflows, reports and how you want to reach it!
Think starting from the end result
- What does your final report look like?
- Which configurations do you need?
Define data structure and path
- What data sources so you need?
- What is the necessary data structure for the end result?
- How can I get there?
Divide and conquer
- What sub problems do I have to solve?
- Can I mock some problems in other workflows?
- How many workflows do I need? What’s their purpose?
Documentation
Understand your data and work with discipline - ALWAYS
Understand your input data
- What do my data look like?
- Which data can be joined?
- What are possible errors, preprocessing steps?
- What is the amount? What would I expect in the end?
Workflow and data naming
- Are all workflows and data named properly?
- What is the necessary data structure for the end result?
- How can I get there?
Processor Naming, Grouping and Color Coding
- Does another user understand what I did?
- Will I understand what I did a month ago?
Know-how
Get to know your tool and get deep knowledge about execution paths, SQLs and your way of working
Parallel computing/Spark
- What is the most efficient way for parallel computing?
- What libraries/processors can I use?
- What leads to full execution/partial?
- What are RDDs?
SQL and processor knowledge (especially Joins)
- What effects will a join have?
- What are my keys within the structures?
- What is a normal form?
- What functions does SQL provide?
- What processors improve my speed?
Don’t play around
- Can I do something in parallel?
- Can I work with sampling for setting up the workflow?
- Will my workflow run? Can I foresee errors?
Monitor and debug
Double check all (intermediate) results for consistency, reasonability and expectations
Think about expected results first
- Are my intermediate results reasonable?
- Does the amount of data make sense?
- What would I expect, what is the result?
Debug your workflow and validate
- What is the sum?
- Is there a condition that should not happen?
- What can go wrong?
- Do I have to ensure data consistency?
Validate your running workflow
- How many data would I expect?
- Are the sums, counts etc. reasonable?
- Did I doublecheck my output?
Re-Use
Think about workflows like code . Don‘t copy paste anything , merge the paths
Use variables
- Do I use filter conditions often?
- Do I want to change parameters?
Don’t duplicate paths
- Can I combine paths?
- How easy is it to change parameters?
- Can I work with identifiers?
Have grouping, standardize
- Can I reuse my combination of processors?
- What are problems I have to solve often?
- Can I create templates?