Building a deep and specialized Workflow in ONE DATA can lead to several combined and connected Processors. With the original data running though the Workflow, it can change its shape and format by the application of the Processors, which can be hard to track by hand. Also, it might be desirable to check performance information running down a Processor path in order to find out which of the operations takes longest and thereby slows down the overall Workflow.
The Debug Mode of ONE DATA facilitates this by giving the possibility to "look into" your built Workflows. This can be done at any point in the Workflow, giving you information about any of its steps. Furthermore, the Debug Mode does have the following key values:
- Improve maintainability by offering to check which data with what given datatype runs into and out of processors
- Partial running of only parts of the workflow without the necessity of waiting for the whole workflow to finish
- Check overall runtime information of your processors as well as Spark details
- See dependencies of underlying operations in order to track down technical calls that are made by a workflow
How to Use the Debug Mode
There are two possible ways of accessing the debug functionality of ONE DATA, which both differ in the level of detail of produced information and performance. Requesting more information does require more time, so you should always think about what and where you want to debug. Starting the Debug Mode can be done in the following ways:
- Next to simply running the Workflow, there is now the possibility of doing a "Save & Debug" run. This option fully debugs the whole Workflow. Note here that this might require some time depending on the size of your defined Workflow.
- Smaller debug operations can be done by right-clicking on the processor for which you want the debug information. This allows for smaller and more direct debugging, which needs less time than a save & debug operation. This option allows for three different debug modes that are described below.
For the specialized debugging of a processor (right-click on a processor), there are three different types of debug modes available:
- Fast Debug: Debugs exactly the target Processor (and no other Processors) and delivers the data debug info for the respective Processor.
- Full Debug: Debugs the Processor and the path leading to it, delivering both the data debug and the spark debug info for all the Processors in the given path.
Performing a "save & debug" run on your Workflow is equivalent to a "full debug" on every processor.
- Get Schema: This debug mode is different to the other modes in terms of it not delivering one of the debug info described above. Instead, this mode can be helpful whenever a processor whose schema info cannot be directly inferred by the ONE DATA client is used. For example a Query Processor executing an SQL query, which might change the available columns or datatypes of the data running through it. Using get schema on such a processor enables the client to forward the relevant schema information, allowing the subsequent processors to be configured accordingly.
As an example, imagine an SQL processor on arbitrary data with three columns. If the result of the query forwards only two columns, this information cannot be directly inferred by the ONE DATA client, as it is not executed. Therefore, before running the get schema debug mode, a subsequent processor's configuration allows to use the non-existing column. Running the get schema debug mode leads to a correctly shown choice of only the two selected columns.
There are two different debug results that can be produced by the various debug modes. The possible results are:
- Data Debug Info: Shows the data with datatype produced by a processor, so all columns as well as a sample of their content and the amount of rows generated.
- Spark Debug Info: Enlists various spark information about the processor, such as processor metrics (spark jobs, stages, tasks, execution runtimes, ...), the Spark plan from parsed logical plan to the actual physical plan, the schema of the produced data of the processor, RDD dependencies and spark runtime types.
Disclaimer: the entries of this info might change in the future in terms of entries being added and its structure might change