TABLE OF CONTENTS
- General Information
- Dataset Analyzation
The dataset overview is an interface that allows a more in depth discovery and analyzation of the corresponding Data Table.
Throughout this article, we will discover what this feature has to offer, and provide a better understanding of its different functionalities.
This part of the dataset overview is persistent throughout the feature (displayed at the top). In order, it contains the name of the Data Table, its business owner, creator, last creation and modification dates, quality, source, type, and row count.
It also offers the possibility to navigate back to the General Overview using the "Back to Overview" button.
This part of the dataset overview contains six different tabs, each of them either presents detailed information about the Data Table, or enables specific analyzation methods of the data at hand.
Each tab has an info button.
Using the Data tab, the user can inspect a sample of the dataset (the first 100 rows).
The dataset columns are displayed, along with the corresponding Datatype for each column.
The Data Information tab is used to evaluate and maintain user defined metadata for the dataset in a structured format, therefore adding value to it.
Some metadata fields, such as tags and business owner, can help improve the searchability within the General Overview of the Data Catalogue.
Data in fields with the prefix 'Data Catalogue' will be kept only in the Catalogue. All other information will be synchronized with ONE DATA.
The Column Information tab contains the metadata describing each column individually. It consists of two separate tables, a first table for column description in the Data Catalogue (1), and a second table statistically summarizing data columns (2).
The table contains two columns, namely the Data Table column name, and a short description exclusively for Data Catalogues, meaning that it cannot be seen in Data Table Use Case view.
Descriptions can be added manually using the button "Edit Table" on the right side.
Use the button "SAVE CHANGES" to save any update on the column descriptions.
The table contains automatically generated statistics about each individual column in the dataset (number of distinct values, number of nulls, minimum and maximum values, etc), along with other relevant information (Datatypes, constrains, etc).
Statistics columns are sortable and filterable.
With the query tab functionality, the user can perform further dataset preprocessing and create his/her own ad-hoc analyses. It is also possible to download the query result or save it within the Data Catalogue for later use.
In this section, we will further explain the different parts that constitute the Query functionality and give some use case examples.
SQL Query EditorGives the user the ability to execute SQL queries on the selected table. The editor comes with a number of functional buttons (Shortcuts, Download SQL, Import SQL, Refresh Schema), along with a tab for previously executed queries.
It is possible to involve more than one table within the same query, under the condition that both tables need to originate from the same connection (Hive, Oracle, etc). In that case, the second table is referred to by its database name instead of "InputTable".
Query PreviewA table displaying the raw dataset by default, and a preview for the query result upon its execution. It is possible to download an Excel file with the query preview using the button "Download" on the right side of the table.
Save Query DataSet the query metadata and save the executed query under the table (4) for later use.
Saved QueriesThe table displays all previously saved queries along with the corresponding table, the owner, and metadata specified
Start WorkflowCreate and save a Workflow with the selected query. The Workflow contains a Data Table Load Processor, a Query Processor and a Result Table Processor.
Navigate to the created workflow using the button "Created WF" that appears right next to the button "Start Workflow".
Coming to the visual part of Data Catalogue, the Lineage tab basically contains information about the data flow for the selected table.
It allows the user to explore the complete flow of the dataset from its source(s) to its final destination(s) through an interactive data lineage graph.
One of four types of data lineage diagrams can be selected:
- All Nodes: shows full data lineage including predecessors, successors, and side streams of the selected data set.
- Standard Lineage: shows all datasets that are predecessors of the selected dataset.
- Extended Lineage: shows all predecessors and successors of the selected dataset.
- Consumers only: shows all successors of the selected dataset.
Within the same type of graph, the user can also select different depth levels to change the number of displayed nodes and therefore the size of the diagram.
Finally, the user can discover the displayed resources separately by clicking the following button for the corresponding node within the graph.
The user can download an SVG copy of the graph using the download button displayed on the top right corner.
Joinable Data Landscape
Through this tab, the user can explore the data landscape and find datasets that are related to the current selection. The user can also see how two datasets are related and how well they match by clicking on the links between elements.
Once the user clicks on the link, the following table appears: