Purpose


The purpose of the Data Hub within ONE DATA is to connect, clean and aggregate data from various sources. It serves as central data management module and helps a data engineer to setup the basis for sophisticated analysis. The following data sources can be connected to ONE DATA:

  • File upload (CSV, Excel)
  • Model upload (e.g., Nimbus BPMN)
  • Relational databases (e.g., PostgreSQL or Oracle)
  • Web APIs (e.g., REST or SOAP)
  • Streaming data (e.g., via Kafka queues)
  • NoSQL connectors (e.g., Cassandra)

The system is not limited to the above data sources and is being extended as required within projects. Once the data connection is in place, the user is able to persist the data within the ONE DATA storage capabilities or to load it transient for each usage scenario to not fall into capacity issues.

Once the data is loaded into the system, the data engineer most likely performs one of the following actions:

  • data cleansing is the effort to harmonize the data (e.g., bringing all data formats into one specific time formats like yyyy-mm-dd)
  • data aggregation describes the challenge to put data from diverse and potential different technologies into one central place in a connected way (cp. data lake)
  • data wrangling is the step where the data portions are already prepared for further more complex analysis. For sure this also needs the application of algorithms or different methods.



To reach the above goal, the data engineer can interact with the following resource types:

  • Credentials store the sensitive data needed to establish a connection
  • Connections actually open and maintain the link to a data source
  • Data Tables actually store the retrieved and prepared data
  • Workflows and functions help to perform data processing or aggregation tasks. This can be done 
  • Production Lines help to define a execution order of workflows and functions that can be equipped with quality gates to implement quality management principles
  • Schedules are used to contentiously load data diffs or entire new datasets from the data sources
  • Reports to generate first insights like dataset summaries for the retrieved and prepared data

To do the data loading and manipulation tasks within workflows, the following processor groups can be used as starter:

Of course all other processors of ONE DATA can be included and applied in this stage.

The Data Hub is able to secure the data by the underlying user management concepts:

  • User Management to make use of projects, user groups and sharing of resources into a project structure
  • Analysis Authorization to secure several dimension within the data to restrict the visibility for certain user groups

Standard projects 

With the Data Hub, the following standard projects are implemented:

  • Access Layer
  • Data Catalog
  • Restricted Data
  • Data Lake