Overview
What are the key incentives of ONE DATA?
- A central tool especially fit for large enterprises, capable to implement various use cases
- From Garage to Production: ONE DATA is there for central coordination of all data analytics topics - ranging from data connection and distribution management via AI possibilities to rolling out live apps on machines or human devices
- ONE DATA is able to reuse and interact with existing industry standards and ecosystems and therefore is not a closed-down monolith
- ONE DATA helps to start off initiatives with a clear focus while having the ability to set projects live
- ONE DATA brings the perspectives of various stakeholders together and involves them early in the journey from an idea to a live system
ONE DATA's Value Pyramid
To ensure the ROI (return on investment) of a project as a manager you need a clear idea and strategy of how to monetize your corporation's data. Actually if you look at the digitization journey, or data-driven journey, as it appears in big corporations, it basically starts with what we call "strategic intent".
The board of directors thinks "we need to do something with the data thing".
"Let's ask someone what to do with this data thing!"
That’s strategic intent from which they develop new business models as a prerequisite and starting point for data-driven business cases. Once you have identified a use case it is key to gain enriched data, which means you need to unify your data as an intermediate product for value creation.
ONE DATA supports data scientists doing the integration of heterogeneous data sources with its so-called Data Hub as a single source of enriched data. Two of the most time-consuming phases in the value creation process within data-driven projects are the so-called "data cleansing process" (the process of cleaning the data at hand so it can be used in the further workflows) and the "model serving" (the process of taking an existing and trained model "to production", in general doing inference). In order that you as a data scientist can focus on data analysis rather than wasting a lot of time with data cleansing and the model serving, ONE DATA provides the opportunity to create data models or update/customize predefined ones. To get your data ready for value creation, data scientists and business analysts may need to run analysis with updated data and variables from outside ONE DATA. Therefore ONE DATA uses Microservices.
At this point we have enriched data as well as proved data models that run "internally" with the data of the Data Hub and externally from outside ONE DATA. Now it's all about scalability, sustainability, and efficiency to generate the most value out of your projects. ONE DATA serves managers, data scientists, and business analysts with access to data-driven applications. Additionally, you gain numerous reporting possibilities for instant insights without the need of methodological expertise.
ONE DATA Architectural Thoughts
From an external view, ONE DATA is a single product, but from a technical perspective ONE DATA is following a microservice architecture. This means that the application is subdivided into several modules that perform specific tasks. This architectural pattern is a variant of the service-oriented architecture (SOA) architectural style that structures an application as a collection of loosely coupled services. In a microservice architecture, services are fine-grained and the protocols are lightweight. The benefit of decomposing an application into different smaller services is that it improves modularity. This makes the application easier to understand, develop, test, and it becomes more resilient to architecture erosion.
In a highly simplified way, we work in three process steps. The first step is to collect the data in the Data Hub. From there, the data is prepared and processed in the production department and then visualized in clear, easy-to-understand dashboards within the applications. Finally, the customer is provided with easy-to-use evaluations of his originally complex data in a browser based user interface.
In a nutshell, this means that everything starts with the question of how the data can be imported into the system. There are many possibilities to connect different data sources. The data can for example come from CSV-/ZIP-/XML-files, MySQL, PostgreSQL, SAP HANA, Oracle Database, Microsoft SQL, Rest/Soap API, Kafka, Cassandra, Q-DAS or TDMS, and many more. This data will be stored in the Data Hub, which allows for masses of data to be stored conveniently. ONE DATA also works with delta loads and on demand loads of data in order to conserve resources and maintain higher performance. The most valuable advantage of our Data Hub is to have all data at exactly one place so that one can access it easily and efficiently.
For this reason, the models are also stored in there. As a database management system for user information and metadata, PostgreSQL is employed to ensure the best possible use of its advantages such as the processing of enormous amounts of data, the innovative power of the open source community, the platform independence, and, in particular, the integration of high security features are met.
In the next step, Production, state of the art data processing methods for data cleansing, wrangling and mapping are applied as well as our artificial intelligence suit which includes model training. At this point we work with a reporting system and have high quality gates. The models and methods can be chained and executed in a scheduled way, so that our customer receives the benefits of always up-to-date results and automation effects taking place. This data processing step is highly customizable by data scientists to meet any possible business case and eventually there is a choice between Hadoop HDFS and Apache Parquet to store the data.
Ultimately the applications serve two target groups. On the one hand, the user who benefits from interactive dashboards of data-driven evaluations, who then has the possibility to download these data and models for his management reports. And on the other hand, the technical component through which microservices and APIs can serve data and data models for further processing. It also provides excellent system monitoring capabilities. The data is finally stored in the KUDU database format and thus benefits from advantages such as low latency for random access but at the same time high speed for sequential access.
There is a sophisticated user management implemented, which enables ONE DATA being used at all levels of the company. This not only supports user based authentication and analysis authorization, but also allows users to be assigned to different groups and roles, thus separating projects from each other. It is also possible to restrict resources per user and the initial access to the system is provided via an open registration and invite process.
Throughout the entire process there is the capability to choose between the possibilities of computation power of R, spark, and python to apply the most suitable solution for the respective challenges of our customers. This ensures that the latest scripts can always be applied and that the advantages of open source are incorporated in the best possible way.
All this is made available to our customer in a browser based user interface, so that the customer eventually has full access to her or his data, even when on the move accelerating own core business.
ONE DATA from a User's Perspective
ONE DATA can be installed either on-premise or in cloud environments (e.g., AWS or MS Azure). A specific installation of ONE DATA is called instance and is reachable via its own entry URL . A specific software module - the portal - serves as a generic entry point to ONE DATA. Following this, the portal acts as the main entrance and fulfills task like login, logout, password reset, domain selections, or user settings.
https://s3.amazonaws.com/cdn.freshdesk.com/data/helpdesk/attachments/production/48015138640/original/UDYlWnvUgpGfNb0CaMI7TLxPO5iwsK9Jow.png?1573547049" title="ONE DATA Documentation > Why ONE DATA? > image2019-9-24_15-37-17.png" width="800">
Furthermore, the major role of the portal is to be the central point of navigation between the (enabled) modules. Once logged in the modules overview appears. Each module can be directly accessed via the corresponding subdomain.
Why the Modules Within ONE DATA?
Modules are there to focus on specific tasks a certain user group should work on in regards to their skill set and responsibility. For a better illustration, the following user groups are envisioned for certain modules:
- A data engineer is a person who focuses on processes that generate, save, maintain, clean, enrich, or publish data. Therefore this user interacts mostly with the Data Hub module.
- A data scientist / analyst is a person who extracts value out of given data.Therefore interdisciplinary methods are applied such as AI algorithms or processes. To do so, this user group interacts most likely with Model Hub, Processing Library or Use Cases module.
- A Manager does not want to consume each technical facet. This user group is interested in data-driven apps for fast discovery of insights and KPIs with the possibility to drill-down as well as to adjust certain parameters. To do so, this user group interacts with App Viewer module for consumption where as data scientists use the App Builder module for creation of such apps.
From a technical perspective the modules use the similar system function in the background but offer them in an purpose oriented-environment with a focus on usability. Not all modules have the same possibilities and resources to interact with since some are simply not needed to conduct the envisioned tasks. The modules also have specialized resource to share between each other.
What are Modules Using at Their Core?
The following set of resources are available in one data:
- Credentials: store authentication information of different authentication mechanisms; in its simplest way basic auth (user name and password) to connect to source systems and also share them without providing clear text e.g. technical users
- Connections: set up connections to source systems with the mandatory technical connections details based on the used technology (e.g. database of specific types or APIs) to then share or use further but maintained at a single source
- Datasets: either virtual from a connection enriched with details like table or specific api or physical stored datasets at the instance itself either newly created or uploaded / imported → central point which all data can be screened, documented and quality ensured and therefore provided forward
- Workflows: data processing via different execution contexts. build a processing pipeline like data load, data wrangling and cleansing, via machine learning algorithmic and transformation to build up data pipelines to process data.
- Reports: what you see is what you get editor to plot and build light weighted reports (single pages) and simple navigation (dashboards) based on the available and created datasets
- Models: machine learning and business process models which can be created within or uploaded to one data and then like datasets managed in a versioned way within model catalog and reused within or served outside for deployment
- Schedules: timed executions of e.g. workflows or scripts or even whole pipelines/production lines (connected functions with defined input and output)
- Production Lines: sequence of single executions e.g. workflows which represent a running live use case from a-z (e.g. data load to feeding data to an app with intermediate checks, messaging, and decision based logic)
- Apps: a full packed set of results to spread and integrated on other systems and devices with focus and UI and UX.
How do Modules Interact?
As shown above, there exist seven modules (6 + 1 technical module) that can be separated into a conglomeration of a specialized subset of 4 modules (data hub, model hub, processing library, app builder)
- Those 4 are there to central collection building and distribution of data models and functions as well as visual app templates
- Data hub
- Everything about connecting to system, getting data, preprocess data, organize data, check quality and label data
- Model Hub
Everything about storing models, version models, run models, check quality and label, import model, R/Python function dockerization …
- Processing Library
Everything about processing template for reuse, reuse and spreading, R/Python script organisation, reusable functions with defined i/o, microservices, …
- App builder
build up logic / UI / UX to provide multiple consistent and central maintained in templates and frames
- Data hub
- The additional one is the Technical usage of ONE DATA for technical consumer
- API Layer
- set up and publish maintained and versioned interfaces for making the resources available for 3rd party application and also to operate one data from 3rd party system in an broader it infrastructure
- API Layer
- Visual usage of ONE DATA for human consumer
- App Viewer
Everything providing something in a visual way to end users, which also is very interactive and can be consumed on different devices. Also very stable, maintainable and enhanceable from a creator perspective. And a fast distribution
- App Viewer
- The one to emphasis the most is for business case development
- Use cases
Everything covering a use case in terms of a certain business goal and domain. Uses all the other modules and forms via a production line a running productive application
- Use cases
Use cases is the module combining all the the buildings block from the service modules (e.g. using prepared data, using maintained models, using tested functions, reuse visual compliant conform app templates). Use cases itself can create new data/models/services/visualization templates itself and also make it reusable at the servicing modules.
Picking up the Value Pyramid
- DATA-DRIVEN BUSINESS CASE is covered by the use case module. As already mentioned , it is the center and foundation and therefore support the full functionality of ONE DATA (which explains the thing about a use case can create new resources like data, models and processing functions back to the the specific modules for reuse form others).
- ENRICHED DATA is part of the data hub, where the data is centrally connected and brought to a level for reuse
- DATA MODELS are located in the model hub, which has like the data hub catalog and search organizing functions to maintain those
- MICROSERVICES are arbitrary complex or also simple functions / algorithms with dedicated input and output which can either be referenced or copied (the difference about sharing and coping with corresponding advantages and disadvantages later). All of them are stored in the processing library.
- APPLICATIONS are either newly or customized from app templates at the app builder.
The special modules (API layer / technical and App viewer / visual) pay into all the 4 top sections of the pyramid and can be used to open ONE DATA to different building blocks of the IT ecosystem (e.g. a data warehouse). Whereas the API layer is a management layer (like model and data catalog) to maintain, search and organize all the data, models and microservices being used inside or outside of one data especial by other systems / programs the app viewer is the interface for end users to consume results on displays and interact with them.