Content

  1. Overview
  2. Usage of Global Variables Within a Python Script
    1. OD Variables
    2. Examples
  3. Input
  4. Output
    1. Dataset
    2. Image
    3. Model
  5. Dataset Structure
    1. Available Operations
    2. Restrictions
  6. Example


Overview

This article covers the functionality of the ONE DATA Python Framework that is available for the Python Scripts within the ONE DATA (OD) Python Processors. 

It can be used to:

  • access datasets passed from OD
  • support of OD Variables
  • push any of the following structures to OD:
    • datatables
    • images
    • Models

Additionally, any output to Pythons stdout (e.g. print() calls) and stderr (e.g. raised Exceptions) are also passed back to the OD server.


Usage of Global Variables Within a Python Script

The OD Python Framework registers following global variables which are accessible from Python scripts:

  • od_variables: Support for OD Variables with possibility to develop scripts in custom IDE.
  • od_inputDict structure with keys being dataset names (as defined in the OD Python Processors) and values being instances of the framework's dataset structure.
  • od_output: Contains set of methods for adding supported output structures.

In the next sections, the listed possibilities are explained more in depth.


More information about Variables in ONE DATA can be found here.



OD Variables

ONE DATA Variables can be used in the Python script in two ways:

  1. Directly using @variable_name@ syntax.
  2. Using ODPF's od_variables global variable

Disadvantage of the direct approach is that such syntax is not supported by custom IDEs and thus developing and testing the script there would be very cumbersome to somehow replace the code before running a script. For the above reason, the od_variable global variable was added to ODPF and made available for Python script development. The usage is as follows:

variable = od_variables.get("@variableName@", default_value)

where

  • "variablename" is the technical name of the Variable as defined in OD
  • "default_value" serves two purposes:
    • during Python script development in IDE - as a default value returned when running the script
    • in ODPF - the actual variable value (which is passed as string to ODPF) will be casted to the type of a default value

As OD provides multiple types of Variables, below is a table of supported "default value" types for auto-casting of the actual variable value. As the variable value is passed as a string to ODPF each OD Variable type can be auto-casted to str.


OD Variable Type
Supported default_value Python type
Integer
int, float, double
Double
float, str
Boolean
bool, str
String
str
Date
int, float, datetime, str
Datatable
uuid.UUID, str


In addition, any sub-types of the above mentioned types are also supported, as long as they pass  the isinstance check. If the casting fails the Python script execution ends with the exception which is propagated back to OD.


Note that not all data types that are mentioned here, are supported as output. For more information, have a look at the Output and Dataset Structure sections.



Examples

Below are examples of ODPF's od_variables usage for various OD Variables types. In IDE development returned value would be the default value (2nd parameter passed to get method).


String OD Variable

NameTechnical NameData TypeValue
Srting VariablestringVariablestring"string"


od_variables.get("@stringVariable@", "str")  # "string": str


Integer OD Variable

NameTechnical NameData TypeValue
Integer VariableintVariableint100


od_variables.get("@intVariable@", 1)         # 100: int
od_variables.get("@intVariable@", 1.0)       # 100.0: float
od_variables.get("@intVariable@", "1")       # "100": str


Double OD Variable

NameTechnical NameData TypeValue
Double VariabledoubleVariabledouble100.0


od_variables.get("@doubleVariable@", 1.0) # 100.0: float
od_variables.get("@doubleVariable@", "1")    # "100.0": str


Boolean OD Variable

NameTechnicalNameData TypeValue
Boolean VAriablebooleanVariablebooleantrue
od_variables.get("@booleanVariable@", False)     # True: bool
od_variables.get("@booleanVariable@", "false")   # "true": str


Date OD Variable 

Name
Technical Name
Data Type
Value
Date Variable
dateVariable
datetime
18-09-2018


od_variables.get("@dateVariable@", 1)                # 1537228800: int
od_variables.get("@dateVariable@", 1.0)              # 1537228800.0: float
od_variables.get("@dateVariable@", "str")            # "1537228800": str
od_variables.get("@dateVariable@", datetime.now())   # 2018-09-18 00:00:00+00:00 : datetime


Datatable OD Variable

NameTechnical NameData TypeValue
Datatable VariabledatasetVariabledatasetThe value is the UUID of a dataset (e.g.  b9edb619-3a26-4113-be5d-74241d1fa0f6)


import uuid

od_variables.get("@datasetVariable@", "str")         # "b9edb619-3a26-4113-be5d-74241d1fa0f6": str
od_variables.get("@datasetVariable@", uuid.uuid4())  # UUID('{b9edb619-3a26-4113-be5d-74241d1fa0f6}'): UUID


Input

Only datasets can be passed from OD to Python scripts. To get access to a registered dataset use:

dataset = od_input['dataset-name']

where "dataset-name" is a name registered in the OD Python Processor configuration for a specific input dataset.

The returned value is an instance of the frameworks dataset structure.


Output

Dataset

To pass a dataset created in a Python script back to OD, use:

od_output.add_data("dataset-name", dataset, col_names)

where

  • "dataset-name" is the name under which it will be available in an OD Processor (e.g. for registering with Processors output)
  • "dataset"is one of the following:
    • an ODPF dataset structure
    • 2D matrix of data as list of rows where each row is a list of column values
    • Pandas DataFrame
  • "col_names" is an optional list of column names (required for 2D matrix content)


Image

To pass an image from a Python script back to OD, use:

from onelogic.odpf import ImageType
image_type = ImageType.PNG # or ImageType.JPG
od_output.add_image("image-name", image_type, image_content)

where

  • "image-name" is the name under which the image will be available in a OD Processor (e.g. for adding to a report)
  • "image-type" is the type of the image (one of the values in onelogic.odpf.ImageType: JPG or PNG)
  • "image-content"is one of the following:
    • a byte array with the image content
from PIL import Image

roi_img = Image.new('RGB', (60, 30), color='red')
image_bytearr = io.BytesIO()
roi_img.save(image_bytearr, format='PNG')
image_content = image_bytearr.getvalue()    


  • a matplotlib's Figure object as a result of plot creation
import pandas as pd      

df = pd.DataFrame({'lab':['A', 'B', 'C'], 'val':[10, 30, 20]})
ax = df.plot.bar(x='lab', y='val', rot=0)
image_content = ax.get_figure()


Model

To pass Model data from a Python script to OD, use:

od_output.add_model("model-name", model_content)

where

  • "model-name" is a name under which the Model will be available in a OD Processor (e.g. for exporting / saving it for later use)
  • "model-content" is the content of the Model (as string)


Model Groups

To pass Model Group data from a Python script to OD, use:

od_output.add_model("model-name", "model-content", "model-group-name")

where

  • "model-name" is a name under which the Model will be available in a OD Processor (e.g. for exporting / saving it for later use)
  • "model-content" is the content of the Model (as string)
  • "model-group-name" is the name of the Model Group, to which the Model will be assigned to.


For now, Model in the OD Python Framework can be an arbitrary string. This can change in future versions!


Dataset Structure

The Input and Output datasets are represented as onelogic.odpf.ODDataset.

Within Python script, datasets can use any Python / Numpy / Pandas data type available. As OD does not support all the various types mentioned before, datasets used for OD input / output are deserialized / serialized in following manner:

  • od_input
OD Type
ODDataset Column Type
INT
np.int64 / np.float64 (if None values present)
DOUBLE / Numeric
np.float64
DATETIME
np.datetime64
STRING
np.object
  • od_output
ODDataset Column Type
OD Type
any integer type
INT
any floating point type
DOUBLE
any datetime type
DATETIME
np.object / string type
STRING


Available Operations

Create new dataset

A new dataset can be created in two ways:

  • 2D matrix representation of data (list of rows where each row is a list of column values) and list of column names
from onelogic.odpf.common import ODDataset
from datetime import datetime

dataset = ODDataset([[1, 2.0, "test", datetime.now()],
                     [2, 3.0, "sample", datetime.now()]],
                    ["int_col", "double_col", "str_col", "timestamp_col"])


from onelogic.odpf.common import ODDataset
from datetime import datetime
from pandas import DataFrame

d = {'int_col': [1, 2], 'double_col': [2.0, 3.0], 'str_col': ['test', 'sample'], 
     'timestamp_col': [datetime.now(), datetime.now()]}
dataset = ODDataset(DataFrame(data=d))


Current restrictions:

  • If content is passed as 2D matrix, column names must be specified and have the same size as each row
  • Data types in columns must be of supported type


Get list of column names

To retrieve a list of the dataset's column names call:

column_names = dataset.column_names()


Get dataset as 2D matrix

To retrieve values of the dataset as a list of rows where each row is a list of column values (in same order as column names), call:

matrix = dataset.get_as_matrix()


Get dataset as Pandas DataFrame

To retrieve the values of a dataset as Pandas DataFrame call:

matrix = dataset.get_as_pandas()


Important: Columns in the returned Pandas DataFrame are in arbitrary order, so to access values of specific columns, the column name should be used instead of indexes.


Restrictions

get_as_* Operations

Calling any of the get_as_* operations on an ODDataset returns a copy of the dataset's actual state. After this, any changes done to a 2D matrix or a DataFrame version of the dataset are not synchronized with the original!


A copy of the inner ODDataset representation is created only if necessary. Once a copy is created, it is stored separately from the original inner representation within the ODDataset. Below is the table of get_as_* operations behaviour based on the ODDataset origin:


ODDataset origin
Get as 2D matrix
Get as Pandas DataFrame
ODDataset constructed with 2D matrix input
no copy
copy
ODDataset constructed with DataFrame
copy
no copy
ODDataset from od_input
copy
no copy


Example

Following Python script is a simple example which shows the usage of input / output global variables to read / pass data between the OD server and the script.

import io
from PIL import Image
from onelogic.odpf import ImageType
from onelogic.odpf.common import ODDataset
from datetime import datetime
import pandas as pd

print("Hello, world!")

# print input dataset as list of lists
print(od_input['input'].get_as_matrix())

# print input dataset as Pandas DataFrame
print(od_input['input'].get_as_pandas())

od_output.add_model("model", "This is a Model content.")

df = pd.DataFrame({'lab':['A', 'B', 'C'], 'val':[10, 30, 20]})
ax = df.plot.bar(x='lab', y='val', rot=0)
od_output.add_image("plot", ImageType.JPG, ax.get_figure())

roi_img = Image.new('RGB', (60, 30), color='red')
image_bytearr = io.BytesIO()
roi_img.save(image_bytearr, format='PNG')
image_bytearr = image_bytearr.getvalue()
od_output.add_image("image", ImageType.PNG, image_bytearr)

od_output.add_data("dataset", ODDataset([[1, 2.0, "test", datetime.now()],
                                         [2, 3.0, "sample", datetime.now()]],
                                        ["int_col", "double_col", "str_col", "timestamp_col"]))

The following happens during / after script execution:

  1. The string "Hello World!" is printed out.
  2. The input dataset "input" is printed to stdout as 2D matrix and then as Pandas DataFrame.
  3. The Model "model" with content "This is a Model" is added to the output of the script.
  4. A JPG image containing a bar plot of a sample dataset is added to the scripts output under the name "plot".
  5. A PNG image with a 60 x 30 red rectangle is added to the output under the name "image"
  6. A new dataset (2 rows; 4 columns) with all supported data types is added to the scripts output as "dataset"