Besides the Filesystem Connection (see the Connections article) in ONE DATA, it is also possible to read in CSV files from the file system with the Python Script Data Generator Processor. As the python code can be added freely, the reading can be more fine-tuned. However, if the to be read CSV files are structured normally, the Filesystem Connection should be the way to go. Also consider following notes.


In order to make reading CSV files via the processor possible, the targeted directory needs to be whitelisted to make it accessible for reads. This needs to be done both in the ONE DATA Server container as well as the pydata container.

IMPORTANT INFO, PLEASE READ CAREFULLY BEFORE USING THIS FEATURE:
With the whitelisting of directories for the pydata container, all python processors have avvess to this directory and can both read and write data in there! As no rights management can be supported for this functionality, ALL USERS that have the right to create workflows and python processors can thereby access ALL DATA that is contained in the whitelisted directory!
If all users of your ONE DATA instance are known and trusted to you, you can still go for this feature, but you should be aware of its consequences.

The most basic workflow in order to read a CSV file with the Python Script Data Generator Processor (and afterwards persisting it as a data table in ONE DATA) is shown in the following figure, simply connecting the processor with a subsequent Data Table Save Processor:



Note that the Python Script Data Generator has only one output, which means, that only one data table can be created and supported for ONE DATA. Multiple CSVs can be read inside the code, but only one data table thereby can be written. If multiple data tables are required, multiple processors can be used.

No further configuration of the python processor is necessary, other than writing down the python code to read the CSV file, alter it depending on your use case, and in the end supporting it as output of the processor.

The following code snippet shows an exemplary python code, which reads in classically structured CSV files from the file system. In the example, the CSV "test_python_read.csv" is read from the path "/sampleData/" (which needs to be whitelisted in both the ONE DATA Server container as well as the pydata container!). Its separator is configured to a comma ",".


import pandas as pd
import os
from io import StringIO


# With the first line of read CSV file, changes it so it can be used for column names in the data table
def create_col_names(line, separator):
    line = line[:-2]
    singles = line.split(separator)
    
    columns = []
    for i in singles:
        columns.append(i)
    return columns


# The path to the to be read CSV file
# IMPORTANT NOTE: Make sure that it is whitelisted/accessible for OD and the pydata container
path = "/sampleData/"

list_of_files = os.listdir(path)

# The name of the CSV file to be read
file_name = "test_python_read.csv"

# Specify the separator that is used in the CSV file to separate entries
separator = ","

if file_name in list_of_files:
    print('Read file: ' + file_name)
    
    lines = []
    
    filename = path + file_name
    with open(filename, "r", encoding="utf-8") as dfile:
        for line in dfile:
            lines.append(line)

    columns = create_col_names(lines[0], separator)
    lines.pop(0)
    
    df = pd.read_csv(StringIO("".join(lines)), sep=separator, header = None)
    
    df.columns = columns
    
#    print(df)
    
    
else:
    print(f"File {file_name} is not found in the specified directory")
    
    
# Write the dataframe as output of this processor
od_output.add_data("output", df)