Skip to content
background-image background-image

Python processor

[ | version 3.0]

You are viewing outdated version, see the latest version

Connector

The connector is permitted for use of the platform background agent.

Connector processing type: Both (Row by row & Bulk), Default type: Row by row!

Debug script enabled.

The Python processor allows you to run Python scripts using the CPython runtime (version 3.11). Although the Python ecosystem is very large, we provide only a limited yet powerful set of imported modules. For security reasons, your Python code can only access a limited set of imported modules. We allow the following standard modules:

We provide the following 3rd party modules:

  • pandas (2.0.3) Powerful data structures for data analysis, time series, and statistics

  • numpy (1.25.2) Fundamental package for array computing in Python

  • PyYAML (6.0.1) For parsing & building YAML content

  • openai (0.28.0) Client library for the OpenAI API

  • deepdiff (6.3.1) Deep Difference and Search of any Python object/data.

  • python-jose ([cryptography] 3.3.0) JOSE implementation in Python

  • passlib (1.7.4) Comprehensive password hashing framework supporting over 30 schemes

  • httpx (0.25.0) Fully featured HTTP client library

  • matplotlib (3.8.2) Matplotlib is a comprehensive library for creating static and animated visualizations in Python.

Allowed imports are:

(
    #
    # STD
    "string",
    "math",
    "itertools",
    "random",
    "warnings",
    "base64",
    "io",
    "json",
    "xml",
    "ssl",
    "time",
    "datetime",
    #
    # 3RD PARTY
    "yaml",
    "httpx",
    "pandas",
    "numpy",
    "deepdiff",
    "passlib.hash",
    "jose",
    "jose.backends",
    "jose.constants",
    "jose.utils",
    "openai",
    "matplotlib",
    "matplotlib.pyplot",
)

In addition to the modules above, which were already mentioned, we restrict imports. For example, you cannot import os module or sys module. You only can import the modules listed above. Find an example of allowed import below:

# works
import string
import pandas as pd
from json import dumps
from xml import etree
# does not work
import os
import sys
from xml.etree import ElementTree

As you can see, we can import xml module, but we cannot import xml.etree module. This is because xml.etree module is not listed in allowed imports. So, if module a is listed in allowed imports, then you can import a module, but you cannot import a.b module. But if module a.b is listed in allowed imports, then you can import a.b module. Also, this restricted Python processor does not allow type hints. For example:

# works
number = 1
# does not work
number: int = 1

The code shown above with the type hint would fail with a syntax error. In the standard Python runtime, this code would work. When using our Python processor, please mind to remove all type hints from your code.

In addition, we allso provide the following APIs:

# Async sleep
sleep(seconds: float) -> Coroutine

# Input data are stored in
INPUT_DATA

# Data checkpoint is stored in
DATA_CHECKPOINT

"""Connector logger API"""
log.trace(message: str) -> None
log.debug(message: str) -> None
log.info(message: str) -> None
log.warn(message: str) -> None
# Calling this method will force the connector to mark statement execution as failed
log.error(message: str) -> None

Examples

Examples of Python processor can be found in Python processor examples.

Configuration

Python statement configuration

Statement

Python statement to be executed using the Python processor service. The output of the connector expects a list of objects that must match the output schema in structure. The defined Statement represents the body of the async function whose output is parsed and returned as the output of the connector itself. The last line of the Statement must therefore be:

return items_in_list

Example

users = [
    { "name": "Alice" },
    { "name": "Bob" },
    { "name": "Charlie" },
]
return users

Note

As the statement is async you must ensure that all Coroutine are awaited. Non-awaited code may not be completed before your statement code.

Data checkpoint column

The data checkpoint column is a column (field), from which the platform takes the last row value after each executed task run and stores it as a Data checkpoint. The data checkpoint value can be used in the Python statements to control, which data should be processed in the next run. You can refer to the value using the predefined variable DATA_CHECKPOINT. Example of use: processing data in cycles, where every cycle processes only a subset of the entire set due to the total size. If you use e.g. record ID as a data checkpoint column, the platform will store after each cycle the last processed ID from the data subset processed by the task run. If your statement is written in a way that will evaluate the value in data checkpoint against the IDs of the records in the data set, you can ensure this way, that only not processed records will be considered in the next task run.

Input & Output Schema

Input

Data schema is optional

The connector does not expect a specific schema. The required data structure can be achieved by correct configuration. Although the selected connector doesn't require a schema generally, the individual integration task step may need to match the output data structure of the preceding task step and use a data schema selected from the repository or create a new input schema.

Output

Data schema is mandatory

The connector requires mandatory input or output data schema, which must be selected by the user from the existing data schema repository or a new one must be created. The connector will fail without structured data.

Release notes

3.0.2

  • Fixed processing sensitive errors

3.0.0

  • First release