Using Data Pools in Services

This guide explains how to use the DataPool feature to work with datasets and file collections within your services.

What is a Data Pool?

A Data Pool is a managed collection of files, similar to a directory or a folder, that can be attached to your service at runtime. It provides a simple and efficient way to access large datasets, pre-trained models, or any other file-based resources without having to include them directly in your service's deployment package.

When you use a Data Pool, the platform mounts the specified file collection into your service's runtime environment. The planqk-commons library provides a convenient DataPool abstraction to interact with these mounted files.

Data Pool Limits

Data Pools are designed to handle large datasets, but there are some limits to keep in mind:

The maximum size of a single file in a Data Pool is 500 MB.
The files are mounted using a blob storage technology, which means performance may vary based on the size and number of files.

How to Use the `DataPool` Class

To use a Data Pool in your service, you simply need to declare a parameter of type DataPool in your run method. The runtime will automatically detect this and inject a DataPool object that corresponds to the mounted file collection.

The `DataPool` Object

The DataPool object, found in planqk.commons.datapool, provides the following methods to interact with the files in the mounted directory:

list_files() -> Dict[str, str]: Returns a dictionary of all files in the Data Pool, where the keys are the file names and the values are their absolute paths.
open(file_name: str, mode: str = "r"): Opens a specific file within the Data Pool and returns a file handle, similar to Python's built-in open() function.
path: A property that returns the absolute path to the mounted Data Pool directory.
name: A property that returns the name of the Data Pool (which corresponds to the parameter name in your run method).

Tutorial: Building a Service with a Data Pool

Let's walk through an example of a service that reads data from a Data Pool.

1. Initialize a New Project

If you haven't already, create a new service project. You can use the CLI to set up a new service:

bash

planqk init

cd [user_code]

uv venv
source .venv/bin/activate
uv sync

For the rest of this guide, we assume that you created your service in a directory named user_code, with the main code in user_code/src/.

2. Update the `run` Method

In your program.py, define a run method that accepts a DataPool parameter. The name of the parameter (e.g., my_dataset) is important, as it will be used to identify the Data Pool in the API call.

python

# user_code/src/program.py

from planqk.commons.datapool import DataPool
from pydantic import BaseModel

class InputData(BaseModel):
    file_to_read: str

def run(data: InputData, my_dataset: DataPool) -> str:
    """
    Reads the content of a specified file from a Data Pool.
    """
    try:
        # Use the open() method to read a file from the Data Pool
        with my_dataset.open(data.file_to_read) as f:
            content = f.read()
        return content
    except FileNotFoundError:
        return f"File '{data.file_to_read}' not found in the Data Pool."

In this example, the run method expects a Data Pool to be provided for the my_dataset parameter.

3. Local Testing with Data Pools

When developing and testing your service locally, you don't have access to the platform's Data Pool mounting system. However, you can easily simulate this by creating a local directory and passing it to your run method.

Steps for Local Testing

Create a local directory for your Data Pool. This directory should be placed inside the user_code/input directory. The name of this directory can be anything, but for this example, we'll name it my_dataset to match the parameter in the run method.
Populate the directory with your test files. Place any files you need for your test inside this directory (e.g., user_code/input/my_dataset/hello.txt). And add the value Hello to the hello.txt file.
Update the __main__.py file. Modify your main entrypoint to manually create a DataPool instance and pass it to the run function. You will create the DataPool object with a relative path to your local Data Pool directory.
Run your service. Now you can run your service directly without setting any environment variables.
bash
```
# Run your service's main entrypoint
cd user_code
python -m src
```

Example

Let's assume your project has the following structure:

user_code
├── src/
│   ├── __main__.py
│   └── program.py
└── input/
    ├── data.json
    └── my_dataset/
        └── hello.txt

And user_code/input/data.json contains:

json

{
  "file_to_read": "hello.txt"
}

Update your user_code/src/__main__.py to look like this:

python

# user_code/src/__main__.py
import json
import os

from planqk.commons.constants import OUTPUT_DIRECTORY_ENV
from planqk.commons.datapool import DataPool
from planqk.commons.json import any_to_json
from planqk.commons.logging import init_logging

from .program import InputData, run

init_logging()

# This file is executed if you run `python -m src` from the project root. Use this file to test your program locally.
# You can read the input data from the `input` directory and map it to the respective parameter of the `run()` function.

# Redirect the platform's output directory for local testing
directory = "./out"
os.makedirs(directory, exist_ok=True)
os.environ[OUTPUT_DIRECTORY_ENV] = directory

with open(f"./input/data.json") as file:
    data = InputData.model_validate(json.load(file))

result = run(data, my_dataset=DataPool("./input/my_dataset"))

print(any_to_json(result))

The __main__.py script now manually creates the DataPool object and passes it to your run function, simulating the behavior of the platform and allowing you to test your run method's logic with local files.

Now run the service:

bash

python -m src

4. Use Multiple Data Pools

If your service needs to work with multiple Data Pools, you can simply add more parameters of type DataPool to your run method.

python

import os

from planqk.commons.datapool import DataPool
from pydantic import BaseModel

class InputData(BaseModel):
    file_to_read_from_my_dataset: str
    file_to_read_from_another_dataset: str

def run(data: InputData, my_dataset: DataPool, another_dataset: DataPool, output_datapool: DataPool) -> str:
    """
    Combines the content of the specified files from two different Data Pools in a third Data Pool.
    """
    try:
        with my_dataset.open(data.file_to_read_from_my_dataset) as f1:
            content1 = f1.read()
        
        with another_dataset.open(data.file_to_read_from_another_dataset) as f2:
            content2 = f2.read()
        
        # You can also write to the output Data Pool if needed
        concatinated_file = os.path.join(output_datapool.path, "concatenated_output.txt")
        with open(concatinated_file, "w") as out_file:
            out_file.write(content1)
            out_file.write(content2)
        
        return f"Content from my_dataset: {content1}\nContent from another_dataset: {content2}"
    except FileNotFoundError as e:
        return str(e)

For local development, you would create two directories in your input folder (e.g., my_dataset and another_dataset) and pass them as separate parameters in the __main__.py:

python

# user_code/src/__main__.py

## see from above...

result = run(data, my_dataset=DataPool("./input/my_dataset"), another_dataset=DataPool("./input/another_dataset"), output_datapool=DataPool("./input/output_datapool"))

## see from above...

Create the another_dataset and output_datapool directories in your input folder. Then, create the file input/another_dataset/world.txt with the content World.

Before running the service, update the data in input/data.json to include the new file names:

json

{
  "file_to_read_from_my_dataset": "hello.txt",
  "file_to_read_from_another_dataset": "world.txt"
}

After running the service, you should see the concatenated output file in the input/output_datapool directory.

bash

python -m src

5. Configuring the Data Pool in the API Call

When you execute this service via the platform API, you need to specify which Data Pool to mount. This is done by providing a special JSON object in the request body. The key of this object must match the DataPool parameter name in your run method (my_dataset in our example).

The JSON object has two fields:

id: The unique identifier (UUID) of the Data Pool you want to use.
ref: A static value that must be "DATAPOOL".

Here is an example of a request body for our first version of the service:

json

{
  "data": {
    "file_to_read": "hello.txt"
  },
  "my_dataset": {
    "id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
    "ref": "DATAPOOL"
  }
}

When the service is executed with this input, the platform will:

Identify that the my_dataset parameter is a Data Pool reference.
Mount the Data Pool with the specified id.
Instantiate a DataPool object pointing to the mounted directory.
Inject this Data Pool object into the run method as the my_dataset argument.

Your code can then use the my_dataset object to interact with the files in the mounted Data Pool.

Here is an example of a request body for our second version of the service:

json

{
  "data": {
    "file_to_read_from_my_dataset": "hello.txt",
    "file_to_read_from_another_dataset": "world.txt"
  },
  "my_dataset": {
    "id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
    "ref": "DATAPOOL"
  },
  "another_dataset": {
    "id": "b1b2c3d4-e5f6-7890-1234-567890abcdef",
    "ref": "DATAPOOL"
  },
  "output_datapool": {
    "id": "c1b2c3d4-e5f6-7890-1234-567890abcdef",
    "ref": "DATAPOOL"
  }
}

OpenAPI Specification for Data Pools

When you generate an OpenAPI specification for a service that uses a DataPool, the planqk openapi library automatically creates the correct schema for the Data Pool parameter. Instead of showing the internal structure of the DataPool class, it generates a schema that reflects the expected API input format.

For the my_dataset: DataPool parameter, the generated OpenAPI schema will look like this:

yaml

my_dataset:
  type: object
  properties:
    id:
      type: string
      format: uuid
      description: UUID of the Data Pool to mount
    ref:
      type: string
      enum: [DATAPOOL]
      description: Reference type indicating this is a Data Pool
  required:
    - id
    - ref
  additionalProperties: false

This ensures that the API documentation accurately represents how to use the service and provides a clear contract for API clients.

Using Data Pools in Services ​

What is a Data Pool? ​

How to Use the DataPool Class ​

The DataPool Object ​

Tutorial: Building a Service with a Data Pool ​

1. Initialize a New Project ​

2. Update the run Method ​

3. Local Testing with Data Pools ​

Steps for Local Testing ​

Example ​

4. Use Multiple Data Pools ​

5. Configuring the Data Pool in the API Call ​

OpenAPI Specification for Data Pools ​

Using Data Pools in Services

What is a Data Pool?

How to Use the `DataPool` Class

The `DataPool` Object

Tutorial: Building a Service with a Data Pool

1. Initialize a New Project

2. Update the `run` Method

3. Local Testing with Data Pools

Steps for Local Testing

Example

4. Use Multiple Data Pools

5. Configuring the Data Pool in the API Call

OpenAPI Specification for Data Pools