Appearance
Using Data Pools in Services
This guide explains how to use the DataPool feature to work with datasets and file collections within your services.
What is a Data Pool?
A Data Pool is a managed collection of files, similar to a directory or a folder, that can be attached to your service at runtime. It provides a simple and efficient way to access large datasets, pre-trained models, or any other file-based resources without having to include them directly in your service's deployment package.
When you use a Data Pool, the platform mounts the specified file collection into your service's runtime environment. The planqk-commons library provides a convenient DataPool abstraction to interact with these mounted files.
Data Pool Limits
Data Pools are designed to handle large datasets, but there are some limits to keep in mind:
- The maximum size of a single file in a Data Pool is 500 MB.
- The files are mounted using a blob storage technology, which means performance may vary based on the size and number of files.
How to Use the DataPool Class
To use a Data Pool in your service, you simply need to declare a parameter of type DataPool in your run method. The runtime will automatically detect this and inject a DataPool object that corresponds to the mounted file collection.
The DataPool Object
The DataPool object, found in planqk.commons.datapool, provides the following methods to interact with the files in the mounted directory:
list_files() -> Dict[str, str]: Returns a dictionary of all files in the Data Pool, where the keys are the file names and the values are their absolute paths.open(file_name: str, mode: str = "r"): Opens a specific file within the Data Pool and returns a file handle, similar to Python's built-inopen()function.path: A property that returns the absolute path to the mounted Data Pool directory.name: A property that returns the name of the Data Pool (which corresponds to the parameter name in yourrunmethod).
Tutorial: Building a Service with a Data Pool
Let's walk through an example of a service that reads data from a Data Pool.
1. Initialize a New Project
If you haven't already, create a new service project. You can use the CLI to set up a new service:
bash
planqk init
cd [user_code]
uv venv
source .venv/bin/activate
uv syncFor the rest of this guide, we assume that you created your service in a directory named user_code, with the main code in user_code/src/.
2. Update the run Method
In your program.py, define a run method that accepts a DataPool parameter. The name of the parameter (e.g., my_dataset) is important, as it will be used to identify the Data Pool in the API call.
python
# user_code/src/program.py
from planqk.commons.datapool import DataPool
from pydantic import BaseModel
class InputData(BaseModel):
file_to_read: str
def run(data: InputData, my_dataset: DataPool) -> str:
"""
Reads the content of a specified file from a Data Pool.
"""
try:
# Use the open() method to read a file from the Data Pool
with my_dataset.open(data.file_to_read) as f:
content = f.read()
return content
except FileNotFoundError:
return f"File '{data.file_to_read}' not found in the Data Pool."In this example, the run method expects a Data Pool to be provided for the my_dataset parameter.
3. Local Testing with Data Pools
When developing and testing your service locally, you don't have access to the platform's Data Pool mounting system. However, you can easily simulate this by creating a local directory and passing it to your run method.
Steps for Local Testing
Create a local directory for your Data Pool. This directory should be placed inside the
user_code/inputdirectory. The name of this directory can be anything, but for this example, we'll name itmy_datasetto match the parameter in therunmethod.Populate the directory with your test files. Place any files you need for your test inside this directory (e.g.,
user_code/input/my_dataset/hello.txt). And add the valueHelloto thehello.txtfile.Update the
__main__.pyfile. Modify your main entrypoint to manually create aDataPoolinstance and pass it to therunfunction. You will create theDataPoolobject with a relative path to your local Data Pool directory.Run your service. Now you can run your service directly without setting any environment variables.
bash# Run your service's main entrypoint cd user_code python -m src
Example
Let's assume your project has the following structure:
user_code
├── src/
│ ├── __main__.py
│ └── program.py
└── input/
├── data.json
└── my_dataset/
└── hello.txtAnd user_code/input/data.json contains:
json
{
"file_to_read": "hello.txt"
}Update your user_code/src/__main__.py to look like this:
python
# user_code/src/__main__.py
import json
import os
from planqk.commons.constants import OUTPUT_DIRECTORY_ENV
from planqk.commons.datapool import DataPool
from planqk.commons.json import any_to_json
from planqk.commons.logging import init_logging
from .program import InputData, run
init_logging()
# This file is executed if you run `python -m src` from the project root. Use this file to test your program locally.
# You can read the input data from the `input` directory and map it to the respective parameter of the `run()` function.
# Redirect the platform's output directory for local testing
directory = "./out"
os.makedirs(directory, exist_ok=True)
os.environ[OUTPUT_DIRECTORY_ENV] = directory
with open(f"./input/data.json") as file:
data = InputData.model_validate(json.load(file))
result = run(data, my_dataset=DataPool("./input/my_dataset"))
print(any_to_json(result))The __main__.py script now manually creates the DataPool object and passes it to your run function, simulating the behavior of the platform and allowing you to test your run method's logic with local files.
Now run the service:
bash
python -m src4. Use Multiple Data Pools
If your service needs to work with multiple Data Pools, you can simply add more parameters of type DataPool to your run method.
python
import os
from planqk.commons.datapool import DataPool
from pydantic import BaseModel
class InputData(BaseModel):
file_to_read_from_my_dataset: str
file_to_read_from_another_dataset: str
def run(data: InputData, my_dataset: DataPool, another_dataset: DataPool, output_datapool: DataPool) -> str:
"""
Combines the content of the specified files from two different Data Pools in a third Data Pool.
"""
try:
with my_dataset.open(data.file_to_read_from_my_dataset) as f1:
content1 = f1.read()
with another_dataset.open(data.file_to_read_from_another_dataset) as f2:
content2 = f2.read()
# You can also write to the output Data Pool if needed
concatinated_file = os.path.join(output_datapool.path, "concatenated_output.txt")
with open(concatinated_file, "w") as out_file:
out_file.write(content1)
out_file.write(content2)
return f"Content from my_dataset: {content1}\nContent from another_dataset: {content2}"
except FileNotFoundError as e:
return str(e)For local development, you would create two directories in your input folder (e.g., my_dataset and another_dataset) and pass them as separate parameters in the __main__.py:
python
# user_code/src/__main__.py
## see from above...
result = run(data, my_dataset=DataPool("./input/my_dataset"), another_dataset=DataPool("./input/another_dataset"), output_datapool=DataPool("./input/output_datapool"))
## see from above...Create the another_dataset and output_datapool directories in your input folder. Then, create the file input/another_dataset/world.txt with the content World.
Before running the service, update the data in input/data.json to include the new file names:
json
{
"file_to_read_from_my_dataset": "hello.txt",
"file_to_read_from_another_dataset": "world.txt"
}After running the service, you should see the concatenated output file in the input/output_datapool directory.
bash
python -m src5. Configuring the Data Pool in the API Call
When you execute this service via the platform API, you need to specify which Data Pool to mount. This is done by providing a special JSON object in the request body. The key of this object must match the DataPool parameter name in your run method (my_dataset in our example).
The JSON object has two fields:
id: The unique identifier (UUID) of the Data Pool you want to use.ref: A static value that must be"DATAPOOL".
Here is an example of a request body for our first version of the service:
json
{
"data": {
"file_to_read": "hello.txt"
},
"my_dataset": {
"id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"ref": "DATAPOOL"
}
}When the service is executed with this input, the platform will:
- Identify that the
my_datasetparameter is a Data Pool reference. - Mount the Data Pool with the specified
id. - Instantiate a
DataPoolobject pointing to the mounted directory. - Inject this
Data Poolobject into therunmethod as themy_datasetargument.
Your code can then use the my_dataset object to interact with the files in the mounted Data Pool.
Here is an example of a request body for our second version of the service:
json
{
"data": {
"file_to_read_from_my_dataset": "hello.txt",
"file_to_read_from_another_dataset": "world.txt"
},
"my_dataset": {
"id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"ref": "DATAPOOL"
},
"another_dataset": {
"id": "b1b2c3d4-e5f6-7890-1234-567890abcdef",
"ref": "DATAPOOL"
},
"output_datapool": {
"id": "c1b2c3d4-e5f6-7890-1234-567890abcdef",
"ref": "DATAPOOL"
}
}OpenAPI Specification for Data Pools
When you generate an OpenAPI specification for a service that uses a DataPool, the planqk openapi library automatically creates the correct schema for the Data Pool parameter. Instead of showing the internal structure of the DataPool class, it generates a schema that reflects the expected API input format.
For the my_dataset: DataPool parameter, the generated OpenAPI schema will look like this:
yaml
my_dataset:
type: object
properties:
id:
type: string
format: uuid
description: UUID of the Data Pool to mount
ref:
type: string
enum: [DATAPOOL]
description: Reference type indicating this is a Data Pool
required:
- id
- ref
additionalProperties: falseThis ensures that the API documentation accurately represents how to use the service and provides a clear contract for API clients.

