ETL
CLI Reference for ETLs
This section documents ETL management operations with ais etl
. But first, note:
As with global rebalance, dSort, and download, all ETL management commands can also be executed via
ais job
andais show
—the commands that, by definition, support all AIS xactions, including AIS-ETL.
For background on AIS-ETL, getting started, working examples, and tutorials, please refer to:
Table of Contents
- Commands
- Initialize ETL with a specification file
- Initialize ETL with code
- List all ETLs
- View ETL details
- View ETL logs
- Stop ETL
- Start ETL
- Transform an object on-the-fly
- Transform an entire bucket
Commands
Top-level ETL commands include init
, stop
, show
, and more:
$ ais etl --help
NAME:
ais etl - Execute custom transformations on objects
USAGE:
ais etl command [arguments...] [command options]
COMMANDS:
init Start ETL job: 'spec' job (requires pod yaml specification) or 'code' job (with transforming function or script in a local file)
show Show ETL(s)
view-logs View ETL logs
start Start ETL
stop Stop ETL
rm Remove ETL
object Transform an object
bucket Transform entire bucket or selected objects (to select, use '--list', '--template', or '--prefix')
OPTIONS:
--help, -h Show help
Additionally, use --help
to display any specific command, e.g.:
Init ETL with a specification file
ais etl init spec --from-file=SPEC_FILE --name=ETL_NAME [--comm-type=COMMUNICATION_TYPE] [--wait-timeout=TIMEOUT] [--arg-type=ARGUMENT_TYPE]
or
ais start etl init
Initializes an ETL from a Pod YAML specification file. The --name
parameter assigns a unique name to the ETL. See ETL name specifications for valid names.
Example
Initialize an ETL that computes the MD5 hash of an object.
$ cat spec.yaml
apiVersion: v1
kind: Pod
metadata:
name: transformer-md5
spec:
containers:
- name: server
image: aistore/transformer_md5:latest
ports:
- name: default
containerPort: 80
command: ['/code/server.py', '--listen', '0.0.0.0', '--port', '80']
$ ais etl init spec --from-file=spec.yaml --name=transformer-md5 --comm-type=hpull:// --wait-timeout=1m
transformer-md5
Init ETL with code
ais etl init code --name=ETL_NAME --from-file=CODE_FILE --runtime=RUNTIME [--chunk-size=NUM_OF_BYTES] [--transform=TRANSFORM_FUNC] [--before=BEFORE_FUNC] [--after=AFTER_FUNC] [--deps-file=DEPS_FILE] [--comm-type=COMMUNICATION_TYPE] [--wait-timeout=TIMEOUT] [--arg-type=ARGUMENT_TYPE]
This initializes an ETL from a provided CODE_FILE
that contains:
transform(input_bytes)
: The main transformation function.before(context)
: An optional pre-processing function.after(context)
: An optional post-processing function.
The --name
parameter assigns a unique name to the ETL (see ETL name specifications).
Note:
- The default value for
--transform
is"transform"
. - Available runtimes are listed here.
Example
Initialize an ETL that computes the MD5 hash of an object.
$ cat code.py
import hashlib
def transform(input_bytes):
md5 = hashlib.md5()
md5.update(input_bytes)
return md5.hexdigest().encode()
$ ais etl init code --from-file=code.py --runtime=python3.11v2 --name=transformer-md5 --comm-type hpull
transformer-md5
With before(context)
and after(context)
functions using streaming (CHUNK_SIZE
> 0):
$ cat code.py
import hashlib
def before(context):
context["before"] = hashlib.md5()
return context
def transform(input_bytes, context):
context["before"].update(input_bytes)
def after(context):
return context["before"].hexdigest().encode()
$ ais etl init code --name=etl-md5 --from-file=code.py --runtime=python3.11v2 --chunk-size=32768 --before=before --after=after --comm-type hpull
List ETLs
ais etl show
or equivalently:
ais job show etl
Lists all available ETLs.
View ETL details
ais etl show details <ETL_NAME>
Displays details about a specific ETL, including:
- ETL Name
- Communication Type
- Specification or Code
- Argument Type
View ETL Logs
ais etl view-logs ETL_NAME [TARGET_ID]
Outputs logs for the given ETL. An optional TARGET_ID
can be specified to retrieve logs from a particular target node.
Stop ETL
ais etl stop ETL_NAME
Stops the specified ETL.
Start ETL
ais etl start ETL_NAME
Starts the specified ETL.
Transform an object on-the-fly with a given ETL
ais etl object ETL_NAME BUCKET/OBJECT_NAME OUTPUT
Examples
Transform object to STDOUT
Compute the MD5 hash of shards/shard-0.tar
and print it.
$ ais etl object transformer-md5 ais://shards/shard-0.tar -
393c6706efb128fbc442d3f7d084a426
Transform object and save to file
$ ais etl object transformer-md5 ais://shards/shard-0.tar output.txt
$ cat output.txt
393c6706efb128fbc442d3f7d084a426
Transform a bucket offline with the given ETL
ais etl bucket ETL_NAME SRC_BUCKET DST_BUCKET
Transforms all or selected objects and places them in another bucket.
Available Flags
Flag | Description |
---|---|
--list |
Comma-separated list of object names (e.g., ‘obj1,obj2’). |
--template |
Template for matching object names (e.g., ‘obj-{000..100}.tar’). |
--ext |
Mapping for extension transformation (e.g., {jpg:txt} ). |
--prefix |
Prefix for transformed objects. |
--wait |
Wait for operation to finish. |
--requests-timeout |
Timeout for a single object transformation. |
--dry-run |
Show transformation results without applying changes. |
--num-workers |
Number of concurrent workers. |
Examples
Transform an entire bucket
$ ais etl bucket transformer-md5 ais://src_bucket ais://dst_bucket
$ ais wait xaction <XACTION_ID>
Transform selected objects
$ ais etl bucket transformer-md5 ais://src_bucket ais://dst_bucket --template "shard-{10..12}.tar"
Transform bucket with extension mapping
$ ais etl bucket transformer-md5 ais://src_bucket ais://dst_bucket --ext="{in1:out1, in2:out2}" --prefix="etl-" --wait
Perform a dry-run
$ ais etl bucket transformer-md5 ais://src_bucket ais://dst_bucket --dry-run --wait
[DRY RUN] No modifications on the cluster
2 objects (20MiB) would have been put into bucket ais://dst_bucket