BUCKET
When objects are called shards
In this document:
- commands to read, write, extract, and list archives - objects formatted as
TAR
,TGZ
(orTAR.GZ
) ,ZIP
, orTAR.LZ4
.
For the most recently updated list of supported archival formats, please refer to this source.
The corresponding subset of CLI commands starts with ais archive
, from where you can <TAB-TAB>
to the actual (reading, writing, etc.) operation.
$ ais archive --help
NAME:
ais archive get - get a shard and extract its content; get an archived file;
write the content locally with destination options including: filename, directory, STDOUT ('-'), or '/dev/null' (discard);
assorted options further include:
- '--prefix' to get multiple shards in one shot (empty prefix for the entire bucket);
- '--progress' and '--refresh' to watch progress bar;
- '-v' to produce verbose output when getting multiple objects.
'ais archive get' examples:
- ais://abc/trunk-0123.tar.lz4 /tmp/out - get and extract entire shard to /tmp/out/trunk/*
- ais://abc/trunk-0123.tar.lz4 --archpath file45.jpeg /tmp/out - extract one named file
- ais://abc/trunk-0123.tar.lz4/file45.jpeg /tmp/out - same as above (and note that '--archpath' is implied)
- ais://abc/trunk-0123.tar.lz4/file45 /tmp/out/file456.new - same as above, with destination explicitly (re)named
'ais archive get' multi-selection examples:
- ais://abc/trunk-0123.tar 111.tar --archregx=jpeg --archmode=suffix - return 111.tar with all *.jpeg files from a given shard
- ais://abc/trunk-0123.tar 222.tar --archregx=file45 --archmode=wdskey - return 222.tar with all file45.* files --/--
- ais://abc/trunk-0123.tar 333.tar --archregx=subdir/ --archmode=prefix - 333.tar with all subdir/* files --/--
USAGE:
ais archive get [command options] BUCKET[/SHARD_NAME] [OUT_FILE|OUT_DIR|-]
OPTIONS:
--checksum validate checksum
--yes, -y assume 'yes' to all questions
--latest check in-cluster metadata and, possibly, GET, download, prefetch, or copy the latest object version
from the associated remote bucket:
- provides operation-level control over object versioning (and version synchronization)
without requiring to change bucket configuration
- the latter can be done using 'ais bucket props set BUCKET versioning'
- see also: 'ais ls --check-versions', 'ais cp', 'ais prefetch', 'ais get'
--refresh value time interval for continuous monitoring; can be also used to update progress bar (at a given interval);
valid time units: ns, us (or µs), ms, s (default), m, h
--progress show progress bar(s) and progress of execution in real time
--blob-download utilize built-in blob-downloader (and the corresponding alternative datapath) to read very large remote objects
--chunk-size value chunk size in IEC or SI units, or "raw" bytes (e.g.: 4mb, 1MiB, 1048576, 128k; see '--units')
--num-workers value number of concurrent blob-downloading workers (readers); system default when omitted or zero (default: 0)
--archpath value extract the specified file from an object ("shard") formatted as: .tar, .tgz or .tar.gz, .zip, .tar.lz4;
see also: '--archregx'
--archmime value expected format (mime type) of an object ("shard") formatted as: .tar, .tgz or .tar.gz, .zip, .tar.lz4;
especially usable for shards with non-standard extensions
--archregx value string that specifies prefix, suffix, substring, WebDataset key, _or_ a general-purpose regular expression
to select possibly multiple matching archived files from a given shard;
is used in combination with '--archmode' ("matching mode") option
--archmode value enumerated "matching mode" that tells aistore how to handle '--archregx', one of:
* regexp - general purpose regular expression;
* prefix - matching filename starts with;
* suffix - matching filename ends with;
* substr - matching filename contains;
* wdskey - WebDataset key
example:
given a shard containing (subdir/aaa.jpg, subdir/aaa.json, subdir/bbb.jpg, subdir/bbb.json, ...)
and wdskey=subdir/aaa, aistore will match and return (subdir/aaa.jpg, subdir/aaa.json)
--extract, -x extract all files from archive(s)
--inventory list objects using _bucket inventory_ (docs/s3inventory.md); requires s3:// backend; will provide significant performance
boost when used with very large s3 buckets; e.g. usage:
1) 'ais ls s3://abc --inventory'
2) 'ais ls s3://abc --inventory --paged --prefix=subdir/'
(see also: docs/s3inventory.md)
--inv-name value bucket inventory name (optional; system default name is '.inventory')
--inv-id value bucket inventory ID (optional; by default, we use bucket name as the bucket's inventory ID)
--prefix value get objects that start with the specified prefix, e.g.:
'--prefix a/b/c' - get objects from the virtual directory a/b/c and objects from the virtual directory
a/b that have their names (relative to this directory) starting with 'c';
'--prefix ""' - get entire bucket (all objects)
--cached get only in-cluster objects - only those objects from a remote bucket that are present ("cached")
--archive list archived content (see docs/archive.md for details)
--limit value maximum number of object names to display (0 - unlimited; see also '--max-pages')
e.g.: 'ais ls gs://abc --limit 1234 --cached --props size,custom (default: 0)
--units value show statistics and/or parse command-line specified sizes using one of the following _units of measurement_:
iec - IEC format, e.g.: KiB, MiB, GiB (default)
si - SI (metric) format, e.g.: KB, MB, GB
raw - do not convert to (or from) human-readable format
--verbose, -v verbose output
--silent server-side flag, an indication for aistore _not_ to log assorted errors (e.g., HEAD(object) failures)
--help, -h show help
Table of Contents
- Archive files and directories
- Append files and directories to an existing archive
- Archive multiple objects
- List archived content
- Get archived content
- Get archived content: multiple-selection
- Generate shards
Archive files and directories
Archive multiple files.
$ ais archive put --help
NAME:
ais archive put - archive a file, a directory, or multiple files and/or directories as
(.tar, .tgz or .tar.gz, .zip, .tar.lz4)-formatted object - aka "shard".
Both APPEND (to an existing shard) and PUT (a new version of the shard) are supported.
Examples:
- 'local-file s3://q/shard-00123.tar.lz4 --append --archpath name-in-archive' - append file to a given shard,
optionally, rename it (inside archive) as specified;
- 'local-file s3://q/shard-00123.tar.lz4 --append-or-put --archpath name-in-archive' - append file to a given shard if exists,
otherwise, create a new shard (and name it shard-00123.tar.lz4, as specified);
- 'src-dir gs://w/shard-999.zip --append' - archive entire 'src-dir' directory; iff the destination .zip doesn't exist create a new one;
- '"sys, docs" ais://dst/CCC.tar --dry-run -y -r --archpath ggg/' - dry-run to recursively archive two directories.
Tips:
- use '--dry-run' if in doubt;
- to archive objects from a ais:// or remote bucket, run 'ais archive bucket' (see --help for details).
USAGE:
ais archive put [command options] [-|FILE|DIRECTORY[/PATTERN]] BUCKET/SHARD_NAME
OPTIONS:
--list value comma-separated list of object or file names, e.g.:
--list 'o1,o2,o3'
--list "abc/1.tar, abc/1.cls, abc/1.jpeg"
or, when listing files and/or directories:
--list "/home/docs, /home/abc/1.tar, /home/abc/1.jpeg"
--template value template to match object or file names; may contain prefix (that could be empty) with zero or more ranges
(with optional steps and gaps), e.g.:
--template "" # (an empty or '*' template matches eveything)
--template 'dir/subdir/'
--template 'shard-{1000..9999}.tar'
--template "prefix-{0010..0013..2}-gap-{1..2}-suffix"
and similarly, when specifying files and directories:
--template '/home/dir/subdir/'
--template "/abc/prefix-{0010..9999..2}-suffix"
--wait wait for an asynchronous operation to finish (optionally, use '--timeout' to limit the waiting time)
--timeout value maximum time to wait for a job to finish; if omitted: wait forever or until Ctrl-C;
valid time units: ns, us (or µs), ms, s (default), m, h
--progress show progress bar(s) and progress of execution in real time
--refresh value time interval for continuous monitoring; can be also used to update progress bar (at a given interval);
valid time units: ns, us (or µs), ms, s (default), m, h
--append-or-put append to an existing destination object ("archive", "shard") iff exists; otherwise PUT a new archive (shard);
note that PUT (with subsequent overwrite if the destination exists) is the default behavior when the flag is omitted
--append add newly archived content to the destination object ("archive", "shard") that must exist
--archpath value filename in an object ("shard") formatted as: .tar, .tgz or .tar.gz, .zip, .tar.lz4
--num-workers value number of concurrent client-side workers (to execute PUT or append requests);
use (-1) to indicate single-threaded serial execution (ie., no workers);
any positive value will be adjusted _not_ to exceed twice the number of client CPUs (default: 10)
--dry-run preview the results without really running the action
--recursive, -r recursive operation
--verbose, -v verbose output
--yes, -y assume 'yes' to all questions
--units value show statistics and/or parse command-line specified sizes using one of the following _units of measurement_:
iec - IEC format, e.g.: KiB, MiB, GiB (default)
si - SI (metric) format, e.g.: KB, MB, GB
raw - do not convert to (or from) human-readable format
--include-src-dir prefix the names of archived files with the (root) source directory
--skip-vc skip loading object metadata (and the associated checksum & version related processing)
--cont-on-err keep running archiving xaction (job) in presence of errors in a any given multi-object transaction
--help, -h show help
The operation accepts either an explicitly defined list or template-defined range of file names (to archive).
NOTE:
ais archive put
works with locally accessible (source) files and shall not be confused withais archive bucket
command (below).
Also, note that ais put
command with its --archpath
option provides an alternative way to archive multiple objects:
For the most recently updated list of supported archival formats, please see:
Append files and directories to an existing archive
APPEND operation provides for appending files to existing archives (shards). As such, APPEND is a variation of PUT (above) with additional two boolean flags:
Name | Description |
---|---|
--append |
add newly archived content to the destination object ("archive", "shard") that must exist |
--append-or-put |
if destination object ("archive", "shard") exists append to it, otherwise archive a new one |
Example 1: add file to archive
step 1. create archive (by archiving a given source dir)
$ ais archive put sys ais://nnn/sys.tar.lz4
Warning: multi-file 'archive put' operation requires either '--append' or '--append-or-put' option
Proceed to execute 'archive put --append-or-put'? [Y/N]: y
Files to upload:
EXTENSION COUNT SIZE
.go 11 17.46KiB
TOTAL 11 17.46KiB
APPEND 11 files (one directory, non-recursive) => ais://nnn/sys.tar.lz4? [Y/N]: y
Done
step 2. add a single file to existing archive
$ ais archive put README.md ais://nnn/sys.tar.lz4 --archpath=docs/README --append
APPEND README.md to ais://nnn/sys.tar.lz4 as "docs/README"
step 3. list entire bucket with an --archive
option to show all archived entries
$ ais ls ais://nnn --archive
NAME SIZE
sys.tar.lz4 16.84KiB
sys.tar.lz4/api_linux.go 1.07KiB
sys.tar.lz4/cpu.go 1.07KiB
sys.tar.lz4/cpu_darwin.go 802B
sys.tar.lz4/cpu_linux.go 2.14KiB
sys.tar.lz4/docs/README 13.85KiB
sys.tar.lz4/mem.go 1.16KiB
sys.tar.lz4/mem_darwin.go 2.04KiB
sys.tar.lz4/mem_linux.go 2.81KiB
sys.tar.lz4/proc.go 784B
sys.tar.lz4/proc_darwin.go 369B
sys.tar.lz4/proc_linux.go 1.40KiB
sys.tar.lz4/sys_test.go 3.88KiB
Listed: 13 names
Alternatively, use regex to select:
$ ais ls ais://nnn --archive --regex docs
NAME SIZE
sys.tar.lz4/docs/README 13.85KiB
Example 2: use --template
flag to add source files
Generally, the --template
option combines (an optional) prefix and/or one or more ranges (e.g., bash brace expansions).
In this case, the template we use is a simple prefix with no ranges.
$ ls -l /tmp/w
total 32
-rw-r--r-- 1 root root 14180 Dec 11 18:18 111
-rw-r--r-- 1 root root 14180 Dec 11 18:18 222
$ ais archive put ais://nnn/shard-001.tar --template /tmp/w/ --append
Files to upload:
EXTENSION COUNT SIZE
2 27.70KiB
TOTAL 2 27.70KiB
APPEND 2 files (one directory, non-recursive) => ais://nnn/shard-001.tar? [Y/N]: y
Done
$ ais ls ais://nnn/shard-001.tar --archive
NAME SIZE
shard-001.tar 37.50KiB
shard-001.tar/111 13.85KiB
shard-001.tar/222 13.85KiB
shard-001.tar/23ed44d8bf3952a35484-1.test 1.00KiB
shard-001.tar/452938788ebb87807043-4.test 1.00KiB
shard-001.tar/7925bc9b5eb1daa12ed0-2.test 1.00KiB
shard-001.tar/8264574b49bd188a4b27-0.test 1.00KiB
shard-001.tar/f1f25e52c5edd768e0ec-3.test 1.00KiB
Example 3: add file to archive
In this example, we assume that arch.tar
already exists.
# contents _before_:
$ ais archive ls ais://abc/arch.tar
NAME SIZE
arch.tar 4.5KiB
arch.tar/obj1 1.0KiB
arch.tar/obj2 1.0KiB
# add file to existing archive:
$ ais archive put /tmp/obj1.bin ais://abc/arch.tar --archpath bin/obj1
APPEND "/tmp/obj1.bin" to object "ais://abc/arch.tar[/bin/obj1]"
# contents _after_:
$ ais archive ls ais://abc/arch.tar
NAME SIZE
arch.tar 6KiB
arch.tar/bin/obj1 2.KiB
arch.tar/obj1 1.0KiB
arch.tar/obj2 1.0KiB
Example 4: add file to archive
# contents _before_:
$ ais archive ls ais://nnn/shard-2.tar
NAME SIZE
shard-2.tar 5.50KiB
shard-2.tar/0379f37cbb0415e7eaea-3.test 1.00KiB
shard-2.tar/504c563d14852368575b-5.test 1.00KiB
shard-2.tar/c7bcb7014568b5e7d13b-4.test 1.00KiB
# append and note that `--archpath` can specify a fully qualified destination name
$ ais archive put LICENSE ais://nnn/shard-2.tar --archpath shard-2.tar/license.test
APPEND "/go/src/github.com/NVIDIA/aistore/LICENSE" to "ais://nnn/shard-2.tar[/shard-2.tar/license.test]"
# contents _after_:
$ ais archive ls ais://nnn/shard-2.tar
NAME SIZE
shard-2.tar 7.50KiB
shard-2.tar/0379f37cbb0415e7eaea-3.test 1.00KiB
shard-2.tar/504c563d14852368575b-5.test 1.00KiB
shard-2.tar/c7bcb7014568b5e7d13b-4.test 1.00KiB
shard-2.tar/license.test 1.05KiB
Archive multiple objects
This is a yet another archive-creating operation that:
- takes in multiple objects from a given source bucket, and
-
archives them all as a shard in the specified destination bucket,
where:
- source and destination buckets may not necessarily be different;
- both
--list
and--template
options are supported - supported archival formats include
.tar
,.tar.gz
(or, same,.tgz
), and.zip
; more extensions may be added in the future. - archiving is carried out asynchronously, in parallel by all AIS targets.
As such, ais archive bucket
is one of the supported multi-object operations.
NOTE:
ais archive bucket
multi-object bucket-to-bucket archiving shall not be confused withais archive put
command - the latter is used to archive multiple source files from a local (or locally accessible) source directory.
$ ais archive bucket --help
NAME:
ais archive bucket - archive multiple objects from SRC_BUCKET as (.tar, .tgz or .tar.gz, .zip, .tar.lz4)-formatted shard
USAGE:
ais archive bucket [command options] SRC_BUCKET DST_BUCKET/SHARD_NAME
OPTIONS:
--template value template to match object or file names; may contain prefix (that could be empty) with zero or more ranges
(with optional steps and gaps), e.g.:
--template "" # (an empty or '*' template matches eveything)
--template 'dir/subdir/'
--template 'shard-{1000..9999}.tar'
--template "prefix-{0010..0013..2}-gap-{1..2}-suffix"
and similarly, when specifying files and directories:
--template '/home/dir/subdir/'
--template "/abc/prefix-{0010..9999..2}-suffix"
--list value comma-separated list of object or file names, e.g.:
--list 'o1,o2,o3'
--list "abc/1.tar, abc/1.cls, abc/1.jpeg"
or, when listing files and/or directories:
--list "/home/docs, /home/abc/1.tar, /home/abc/1.jpeg"
--dry-run preview the results without really running the action
--include-src-bck prefix the names of archived files with the source bucket name
--append-or-put if destination object ("archive", "shard") exists append to it, otherwise archive a new one
--cont-on-err keep running archiving xaction in presence of errors in a any given multi-object transaction
--wait wait for an asynchronous operation to finish (optionally, use '--timeout' to limit the waiting time)
--help, -h show help
Examples
- Archive a list of objects from a given bucket:
$ ais archive bucket ais://bck/arch.tar --list obj1,obj2
Archiving "ais://bck/arch.tar" ...
Resulting ais://bck/arch.tar
contains objects ais://bck/obj1
and ais://bck/obj2
.
- Archive objects from a different bucket, use template (range):
$ ais archive bucket ais://src ais://dst/arch.tar --template "obj-{0..9}"
Archiving "ais://dst/arch.tar" ...
ais://dst/arch.tar
now contains 10 objects from bucket ais://src
: ais://src/obj-0
, ais://src/obj-1
... ais://src/obj-9
.
- Archive 3 objects and then append 2 more:
$ ais archive bucket ais://bck/arch1.tar --template "obj{1..3}"
Archived "ais://bck/arch1.tar" ...
$ ais archive ls ais://bck/arch1.tar
NAME SIZE
arch1.tar 31.00KiB
arch1.tar/obj1 9.26KiB
arch1.tar/obj2 9.26KiB
arch1.tar/obj3 9.26KiB
$ ais archive bucket ais://bck/arch1.tar --template "obj{4..5}" --append
Archived "ais://bck/arch1.tar"
$ ais archive ls ais://bck/arch1.tar
NAME SIZE
arch1.tar 51.00KiB
arch1.tar/obj1 9.26KiB
arch1.tar/obj2 9.26KiB
arch1.tar/obj3 9.26KiB
arch1.tar/obj4 9.26KiB
arch1.tar/obj5 9.26KiB
List archived content
NAME:
ais archive ls - list archived content (supported formats: .tar, .tgz or .tar.gz, .zip, .tar.lz4)
USAGE:
ais archive ls [command options] BUCKET[/SHARD_NAME]
List archived content as a tree with archive (“shard”) name as a root and archived files as leaves. Filenames are always sorted alphabetically.
Options
Name | Type | Description | Default |
---|---|---|---|
--props |
string |
Comma-separated properties to return with object names | "size" |
--all |
bool |
Show all objects, including misplaced, duplicated, etc. | false |
Examples
$ ais archive ls ais://bck/arch.tar
NAME SIZE
arch.tar 4.5KiB
arch.tar/obj1 1.0KiB
arch.tar/obj2 1.0KiB
Example: use ‘--prefix’ that crosses shard boundary
For starters, we recursively archive all aistore docs:
$ ais put docs ais://A.tar --archive -r
To list a virtual subdirectory inside this newly created shard (e.g.):
$ ais archive ls ais://nnn --prefix "A.tar/tutorials"
NAME SIZE
A.tar/tutorials/README.md 561B
A.tar/tutorials/etl/compute_md5.md 8.28KiB
A.tar/tutorials/etl/etl_imagenet_pytorch.md 4.16KiB
A.tar/tutorials/etl/etl_webdataset.md 3.97KiB
Listed: 4 names
or, same:
$ ais ls ais://nnn --prefix "A.tar/tutorials" --archive
NAME SIZE
A.tar/tutorials/README.md 561B
A.tar/tutorials/etl/compute_md5.md 8.28KiB
A.tar/tutorials/etl/etl_imagenet_pytorch.md 4.16KiB
A.tar/tutorials/etl/etl_webdataset.md 3.97KiB
Listed: 4 names
Get archived content
$ ais get --help
ais get - (alias for "object get") get an object, a shard, an archived file, or a range of bytes from all of the above;
write the content locally with destination options including: filename, directory, STDOUT ('-'), or '/dev/null' (discard);
assorted options further include:
- '--prefix' to get multiple objects in one shot (empty prefix for the entire bucket);
- '--extract' or '--archpath' to extract archived content;
- '--progress' and '--refresh' to watch progress bar;
- '-v' to produce verbose output when getting multiple objects.
USAGE:
ais get [command options] BUCKET[/OBJECT_NAME] [OUT_FILE|OUT_DIR|-]
OPTIONS:
--offset value object read offset; must be used together with '--length'; default formatting: IEC (use '--units' to override)
--checksum validate checksum
--yes, -y assume 'yes' to all questions
--refresh value interval for continuous monitoring;
valid time units: ns, us (or µs), ms, s (default), m, h
--progress show progress bar(s) and progress of execution in real time
--archpath value extract the specified file from an archive (shard)
--extract, -x extract all files from archive(s)
--prefix value get objects that start with the specified prefix, e.g.:
'--prefix a/b/c' - get objects from the virtual directory a/b/c and objects from the virtual directory
a/b that have their names (relative to this directory) starting with c;
'--prefix ""' - get entire bucket
--cached get only those objects from a remote bucket that are present ("cached") in AIS
--archive list archived content (see docs/archive.md for details)
--limit value limit object name count (0 - unlimited) (default: 0)
--units value show statistics and/or parse command-line specified sizes using one of the following _units of measurement_:
iec - IEC format, e.g.: KiB, MiB, GiB (default)
si - SI (metric) format, e.g.: KB, MB, GB
raw - do not convert to (or from) human-readable format
--verbose, -v verbose outout when getting multiple objects
--help, -h show help
Example: extract one file
$ ais archive get ais://dst/A.tar.gz /tmp/w --archpath 111.ext1
GET 111.ext1 from ais://dst/A.tar.gz as "/tmp/w/111.ext1" (12.56KiB)
$ ls /tmp/w
111.ext1
Alternatively, use fully qualified name:
$ ais archive get ais://dst/A.tar.gz/111.ext1 /tmp/w
Example: extract one file using its fully-qualified name::
$ ais archive get ais://nnn/A.tar/tutorials/README.md /tmp/out
Example: extract all files from a single shard
Let’s say, we have a certain shard in a certain bucket:
$ ais ls ais://dst --archive
NAME SIZE
A.tar.gz 5.18KiB
A.tar.gz/111.ext1 12.56KiB
A.tar.gz/222.ext1 12.56KiB
A.tar.gz/333.ext2 12.56KiB
We can then go ahead to GET and extract it to local directory, e.g.:
$ ais archive get ais://dst/A.tar.gz /tmp/www --extract
GET A.tar.gz from ais://dst as "/tmp/www/A.tar.gz" (5.18KiB) and extract to /tmp/www/A/
$ ls /tmp/www/A
111.ext1 222.ext1 333.ext2
But here’s an alternative syntax to achieve the same:
$ ais get ais://dst --archive --prefix A.tar.gz /tmp/www
or even:
$ ais get ais://dst --archive --prefix A.tar.gz /tmp/www --progress --refresh 1 -y
GET 51 objects from ais://dst/tmp/ggg (total size 1.08MiB)
Objects: 51/51 [==============================================================] 100 %
Total size: 1.08 MiB / 1.08 MiB [==============================================================] 100 %
The difference is that:
- in the first case we ask for a specific shard,
- while in the second (and third) we filter bucket’s content using a certain prefix
- and the fact (the convention) that archived filenames are prefixed with their parent (shard) name.
Example: extract all files from all shards (with a given prefix)
Let’s say, there’s a bucket ais://dst
with a virtual directory abc/
that in turn contains:
$ ais ls ais://dst
NAME SIZE
A.tar.gz 5.18KiB
B.tar.lz4 247.88KiB
C.tar.zip 4.15KiB
D.tar 2.00KiB
Next, we GET and extract them all in the respective sub-directories (note --verbose
option):
$ ais archive get ais://dst /tmp/w --prefix "" --extract -v
GET 4 objects from ais://dst to /tmp/w (total size 259.21KiB) [Y/N]: y
GET D.tar from ais://dst as "/tmp/w/D.tar" (2.00KiB) and extract as /tmp/w/D
GET A.tar.gz from ais://dst as "/tmp/w/A.tar.gz" (5.18KiB) and extract as /tmp/w/A
GET C.tar.zip from ais://dst as "/tmp/w/C.tar.zip" (4.15KiB) and extract as /tmp/w/C
GET B.tar.lz4 from ais://dst as "/tmp/w/B.tar.lz4" (247.88KiB) and extract as /tmp/w/B
Example: use ‘--prefix’ that crosses shard boundary
For starters, we recursively archive all aistore docs:
$ ais put docs ais://A.tar --archive -r
To list a virtual subdirectory inside this newly created shard (e.g.):
$ ais archive ls ais://nnn --prefix A.tar/tutorials
NAME SIZE
A.tar/tutorials/README.md 561B
A.tar/tutorials/etl/compute_md5.md 8.28KiB
A.tar/tutorials/etl/etl_imagenet_pytorch.md 4.16KiB
A.tar/tutorials/etl/etl_webdataset.md 3.97KiB
Listed: 4 names
Now, extract matching files from the bucket to /tmp/out:
$ ais archive get ais://nnn --prefix A.tar/tutorials /tmp/out
GET 6 objects from ais://nnn/tmp/out (total size 17.81MiB) [Y/N]: y
$ ls -al /tmp/out/tutorials/
total 20
drwxr-x--- 4 root root 4096 May 13 20:05 ./
drwxr-xr-x 3 root root 4096 May 13 20:05 ../
drwxr-x--- 2 root root 4096 May 13 20:05 etl/
-rw-r--r-- 1 root root 561 May 13 20:05 README.md
drwxr-x--- 2 root root 4096 May 13 20:05 various/
Get archived content: multiple selection
Generally, both single and multi-selection from a given source shard is realized using one of the following 4 (four) options:
--archpath value extract the specified file from an object ("shard") formatted as: .tar, .tgz or .tar.gz, .zip, .tar.lz4;
see also: '--archregx'
--archmime value expected format (mime type) of an object ("shard") formatted as: .tar, .tgz or .tar.gz, .zip, .tar.lz4;
especially usable for shards with non-standard extensions
--archregx value string that specifies prefix, suffix, substring, WebDataset key, _or_ a general-purpose regular expression
to select possibly multiple matching archived files from a given shard;
is used in combination with '--archmode' ("matching mode") option
--archmode value enumerated "matching mode" that tells aistore how to handle '--archregx', one of:
* regexp - general purpose regular expression;
* prefix - matching filename starts with;
* suffix - matching filename ends with;
* substr - matching filename contains;
* wdskey - WebDataset key
example:
given a shard containing (subdir/aaa.jpg, subdir/aaa.json, subdir/bbb.jpg, subdir/bbb.json, ...)
and wdskey=subdir/aaa, aistore will match and return (subdir/aaa.jpg, subdir/aaa.json)
In particular, ‘--archregx’ and ‘--archmode’ pair defines multiple selection that can be further demonstrated on the following examples.
But first, note that in all multi-selection cases, the result is (currently) invariably formatted as .TAR (that contains the aforementioned selection).
Example: suffix match
Select all *.jpeg
files from a given shard and return them all as 111.tar:
$ ais archive get ais://abc/trunk-0123.tar 111.tar --archregx=jpeg --archmode=suffix
Example: WebDataset key
Select all files that have a given WebDataset key; return the result as 222.tar:
$ ais archive get ais://abc/trunk-0123.tar 222.tar --archregx=file45 --archmode=wdskey
Example: prefix match
Similar to the above except that in this case ‘--archregx’ value specifies virtual subdirectory inside a given named shard:
$ ais archive get ais://abc/trunk-0123.tar 333.tar --archregx=subdir/ --archmode=prefix
Generate shards
ais archive gen-shards "BUCKET/TEMPLATE.EXT"
Put randomly generated shards that can be used for dSort testing.
The TEMPLATE
must be bash-like brace expansion (see examples) and .EXT
must be one of: .tar
, .tar.gz
.
Warning: Remember to always quote the argument ("..."
) otherwise the brace expansion will happen in terminal.
Options
Flag | Type | Description | Default |
---|---|---|---|
--fsize |
string |
Single file size inside the shard, can end with size suffix (k, MB, GiB, ...) | 1024 (1KB ) |
--fcount |
int |
Number of files inside single shard | 5 |
--fext |
string |
Comma-separated list of file extensions (default “.test”), e.g.: --fext ‘.mp3,.json,.cls’ | .test |
--cleanup |
bool |
When set, the old bucket will be deleted and created again | false |
--num-workers |
int |
Limits the number of shards created concurrently | 10 |
Examples
Generate shards with varying numbers of files and file sizes
Generate 10 shards each containing 100 files of size 256KB and put them inside ais://dsort-testing
bucket (creates it if it does not exist).
Shards will be named: shard-0.tar
, shard-1.tar
, ..., shard-9.tar
.
$ ais archive gen-shards "ais://dsort-testing/shard-{0..9}.tar" --fsize 262144 --fcount 100
Shards created: 10/10 [==============================================================] 100 %
$ ais ls ais://dsort-testing
NAME SIZE VERSION
shard-0.tar 25.05MiB 1
shard-1.tar 25.05MiB 1
shard-2.tar 25.05MiB 1
shard-3.tar 25.05MiB 1
shard-4.tar 25.05MiB 1
shard-5.tar 25.05MiB 1
shard-6.tar 25.05MiB 1
shard-7.tar 25.05MiB 1
shard-8.tar 25.05MiB 1
shard-9.tar 25.05MiB 1
Generate shards using custom naming template
Generates 100 shards each containing 5 files of size 256KB and put them inside dsort-testing
bucket.
Shards will be compressed and named: super_shard_000_last.tgz
, super_shard_001_last.tgz
, ..., super_shard_099_last.tgz
$ ais archive gen-shards "ais://dsort-testing/super_shard_{000..099}_last.tar" --fsize 262144 --cleanup
Shards created: 100/100 [==============================================================] 100 %
$ ais ls ais://dsort-testing
NAME SIZE VERSION
super_shard_000_last.tgz 1.25MiB 1
super_shard_001_last.tgz 1.25MiB 1
super_shard_002_last.tgz 1.25MiB 1
super_shard_003_last.tgz 1.25MiB 1
super_shard_004_last.tgz 1.25MiB 1
super_shard_005_last.tgz 1.25MiB 1
super_shard_006_last.tgz 1.25MiB 1
super_shard_007_last.tgz 1.25MiB 1
...
Multi-extension example
$ ais archive gen-shards 'ais://nnn/shard-{01..99}.tar' -fext ".mp3, .json, .cls"
$ ais archive ls ais://nnn | head -n 20
NAME SIZE
shard-01.tar 23.50KiB
shard-01.tar/541701ae863f76d0f7e0-0.cls 1.00KiB
shard-01.tar/541701ae863f76d0f7e0-0.json 1.00KiB
shard-01.tar/541701ae863f76d0f7e0-0.mp3 1.00KiB
shard-01.tar/8f8c5fa2934c90138833-1.cls 1.00KiB
shard-01.tar/8f8c5fa2934c90138833-1.json 1.00KiB
shard-01.tar/8f8c5fa2934c90138833-1.mp3 1.00KiB
shard-01.tar/9a42bd12d810d890ea86-3.cls 1.00KiB
shard-01.tar/9a42bd12d810d890ea86-3.json 1.00KiB
shard-01.tar/9a42bd12d810d890ea86-3.mp3 1.00KiB
shard-01.tar/c5bd7c7a34e12ebf3ad3-2.cls 1.00KiB
shard-01.tar/c5bd7c7a34e12ebf3ad3-2.json 1.00KiB
shard-01.tar/c5bd7c7a34e12ebf3ad3-2.mp3 1.00KiB
shard-01.tar/f13522533ecafbad4fe5-4.cls 1.00KiB
shard-01.tar/f13522533ecafbad4fe5-4.json 1.00KiB
shard-01.tar/f13522533ecafbad4fe5-4.mp3 1.00KiB
shard-02.tar 23.50KiB
shard-02.tar/095e6ae644ff4fd1778b-7.cls 1.00KiB
shard-02.tar/095e6ae644ff4fd1778b-7.json 1.00KiB
...