PERFORMANCE
AIStore is all about performance. It’s all about performance and reliability, to be precise. An assortment of tips and recommendations that you find in this text will go a long way to ensure that AIS does deliver.
But first, one important consideration:
Currently(**) AIS utilizes local filesystems - local formatted drives of any kind. You could use HDDs and/or NVMes, and any Linux filesystem: xfs, zfs, ext4... you name it. In all cases, the resulting throughput of an AIS cluster will be the sum of throughputs of individual drives.
Usage of local filesystems has its pros and cons, its inevitable trade-offs. In particular, one immediate implication is called ulimit
- the limit on the maximum number of open file descriptors. We do recommend to fix that specific tunable right away - as described below.
The second related option is noatime
- and the same argument applies. Please make sure to review and tune-up them both.
(**) The interface between AIS and a local filesystem is extremely limited and boils down to writing/reading key/value pairs (where values are, effectively, files) and checking existence of keys (filenames). Moving AIS off of POSIX to some sort of (TBD) key/value storage engine should be a straightforward exercise. Related expectations/requirements include reliability and performance under stressful workloads that read and write “values” of 1MB and greater in size.
- Operating System
- CPU
- Network
- Smoke test
- Maximum number of open files
- Storage
- Virtualization
- Metadata write policy
- PUT latency
- GET throughput
aisloader
Operating System
There are two types of nodes in AIS cluster:
- targets (ie., storage nodes)
- proxies (aka gateways)
AIS target must have a data disk, or disks. AIS proxies do not “see” a single byte of user data - they implement most of the control plane and provide access points to the cluster.
The question, then, is how to get the maximum out of the underlying hardware? How to improve datapath performance. This and similar questions have one simple answer: tune-up target’s operating system.
Specifically, sysctl
selected system variables, such as net.core.wmem_max
, net.core.rmem_max
, vm.swappiness
, and more - here’s the approximate list:
The document is part of a separate repository that serves the (specific) purposes of deploying AIS on bare-metal Kubernetes. The repo includes a number of playbooks to assist in a full deployment of AIStore.
In particular, there is a section of pre-deployment playbooks to prepare AIS nodes for deployment on bare-metal Kubernetes
General references:
- Ubuntu Performance Tuning (archived) <- good guide about general optimizations (some of them are described below)
- RHEL 7 Performance Tuning Guide <- detailed view on how to tune the RHEL lot of the tips and tricks apply for other Linux distributions
CPU
Setting CPU governor (P-States) to performance
may make a big difference and, in particular, result in much better network throughput:
On Linux
:
$ echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
or using cpupower
package (on Debian
and Ubuntu
):
$ apt-get install -y linux-tools-$(uname -r) # install `cpupower`
$ cpupower frequency-info # check current settings
$ cpupower frequency-set -r -g performance # set `performance` setting to all CPU's
$ cpupower frequency-info # check settings after the change
Once the packages are installed (the step that will depend on your Linux distribution), you can then follow the tuning instructions from the referenced PDF (above).
Network
AIStore supports 3 (three) logical networks:
- public (default port
51081
) - intra-cluster control (
51082
), and - intra-cluster data (
51083
)
Ideally, all 3 are provisioned (physically) separately - to reduce contention, avoid HoL, and ultimately optimize intra-cluster performance.
Separately, and in addition:
- MTU should be set to
9000
(Jumbo frames) - this is one of the most important configurations - Optimize TCP send buffer sizes on the target side (
net.core.rmem_max
,net.ipv4.tcp_rmem
) - Optimize TCP receive buffer on the client (reading) side (
net.core.wmem_max
,net.ipv4.tcp_wmem
) - Set
net.ipv4.tcp_mtu_probing = 2
The last setting is especially important when client’s MTU is greater than 1500.
The list of tunables (above) cannot be considered exhaustive. Optimal (high-performance) choices always depend on the hardware, Linux kernel, and a variety of factors outside the scope.
Smoke test
To ensure client ⇔ proxy, client ⇔ target, proxy ⇔ target, and target ⇔ target connectivity you can use iperf
(make sure to use Jumbo frames and disable fragmentation).
Here is example use of iperf
:
$ iperf -P 20 -l 128K -i 1 -t 30 -w 512K -c <IP-address>
NOTE: iperf
must show 95% of the bandwidth of a given phys interface. If it does not, try to find out why. It might have no sense to run any benchmark prior to finding out.
Maximum number of open files
This must be done before running benchmarks, let alone deploying AIS inproduction - the maximum number of open file descriptors must be increased.
The corresponding system configuration file is /etc/security/limits.conf
.
In Linux, the default per-process maximum is 1024. It is strongly recommended to raise it to at least
100,000
.
Here’s a full replica of /etc/security/limits.conf that we use for development and production.
To check your current settings, run ulimit -n
or tail /etc/security/limits.conf
.
To increase the limits, copy the following 5 lines into /etc/security/limits.conf
:
$ tail /etc/security/limits.conf
#ftp hard nproc 0
#ftp - chroot /ftp
#@student - maxlogins 4
root hard nofile 999999
root soft nofile 999999
* hard nofile 999999
* soft nofile 999999
# End of file
Once done, re-login and double-check that both soft and hard limits have indeed changed:
$ ulimit -n
999999
$ ulimit -Hn
999999
For further references, google:
docker run --ulimit
DefaultLimitNOFILE
Storage
Storage-wise, each local ais_local.json
config must be looking as follows:
{
"net": {
"hostname": "127.0.1.0",
"port": "8080",
},
"fspaths": {
"/tmp/ais/1": {},
"/tmp/ais/2": {},
"/tmp/ais/3": {}
}
}
- Each local path from the
fspaths
section above must be (or contain as a prefix) a mountpath of a local filesystem. - Each local filesystem (above) must utilize one or more data drives, whereby none of the data drives is shared between two or more local filesystems.
- Each filesystem must be fine-tuned for reading large files/blocks/xfersizes.
Disk priority
AIStore can be sensitive to spikes in I/O latencies, especially when running bursty traffic that carries large numbers of small-size I/O requests (resulting from writing and/or reading small-size objects).
To ensure that latencies do not spike because of some other processes running on the system it is recommended to give aisnode
process the highest disk priority.
This can be done using ionice
tool.
$ # Best effort, highest priority
$ sudo ionice -c2 -n0 -p $(pgrep aisnode)
Benchmarking disk
TIP: double check that rootfs/tmpfs of the AIStore target are not used when reading and writing data.
When running a benchmark, make sure to run and collect the following in each and every target:
$ iostat -cdxtm 10
Local hard drive read performance, the fastest block-level reading smoke test is:
$ hdparm -Tt /dev/<drive-name>
Reading the block from certain offset (in gigabytes), --direct
argument ensure that we bypass the drive’s buffer cache and read directly from the disk:
$ hdparm -t --direct --offset 100 /dev/<drive-name>
More: Tune hard disk with hdparm
.
Local filesystem
Another way to increase storage performance is to benchmark different filesystems: ext
, xfs
, openzfs
.
Tuning the corresponding IO scheduler can prove to be important:
Other related references:
noatime
One of the most important performance improvements can be achieved by simply turning off atime
(access time) updates on the filesystem.
Example: mount xfs with the following performance-optimized parameters (that also include noatime
):
noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,nobarrier
As aside, it is maybe important to note and disclose that AIStore itself lazily updates access times. The complete story has two pieces:
- On the one hand, AIS can be deployed as a fast cache tier in front of any of the multiple supported backends. To support this specific usage scenario, AIS tracks access times and the remaining capacity. The latter has 3 (three) configurable thresholds: low-watermark, high-watermark, and OOS (out of space) - by default, 75%, 90%, and 95%, respectively. Running low on free space automatically triggers LRU job that starts visiting evictable buckets (if any), sorting their content by access times, and yes - evicting. Bucket “evictability” is also configurable and, in turn, has two different defaults: true for buckets that have remote backend, and false otherwise.
- On the other hand, we certainly don’t want to have any writes upon reads. That’s why object metadata is cached and atime gets updated in memory to a) avoid extra reads and b) absorb multiple accesses and multiple atime updates. Further, if and only if we update atime in memory, we also mark the metadata as
dirty
. So that later, if for whatever reason we need to un-cache it, we flush it to disk (along with the most recently updates access time).- The actual mechanism of when and whether to flush the corresponding metadata is driven by two conditions: the time since last access and the amount of free memory. If there’s plenty of memory, we flush
dirty
metadata after 1 hour of inactivity.- Finally, in AIS all configuration-related defaults (e.g., the default watermarks mentioned above) are also configurable - but that’s a different story and a different scope...
External links:
Virtualization
There must be no sharing of host resources between two or more VMs that are AIS nodes.
Even if there is a single virtual machine, the host may decide to swap it out when idle, or give it a single hyperthreaded vCPU instead of a full-blown physical core - this condition must be prevented.
Virtualized AIS node needs to have a physical resource - memory, CPU, network, storage - in its entirety. Hypervisor must resort to the remaining absolutely required minimum.
Make sure to use PCI passthrough to assign a device (NIC, HDD) directly to the AIS VM.
AIStore’s primary goal is to scale with clustered drives. Therefore, the choice of a drive type and its capabilities remains very important.
Finally, when initializing virtualized disks it’d be advisable to set an optimal block size.
Metadata write policy
There will be times when it’ll make sense to keep storage system’s metadata strictly in memory or, maybe, cache and flush it on a best-effort basis. The corresponding use cases include temporary datasets - transient data of any kind that can (and will) be discarded. There’s also the case when AIS is used as a high-performance caching layer where original data is already sufficiently replicated and protected.
Metadata write policy - json tag write_policy.md
- was introduced specifically to support those and related scenarios. Configurable both globally and for each individual bucket (or dataset), the current set of supported policies includes:
Policy | Description |
---|---|
immediate |
write immediately upon updates (global default) |
delayed |
cache and flush when not accessed for a while (see also: noatime) |
never |
never write but always keep metadata in memory (aka “transient”) |
For the most recently updated enumeration, please see the source.
PUT latency
AIS provides checksumming and self-healing - the capabilities that ensure that user data is end-to-end protected and that data corruption, if it ever happens, will be properly and timely detected and - in presence of any type of data redundancy - resolved by the system.
There’s a price, though, and the scenarios where you could make an educated choice to trade checksumming for performance.
In particular, let’s say that we are massively writing a new content into a bucket.
Type of the bucket doesn’t matter - it may be an
ais://
bucket, ors3://
, or any other supported backend including HDFS and HTTP.
What matters is that we do know that we’ll be overwriting few objects, percentage-wise. Then it would stand to reason that AIS, on its end, should probably refrain from trying to load the destination object’s metadata. Skip loading existing object’s metadata in order to compare checksums (and thus maybe avoid writing altogether if the checksums match) and/or update the object’s version, etc.
- API: PUT(object) - and look for
SkipVC
option - CLI: PUT(object) -
--skip-vc
option
GET throughput
AIS is elastic cluster that can grow and shrink at runtime when you attach (or detach) disks and add (or remove) nodes. Drive are prone to sudden failures while nodes can experience unexpected crashes. At all times, though, and in presence of any and all events, AIS tries to keep fair and balanced distribution of user data across all clustered nodes and drives.
The idea to keep all drives equally utilized is the absolute cornerstone of the design.
There’s one simple trick, though, to improve drive utilization even further: n-way mirroring. In fact,
- if you observe 90% and higher utilizations,
- and if adding more drives (or more nodes) is not an option,
- and if you still have some spare capacity to create additional copies of data
if all of the above is true, it’d be a very good idea to mirror the corresponding bucket, e.g.:
# reconfigure existing bucket (`ais://abc` in the example) for 3 copies
$ ais bucket props set ais://abc mirror.enabled=true mirror.copies=3
Bucket props successfully updated
"mirror.copies" set to: "3" (was: "2")
"mirror.enabled" set to: "true" (was: "false")
In other words, under extreme loads n-way mirroring is especially, and strongly, recommended. (Data protection would then be an additional bonus, needless to say.) The way it works is simple:
- AIS target constantly monitors drive utilizations
- Given multiple copies, AIS target always selects a copy (and a drive that stores it) based on the drive’s utilization.
Ultimately, a drive that has fewer outstanding I/O requests and is less utilized - will always win.
aisloader
AIStore includes aisloader
- a powerful benchmarking tool that can be used to generate a wide variety of workloads closely resembling those produced by AI apps.
For numerous command-line options and usage examples, please see:
Or, simply run aisloader
with no arguments and see its online help, including command-line options and numerous usage examples:
$ make aisloader; aisloader
AIS loader (aisloader v1.3, build ...) is a tool to measure storage performance.
It's a load generator that has been developed to benchmark and stress-test AIStore
but can be easily extended to benchmark any S3-compatible backend.
For usage, run: `aisloader`, or `aisloader usage`, or `aisloader --help`.
Further details at https://github.com/NVIDIA/aistore/blob/main/docs/howto_benchmark.md
Command-line options
====================
...
...
Note as well that aisloader
is fully StatsD-enabled - collected metrics can be forwarded to any StatsD-compliant backend for visualization and further analysis.