When I try to create an index with QAT codec:
curl --cacert ~/node-cm0.pem -XPUT "https://admin:admin@localhost:9200/qat_index" -H 'Content-Type:application/json' -d'
{
"settings": {
"index": {
"codec.qatmode" : "hardware",
"codec": "qat_lz4"
}
}
}'
I get this error:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"unknown value for [index.codec] must be one of [default, lz4, best_compression, zlib] but was: qat_lz4"}],"type":"illegal_argument_exception","reason":"unknown value for [index.codec] must be one of [default, lz4, best_compression, zlib] but was: qat_lz4"},"status":400}
In opensearch log, there is no relevant log that explains what is the cause of this error. By using upstream opensearch and rebuilding some of the components with more logs, I ended up understanding the reason. The qat java plugin cannot be properly loaded because of lack of appropriate permissions:
Caused by: java.security.AccessControlException: access denied ("org.opensearch.secure_sm.ThreadPermission" "modifyArbitraryThread")
at java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:488) ~[?:?]
at java.base/java.security.AccessController.checkPermission(AccessController.java:1071) ~[?:?]
at java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:411) ~[?:?]
To fix this issue, we have to grant the qat java plugin this specific permission by modifying the file: usr/share/opensearch/plugins/opensearch-custom-codecs/plugin-security.policy
Here is the contents of this file after modification:
...
grant codeBase "${codebase.qat-java}" {
permission java.lang.RuntimePermission "loadLibrary.*";
permission org.opensearch.secure_sm.ThreadPermission "modifyArbitraryThread";
};
...
The next issue is opensearch cannot see any QAT devices despite that fact that QAT is properly setup in the host
and the required interfaces are connected (intel-qat
and process-control
):
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: No devices found
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: No device found
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: Error userStarMultiProcess(-1), switch to SW if permitted
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: g_process.qz_init_status = QZ_NOSW_NO_HW
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: [2025-03-12T02:24:47,802][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [cm0] fatal error in thread [opensearch[cm0][clusterApplierService#updateTask][T#1]], exiti\
ng
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: java.lang.ExceptionInInitializerError: null
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: at org.opensearch.index.codec.customcodecs.QatZipperFactory.isQatAvailable(QatZipperFactory.java:177) ~[?:?]
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: at org.opensearch.index.codec.customcodecs.CustomCodecPlugin.getCustomCodecServiceFactory(CustomCodecPlugin.java:53) ~[?:?]
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: at org.opensearch.index.engine.EngineConfigFactory.<init>(EngineConfigFactory.java:102) ~[opensearch-2.18.0.jar:2.18.0]
This is because the opensearch daemon is not running under root user nor under a normal user belonging to the group qat
,
this is required to access the devices files under /dev/vfio/
.
In fact, opensearch daemon runs under snap_daemon
user that does not belong to the group qat
.
We can see it in the start.sh
script:
"${SNAP}"/usr/bin/setpriv \
--clear-groups \
--reuid snap_daemon \
--regid snap_daemon -- \
"${OPENSEARCH_BIN}"/opensearch
A quick fix is to grant access to the user snap_daemon
for the files in /dev/vfio
by using ACL
sudo setfacl -m group:snap_daemon:rw /dev/vfio/devices/*
sudo setfacl -m group:snap_daemon:rw /dev/vfio/*
This is only a work-around and I think we should find a proper way to handle this permission issue.
We are not yet done, now, opensearch complains about not being able to allocate DMA memory:
Mar 12 12:18:38 corsair-741103 opensearch.daemon[1129295]: dma_map_slab:200 VFIO_IOMMU_MAP_DMA failed va=71e15d000000 iova=a00000 size=200000 -- errno=12
Mar 12 12:18:38 corsair-741103 opensearch.daemon[1129295]: [error] Lac_MemPoolCreate() - : Unable to allocate contiguous chunk of memory
Mar 12 12:18:38 corsair-741103 opensearch.daemon[1129295]: [error] SalCtrl_CompressionInit() - : Failed to create dc memory pool
Mar 12 12:18:38 corsair-741103 opensearch.daemon[1129295]: [error] SalCtrl_ServiceInit() - : Failed to initialise all service instances
The reason is the MEMLOCK limit is exceeded.
In fact, somehow, the opensearch memlock limit is fixed to 8192
.
We have to increase this limit and one way to do this is to add this line in the start.sh
script:
ulimit -l unlimited
# start
"${SNAP}"/usr/bin/setpriv \
--clear-groups \
--reuid snap_daemon \
--regid snap_daemon -- \
"${OPENSEARCH_BIN}"/opensearch
With all these 3 fixes, we finally can create an index with QAT codec:
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "qat_index"
}