Skip to content

Instantly share code, notes, and snippets.

@hector-cao
Last active March 13, 2025 03:33
Show Gist options
  • Save hector-cao/3b841333b2d4340a7d138be574b2dc71 to your computer and use it in GitHub Desktop.
Save hector-cao/3b841333b2d4340a7d138be574b2dc71 to your computer and use it in GitHub Desktop.

Problem statement

When I try to create an index with QAT codec:

curl --cacert ~/node-cm0.pem -XPUT "https://admin:admin@localhost:9200/qat_index" -H 'Content-Type:application/json' -d'
{
  "settings": {
    "index": {
      "codec.qatmode" : "hardware",
      "codec": "qat_lz4"
    }
  }
}'

I get this error:

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"unknown value for [index.codec] must be one of [default, lz4, best_compression, zlib] but was: qat_lz4"}],"type":"illegal_argument_exception","reason":"unknown value for [index.codec] must be one of [default, lz4, best_compression, zlib] but was: qat_lz4"},"status":400}

Fixes

Issue 1:

In opensearch log, there is no relevant log that explains what is the cause of this error. By using upstream opensearch and rebuilding some of the components with more logs, I ended up understanding the reason. The qat java plugin cannot be properly loaded because of lack of appropriate permissions:

Caused by: java.security.AccessControlException: access denied ("org.opensearch.secure_sm.ThreadPermission" "modifyArbitraryThread")                                                                                             
    at java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:488) ~[?:?]                                                                                                                        
    at java.base/java.security.AccessController.checkPermission(AccessController.java:1071) ~[?:?]                                                                                                                               
    at java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:411) ~[?:?]  

To fix this issue, we have to grant the qat java plugin this specific permission by modifying the file: usr/share/opensearch/plugins/opensearch-custom-codecs/plugin-security.policy Here is the contents of this file after modification:


...
grant codeBase "${codebase.qat-java}" {
  permission java.lang.RuntimePermission "loadLibrary.*";
  permission org.opensearch.secure_sm.ThreadPermission "modifyArbitraryThread";
};
...

Issue 2

The next issue is opensearch cannot see any QAT devices despite that fact that QAT is properly setup in the host and the required interfaces are connected (intel-qat and process-control):

Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: No devices found                                                                                                                                                       
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: No device found                                                                                                                                                        
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: Error userStarMultiProcess(-1), switch to SW if permitted                                                                                                              
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: g_process.qz_init_status = QZ_NOSW_NO_HW                                                                                                                               
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: [2025-03-12T02:24:47,802][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [cm0] fatal error in thread [opensearch[cm0][clusterApplierService#updateTask][T#1]], exiti\
ng                                                                                                                                                                                                                               
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]: java.lang.ExceptionInInitializerError: null                                                                                                                            
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]:         at org.opensearch.index.codec.customcodecs.QatZipperFactory.isQatAvailable(QatZipperFactory.java:177) ~[?:?]                                                   
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]:         at org.opensearch.index.codec.customcodecs.CustomCodecPlugin.getCustomCodecServiceFactory(CustomCodecPlugin.java:53) ~[?:?]                                    
Mar 12 02:24:47 corsair-741103 opensearch.daemon[707672]:         at org.opensearch.index.engine.EngineConfigFactory.<init>(EngineConfigFactory.java:102) ~[opensearch-2.18.0.jar:2.18.0] 

This is because the opensearch daemon is not running under root user nor under a normal user belonging to the group qat, this is required to access the devices files under /dev/vfio/. In fact, opensearch daemon runs under snap_daemon user that does not belong to the group qat.

We can see it in the start.sh script:

   "${SNAP}"/usr/bin/setpriv \
	--clear-groups \
        --reuid snap_daemon \
        --regid snap_daemon -- \
        "${OPENSEARCH_BIN}"/opensearch

A quick fix is to grant access to the user snap_daemon for the files in /dev/vfio by using ACL

sudo setfacl -m group:snap_daemon:rw /dev/vfio/devices/*
sudo setfacl -m group:snap_daemon:rw /dev/vfio/*

This is only a work-around and I think we should find a proper way to handle this permission issue.

Issue 3

We are not yet done, now, opensearch complains about not being able to allocate DMA memory:

Mar 12 12:18:38 corsair-741103 opensearch.daemon[1129295]: dma_map_slab:200 VFIO_IOMMU_MAP_DMA failed va=71e15d000000 iova=a00000 size=200000 -- errno=12
Mar 12 12:18:38 corsair-741103 opensearch.daemon[1129295]: [error] Lac_MemPoolCreate() - : Unable to allocate contiguous chunk of memory
Mar 12 12:18:38 corsair-741103 opensearch.daemon[1129295]: [error] SalCtrl_CompressionInit() - : Failed to create dc memory pool
Mar 12 12:18:38 corsair-741103 opensearch.daemon[1129295]: [error] SalCtrl_ServiceInit() - : Failed to initialise all service instances

The reason is the MEMLOCK limit is exceeded. In fact, somehow, the opensearch memlock limit is fixed to 8192.

We have to increase this limit and one way to do this is to add this line in the start.sh script:


    ulimit -l unlimited
    # start                                                                                                                                                                                    
    "${SNAP}"/usr/bin/setpriv \
	--clear-groups \
        --reuid snap_daemon \
        --regid snap_daemon -- \
        "${OPENSEARCH_BIN}"/opensearch

Conclusion

With all these 3 fixes, we finally can create an index with QAT codec:


{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "qat_index"
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment