TLDR: Don't use Ubuntu 24.04 if you are mounting a blob storage container which you need to have constant connection for a long job (hours) and have big data files (tenths to hundreds of GB each file). Use Ubuntu server 22.04 instead.
Ans:
After many backs and forth with people inside MS support I got an interesting request to use Ubuntu server 22.04 instead of Ubuntu serve 24.04. It worked!!! I don't know what the catch is, but apparently, MS start script has some conflicts with Ubuntu server 24.04 that is making the node lose the lease of the storage container. When the connection to the storage container is lost the blubfuse.py scripts go into a nested call that leads to OOM. I will run some more test runs but it was stable in my initial 10 test runs.Below is sample of the winning configuration in my specific case. But note that I didn't create it from Azure CLI but rather modified from the portal omitting the configurations that are only relevant to me (i.e. account key and stuff like that).
{
"id": "your-pool",
"displayName": null,
"vmSize": "STANDARD_E20ds_V4",
"virtualMachineConfiguration": {
"imageReference": {
"publisher": null,
"offer": null,
"sku": null,
"version": null,
"virtualMachineImageId": "/subscriptions/yoursubs/Microsoft.Compute/galleries/batchcomputegallery/images/yourcustom-ubuntu-22_04-lts-server/versions/1.0.0",
"exactVersion": null
},
"nodeAgentSKUId": "batch.node.ubuntu 22.04",
"securityProfile": {
"securityType": "TrustedLaunch",
"encryptionAtHost": null,
"uefiSettings": {
"secureBootEnabled": true,
"vTpmEnabled": true
}
}
},
"enableAutoScale": true,
"autoScaleFormula": "startingNumberOfVMs = 0; maxNumberofVMs = 10; pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second); pendingTaskSamples = pendingTaskSamplePercent < 70 ? startingNumberOfVMs : avg($PendingTasks.GetSample(180 * TimeInterval_Second)); $TargetDedicatedNodes=min(maxNumberofVMs, pendingTaskSamples); $NodeDeallocationOption = taskcompletion",
"autoScaleEvaluationInterval": "PT5M",
"enableInterNodeCommunication": false,
"networkConfiguration": {
"subnetId": "/subscriptions/yoursubs/Microsoft.Network/virtualNetworks/net-batch-001/subnets/default",
"dynamicVNetAssignmentScope": null,
"publicIPAddressConfiguration": {
"provision": "batchmanaged"
},
"enableAcceleratedNetworking": true
},
"applicationPackageReferences": [
{
"applicationId": "testApp",
"version": null
}
],
"taskSlotsPerNode": 1,
"taskSchedulingPolicy": {
"nodeFillType": "spread"
},
"targetNodeCommunicationMode": "simplified",
"mountConfiguration": [
{
"azureBlobFileSystemConfiguration": {
"accountName": "yourStorageAcc",
"accountKey": "yourKey",
"blobfuseOptions": "--allow-other --log-level=LOG_DEBUG --read-only=false --disable-writeback-cache=false --file-cache-timeout=15000 --sync-to-flush=true -o attr_timeout=15000 -o entry_timeout=15000 -o negative_timeout=15000",
"containerName": "yourContainer",
"relativeMountPath": "yourContainer"
}
}
]
}