Bare Metal Machine Platform Commands

Article
2025-04-04

This article describes how to perform lifecycle management operations on Bare Metal Machines (BMM). These steps should be used for troubleshooting purposes to recover from failures or when taking maintenance actions.

First, read the advice in the article Best Practices for Bare Metal Machine Operations before proceeding with operations.

The bolded actions listed are considered disruptive (Power off, Restart, Reimage, Replace). The Cordon action without the evacuate parameter isn't considered disruptive while Cordon with the evacuate parameter is considered disruptive.

Power off a Bare Metal Machine
Start a Bare Metal Machine
Restart a Bare Metal Machine
Make a Bare Metal Machine unschedulable (cordon without evacuate, doesn't drain the node)
Make a Bare Metal Machine unschedulable (cordon with evacuate, drains the node)
Make a Bare Metal Machine schedulable (uncordon)
Reimage a Bare Metal Machine
Replace a Bare Metal Machine

Caution

Don't perform any action against control or management plane servers without first consulting with Microsoft support personnel, doing so could affect the integrity of the Operator Nexus Cluster.

Important

Multiple disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected. This check is done to maintain the integrity of the Nexus Cluster instance and avoid multiple KCP nodes become nonoperational at once due to simultaneous disruptive actions. Rejected disruptive action commands can be due to either already running against another KCP node or if the full KCP isn't available. If multiple nodes become nonoperational, it breaks the healthy quorum threshold of the Kubernetes Control Plane.

The actions listed are considered disruptive to BareMetal Machines (BMM):

Power off a BMM
Restart a BMM
Make a BMM unschedulable (cordon with evacuate, drains the node)
Reimage a BMM
Replace a BMM

Leaving only the nondisruptive actions:

Start a BMM
Make a BMM unschedulable (cordon without evacuate, doesn't drain node)
Make a BMM schedulable (uncordon)

Prerequisites

Install the latest version of the appropriate CLI extensions.
Request access to run the Azure Operator Nexus network fabric (NF) and network cloud CLI extension commands.
Sign in to the Azure CLI and select the subscription where the cluster is deployed.
Collect the following information:
- Subscription ID (SUBSCRIPTION)
- Cluster name (CLUSTER)
- Resource group (CLUSTER_RG)
- Managed resource group (CLUSTER_MRG) - BareMetal Machines (BMM) resources are present in the managed resource group
- BareMetal Machine Name (BMM_NAME) that requires lifecycle management operations

Power off a Bare Metal Machine

Important

There are rare cases where running Nexus VMs fail to relaunch after BMM shutdown or restart. To prevent these cases, power off any virtual machines on the BMM before powering off or restarting the BMM. See the cordon command for instructions on finding the workloads running on a BMM.

This command will power-off the specified bareMetalMachineName.

az networkcloud baremetalmachine power-off \
  --name <BareMetalMachineName> \
  --resource-group <resourceGroup> \
  --subscription <subscriptionID>

Start a Bare Metal Machine

This command will start the specified bareMetalMachineName.

az networkcloud baremetalmachine start \
  --name <BareMetalMachineName> \
  --resource-group <resourceGroup> \
  --subscription <subscriptionID>

Restart a Bare Metal Machine

Important

This command will restart the specified bareMetalMachineName.

az networkcloud baremetalmachine restart \
  --name <BareMetalMachineName> \
  --resource-group <resourceGroup> \
  --subscription <subscriptionID>

Make a Bare Metal Machine unschedulable (cordon)

You can make a Bare Metal Machine unschedulable by executing the cordon command. On the execution of the cordon command, Operator Nexus workloads aren't scheduled on the Bare Metal Machine when cordon is set. Any attempt to create a workload on a cordoned Bare Metal Machine results in the workload being set to pending state. Existing workloads continue to run on the Bare Metal Machine unless the workloads are drained.

Drain Bare Metal Machine workloads

The cordon command supports the evacuate parameter which its default value False means that the cordon command prevents scheduling new workloads. To drain workloads with the cordon command, the evacuate parameter must be set to True. The workloads running on the Bare Metal Machine are stopped and the Bare Metal Machine is set to pending state.

Note

Nexus Management Workloads continue to run on the Bare Metal Machine even when the server is cordoned and evacuated.

It's a best practice to set the evacuate value to True when attempting to do any maintenance operations on the Bare Metal server. For more best practices to follow, read through Best Practices for Bare Metal Machine Operations.

az networkcloud baremetalmachine cordon \
  --evacuate "True" \
  --name <BareMetalMachineName> \
  --resource-group <resourceGroup> \
  --subscription <subscriptionID>

To identify if any workloads are currently running on a Bare Metal Machine, run the following command:

For Virtual Machines:

az networkcloud baremetalmachine show -n <nodeName> /
  --resource-group <resourceGroup> /
  --subscription <subscriptionID> | jq '.virtualMachinesAssociatedIds'

For Nexus Kubernetes cluster nodes: (Requires logging into the Nexus Kubernetes cluster)

kubectl get nodes <resourceName> -ojson |jq '.metadata.labels."topology.kubernetes.io/baremetalmachine"'

Make a Bare Metal Machine schedulable (uncordon)

You can make a Bare Metal Machine "schedulable" (the server can host workloads) by executing the uncordon command. All workloads in a pending state on the Bare Metal Machine are restarted when the Bare Metal Machine is uncordoned.

az networkcloud baremetalmachine uncordon \
  --name <BareMetalMachineName> \
  --resource-group <resourceGroup> \
  --subscription <subscriptionID>

Reimage a Bare Metal Machine

You can restore the runtime version on a Bare Metal Machine by executing reimage command. The reimage action doesn't affect the tenant workload files on the Bare Metal Machine. This process redeploys the runtime image on the target Bare Metal Machine and executes the steps to rejoin the cluster with the same identifiers.

As a best practice, ensure the Bare Metal Machine's workloads are drained using the cordon command, with evacuate set to True, before executing the reimage command. For more best practices to follow, read through Best Practices for Bare Metal Machine Operations.

Important

Avoid write or edit actions performed on the node via Bare Metal Machine access. The reimage action is required to restore Microsoft support and any changes done to the Bare Metal Machine are lost while restoring the node to it's expected state.

Warning

Don't run more than one baremetalmachine replace or reimage command at the same time for the same BareMetal Machine (BMM) resource. Executing replace at the same time as a reimage leaves servers in a nonoperational state. Make sure any replace/reimage on the BMM completes fully before starting another one. Additionally, avoid executing sequential reimage actions on a BMM that just completed a replace action unless specified maintenance operation is being performed.

az networkcloud baremetalmachine reimage \
  --name <BareMetalMachineName> \
  --resource-group <resourceGroup> \
  --subscription <subscriptionID>

Replace a Bare Metal Machine

Use the replace command when a server encounters hardware issues requiring a complete or partial hardware replacement. After the replacing components such as motherboard or Network Interface Card (NIC), the MAC address of Bare Metal Machine will change; however, the iDRAC IP address and hostname will remain the same. A replace must be executed after each hardware maintenance operation, read through Best practices for a Bare Metal Machine replace for more details.

Warning

az networkcloud baremetalmachine replace \
  --name <BareMetalMachineName> \
  --resource-group <resourceGroup> \
  --bmc-credentials password=<IDRAC_PASSWORD> username=<IDRAC_USER> \
  --bmc-mac-address <IDRAC_MAC> \
  --boot-mac-address <PXE_MAC> \
  --machine-name <OS_HOSTNAME> \
  --serial-number <SERIAL_NUMBER> \
  --subscription <subscriptionID>

If the replace action fails due to a hardware validation failure, the specific error or test failure is shown in the replace response, as shown in the following examples. This information can also be found in the Activity Log for the Bare Metal Machine (Operator Nexus). The error code and error message are included the JSON properties of the corresponding BareMetalMachines_Replace operation.

Example 1: hardware validation fails due to invalid Baseboard Management Controller (BMC) credentials provided

$ az networkcloud baremetalmachine replace --name rack1compute02 --resource-group hostedRG --bmc-credentials password=REDACTED username=root --bmc-mac-address 00-00-5E-00-01-00 --boot-mac-address 00-00-5E-00-02-00 --machine-name RACK1COMPUTE02 --serial-number SN123435
(None) BMC login unsuccessful: Fail - Unauthorized; System health test(s) failed: [Additional logs: Server power down at end of test failed with: Unauthorized]
Code: None
Message: BMC login unsuccessful: Fail - Unauthorized; System health test(s) failed: [Additional logs: Server power down at end of test failed with: Unauthorized]

Example 2: hardware validation fails due to networking failure

$ az networkcloud baremetalmachine replace --name rack1compute02 --resource-group hostedRG --bmc-credentials password=REDACTED username=root --bmc-mac-address 00-00-5E-00-01-00 --boot-mac-address 00-00-5E-00-02-00 --machine-name RACK1COMPUTE02 --serial-number SN123435
(None) Networking test(s) failed: [NIC.Slot.6-1-1_LinkStatus] expected: up; observed: Down; [Additional logs: Link failure detected on NIC.Slot.6-1-1; Unable to perform cabling check on PCI Slot 6]
Code: None
Message: Networking test(s) failed: [NIC.Slot.6-1-1_LinkStatus] expected: up; observed: Down; [Additional logs: Link failure detected on NIC.Slot.6-1-1; Unable to perform cabling check on PCI Slot 6]

For more information about troubleshooting hardware validation failures, see Troubleshoot Hardware Validation Failure.

Share via

Bare Metal Machine Platform Commands

Prerequisites

Power off a Bare Metal Machine

Start a Bare Metal Machine

Restart a Bare Metal Machine

Make a Bare Metal Machine unschedulable (cordon)

Drain Bare Metal Machine workloads

To identify if any workloads are currently running on a Bare Metal Machine, run the following command:

Make a Bare Metal Machine schedulable (uncordon)

Reimage a Bare Metal Machine

Replace a Bare Metal Machine

Feedback

Additional resources