Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article describes how to perform lifecycle management operations on Bare Metal Machines (BMM). These steps should be used for troubleshooting purposes to recover from failures or when taking maintenance actions.
First, read the advice in the article Best Practices for Bare Metal Machine Operations before proceeding with operations.
The bolded actions listed are considered disruptive (Power off, Restart, Reimage, Replace).
The Cordon action without the evacuate
parameter isn't considered disruptive while Cordon with the evacuate
parameter is considered disruptive.
- Power off a Bare Metal Machine
- Start a Bare Metal Machine
- Restart a Bare Metal Machine
- Make a Bare Metal Machine unschedulable (cordon without evacuate, doesn't drain the node)
- Make a Bare Metal Machine unschedulable (cordon with evacuate, drains the node)
- Make a Bare Metal Machine schedulable (uncordon)
- Reimage a Bare Metal Machine
- Replace a Bare Metal Machine
Caution
Don't perform any action against control or management plane servers without first consulting with Microsoft support personnel, doing so could affect the integrity of the Operator Nexus Cluster.
Important
Multiple disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected. This check is done to maintain the integrity of the Nexus Cluster instance and avoid multiple KCP nodes become nonoperational at once due to simultaneous disruptive actions. Rejected disruptive action commands can be due to either already running against another KCP node or if the full KCP isn't available. If multiple nodes become nonoperational, it breaks the healthy quorum threshold of the Kubernetes Control Plane.
The actions listed are considered disruptive to BareMetal Machines (BMM):
- Power off a BMM
- Restart a BMM
- Make a BMM unschedulable (cordon with evacuate, drains the node)
- Reimage a BMM
- Replace a BMM
Leaving only the nondisruptive actions:
- Start a BMM
- Make a BMM unschedulable (cordon without evacuate, doesn't drain node)
- Make a BMM schedulable (uncordon)
Prerequisites
- Install the latest version of the appropriate CLI extensions.
- Request access to run the Azure Operator Nexus network fabric (NF) and network cloud CLI extension commands.
- Sign in to the Azure CLI and select the subscription where the cluster is deployed.
- Collect the following information:
- Subscription ID (
SUBSCRIPTION
) - Cluster name (
CLUSTER
) - Resource group (
CLUSTER_RG
) - Managed resource group (
CLUSTER_MRG
) - BareMetal Machines (BMM) resources are present in the managed resource group - BareMetal Machine Name (
BMM_NAME
) that requires lifecycle management operations
- Subscription ID (
Power off a Bare Metal Machine
Important
There are rare cases where running Nexus VMs fail to relaunch after BMM shutdown or restart. To prevent these cases, power off any virtual machines on the BMM before powering off or restarting the BMM. See the cordon
command for instructions on finding the workloads running on a BMM.
This command will power-off
the specified bareMetalMachineName
.
az networkcloud baremetalmachine power-off \
--name <BareMetalMachineName> \
--resource-group <resourceGroup> \
--subscription <subscriptionID>
Start a Bare Metal Machine
This command will start
the specified bareMetalMachineName
.
az networkcloud baremetalmachine start \
--name <BareMetalMachineName> \
--resource-group <resourceGroup> \
--subscription <subscriptionID>
Restart a Bare Metal Machine
Important
There are rare cases where running Nexus VMs fail to relaunch after BMM shutdown or restart. To prevent these cases, power off any virtual machines on the BMM before powering off or restarting the BMM. See the cordon
command for instructions on finding the workloads running on a BMM.
This command will restart
the specified bareMetalMachineName
.
az networkcloud baremetalmachine restart \
--name <BareMetalMachineName> \
--resource-group <resourceGroup> \
--subscription <subscriptionID>
Make a Bare Metal Machine unschedulable (cordon)
You can make a Bare Metal Machine unschedulable by executing the cordon
command.
On the execution of the cordon
command, Operator Nexus workloads aren't scheduled on the Bare Metal Machine when cordon
is set.
Any attempt to create a workload on a cordoned
Bare Metal Machine results in the workload being set to pending
state.
Existing workloads continue to run on the Bare Metal Machine unless the workloads are drained.
Drain Bare Metal Machine workloads
The cordon command supports the evacuate
parameter which its default value False
means that the cordon
command prevents scheduling new workloads.
To drain workloads with the cordon
command, the evacuate
parameter must be set to True
.
The workloads running on the Bare Metal Machine are stopped
and the Bare Metal Machine is set to pending
state.
Note
Nexus Management Workloads continue to run on the Bare Metal Machine even when the server is cordoned and evacuated.
It's a best practice to set the evacuate
value to True
when attempting to do any maintenance operations on the Bare Metal server.
For more best practices to follow, read through Best Practices for Bare Metal Machine Operations.
az networkcloud baremetalmachine cordon \
--evacuate "True" \
--name <BareMetalMachineName> \
--resource-group <resourceGroup> \
--subscription <subscriptionID>
To identify if any workloads are currently running on a Bare Metal Machine, run the following command:
For Virtual Machines:
az networkcloud baremetalmachine show -n <nodeName> /
--resource-group <resourceGroup> /
--subscription <subscriptionID> | jq '.virtualMachinesAssociatedIds'
For Nexus Kubernetes cluster nodes: (Requires logging into the Nexus Kubernetes cluster)
kubectl get nodes <resourceName> -ojson |jq '.metadata.labels."topology.kubernetes.io/baremetalmachine"'
Make a Bare Metal Machine schedulable (uncordon)
You can make a Bare Metal Machine "schedulable" (the server can host workloads) by executing the uncordon
command.
All workloads in a pending
state on the Bare Metal Machine are restarted
when the Bare Metal Machine is uncordoned
.
az networkcloud baremetalmachine uncordon \
--name <BareMetalMachineName> \
--resource-group <resourceGroup> \
--subscription <subscriptionID>
Reimage a Bare Metal Machine
You can restore the runtime version on a Bare Metal Machine by executing reimage
command. The reimage
action doesn't affect the tenant workload files on the Bare Metal Machine.
This process redeploys the runtime image on the target Bare Metal Machine and executes the steps to rejoin the cluster with the same identifiers.
As a best practice, ensure the Bare Metal Machine's workloads are drained using the cordon
command, with evacuate
set to True
, before executing the reimage
command.
For more best practices to follow, read through Best Practices for Bare Metal Machine Operations.
Important
Avoid write or edit actions performed on the node via Bare Metal Machine access.
The reimage
action is required to restore Microsoft support and any changes done to the Bare Metal Machine are lost while restoring the node to it's expected state.
Warning
Don't run more than one baremetalmachine replace
or reimage
command at the same time for the same BareMetal Machine (BMM) resource.
Executing replace
at the same time as a reimage
leaves servers in a nonoperational state.
Make sure any replace
/reimage
on the BMM completes fully before starting another one.
Additionally, avoid executing sequential reimage
actions on a BMM that just completed a replace
action unless specified maintenance operation is being performed.
az networkcloud baremetalmachine reimage \
--name <BareMetalMachineName> \
--resource-group <resourceGroup> \
--subscription <subscriptionID>
Replace a Bare Metal Machine
Use the replace
command when a server encounters hardware issues requiring a complete or partial hardware replacement.
After the replacing components such as motherboard or Network Interface Card (NIC), the MAC address of Bare Metal Machine will change; however, the iDRAC IP address and hostname will remain the same.
A replace
must be executed after each hardware maintenance operation, read through Best practices for a Bare Metal Machine replace for more details.
Warning
Don't run more than one baremetalmachine replace
or reimage
command at the same time for the same BareMetal Machine (BMM) resource.
Executing replace
at the same time as a reimage
leaves servers in a nonoperational state.
Make sure any replace
/reimage
on the BMM completes fully before starting another one.
Additionally, avoid executing sequential reimage
actions on a BMM that just completed a replace
action unless specified maintenance operation is being performed.
az networkcloud baremetalmachine replace \
--name <BareMetalMachineName> \
--resource-group <resourceGroup> \
--bmc-credentials password=<IDRAC_PASSWORD> username=<IDRAC_USER> \
--bmc-mac-address <IDRAC_MAC> \
--boot-mac-address <PXE_MAC> \
--machine-name <OS_HOSTNAME> \
--serial-number <SERIAL_NUMBER> \
--subscription <subscriptionID>
If the replace
action fails due to a hardware validation failure, the specific error or test failure is shown in the replace
response, as shown in the following examples.
This information can also be found in the Activity Log for the Bare Metal Machine (Operator Nexus).
The error code and error message are included the JSON properties of the corresponding BareMetalMachines_Replace
operation.
Example 1: hardware validation fails due to invalid Baseboard Management Controller (BMC) credentials provided
$ az networkcloud baremetalmachine replace --name rack1compute02 --resource-group hostedRG --bmc-credentials password=REDACTED username=root --bmc-mac-address 00-00-5E-00-01-00 --boot-mac-address 00-00-5E-00-02-00 --machine-name RACK1COMPUTE02 --serial-number SN123435
(None) BMC login unsuccessful: Fail - Unauthorized; System health test(s) failed: [Additional logs: Server power down at end of test failed with: Unauthorized]
Code: None
Message: BMC login unsuccessful: Fail - Unauthorized; System health test(s) failed: [Additional logs: Server power down at end of test failed with: Unauthorized]
Example 2: hardware validation fails due to networking failure
$ az networkcloud baremetalmachine replace --name rack1compute02 --resource-group hostedRG --bmc-credentials password=REDACTED username=root --bmc-mac-address 00-00-5E-00-01-00 --boot-mac-address 00-00-5E-00-02-00 --machine-name RACK1COMPUTE02 --serial-number SN123435
(None) Networking test(s) failed: [NIC.Slot.6-1-1_LinkStatus] expected: up; observed: Down; [Additional logs: Link failure detected on NIC.Slot.6-1-1; Unable to perform cabling check on PCI Slot 6]
Code: None
Message: Networking test(s) failed: [NIC.Slot.6-1-1_LinkStatus] expected: up; observed: Down; [Additional logs: Link failure detected on NIC.Slot.6-1-1; Unable to perform cabling check on PCI Slot 6]
For more information about troubleshooting hardware validation failures, see Troubleshoot Hardware Validation Failure.