SynqNet Node Failure

For information about packet errors, please see the Packet Error Counters section.

A SynqNet Failure occurs when a node on the network reaches its FAIL threshold. The example below explains what happens on a SynqNet network when a node has reached its Fail Limit threshold.

The default, downstream Fail Limit is when the Packet Error Rate Counter = 12
The default, upstream Fail Limit is when the Packet Error Rate Counter = 8

The table below shows the Packet Error Rate Counter at the controller and nodes during each controller period. For simplicity, the example below assumes that the Packet Error Counter and Packet Error Rate Counter are identical.

Controller Period	Node 0	Node 1	Controller 0	Comments
P	0	0	N₀ = 0 N₁ = 0	All nodes receive their packets. No packet errors.

The SynqNet network is operating normally and zero packets are being lost.

The next few controller periods show a problem with the downstream flow of packets from N₀ to N₁. Packets are not being received by N₁.

Controller Period	Node 0	Node 1	Controller 0	Comments
P + 1	0	3	N₀ = 0 N₁ = 0	N1 missed 3 packets.
P + 2	0	6	N₀ = 0 N₁ = 0	N1 missed 3 more packets.
P + 3	0	9	N₀ = 0 N₁ = 0	N1 missed 3 more packets.
P + 4	0	12	N₀ = 0 N₁ = 0	N1 missed 3 more packets. The default Fail Limit has been reached on N1.

The default downstream Fail Limit for each node is when the Packet Error Rate Counter = 12. In the example above, N₁ reaches its Fail Limit in four consecutive cycles. Once the Fail Limit has been reached, the node will enter the SYNQ Lost State and disable its outputs. The controller stops sending packets over the network because communication has been lost with N₁.

MPI Software Perspective

Node failure tolerance is supported with string, ring, or other topologies. If one or more nodes fail, the network will continue to operate in SYNQ mode, sending and receiving data to/from the good nodes. Typically, a node failure indicates a serious problem. For instance, a broken cable in a string topology, loss of power to a node, or a large number of packet errors. A node failure is not automatically recoverable, because the state of the node is unknown and network timing schedule can only be transmitted to the node at initialization time. To recover a system with failed nodes, the node or cable hardware must be repaired or replaced and then the network must be shutdown and re-initialized.

Events
When a node failure occurs, the controller will generate a MPIEventTypeSYNQNET_NODE_FAILURE status/event for the network when any node fails. The status can be read with mpiSynqNetStatus(...) decoding the eventMask with mpiEventMaskBitGET(eventMask, MPIEventTypeSYNQNET_NODE_FAILURE). An application can determine the failed nodes by reading the failedNodeMask (one bit per node) with mpiSynqNetStatus(...). The network node failure event generation is configured by setting the eventMask with mpiSynqNetEventNotifySet(...). After the SYNQNET_NODE_FAILURE event occurs, the status/event can be cleared with mpiSynqNetEventReset(...). This will allow another SYNQNET_NODE_FAILURE status/event to be triggered.

If a node fails, the controller will also generate a MPIEventTypeSQNODE_NODE_FAILURE status/event for each failed node. The status can be read with mpiSqNodeStatus(...) decoding the eventMask with mpiEventMaskBitGET(eventMask, MPIEventTypeSQNODE_NODE_FAILURE). The node failure event generation is configured by setting the eventMask with mpiSqNodeEventNotifySet(...). After the SQNODE_NODE_FAILURE event occurs, the status/event can be cleared with mpiSqNodeEventReset(...). This will allow another SQNODE_NODE_FAILURE status/event to be triggered after the network has been re-initialized.

Node Failure Action
When any node fails, the controller can generate an action (NONE, STOP, E_STOP, or ABORT) for the motors associated with the good nodes. The default action is NONE. The nodeFailureAction configuration is a member of the MPIMotorConfig{...} structure and can be set with mpiMotorConfigSet(...).

If a node fails, the controller will automatically generate an ABORT action for the motor objects that are mapped to that node. This makes it easier for users and applications to identify node problems from the motor object.

Network Shutdown
Since the SynqNet network will continue to operate in SYNQ mode even if nodes fail, the only way to shutdown a network is to call mpiSynqNetShutdown(...). To re-initialize the network, use mpiSynqNetInit(...). An alternative way to shutdown and re-initialize the network is to use mpiControlReset(...), which performs a network shutdown, resets the controller, and re-initializes the network.

Examples
Here are a few possible scenarios for how an application could handle a network with failed nodes. For ring topology cases, the recovery mode is AUTO_ARM. For string topology cases, the recovery mode is DISABLED.

Ring Topology - One node fails due to loss of power

Network traffic is automatically re-directed around the failed node. Network continues to operate in SYNQ mode.
Controller generates a SYNQNET_RECOVERY event to the application, notifying that a re-direction occurred.
Controller generates a SYNQNET_NODE_FAILURE and SQNODE_NODE_FAILURE status/event to the application, notifying that a node has failed.
Application queries controller to determine the location of the fault.
Application moves motors (on good nodes) to a safe location and disables servo control.
User fixes node power.
Application re-initializes network, re-starts machine operation.

String Topology - Single cable break causing node failures

Network continues to operate in SYNQ mode, nodes downstream from the break don't receive data. Downstream nodes eventually fail when packet error rate failure limits are exceeded.
Controller generates a SYNQNET_NODE_FAILURE and SQNODE_NODE_FAILURE status/event to the application, notifying that a node has failed.
Application queries controller to determine the location of the fault.
Application moves motors (on good nodes) to a safe location, disables servo control.
User fixes broken cable.
Application re-initializes network, re-starts machine operation.

There are several other variations. The motor's nodeFailureAction feature could be used to stop the motors that are not located on the faulted node(s). Or the motor's nodeFailureAction could be used to stop a select group of critical motors. Or, the motor's nodeFailureAction could be set to NONE, allowing the application to decide how respond to node failures. The general concept is to keep as much of the network/nodes functioning as possible when failures occur. This allows an application to deal with critical axes, non-critical axes, and axis relationships differently.

For example, suppose a machine has an X, Y gantry with 3 feeder axes. If the X or Y node fails, an application will want to Abort X, Y axes immediately and stop the machine operation to fix the problem. But, if a feeder axis fails, the application may want to continue X, Y control until the axes can be moved to a safe location and the feeder axes can be serviced.

Expected Behavior

A node will almost always receive ALL of its packets. An acceptable Packet Error Rate is roughly 0-1 errors per day. However, if you are experiencing one or more errors per hour, you should definitely check the data integrity of your system. A high error rate is almost always caused by a faulty cable/connector or a bad connection.

Troubleshooting

Check all cables and connectors. Ensure that no cable or connector is damaged or has bent pins. If the LEDs are off at a connection, it means that communication has been lost at that port. If the cable is properly connected at that port, change the cable.
Check to make sure that each node has power and that it has not been accidentally turned off.
If communication problems persist, contact the node manufacturer.