SynqNet Fault Recovery

On a SynqNet network, it is important that the communication between the nodes and the controller is not compromised by missing packets. If a significant number of packets are being lost, it is important that the network is able to reroute the flow of packets so that a sufficient level of data integrity can be maintained. One of the key safety features on a Ring topology is that when packets are being lost during either an upstream or downstream flow, the idle link can be used as an alternative path for sending and receiving packets.

The SynqNet Fault Recovery feature allows the network to use the idle link for correcting faulty communication without having to perform an emergency shutdown of the network. It essentially gives the system a buffer of tolerance that allows the network to continue to operate, as long as the Fail Limit has not been reached. Since the presence of an idle link is required for SynqNet Fault Recovery, this feature is NOT available on networks with String topologies.

WARNING!
If the idle cable is broken, and a second cable fails, then SynqNet Fault Recovery will not be able to fully recover from the fault. SynqNet will try to recover, but some nodes will be stranded. Presently, SynqNet does not support an event to notify the application if an idle cable fails. In the meantime, an application can periodically poll the idle cable with mpiSynqNetIdleCableStatus(...) to test the connection.

The example below explains what happens on a SynqNet network when a node enters the Fault Recovery mode after its Packet Error Rate Counter has reached its Fault Limit threshold. The table below shows the Packet Error Rate Counters at the controller and nodes during each controller period. For simplicity, the example below assumes that the Packet Error Counter and Packet Error Rate Counter are identical.

The default, Downstream Fault Limit is when when the Packet Error Rate Counter = 6
The default, Upstream Fault Limit is when when the Packet Error Rate Counter = 4

For information about packet errors, please see the Packet Error Counters section.

Controller Period	Node 0	Node 1	Controller 0	Comments
P	0	0	N₀ = 0 N₁ = 0	All nodes receive their packets. No packet errors.

In the example above, the SynqNet network is operating normally and all packets are being received. Since no packets are being lost, the network behaves just like a network with a String topology. Although packets are sent from both ports of the controller in a Ring topology, the Repeater on the last node blocks the packets from being received by the node when its Repeater is OFF. For example, although packets are being sent to the OUT port of N₁, it is only receiving packets from its IN port. Therefore, the Repeater remains OFF at the last node in the network until the Fault Limit has been reached for a node.

The next two controller periods show a problem with the downstream flow of packets from N₀ to N₁. Packets are not being received by N₁.

Controller Period	Node 0	Node 1	Controller 0	Comments
P + 1	0	3	N₀ = 0 N₁ = 0	N1 missed 3 packets. The Repeater on N1 is OFF.
P + 2	0	6	N₀ = 0 N₁ = 0	N1 missed 3 more packets. The default Fault Limit has been reached on N1.

Once the Fault Limit has been reached (n = 6), N₁ enters the Fault Recovery mode. The repeater on N₁ is turned ON and N₁ starts to receive its packets from its OUT port.

Controller Period	Node 0	Node 1	Controller 0	Comments
P + 3	0	6	N₀ = 0 N₁ = 0	The Repeater at N1 is turned ON. N1 receives its packets from its OUT port.

In the next controller period (P + 3), zero packets are lost and the Packet Error Rate Counter for N₁ remains at 6. N₁ remains in Fault Recovery Mode until the system is shut down and reset.

MPI Software Perspective

Network fault recovery is only supported with ring topologies. If any single network connection fails, the network traffic will be automatically re-routed around the faulty connection via the idle link. The controller can be configured to notify the host application when the fault recovery occurs.

Events
When network fault recovery occurs, the controller will generate a MPIEventTypeSYNQNET_RECOVERY status/event. The status can be read with mpiSynqNetStatus(...), decoding the eventMask with mpiEventMaskBitGET(eventMask, MPIEventTypeSYNQNET_RECOVERY). The recovery event generation is configured by setting the eventMask with mpiSynqNetEventNotifySet(...). After the RECOVERY event occurs, the status/event can be cleared with mpiSynqNetEventReset(...). This will allow another SYNQNET_RECOVERY status/event to be triggered.

Fault Location (Idle Cable)
The fault location can be determined using mpiSynqNetIdleCableGet(...). A network with a ring topology has one idle cable. After network initialization, the default idle cable is the connection from the last node to the controller. Since no data is transmitted across this cable, it is considered to be idle. After network fault recovery, the idle cable is the connection that had the fault.

If the faulted cable is replaced or you want to test the idle cable, use mpiSynqNetIdleCableStatus(...). This will send a special test packet across the idle cable to verify the upstream and downstream data path is operational. The status will report:

MPISynqNetCableStatusGOOD - communication test passed. MPISynqNetCableStatusBAD_UPSTREAM - upstream communication test failed. MPISynqNetCableStatusBAD_DOWNSTREAM - downstream communication test failed. MPISynqNetCableStatusBAD - communication test failed.

Recovery Mode
The controller supports several network fault recovery modes. It can be configured with mpiSynqNetConfigGet/Set(...), using the MPISynqNetRecoveryMode enumeration:

MPISynqNetRecoveryModeDISABLED (default for string topology) - Network does not attempt to redirect network traffic around a fault.

MPISynqNetRecoveryModeSINGLE_SHOT - Network will redirect network traffic around a fault one time. A second fault will cause all nodes downstream from the fault to fail.

MPISynqNetRecoveryModeAUTO_ARM (default for ring topology) - Network will redirect network traffic around a fault each time a fault occurs. After the network traffic redirection, the controller will wait for the node(s) upstream and downstream packet error rate counters to decrement to zero until the recovery is re-armed. Then, the network will be able to respond to another fault.

Most applications will want to use the default recovery mode. If a ring topology network has marginal operational characteristics (large number of packet errors) it might be useful to set the mode to SINGLE_SHOT or DISABLED during troubleshooting. It is easier to determine the network behavior when it is not trying to recover from faults.

Examples
Here are a few possible scenarios for how an application could handle a faulted network with a ring topology. In these cases, the recovery mode is AUTO_ARM.

Single cable break causing a network fault.

Network traffic is automatically re-directed around the broken cable. Network continues to operate in SYNQ mode.
Controller generates a SYNQNET_RECOVERY status/event to the application, notifying that a re-direction occurred.
The application queries controller to determine the location of the fault.
The application decides how to recover.

Recovery Option A:
a) Application moves motors to safe location, disables servo control.
b) Application notifies user and he/she fixes broken network cable.
c) Application re-initializes by performing a network shutdown and initialization or a controller reset. Re-start machine operation.

Recovery Option B:
a) Application notifies user and he/she fixes broken network cable.
b) Application verifies cable is good using mpiSynqNetIdleCableStatus(...).
c) Controller automatically re-arms fault recovery after the packet error rate counters decrement to zero. The network is ready for another recovery.

Expected Behavior

A node will almost always receive ALL of its packets. An acceptable Packet Error Rate is roughly 0-1 errors per day. However, if you are experiencing one or more errors per hour, you should definitely check the data integrity of your system. A high error rate is almost always caused by a faulty cable/connector or a bad connection.

Troubleshooting

Check all cables and connectors. Ensure that no cable or connector is damaged or has bent pins. If the LEDs are off at a connection, it means that communication has been lost at that port. If the cable is properly connected at that port, change the cable.
Check to make sure that each node has power and that it has not been accidentally turned off.
If communication problems persist, contact the node manufacturer.