SynqNet HotReplace

Introduction

The SynqNet HotReplace feature allows one or more consecutive nodes to be shut down, serviced, and then reattached to the system without affecting the operation of the other nodes. The HotReplace feature is supported in the 03.04.00 software release and later.

The SynqNet HotReplace feature is especially useful for modular systems where modules are occasionally taken offline for regular maintenance or replacement. Many high performance systems would require you to shut down the entire system in order to service a module. But, on a SynqNet network, you can use the HotReplace feature to safely take a module offline, service it, and then replace it while the rest of the system remains fully operational.

Overview

HotReplace begins when an application commands the shutdown of the node(s) to be serviced by calling the mpiSynqNetNodeShutdown(...) method. Once the node or set of contiguous nodes has been shut down in the software, it may then be disconnected or powered off while the rest of the nodes maintain communication over the SynqNet network. An application may also be customized so that the disconnected node(s) are bypassed while the remaining nodes perform controlled motion. After service or repair of the disconnected node(s) is complete, some or all of the nodes may be reconnected. The reconnected node(s) is powered up and the application restarts the nodes by calling the mpiSynqNetNodeRestart(...) method. The MPI and firmware rediscovers all replaced nodes, restores their original configuration, and returns the SynqNet network to normal operation. An application must call both the mpiSynqNetNodeShutdown(...) and mpiSynqNetNodeRestart(...) methods, since the MPI and firmware have no way of knowing when service was shutdown or restarted.

HotReplace does not allow changes to the original network topology. Any restarted node must have an identical node type, FPGA type, location within the network, and cable lengths to the originally discovered node(s) by the controller. The addition of more nodes or nodes of a different type, requires a re-initialization of the full network. This restriction is necessary because SynqNet calculates timing schedules based on these factors and optimizes the system to match the exact packet sizes and propagation delays of each particular network. These schedules cannot be changed while the controller is normal cyclic operation.

HotReplace is supported for ring, string, and dual string topologies. See SynqNet Topologies. Nodes may be replaced with different nodes of identical type (it’s OK to replace a broken node). Restart is allowed for subsets of the original system (e.g. removing two modules and then replacing and restarting one), but allowed subsets must grow in network order. For example, assume a full system is a ring of nodes 0, 1, 2, and 3. If nodes 1, 2, and 3 are disconnected, then node 1 can be restarted by itself, but node 2 requires the presence of node 1 since it is part of the original topology. The typical tolerance for "identical" cable lengths is +/- 1 meters, however, greater tolerances can be configured.

Fault recovery is only supported for the full original system. Fault recovery requires a full ring for redundant data paths. Restarting a subset of the original ring results in a dual string topology. Adding a cable to close the ring without restarting the full set of nodes would be a topology change (changes propagation delays) and is therefore not supported.

Node download is an “offline” operation and is not supported during normal cyclic operation. Therefore, it is not possible to update the node FPGA during a restart. Thus, if restart is used to replace a failed node, the node must already have the correct runtime FPGA image loaded onto the node.

See the following diagrams for correct and incorrect wiring and operation.

Example 1: Servicing One Module

Step 1: Initial Setup

Step 2: Module B is removed for servicing.

Step 3: Module B is reconnected and restarted.

Example 2: Servicing Two or More Contiguous Modules

Step 1: Initial Setup

Step 2: Modules B, C, and D are removed for service.

Step 3: Module B is reconnected and restarted.

Step 4: Module D is reconnected and restarted.

Step 5: Module C is reconnected and restarted.

It is also possible to restart modules A, B, and C all at the same time. The above example was chosen to show that modules may be reconnected to either string, but must match the original topology.

Example 3: Cannot Add Additional Nodes with HotReplace

Step 1: Initial Setup

Step 2: Cannot add additional nodes with HotReplace.
Adding new nodes would change the SynqNet topology and timing. Therefore, it is not supported by a HotReplace. A full network reset would be required to add additional nodes.

Example 4: Fault Recovery is Not Supported with a Removed Module

Step 1: Initial Setup

Step 2: Module B is removed for servicing.
Fault Recovery is not possible because there is no redundant path (idle link) available.

Step 3: Cannot connect Module A to C.
Attempting to connect Module A to C would change the SynqNet topology (timing) and is not allowed.

Example 5: Removing Non-adjacent Modules Disconnects Modules Between

Step 1: Initial Setup

Step 2: Remove Modules B and D while attempting to use Module C.
If Modules B and D are disconnected, communication to Module C will be lost. In order to communicate with Module C, either Module B or D must remain connected.

Example 6: Cannot Restart Non-adjacent Modules

Step 1: Initial Setup

Step 2: Modules B, C, and D are removed for servicing.

Step 3: Cannot restore Module C by itself.
In order to communicate with Module C, you must first restore either Module B or D.