Data Center Network Automation

The goal of automation is to use open-source software.

1. The network switches use ZTP (Zero Touch Provisioning).  The management port of switch uses DHCP to get IP address and a DHCP option provides tftp address to get any configuration from tftp server.
2. Cisco, Juniper, and Arista switches support gRPC.  Use gRPC to get telemetry from switches to debug switch issues. SONiC switches also support gRPC and gNMI to get telemetry data.   Note, no telemetry can catch microbursts.
3. Use ONOS, which has web UI to get network topology (from LLDP), learn devices, learn BGP, supports clustering, HA, target configuration update, etc.  For northbound interface, ONOS supports REST API, gRPC, and CLI. ONOS uses OpenConfig and YANG models. gRPC with ONOS allows gNMI and gNOI for network management and network operations.  Routers cannot be learnt from LLDP.  LLDEP-MED has a capabilities knob that discovers routers. 
4. gNMI uses Capabilites gRPC which can be extended to support any node (e.g., server, storage).  Capabilities gRPC is very extensive (https://github.com/openconfig/public/tree/master/release/models).  YANG allows new capabilities to be defined.  gNOI can be extended for new network operations.  An IETF draft is an excellent doc to learn gNMI from: https://tools.ietf.org/html/draft-openconfig-rtgwg-gnmi-spec-01.  No IETF draft exists for gNOI.
5. gNOI supports upgrading firmware of a node.  See the SetPackage() API with gNOI.
6. If any Puppet, Chef, CFEngine, or Ansible is planned, use Ansible because only Ansible uses an agentless architecture.  Ansible supports managing servers. It can be extended to manage networking nodes.  Ansible takes care of firmware versions and upgrading firmware of devices.  Any network automation and management also requires integration with Ansible. 
7. It is desirable to manage servers and network switches using same tools.  Why not use gRPC?  Strive to manage storage with gRPC as well. 
8. Most often, data centers issues occur during a network management operation (70% of Google's failures, see slide 21 at https://tinyurl.com/y8fssfdx).  For long-term, consider using Intent to manage the network. 
9. Monitoring the WAN link is important - monitor link bandwidth and routing flaps. 
10. Simplify the network - use IPv6 internally with NAT64 at network edge.  Public-domain NAT64 (stateless and stateful) software exists. 

Leave a Reply

Close Menu