Category Archives: Operations

Why I Joined Rafay

Recently I made the decision to join Rafay Systems. I had been in Enterprise IT for over two decades (all in networking), and most recently at multicloud networking pioneer Aviatrix Systems. So what made me want to join Rafay? In a nutshell – application modernization.

Although Multicloud Networking has grown to the point where Gartner now has a formal definition for the Multicloud Networking Software market, it’s important to remember that networking will always need to respond to the needs of modern application development. It’s always playing catch-up with the app.

I had always been fascinated by technologies that enable Enterprises to build applications with greater agility, whether they are in the cloud, for IoT, or for 5G. And containerization provides provides exactly this along with other benefits, such as:

  • Continuous integration, development, and deployment
  • Loosely coupled microservices
  • Cloud and OS distribution portability

However, what containerization, alone, does not offer is:

  • Load Balancing
  • Automated rollouts and rollbacks
  • Self-healing

This is where Kubernetes fits in. Kubernetes (k8s for short) is a framework for achieving all of the above goals and more to build modern applications.

However, k8s has a very steep learning curve. Here are the components of Kubernetes cluster:

None of these components are optional! At a small scale (such as POC level), managing this complexity is hard enough. However, when an Enterprise decides to take the plunge, they often find themselves falling down a slippery slope to a bottomless pit. Here are just a few reasons why:

  1. Multicloud reality – While each of the CSPs have their own flavor of managed Kubernetes services, none is incentivized by multicloud support. I’ve previously written what the big deal about multicloud networking is. Taking a step back, Enterprises face the same challenge with operationalizing apps in multiple clouds. How do you perform lifecycle management of cluster types across all clouds, be it private or public?
  2. Lack of centralized policy management controls – While there are native k8s constructs for network and security policy, they lack unified definition and enforcement across fleets of clusters. How do you configure enterprise-grade policies that can be enforced across all Kubernetes infrastructure while allowing for centralized detection and reporting of policy violations?
  3. Limited Role Based Access Control (RBAC) – The kubectl CLI tool does not provide RBAC by default. Executed commands are not logged by user account and generally speaking, kubectl is difficult to access outside firewalls. Moreover, using it to manage entire fleets is cumbersome and error-prone. How do you ensure that developers, QA, DevOps, and Ops/SREs teams have the right access based on their roles and responsibilities?

Here’s how Rafay solves the above problems:

  1. Lifecycle Management of any Kubernetes cluster type, be it Public (EKS, AKS, or GKE) or Private Cloud On-Premises. Rafay provides a single pane of glass for Operations teams to deploy, manage, and upgrade all of an Enterprise’s Kubernetes clusters across all environments from a single console access.
  2. Centralized governance through cluster Blueprints. These ensure that clusters are always in compliance with company policies. Blueprints allow centralized configurations for cluster standardization that can encompass security policies, software add-ons such as service mesh, ingress controllers, monitoring, logging and backup and restore strategies.
  3. Zero-Trust Access. This service enables controlled, audited access for developers, SREs, and automation systems to the Kubernetes infrastructure. It integrates tightly with enterprise-grade RBAC/SSO solutions and is continuously validated for security configuration and posture to ensure compliance.

These are just a few of the rich suite of turnkey services that the Rafay Kubernetes Operations Platform provides.

Networking will always hold a special place in my heart and I’ll still get some of that exposure at Rafay. However it will be less with BGP AS Path Prepend and more with CNI plugins.

At the end of the day, technology features and benefits are one thing, but what really excites me is what app modernization ultimately means for organizations. We all use many of these modern apps every day. Enterprises build them for a number of critical business reasons, such as to serve customers, leverage cloud computing, and better compete in the market.

To date, Kubernetes has been more of a hurdle than an enabler. Rafay’s goal is to change that and help make Kubernetes the accelerator to modernization that it was intended to be. And that movement, to me, is worth joining.

I’m absolutely thrilled to begin my journey in the exciting world of enterprise-grade Kubernetes operations management with Rafay!

Advertisement

Introducing ACE Cloud Operations

Recently Aviatrix developed a new course in the Aviatrix Certified Engineer (ACE) program. Aviatrix Certified Engineer – Multi-Cloud Network Operations (or ACE Cloud Ops for short) is catered towards cloud operations practitioners who need to successfully run, operate, and manage business-critical Day-2 workloads in the cloud.

The ACE program recently announced its 10,000th certified engineer. That’s a phenomenal achievement considering our stretch goal for the year 2020 was only 500. It’s amazing how Covid 19 has resulted in expanding our reach to hundreds of students per week.

ACE Cloud Ops takes a unique view on operating cloud infrastructure, which is necessarily different from operating on-prem infrastructure.

Operations in the On-Prem World

In the On-prem world, enterprises own the underlay. They have full control over traffic patterns and have a familiar toolkit regardless of what vendor they use on-prem.

Of course some tools, such as SNMP died away, but ICMP-based tools such Ping and Traceroute are still going strong 40 years after RFC 792. IP doesn’t go away when you move to the cloud and neither should the network engineering toolkit.

Key skills for Infrastructure Operations engineers include:

  • Hardware (knowledge of cables, transceivers, switches, routers, racks, real estate, physical security, power, cooling)
  • Layer 2 (Spanning Tree is the worst use of an Operations Engineer’s time)
  • OSPF, BGP
  • Repeatability achieved by scripting tools such as Expect (which is really screen-scraping), Shell, Perl, Python (still invaluable). This is not true automation.

Capacity planning in the on-prem world often involves ordering the right number of spares to plan for outages, so that there is some form of high availability, although it does result in higher RPOs and RTOs.

We all know the financial benefits (when done well) of moving apps to the cloud. But while it offers great agility for Developers (you can  spin up a database within minutes), networking has been slow to catch up. Moreover, as we see a rapid shift towards multi-cloud, Operations teams are left on their own without guidance.

Operations in the Cloud World

Operations engineers have a harder time doing their job because of the lack of toolsets afforded to them by Cloud Service Providers (CSPs). Each CSP has proprietary tools that are intended to keep their customers locked into their cloud. Moreover, networking is not a source of revenue for CSPs. They don’t make networking easy and their networking tools are, simply put, not enterprise-ready. 

For example, consider what it takes just to view a route table in Azure. An intuitive approach would be to list the routes from the VNet or at least have a direct link to it. However, you would be mistaken into thinking that way.

Instead, buried in a list of connected devices in that VNet, you have to select the appropriate NIC, which may have an obscure ID.

Next, you have to select an even more obscure term called ‘Effective Routes’

Only then can you see the routes.

It is a very clunky approach to a routine task in the On-prem world. Of course the problem grows exponentially when having to deal with the oddities of each cloud when the enterprise goes multi-cloud. Each CSP abandons the networking toolkit and offers their platform as a blackbox to Operations teams.

When moving to the cloud, an Operations Engineer must have these new skills at a minimum:

  • Agile mindset
  • Infrastructure as Code (read Terraform)
  • CI/CD
  • VCS

Capacity planning takes place with cloud-native principles, such as elasticity and auto-scaling. It requires a new way of thinking, not just for Developers, but also for Operations teams. 

ACE Cloud Ops

The ACE Cloud Ops course better equips Cloud Operations teams  to run a multi-cloud network in their daily jobs. It builds on the immensely popular ACE program with some of the most common use cases we see our customers when operating in any cloud:

  • How to Ensure Business Continuity with an Enterprise-class Transit Solution
  • How to Strengthen Compliance and Audit Initiatives by providing Monitoring and Troubleshooting for Cloud Security Appliances
  • How to Efficiently Connect Remote Sites to Cloud
  • How to Improve your Cloud Egress Security posture
  • Best Practices for Platform Operations Management
  • DevOps for Network Engineers

There are also hands on labs focused on break-fix scenarios that are based on this topology:

The source code of the Terraform that built this topology is here.

ACE Associate is a pre-requisite for ACE Cloud Ops. 

Submit interest for taking ACE Cloud Ops here.