Category Archives: AWS

Announcing the Rafay Certification program

Today I’m extremely happy to announce the launch of the Rafay Certification program – the industry’s first and only multi-cloud Kubernetes operations certification. This is a unique program for platform teams, infrastructure engineers, SREs, and application developers to develop competencies in application modernization using Kubernetes.

Let’s face it. Kubernetes is difficult! Enterprises are finding it difficult to translate the skills learned in the Certified Kubernetes Administrator (CKA) exam offered by CNCF to large scale environments. It’s hard enough managing a single cluster. To manage hundreds of clusters with enterprise-grade best practices along with governance and security can be a daunting task.

But don’t hate k8s. This is where Rafay helps. And this is where the Rafay Certification program will help. It provides ongoing education to enable customers in their digital transformation initiatives by gaining rapid efficiencies from Kubernetes.

We have partnered with Credly by Pearson to build this program. Credly provides a digital badging platform that comprises 95% of the top IT certifications. Many of our customers already use Credly to flaunt their hard-earned accomplishments with other vendors. The Rafay Certified Associate is the latest entry there.

To learn more, check out the blog piece I wrote for Rafay as well as general information on the Rafay Certification program itself.


Why I Joined Rafay

Recently I made the decision to join Rafay Systems. I had been in Enterprise IT for over two decades (all in networking), and most recently at multicloud networking pioneer Aviatrix Systems. So what made me want to join Rafay? In a nutshell – application modernization.

Although Multicloud Networking has grown to the point where Gartner now has a formal definition for the Multicloud Networking Software market, it’s important to remember that networking will always need to respond to the needs of modern application development. It’s always playing catch-up with the app.

I had always been fascinated by technologies that enable Enterprises to build applications with greater agility, whether they are in the cloud, for IoT, or for 5G. And containerization provides provides exactly this along with other benefits, such as:

  • Continuous integration, development, and deployment
  • Loosely coupled microservices
  • Cloud and OS distribution portability

However, what containerization, alone, does not offer is:

  • Load Balancing
  • Automated rollouts and rollbacks
  • Self-healing

This is where Kubernetes fits in. Kubernetes (k8s for short) is a framework for achieving all of the above goals and more to build modern applications.

However, k8s has a very steep learning curve. Here are the components of Kubernetes cluster:

None of these components are optional! At a small scale (such as POC level), managing this complexity is hard enough. However, when an Enterprise decides to take the plunge, they often find themselves falling down a slippery slope to a bottomless pit. Here are just a few reasons why:

  1. Multicloud reality – While each of the CSPs have their own flavor of managed Kubernetes services, none is incentivized by multicloud support. I’ve previously written what the big deal about multicloud networking is. Taking a step back, Enterprises face the same challenge with operationalizing apps in multiple clouds. How do you perform lifecycle management of cluster types across all clouds, be it private or public?
  2. Lack of centralized policy management controls – While there are native k8s constructs for network and security policy, they lack unified definition and enforcement across fleets of clusters. How do you configure enterprise-grade policies that can be enforced across all Kubernetes infrastructure while allowing for centralized detection and reporting of policy violations?
  3. Limited Role Based Access Control (RBAC) – The kubectl CLI tool does not provide RBAC by default. Executed commands are not logged by user account and generally speaking, kubectl is difficult to access outside firewalls. Moreover, using it to manage entire fleets is cumbersome and error-prone. How do you ensure that developers, QA, DevOps, and Ops/SREs teams have the right access based on their roles and responsibilities?

Here’s how Rafay solves the above problems:

  1. Lifecycle Management of any Kubernetes cluster type, be it Public (EKS, AKS, or GKE) or Private Cloud On-Premises. Rafay provides a single pane of glass for Operations teams to deploy, manage, and upgrade all of an Enterprise’s Kubernetes clusters across all environments from a single console access.
  2. Centralized governance through cluster Blueprints. These ensure that clusters are always in compliance with company policies. Blueprints allow centralized configurations for cluster standardization that can encompass security policies, software add-ons such as service mesh, ingress controllers, monitoring, logging and backup and restore strategies.
  3. Zero-Trust Access. This service enables controlled, audited access for developers, SREs, and automation systems to the Kubernetes infrastructure. It integrates tightly with enterprise-grade RBAC/SSO solutions and is continuously validated for security configuration and posture to ensure compliance.

These are just a few of the rich suite of turnkey services that the Rafay Kubernetes Operations Platform provides.

Networking will always hold a special place in my heart and I’ll still get some of that exposure at Rafay. However it will be less with BGP AS Path Prepend and more with CNI plugins.

At the end of the day, technology features and benefits are one thing, but what really excites me is what app modernization ultimately means for organizations. We all use many of these modern apps every day. Enterprises build them for a number of critical business reasons, such as to serve customers, leverage cloud computing, and better compete in the market.

To date, Kubernetes has been more of a hurdle than an enabler. Rafay’s goal is to change that and help make Kubernetes the accelerator to modernization that it was intended to be. And that movement, to me, is worth joining.

I’m absolutely thrilled to begin my journey in the exciting world of enterprise-grade Kubernetes operations management with Rafay!

Learnings from teaching multi-cloud networking and security to thousands

Last week was my 2-year anniversary at Aviatrix. I thought I would take a moment to reflect on my role and what it has meant to me.

I lead the technical enablement for the Aviatrix Certified Engineer (ACE) Training and Security program. When I joined the company, there were less than 500 certified individuals. I’m very proud to announce that Aviatrix has 18,000 ACEs just two years later.

  1. What’s the big deal about Aviatrix anyway?
  2. What’s the big deal about ACE anyway?
  3. What does ACE have in common with Peloton?
  4. ACE IaC – Bridging the gap between Developers and Network Engineers
  5. What are the desired outcomes of Customers in ACE trainings?
  6. What do our Customers think of ACE?

What’s the big deal about Aviatrix anyway?

Who are these 18,000 people and why did they invest their time in Aviatrix? For the most part, they represent Enterprise IT professionals who are facing a challenge of managing a multi-cloud infrastructure.

Earlier this year, I talked about it at length in a webinar titled ‘Getting Ahead in the Cloud: Use the Skills Gap to Your Advantage‘. In that talk, I identified some personas that I have typically encountered:

  1. On-prem networking professionals who need to adapt to the needs of the business in order to stay relevant. They know networking inside out, but since their company has recently moved to the cloud, network engineers find themselves having to play catch up.
  2. Cloud Infrastructure architects and engineers who need enterprise-grade networking with visibility, which is something the CSPs struggle to deliver because of their multi-tenancy model.

In general, they all come to Aviatrix to enable their business driver goals by adopting Public Cloud. These goals include:

  • Application turnaround and uptime – Just about every enterprise finds cloud a strategic enabler for their business. They move to the cloud to gain better agility and unearth new sources of revenue. This means that household names, such as Fortune 100 companies, are now technology companies. It doesn’t matter what industry or vertical they are in. But to get there, their applications need to be secure, highly performant, highly scalable, and highly available.
  • The massive skills gap in multi-cloud – Enterprises will adopt the best of each cloud to improve their business initiatives. And as soon as an enterprise goes multi-cloud, the IT team is put under immense pressure to re-tool with very little time.

Moreover, they want to adopt Aviatrix because they face operational challenges in the cloud such as:

  • CSPs disincentivized to support multi-cloud – That’s pretty obvious, but most important. Customers don’t want a different architecture for each of the 5 CSPs they are in. They want a single architecture that does it all.
  • Difficulty scaling out – Networking and automation have historically never gotten along well. DIY methods were hard enough in on-premises. In the cloud, where they don’t have control and visibility, it is impossible.

Aviatrix offers enterprises instant benefits with multi-cloud optionality. Even in a single region of a single cloud, Customers get a unified control, management, and automation plane for all their accounts, subscriptions, projects, or tenancies.

What’s the big deal about ACE anyway?

Simply put, Customers pursue the ACE training and certification program because they want to learn more about Aviatrix in a structured and standardized way.

When I joined Aviatrix, there were 2 ACE tracks – Associate and Professional. ACE Associate is an introductory course that fast-tracks cloud networking knowledge. It covers cloud networking for all CSPs along with a brief overview of Aviatrix. ACE Professional is deep product training with a blend of lectures, labs, and design exercises, which is great for network engineers and architects.

However, soon after I joined, it was becoming clear that our Customers needed more. They wanted hands-on training for their operators, so that they could be enabled to do their job in the cloud with better insights and better visibility. They needed this so that they could solve problems very quickly and securely build their multi-cloud infrastructure.

What does ACE have in common with Peloton?

The result was ACE Cloud Operations – an 8-hour training with 10 labs that walks students through CoPilot, which is the Day 2 Operations component of the Aviatrix platform. I like to compare this hands-on ACE Cloud Operations training with a Peloton bootcamp, where there are efforts and recoveries for optimum performance. The labs are analogous to efforts – fast-paced and focused on troubleshooting. The lectures are analogous to the recoveries – a quick recap of what the feature is all about.

One of the best parts about ACE Cloud Operations is how certification is awarded. It is 100% based on how well the student did in their labs. There are no facts to memorize, and no exams to study for. We believe that the components of a hands-on certification should be hands-on. And this approach has been very well received by our Customers and Partners.

ACE IaC – Bridging the gap between Developers and Network Engineers

However, there was still something significant missing. For decades, network engineers have felt out of place when interacting with software developers. The problem typically starts from college when they feel uncomfortable with programming language courses. They are more at ease with data in transit (i.e. networking) than writing thousands of lines of code. I most certainly was like that in school, and thousands of Customers I’ve worked with are like that as well.

But nowadays when application developers are relying extensively on the speed and agility that the cloud has to offer, they find it very frustrating when networking and security teams are slow to respond to the needs of the enterprise. Networking needs to codify their approach to building in the cloud.

And often just as soon as network engineers learn how Infrastructure as Code (IaC) works in one CSP (such as CloudFormation in AWS), they need to re-tool on very short notice when they company goes multi-cloud. This has happened with so many of my Customers. They need a cloud-agnostic approach. Enter Terraform.

We came up with ACE Infrastructure as Code (IaC) to bridge the gap between network engineers and developers. It is build on the principle of teaching DevOps for Network Engineers. We teach the concepts of DevOps, VCS, and CI/CD pipelines from a network engineer’s perspective. There are tons of free learning resources out there that cover these topics, but none that cover them so well for network engineers. This training assumes absolutely no pre-requisite in programming, but we sprinkle it with just the right amount of Terraform.

There are 3 hands-on labs with the goals of Build, Enhance, and Secure in mind respectively. By no coincidence, they map out neatly to Day 0, Day 1, and Day 2 Operations. The 3rd lab also covers a soft skill – Collaboration, and why it is important for the various stakeholders of an organization (Network Engineers, InfoSec, and Developers) to work closely together to build an enterprise-grade network.

Perhaps, best of all: this training is available for free to consume at your own pace. This is is more appealing to Customers who have different backgrounds in programming. I am especially proud of ACE IaC as there is nothing like it in the industry.

What are the desired outcomes of Customers in ACE trainings?

New customers are typically more interested in use cases like

  • How to get unstuck with cloud-specific implementations (such as AWS TGW or Azure Virtual WAN) by building on a repeatable architecture – Aviatrix Multi-Cloud Network Architecture (MCNA).
  • How to secure Egress traffic by filtering FQDNs.
  • How to build a solution for remote users to VPN to their cloud network that is cloud-agnostic.
  • How to leverage Single Pain of Glass embedded Threat Intelligence.

Existing customers, on the other hand, are more interested in deeper integrations with SD-WAN vendors. This means moving more towards the edge of the cloud network and learning how Aviatrix can work more closely in the on-prem Data Center ecosystem.

Lack of Visibility and Control in native CSP offerings was something all ACE attendees are concerned with.

What do our Customers think of ACE?

I have delivered live instructor-led training on multi-cloud networking and security to over a thousand Customers and Partners. Self-paced ACE trainings have been consumed by over 75,000 students. And I read every piece of feedback in post-training surveys.

Instructor-led training has given me the opportunity to understand the pain point of our Customers. And by and large, they come to ACE trainings because find it impossible to build a secure cloud infrastructure at scale, at a high performance, with visibility, and in multiple clouds without using Aviatrix.

The accolades I’ve received for ACE are overwhelming to say the least. Customers routinely make statements like this in surveys:

  • One of the best trainings I’ve ever had!
  • I use the skills I learned in ACE daily. In addition to providing training on Aviatrix products, the coursework took a deeper dive under the cloud providers’ covers. Thanks to this training, I have a better understanding of their underlay networks, which simplifies troubleshooting.
  • This post by a veteran in the industry.

It has been the most rewarding learning experience of my career and I’m excited with what lies ahead.

Introducing ACE Cloud Operations

Recently Aviatrix developed a new course in the Aviatrix Certified Engineer (ACE) program. Aviatrix Certified Engineer – Multi-Cloud Network Operations (or ACE Cloud Ops for short) is catered towards cloud operations practitioners who need to successfully run, operate, and manage business-critical Day-2 workloads in the cloud.

The ACE program recently announced its 10,000th certified engineer. That’s a phenomenal achievement considering our stretch goal for the year 2020 was only 500. It’s amazing how Covid 19 has resulted in expanding our reach to hundreds of students per week.

ACE Cloud Ops takes a unique view on operating cloud infrastructure, which is necessarily different from operating on-prem infrastructure.

Operations in the On-Prem World

In the On-prem world, enterprises own the underlay. They have full control over traffic patterns and have a familiar toolkit regardless of what vendor they use on-prem.

Of course some tools, such as SNMP died away, but ICMP-based tools such Ping and Traceroute are still going strong 40 years after RFC 792. IP doesn’t go away when you move to the cloud and neither should the network engineering toolkit.

Key skills for Infrastructure Operations engineers include:

  • Hardware (knowledge of cables, transceivers, switches, routers, racks, real estate, physical security, power, cooling)
  • Layer 2 (Spanning Tree is the worst use of an Operations Engineer’s time)
  • Repeatability achieved by scripting tools such as Expect (which is really screen-scraping), Shell, Perl, Python (still invaluable). This is not true automation.

Capacity planning in the on-prem world often involves ordering the right number of spares to plan for outages, so that there is some form of high availability, although it does result in higher RPOs and RTOs.

We all know the financial benefits (when done well) of moving apps to the cloud. But while it offers great agility for Developers (you can  spin up a database within minutes), networking has been slow to catch up. Moreover, as we see a rapid shift towards multi-cloud, Operations teams are left on their own without guidance.

Operations in the Cloud World

Operations engineers have a harder time doing their job because of the lack of toolsets afforded to them by Cloud Service Providers (CSPs). Each CSP has proprietary tools that are intended to keep their customers locked into their cloud. Moreover, networking is not a source of revenue for CSPs. They don’t make networking easy and their networking tools are, simply put, not enterprise-ready. 

For example, consider what it takes just to view a route table in Azure. An intuitive approach would be to list the routes from the VNet or at least have a direct link to it. However, you would be mistaken into thinking that way.

Instead, buried in a list of connected devices in that VNet, you have to select the appropriate NIC, which may have an obscure ID.

Next, you have to select an even more obscure term called ‘Effective Routes’

Only then can you see the routes.

It is a very clunky approach to a routine task in the On-prem world. Of course the problem grows exponentially when having to deal with the oddities of each cloud when the enterprise goes multi-cloud. Each CSP abandons the networking toolkit and offers their platform as a blackbox to Operations teams.

When moving to the cloud, an Operations Engineer must have these new skills at a minimum:

  • Agile mindset
  • Infrastructure as Code (read Terraform)
  • CI/CD
  • VCS

Capacity planning takes place with cloud-native principles, such as elasticity and auto-scaling. It requires a new way of thinking, not just for Developers, but also for Operations teams. 

ACE Cloud Ops

The ACE Cloud Ops course better equips Cloud Operations teams  to run a multi-cloud network in their daily jobs. It builds on the immensely popular ACE program with some of the most common use cases we see our customers when operating in any cloud:

  • How to Ensure Business Continuity with an Enterprise-class Transit Solution
  • How to Strengthen Compliance and Audit Initiatives by providing Monitoring and Troubleshooting for Cloud Security Appliances
  • How to Efficiently Connect Remote Sites to Cloud
  • How to Improve your Cloud Egress Security posture
  • Best Practices for Platform Operations Management
  • DevOps for Network Engineers

There are also hands on labs focused on break-fix scenarios that are based on this topology:

The source code of the Terraform that built this topology is here.

ACE Associate is a pre-requisite for ACE Cloud Ops. 

Submit interest for taking ACE Cloud Ops here.

AppIQ – Unprecedented visibility that Aviatrix CoPilot brings

Earlier in my career, I worked as a Network Engineer in the high-frequency trading industry at a capital market exchange. It was the time when electronic trading was gaining heavy momentum as open outcry was receding. This was thanks mainly in part to vendors such as Arista who leveraged merchant silicon from Broadcom to lead the charge of low-latency networking.

Scores of trading firms would set up their equipment in one of the exchange’s many data centers inside the building to practice latency arbitrage. Speed was the name of the game and livelihoods were hedged on the network’s ability to pass packets as quickly as possible.

In the early days, any time there was a significant delay (could be as low as 1-2 seconds), the exchange would get hit with hefty fines. However, if we could prove that it was not the fault of the network, but rather the application that caused a trade to execute slowly, then we were off the hook. So my team invested in several network taps and sniffers from NETSCOUT and Gigamon to perform forensic analysis on these low-latency, high-throughput financial systems.

But there were never enough taps. Taps allowed us to pinpoint the location and cause of delays and retransmissions if we were lucky enough to have placed them at the exact spot in the network where the delay was incurred. It was like a playing a game of whack-a-mole. Providing evidential data was a nightmare in those days. There was such little visibility.

Did I mention we owned the entire network?

Fast forward to public clouds today which are complete black boxes. They provide very little visibility and the network has no way to prove it is not at fault because there have been no tools that are able to extract meaningful data until Aviatrix CoPilot came along. It already had the ability to display NetFlow records to provide such empirical data. Take this screenshot as an example.

If I were to see a flow with a few SYNs coming in, for example, I could use that information to ask the Application team whether everything is okay on their end. Or if I see a SYN followed immediately by a RST, that might point in the direction of a firewall blocking something. Or maybe if PSH packets are going through fine and data is being passed for a while, it might be another indication of the network doing its job and the application developer needing to be pulled in. It’s a very powerful feature.

But with the new AppIQ feature released this week in CoPilot, visibility is taken to the next level. AppIQ allows you to generate a comprehensive report of latency, traffic, and performance monitoring data between any two cloud instances connected via your Aviatrix transit network, such as shown here with an SSH test.

Now you can see latencies on a hop-by-hop basis. AWS us-east-1 (N. Virginia) to us-east-2 (Ohio) regions are about 12 ms away on average. And each of those green links represents an encrypted tunnel.

End-to-end encryption in the cloud with the visibility: that’s what every network engineer dreams of having.