NAN077: Network Observability: Tools, Automation, and Insights | Packet Pushers (2024)

Network optimization starts with observing, but how are networks observed and what tools are used? Joining the podcast today are the authors behind the book “Modern Network Observability.” Eric Chou welcomes David Flores, Christian Adell, and Josh VanDeraa to help uncover practical strategies and real-world case studies for network observability.

Episode Guests: David Flores, Christian Adell, Josh VanDeraa

David Flores, Sr. Network Developer, CoreWeave

David Flores started out as a network engineer and has since ventured into network automation, DevOps, and software development. That foundational knowledge in networking has been invaluable, helping him shape better automation solutions, especially in the NetOps realm. Recently pursuits have been network telemetry and observability, building automation tools that draw from this data. He’s also part of a hands-on team of automation engineers, and together, they roll up our sleeves to solve the diverse automation challenges clients bring to us.

Christian Adell, Principal Network Architect, Network to Code

Christian Adell is a network software engineer who has played multiple roles related to networking and IT automation. Currently, as Principal Architect at Network to Code, he is focused on building network automation solutions for diverse use cases, with great emphasis on open source software Christian is co-author of O’Reilly’s Network Programmability & Automation book and Network Automation with Nautobot book by Packt.

Josh VanDeraa, Services Director, Network to Code

Josh VanDeraa is a network engineer and automation engineer who got his start in large retail but has also worked in travel and professional services. Josh is now a Services Director at Network to Code. He is the author of self-published book Open Source Network Management and co-author of Network Automation with Nautobot book by Packt.

AdSpot: torero

Torero is a free automation gateway designed to allow network engineers to begin the transition from localized automation development to team-oriented development. It focuses on creating and delivering a uniform experience for operationalizing disparate approaches to achieving network automation. Whether its Ansible playbooks, OpenTofu plans, or internally developed Python scripts and tools, torero is purposely built to to launch and execute automation services without burdening teams that need to maintain multiple vertically built and costly operational stacks. Learn more at www.torero.dev.

Episode Links:

Josh VanDeraa’s Blog

Modern Network Observability book

Network Automation Nerds Podcast Episode 33 with Josh VanDeraa

Episode Transcript:

This episode was transcribed by AI and lightly formatted. We make these transcripts available to help with content accessibility and searchability but we can’t guarantee accuracy. There are likely to be errors and inaccuracies in the transcription.

Automatically Transcribed With Podsqueeze

Ethan Banks 00:00:01 Today’s episode is brought to you by torero. Stop worrying if your peers have the environment to run your automation. Torero is a free tool that enables you to launch Python playbooks and plans against production while handling dependencies. Find out more at torero.dev, t-o-r-e-r-o.dev.

Eric Chou 00:00:30 Hello and welcome to another episode of Network Automation Nerds podcast, where we explore the latest in our conversation from a practitioner’s perspective. I’m your host, Eric Chou, a network engineer who loves everything about network automation. In today’s hyper-connected world, maintaining and optimizing network performance has never been more critical. But how do we optimize our network without knowing what is going on? We can, right? How do we observe our network? What tool stacks to use? Joining us today are the brilliant minds behind the book Modern network observability, a hands-on approach using open source tools such as Telegraf, Prometheus, Grafana and other tools. We welcome David Flores, Christian Adell, Josh VanDeraa, all network veterans to the show. Together we will uncover some of the practical strategies and real world use case studies for network observability.

Eric Chou 00:01:19 Without further ado, let’s dive into today’s topic. I want to start with just welcoming you guys. I want to give you guys some background story. All three of these guys, we work together some shorter, some longer. And I’ve known Josh for longer. And we’ll have his previous episode on network monitoring, open source network monitoring projects in the show notes. But I do want to welcome all of you to the show, you know, and we want to start with some origin stories. So since, you know, we just kind of go through the boxes laying out on my screen, why don’t we start with you, David? Just give us an overview of your background. And how did you come about being passionate about network observability?

David Flores 00:01:53 Yeah, sure. Well, thank you for having us on the show, Eric. I would say maybe the origin story is not that different from many engineers out there. I’m originally from Venezuela. My first job was actually on the network operation center, so I was doing shifts, doing a lot of day to day operations, day two operations, replacing routers, fixing configurations and so on.

David Flores 00:02:14 And at the time, I had faced hard facts around incidents. And the monitoring tooling at that time was pretty poor. Okay. It was sufficient enough in order to be reactive about it, but if you wanted to see deeper into it, you want to manipulate the data, if you wanted to do other kinds of stuff, information was pretty difficult because it was a vendor based world, and they were pretty close on what you can do with that information. I would say that that kind of marked a little bit of like my background on data operations, and I’ve been working with cloud providers, financial institutions, and I’ve been doing automation for a while, and I always try to come back to the story around, you know, how can we make observability better for that working space? I usually have that job doing normal net DevOps. What you build is basically what you operate. So it was crucial that if you do automation for building stuff, you need to automate as well for your operations stuff. And I found out myself that we were kind of stuck in like ten years ago with the amount of tooling that the network space had.

David Flores 00:03:19 So started looking around at new modern tools that the DevOps cloud engineers were using at the time. So that’s where it started. That’s where it all started. So I’m pretty happy right now with the book coming out. But I would say that definitely there’s a lot of experience working with Christian with Josh in MTC. That’s actually where we all work together from an entity, and we did a lot of projects around observability. That’s kind of where everything started. Now, the book itself is actually a pretty funny story. I will actually let more Christian talk about it, but we were in now this Cisco Live Event. I think it was maybe one year ago. No, two years ago. Well, when the Cisco Live event and I don’t know if it was a good advice or bad advice because it can be a little bit stressful reading a book, for sure.

Eric Chou 00:04:06 Was Christian drunk or.

David Flores 00:04:09 Well, he was definitely high on enthusiasm.

Eric Chou 00:04:12 He was drunk on knowledge.

David Flores 00:04:13 So he was really trying to get, you know, this is something that it might be good to get out there for other people to actually know and to have more experience on it. That’s basically where I think it all started.

Eric Chou 00:04:25 Yeah, yeah, that’s kind of funny, David, because, I mean, I’ve worked in a NOC before. What is the first question people ask when you call them from the NOC? Like what changed? Right. And it’s so surprising that we can’t even answer that question. And I kind of chuckle when you told that story because you’re saying when you build something, you need to be able to observe it. But that’s not always the case. And that’s where we are today. So and that’s why, you know, you have these stories and, you know, tools that you’re building up with. But I really think that’s a good way to approach it is just to, you know, kind of build it in bundle, right? Like eat your own dog food. You built it. It’s not that something you could throw over the wall and somebody else would have to bear the consequence of not knowing what’s going on. Christian, so David gave a good intro, so why don’t we go ahead and have you tell your overall background and kind of tie into what David was saying?

Christian Adell 00:05:11 Yes. So in my case, I have a very similar background because also came from the network operations, traditional data center campus, traditional network operations. But before jumping into a full network automation project, I had three years experience in a company that was focusing on delivering application software in the cloud environment. So they were applying from day one DevOps principles. So to me, that was opening my mind to discover, not the traditional monitoring application that I was used to, but a different approach where in most of the cases you don’t even know what you have to monitor because the services, the application where jumping from unexpected places and you have to be aware of what is actually happening or what actually happened while you are sleeping and have to go back and see and document all these kind of things. When I was coming from this pure DevOps approach into another network environment that in this case was a worldwide deployment where we had to be very focused on keeping the high level service. And in terms of availability, we had to change a bit the traditional approach in terms of be more proactive, try to apply these self-healing concepts to make the network run unattended as much as we can.

Christian Adell 00:06:31 So we want to have always the network going on live and up and running. But sometimes you need some fixes, some adjustments and all these things. If you can be proactive and use the data in order to take, let’s say, the good decision that you are expecting in an automated way. This closes the cycle and makes you capable of implementing this end to end self-healing networks. This is at the end what you get with the data and the power of this data, and how you can reuse not only the data that you are getting from the operational network, but also the data that you have in other places, like a source of truth, where you combine all these things in order to enrich the experience. For the people operating this network, is what we have been doing for a while in Network to Code with David and Josh, and was the seed, the inspiration for us to say all this knowledge, all these concepts that in the book we are representing or implementing in some specific tools. But at the end, the concept can be extrapolated to different stacks and what we want to share with the community.

Christian Adell 00:07:37 And it was what David explained that was initially thought as an idea to go to conferences and share the idea, but we said maybe it is not enough. One half an hour, one hour is not enough to share all that we want to share, and we just jump back into this idea of creating a book where we can collect the learnings and the ideas that we have in order to put into action this concept of modern network observability.

Eric Chou 00:08:01 I also want to do a plug in for you, Christian. This is not the first book that you wrote, right? So what are some of the other books that you’ve written?

Christian Adell 00:08:08 Actually, this is my third one in a row in one year, let’s say published. And the first one was the second edition of the Network Automation and Probability book by O’Reilly. In that book was my first experience and was very excited on that, because I had to learn how to write a book. And this connects with the question from David about how stress you get when you have to write a book.

Christian Adell 00:08:30 I would say that the first one is the worst, but then you get used to the experience and you also appreciate and enjoy the experience. A side of that, together with Josh, I am co-author of the Nautobot Automation book by Packt. In this case, I has the opportunity to bring together Josh and David in order to create this book, focus on this modern network observability. But it’s a modern network observability for networks that today are heterogeneous in nature. So they are not only on premise, they are virtual, they are in the cloud. And you have to take all these things into account.

Eric Chou 00:09:07 Yeah. Back to what you were talking about on writing a book or taking out any project for that matter. It’s kind of a muscle that you just have to exercise. You know, if you use the gym analysis, the first time you do a bench press like that whole motion is unfamiliar to you. It’s not so much about the weight or the actual exercise is that whole motion is different. So now once you get into that motion, you know, you get more used to it. But the challenge for me is to keep that motion going. You know, once you write that book and you get into that cadence of, you know, writing whatever words that you have per day, and then you stop and then you start again, it’s almost like you’re in that perpetual starting point. So I admire you for writing three books in a year, one at the time.

Christian Adell 00:09:45 Actually, I have a point here that it’s the same experience that I have in today in the podcast. So I see that you are very used to that. But for me is my second experience, and it’s always taking a bit of effort to get used to the conversation in this format.

Eric Chou 00:09:59 No worries, no worries. Virtual hug right here. Virtual high five. You’re doing great. All right, let’s move on to our third author on the podcast where oh I’m sorry, fourth, counting myself. But you know, Josh, you’re old friends to the show. You know, why don’t you give us a little bit of overview for yourself and maybe your story with the book as well?

Josh VanDeraa 00:10:16 I just want to dive into my journey on the observability side of things of many years I had myself get trained that when we got a notification that something was down or a call came in that something was down. My first instinct had been trained not to go to a monitoring tool, but to go to the command prompt and type ping and start to just ping everything. I go back to at that point, I would say the observability tool at that place was not trusted as far as being able or it was not exposed. And as David said, some of the other tooling that I got to see from the DevOps side of things, of the new graphs and live data coming in, really, once I saw that that was probably in the middle of 2010s, right? Got my first exposure and was, how can I get this for networking? And then along my journey of, okay, get into some other organizations where they had built and put some care and feeding into the network observability tool. Got the experience of having really a sidekick that was there to say, hey, well, how does the network look? I still remember as I got in the habit of doing extended pings, sending 10,000 pings of 1500 bytes across a link, and if that’s good, then everything else is good.

Josh VanDeraa 00:11:27 I turn up a new ten gig circuit where all my ping tests all came out fine, was about to pack up to leave for the night, and want to check the monitoring tool one more time. And it detected some errors. You know, being able to get some feedback, those tools were still a little bit more every five minute polling type thing. And so really start to look at the tooling that we’ve been looking at with Telegraf, with Grafana and putting things together really to get that much more real time. You know, it’s not quite exact real time, but it’s close to real time. And so the opportunity to put these ideas and concepts into writing for others in the network community, to be able to take and implement and take these features, it’s been a pleasure to do so.

Eric Chou 00:12:09 Yeah, definitely. I think getting to real time, that’s like the holy grail and just the amount of data that we observe. I mean, I had a peek into the book and, you know, I know what tools would be using the stack.

Eric Chou 00:12:21 For myself, you know, I’m more familiar with the ELK Stack, and I know how difficult it is for, you know, let’s just, for example, NetFlows to be exported, even a moderate or small amount of NetFlow will just overwhelm the Elastic Stack. So this is probably a good segue into our next topic, which was tell us, what are the tools that’s being used in the stack that, you know, we mentioned a little bit Telegraf, Grafana and Prometheus, but why don’t you go a little deeper into that, David, on some of the characteristics or the reasons that you pick these tools, or what experience led you to pick those tools, let us know.

David Flores 00:12:56 Yeah, no, that’s actually a great question. And that’s something that also I want to convey the messaging that I think it’s really important at the end of the day, kind of like the pick of the tool set for the observability stack has been based on, you know what, it’s out there, open source that people can, you know, go ahead and start running.

David Flores 00:13:16 But it’s by no means a golden recommendation, okay? Because there are different there’s a plethora of considerations that you need to take into account in order to pick or choose the toolset that is most, you know, fits your needs or your environment, right. It’s a particular chapter, that actually, Josh wrote rode around, kind of like the build versus buy aspect of things, you know? When does it make sense to actually go with the vendor? When does it make sense to actually build it yourself? What kind of skill set you need in order to operate? Okay. If you’re the one that is operating, what you need to have to operate to orchestrate it and to use it. But coming back to your question around the toolset, one of the few things is that there were some care on what kind of tools we use, because we wanted to make it as hybrid as possible. you will see that the toolset has open source tooling that are coming from different companies, from different vendors. For example, Telegraf is part of the InfluxDB family, okay.

David Flores 00:14:15 And we use it mainly for as a way for us to extract, collect, transform the data or the device. So there are a couple of chapters talking about only Telegraf. There’s also logstash for log parsing. You mentioned ELK Stack and actually I’m also a fan of the ELK Stack. Actually I did a lot of Elastic Stack back in the day, so there’s a little place in my heart for Logstash. I think it’s a great tool and provides a lot of like capabilities for parsing and ingesting log data, and be able to also send it to multiple destinations. We have also Prometheus okay, which is from the Cloud Native Foundation now. And also there are big projects based on Prometheus like cloud offerings like Grafana Mimir or the open source channel. So Prometheus has been kind of like the de facto standard for time series databases. On the metrics side of things, for logs, we use Loki, Grafana Loki. Okay, in order to go with the Grafana stack. Okay, so we use Grafana Loki in this case for storing logs, visualization, Grafana alerts and we go with other manager.

David Flores 00:15:16 Okay. So if you think about it, then the manager also comes from a project from the Prometheus project, and it’s also compatible with Loki. So if you think about it, a lot of these projects are kind of mix and match from different. And what I really like is that the effort of the community to bring the interoperability of these different components together, okay? Because at the end of the day they just passing data. So yeah, I will say that the one thing that also want to take into the book is that it’s more like a guide. Okay. In might go deep into subtopics, but it will not explore all the I don’t know how many plugins Telegraf has. It’s not going to go to the 250 plugins that Telegraf has.

Eric Chou 00:15:56 More than my hands could count.

David Flores 00:15:57 Yeah for sure. So definitely it’s not going to go into every single option because there’s a lot of them, but it provides a good way for you to see, hey, how can you monitor devices with SNMP or how do you monitor with gNMI? It is to provide some streaming telemetry, so the idea is to guide you and to also test you, it is a practical book. So it’s also so you can actually, you know, spin it up yourself, test it, run configurations, see dashboards, see alerts happening in real time. And, you know, experiment a little bit.

Eric Chou 00:16:25 So if I take what you just said and kind of reframe it into a hierarchy in my mind, correct me if I’m wrong, Christian. So the first part would be, you know, kind of data collection so that’s your Telegraf, that’s data shipper, so to speak, and you have the data storage which is Prometheus. And then on top of that you have the visualization which is the Grafana layer. And I would say probably parallel to the Grafana, which is, you know, you kind of having the added service on alerting with the alert manager and so on. Is that a accurate picture, Christian, or am I missing some components there?

Christian Adell 00:16:59 To be honest, we start first trying to introduce why we call it observability and not monitoring. Right. So we try to go into high level understanding of what’s the next step on the traditional monitoring.

Christian Adell 00:17:11 So what observability means: this extra understanding of what’s actually happening on the network, not leaning just on the metrics on the logs that we are used to do, but try to present the different options that you have around operational data that include flows include synthetic monitoring, includes also traces, packet captures. All the things that you can collect. Doesn’t mean that we are going to cover in detail in the book all of them, because that will be more like a collection of books. So in this case, we focus, as David said in this book, mostly around metrics and logs, but we present first all the options that you have available that can be included on that. Then on the meat and potato part of the book that is on the second part is where we just go through all these different tools that David introduced, and you already recap. However, it’s very important to notice that the first chapter on this second part of the book focuses on the architecture, because, as David said, we choose a few of the available options, the ones that maybe we are more familiar with, but a few of them, and they can be reused in other cases for other tools.

Christian Adell 00:18:19 Because something that is crucial in this new ecosystem is the interoperability. You can choose one tool today, but next week, next year you will find a better option for your stack. So we start first with a framework architecture definition of what are the different components. For example, we mentioned about the collectors. We have different options depending on the type of the metric or the logs. These data have different types of collectors. Just as a way to show that you can bring different types of collectors depending on what you want to capture. There is also another layer that we call data processing and enrichment. That is how you scale out this data management. When you have a lot of metrics coming in, you have to distribute it into many different services ingesting that data. All of these things are always represented in an architecture that can help you, can guide you in order to say, okay, that tool in the book represents that functionality. But if I have another stack ELK, or we have the InfluxDB time series database, whatever you have can also fit into the picture. And we use always the architectural reference to make all these things easier to understand.

Eric Chou 00:19:23 That’s a great point. I think what you mentioned was the tools that you represent, it’s a representation of that function. So therefore if you have another tool that you know works out better for you, vendor neutral or open source or whatnot, that could be in parallel. And it’s important to have that plug and play mindset. So it’s important to have that architectural oversight so that in your mind you know where that tool fits into place. Which brings me to the next point about taking the tools. I think David mentioned that a little bit about open source. And I know at Network to Code, open source, a big theme around the company. So, Josh, maybe you could dive a little bit into the decision of going with these tools. And as David mentioned, open source plays a role in that, right?

Josh VanDeraa 00:20:10 Yeah it definitely, does. So when we take a look at the open source side of things, first off is a licensing side. What do you need to do from a licensing, that gets a little bit in there. But as a typical enterprise, you should be able to get to use many of these open source tools. Right. And so when we take a look at the tooling. First and foremost is what role do you need to have it fit into inside of that architecture? So when we take a look at Telegraf, for example, which is a big part of the extract and getting data, it’s got many different input plugins and export plugins. But then it also makes it really easy to transform that data and really want to dive in a little bit more on that transformation and giving additional context. Right. Because what the difference is of why you want to have additional context of the data is as a human, you start to understand, hey, you know what the core uplink is from your router. Hey, that this is either an interface or this is a core uplink, or hey, this is just out-of-band management. Most traditional tools probably don’t have that capability to do that.

Josh VanDeraa 00:21:07 But by integrating with the source of truth that says, okay, this role is a LAN uplink, or this role is the core uplink, you start to be able to put the contextual data that humans make and put that into your learning side of things, so you can start to say, hey, I don’t want to be woken up when an access interface goes down unless it’s like 50% of the access interfaces. Set a threshold on that sort of thing of what is important. Whereas one access interface going down probably don’t need to call and wake somebody up from getting some sleep. That’s a big part of the work life balance of network engineers and that pager responsibility, and really just finishing the tooling. It might fit today in one of the architectural roles. There might be something else that will replace it in the future. And so keeping that architectural mindset of it’s not just one tool for it all.

Eric Chou 00:21:55 I like the fact that you mentioned the enhancing part. You and both Christian mentioned that part. And this is what drives me nuts. Sometimes when I see people say data is the new oil, I want to, you know, reach over to the screen and say, no, data is not the new oil, like business insight is right. Like the insight that the data provides is the new oil is the valuable part. Data itself means nothing. So therefore the enhancement, the ETL and you know, just enriching that data. IP address means nothing to me until we put into context. IP address represents this data center, and this IP address resides on this distribution switch and that is a management IP. And therefore we care more about that IP than, you know, just like some random, you know, IoT device. So I think that’s a big component of picking the right tool where it’s scalable and nothing against vendor tools. But in my experience, vendor tools usually works best with their stuff, and rightfully so that they work best with their stuff. And it has the problem of scaling up or scaling out. But, you know, I think you have something else to say on that front, David. So go ahead.

David Flores 00:23:00 Actually, that’s exactly it. What I think, especially on the vendor side of things, it has been my experience as well. They provide better insight on their stuff for sure. They might be able to even give you pretty much detail of what is happening. But when you have a multi-vendor network environment and not only that, you need to also think about how the value of the network monitoring aspect of things affects also the infrastructure like your system, your nodes, your storage, your cloud infrastructure. So there’s also value of choosing the toolset that also correlates that you can actually do correlation with other infrastructure elements, okay. That can provide an extra level of value to your organization as a whole, okay. Because at the end of the day is really important to keep some of these key aspects of availability, not only from a network perspective, but also a whole infrastructure vertical of our business.

Ethan Banks 00:23:53 A quick sponsor break to hear about torero. Torero is a free tool from the network automation and orchestration folks at Itential. What does Torero do? Torero builds an execution environment around your Python scripts, Ansible playbooks, or open tofu plans so that you can run them as a service. This means that you don’t have to worry whether or not your peer will be able to run your automation. You’re getting rid of the it works on my laptop, but nowhere else problem, because once that automation exists as a service, all the dependencies have been handled. So I work through the torero Hello World demo and it was straightforward. I installed Toro on an Ubuntu box. Linux and Mac are both supported, including Apple Silicon and connected torero to a public GitHub repo that I created for the test. In that repo was a simple Python script and a requirements.txt. I told torero to create a service around that Python script, and torero read the requirements file and built the service. Now the script is executable via torero run. Torero can be run on a single machine your team shares in local mode, or in a server mode that dispatches automation jobs to torero clients. Build your automation, turn it into a torero service and share it with your team. Download free torero at torero.dev. That’s t-o-r-e-r-o.dev.

Christian Adell 00:25:12 Actually David, not only infrastructure the beautiful of bringing collapsing the same tooling that you use for your applications makes you capable of mixing converging these network alerts with an application problem because they are on the same place and they can be correlated in a very easy way.

Josh VanDeraa 00:25:29 That could be a whole nother piece of, but just real quick, complete the other thought around transformation of data, as well as taking a look at, you know, even within the single vendor, the collection method will give you different keys. So in different keys basically it’s how do you index it. Right. So when you take a look at SNMP type thing, the main key will be IF, HC counters, octets in or something like that. And when you look at gNMI it’s going to be something a little bit shorter and type thing. And so where that transformation layer is another big part of what we’ve written in this book is to be able to normalize that SNMP, gNMI and then across multiple vendors into a single metric, which then brings up other areas as well, that I think David and Christian can talk to a little bit better.

David Flores 00:26:15 The normalization enrichment part is definitely one of the most important and also I would say hardest as well to achieve because normalizing is not an easy feat. And we’ll be having SNMP for a while now and we have vendor MIBs. So a lot of people go to the vendor. MIB, which has its own structure, had its own components. We were hoping that open Config Yang also provided kind of like a standard, and I think it’s going on the right direction. But also you need to complement with vendor based Yang models in order to get the data that you need. So the normalizing is still a thing, right? It has been for collecting data from the devices and also for monitoring purposes. So one aspect that I think it’s pretty important is the normalization. Because when you are in a multi-vendor environment, definitely you need to normalize in order to treat the same signals. Okay. Because they’re saying with different formats the same way, normalizing. That’s the key around normalization.

Eric Chou 00:27:10 Let me double click on that a little bit, because not everybody is familiar with the tours that you’re talking about. So if I double click on that normalization part. Can you describe a little bit on where that normalization happens? Is it on the collection stage or is it on the aggregation? Is it having its own little toolsets or is it happening on the data store and so on? I know it can happen in multiple stages, but in your architecture, can you explain a little bit about where that happens and maybe give an example?

David Flores 00:27:37 So the book it happens at the collection stage because we’re focusing on Telegraf as you said correctly said so, you can actually have this in different stages. I mean, you can have it an aggregation layer or but right now we focus on the collection stage. We use Telegraf for that. Okay. Telegraf is a great tool for ETL operations. Extract, transform and transform is the big part of this. We are able to grab the data and manipulate it in a way that we can normalize it. Okay. So the key aspect is that we set configure Telegraf for collecting a set of metrics via SNMP and collecting another set of metrics, the same metrics via gNMI.

David Flores 00:28:16 And with processors inside Telegraf, we’re able to create the same data structure with the same types and the same kind information. So that way we’re normalizing the metric even coming from different technologies. And the same happens with different network devices. But yeah, you can definitely do it in different layers for sure.

Eric Chou 00:28:35 Maybe Christian, you could jump in a little bit and give us an example of the enrichment example. In my mind I think about IP address and querying maybe the maximum DB on getting geolocation data from the IP address or whatnot. Can you give us another example of the, you know, normalization part?

Christian Adell 00:28:51 So normalization is more connected to what Josh said. You want to get a metric that it doesn’t matter if you are coming from vendor A, vendor B, vendor C, right, or even the different provider. So imagine that you want to correlate your traffic coming from AWS VPC and from your VPN running on the on premise, a different variable. At the end of the day, they can be taking data from different perspectives.

Christian Adell 00:29:15 Maybe some data is taken by SNMP, by gNMI, or by an API, or why not? Sometimes you could even go into an SSH connection to get the specific data from the CLI. That could happen. All this data, the meaning is the same, and you want to be able to compare the data and have the same decisions on that data. So this is what we talk about. Normalization. Try to bring the data that is the same into the same identifier. So it’s kind of same model of data. That’s the normalization part. Yeah. On the enrichment that’s connected with what Josh said. Once you get data from the network operational data you get basic data. You get maybe the device that you get the data and the interface. That’s all the context that you have. But you want to bring extra context that is available for you as a human. In the old times you had that as a trivial knowledge. Sometimes later you get maybe in a document. And finally what we want to do is get into a structured data that you can connect your reference into the data that you have.

Christian Adell 00:30:15 So what I mean is that if you know that the interface, it’s Gigabit Ethernet one, that’s not giving you a lot of information. But if in your data structure that you define your state of the network, this gigabit interface means that this is a transit link, you will take some decision on this transit link, for example. This actually connects to something that is very important through all the book we are trying to provide through the different chapters, with the best practices that we have been observing and implementing for a while. This means that sometimes more data is not the answer. What we want to do is look for actionable data that is normalized, that isn’t reached. So you can use that data to do what, because that’s usually the question, what do you want to do with this data? Is what you said. Not only the data is what the business value that you take from that data. And we think that with this enrichment, with this context that you get to the data, is where you can start taking the decisions that you want to take.

Christian Adell 00:31:09 For example, in a self-healing network, maybe you can put the logic into the system that when there is a transit link that has some problems, you want to automatically put some BGP route map configuration that will drift the traffic from one that links to another links, which other links. So if you get the context that this is a transit link, that this belongs to this side, this POP. So this will go to the other POPs. You need all this context to take the decisions. And these things can be codified in a manual way. Or and we can open this for the next topics is we also bring some ideas about how you can leverage artificial intelligence and machine learning to empower these things.

Eric Chou 00:31:52 Oh my God, I was waiting for that. When are we going to say artificial intelligence?

Christian Adell 00:31:57 No, it’s it’s a buzzword, but it’s something that you have to understand. And what we are trying to do in this book, in a small chapter, just to align expectations. This is not a book about that topic.

Christian Adell 00:32:08 It’s just a book tries to get you landed into the basic understanding of what these things are doing, so you can infer what the tools that the vendors are trying to sell you are actually doing. So what this technology can bring you, maybe it can provide you some forecasting capabilities that you can use in order to do operational validation. Or maybe you can elaborate large language models in order to provide an enriched root cause analysis that takes data not only from the event, but also from the rest of your observability system to enrich all these things and produce you uneducated direction in order to troubleshoot all these things. Again, it’s not super deep on the topic, but it’s good enough for you to understand how these things can support you.

Eric Chou 00:32:52 I think Javier Antich said in his book on the AI or machine learning. What they do best is to help you make decisions, and I like what you said about actionable data, right? Like, you know, the data itself. Great. Data itself is not gold. Like I said, it’s the insights that it provides. And to go one step further is the action that you could take based on that insight. And then if you integrate some kind of intelligence to it, then you have this, you know, as you mentioned, self-healing network. But I don’t know. I don’t think we’re there. I mean, it’s a good goal to have, but I don’t know, I mean, I haven’t been into an environment where people feel comfortable about just, you know, draining your BGP uplink.

Christian Adell 00:33:32 I was doing this from 2018 to 2020, my former employee. The question here is that what is this self-healing covering? It’s not covering everything. It’s not covering all the use cases. You just try to focus on the important use cases that you need to keep your SLA, your availability at the highest level. So you just try to focus on solving the problem that is happening more often, and it has a higher impact. And you then try to solve this problem. And this connects to actually an important point. And the point is that as you start getting more data.

Christian Adell 00:34:04 You can get into this rabbit hole that you get a lot of data, you get scalability problems. At the end is a cost problem that you can outsource to someone else. But the problem is that having more data is not the solution, or enriching the data with more metadata and labels is not always the solution. It’s going to bring complexity. And what you have to understand is that what’s the right data and metadata that you need on your metrics, on your logs, in order to do these actions that you have to do on your environment. So what I’m trying to say is that in the same example of enriching the interface, I was saying that understanding the interface role like transit, it’s a good data for me to take a decision on the VCP, for example. Maybe I don’t care about the speed of the link, because I know that all my links are homogeneous and there are no differences, so the speed will not be an interesting metadata to add to this metric. So both things have to be connected. What do you want to do with the data. And then you have to put this information into the data.

David Flores 00:35:02 And to add to that point, the book talks a little bit about the pros and cons and tips because also the technology behind it, the performance is really dependent on what kind of data you actually output. Okay. So for example, when we talk about time series databases, particularly like Prometheus, the more labels, the more enrichment that you add. Definitely the dimension of that metric grows. So the more dimension, the more capabilities of cardinality grows. And we talk about dimensionality and cardinality in the book. But the one thing that I want to express this here is that the higher those numbers are, the less performant databases are okay. And it’s really important because in a platform, because it happens with almost any platform where you see a low performance or bad experience, but user experience, the adoption goes down. So it’s really important that something that Josh said at the beginning, that you don’t look at the tool, you just did a ping in order to see that.

David Flores 00:35:58 It’s really important that these tools create trust for the users, right? I think that’s one of the few things that is really important. And you might not need to be an expert in order to have these high level tenets or mission objectives when you’re creating or working with these platforms, because it’s more important that you know this and create an action upon those statements like, okay, I’m not going to add too much enrichment, only the necessary enrichment that I need to add. I don’t need to create, I don’t need to put in all these metrics, because the ones that we really care about are these ones, you know, and take it from there.

Eric Chou 00:36:32 Yeah, I think that’s a balance between how much data do you want and how much overhead you want incur on this layer. Right. So at the end of the day we’re driving that decision. But that decision has these data points. And how many of these data points are you feel confident enough to make this decision. It’s always a probability. So I know, Josh you recently moved into a leadership role. So can you tell us a little bit about the relationship of observability versus like driving business outcome.

Josh VanDeraa 00:37:01 When it comes down to is the reliability of data and being able to make actionable decisions from the data. So when I’m taking a look at it from a leadership perspective, it’s going to be, you know, get into a little bit of what Christian was talking about using AI. There’s a portion of the book that we talk about forecasting as well, of projecting out what things are going to look like based on the historical side of things. And so by being able to put real data to that, understand what anomalies are and not just, okay, there’s been one day of high traffic, you know, might have had something going on, maybe there were some windows updates going out that caused that to go up. So let’s exclude that from the data metrics and be able to forecast out. And then it really gets into having the capability. And again that work life balance to be able to have a system that I love David’s word that he used here of trust anything that I use, it’s all about trust. And game that entity there, trusting it, that, hey, it’s going to alert you when something’s wrong and you can sleep easy. As long as you know that the pager is not going off the page. It’s going off every night and it’s not actionable. You’re not going to have a good work life balance, and you’re going to have some turnover from a team member perspective.

Eric Chou 00:38:10 Yeah. As somebody who worked for Microsoft, we know how, patch Tuesday. You know, it’s not just the rest of the world right, like we do Patch Tuesday. And we know like the patch you know spikes up. So I like what you mentioned about just correlating it to your business cycle and correlating it to other data that may be personalized or individualized for your company so that you build, as all three of your mentioned trusts into that data. And once you trust enough, I don’t know, I’m not there yet. But Christian, maybe at some point we could do that auto mitigation that you’re talking about the self-healing network that we’re being hearing so much about for the last, I don’t know, ten years.

David Flores 00:38:45 Yeah. One thing for sure that you can start doing without the full cycle of self-healing networks is automate it. So there’s a chapter in the book around automation. I hope that the readers actually enjoy doing a practice, because definitely we enjoy creating it. And the fact of like having the capability of wearing your observability data and tie it up with your intended data, coming from a source of truth and being able to use this to edit in order to create actions, like for example, hey, you have devices reporting these kind of issues, and I want to notify anyone that is actually working on those devices in my ticket system or even my source of truth, you know, say, hey, this device is on alert or under maintenance. Don’t change anything on them. For example, in the book we work with Nautobot. So if Nautobot is doing other workflows or other jobs in order to, you can actually, you know, stop and halt because these devices are are under maintenance or actually alerted. So having that capability of using the data and then action on it via automation, then that’s great.

Josh VanDeraa 00:39:53 And really tying that together from the game down to the business side of things. It’s self-healing networks isn’t going to just be able to say, go buy a button and we have self-healing networks. It’s going to be a lot about the process and start with the process, with the tooling that you’ve got or the architecture to decide that and get that process, have it written down and then automate that process from from that point, you know, so from a network automation perspective, it really comes down to first identifying how do you see it? Is it being there from a metric or is it from a log. And then what do you do when you see that as a human and write the automation behind it.

Christian Adell 00:40:31 Here, the point is that even we can see that this is like a very complex thing. To reach this idea of the self-healing networks, my experience tells me that I still remember when we implemented the first version of that, and as you can imagine, it was a read only approach.

Christian Adell 00:40:48 So what we had is that the conclusion of the automation process of the self-healing, was recommending what the automation would do for you. And then for a while we were just checking, okay, what the recommended or the automation is telling me to do is what I have to do because it’s totally matching my expectations. So after you run this for a while and you are creating the trust that the automation is actually concluding the same, that you as a human being will be concluding with the same data, then is when you start being into the right operations to it, you change the state. And the second question that I want to also highlight about the book, but maybe it’s already implicit on our comments. But just to make it very clear for everyone, this is a very, very practical book. So I would say that 90% of the chapters have a big chunk of hands-on things to do. So David mentioned about how you can use automation via programmability to access APIs, to consume data from Prometheus and the query from Loki from different places.

Christian Adell 00:41:54 There are many use cases, but in the other places we also explain you how to set up all these different tools, how to tune them, or even on the metrics exploration or data network operational data exploration. We explain how to interact with these different types of data. So we will try to make it for sure strong in the foundations to make a clear reference for you to understand the idea, the concept, but also to provide a hands-on experience to solidify the learning process.

Eric Chou 00:42:22 We could nerd out on this topic all day, I feel, but we are coming up on time, so I want to give each of you one last segment. If there’s something that we have not mentioned, feel free to jump in or you can use that as opportunity for call to action. So why don’t we start with you, Josh, do you have anything else to add or is there any call to action for people?

Josh VanDeraa 00:42:40 My call to action is to explore and experiment with the tooling. It keeps getting better and better. My general take if I can’t do something in an hour with open source, I start to get frustrated and move on. But you know, components of the book build it up, explore it, take a look.

Eric Chou 00:42:55 All right, Christian, any call to action or anything else you want to add.

Christian Adell 00:42:58 So my recommendation for anyone taking on this book is that you should not limit yourself, your imagination. So you have to be very open minded in order to think and guess that everything is possible. And it’s up to you to understand and try to represent what you can do about the requirements that you need in order to implement that. Without any limitation try to be overoptimistic and eventually you will hit some block, but you will do more than you actually expect.

Eric Chou 00:43:25 Nice one, David, for this whole book project and from what I could observe, David is the main guy who is behind this book writing process, initiating the idea. So it’s fitting that we have David have the last call to action and anything that you want to cover that we have not mentioned or any call to action.

David Flores 00:43:43 Basically what I would say is it’s been a pleasure actually writing the book. I mean, it’s a little bit stressful, but also it’s a pleasure because putting into paper a lot of the learnings and experiences is gratifying, actually. And not only that, I kind of cheated because I have two authors already with me on it, so the reviews were pretty good. Even the reviewers is a mentor of mine, which I appreciate a lot. So one thing that I would say that I want to also take as a message for the community is the aspect of looking at the observability in your day to day operations, okay, into like the next step for automation. We have been seeing automation a lot on the configuration and building phases, and there’s a lot that you can do on the day two operations. Okay, from alerting to replacing PSUs from data center technicians, I have automated jobs and even AI as well to help you with some of these operations. But what I would say is it’s definitely a rich environment.

David Flores 00:44:44 There’s a lot of rich data that you can actually use and it is good that there’s now a new trend, new tools coming out there that also are working on these day two operations. So as Josh said, test it, explore it. One of the goals of the book was to be practical enough for people to actually explore it, to use it, and to actually, you know, experiment a little bit and see what actually fits your environment and what doesn’t.

Eric Chou 00:45:07 Yeah, I don’t care what anybody else says, like people don’t read books anymore. But as an author, is this always that feeling of just this chest pumping feeling when you see that thing, the physical artifact that you’ve done and it’s just great. So I roger that. I mean, I’m jealous of you, David, for having experienced that, right? Like, you know, for me, it can be past, but I’m happy for you and happy for all of you guys for this book project to come about and and to realize one quick bit, let’s go around. Where can people find you? Where can people follow you? Let’s start with Josh.

Josh VanDeraa 00:45:39 Probably the best place to find me on LinkedIn. Josh-Vanderaa unique last name there. So there shouldn’t be. Too many of us can also find links to all my other socials on josh-v.com. I try to blog relatively frequent there.

Eric Chou 00:45:50 So yeah. What about you Christian? You’re a social butterfly, right?

Christian Adell 00:45:54 Not too much. I have very limited amount of time available to be on the social media, but if you can just contact me, I will be more than happy to anyone reaching me out via LinkedIn. So Christin Adell is my actually the it’s hard to pronounce. My surname is Catalan, so is Adell, but you can just reach me out via LinkedIn. Awesome.

Eric Chou 00:46:16 David, what about you? Where can people follow you? Where can people find you?

David Flores 00:46:19 Usually over LinkedIn as well. David Flores. Yeah, I’m also on X as well, davidban77 I have a blog post Hashnode, davidban77, that also is available for you to reach out, but mostly on LinkedIn.

Eric Chou 00:46:32 Yeah, I imagine those other social channels have been like we growing because of you’re so busy writing the book and, you know, working full time, all of that. But it seems like LinkedIn is the best place for all of my guests nowadays, so we obviously put those into the show notes and if so, choose to you can follow them on Twitter X, whatever you want to call it. Thanks again guys for being here. I really enjoy our conversation. It’s great to see all of you again, even though I see some of you on the daily basis. So, so thank you guys.

Josh VanDeraa 00:47:00 Thanks for having us.

David Flores 00:47:01 Thanks for having us.

Eric Chou 00:47:02 Thank you for listening to this episode of Network Automation Nerds. You awesome listeners, do you have any feedback for Network Automation? Our guest today or this episode? Send us some follow ups on packetpushers.net/FU where the FU stands for a follow up of course. We want to hear from you. Check out our website at packetpushers.net to find other network engineering focused podcasts, blogs, YouTube videos, and much more.

Eric Chou 00:47:26 Join our weekly reviews of Network Break, deep network dives on Heavy Networking and so much more. Last but not least, remember that too much network automation will never be enough.

NAN077: Network Observability: Tools, Automation, and Insights | Packet Pushers (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Twana Towne Ret

Last Updated:

Views: 5713

Rating: 4.3 / 5 (44 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Twana Towne Ret

Birthday: 1994-03-19

Address: Apt. 990 97439 Corwin Motorway, Port Eliseoburgh, NM 99144-2618

Phone: +5958753152963

Job: National Specialist

Hobby: Kayaking, Photography, Skydiving, Embroidery, Leather crafting, Orienteering, Cooking

Introduction: My name is Twana Towne Ret, I am a famous, talented, joyous, perfect, powerful, inquisitive, lovely person who loves writing and wants to share my knowledge and understanding with you.