By Dan Newton, CDS
When Gartner predicted that by “2025, 85% of infrastructure strategies will integrate on-premises, colocation, cloud, and edge delivery options, compared with 20% in 2020,” it was clear that a seismic shift in how we monitor and manage infrastructure was on the horizon. We’d already lived through multiple generational shifts within data centers: from monolithic to client-server; from LAN to WAN; from physical systems to virtualized infrastructures; and even to compute and storage modes that served temporary, containerized workloads. The increasingly distributed and diverse landscape would put even more pressure to meet and exceed availability, performance, and security SLAs.
Systems monitoring always has been a foundational element to delivering on IT SLAs, and importantly attempting to reduce business impact by quickly identifying symptoms and ultimately the root cause of application delivery disruption and impact. Whether it was basic heartbeat monitoring or advanced correlation of incidents, monitoring always has played a front-line role.
But monitoring itself has changed over the years. Fortunately gone, for the most part, are the days of highly intrusive agents that would consume memory, compute, and disk resources – and potentially interrupt the health of the very systems they were monitoring. Our monitoring tools also have been shaped by the evolution and sophistication of the underlying platforms. Today, server monitoring tools have multiple layers of device inspection – from high-level events to lower BIOS-level alerts – that allow systems administrators greater flexibility and granularity on how they manage a server farm. Network monitoring has progressed as well, particularly as workloads now regularly stretch across data center and cloud boundaries. Storage monitoring, which always has been a highly-specialized domain, also has expanded. Fibre Channel and the management of specialized connectivity were always unique to the storage estate – but today, as more “open” storage protocols like file and object storage have become mainstream in data centers, more IP-based management options exist for administrators.
The other important evolution in the monitoring arena is artificial intelligence (AI) and machine learning (ML). The former allows the processing of data in real-time to detect anomalies against normal benchmarks, to identify and understand dependencies, and even to assist in root-cause analysis (RCA). AI can help automate the processing of machine-generated events to focus on the “signal” vs. the noise. In today’s data center, with more and more devices generating an alert and/or event data, the volume is too great for manual processing.
Machine Learning, on the other hand, allows the ability to analyze historical data to predict the future. For example, is there a slow performance issue that is emerging in a specific infrastructure pool? ML can learn from the patterns that are emerging through data and, ultimately, pinpoint issues before they become system critical. ML also can help when planning the introduction of a new workload – helping forecast resource requirements to avoid future bottlenecks and/or outages.
Today, minimizing business disruption requires the monitoring of storage, network, server, and application processes, the correlation of the entire stack across platforms, and continuous learning and improvement – be it AI or ML.
The human layer is also important and cannot be forgotten. Over time models can be improved, but not every scenario can be modeled today. Experienced engineers who understand the application and infrastructure interaction continue to play a vital role in issue resolution.
Having worked with enterprise customers to deploy and optimize monitoring solutions, from ScienceLogic, Datadog, and Logicmonitor, to PRTG and Nagios, I learned that while the ultimate objective was to know there was an issue before the customer was aware, this goal was often unobtainable. Enterprise customers have hundreds or thousands of end users accessing their applications 24x7 and if there is an application issue, they will be contacting the helpdesk often at the same time as the monitoring solution is detecting a problem.
Ultimately, these customers focused on outcomes and the minimization of business disruption. Demonstrating the ability to quickly determine and resolve the root cause of an application issue relies on real-time access to monitoring data and logs and experienced engineers that can diagnose across numerous data center and hyperscaler platforms. This requires not only the deployment of a highly capable monitoring solution but also access to years of performance data and highly experienced engineers.
Systems monitoring is like proactive healthcare. It requires work but it’s never been more important.
About The Author
Dan is the CEO of CDS. Previously, he was at HiveIO, a developer of intelligent virtualization technology, where he served as CEO. Prior to HiveIO, Dan was Chief Operating Officer for the managed services company Datapipe, supporting customers globally in over 20 data centers and across private and public cloud solutions. Datapipe was acquired by Rackspace, where Dan served as GM and SVP of Service Delivery. Dan also has held key leadership roles with Dell and Perot Systems.