Rolling out Proactive Monitoring using Datadog
While working on automating the release and deployment process and the provisioning of MobStor, an internal cloud storage platform at Yahoo! a while ago, a colleague in the operations team introduced me to YMon for the first time. YMon was a version of Nagios that was hardened for the specific security requirements at Yahoo!, and also that version scaled better, I was told then.
That was my first encounter with a proper monitoring system. Also, at Yahoo! there were homegrown tools such as Groucho that were developed in Perl which were in use to do monitoring chores. Look at how painful it had been monitoring in those days — the naming of a useful tool itself reflected it!
I didn’t do much with YMon for that project, but, in the next group I worked in, I was very much involved in expanding the use of YMon to implement comprehensive monitoring. Also I managed an SRE team later at Yahoo! that fully depended on YMon to monitor and alert on critical issues.
Though YMon/Nagios had been very useful, the limitations of the first generation monitoring tools were apparent then also — the huge volume of spurious alerts that resulted in alert fatigue, nonavailability of historical data related to monitoring, scalability issues, minimal or non-intuitive user interface and so on. But they worked and I remember running a few projects to mitigate some of these shortcomings of first generation monitoring tools and make the life of the on-call support engineers a little easier.
Later, at other companies I worked on more monitoring tools — Zenoss, Cacti, Zabbix etc. In the absence of a comprehensive monitoring application I ended up writing a lot of scripts and plugins to extend the features of existing products to roll out monitoring solutions to meet the operational requirements.
And then from 2013 through 2015 I got the opportunity to use Splunk, Apica and Datadog — a sample of next generation monitoring products — that finally started addressing the industry’s long pending ask for something better than Nagios and similar products. With its path breaking log aggregation and search features Splunk changed the way we look at solving monitoring problems. Just like DevOps evolved as result of the widespread adoption of delivering software as SaaS with backend infrastructure on the public cloud platforms, the next generation monitoring tools, both open source and licensed, came about to meet similar industry requirements.
The first generation monitoring tools were compatible only with a host-centric, static, and bare-metal infrastructure housed in your own data centers or co-locations. There is hardly anything host-centric or static in an elastic and service oriented public cloud environment.
With dynamic infrastructure to deal with, the monitoring problems became far more complex. The use of microservices as the building blocks of the runtime environments of software systems, the complexity only increased. While there is a flood of monitoring tools available in the market right now, there are only few products that could address this complexity gracefully and at scale. Datadog is one of those products that continues to add features and enhance the existing ones to meet with those rapidly shifting demands.
I had the opportunity to roll out monitoring solutions using Datadog at multiple companies. Tapping into that experience, I recently wrote a book that covers how Datadog can be used to roll out proactive monitoring for a variety of scenarios starting from bare-metal datacenter based environments through state-of-the-art Kubernetes clusters running distributed workloads, and many hybrid configurations in between.
In the following summary, let’s see how the book covers the usefulness of Datadog in rolling out proactive monitoring; the details are in the book of course.
Prior to the writing of this book I had published a lot about monitoring, including one well-read article about Proactive Monitoring on devops.com. In a nutshell, this book is all about rolling out proactive monitoring using Datadog.
The book starts with a comprehensive description of monitoring terminology that are used commonly in the industry. Also different types of monitoring that include infrastructure monitoring, platform monitoring, application monitoring, logs aggregation and analytics, and last-mile monitoring, are explored and how popular monitoring tools cater to these categories of monitoring in practice.
After the generic treatise of monitoring topics in the first chapter, the book quickly gets into addressing Datadog specific topics such as installation of Datadog Agent in different scenarios and review of basic features such as the Datadog Dashboard, account management, and the use of metrics, events and tags.
Monitoring infrastructure is core to any monitoring effort and that is explained in the book in detail in the context of how Datadog can be used in infrastructure monitoring leveraging the out-of-the-box support for that. Monitors and alerts are described in detail and tutorials are available for setting them up on Datadog.
The basic concepts of monitoring and how they are implemented in Datadog are covered by the end of chapter 7 which is at the halfway mark of the book. Advanced topics are covered in the book after that and some of the important chapters are on these topics:
- Using Datadog REST API
- Working with monitoring standards
- Integrating with Datadog
- Monitoring containers
The last chapter provides an overview of some of the advanced features that Datadog has released recently such as security monitoring, APM, observability features and synthetic monitoring.
The book Datadog Cloud Monitoring Quick Start Guide is published by Packt Publishing and it is available in both print and ebook formats, and carried by major online book retailers including Amazon and Barnes & Noble.