As data centers grow and staff is reduced, the need for efficient monitoring tools resources is more important than ever. The term monitoring when applied to data centers can be confusing because it means different things depending on who is saying and who is listening. For example:
- The person running the applications in the cluster thinks, "When will my process run? When will it end? and How is behaving in comparison with the last execution??? "
- The operator in the network operations center thinks, "When will we see a red light that means something needs fixing and make a call to the technical service?"
- The person in the group of engineering systems thinks "How are our machines doing? Are all services working properly? What trends do we see? and How can we best use our IT resources???"
In one way or another, we are forced to look through terabytes of code to control exactly what we want to monitor. Not only that, there are hundreds of products and services. Fortunately, many monitoring tools are open source; in fact, many of them do a better job than some of the commercial applications that try to achieve the same goal.
The hardest part of using open source monitoring tools is carrying out an installation and a configuration that works for our environment. The two main problems we usually face are:
- There is no tool to monitor we want and how we want. Why? Because each user define monitoring in different ways.
- Because of the previous problem, we could need a lot of work of Tool Customization to get exactly what we want. Why? Because every environment, no matter how standard we think it is, it is unique.
By the way, these same two problems are found in the commercial monitoring tools.
And here it is where our nice couple, Ganglia and Nagios, two tools that monitor data centers comes on stage. Both are widely used in high-performance computing (HPC), but have qualities that make them attractive in other environments. Also, both have taken different positions on the definition of surveillance: Ganglia is more concerned with gathering metrics and track them over time while Nagios is more focused on being an alert mechanism.
As both projects have evolved separately, in the end there is a certain overlap in some of the functionality. For example:
- Ganglia used to require an agent to run on each host in order to gather information from it, but now we can get metrics of anything through Ganglia (spoofing mechanism).
- Nagios was also used to extract information from its target hosts, but now it has plug-ins that run agents on target hosts.
While the tools have converged in some functional areas, there is still enough difference between them so that we will win if we use both as a couple. The joint implementation fills the gaps of each product:
- Ganglia does not have a built-in notification system while Nagios excels at this
- Nagios does not seem to have integrated agents in target hosts that are scalable (though some can argue this point), whereas this was part of the original design of Ganglia.
There are other open source projects that do things that our couple does and even some are better in certain areas. The most popular open source monitoring solutions are Cacti, Zenoss, Zabbix, Performance Copilot (PCP) and Clumon (I'm also sure you have one that is not on this list). Many (including Ganglia and some Nagios plugins) make use of RRDTool or Tobi Oetiker's MRTG (Multi Router Traffic Grapher) to generate nice graphics and store data.
With so many open source solutions for monitoring a data center, we are often surprised to see how many companies enter into the development of their own solutions and ignore the work that others have done. If you have this need, and do not know what to choose, our winning couple Gangli- Nagios can be your solution.