Thursday, September 27, 2007

Adding host info to ganglia and gmond

GridWay was not reporting OS name and OS version for mileva. After making sure gmond was running and MDS was actually being populated by available information (info on this is available in earlier post), you have to make sure all the information that is needed by GridWay is being provided by ganglia.
For example, I was noticing on the problem host that telnet localhost 8649 was giving the followgin output:
[METRIC NAME="sample-metric" VAL="Linux mileva-0-3.local 2.6.9-55.0.2.ELsmp #1 SMP Tue Jun 26 14:14:47 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
[METRIC NAME="sample-metric" VAL="Linux mileva-0-2.local 2.6.9-55.0.2.ELsmp #1 SMP Tue Jun 26 14:14:47 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

While output on working host was giving the following output:
METRIC NAME="os_name" VAL="Linux" TYPE="string" UNITS="" TN="401" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/]
[METRIC NAME="sample-metric" VAL="Linux everest-0-3.local 2.4.21-20.ELsmp #1 SMP Sat Sep 18 18:28:16 PDT 2004 x86_64 x86_64 x86_64 GNU/Linux
[METRIC NAME="os_name" VAL="Linux" TYPE="string" UNITS="" TN="943" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/]
[METRIC NAME="sample-metric" VAL="Linux everest-0-13.local 2.4.21-20.ELsmp #1 SMP Sat Sep 18 18:28:16 PDT 2004 x86_64 x86_64 x86_64 GNU/Linux

This means ganglia is not providing needed information, so do the following to include needed info. (Following was done on Ganglia v3.0.4)
Edit /etc/gmond.conf (this is the default location) file and add the following text toward the bottom of the document (among all the other similar entries):
metric {
name = "os_name"
value_threshold = 10.0
}

It is likely that os_release will need to be added in the same fashion as was described for os_name.

Edited config file (/etc/gmond.conf) needs to be copied to all the nodes of the cluster before it will take effect.

Then restart gmond (/etc/init.d/gmond restart) on all the nodes and wait for GridWay info to be updated...

Saturday, September 15, 2007

Populating gwhost in GridWay

Source of information when executing gwhost command in GridWay comes from two major sources:
From $GLOBUS_LOCATION/libexec/globus-scheduler-provider-[sge OR fork] and $GLOBUS_LOCATION/etc/globus_wsrf_mds_usefulrp/gluerp.xml. The gluerp.xml script is the one that provides info about static resource info and it obtains infrom from a provider such as ganlgia (gmond process) or hawkeye. Information available from ganglia can be checked by executing: telnet localhost 8649
gluerp.xml needs to have a line such as the following enabled in order for gwhost to be properly populated:
.../...
java org.globus.mds.usefulrp.glue.GangliaElementProducer
.../...

The globus-scheduler-provider-[sge OR fork] script provides mostly dynamic information used for scheduling.