Vizury

Implementing Dynamic Thresholds Using Bischeck

When you work with a tens of publishers in this industry, there are multiple metrics to be tracked per ad-exchange such as QPS, bid rate, average bid value, hit rate, etc. and these need to be tracked across all of them. What makes this challenging is that these metrics are highly dynamic, varying based on the time of the day and the day of the week. So, how did we handle these widely dynamic metrics?

Here’s an implementation guide to help you out with that- Say Hello to Bischeck!

What is Bischeck?

Bischeck is a Nagios plugin that gives us the flexibility of using dynamic thresholds based on historical data for sending alerts. So, let us consider this with an example. One of the ad-exchange metrics that we constantly monitor is publisher QPS. This varies widely across publishers and also based on time of the day and day of the week. For example, Baidu QPS varies from 7000 during non-peak hours to 70000 during peak hours while FBX qps varies from 2000 during non-peak hours to 5000 during peak hours.

FBX_qps

 

 

 

 

 

 

 

Baidu_qps

 

 

 

 

 

 

 

So, for Baidu, while a QPS of 6000 would seem normal during non-peak hours, it would indicate a problem during peak hours. And this same 6000 could indicate an unexpected surge in FBX bid requests during non-peak hours but is something expected during peak hours. While we can use Nagios time based thresholds, scaling it across multiple publishers and multiple business metrics, involves a lot of maintenance overhead. And this is where Bischeck proved to be a boon. I will be walking through with an example on how to setup bischeck for monitoring metrics in a Ubuntu box.

Pre-requisites

1. Bischeck needs java and it uses redis as backend for storage.

$ apt-get install default-jdk redis-server

2. Currently, bischeck uses Nagios for alerting. Future versions make bischeck a standalone service. For the current version, though we will need to integrate with Nagios. Also, as it uses passive checks, we install nsca as well and configure nagios for checking external commands.

$ apt-get install nagios3 nsca
$ sed -i 's/check_external_commands=0/check_external_commands=1/' /etc/nagios3/nagios.cfg
$ service nagios3 restart
$ service nsca restart
$ dpkg-reconfigure nsca # See /usr/share/doc/nsca/README.Debian

Installation and Basic Setup

Installation of Bischeck is pretty straight forward. The latest version as of writing this article is 1.1.1. A newer version 2.0.0 with many more features [1] in the pipeline. You will have to be logged in as root.

$ wget http://gforge.ingby.com/gf/download/frsrelease/132/505/bischeck-1.1.1.tar.gz
$ tar -xvzf bischeck-1.1.1.tar.gz
$ cd bischeck-1.1.1
$ chmod 755 install
$ ./install -u # Get usage
$ ./install -I /opt/bischeck

In bischeck init script, it does “su nagios” to retrieve certain configuration parameters as user nagios. However, if nagios shell is not configured to a real shell in /etc/passwd, su will not work. So, ensure that nagios shell is set to proper shell such as /bin/bash before starting bischeck. Alternatively, if modifying nagios user shell is seen as a security concern, you can pass the option “–shell /bin/bash” to su command wherever it is being used in /etc/init.d/bischeckd (haven’t tried this though)

$ usermod --shell /bin/bash nagios

Also, bischeck submits the results to nsca. The default installation of nsca does not have any password. So, we need to remove it in bischeck config file.

$ sed -i 's/password<\/value>/password<\/value>/' /opt/bischeck/etc/services.xml/etc/servers.xml

Double check /opt/bischeck/etc/servers.xml to ensure the password is indeed removed. Finally, restart bischeck.

$ service bischeckd start
$ tail -f /var/tmp/bischeck.log

More details can be found in the Official Installation and Administration Guide [2].

Nagios Configuration

Let us create a nagios host and service to monitor. Create a file /etc/nagios3/conf.d/publisher-metrices.cfg with the following content:

# Dummy command used to alert if passive checks are not recieving data
# for specific period of time
define command {
        command_name    check_bischeck
        command_line    /usr/lib/nagios/plugins/check_dummy 1 "Results were not reported"
}

# "FBX" is not a physical host but actually one of our publishers (ad-exchange).
# Business metrics may not actually be associated with a physical machine.
# But we would still want to group together different business metrices based
# on some criteria. In our case we have multiple publishers, and each publisher
# has certain metrices associated with it (qps, error_rate, etc.). So, we have
# made the publisher name as virtual host
define host {
        use                     generic-host
        host_name               FBX
        active_checks_enabled   0
        }

# Bischeck works with passive checks. We have also added freshness check (See
# http://nagios.sourceforge.net/docs/3_0/freshness.html for more info)
define service {
        host_name               FBX
        service_description     QPS
        use                     generic-service
        active_checks_enabled   0
        passive_checks_enabled  1
        check_freshness         1
        freshness_threshold     900
        check_command           check_bischeck
}

After configuring nagios, restart it for changes to take effect.

$ service nagios3 restart
$ tail -f /var/log/nagios3/nagios.log

You can visit http://<server_ip>/nagios3/ to check that “FBX” host and “QPS” service is added. The service state will turn to warning after a while as it has not yet received any metrics.

Configuring Bischeck

First, let us create a script /usr/local/bin/FBX_qps.sh which generates dummy QPS numbers based on time of day. This will be used by bischeck.

# Get current minute between 0 to 1399
current_minute=$(( ($(date +%s) % 86400)/60 ))

# multiplier starts at 720 during 00:00 UTC, reaches zero during noon and gradually increase to 720 during midnight
multiplier=$(test $current_minute -le 720 &amp;&amp; echo $((720 - $current_minute)) || echo $(($current_minute - 720)))

# qps varies between 5000 and 55000 depending on time of day
qps=$((5000 + (50000 * $multiplier / 720)))
echo "OK | qps=$qps;0.0;0.0;;"

Make it executable and test it.

$ chmod +x /usr/local/bin/FBX_qps.sh
$ /usr/local/bin/FBX_qps.sh
OK | qps=20972;0.0;0.0;;

Next, we need to configure bischeck services. Create the file /opt/bischeck/etc/bischeck.xml with the following content (backup the existing bischeck.xml before overwriting it):

&lt;?xml version='1.0' encoding='UTF-8'?&gt;
&lt;bischeck&gt;

  &lt;host&gt;
    &lt;name&gt;FBX&lt;/name&gt;
    &lt;service&gt;&lt;template&gt;qpstemplate&lt;/template&gt;&lt;/service&gt;
  &lt;/host&gt;

  &lt;servicetemplate templatename="qpstemplate"&gt;
    &lt;name&gt;QPS&lt;/name&gt;
    &lt;schedule&gt;120S&lt;/schedule&gt;
    &lt;url&gt;shell://localhost&lt;/url&gt;

    &lt;serviceitem&gt;
      &lt;template&gt;publishermetrictemplate&lt;/template&gt;
      &lt;serviceitemoverride&gt;
        &lt;name&gt;qps&lt;/name&gt;
      &lt;/serviceitemoverride&gt;
    &lt;/serviceitem&gt;
  &lt;/servicetemplate&gt;

  &lt;serviceitemtemplate templatename="publishermetrictemplate"&gt;
    &lt;name&gt;$$SERVICEITEMNAME$$&lt;/name&gt;
    &lt;execstatement&gt;{"check":"/usr/local/bin/$$HOSTNAME$$_$$SERVICEITEMNAME$$.sh","label":"$$SERVICEITEMNAME$$"}&lt;/execstatement&gt;
    &lt;thresholdclass&gt;Twenty4HourThreshold&lt;/thresholdclass&gt;
    &lt;serviceitemclass&gt;CheckCommandServiceItem&lt;/serviceitemclass&gt;

    &lt;cache&gt;
      &lt;aggregate&gt;
        &lt;method&gt;avg&lt;/method&gt;
        &lt;useweekend&gt;true&lt;/useweekend&gt;
        &lt;retention&gt;
          &lt;period&gt;H&lt;/period&gt;
          &lt;offset&gt;720&lt;/offset&gt;
        &lt;/retention&gt;
        &lt;retention&gt;
          &lt;period&gt;D&lt;/period&gt;
          &lt;offset&gt;30&lt;/offset&gt;
        &lt;/retention&gt;
      &lt;/aggregate&gt;

      &lt;purge&gt;
        &lt;offset&gt;30&lt;/offset&gt;
        &lt;period&gt;D&lt;/period&gt;
      &lt;/purge&gt;
      &lt;!-- Max count = 30 days * 24 hours per day * 30 items per hour --&gt;
      &lt;!--
      &lt;purge&gt;
        &lt;maxcount&gt;21600&lt;/maxcount&gt;
      &lt;/purge&gt;
      --&gt;
    &lt;/cache&gt;
  &lt;/serviceitemtemplate&gt;

&lt;/bischeck&gt;

Set the appropriate thresholds by configuring /opt/bischeck/etc/24thresholds.xml. Here is a sample configuration (be sure to backup 24thresholds.xml before overwriting it):

&lt;?xml version="1.0" encoding="UTF-8" standalone="yes"?&gt;
&lt;twenty4threshold&gt;
    &lt;!-- QPS --&gt;
    &lt;servicedefgroup&gt;
        &lt;member&gt;
            &lt;hostname&gt;FBX&lt;/hostname&gt;
            &lt;servicename&gt;QPS&lt;/servicename&gt;
            &lt;serviceitemname&gt;qps&lt;/serviceitemname&gt;
        &lt;/member&gt;

        &lt;template&gt;response-qps&lt;/template&gt;
    &lt;/servicedefgroup&gt;

    &lt;servicedeftemplate templatename="response-qps"&gt;
         &lt;period&gt;
             &lt;calcmethod&gt;&amp;gt;&lt;/calcmethod&gt;
             &lt;warning&gt;20&lt;/warning&gt;
             &lt;critical&gt;40&lt;/critical&gt;
             &lt;hoursIDREF&gt;10&lt;/hoursIDREF&gt;
         &lt;/period&gt;
    &lt;/servicedeftemplate&gt;

    &lt;hours hoursID="10"&gt;
        &lt;hourinterval&gt;
            &lt;from&gt;00:00&lt;/from&gt;
            &lt;to&gt;23:00&lt;/to&gt;
            &lt;!-- See http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_configuration_guide.html#toc-Chapter-4 --&gt;
            &lt;threshold&gt;avg($$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-24H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-96H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-168H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-336H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-504H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-672H])&lt;/threshold&gt;
        &lt;/hourinterval&gt;
    &lt;/hours&gt;
&lt;/twenty4threshold&gt;

Finally, restart bischeck.

$ service bischeckd restart
$ tail -f /var/tmp/bischeck.log

If everything is fine, you should start seeing the QPS values in Nagios web console.

Bischeck also provides integration with pnp4nagios [5]. Configuring this is beyond the scope of this article though.

Bischeck Command Line Utilites

Bischeck provides a couple of very useful command line tools for debugging [6].

The first is for threshold testing.

$ /opt/bischeck/bin/bischeck threshold.Twenty4HourThreshold -h FBX -s QPS -i qps -d 20150513 -H 08 -M 30 -m 10000

Bischeck provides a command line tool to check if values are getting populated properly and also for evaluating expressions:

$ /opt/bischeck/bin/bischeck cli.CacheCli
cacehcli&gt; avg(FBX-QPS-qps[0:4])
[1/1/2 ms] avg(10972,11111,11250,11388,11527) = 11249.6

Closing Comments

As it can be seen, bischeck is a pretty powerful plugin and provides a lot of functionalities beyond just dynamic thresholds. Moreover, they provide excellent support and the developers have been very responsive in incorporating feedback as well as fixing bugs. For any organization with highly dynamic business metrics, bischeck is definitely a must have!

References

[1] http://www.bischeck.org/?p=946
[2] http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html
[3] http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_configuration_guide.html
[4] http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Section-4.4
[5] http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Section-3.3
[6] http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Section-4.3

About the author

Nishant is a marketer at Vizury Interactive, where he helps ecommerce business owners to increase revenue. Vizury helps ecommerce companies to increase sales with new channels like web push notifications.

nishant.gupta@vizury.com'

Leave a Reply

Your email address will not be published. Required fields are marked *