.TH knoerre 1 "March 2012" "ngtx" .SH NAME knoerre \- fast check tool and http server for nagios remote checks .SH SYNOPSIS .B knoerre [ key ] .br .SH DESCRIPTION .B knoerre is a tool for checking very different parameters of a server. The intended primary purpose is to serve check values to a (remote) requesting instance like .B nagios by using simplified HTTP. .br It was developed as a substitution to the oversized, sometimes very buggy, sometimes difficult to configure and often also slow net-snmp package. .br knoerre uses (should use) .B tcpserver of DJB's software suite ucspi-tcp. Only the brave among yourselves will have the heart to do the daring deed of using (x)inetd. .br The usage of DJB's daemontools and ucspi-tcp (for tcpserver) is strongly recommended. .br knoerre can be easily set up with .B knoerre-conf(1). .br Access restrictions by IP# can be done with .B knoerre-update-tcprules(1). .br A .B key is a specific request to knoerre like i.e. "load1". All "keys" can be used local or by http request i.e. knoerre load1, knoerre diskusage/home or .B GET /load1 HTTP/1.1 . A key given on command line takes precedence over reading a http request from stdin (by tcpserver). A http request is internally limited to 512 bytes. .br Like using keys on the command line you can use knoerre in more ways of nagios remote checks: called by .B ssh, .B NRPE and the slow .B snmpd. Nevertheless the usage of .B tcpserver is strongly recommended. Using tcpserver and a request like .I load1 you'll receive a approx. 25% faster response like a local "/bin/cat /proc/loadavg". Using a local "knoerre load1" it is 4 times faster than "/bin/cat". .br Here's a short speed comparison, 5000 times remote request "load1": .RS net-snmp default, default nagios check_snmp: .B 8 mins 50 secs .br NRPE: .B 43 secs .br tcpserver/knoerre: .B 3 secs .br .SS Process control With the recommended usage of daemontools and ucspi-tcp you don't have to care about starting, stopping or restarting knoerre. Started on demand by .B tcpserver(1) there is no continuously running knoerre process like other daemons. The controlling tcpserver-process can be managed with .B svc(8). .PP .SS Built-In checks Some basic checks are built into knoerre. These built-in checks don't need to call an external program. .PP .TP .B cachedvalue Return cached value from a file .br .B Format: cachedvalue/XXXXX/absolute/path/to/file .br where XXXXX is the max age in minutes the file may have. .br Return the contents of the given file. The file should contain one line beginning with "OK ", "WARNING " or "CRITICAL " causing knoerre to exit with the matching exit code. .br These conditions will also cause a critical exit code: .br lstat error, file's modtime is older than XXXXX minutes, not a regular file, empty file, file too large, open error, read error .TP .B cat Cat content of a file. .br .B Format: cat/absolute/path/to/file .br "Cat" the content of a given file after "cat/". The first line contains the filename and also the date of the file (if no error occured). The last line of the file should contain an integer value to check by nagios. You can also use this check to test if an NFS-mounted FS is actually working by "cat"ting a file which should contain just "1" in a line. But to prevent blocking knoerre-processes you should better use the .B nfs check. If an error or timeout happens then 9999 or a bigger value is returned. .TP .B cmdline Return the number of instances of a process by cmdline match. .br .B Format: cmdline/XXXX .br where XXX is a string which should be part of the cmdline. .br Like .B process but use /proc/.../cmdline to detect also script-processes like i.e. .I "python loadlogger.py" which process name is only "python". .TP .B cmp Compare a string to the content of a file. .br .B Format: cmp/string/absolute/path/to/file .br Compare a string to the content of a file. If the string is equal to the content (LF is ignored) then 0 is returned otherwise 1. If an error or timeout happens then 9999 or a bigger value is returned. .TP .B cpu Show CPU usage in percent values. .br .B Format: cpuXY/SECONDS .br where X is one of (u|n|s|i|w|I) and Y one of (t|c) and optional SECONDS the measuring interval. .br The times of CPU usage can be shown 't'otal since kernel start or 'c'urrent values of a measuring interval of 10 seconds default. The CPU times are 'u'ser, 'n'ice, 's'ystem, 'i'dle or I/O 'w'ait. The 'I' values are "inverted" against 100 percent, e.g. print 99 for idle of 1%. If you need an immediate response of an up-to-date measuring value you should use .B knoerred which has a special measuring thread. .TP .B ctxtswitch Return context switches per second. .br .B Format: ctxtswitch .br .B Format: ctxtswitch/SECONDS .br Count the context switches per second. If no seconds are given a default of 10 is used. .TP .B direntries Return the number of entries recursively in a directory. .br .B Format: direntries/absolute/path/to/dir .br It counts entries in a dir - not inodes. This check is equal to "direntries" in .I recursive mode. See .B direntries(1). .TP .B dirlevels Return the maximum recursion level .br .B Format: dirlevels/absolute/path/to/dir .br Step recursively into dir, count recursion level and print the max count. One "@" can be used as wildcard (like asterisk in a shell). .TP .B diskinodes Return used disk inodes percentage. .br .B Format: diskinodes/absolute/path/to/fs .br Like .B diskusage but for inodes and not diskspace. .TP .B diskusage Return used disk space percentage. .br .B Format: diskusage/absolute/path/to/fs .br Return the amount of used space on a filesystem given after "diskusage/". NOTE: Because just one simple stat() call is used, you can use this check also for testing existance of files like e.g. "/var/lib/mysql/mysql.sock". See .B nagios-check-diskfree(1). .TP .B diskusagelocal Return highest used local disk space percentage. .br .B Format: diskusagelocal .br Get stats of mounted local (e.g. ext3) filesystems, print the two most full fs and in the last line the highest fill rate in percent. .TP .B dmesg kernel errors .br .B Format: dmesg .br Search kernel message ring buffer for "bad" lines, kernel errors and warnings. Like .B kernellog the search results are weighted i.e. "Hardware Error" gets 2000 points while a "segmentation fault" gets 1 point. See also .B kernellog. .TP .B fileexists Return whether file exists. .br .B Format: fileexists/X/absolute/path/to/file .br with X one of [fdcbplsaFDCBPLSA] for f(ile), d(ir), c(har dev), b(lock dev), p(ipe), l(ink), s(ocket) or a(ny type). .br If file exists and matches the type then 0 is returned otherwise 1. Upper case letter for file type makes logical inversion of the test. If file is a small regular file then also its content is printed before last line. .TP .B filesizes Return (max) filesize(s) in KB. .br .B Format: filesizes/absolute/path/to/file .br Get the filesize in KB of a single file or the maximum filesize of a group of files. You can use .B one dot or '@' as one wildcard (like asterisk in a shell). See .B Examples. .TP .B filesizesbypattern Return (max) filesize(s) by given filename pattern. .br .B Format: filesizesbypattern/XXXXX/Y/absolute/path/to/file-or-dir .br .B Format: filesizesbypatternmaxage/XXXXX/Y/ZZ/absolute/path/to/file-or-dir .br where XXXXX is a filename pattern like i.e. log, cipher Y is the recursive search depth and number ZZ is the max age (modtime) in days from 1 to 99. .br Get the filesize in KB of a single file or the maximum filesize of a group of files by a given filename pattern and a maximum depth to search in. You can use .B one dot or '@' as one wildcard (like asterisk in a shell). .TP .B filesizesbysuffix Return (max) filesize(s) by given filename suffix. .br .B Format: filesizesbysuffix/XXXXX/Y/absolute/path/to/file-or-dir .br where XXXXX is a filename suffix like i.e. .gif and cipher Y is the recursive search depth. .br Get the filesize in KB of a single file or the maximum filesize of a group of files by a given filename suffix and a maximum depth to search in. You can use .B one dot or '@' as one wildcard (like asterisk in a shell). See .B Examples. .TP .B filetimestamp Return age of file in minutes. .br .B Format: filetimestamp/X/absolute/path/to/file .br with X one of [acmoACMO] using access, change or modification time or the oldest of these. .br Upper case means return no error but just 0 if file does not exist. If file is a small regular file then also print its content before last line. .TP .B kernellog Count "bad lines" in kernellog. .br .B Format: kernellog/XX/absolute/path/to/kernellog .br where XX is a two-digit number. .br Like .B tslogentries you can specify as first parm the number of chars from the beginning of a log line which must be equal to the beginning of the last line of kernellog. If you use i.e. .I kernellog/07/var/log/kernel on Aug 29, then all lines starting with "Aug 29 " are scanned but not lines with "Aug 28". .br "Bad entries" are hardcoded in source and are strings like "access beyond end of device", "ector repair", "kernel BUG" and more. .br Up to 10 "bad lines" of kernellog are returned in lines above the count return value for nagios. On very big files only the last part (default 1MB) is searched. See also .B dmesg. .TP .B load1 load5 load15 Return load average per 1/5/15 minutes. .br Just return the load average value requested in the last line and all of .I /proc/loadavg in the line above. .br If knoerre was compiled with gcc-option .I -DOPENVZDEFAULT then the load value will be divided by the number of cpu cores online as listed in /proc/stat. Additionally the number of cores will be appended to the line with loadavg data. #.TP #.B loadmulti #Return load and many other values as one multicheck. #.br #.B Format: loadmulti/XXX/YYY #.br #where XXX is the time on the requesting host in seconds since epoch #and YYY the hostname the local host should have. #.br .TP .B loaduser Return most processes per one account .br .B Format: loaduser/XXX/YYY .br where XXX and YYY are the min/max uid of the processes to be checked. .br Return most running processes per one account. For every uid in the given range all processes are counted. 32-bit-UIDs are also supported. Up to 3 top users and the process counts are printed and the value in the last line is the max proc count. .TP .B logcheckerr Count lines with errors in a logfile .br .B Format: logcheckerr/absolute/path/to/logfile .br Lines with "error" or "fail" are counted with a weight of 100 and "warning" lines with a weight of 10. Up to 10 "faulty" lines of logfile are returned in lines above the count return value for nagios. On very big files only the last part (default 1MB) is searched. .TP .B longprocp Return minutes of the longest running user process. .br .B Format: longprocp/XXX/YYY[/A[/B[/C]]] .br where XXX and YYY are the min/max uid of the processes to be checked and the optional A, B, ... are names of processes to be excluded from check (up to 15). .br Check for long running processes. This check returns the time in minutes of the longest running user process. Its goal is to detect suspicious processes like PHP-shells of hacked user accounts. The only difference to .B longprocs is that min/max uid and process excludes are given by HTTP request and are not configured in /etc/knoerrerc. It's useful in cases when you want to build a monolithic version of knoerre which does not read knoerrerc. .TP .B longprocs Return minutes of the longest running user process. .br .B Format: longprocs .br Check for long running processes. This check returns the time in minutes of the longest running user process. Its goal is to detect suspicious processes like PHP-shells of hacked user accounts. The values for min/max uid and optional exclude process names must be specified in /etc/knoerrerc. See .B nagios-check-longuserprocesses(1). .TP .B mailqsize Return postfix mailqueue size. .br .B Format: mailqsize .br .B Format: mailqsize/XXXXX .br Return the size of the mailqueue (active and deferred subdirs) on a postfix server. See .B postfix-mailqsize(1)\. With the second format you can specify up to 4 subdirs to check and an optional mode character. Just use any combination of single chars like a(ctive), d(eferred), m(aildrop) or i(ncoming). Using 'M' as mode char for maximum count you won't get the sum of all emails but the maximum count of one of the specified dirs. .TP .B maxdirentries Return the maximum number of entries recursively in directories. .br .B Format: maxdirentries/X/absolute/path/to/dir .br where cipher X is the recursive search depth. .br This check is equal to "direntries" in .I max mode. See .B direntries(1). .TP .B maxfilesizes Return biggest file size recursively. .br .B Format: maxfilesizes/X/absolute/path/to/dir .br .B Format: maxfilesizessum/X/absolute/path/to/dir .br where cipher X is the recursive search depth. .br Find the biggest files and print paths and sizes in MB. The return value is the size of the biggest file in MB or the sum of the sizes of the scanned files. .TP .B mountopts .br Check mountpoint and options .br .B Format: mountopts/XXXXX/absolute/path/to/mountpoint .br where XXXXX is an option string which should match the beginning of the mount options .br Use /proc/mounts for actual mount options and mountpoint. If the given option string matches the actual mount options then 0 will be returned otherwise 1. If an error like i.e. not existing mountpoint or timeout happens then 9999 or a bigger value is returned. .TP .B mounts Check mounts of fstab .br .B Format: mounts .br .B mounts compares all entries of /etc/fstab if all are actually mounted and do a statfs() to check if a (nfs) mount is lost. Return 1 if a mount is missing and return 2 if a mount is listed in /proc/mounts but is actually lost. Use fork() like key .B nfs to avoid blocking on lost mounts. See also .B procmounts. .TP .B mysqlerr Count errors in mysqld errlog .br .B Format: mysqlerr/absolute/path/to/mysqld.err .br Like .B kernellog you must specify the absolute path to MySQL daemon error logfile. Only lines with ts of the current day are examined. Every "Note" counts, "Warnings" count ten times and every "ERROR" has a weight of 100. .TP .B netlinksdown Count net interfaces without link .br .B Format: netlinksdown .br Check all network interfaces for missing link (cable). .TP .B nettraf Count network traffic .br .B Format: nettraf/XXXX/SECONDS .br where XXXX is the device name and optional SECONDS the measuring interval. .br Traffic data is read from /proc/net/dev. Units are KiB and KiB/s. In the line before last the total count of traffic while the measuring interval and the measuring interval are shown. If you need an immediate response of an up-to-date measuring value you should use .B knoerred which has a special measuring thread. .TP .B nfs Check availability of a nfs-mounted fs. .br .B Format: nfs/absolute/path/to/file .br Check the availability of a nfs-mounted fs. It does this by "cat"ting the content of a given file after "nfs/", which should contain "1". If this file does not exist or NFS is not available and a timeout of 2 seconds did happen then a bigger value than 1 is returned. For NFS this check should be preferred over .B cat because it forks a child which may be blocked and killed then afterwards. See .B nagios-check-nfs(1). .TP .B proccount Number of all processes .br .B Format: proccount .br .B Format: proccounttg .br .B Format: proccountovz .br "proccount" shows the count of all processes as shown by /proc/loadavg (including "threads"). "proccounttg" counts processes by stepping through /proc and count every PID-dir (no "threads", just processes with pid==tgid). The alternative "proccountovz" is disabled by default. It additionally shows the three "top" instances of OpenVZ in the line before last line. .TP .B process Count instances of a process. .br .B Format: process/XXXXX .br .B Format: process0/XXXXX .br .B Format: processd/XXXXX .br .B Format: process/OpenVZ-CTID_YYYY/XXXXX .br .B Format: processd/OpenVZ-CTID_YYYY/XXXXX *** CURRENTLY NOT IMPLEMENTED *** .br where XXXXX is the name of a process as in /proc/.../stat and YYYY is the CTID to match on an OpenVZ host. .br If the key is "processd" then count only "real" daemons running as session/process leader with PPID 1. .br On "process" a return value of 999999999999999999 will be returned if no such process runs. To return just 0 you must use "process0". .br See .B nagios-check-process(1). .TP .B procmounts Check mounts of /proc/mounts .br .B Format: procmounts .br See also .B mounts. .B procmounts checks all mounts of /proc/mounts for being alive. It returns 2 if a mount is lost. .TP .B rsbackup Return the minutes since the last backup. .br .B Format: rsbackup .br The last backup time in format YYYYMMDD is taken from "/var/log/backup.timestamp" and the difference to the current time is returned. See .B nagios-check-backup(1). .TP .B time measure execution time of a command .br .B Format: time/XXXXX .br where XXXXX must be the executable /usr/bin/XXXXX which will be measured. .br The return value is the execution time of the command from fork()/execve() until SIGCHILD. The execution time is measured in microseconds. .TP .B timediff System clock difference between local and remote. .br .B Format: timediff/XXXXX .br where XXXXX must be the unix timestamp from the requesting server in seconds since epoch. .br The difference between remote and local system time is returned as a (positive) value in seconds. .br A sample check in a shell: .br lynx -dump http://172.16.1.1:8888/timediff/$(date +%s) .TP .B tslogentries Count last lines in a logfile with the same beginning of line. .br .B Format: tslogentries/XY/absolute/path/to/file .br where cipher X is the recursive search depth and the optional Y is a separator char. .br If you have logfiles with a timestamp at the beginning of every logline then you can count i.e. how many mails were sent or files were transferred today. The first argument must be a cipher as field count and an optional char taken as field separator to create a matching pattern. The pattern is created from the last line and the field count and separator. If no separator char is specified then ' ' (space) will be used as default. The second argument is the path. You can use .B one dot or '@' as one wildcard like asterisk in a shell. See .B Examples. .TP .B sockets Count sockets / sockets per port .br .B Format: sockets/PROTO/XXXXXX/YYYY .br .B Format: sockets/PROTO/XXXXXX/YYYY/ZZZZZZZZ .br .B Format: sockets/set-WWWW[/XXXXXX[/YYYY]] .br where WWWW is a set of protocols, XXXXXX is local, remote, wlocal, wremote, all or wall. YYYY is the port as 4-digit hexstring and ZZZZZZZZ is an optional IP address to be excluded from counting. .br PROTO is one of tcp, udp, tcp6, udp6 or set-WWWW. It is also the name of the proc-file in /proc/net/ which is read to get socket data. If you specify a set of protocols then "t" stands for tcp, "T" for tcp6, "u" for udp and "U" for udp6. Using the set syntax the specification of remote/local and port number is optional counting all sockets i.e. sockets/set-tTuU gives you all sockets. If you wanna know e.g. the number of sockets of a local running apache then you should use the key sockets/tcp/local/0050 and if you wanna count outgoing ssh-connections excluding connections to 172.16.0.1 then you should use sockets/tcp/remote/0016/010010AC . Sockets in state "06" (TIME_WAIT) are ignored unless you prefix local/remote/all with 'w'. .TP .B swap Used swap space in MB .br .B Format: swap .br Used swap space in MB is calculated with values of /proc/meminfo. MemTotal and SwapTotal in MB are printed in line before last. If you don't need this data you should use .B swaps because /proc/swaps holds just swap information and nothing else. The "swap" key is disabled by default. .TP .B swaps Used swap(s) space in MB .br .B Format: swaps .br This is an alternative version to .B swap. The amount of used swap space is calculated by adding the "Used" fields in /proc/swaps. The number of active swaps is printed in line before last. This should be preferred over .B swap unless you need the MemTotal output. .TP .B tcp Check for open TCP port .br .B Format: tcp/XXXXXXXX/YYYY .br where XXXXXXXX is the ip address and YYYY the port to connect to. .br Check for open port and return an error code if connect() fails. If connect() succeeds return the time needed in microseconds. This is useful to check e.g. a local (127.0.0.1) running tomcat server on port 8080 with tcp/127.0.0.1/8080. .TP .B uptime Return uptime .br .B Format: uptime .br .B Format: uptimeI .br .B Format: uptimeI/INVERSIONLEVEL .br Return uptime or an "inverted" uptime in seconds. The inverted value is (INVERSIONLEVEL - uptime) or 0 if the value would be negative. The inversionlevel may be specified by the key string, i.e. uptimeI/3600. If no inversionlevel was specified then a default of 86400 will be used. .TP .B wc-l Count lines of a file. .br .B Format: wc-l/absolute/path/to/file .br Just like shell cmd "wc -l" it counts lines of a file. You can use it for checking i.e. apache running out of semaphores with .I wc-l/proc/sysvipc/sem. .PP .SS knoerrerc The (optional) resource config file is "/etc/knoerrerc". You can just specify some basic settings like external commands or parameters for "longprocs". .br To specify an external program which is called by knoerre please use "CMD programurl command arg1 arg2 .. arg15", like i.e. .RS CMD loadavg cat /proc/loadavg .RE .br NOTE1: The number of args is limited to 15. .br NOTE2: knoerre doesn't use insecure and oversized popen(). You don't get a shell to execute the external program. .br NOTE3: You can't specify a path to your external program. For security reasons knoerre uses an internal path list to search for the program. .br Parameters for the .I longprocs function can be specified like this: .RS LONGPROC_UID_MIN 630 .br LONGPROC_UID_MAX 65533 .br LONGPROC_EXCLUDES vsftpd bash sftp-server .RE .SH FILES knoerre uses one configuration file and one access restrictions file for its tcpserver daemon: .TP /etc/knoerrerc rc-file for non-monolithic knoerre .PP .TP /etc/knoerre.tcprules.cdb tcprules for use with tcpserver .PP .SH "SEE ALSO" tcpserver(1), knoerre-conf(1), knoerre-update-tcprules(1), svc(8), check_remote_by_http(1), check_remote_by_http_time(1) .PP http://cr.yp.to/ucspi-tcp.html .br http://cr.yp.to/daemontools.html .SH EXAMPLES Here's a simple example of a client and server communication: .RS server$ tcpserver -v -RHl localhost 0 8888 knoerre .br client$ lynx -dump -mime-header http://server:8888/load1 .br HTTP/1.0 200 OK .br Server: knoerre/0.8.5m .br Content-Type: text/plain .br .br 1.51 .RE .br You can also use something like .RS echo "GET /loadavg HTTP/1.1" | knoerre .RE or .RS knoerre loadavg .RE .br This example shows the usage of a .B @ as wildcard: .RS $ knoerre filesizes/home/www/@/log/access_log .br /home/www/user_hans/log/access_log .br 52222 .RE .br A very "complex" example with three arguments (suffix, depth and path) and wildcard usage is this: .RS $ knoerre filesizesbysuffix/.gif/2/home/@/html/typo3temp .br /home/www/user_hans/html/typo3temp/pics/30363cbb32.gif .br 201 .RE .br Also filesizesbysuffix: .RS $ knoerre filesizesbysuffix/cache_pages.ibd/1/var/lib/mysql .br /var/lib/mysql/user-database-1/cache_pages.ibd=3022848 .br 3022848 .br $ knoerre filesizesbysuffix/.ibd/1/var/lib/mysql .br /var/lib/mysql/user-database-2/index_rel.ibd=3248128 .br 3248128 .RE .br Which user sent the most emails today? .RS $ knoerre tslogentries/1/home/www/@/log/mail.log .br /home/www/user_hans/log/mail.log .br 858 .RE .br Which user runs the most processes? .RS $ knoerre loaduser/1/60000 .br hans=32 jack=3 john=1 .br 32 .RE .br Is /home rw-mounted and nosuid? .RS $ grep home /proc/mounts .br /dev/sda7 /home ext3 rw,nosuid,nodev,data=ordered 0 0 .br $ knoerre/knoerre mountopts/rw,nosuid/home .br /home==rw,nosuid? .br /dev/sda7 /home ext3 rw,nosuid,nodev,data=ordered 0 0 .br 0 .RE .br .SH SECURITY .B knoerre does not support dropping of rights. Used as remote check tool with tcpserver you can drop rights with tcpserver. .B knoerre actually does not need to be run as root but for different checks and different dirs and files you'll maybe need different rights. Don't use setuid-bits, uid/euid checks are not made. Too long keys are truncated or answered with http-redirection. HTTP requests are limited to 512 bytes. Keys containing ".." are answered with http-redirection. All stat-calls are lstat()-calls. No writes are made to filesystem(s), all open()-calls are read-only. Data is only written to stdout/stderr. No external libs are used. Only standard C-lib is used. No stdio-functions are used. "External" input data is used with bound checks. Arrays are "oversized" to avoid off-by-one errors. An internal timeout prevents "dead" .B knoerre processes with blocking read() and waiting for data which will never come. The amount of syscalls and the amount of different syscalls is low. The source code and also the executable file is small. Using external commands with "CMD" in /etc/knoerrerc can be a security risk because the external program is forked/exec'ed by knoerre. .B knoerre doesn't use insecure and oversized popen() to execute external commands. You don't get a shell to execute an external program. You can't put strings in quotes. Space does always separate. You can't specify a path to your external program. .B knoerre uses an internal path list to search for the program. It's strongly recommended that you only allow access for your nagios server by tcp. One entry "knoerre: ALL" in /etc/hosts.deny and one entry with the nagios server IP# in /etc/hosts.allow. After changing it you .B must use knoerre-update-tcprules(1) to update tcpserver's cdb file. Keep always in mind that host based authentication is actually not a authentication. To encrypt network traffic please use e.g. ipsec or vpn. .SH CAVEATS Due to "leaf optimization" in direntries recursive mode it can produce wrong results on non-unix-like filesystems. The maximum internal absolute pathname length is 16384 chars. .SH AUTHOR Frank Bergmann, http://www.tuxad.com