Preface
This set of examples, or cookbook, show how to use parts of the NetLogger Toolkit. For more details, downloads, etc. see the NetLogger home page at http://acs.lbl.gov/NetLoggerWiki/.
Conventions
- Italic
-
Used for file and directory names, email addresses, and new terms where they are defined.
- Constant Width
-
Used for code listings and for keywords, variables, functions, command options, parameters, class names, and HTML tags where they appear in the text. Used with double quotes for literal values like True, 10 and netlogger.modules. In code listings, user input to the terminal will be prefixed with a $.
- Constant Width Italic
-
Used to indicate items that should be replaced by actual values.
- Link text
-
Used for URLs and cross-references.
Data Mining Tools
Overview
The NetLogger Data Mining Tools (DMT) provide a rich library of functions for processing logs. The functionality is concentrated in two basic types of modules: parsing and loading. The first type converts from a variety of log formats into the "Best Practices" (BP) log format. The second performs analysis using BP logs. The programs that drive inputs and outputs for these modules are called nl_parse and nl_load. The examples below will describe different ways to use these programs, separately or together, to perform common data processing tasks.

Getting the code
-
Check out the code from subversion
svn co https://bosshog.lbl.gov/repos/netlogger/trunk/python/
-
Set up the environment
cd python/ source dev-setup.sh (or .csh)
Basic Parsing
Once you have set up your development environment, parsing logs is mostly a matter of picking the name of your parser module and feeding it inputs.
Choose a module
First list modules with the -l/--list option.
$ nl_parse --list ## Output Available modules: bestman, bp, condor_dag, csa_acct, dynamic, generic, gensim, gk, globus_condor, gram_acct, gridftp, gridftp_auth, guc, hsi_ndapi, hsi_xfer, jobstate, kickstart, ks, netstat, pbs, sge, sge_rpt, vmstat, wsgram
Get more information on a module with the -i/--info option. This will list parameters that can be given on the command-line for that module.
$ nl_parse -i bestman <Output> * Parser name: bestman * Description: Parse logs from Berkeley Storage Manager (BeStMan). See also http://datagrid.lbl.gov/bestman/ * Parameters: version Version 1 is anything before bestman 2.2.1.r3, Version 2 is that version and later ones. values=(1,2) [2] transfer_only For Version2, report only those events needed for transfer performance. values=(yes,no) [no]
Parse a file
This example will use nl_parse bp module to convert BP logs to themselves.
First, create a small input file:
$ nl_write -n 3 event=example.1 > /tmp/ex1.bp
Parse the file, sending results to standard output.
$ nl_parse bp /tmp/ex1.bp ## Output ts=2010-09-14T13:17:56.606554Z event=example.1 level=Info n=0 ts=2010-09-14T13:17:56.606595Z event=example.1 level=Info n=1 ts=2010-09-14T13:17:56.606620Z event=example.1 level=Info n=2
You can use options to this parser to, e.g., verify output. Another way to do this is use the specialized tool nl_check.
$ cp /tmp/ex1.bp /tmp/ex2.bp $ echo "bogus=true" >> /tmp/ex2.bp $ nl_parse bp /tmp/ex2.bp ## Output ts=2010-09-14T13:17:56.606554Z event=example.1 level=Info n=0 ts=2010-09-14T13:17:56.606595Z event=example.1 level=Info n=1 ts=2010-09-14T13:17:56.606620Z event=example.1 level=Info n=2 2010-09-14T13:23:57.795922Z WARN netlogger.NLParser.unparsed.event - \ msg=missing ts,value=bogus=true
Parse without verification.
$ nl_parse bp /tmp/ex2.bp verify=false ## Output ts=2010-09-14T13:17:56.606554Z event=example.1 level=Info n=0 ts=2010-09-14T13:17:56.606595Z event=example.1 level=Info n=1 ts=2010-09-14T13:17:56.606620Z event=example.1 level=Info n=2 // Note: NetLogger generated missing fields ts=2010-09-14T13:26:25.148633Z event=event level=Info bogus=true
Use nl_check to verify.
$ nl_check /tmp/ex2.bp ## Output *** Parser error on line 3: Missing one or more required elements (Group:({"ts" Suppress:("=") {quoted string, starting with " ending with " ^ {!W:( )}...}}), Group:({"event" Suppress:("=") {quoted string, starting with " ending with " ^ {!W:( )}...}})) (at char 0), (line:1, col:1) 1 errors found
Use nl_check to filter out bad input.
$ nl_check -c -f -q /tmp/ex2.bp ## Output ts=2010-09-14T13:17:56.606554Z event=example.1 level=Info n=0 ts=2010-09-14T13:17:56.606596Z event=example.1 level=Info n=1 ts=2010-09-14T13:17:56.606620Z event=example.1 level=Info n=2
Send logs to AMQP broker
NetLogger has support for the Advanced Message Queuing Protocol (AMQP), a very useful component in a distributed logging architecture. To send logs to AMQP, use the -a/--amqp-host option for nl_parse.
$ nl_parse -a localhost bp /tmp/ex1.bp
Data loading basics
The loader program (nl_load) serves a variety of purposes. In a use-case where multiple sources are streaming data to a centralized information broker for processing, database loading for example, nl_load functions like a snap-in piece of server architecture. It would be the piece that handles the actual processing and database loading in this situation. However, in a case where one would like to connect to a remote information broker for the purpose of monitoring and filtering event data during the run of a particular job, nl_load would be working as a piece of client software.
The client loader program is comprised of two parts. The main program (nl_load) is also located in the scripts directory where the broker is. However the code that handles various loading/processing tasks is located in the netlogger/analysis/modules/ subdirectory. The loader program is invoked and the loader module that is desired will noted on the command line at run time. For example the following command:
nl_load -c localhost csv_loader > bp_outfile.csv
Will invoke the loader client program, attach to the information broker running on localhost, load in the csv_loader module (located in netlogger/analysis/modules/) that transforms the incoming netlogger data into CSV format and writes the result out to a file. A different processing module could load the streamed data into a database rather than producing an output file.
A simple example of how the loader modules are written can be found in the bp.py module. It merely writes out the incoming netlogger bp data out to a file. The process method of the subclass is where the events are passed in by the broker nl_load is connected to.
It is also possible feed bp formatted data directly to the loader module directly via a pipe rather than connecting the loader to the broker. For example, if you only have a single source/stream of data (a directory of logs for example) and you only want to send the data to a single processing loader on the same machine, one could do that without connecting the loader to the broker. For example, the following command:
nl_parse bp input.bp | nl_load csv_loader > output_bp.csv
Would take the output from an nl_parse process (see next section), pipe it to nl_load which would send the data to the appropriate loading processing module and output the results to a file.
Information broker (nl_broker)
The NetLogger information broker (nl_broker) is located in the scripts directory of the source checkout. The broker accepts incoming streams of netlogger bp formatted data and hands the events off to one or more processing (loader) modules. It can be run in the background by simply invoking it without any arguments:
nl_broker &
This will start the broker on the localhost interface with preset default ports. Invoking with the -h flag will show the various arguments - mostly changing default ports. After the broker has been started it will accept incoming streams of netlogger bp formatted data. However, if there are no client "loader" processes attached to the broker, any incoming data will not be processed. If a tree falls in the forest, it does not make a sound in this case.
Advanced usage
Below are some examples of using the DMT components. Some additional examples can be found in the examples/dmt subdirectory of the source code.
Here is an example of how to load 5000 events into MongoDB using the mongodb loader module.
# If necessary, start the MongoDB server. # If you installed from a package, (e.g., apt-get on Debian) there will already be # a boot script that does this automatically. $ mongod & # Run the broker. It makes sense to do this first, but the loader is smart enough # to try reconnecting if the broker is not there when it starts up. $ nl_broker & # Now connect to the broker with a loader that sends data to Mongo. The database # and collection names are sort of like database and table for a RDBMS. # The 'intvals' parameter tells the mongodb module that values for those attributes # should be coerced to an integer before inserting to the DB. There is also 'floatvals'. # In this case, the 'n' will be an index generated by nl_write. $ nl_load -c localhost mongodb database=testdb collection=testcoll intvals=n & # Note: if running 'mongod' in the foreground, it will now spit out a connection accepted message # Write some logs to the broker $ nl_write -T -n 5000 # Check that the data got there # $ is shell, > is mongo prompt $ mongo testdb MongoDB shell version: 1.4.3 url: testdb connecting to: testdb type "help" for help > db.testcoll.count() 5000
In real operation, you would replace the "nl_write" with one or more "nl_parse" modules that are sending parsed data to the broker.
Here is an example of how to load 5000 events into SQLite using the nl_sql loader module. This module creates and uses a generic schema:
CREATE TABLE event ( event_id INTEGER NOT NULL, ts NUMERIC(16, 6), event VARCHAR(255), level SMALLINT, startend SMALLINT, status INTEGER, PRIMARY KEY (event_id) ); CREATE TABLE ident ( id INTEGER NOT NULL, event_id INTEGER NOT NULL, name VARCHAR(255), value VARCHAR(255), PRIMARY KEY (id), FOREIGN KEY(event_id) REFERENCES event (event_id) ); CREATE TABLE value ( id INTEGER NOT NULL, event_id INTEGER NOT NULL, name VARCHAR(255), value VARCHAR(255), PRIMARY KEY (id), FOREIGN KEY(event_id) REFERENCES event (event_id) );
The commands to load and send the data are similar to the MongoDB loading example.
# Run the loader. Although there is no broker yet, it will connect to it when it appears. $ nl_load -c localhost nl_sql dsn=sqlite:///test.db & # Run the broker. $ nl_broker & # Wait a second for the loader to connect, then write some log data $ nl_write -T -n 5000 # In order to get the last few events flushed, it is sometimes # necessary to send one extra event $ sleep 2; nl_write -T -n 1 # Check that the data arrived sqlite3 test.db SQLite version 3.6.19 Enter ".help" for instructions Enter SQL statements terminated with a ";" sqlite> select count(*) from event; 5001 sqlite> select max(value + 0) from value where name = 'n'; 4999
NetLogger Analysis
This section provides some examples of NetLogger analysis.
Installing on Ubuntu
You will need R (the R language) for most of these examples. Here is an example of how to install R and the NetLogger R packages on Ubuntu.
% sudo bash # ..or become root by other means $ apt-get install r-base-core # If you want to use MySQL, you'll need that installed too. # If you don't, take the "RMySQL" out of the package list for R, below $ apt-get install mysql-server mysql-client libmysqlclient16-dev # I find it easier to install add-on packages in R itself $ R > install.packages(c("ggplot2", "lattice", "RMySQL", "RSQLite")) > quit(save="no") $ exit # ..back to yourself
% mkdir -p /tmp/nlr % cd /tmp/nlr % svn co https://svn.globus.org/repos/netlogger/trunk/analysis/R-packages % cd R-packages/ # Become root % sudo bash $ R CMD INSTALL nlbase_0.0.0.tar.gz $ R CMD INSTALL nlgridftp_0.0.0.tar.gz % exit # .. become yourself again
The easiest way to test that the install worked at all is to run R and ask for help on a couple of functions.
% R > library(nlgridftp) > ?convertRaw # ... should display a bunch of help ..
Lifelines
One can create NetLogger “lifelines” with the nl_lifeline program (in the Python part of the NetLogger distribution). This program takes a log file on standard input and produces a second program, with data embedded, that can be run by either gnuplot or R to produce the plot.
In the example below, the event sequence is a,b,c, and the “+item+” field is used to group events into lines, i.e. a sequence of events having the same value for item will be placed in the same line.
$ cat file.log ts=2009-04-17T19:52:17.558026Z event=a level=Info item=1 ts=2009-04-17T19:52:21.790675Z event=b level=Info item=1 ts=2009-04-17T19:52:26.566290Z event=c level=Info item=1 ts=2009-04-17T19:52:30.992154Z event=a level=Info item=2 ts=2009-04-17T19:52:35.942049Z event=b level=Info item=2 ts=2009-04-17T19:52:47.864045Z event=c level=Info item=2 # Create plot using R $ nl_lifeline -l item -g item -e a,b,c -o file.rplot -t R < file.log # (output) To create the final plot, run: R CMD BATCH file.rplot # Note: output will be in plot.pdf # Create plot using Gnuplot $ nl_lifeline -l item -g item -e a,b,c -o file.gnuplot -t g < file.log # (output) To create the final plot, run: gnuplot file.gnuplot # Note: output will be in plot.png
Database queries
This section provides some sample database queries.
Database event types
This query reports database event types and their attributes: Some of the syntax is MySQL-specific, in particular the group_concat function.
-- -- Put event types in a temporary table -- create temporary table event_types (id integer, name varchar(255)) select id, count(*) 'num', min(time) 'first', max(time) 'last', (case when startend = 0 then concat(name,'.start') when startend = 1 then concat(name,'.end') else name end) name from event group by name, startend; -- -- Join with attr(ibutes) table -- create temporary table event_attrs (event varchar(255), names varchar(4096)) select e.name as 'event', group_concat(attr.name order by attr.e_id separator ',') as 'names' from event_types e left join attr on e.id = attr.e_id group by e.name; -- -- Join with ident(ifiers) table -- create temporary table event_idents (event varchar(255), names varchar(4096)) select e.name as 'event', group_concat(ident.name order by ident.e_id separator ',') 'names' from event_types e left join ident on e.id = ident.e_id group by e.name; -- -- Join with DN table -- create temporary table event_dn (event varchar(255), has_dn varchar(3)) select e.name as 'event', (case when isnull(dn.id) then 'no' else 'yes' end) 'has_dn' from event_types e left join dn on e.id = dn.e_id group by e.name; -- -- Join with text table -- create temporary table event_text (event varchar(255), has_text varchar(3)) select e.name as 'event', (case when isnull(text.id) then 'no' else 'yes' end) 'has_text' from event_types e left join text on e.id = text.e_id group by e.name; -- -- Project them all into the same table -- select x.event, e.num, from_unixtime(e.first, "%Y-%m-%dT%H:%i:%S") 'first', from_unixtime(e.last, "%Y-%m-%dT%H:%i:%S") 'last', x.names 'attributes', y.names 'identifiers', d.has_dn, t.has_text from event_attrs x join event_idents y on x.event = y.event join event_types e on x.event=e.name join event_dn d on d.event=e.name join event_text t on t.event=e.name order by num;