Topic Reading web server log files

robertfolkerts started the topic:
2012-04-09 16:55

Reading web server log files

I'm trying to read an IIS log file in a Kettle job, and as usual, this means inspecting the data first in DataCleaner to prevent errors 1/2 through the ETL job (following Bauman's book Pentaho Solutions).

But the W3 has defined that format is blank seperated with the fields metadata in a row that looks like:
#Fields: time cs-method cs-uri (more columns)

I can 'almost' read this as a csv with a blank separator, but the metadata has 15 columns and there are only 14 columns of data. (the extra column in the metadata is #Fields:

My quick and dirty workaround is to uncheck "Fail in inconsistent column count" and to mentally shift all the column names by one. Is there a cleaner work around? I appreciate the auto discovery, but I need to have a way to add a directive to 'ignore leading #Fields:' in the file format descriptor.

Can I define a new datastore type (say W3 Logs). If so, how do a write one of these?
deleted user replied:
2012-04-09 18:30
Hi Robert,

I've actually had it on my agenda for a while to write up a tutorial on how to create a custom datastore. So I think I will do that tonight :) But it will require basic Java development skills, hope that is also what you are looking for?

How is this done in Kettle? Is there a specific input step for W3 log files or an option in the CSV input step that you can use? If we can come up with a generic way of fixing the issue it would be nice.
robertfolkerts replied:
2012-04-09 18:41
There is a "text file input" tool, which is basically a super class of the "cvs file input" tool. The text file input tool allows for hand editing of the fields in a 'fields tab' on the tool, so it auto-populates the field, but I can go in and alter the name, data type, length, format string, etc.
deleted user replied:
2012-04-09 19:09
OK, gotcha. Hmm this seems like a valid requirement actually, especially in the light of W3C making it a standard. But it's not anything I've come across before, and in a way I think it's a dirty format but whatever :-P

In future versions I do envision some way of manually defining/overriding the CSV headers though. But it wont fix your issue right now. Instead I will let you know once there's a blog/tutorial available on making custom datastores. Will start cooking on it now (have actually told the story a ton of times via email, to colleagues etc. so I just need to make it pretty).
deleted user replied:
2012-04-09 20:23
Hi Robert,

For the custom datastore thingie, please take a look at my blog which I've just published a new entry on:

Implementing a custom datastore in DataCleaner.

Hope it can help you on your way.
robertfolkerts replied:
2012-04-10 15:35
Wow, that was fast:-) Thanks