Topic: DQ monitoring server?
DQ monitoring server?
Dear DC users and developers,
At Human Inference we've been thinking of ways to build upon the foundation of datacleaner for making it a more complete data quality solution. An idea have been shaping that we want to ask you about your opinions about and maybe use those opinions to make some sort of decision on whether or not we're going to build it!
The idea is to have a server-side counterpart for the DataCleaner application. The purpose of the server-app would be to be able to schedule jobs, gather and persist results and show trends over time. I would call this functionality "DQ monitoring". The current DC app would be extended with a way to upload jobs to the server so that you can still work with your jobs in the regular DataCleaner application, but for enterprise deployment you would probably run them in batches on the server.
In terms of reporting we have in mind that you should of course be able to see the results for a single run, but you should ALSO be able to see the evolution of your profiling metrics. For example you might be interested in seeing trends in the patterns found or in the metrics available in the various analyzers.
Another possible feature would be to have email bursting built-in, so that in case you have a threshold value for some particular metric, you could recieve email alerts if your metrics no longer lives up to your goals.
What is your oppinion on such a DQ monitoring application? Do you think it would fit in nicely with DataCleaner? Or would it not add a lot of value?
At Human Inference we've been thinking of ways to build upon the foundation of datacleaner for making it a more complete data quality solution. An idea have been shaping that we want to ask you about your opinions about and maybe use those opinions to make some sort of decision on whether or not we're going to build it!
The idea is to have a server-side counterpart for the DataCleaner application. The purpose of the server-app would be to be able to schedule jobs, gather and persist results and show trends over time. I would call this functionality "DQ monitoring". The current DC app would be extended with a way to upload jobs to the server so that you can still work with your jobs in the regular DataCleaner application, but for enterprise deployment you would probably run them in batches on the server.
In terms of reporting we have in mind that you should of course be able to see the results for a single run, but you should ALSO be able to see the evolution of your profiling metrics. For example you might be interested in seeing trends in the patterns found or in the metrics available in the various analyzers.
Another possible feature would be to have email bursting built-in, so that in case you have a threshold value for some particular metric, you could recieve email alerts if your metrics no longer lives up to your goals.
What is your oppinion on such a DQ monitoring application? Do you think it would fit in nicely with DataCleaner? Or would it not add a lot of value?
kasper,
not sure if you guys have made a decision on this. but i would definitely love to have something like this in my tool list.
-tach4
not sure if you guys have made a decision on this. but i would definitely love to have something like this in my tool list.
-tach4
I also think this would be an excellent addition to DataCleaner.
Sounds reasonable, but I hope that with this server-enhancements still the client will be available.
The evolution and/or trend reporting should be also available in the client version. This would be very useful.
Christian
The evolution and/or trend reporting should be also available in the client version. This would be very useful.
Christian
It's necesary the automatization of monitoring of the data quality, because in some cases the information can change in any moment and depending the customer needs to check this quality in the data.
On a related note, there's a discussion on the DataCleaner-dev mailing list about this ... Take a look here, and feel free to join the conversation:
http://groups.google.com/group/datacleaner-dev/browse_thread/thread/fcb16c4f86f482d2
http://groups.google.com/group/datacleaner-dev/browse_thread/thread/fcb16c4f86f482d2
Happy to say that this work is now going on :) You can find it in the 3.0 branch of DataCleaner's source:
http://eobjects.org/svn/DataCleaner/branches/3.0-monitor/
Current situation is that we support a timeline view, manual building of repository (but there is an example) and drill-to-details from the timeline view, which results in a single (historic) profiling result. Pretty neat! Will try and blog about it soon.
http://eobjects.org/svn/DataCleaner/branches/3.0-monitor/
Current situation is that we support a timeline view, manual building of repository (but there is an example) and drill-to-details from the timeline view, which results in a single (historic) profiling result. Pretty neat! Will try and blog about it soon.
I think this is an awesome idea. I am very interested in learning more about this.
I have checked out the 3.0-monitor branch but am unable to build. Seems to be unable to find some classes which had been in AnalyzerBeans-core previously (e.g. org.eobjects.analyzer.result.PatternFinderResult) Am I being too impatient?
I have checked out the 3.0-monitor branch but am unable to build. Seems to be unable to find some classes which had been in AnalyzerBeans-core previously (e.g. org.eobjects.analyzer.result.PatternFinderResult) Am I being too impatient?
Great! To get the build working you also need to check out and build AnalyzerBeans. Located at http://eobjects.org/svn/AnalyzerBeans/trunk
This is because we are currently developing on both AB and DC so the dependency is snapshot based.
Would love to hear more from you. Share your impressions and thoughts.
This is because we are currently developing on both AB and DC so the dependency is snapshot based.
Would love to hear more from you. Share your impressions and thoughts.
Oh and by the way - we've merged the branch into trunk already, so don't check out from the branch, but get trunk directly. From here:
http://eobjects.org/svn/DataCleaner/trunk
Notice both some changes in the desktop app and the whole new "monitor" webapp.
http://eobjects.org/svn/DataCleaner/trunk
Notice both some changes in the desktop app and the whole new "monitor" webapp.
Kasper thank you for such quick responses!
I got the DataCleaner trunk and after a few hiccups with Maven not resolving some of the dependencies, I have the war file deployed.
I have just logged in and will poke around more tonight/tomorrow.
Thanks again!
I got the DataCleaner trunk and after a few hiccups with Maven not resolving some of the dependencies, I have the war file deployed.
I have just logged in and will poke around more tonight/tomorrow.
Thanks again!
Thanks everyone who has so far participated in this thread.
A heads-up: An alpha version of DC 3 (with the major new monitoring feature) have just been uploaded to sourceforge. Please also see installation instructions there.
https://sourceforge.net/projects/datacleaner/files/datacleaner%20%28unstable%29/3.0-alpha/
Any feedback, ideas, comments are greatly appreciated!
A heads-up: An alpha version of DC 3 (with the major new monitoring feature) have just been uploaded to sourceforge. Please also see installation instructions there.
https://sourceforge.net/projects/datacleaner/files/datacleaner%20%28unstable%29/3.0-alpha/
Any feedback, ideas, comments are greatly appreciated!
Update: Now beta is available:
https://sourceforge.net/projects/datacleaner/files/datacleaner%20%28unstable%29/3.0-beta/
https://sourceforge.net/projects/datacleaner/files/datacleaner%20%28unstable%29/3.0-beta/
Log in by clicking the login link at the top of the screen
