The most popular data quality toolkit meets Big Data

DataCleaner on Hadoop and other big data stores

Data management is changing - fast. One of the driving forces is Big Data and in particular Hadoop as a technology platform to facilitate this force. The ability to scale horizontally for greater size and speed means that new territory is being explored, disrupting both the businesses and technologies associated with data processing.

But with 'Big data' comes big responsibility - a responsibility to master your data and not just let your new data lake turn into a data swamp.

DataCleaner has this capability to help you with the quality of data, ingestion of data, standardizing and monitoring etc. We can leverage the computing power of your Hadoop cluster to overcome infrastructure and performance hurdles.

How does it work?

DataCleaner can connect to the Hadoop Distributed File System (HDFS) and read+write your data, just like on any other file system. Moreover you can submit your DataCleaner jobs to actually execute on the Hadoop cluster itself.

DataCleaner interoperates with Apache Spark and YARN to get the most out of Hadoop and to give you an industry-standard execution platform which fits in with all of the major Hadoop distributions - Hortonworks, Cloudera and MapR.

In addition to Hadoop we also support other major Big Data and NoSQL databases such as ElasticSearch, Cassandra, HBase, MongoDB and CouchDB.

Watch the video above to see how DataCleaner works with Hadoop.

What can I use it for?

DataCleaner is a great tool for data profiling and for processing of data. Use it with big data to: