Reference documentation

5.0

Copies of this document may be made for your own use and for distribution to others, provided that you do not charge any fee for such copies and further provided that each copy contains this Copyright Notice, whether distributed in print or electronically.


Table of Contents

I. Introduction to DataCleaner
1. Background and concepts
What is data quality (DQ)?
What is data profiling?
What is data wrangling?
What is a datastore?
Composite datastore
What is data monitoring?
What is master data management (MDM)?
2. Getting started with DataCleaner desktop
Installing the desktop application
Connecting to your datastore
Adding components to the job
Wiring components together
Transformer output
Filter requirement
Output data streams
Executing jobs
Saving and opening jobs
Template jobs
Writing cleansed data to files
3. Getting started with DataCleaner monitor
Installing the monitoring web application
Connecting to your datastore
Building a job
Scheduling jobs
Adding metric charts on the dashboard
II. Analysis component reference
4. Transform
JavaScript transformer
Invoke child Analysis job
Equals
Max rows
Not null
Union
5. Improve
Duplicate detection
'Model training' mode
'Detection' mode
'Untrained detection' mode
Merge duplicates
Merge duplicates
Conclusion
Synonym lookup
DE movers and deceased check
Address and Mail Suppression data sources
Output
UK movers, deceased and mailing preferences check
Address and Mail Suppression data sources
Output
US movers, deceased and do-not-mail check
Address and Mail Suppression data sources
Output
Table lookup
National identifiers
6. Analyze
Boolean analyzer
Completeness analyzer
Character set distribution
Date gap analyzer
Date/time analyzer
Number analyzer
Pattern finder
Reference data matcher
Referential integrity
String analyzer
Unique key check
Value distribution
Value matcher
Weekday distribution
7. Write
Create CSV file
Create Excel spreadsheet
Create staging table
Insert into table
Update table
III. Reference data
8. Dictionaries
9. Synonyms (aka. Synonym catalogs)
Text file synonym catalog
Datastore synonym catalog
10. String patterns
IV. Configuration reference
11. Configuration file
XML schema
Datastores
Database (JDBC) connections
Comma-Separated Values (CSV) files
Fixed width value files
Excel spreadsheets
XML file datastores
ElasticSearch index
MongoDB databases
CouchDB databases
Composite datastore
Reference data
Dictionaries
Synonym catalogs
String patterns
Task runner
Storage provider
12. Analysis job files
XML schema
Source section
13. Logging
Logging configuration file
Default logging configuration
Modifying logging levels
Alternative logging outputs
14. Database drivers
Installing Database drivers in DataCleaner desktop
Installing Database drivers in DataCleaner monitor
V. DataCleaner monitor repository
15. Repository configuration
Configure repository location
Directory-based repository
Database-backed repository
Providing signed Java WebStart client files
Producing the signed JARs
Configuring DataCleaner monitor to use the signed JARs
Cluster configuration (distributed execution)
16. Repository layout
Multi-tenant layout
Tenant home layout
VI. DataCleaner monitor web services
17. Job triggering
Trigger service
Polling for execution status
18. Repository navigation
Job files
Result files
Uploading content to the repository
Modifying result metadata
Renaming jobs
Copying jobs
Deleting jobs
19. Metric web services
Metrics background
Getting a list of available metrics
Getting the values of particular metrics
20. Atomic transformations (data cleaning as a service)
What are atomic transformation services?
Invoking atomic transformations
VII. Invoking DataCleaner jobs
21. Command-line interface
Executables
Usage scenarios
Executing an analysis job
Listing datastore contents and available components
Parameterizable jobs
Dynamically overriding configuration elements
22. Apache Hadoop and Spark interface
Hadoop deployment overview
Setting up Spark and DataCleaner environment
Upload configuration file to HDFS
Upload job file to HDFS
Upload executables to HDFS
Launching DataCleaner jobs using Spark
Limitations of the Hadoop interface
VIII. Third party integrations
23. Pentaho integration
Configure DataCleaner in Pentaho Data Integration
Launch DataCleaner to profile Pentaho Data Integration steps
Run Pentaho Data Integration jobs in DataCleaner monitor
Run DataCleaner jobs in Pentaho Data Integration
IX. Developer's guide
24. Architecture
Data access
Processing framework
25. Executing jobs through code
Overview of steps and options
Step 1: Configuration
Step 2: Job
Step 3: Execution
Step 4: Result
26. Developer resources
Extension development tutorials
Building DataCleaner
27. Extension packaging
Annotated components
Single JAR file
Extension metadata XML
Component icons
28. Embedding DataCleaner

List of Tables

4.1. JavaScript variables
4.2. JavaScript data types
5.1. Model training properties
5.2. DE movers and deceased check output
5.3. DE movers and deceased check output
5.4. DE movers and deceased check output
6.1. Completeness analyzer properties
6.2. Pattern finder properties
6.3. Referential integrity properties
6.4. Unique key check properties
6.5. Value distribution properties
17.1. Job triggering HTTP parameters