Abstract
Data cleaning has been faced with the problem of duplicate elimination. This involves the task of doing away with numerous tuples, which refer to the same thing. Formerly, the solutions depended on the familiar analogous functions in these multiple tuples. These solutions, however, were not consistent for they did not give the real results always. In this paper, we provide a solution of an algorithm for eliminating duplicates in dimensional tables with a main focus on the data warehouse. This algorithm exploits hierarchies implemented on real datasets encountered from a data warehouse in operation (Ananthakrishna et al. 586).
Introduction
It is of paramount importance that data received in a data warehouse have minimal or fewer mistakes e.g. spelling mistakes or inconsistent conventions. This will reduce the amount of effort spent on data cleaning. This will play a key role in enhancing the accuracy of the decision and support analysis carried in data warehouses. The process of identifying and eliminating identical data in the quest for quality is always tricky and problematic. This is because you always come across objects which have copious portrayals in a data warehouse. An example is for a client Lisa being entered twice in a SuperMart after a purchase i.e. [Lisa Simpson, Seattle, WA, USA, 98025] (Ananthakrishna et al. 586).
Delphi
This software consists of an algorithm that operates as a straightforward duplicate detector. This is done by scanning for duplicate entities at each level of the hierarchy followed by a scan over the entire hierarchy. Two units are therefore considered to be duplicates if particular pairs of tuples in each relation are actually duplicates. In figure1, we can treat the Organization, City, State, and country as different relations in order to determine these duplicate pairs of tuples. A top-down traversal hierarchy approach is employed whereby the topmost relation is processed first before grouping child relations into smaller groups for comparison processing.
In the section below we briefly take a look at GroupWise duplicate detection;
This is done by first pairing the duplicates in a group G, and then comparing (using tcm and fkcm). Those whose similarity is greater than the latter’s threshold are chosen.
While carrying out duplicate detection of a certain group, a set in the group I pinpointed and its corresponding tuples compared to those in the group. We, therefore, employ the services of a potential duplicate identification filter in detaching all possible replicas in the subsets of the group. It is good since we only carry out |G| comparisons to identify G’, which is a bit smaller than G. (Ananthakrishna et al. 590).
We use token tables to efficiently calculate the tcm for any tuple. The tuple, however, ought to be a member of a particular group. The token tables contain the set of tokens, which must actually have frequencies and a list of corresponding tuples (Ananthakrishna et al. 591). The stop token frequency is that frequency at which we begin disregarding a token, set to be 10% of the number of tuples in G.
Duplicates can be detected also using fkcm. A tuple v is a duplicate of another tuple v’ in G if fkcm(v, v’) > fkcm-threshold.
The children of G are on one meddled table. A subset of the union of the children sets of the tuples in G is all contained here. The table encompasses every child tuple, together with its corresponding frequency. The tuples contained in the group with which the child tuple joins are also contained in this same table. Tuples with frequency less than the stop children frequency are maintained.
A combination of the predictions for each pair of tuples that have been sensed to be duplicated is done using tcm or fkcm. Partitioning of the group into sets assists in placing duplicate pairs into sets, which helps in determining the representative tuple for each set. This is especially done in order to identify maximal connected sets of duplicates.
“The Top-down Traversal method is also used. Relations are grouped from the top most one and then the duplicate detection is carried out on each and every group” (Ananthakrishna et al. 591).
Dynamic thresholding is a method used to get the limits (thresholds) for each group. This method is applied when the users are not in a position to effectively use tcm and fkcm methods. It involves treating each group independently, thus allowing qualitatively better thresholds to be set.
Discussion
Small children sets might be sources of errors and are detected by modifying the children table construction and processing. Correlated errors might also introduce duplicates in our working and they are corrected by measuring with significant computational overhead, co-occurrence through lower level relations (Ananthakrishna et al. 595).
Experimental Evaluation
The quality and efficiency of the Delphi is evaluated by introducing equivalence, spelling & truncation errors.
Conclusions
The above discussed are dimensional hierarchies in data warehouses used to develop efficient algorithm for detecting fuzzy duplicates (Ananthakrishna et al. 597).
Reference
Ananthakrishna et al. Eliminating Fuzzy Duplicates in Data Warehouses. In In VLDB 2002: 586-597. Print.