\ Data transformation and discretization are critical steps in the data preprocessing pipeline. They prepare raw data for analysis by converting it into forms suitable for mining, improving the efficiency and accuracy of data mining algorithms. This article dives deep into the concepts, techniques, and practical applications of data transformation and discretization.
1. What is Data Transformation?Data transformation involves converting data into appropriate forms for mining. This step is essential because raw data is often noisy, inconsistent, or unsuitable for direct analysis. Common data transformation strategies include:
\
Normalization scales numeric attributes to a specific range, such as [0.0, 1.0] or [-1.0, 1.0]. This is particularly useful for distance-based mining algorithms (e.g., k-nearest neighbors, clustering) to prevent attributes with larger ranges from dominating those with smaller ranges.
3.1.1 Min-Max NormalizationFormula:
v*’* : Original value of the attribute.
minA : Minimum value of attribute A.
maxA: Maximum value of attribute A.
new_minA: Minimum value of the new range (e.g., 0.0).
new_maxA: Maximum value of the new range (e.g., 1.0).
Example:
Suppose the attribute "income" has a minimum value of $12,000 and a maximum value of $98,000.
We want to normalize an income value of $73,600 to the range [0.0, 1.0].
The normalized value is 0.716.
Formula:
Example:
Suppose the mean income is $54,000 and the standard deviation is $16,000.
We want to normalize an income value of $73,600.
Using the formula:
The normalized value is 1.225.
Formula:
j : Smallest integer such that ( max(|v'|) < 1 ).
Example:
Suppose the attribute "price" has values ranging from -986 to 917.
The maximum absolute value is 986.
The smallest integer ( j ) such that ( 986 / 10^j < 1 ) is j = 3 .
Normalize the value
The normalized value is -0.986.
Discretization replaces numeric values with interval or conceptual labels. This is useful for simplifying data and making patterns easier to understand.
3.2.1 BinningBinning divides the range of an attribute into bins (intervals). There are two main types:
Histograms partition the values of an attribute into disjoint ranges (buckets). The histogram analysis algorithm can be applied recursively to generate a multilevel concept hierarchy.
Concept hierarchies generalize nominal attributes to higher-level concepts (e.g., street → city → country). They can be generated manually or automatically based on the number of distinct values per attribute.
Data transformation and discretization are essential steps in data preprocessing. They improve data quality, enhance mining efficiency, and facilitate better insights. By normalizing, discretizing, and generating concept hierarchies, you can transform raw data into a form that is ready for analysis.
All Rights Reserved. Copyright , Central Coast Communications, Inc.