Analyzing numerical data validating identification numbers
Data cleaning is the process of preventing and correcting these errors.
Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, Such data problems can also be identified through a variety of analytical techniques.
Mathematical formulas or models called algorithms may be applied to the data to identify relationships among the variables, such as correlation or causation.
In general terms, models may be developed to evaluate a particular variable in the data based on other variable(s) in the data, with some residual error depending on model accuracy (i.e., Data = Model Error).
Data may be numerical or categorical (i.e., a text label for numbers). The requirements may be communicated by analysts to custodians of the data, such as information technology personnel within an organization.
In mathematical terms, Y (sales) is a function of X (advertising).
It may be described as Y = a X b error, where the model is designed such that a and b minimize the error when the model predicts Y for a given range of values of X.
For example, with financial information, the totals for particular variables may be compared against separately published numbers believed to be reliable.
Unusual amounts above or below pre-determined thresholds may also be reviewed.
The data are necessary as inputs to the analysis, which is specified based upon the requirements of those directing the analysis or customers (who will use the finished product of the analysis).