When you create a data warehouse is not enough attention is paid to cleaning the incoming information into it. Apparently, it is believed that the larger the storage, the better. This is a surefire way to turn your data warehouse into a garbage dump.
Data must be cleaned! After all, information is heterogeneous and is almost always collected from a variety of sources. It is the presence of different data collection points that makes the cleaning process so complex and relevant.
Mistakes will always take place and it is impossible to completely get rid of them. Perhaps, it sometimes makes sense to put up with them rather than to spend time fixing them. But in general, we should aim to reduce the number of errors to an acceptable level by any means. The methods used for analysis are already fraught with inaccuracies, so why aggravate the situation?
In addition, we need to take into account the psychological aspect of the problem. If the analyst or decision-maker is not sure about the numbers they get from the data storage, they will rely on information they get from other sources. This severely reduces the value of the storage.
We will not consider simple errors, such as mismatches of types or differences in input formats and encodings. These are cases when information comes from several sources and different conventions are adopted to denote the same fact.
A typical example of such an error is the designation of a person's gender. Somewhere it is denoted as M/W, somewhere as 1/0 and somewhere as True/False. To prevent such kind of errors, we can set rules for transcoding and typecasting. So these kinds of errors are relatively easy to resolve. But we are interested in more complex problems that cannot be solved in elementary ways.
The types of complex errors are rather numerous. In addition to universal errors, some errors are specific only to a particular subject area or task. But let us look at those that do not depend on the task.
There are proven methods for solving each of these problems. Of course, errors can be corrected manually. But with large amounts of data, this becomes time-consuming and problematic. Therefore, we will consider options for solving these problems automatically with minimal human involvement.
First, we need to decide what exactly to consider a contradiction. Oddly enough, this is a non-trivial task. For example, in Russia, a pension card needs to be changed if someone changes their last name, first name, patronymic or gender. It turns out that there is no contradiction in the fact that a person was born as a woman and retired as a man!
After we decide what to consider a contradiction and detect such records, there are several options for action.
This problem is a scourge for many data storages. Most forecasting methods assume that the data comes in a constant, uniform stream. In practice, this happens rarely. Therefore, one of the most popular applications of data storages, forecasting, is implemented poorly or with significant limitations. To combat this phenomenon, we can use the following methods.
Quite often, we come across data or events that are significantly out of character. For example, the price of a product is 10 times higher than the average. These values need to be adjusted. The fact is that the analysis algorithms do not know anything about the nature of the processes. Therefore, any anomaly will be perceived as a completely normal value. Because of this, the model will be greatly distorted. A random failure or success will be considered a pattern.
There is a way to deal with this problem — it is called robust estimates. These are methods that are resistant to strong perturbations. A median filter can serve as an example.
We evaluate the available data and apply one of the following actions to everything that goes beyond the acceptable limits.
When carrying out analysis, we almost always come across noise. Most often, the noise does not carry any useful information and only prevents us from seeing the picture clearly. There are several methods to combat this phenomenon.
This is a topic for a separate conversation. The types of such errors are too numerous — for example, typos, deliberate distortion of data or inconsistency of formats. Plus, there are errors related to the features of the application for data entry.
There are proven methods to deal with most of them. Some things are obvious. For example, we can check the formats before adding data to the storage. Others are more sophisticated. For instance, we can correct typos based on various types of thesaurus. But in any case, we need to clean up these types of errors.
Dirty data is a very big problem. In fact, it can negate all our efforts to fill the data storage. At the same time, we should keep in mind that we are not talking about a one-time clean but constant work in this direction. As they say, "to keep the place clean, you don't need to tidy it up but to stop making a mess there". The ideal option is to implement special procedures in the ETL process that will ensure that the data is cleaned before being uploaded to the storage.
If the data changed during the cleaning process, it must be labeled to take this aspect into account in the subsequent analysis. Otherwise, there is a risk of relying on it as on real information, which can lead to incorrect conclusions.
The solutions described above are not the only ones. There are quite a few other processing methods that can help improve the quality of data, ranging from expert systems to neural networks. At the same time, we should take into account that the cleaning methods can be strongly tied to the subject area. For example, what is noise for some is highly valuable information for others.
In view of the above, the ETL process should be considered more broadly, and not only as procedures for importing data from receivers, transforming it minimally and loading it into storage. Data cleaning should be a mandatory stage of work. The value of the storage is determined not only and not so much by the volume of data — but by the quality of the collected information.