In order to be sure that a data analysis leads to meaningful results, the data must have appropriate integrity. This means that the data must be complete, correct and consistent. It is important, for example, that all order lines to be mapped in the system are recorded in the transaction data and that the product dimensions recorded in the master data are mostly complete and always refer to the same product unit (usually the picking unit).

Overview of scope of products with incomplete dimensions in the article master (example)

If product dimensions are missing, one of the things to check is how many and which products are affected. If these are only a few products, which are also never or rarely sold anyway, incomplete master data is not a major problem. If fast-moving items are among them, data must be determined subsequently or (more commonly) assumptions must be made by the analyst together with the customer.

Overview of proportion of products with incomplete dimensions in order lines and quantities (example)

Frequently, data is complete, but incorrect. This applies in particular to product dimensions. When master data quality was not previously of great importance when operating a manual system, accuracy of data entry has often suffered. It is therefore important that outliers are identified and checked individually. Thus, various plausibility checks are performed to identify presumably or obviously incorrect values. If there are incorrect entries, it must be decided whether the corresponding values are corrected or the products are excluded from the analysis. Consideration of the importance of the products to system throughput is essential to this decision.

Plausibility check: distribution of product dimensions with obvious outliers (example)

Sometimes it turns out during the plausibility check that the numerical values are correct, but the units are not specified correctly. Standardization of the units then becomes necessary.

Different units of measurement in the master data require standardization

Furthermore, the data set must be comprehensive enough to draw robust conclusions. For warehouse planning, a data set should cover at least two months, but preferably six to twelve months. The resulting data sets can become so large that they cannot be opened in the usual way with Office software such as Microsoft Excel (but they can be opened with Power Query or Access, if you want to work on the basis of Office). Data that only depict a short time horizon is often subject to seasonal effects, which impair the validity of the analyses because the observed behavior is not representative of business over the course of the year. Imagine looking at a retailer’s sales data for November and December only: the effect of Christmas sales is likely to create a strong bias and not provide a meaningful basis for planning for the rest of the year.

Regularly, individual order lines can be found in data records that were picked on the weekend (or only posted there), although the business is usually limited to the period Monday to Friday: this data must be cleaned up in order not to distort the analysis.

Thus, the quality of the data set has a significant impact on the effort that must be expended to draw meaningful conclusions from an analysis. The effort required to clean the data set and the demands on data integrity and quality are of course dependent on the question under investigation. Nevertheless, it can generally be stated that a good data set considerably reduces the analysis effort – and thus the costs.

Also take a look at our handout on data collection for data analysis. We are happy to support you with your data collection!