Using accurate and high quality data is critical for any application relying heavily on the data, be it machine learning, artificial intelligence, or digital twins. Bad quality and erroneous data can result in inaccurate predictions even if the model is otherwise robust. Ensuring data quality is more critical in real-time applications where there is no human in the loop to perform sense checks on data or results. A real-time digital twin implementation for a floating system uses time-series data from numerous measurements such as wind, waves, GPS, vessel motions, mooring tensions, draft, etc. Statistics computed from the data are used in the digital twin. An extensive data checking and cleaning routine was written that performs data quality checks and corrections on the time series data before statistics are computed.
Various types of errors that typically occur in a time series include noise, flat-lined data, clipped data, outliers, and discontinuities. Statistical procedures were developed to check the raw time-series for all these errors. The procedures are generic and robust so they can be used for different types of data. Some data types are slow varying (e.g., GPS) while the others are fast varying random processes. A measurement classified as an error in one type of data is not necessarily an error in the other data type. For example, GPS data can be discontinuous by nature but a discontinuity in the wave data indicates an error. Likewise, checking for white noise in mooring tension data is not that meaningful. We developed parametric data procedures so that the same routine can handle different types of data and their errors. Outlier removal routines use the standard deviation of the time-series which itself could be biased from errors. Therefore, a method to compute unbiased statistics from the raw data is developed and implemented for robust outlier removal.
Extensive testing on years of measured data and on hundreds of data channels was performed to ensure that data cleaning procedures function as intended. Statistics (mean, standard deviations, maximum, and minimum) were computed from both the raw and cleaned data. Comparison showed significant differences in raw and cleaned statistics, with the latter obviously being more accurate.
Data cleaning, while not sounding as high tech as other analytics algorithms, is a critical foundation of any data science application. Using cleaned time-series data and corresponding statistics ensure that a data analytics model provides actionable results. Clean data and statistics help achieve the intended purpose of the digital twin, which is to inform operators of the health/condition of the asset and flag any anomalous events.