This notebook is an appendix to our study. Its aim is to demonstrate the data characteristics of the SKAB dataset. To extract this information we are going to perform Exploratory Data Analysis (EDA) on the data, using DataPrep.EDA
[1] which is an easy-to-use tool well integrated into Python and Jupyter Notebook for viewing data characteristics and understanding the data in an interactive way.
[1] Jinglin Peng, Weiyuan Wu, Brandon Lockhart, Song Bian, Jing Nathan Yan, Linghao Xu, Zhixuan Chi, Jeffrey M. Rzeszotarski, and Jiannan Wang. DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. SIGMOD 2021.
import pandas as pd
from dataprep.eda import create_report
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')
Skoltech Anomaly Benchmark (SKAB) is a multivariate dataset designed for evaluating anomaly detection algorithms. The dataset contains both point outliers and changepoints.
The data is collected from a water circulation system testbed that simulates a real industrial scenario with its control system. Anomalies are induced in the system by partially closing valves, temperature variations, reduction of motor power, drastic water level changes and scenarios leading to cavitation (the formation of small vapor-filled cavities in the liquid).
ds = pd.read_csv('SKAB\ds.csv', index_col=0)
The data analysis can be run with the following command. The report consists of the following sections:
protocol
feature belongs to this category. Here only the Stats, PieChart and Word Frequency tabs carry information, as word length is not important in the case of this feature.create_report(ds)
Number of Variables | 9 |
---|---|
Number of Rows | 45001 |
Missing Cells | 0 |
Missing Cells (%) | 0.0% |
Duplicate Rows | 701 |
Duplicate Rows (%) | 1.6% |
Total Size in Memory | 5.9 MB |
Average Row Size in Memory | 136.7 B |
Variable Types |
|
Accelerometer1RMS and Accelerometer2RMS have similar distributions | Similar Distribution |
---|---|
Accelerometer1RMS is skewed | Skewed |
Accelerometer2RMS is skewed | Skewed |
Current is skewed | Skewed |
Pressure is skewed | Skewed |
Temperature is skewed | Skewed |
Voltage is skewed | Skewed |
Volume Flow RateRMS is skewed | Skewed |
Dataset has 701 (1.56%) duplicate rows | Duplicates |
y has constant length 3 | Constant Length |
Accelerometer1RMS has 24707 (54.9%) negatives | Negatives |
---|---|
Accelerometer2RMS has 24702 (54.89%) negatives | Negatives |
Current has 8118 (18.04%) negatives | Negatives |
Temperature has 30534 (67.85%) negatives | Negatives |
Thermocouple has 24700 (54.89%) negatives | Negatives |
Volume Flow RateRMS has 25328 (56.28%) negatives | Negatives |
numerical
Approximate Distinct Count | 17200 |
---|---|
Approximate Unique (%) | 38.2% |
Missing | 0 |
Missing (%) | 0.0% |
Infinite | 0 |
Infinite (%) | 0.0% |
Memory Size | 3270574 |
Mean | -1.6485 |
Minimum | -4 |
Maximum | 5 |
Zeros | 2 |
Zeros (%) | 0.0% |
Negatives | 24707 |
Negatives (%) | 54.9% |
Minimum | -4 |
---|---|
5-th Percentile | -4 |
Q1 | -4 |
Median | -3.5471 |
Q3 | 0.6796 |
95-th Percentile | 1.7435 |
Maximum | 5 |
Range | 9 |
IQR | 4.6796 |
Mean | -1.6485 |
---|---|
Standard Deviation | 2.628 |
Variance | 6.9065 |
Sum | -74185.9889 |
Skewness | 0.4885 |
Kurtosis | -1.1319 |
Coefficient of Variation | -1.5941 |
numerical
Approximate Distinct Count | 16459 |
---|---|
Approximate Unique (%) | 36.6% |
Missing | 0 |
Missing (%) | 0.0% |
Infinite | 0 |
Infinite (%) | 0.0% |
Memory Size | 3270574 |
Mean | -1.694 |
Minimum | -4 |
Maximum | 5 |
Zeros | 1 |
Zeros (%) | 0.0% |
Negatives | 24702 |
Negatives (%) | 54.9% |
Minimum | -4 |
---|---|
5-th Percentile | -4 |
Q1 | -4 |
Median | -3.6494 |
Q3 | 0.6437 |
95-th Percentile | 1.5091 |
Maximum | 5 |
Range | 9 |
IQR | 4.6437 |
Mean | -1.694 |
---|---|
Standard Deviation | 2.5842 |
Variance | 6.678 |
Sum | -76232.4873 |
Skewness | 0.4961 |
Kurtosis | -1.0773 |
Coefficient of Variation | -1.5255 |
numerical
Approximate Distinct Count | 40997 |
---|---|
Approximate Unique (%) | 91.1% |
Missing | 0 |
Missing (%) | 0.0% |
Infinite | 0 |
Infinite (%) | 0.0% |
Memory Size | 3270574 |
Mean | 0.004256 |
Minimum | -0.002151 |
Maximum | 1.0457 |
Zeros | 1 |
Zeros (%) | 0.0% |
Negatives | 8118 |
Negatives (%) | 18.0% |
Minimum | -0.002151 |
---|---|
5-th Percentile | -0.001036 |
Q1 | 0.00057529 |
Median | 0.002301 |
Q3 | 0.007052 |
95-th Percentile | 0.008857 |
Maximum | 1.0457 |
Range | 1.0479 |
IQR | 0.006476 |
Mean | 0.004256 |
---|---|
Standard Deviation | 0.02741 |
Variance | 0.00075125 |
Sum | 191.5148 |
Skewness | 35.0778 |
Kurtosis | 1250.1814 |
Coefficient of Variation | 6.4404 |
numerical
Approximate Distinct Count | 10 |
---|---|
Approximate Unique (%) | 0.0% |
Missing | 0 |
Missing (%) | 0.0% |
Infinite | 0 |
Infinite (%) | 0.0% |
Memory Size | 3270574 |
Mean | 0.5097 |
Minimum | 0 |
Maximum | 1.125 |
Zeros | 7 |
Zeros (%) | 0.0% |
Negatives | 0 |
Negatives (%) | 0.0% |
Minimum | 0 |
---|---|
5-th Percentile | 0.375 |
Q1 | 0.5 |
Median | 0.5 |
Q3 | 0.625 |
95-th Percentile | 0.625 |
Maximum | 1.125 |
Range | 1.125 |
IQR | 0.125 |
Mean | 0.5097 |
---|---|
Standard Deviation | 0.09883 |
Variance | 0.009767 |
Sum | 22938.3915 |
Skewness | -0.07541 |
Kurtosis | 0.7088 |
Coefficient of Variation | 0.1939 |
numerical
Approximate Distinct Count | 19961 |
---|---|
Approximate Unique (%) | 44.4% |
Missing | 0 |
Missing (%) | 0.0% |
Infinite | 0 |
Infinite (%) | 0.0% |
Memory Size | 3270574 |
Mean | -1.8749 |
Minimum | -4 |
Maximum | 1.9248 |
Zeros | 3 |
Zeros (%) | 0.0% |
Negatives | 30534 |
Negatives (%) | 67.8% |
Minimum | -4 |
---|---|
5-th Percentile | -4 |
Q1 | -4 |
Median | -0.8796 |
Q3 | 0.1397 |
95-th Percentile | 0.5672 |
Maximum | 1.9248 |
Range | 5.9248 |
IQR | 4.1397 |
Mean | -1.8749 |
---|---|
Standard Deviation | 2.0132 |
Variance | 4.053 |
Sum | -84371.4542 |
Skewness | -0.008067 |
Kurtosis | -1.844 |
Coefficient of Variation | -1.0738 |
numerical
Approximate Distinct Count | 24336 |
---|---|
Approximate Unique (%) | 54.1% |
Missing | 0 |
Missing (%) | 0.0% |
Infinite | 0 |
Infinite (%) | 0.0% |
Memory Size | 3270574 |
Mean | -0.06129 |
Minimum | -1.7123 |
Maximum | 2.4124 |
Zeros | 1 |
Zeros (%) | 0.0% |
Negatives | 24700 |
Negatives (%) | 54.9% |
Minimum | -1.7123 |
---|---|
5-th Percentile | -1.683 |
Q1 | -0.7663 |
Median | -0.402 |
Q3 | 0.8671 |
95-th Percentile | 1.1343 |
Maximum | 2.4124 |
Range | 4.1247 |
IQR | 1.6334 |
Mean | -0.06129 |
---|---|
Standard Deviation | 0.9165 |
Variance | 0.8399 |
Sum | -2757.9979 |
Skewness | 0.111 |
Kurtosis | -1.08 |
Coefficient of Variation | -14.9534 |
numerical
Approximate Distinct Count | 26414 |
---|---|
Approximate Unique (%) | 58.7% |
Missing | 0 |
Missing (%) | 0.0% |
Infinite | 0 |
Infinite (%) | 0.0% |
Memory Size | 3270574 |
Mean | 0.907 |
Minimum | -0.001631 |
Maximum | 1.01 |
Zeros | 1 |
Zeros (%) | 0.0% |
Negatives | 6 |
Negatives (%) | 0.0% |
Minimum | -0.001631 |
---|---|
5-th Percentile | 0.8235 |
Q1 | 0.8837 |
Median | 0.9087 |
Q3 | 0.9328 |
95-th Percentile | 0.9854 |
Maximum | 1.01 |
Range | 1.0116 |
IQR | 0.04908 |
Mean | 0.907 |
---|---|
Standard Deviation | 0.05043 |
Variance | 0.002543 |
Sum | 40814.3653 |
Skewness | -4.6654 |
Kurtosis | 81.3505 |
Coefficient of Variation | 0.0556 |
numerical
Approximate Distinct Count | 727 |
---|---|
Approximate Unique (%) | 1.6% |
Missing | 0 |
Missing (%) | 0.0% |
Infinite | 0 |
Infinite (%) | 0.0% |
Memory Size | 3270574 |
Mean | -1.9065 |
Minimum | -4 |
Maximum | 1.5153 |
Zeros | 3 |
Zeros (%) | 0.0% |
Negatives | 25328 |
Negatives (%) | 56.3% |
Minimum | -4 |
---|---|
5-th Percentile | -4 |
Q1 | -4 |
Median | -4 |
Q3 | 0.7727 |
95-th Percentile | 0.9659 |
Maximum | 1.5153 |
Range | 5.5153 |
IQR | 4.7727 |
Mean | -1.9065 |
---|---|
Standard Deviation | 2.3741 |
Variance | 5.6365 |
Sum | -85794.0342 |
Skewness | 0.2596 |
Kurtosis | -1.9227 |
Coefficient of Variation | -1.2453 |
categorical
Approximate Distinct Count | 2 |
---|---|
Approximate Unique (%) | 0.0% |
Missing | 0 |
Missing (%) | 0.0% |
Memory Size | 5610626 |
Mean | 3 |
---|---|
Standard Deviation | 0 |
Median | 3 |
Minimum | 3 |
Maximum | 3 |
1st row | 0.0 |
---|---|
2nd row | 0.0 |
3rd row | 0.0 |
4th row | 0.0 |
5th row | 0.0 |
Count | 0 |
---|---|
Lowercase Letter | 0 |
Space Separator | 0 |
Uppercase Letter | 0 |
Dash Punctuation | 0 |
Decimal Number | 90002 |
The reports shows that all of the 8 features are numerical (excluding the target variable y
). The dataset has no missing values. The empirical distirution of the features looks to be a union of multiple Normal distributions with multiple Bell-curves present on the histograms. This could be a sign of data drift or possible anomalies (however, it may be explained by normal behavior too).
Looking at the Correlation Matrix we observe that about a third of the features are highly correlated while the rest of the pairings show very little correlation.