Reports

Since this question is a bit old and doesn't seem to have a clear answer, here is my proposed approach.

First, I would segment the large dataset into smaller, more manageable chunks based on a time window (for example, creating a separate DataFrame for each month). For each chunk, I would perform exploratory data analysis (EDA) to understand its distribution, using tools like histograms, Shapiro-Wilk/Kolmogorov-Smirnov tests for normality, and QQ-Plots.

In a real-world scenario with high-frequency data, such as a sensor recording at 100 Hz (i.e., one reading every 0.01 seconds), processing the entire dataset at once is impossible if you're working on a local machine. Therefore, I would take a representative sample of the data. I would conduct the EDA on this sample, then calculate the normalization parameters from it. These parameters would then be used as the basis to normalize the rest of the data for that period (e.g., the entire month).

By normalizing the data to a consistent range, such as [0,1], the different segments of data should become directly comparable.

79775093