Raw PurpleAir PM 2.5 time series data will occasionally contain visible outliers. Some monitors are noisier than others, but there are often at least a handful of measurements that stand out among the rest. Here are the pm2.5 values from the past two days for a PurpleAir sensor in Ashland, OR.
The time series above should be smooth on a minute-to-minute scale. But every so often a measurement comes in significantly higher or lower than its neighbors. This is a classic case in which to use a Hampel filter to mark these values as outliers and either remove or replace them.
The Hampel filter is a robust outlier detector using Median Absolute Deviation (MAD). For each point, a median and standard deviation is calculated using all neighboring values within a window of size n
. If the point of interest lies multipule standard deviations from the median it is flagged as an outlier. The values returned by the Hampel filter can be loosely interpreted as the number of standard deviations away each value is from the median. As the Hampel value increases, the more likely it is that the value is an outlier, however there is no universal threshold to determine this. It’s up to us to come up with our own window size and threshold based on the characteristics of our time series.
In the case of our PM 2.5 time series, we should ask ourselves:
“What is a time window in which a substantial change in air quality is very unlikely?”
Let’s actually consider the opposite question for a second: What is a window in which you could accept a lot of variation? Well, PM 2.5 can vary considerab be unusual to have the air quality readings spike suddenly within that amount of time.
So if we decide on a window of 10 minutes, how does that translate to the number of neighboring values? Looking at the time stamps of our data we can see that we get a reading about once a minute, meaning that every neighbor also counts for around one minute. Therefore, roughly 10 neighbors accounts for 10 minutes. We will use a window of length 11 so that each point of interest will have equal numbers of neighbors to the left and right.
Here’s an example of a window around a point from our data:
Hampel filter values are similar to sigma values in normal distributions. The higher they are, the less likely it is that your reading fits within your data distribution. Data values that don’t deviate far from their window median have lower Hampel values, but the exact numbers you get highly depend on the nature of your time series. Let’s see what sort of Hampel values PurpleAir data has by using some Hampel filter functions.
The R package seismicRoll
provides us two Hampel functions we can use to perform outlier filtering.
seismicRoll::roll_hampel()
returns a vector of Hampel values, one for each value in your data. It takes 2 main arguments: x
and n
:
x
is your data vector.
n
is the width of the Hampel filter window.
Here is a sample of our PM 2.5 data with nominal units of micrograms per cubic meter (μg/m3):
## [1] 41.68 42.63 43.59 44.53 43.86 43.49 42.48 43.51 43.38 43.55 43.68
## [12] 44.15 42.27 42.29 42.66 42.25 42.32 42.17 41.12 40.79 41.36 41.17
## [23] 41.05 41.33 40.86
Then we can find the Hampel values for this sample using the window size we decided upon earlier:
hampelValues <- roll_hampel(x = data, n = 11)
Here are the first twenty-five values:
## [1] NA NA NA NA NA 0.08 4.25 0.16 0.25 0.11 0.28 2.45 0.54 0.06
## [15] 0.67 0.03 0.12 0.00 0.28 0.77 0.41 0.13 0.00 0.67 0.08
The bit of our vector we see here is full of numbers ranging from 0 to 4.25. Since the element and value indices correspond, we can see, for example, that measurement #12 of 44.15 (μg/m3) has a Hampel value of 2.45. Those NA
values at the beginning come from the fact that the filter uses a window 11 of points – 5 on each side of the central point. The first and last 5 entries in our input vector don’t have enough neighbors then to do this test, so their Hampel values are not calculated.
Let’s look at the distribution of all 1077 Hampel values in our single day of measurements:
Our distribution begins at zero with mostly low values and very few high ones. We can pick a threshold to define an outlier by finding where higher Hampel values start to become very infrequent, meaning that their readings’ high standard deviation is uncharacteristic within their data window.
We can do this either visually using this histogram or numerically. Looking at this plot right here, in what range do most values seem to be located? Most values can be found below 3 with only a few outliers above.
Numerically, we can work with this histogram bar-by-bar and see what percentage of Hampel values are where. Our first bar for example, from 0 to 0.5, accounts for 544 values out of the total 1077, or about 50.51%. We can keep adding the percentages each bar represents until we hit a range where most small Hampel values (non-outliers) are covered and only the extremely high values (outliers) are left. For instance, if you wanted to mark your readings with the highest 5% of Hampel values as outliers, you could find what limit covers the lowest 95% of them. Anything above that then is an outlier.
## Bar #1: Values from 0.00-0.50 represent 50.51% - cumulative: 50.51%
## Bar #2: Values from 0.50-1.00 represent 27.39% - cumulative: 77.9%
## Bar #3: Values from 1.00-1.50 represent 9.56% - cumulative: 87.47%
## Bar #4: Values from 1.50-2.00 represent 3.99% - cumulative: 91.46%
## Bar #5: Values from 2.00-2.50 represent 3.34% - cumulative: 94.8%
## Bar #6: Values from 2.50-3.00 represent 0.65% - cumulative: 95.45%
Let’s fill in these bars to get a visual sense:
This shows us that readings with Hampel values greater than 3 – the uncolored bars –
are considered outliers according to our rules. We could remove these readings ourselves by finding the indices of the outlier Hampel values, but the seismicRoll
package already offers us the handy findOutliers()
function which can do all of this for us!
findOutliers()
returns a vector of the indices of values determined to be outliers. It uses roll_hampel()
under the hood so we don’t have to do the threshold testing and index-getting ourselves. It takes 3 main arguments: x
, n
, and thresholdMin
:
x
is your data vector.
n
is the width of the Hampel filter window.
thresholdMin
is the lower Hampel value cutoff.
Let’s use the threshold we determined visually: 3. Calling findOutliers(x = data, n = 11, thresholdMin =
3)
returns:
## [1] 7 45 50 59 214 232 248 358 363 378 379 395 407 416
## [15] 418 445 502 533 558 590 665 674 675 678 704 712 751 786
## [29] 815 832 895 916 920 936 947 1005 1038 1070 1072
These are the indices of all the outliers in our data, which we can now pull out to get the actual outlier values. Let’s see the first few:
data[findOutliers(x = data, n = 11, thresholdMin =
3),]
datetime_A | pm25_A | |
---|---|---|
7 | 2018-09-09 21:09:19 | 42.48 |
45 | 2018-09-09 21:59:59 | 33.10 |
50 | 2018-09-09 22:06:00 | 62.00 |
59 | 2018-09-09 22:18:00 | 36.54 |
214 | 2018-09-10 01:44:26 | 79.00 |
232 | 2018-09-10 02:08:40 | 84.00 |
Let’s mark all these points on our graph:
There may be some spikes that appear to be outliers visually but are not marked. It is important to remember that in periods of rapidly changing data, the standard deviation within the window becomes larger. This means that, to be marked as an outlier, a value has to be significantly futher away from the window median during periods of rapid change than during periods of relative quiet.
Now that we know how to identify outliers, what should we do with them? One option is to simply remove them. Another option is to replace them with the median of their neighbors. The openair
function roll_median()
can do exactly this with arguments x
and n
identical to roll_hampel()
:
x
is your data vector.
n
is the width of the window.
data[outlierIndices] <- roll_median(data, n = 11)[outlierIndices]
Identifying and removing outliers is useful enough that we have a dedicated function for this in the MazamaSpatialUtils
package. This function accepts a “PurpleAir Timeseries” or pat
object and returns the same object with outliers removed. By default, it also generates a plot showing the outliers it removed.
The following line of code both generates a plot and returns the pat
object with outliers removed. The underlying C++ code is fast enough to work with a week’s worth of data without breaking a sweat
pat_cleaned <- pat_outlierDetection(pat_oneWeek, n = 11, thresholdMin = 6)
If we decide to create an hourly averaged time series, it’s important to keep in mind that the result will be different depending on whether we have fixed our data’s outliers or not.
PurpleAir PM 2.5 outliers typically spike up rather than down. So you may see that the fixed hourly average time series usually measures a little lower overall than the unfixed time series. But the differences should be very small unless there are a huge number of outliers.