Choosing Monitors

When comparing the readings from two different types of air qualitiy monitors it is important to first pick two monitors that are as close together as possible, helping to ensure that both are sucking in and analyzing the same air. Any differences seen in the readings are then more likely to originate from the monitors themselves rather than discrepencies in their environments.

Finding a pair of colocated monitors is no trouble at all. The MazamaPurpleAir package has a function to load in a table of Purpleair monitors which also contains the two AirNow monitors closest to each. Using this, here are all the PurpleAir monitors in the United States that are within 100m of an AirNow monitor. Click on any marker to get more information about that monitor and its closest neighbor.

In this report we will compare AirNow monitor ID: 530470010_01 and PurpleAir ID: 13681, which are both located close to Winthrop-Chewuch Rd in WA.

PurpleAir Channels

One important thing to consider before we start comparing these two monitors is the fact that PurpleAir monitors actually collect two different sets of data. Each monitor houses two sensors, one on channel A and the other on channel B, and we need to decide which we should use to measure against AirNow. Since both are analyzing the same air, their values should be clearly correlated, perhaps differing by a scale factor. If this is true, then a scatter plot of measurements taken at identical times will form a vey straight line. Otherwise, if their readings are uncorrelated, the points will be spread out.

Let’s make a scatter plot between channel A and channel B to see how well they match:

It looks like channel B tends to measure the same as channel A. If the R squared value is very high, meaning there is a very good correlation between channels, then we can choose whichever channel we want to compare now with the AirNow monitor. Let’s go with channel A for this report.

Hourly Averaging of PurpleAir Data

Now that we have chosen which PurpleAir channel to use, let’s take a look at the raw PM 2.5 readings measured by our monitors in the past week:

The first thing you might notice is that PurpleAir readings are at a much higher resolution and look a bit “fuzzy”. This is because the AirNow monitor data shows us one reading for every hour, while PurpleAir monitors take readings about every forty seconds. Therefore, the PurpleAir plot has more than 60x the number of points. It will make things much easier for both humans and computers to interpret this data if we average the PurpleAir readings by hour. The openair R package makes this extremely simple by providing a timeAverage() function to mutate timestamped readings into desired intervals.

hourlyPurpleAirData <- openair::timeAverage(rawPurpleAirData, avg.time = "hour")

Now both monitor datasets show hourly readings:

That’s more easy to look at! Let’s plot them both on the same axis to better see how their values compare:

Comparing Readings

The two plots don’t align exactly with one another, but you might also notice that they tend to share the same dips and spikes. Their shapes, in general, are quite similar. It might even look like you could move one plot up or down to match the other. We will explore that idea in just a moment. First, let’s look at this relationship through another lens. How do the two sets of readings correlate on a scatterplot?

If the monitors were reading the exact same values, then the points on this scatterplot should be arranged straight along the line y = x. This probably isn’t what you are seeing. The points are scattered but might form a general trend line off in a different direction.

Reading this plot is quite simple. Let’s look at a single point for starters:

Looking at the highlighted point, we can interpret from its position along the axes that:

“When our PurpleAir monitor reads the value 7.3 (ug/m3), the corresponding AirNow monitor reads 6.5 (ug/m3)”

If every value pair had the same linear relationship, then they would all fall on a single, perfectly straight line. This plot is a bit spread out, though, which means that we will have to come up with a relationship which best approximates the general trend of the data. This is easily accomplished by generating a linear model.

Making a Linear Model

Generating a linear model is quite simple in R. The lm() (Linear Model) function can be applied to a dataframe to find how to transform one of its variables into another. First lets put both the AirNow and PurpleAir readings into a single dataframe:

Combined PM 2.5 readings from both monitors into “monitors_data”
Date	PurpleAir PM 2.5 (ug/m3)	AirNow PM 2.5 (ug/m3)
2018-08-30 20:00:00	7.3	6.5
2018-08-30 21:00:00	4.8	5.1
2018-08-30 22:00:00	3.2	4.5
2018-08-30 23:00:00	2.5	3.9
2018-08-31 00:00:00	3.5	4.5
2018-08-31 01:00:00	3.7	4.1

And then run lm():

linear_model <- lm(data = monitors_data, airnow_pm25 ~ purpleair_pm25)

This creates a model, linear_model, which fits PurpleAir values to AirNow values. We can find out the details of this model by calling summary(linear_model).

## 
## Call:
## lm(formula = airnow_pm25 ~ purpleair_pm25, data = monitors_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7674 -0.8011  0.1091  0.3630  3.9069 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.813682   0.210230   8.627 7.63e-13 ***
## purpleair_pm25 0.512980   0.006575  78.020  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.284 on 75 degrees of freedom
## Multiple R-squared:  0.9878, Adjusted R-squared:  0.9877 
## F-statistic:  6087 on 1 and 75 DF,  p-value: < 2.2e-16

A bunch of information is presented here. We can find out more about the model’s residuals, variable coefficients, R values, and more. It is a very quick way to provide a basic overview of your model and give a sense of its accuracy. Now what does this approximation look like though? Does it describe our data well? Let’s see how it fits the scatterplot from earlier:

Not too bad! Of course the line doesn’t perfectly describe every point, but it at least captures the general trend of the data. The goal now is to actually apply this transformation to our PurpleAir data and see if it puts it any closer to the AirNow values. We can use R’s predict.lm(linear_model, monitors_data) to use our model on our original PurpleAir readings.

The PurpleAir plot should now be a little more aligned with that of the AirNow monitor. Again, the fit isn’t perfect, but overall it should match better than the orginal readings did. As described by our model coefficients, the PurpleAir values are first scaled by 0.513 and then shifted 1.814 units.

Linear Model Diagnostics

Now that we have an equation to approximate AirNow values from PurpleAir readings, lets look a bit deeper into just how well it does this approximation. summary(linear_model) gave us textual insight into its performance, but lets get some visual confirmation now with plot(linear_model), which displays a series of plots describing the model’s residual analytics.

These plots help determine how well our linear model describes our data in the following ways:

Residuals vs Fitted: Shows us if the data is well-approximated by our linear equation. Ideally our line should have run straight through the points on the original correlation scatterplot, so in our residual plot there should be about as many points above zero as there are below. Typically though you may notice the points in this plot form a curve, which means that the scatterplot is better described by a nonlinear equation. Here is some more information on residual vs fitted plots.
Normal Q-Q: Shows us if the model residuals follow a normal distribution. If they do, then the points on this plot should fall along the straight dotted line. Typically you will see that the points tend to curve away at the extremities, meaning that our values are not normally distributed. This tutorial does an excellent job explaining Q-Q plots.
Scale-Location: Also known as a spread-location plot, this shows us how the variability of our fitted data changes along its range. We want to see that our residuals after fitting are not dependent on the actual fitted value. Like in the residuals vs fitted plot, the points should be evenly scattered vertically so the red trend line is horizontal. Another way to visualize this is to look at the original scatterplot and see if it forms a “cone”, which would mean that lower fitted values have less variability than higher values. Ideally the variability is equal for any value you fit (homoscedasticity). More about scale-location plots and homoscedasticity.
Residuals vs Leverage: Shows us if there are any particularly influential readings in our data–outliers whose presence heavily impact our model. If any point in this plot lies outside of the Cook’s Distance boundary than it is an influential reading.

Here are a few more general tutorials on interpreting the results of plot(linearModel):

Understanding Diagnostic Plots for Linear Regression Analysis (text)
How to use Diagnostic Plots for Regression Models (text)
Checking Linear Regression Assumptions in R (video)

Multivariate Models

Sensors are sensitive. A PurpleAir monitor could give different PM 2.5 readings depending on a variety of factors such as temperature and humidity. The lm() function allows you to easily toss more variables into your model to see if that results in a better fit. Let’s gather up some other measurements taken by PurpleAir and put those in our combined monitor data table:

Combined readings from both monitors into “monitors_data”
Date	PurpleAir PM 2.5 (ug/m3)	PurpleAir temp (F)	PurpleAir humidity (%)	AirNow PM 2.5 (ug/m3)
2018-08-30 20:00:00	7.3	80.6	22.4	6.5
2018-08-30 21:00:00	4.8	82.0	20.2	5.1
2018-08-30 22:00:00	3.2	83.1	19.3	4.5
2018-08-30 23:00:00	2.5	84.7	17.9	3.9
2018-08-31 00:00:00	3.5	78.9	21.7	4.5
2018-08-31 01:00:00	3.7	76.6	23.6	4.1

Generating a multivariate model follows a similar process to that of single variable models. Instead of relating AirNow PM 2.5 values to just PurpleAir PM 2.5 readings, we’ll relate it to three separate PurpleAir measurements – PM 2.5, temperature, and humidity like so:

multivariate_model <- lm(data = monitors_data, airnow_pm25 ~ purpleair_pm25 + purpleair_temp + purpleair_humidity)

The lm() function will then find the best fit coefficients for each of our three variables (plus a y-intercept). The resulting equation has the format: fit = A*pm25 + B*temp + C*humidity + D, where A, B and C are the coefficients and D is the intercept.

Let’s look at summary(multivariate_model) to see how the extra variables affect our model:

## 
## Call:
## lm(formula = airnow_pm25 ~ purpleair_pm25 + purpleair_temp + 
##     purpleair_humidity, data = monitors_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5308 -0.6127  0.1803  0.4961  2.3809 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -5.566532   5.591514  -0.996   0.3228    
## purpleair_pm25      0.525606   0.005623  93.480   <2e-16 ***
## purpleair_temp      0.097157   0.053849   1.804   0.0753 .  
## purpleair_humidity  0.003000   0.059915   0.050   0.9602    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9474 on 73 degrees of freedom
## Multiple R-squared:  0.9935, Adjusted R-squared:  0.9933 
## F-statistic:  3748 on 3 and 73 DF,  p-value: < 2.2e-16

Taking the summary() provides an easy way to understand the significance of certain variables. At the far right of each coefficient you might see a series of stars, a dot, or a blank. These align with the signif. codes which show you that coefficients with more stars are more important to fitting values. You’ll probably notice that the purpleair_pm25 coefficient has at least a few stars since that variable is a very good indicator of what a airnow_pm25 value is.

But enough with all this text. How does the multivariate model look compared to the univariate one?

You probably won’t see an astounding change between the two. The multivariate model might be closer in some areas and farther away in others. Its R² value of 0.9935 is greater than the univariate model’s R² value of 0.9878, so it does, in fact, do a better job at describing the data.

That’s it for this tour of how to use linear fitting to scale PurpleAir data to a federally regulated AirNow monitor.

Fitting PurpleAir to AirNow with Linear Models