Air Pollution in Haifa

Exploratory data analysis

Jun 15, 2020 15 min read

As part of a project evaluating the effect of Haifa’s low emission zone (LEZ) using the synthetic control method (coming soon!), I ran a short exploration of the data. It’s always beneficial to have a thorough look at the data before starting to model it. I think it’s safe to say that any investment of time in exploratory data analysis will be returned at a later stage in terms of better modelling insights, less bugs and valid conclusions.

Disclaimer: I have no expertise in air quality analysis, in fact this is the first time I look at data collected from an air pollution monitoring system.

This post was generated from a Jupyter notebook, which can be downloaded from here.

The notebook goes through the following steps:

Dataset cleaning and tidying
Handling of missing values
Analyzing trends of time series (daily, weekly, annual etc)
Finding correlations between pairs of variables
Data fusion with meteorological dataset

At a later stage, these sections will be added:

Simple air pollution modelling
Validation of the inference conducted for the ministry regarding the effect of Haifa’s LEZ.

Dataset sources

Air quality monitoring data was downloaded from the website of Israel’s Ministry of Environmental Protection. The interface can be found here.
Temperature data was downloaded from Israel Meteorological Service’s website. The interface can be found here.

The raw data looks like this:

df_aqm = pd.read_csv('Data/Haifa-Atzmaut.csv')
df_aqm

	תחנה:עצמאות חיפה -	SO2	Benzene	PM2.5	PM10	CO	NO2	NOX	NO	תאריך \ שעה
0		ppb	ppb	µg/m³	µg/m³	ppm	ppb	ppb	ppb
1				29.7	44.4	0.6	41.2	114.8	73.8	31/12/2011 24:00
2				30.6	45.5	0.6	35.6	81.1	45.6	01/01/2012 00:05
3				30.6	45.5	0.5	40.7	107.9	67.0	01/01/2012 00:10
4				30.2	45.1	0.5	41.7	127.9	86.2	01/01/2012 00:15
...	...	...	...	...	...	...	...	...	...	...
883289		23/04/2020 02:15	31/03/2014 22:00	02/03/2014 17:00	07/02/2012 14:30	06/01/2020 13:40	18/01/2017 13:05	22/01/2015 07:30	22/01/2015 07:30	תאריך מקסימלי
883290		1.7	0.24	19.8	49.7	0.4	20.8	42.6	21.8	Avg
883291		237295	602180	800202	97432	844796	827176	826981	826945	כמות ערכים
883292		27	68	91	11	96	94	94	94	נתון[%]
883293		2.8	0.3	17.8	50.0	0.2	16.3	58.6	46.0	סטיית תקן

883294 rows × 10 columns

Dataset Cleaning

The following steps are needed to clean the dataset:

Delete the first column, which contains only the station’s name
Strip the column names of leading and trailing whitespaces
Translate the Hebrew date/time column
Replace the representation of midnight from 24:00 to 00:00 to comply with Python’s DateTime
Drop the first row, which contains the units of measurements for the variables
Drop the last 8 rows, which contain summary statistics
Numerify the measurements, setting missing values to NaN
Set the datetime variable as index

station_name = df_aqm.columns[0]
df_aqm.drop(columns=[station_name], inplace=True)
df_aqm.rename(columns={colname: colname.strip() for colname in df_aqm.columns}, inplace=True)
df_aqm.rename(columns={df_aqm.columns[-1]: 'DateTime'}, inplace=True)
df_aqm['DateTime'] = df_aqm['DateTime'].apply(lambda x: x.strip().replace('24:00', '00:00'))
pollutants = list(df_aqm)
pollutants.remove('DateTime')
units = {colname: df_aqm.loc[0, colname].strip() for colname in pollutants}
df_aqm = df_aqm.iloc[1:-8]
df_aqm[pollutants] = df_aqm[pollutants].apply(pd.to_numeric, errors='coerce')
df_aqm['DateTime'] = pd.to_datetime(df_aqm['DateTime'],infer_datetime_format=True)
df_aqm = df_aqm.set_index('DateTime')

print(f'The dataset spans {len(np.unique(df_aqm.index.date))} days,' 
      f'from {df_aqm.index[0].date()} to {df_aqm.index[-1].date()}')

The dataset spans 3068 days,from 2011-12-31 to 2020-05-24

Let’s have a look at a 2-month slice of the cleaned dataset:

df_aqm.loc["2018-04":"2018-05"]

	SO2	Benzene	PM2.5	PM10	CO	NO2	NOX	NO
DateTime
2018-04-01 00:05:00	1.8	0.18	7.4	NaN	0.3	22.2	23.6	1.4
2018-04-01 00:10:00	2.0	0.18	9.5	NaN	0.3	26.3	27.9	1.5
2018-04-01 00:15:00	NaN	0.18	13.7	NaN	0.3	24.6	25.2	0.5
2018-04-01 00:20:00	NaN	0.18	16.8	NaN	0.3	25.0	25.9	0.9
2018-04-01 00:25:00	NaN	0.18	17.4	NaN	0.4	24.2	24.7	0.6
...	...	...	...	...	...	...	...	...
2018-05-31 23:40:00	0.0	0.08	8.0	NaN	0.3	7.2	7.5	0.5
2018-05-31 23:45:00	0.0	0.08	7.5	NaN	0.3	6.7	7.1	0.2
2018-05-31 23:50:00	0.0	0.08	6.4	NaN	0.3	10.2	11.2	1.2
2018-05-31 23:55:00	0.0	0.08	6.6	NaN	0.3	9.1	10.1	1.0
2018-05-31 00:00:00	0.0	0.08	9.8	NaN	0.3	7.7	7.4	0.0

17568 rows × 8 columns

Not bad! Nice and tidy.

Some notes about the data

Measurements are recorded at 5-minute intervals. That’s a very high resolution, much more than needed to answer the research question. But it’ll be very helpful - it will allow for aggressive smoothing of the noisy measurements. In total, we have almost 1 million measurements of each pollutant.
From the slice above we can suspect that the measurement resolution for CO (Carbon Monoxide) might be small compared to its distribution. It’s easy to check:

df_aqm['CO'].value_counts(normalize=True).head(7)

0.3    0.338305
0.2    0.186482
0.4    0.182812
0.5    0.073091
0.1    0.051982
0.6    0.039140
0.0    0.035468
Name: CO, dtype: float64

And indeed more than 70% of the CO measurements fall in one of 3 distinct values. This implies that the CO measurement is not accurate enough in capturing this pollutant’s distribution. We should keep that in mind when drawing inferences based on CO.

Moving on to a general overview of the data, let’s use the convenient ‘describe’ method:

df_aqm.describe()

	SO2	Benzene	PM2.5	PM10	CO	NO2	NOX	NO
count	237295.000000	602180.000000	800202.000000	97432.000000	844796.000000	827176.000000	826981.000000	826945.000000
mean	1.718825	0.242009	19.774215	49.704957	0.366657	20.821325	42.579953	21.816400
std	2.763984	0.300786	17.774355	49.961126	0.249257	16.310098	58.644492	46.070798
min	-0.800000	-0.010000	-29.500000	-10.100000	-0.400000	-4.800000	-4.600000	-5.000000
25%	0.100000	0.060000	10.800000	24.000000	0.200000	7.700000	8.500000	0.300000
50%	0.800000	0.150000	17.200000	37.500000	0.300000	15.400000	19.300000	3.300000
75%	2.400000	0.320000	25.400000	59.400000	0.400000	31.900000	51.300000	18.400000
max	136.200000	12.700000	562.000000	810.900000	23.400000	262.400000	865.500000	763.800000

We can immediately notice negative measurements for all pollutants. This is not a good sign, because while this can be easily addressed and fixed, it increases our suspicion of more serious problems in the dataset that would be difficult to identify.

But that will be addressed later. Now we have to solve the issue of negative measurements. Let’s first understand the extent of the problem:

percent_negative = 100*(df_aqm[pollutants] < 0).mean(axis=0).values
print(f'{pollutants[percent_negative.argmax()]} has {percent_negative.max():.1f}% negative values')

NO has 8.8% negative values

That’s quite bad. Two of the simplest ways to fix this issue are:

Clip the negative values to 0
Replace them with NaNs

However, these two approaches may or may not be the best thing to do, depending on:

The mechanism that has caused this issue
The quality of data required to answer the research question

Since I don’t have a substantiated answer to any of these questions, I tried to find out what domain practitioners recommend on doing. Here is a thread with some opinions and insights on the matter. The authors propose to differentiate between two situations, and act accordingly:

If the datapoints exhibit sudden short drops below 0, this probably indicates an instrument malfunction. In this case, it is advised to replace the invalid value with an average of some neighborhood.
If consecutive values below 0 are recorded for long times compared to the instrument’s integration time, this is probably rooted in the true air conditions. In this case, negative values imply that the real value was below the instrument’s detection limit. The suggested approach would be to replace the negative values with the known detection limit, or 0 if it isn’t known (like in our case). More justifications for this approach can be found in Polissar et al, 1998

If we want to follow these guidelines, we need to decide for each negative measurement if it the former case or the latter. This will be coded at a later stage.

Checking for systematic data collection flaws

It is important to make sure that our dataset does not have any systematic flaws in the data collection process. For example, later we would like to demonstrate the fluctuations of pollutants as function of the weekday. In this case, if some days are recorded significantly less often than other days, any conclusion we would like to state as for the weekday trend might be invalidated.

An easy way to check this is to print the value counts of the weekdays:

df_aqm['Weekday Name'] = df_aqm.index.day_name()
df_aqm['Weekday Name'].value_counts()

Sunday       126420
Saturday     126145
Friday       126144
Monday       126144
Wednesday    126144
Thursday     126144
Tuesday      126144
Name: Weekday Name, dtype: int64

We see that the dataset is pretty much balanced in terms of weekday logging, and that is reassuring. However, it could also be that the logging system simply adds an entry for every 5-minute interval, while the measurement instruments malfunction with some systematic (i.e not at random) pattern. A complete examination would be:

df_aqm[pollutants].isna().groupby(df_aqm['Weekday Name']).mean()

	SO2	Benzene	PM2.5	PM10	CO	NO2	NOX	NO
Weekday Name
Friday	0.731030	0.271008	0.100235	0.892718	0.040160	0.061295	0.061430	0.061430
Monday	0.731204	0.394097	0.094646	0.887097	0.046114	0.067835	0.068121	0.068398
Saturday	0.730992	0.344128	0.100820	0.894621	0.043846	0.064632	0.064767	0.064767
Sunday	0.734813	0.313724	0.100791	0.884710	0.046757	0.060141	0.060354	0.060354
Thursday	0.728350	0.297026	0.093805	0.892662	0.043442	0.062254	0.062571	0.062579
Tuesday	0.733376	0.305429	0.084047	0.886312	0.043363	0.069159	0.069413	0.069413
Wednesday	0.729674	0.302345	0.084071	0.889745	0.041334	0.059353	0.059559	0.059559

We see that the fraction of missing data changes significantly from one pollutant to another. In addition, for each pollutant, the fraction of missing values changes per weekday, but for now it’s hard to say if these differences are consequential. Meanwhile, it’s good to keep this table in mind, and we might return to it depending on the analysis and inference that we would like to conduct.

So we can conclude this step with the reassurance that the data does not exhibit major systematic data collection flaws, and we can move on in to the analysis. Let’s plot the data in its most raw form - just a scatterplot of every datapoint:

axes=df_aqm[pollutants].plot(marker='.', alpha=0.5, linestyle='None', figsize=(11, 11), subplots=True, colormap='Dark2');

png

We can notice a few things from this plot:

PM10 and SO2 are not continuously monitored along the 8 year period. Plots and analyses in the following will not ignore them, as we can still draw some conclusions using these limited measurements
We can see the first hint of a seasonal trend: it seems like NOx and NO follow an annual trend. Because of the scatterplot’s density it could also be that only a few outliers paint a picture that looks like an annual trend, while in reality they do not contribute a significant amount to the total trend. Therefore a more detailed analysis is required to validate this claim
It seems like NO2 and NOx are identical (or almost identical measurements). Reading about NOx it seems rather amusing, as NOx is defined as the sum the nitrogen oxides NO and NO2. But it is the visualization that is misleading, and a closer look reveals that NOx is always higher than NO2.

Let’s see if on average there is a real annual trend for some substances:

log_interval = 5  # minutes
samples_per_week = int(60 / log_interval * 24 * 7)
window = samples_per_week*4
overlap = int(2*samples_per_week)
df_aqm.rolling(
    window=window, center=False, min_periods=samples_per_week).mean()[samples_per_week*2::overlap].plot(
    marker='.', alpha=0.5, linestyle='None', figsize=(11, 11), subplots=True, colormap='Dark2');

png

This plot shows rolling averages of the measurements, with windows of 4 weeks and an overlap of 2 weeks. A similar result could be obtained using a simple groupby with a monthly frequency:

df_aqm.groupby(pd.Grouper(freq='M')).mean()

But the overlapping windows add a degree of smoothing which is (arguably) preferable in this case.

If you are disappointed too with pandas' lack of an overlap parameter in the rolling function, you can contribute to the discussion here.

Anyway, enough with the ranting. The above plot shows that all nitrogen oxide measurements (and also benzene to a lesser extent) follow a clear annual trend peaking at winter. It is still not clear what’s the cause of this modulation, and we will revisit it later in the analysis.

The same trends could be demonstrated using a different visualization, plotting the 8 year daily averages of every pollutants.

df_aqm.groupby(df_aqm.index.dayofyear).mean().plot(marker='.', alpha=0.5, linestyle='None', figsize=(11, 11), subplots=True, colormap='Dark2');
plt.xlabel('Day of the Year');

png

The annual trend of NO, NO2, NOx and benzene is validated through this plot. In addition, we can notice a peculiar drop in CO values around day 240. A closer look reveals that this drop is at day 243. This day is August 31st, the last day of summer vacation for 1.5 million kids. Coincidence? I don’t know. What do you think?

Let’s move now to a weekly analysis:

ax = sns.boxplot(data=df_aqm, x='Weekday Name', y='NOX', 
                 order=['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'], 
                 palette=sns.cubehelix_palette(7, start=.5, rot=-.75, dark=0.3, reverse=True)).set_ylim([0,70]);

png

For NOx, there is a clear weekly trend in the amount of air pollution measured. Note that the outliers have been cropped out of the plot. For other pollutants, the trend is not as impressive:

ax = sns.boxplot(data=df_aqm, x='Weekday Name', y='CO', 
                 order=['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'], 
                 palette=sns.cubehelix_palette(7, start=.5, rot=-.75, dark=0.3, reverse=True)).set_ylim([0,1]);

png

And we also see the discrete nature of the CO measurements, as noted earlier. Another way to look at the weekly flactations would be to plot the raw measurements for a specific period of time:

df_sub = df_aqm.loc['2019-07':'2019-09'] 
fig, ax = plt.subplots(figsize=(20,5))
ax.plot(df_sub.groupby(df_sub.index.date).median()['NOX'], marker='o', linestyle='-')
ax.set_ylabel('NOx [ppb]')
ax.set_title('')
ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=mdates.SATURDAY))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'));

png

For this 3-month period, the lowest measured values were always on Saturdays, agreeing with the boxplot above.

The third and last seasonal trend to be analyzed is the dependence on the hour of the day.

df_aqm.groupby(df_aqm.index.time).mean().plot(marker='.', alpha=0.5, linestyle='None', figsize=(11, 11), subplots=True, colormap='Dark2');
plt.xticks(["08:00","12:00", "16:00", "20:00", "00:00", "04:00"]);

png

Some notes about this plot:

Most pollutants peak in the morning. Some right at the rush hour (NO, NO2, NOX, Benzene) and some a bit later (SO2, PM10)
It doesn’t seem like PM2.5 adheres to any logical pattern.
CO peaks around midnight. It seems as it has a slow buildup process, followed by a slow decreasing phase.
If nitrogen oxides are caused mainly by transportation, then the rush hour attribution above could make sense. In this case, we could also attribute the slow increase and plateau in the afternoon/evening to the fact (or assumption?) that commute back home is spread on longer hours, compared to the narrow rush hour.
I’m no expert on the sources of each pollutant, be it transportation, industry, the port or natural factors. So I don’t know if any of these hypotheses make any sort of sense.

Air pollution and temperatures

The above trends show a clear correlation between the season/month and the average measured pollutants (at least for nitrogen oxides and benzene). But what is the cause of this modulation - is it the time itself? Or maybe it is the temperature? There could be many potential causes, and estimating each of these effects is a difficult problem, according to the identification strategies of causal inference. Instead, we can choose to stay in the realm of mere correlations, and while they do not imply causation, we could speculate what is a reasonable claim and what is not.

A simple examination would be to evaluate how much of this modulation is governed by the change in temperature (disregarding the day of the year). To this end, I downloaded a second dataset, this time from Israeli Meteorological Service. It includes daily measurements (low and high temperatures) from the meteorological station closest to the air quality monitoring station, situated at BAZAN Group’s plant in Haifa.

A straightforward cleaning yields a dataset that looks like this:

df_temp = pd.read_csv('Data/ims_data.csv', sep=",", encoding="ISO-8859-8")
# translate df from Hebrew
df_temp = df_temp.rename(columns={'משתנה': 'variable', 'תאריך': 'DateTime', 'חיפה בתי זיקוק                                    (600)': 'value'})
df_temp.loc[df_temp['variable'] == 'טמפרטורת מינימום(C°)', 'variable'] = 'min temp'
df_temp.loc[df_temp['variable'] == 'טמפרטורת מקסימום(C°)', 'variable'] = 'max temp'
df_temp.loc[df_temp['variable'] == 'טמפרטורת מינימום ליד הקרקע(C°)', 'variable'] = 'ground temp'

# Delete the record 'ground temp' (first make sure it's empty)
assert(df_temp.loc[df_temp['variable'] == 'ground temp','value'].values == '-').all()
df_temp = df_temp.drop(df_temp[df_temp['variable'] == 'ground temp'].index)

# Convert to datetime
df_temp['DateTime'] = pd.to_datetime(df_temp['DateTime'],infer_datetime_format=True)

df_temp['value'] = df_temp['value'].apply(pd.to_numeric, errors='coerce')

df_temp.tail()

	variable	DateTime	value
8638	min temp	2020-04-27	11.6
8640	max temp	2020-04-28	23.2
8641	min temp	2020-04-28	11.0
8643	max temp	2020-04-29	22.9
8644	min temp	2020-04-29	17.1

Let’s plot a histogram of the daily lows and highs, just to see that the dataset makes sense (it does):

g = sns.FacetGrid(df_temp, hue='variable', height=6)
g.map(sns.distplot, 'value', bins=30, hist=True, kde=False, hist_kws={"linewidth": 3, "alpha": 0.7})
g.add_legend(title=' ')
g.axes[0,0].set_title("Distribution of minimal and maximal temperatures")
g.axes[0,0].set_xlabel('Temperature [Celsius]');

png

To merge (fuse?) the datasets, we need to calculate an aggregate value for the pollutants for each day (because temperatures are measured only once a day). Averaging the measurements of each day is the simplest thing to do.

df_pol_day = df_aqm.groupby(df_aqm.index.date).mean()
df_pol_day.index = pd.to_datetime(df_pol_day.index)

df_max = df_temp.loc[df_temp['variable'] == 'max temp', ['DateTime', 'value']]
df_max.rename(columns={'value': 'max temp'}, inplace=True)
df_max.set_index('DateTime', inplace=True, drop=True)

df_min = df_temp.loc[df_temp['variable'] == 'min temp', ['DateTime', 'value']]
df_min.rename(columns={'value': 'min temp'}, inplace=True)
df_min.set_index('DateTime', inplace=True, drop=True)

df_pol_temp = df_pol_day.merge(right=df_max, how='inner', left_index=True, right_index=True)
df_pol_temp = df_pol_temp.merge(right=df_min, how='inner', left_index=True, right_index=True)
df_pol_temp.tail()

	SO2	Benzene	PM2.5	PM10	CO	NO2	NOX	NO	max temp	min temp
2020-04-25	0.031802	0.049213	10.824653	NaN	0.279152	1.761702	1.509220	0.010284	22.3	17.1
2020-04-26	0.089753	0.036250	9.450694	NaN	0.295406	6.074823	7.617376	1.686170	23.0	16.1
2020-04-27	0.217314	0.079306	12.359028	NaN	0.310247	18.504255	28.706383	10.386525	23.0	11.6
2020-04-28	0.595760	0.091632	13.841319	NaN	0.332155	17.207801	26.422340	9.412057	23.2	11.0
2020-04-29	0.102827	0.020035	10.708333	NaN	0.260777	1.787943	1.271986	0.000000	22.9	17.1

sns.jointplot(x='min temp', y='NOX', data=df_pol_temp, kind="reg", stat_func=corr_coef);
warnings.filterwarnings(action='default')

png

With a significant correlation of -0.58, the measured NOx values can be explained to some degree by the temperatures. It would be interesting to see if other meteorological factors, like humidity and wind, can explain even more of the NOx fluctuations.

Closing remarks

This notebook described a preliminary exploration of the air quality monitoring data in Haifa. It is by no means an exhaustive analysis, it is just a set of steps used to increase the understanding of the dataset. If you have any comment - an analysis that you think should be done, a peculiar result that should be examined or anything else - I would love to know!

This is not a conclusion, as it is only the start of the analysis. In the next chapter, I will evaluate the causal effect of Haifa’s new low emission zone using the synthetic control method. Stay tuned!