Turns out that predicting forest fire size from scant met data is hard.¶

Every year, there are many wildfires in my native Washington (state, obviously). Many aren't a big deal. They spread anemically and predictibly, and fizzle out in the autumn. Fires like that are a good thing. It's a natural process that is very beneficial to many pine species. Especially in drier areas, it keeps a healthy, disease-resistent forest composed of numerous tree species. Wildfire supression in the past century has contributed to a monoculture of Douglas firs in the Cascades, which has caused diseases like mistletoe and bark beetle to explode. An additional unfortunate aspect of fire supression has been that in many areas, fuel has built up to incredible levels. In such places, when a fire does occur, it ain't the good kind: You get hot fires that kill even old growth trees and sterilize the soil. These kind of fires also take incredible resources to fight. It's a primary reason why the U.S. Forest Service has no money to do things like maintain roads anymore. In addition, they can often threaten property and even lives.

Accurate predictions of fire danger over time can enable managers more accurately target interventions such as burn bans. However, fire prediction isn't easy.

Obviously, fires like to start in dry and hot conditions. Fire danger depends heavily on precip, humidity, and temperature in the prior several weeks. The thicker or deeper the material, the longer it takes to dry, and the longer it takes to wet during a rain event. This acts as sort of a low-pass filter for fire danger.

Indices based on the meteorological factors are the basis of fire danger assessment. Is it possible to use machine learning to predict fire size based on met data alone? Cortez and Morais, 2007 tried to predict fire size with several regression models on data from 517 fires in the Montesinho Natural Park of Portugal from 2000 to 2003. Their best perfoming model was support vector regression. This is a really interesting dataset to work with. Since it is so small, it poses many challenges that you don't get in the typical Kaggle competition.

It turns out that Portugal could really use better wildfire risk assessment. It is a country bedeviled by wildfires, more than any other country in Europe. It has a great deal of forest. Worse, much of the country is a patchwork of small private plots and public ownership. It's not like the US, where most such land is Forest Service or BLM. This presents a severe problem for management and fire supression. Population loss in rural areas has made much privately owned land overgrown and neglected. In addition, invasive eucalypus trees have taken over many forested areas. Eucalypus is highly flammable.

Fire risk management has also been ineffective. There has been little emphasis on proven strategies like fire breaks around roads and encouragement of private landowners to clear land around houses. These shortcomings were tragically illustrated this June: During a heat wave with temperatures exceeding $40^\circ C$, a series of fires started within hours of each other, possibly due to arson. 67 people were killed, most in their cars as the fire swept across a roadway.

The results of Cortez and Morais weren't fantastic. The task of accurately predicting fire size based on only met data is essentially impossible, which the authors note. The spread of fires is often dynamically chaotic. Fire spread depends on everything from the brush immediately next to the ignition site to whether somebody spilled beer next to the fire pit. Hillslope has a huge effect. Fires don't like to go downhill. In addition, the final burned area obviously depends heavily on the what supression efforts are taken.

However, aggregate fire risk is more predictible, just as the climate is more predictible than the chaotic weather. In this post, I am examining the same data Cortez and Morais used. Instead of trying to predict fire size, I am only trying to estimate the probability of whether a fire start will spread or not. In these data, fires of less than 0.01 hectares are set to zero (this put a big wrinkle and Cortez and Morais' regression efforts, as we shall see). First, we'll go through the data, and then fit a logistic regression and XGBoost model to the data.

In [18]:
import pandas as pd
import scipy as sp
import numpy as np
import sklearn.gaussian_process as gp
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
import scipy.stats.distributions as dist
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')

Load it, get the column names. Also, let's make a categorical feature for whether the burned area is greater than zero.

In [2]:
fire = pd.read_csv("forestfires.csv")
#shuffle. Set a random seed so we get consistent results.
fire = fire.sample(frac=1,random_state=0)
#make a variable for whether area >0. This is what we will predict.
fire['nonzero'] = fire['area'] > 0
#make another categorical feature for if it is late summer or not.
fire['summer'] = (fire['month']=='aug')|(fire['month']=='sep')
fire.columns
Out[2]:
Index([u'X', u'Y', u'month', u'day', u'FFMC', u'DMC', u'DC', u'ISI', u'temp',
       u'RH', u'wind', u'rain', u'area', u'nonzero', u'summer'],
      dtype='object')

Looking at the data¶

These data give burned areas of fire starts along with temperature, humidity, wind speed, a binary variable of whether was raining or not, and some fire danger indices based on met data with different time lags.

What conditions are favorable to wildfires? A big thing is moisture content of the fuel, which is primarily affected by temperatures and precip. Shallow, loose leaf litter dries out quickly after rain. Deeper, more compact material, or thicker wood, takes longer to dry out. However, it does not wet as easily. Deeper material therefore is more sensitive to lower-frequency components of temp and precip. Because of this, late summer is the worst time for wildfires.

FFMC is the "fine fuel moisture content," a measure of the moisture content of leaf litter and similar material. In other words, the tinder that can start a fire. DMC is the "Duff moisture code," which is the moisture content of deeper, but not compact organic matter. This gives you an idea of the flammability of deeper layers of organic matter, or slightly thicker woody material. Next, DC is the "drought code," which gives the moisture content of more compact, deeper layers. These measures are important, because forest fires often start at the ground. They get going by smouldering, and eventually something bigger catches on fire.

Let's plot the temperatures during the fire starts by month:

In [3]:
month_means = fire.groupby('month').aggregate(np.mean)
sb.swarmplot(data=fire,x='month', y='temp',hue='nonzero')
plt.show()

Fires definitely like warmer weather. There are more fires when it is hot. Important for our prediction later, we can kind of see a relationship between fire spreading beyond 0.01 hectares ('nonzero') and temperature.

The RH during these fires is pretty consistent throughout the year, except higher in winter and spring. These values are pretty consistent with average humidity levels in nearby Bragança.

The Montesihno Natural Park has a pretty typical Mediterranean climate, albeit at a relatively high altitude. You can get a general idea of the climate from the nearby town of Bragança, which is around 500m lower than most of the park. This should mean that temps in the park should run about 2-3 degrees cooler. Precip is heavily concentrated during the winter.

The few fires in the winter months don't really seem to be associated with abnormally warm temperatures. However, they do seem to be associated with abnormally dry conditions. We can see this by plotting the FFMC (loose material moisture content):

In [4]:
sb.swarmplot(data=fire,x='month', y='FFMC',hue='nonzero')
plt.legend(title='nonzero',loc='lower left')
plt.show()

FFMC doesn't differ much throughout the year, with the exception of a few outliers. It's above 80 for the vast majority of fires, showing that it is a pretty diagnostic measure for fire danger. Given typical Mediterranean winter conditions, I bet usually FFMCs during the winter are more like 30-50 instead of 80-100. They must differ significantly because it is much rainier, more humid, and colder during the winter.

Let's stop and think about the causes of these fires. According to this paper, arson is the cause of 42% of human-caused wildfires in Portugal. In Spain, only 4% of fires are lightning-triggered, which is probably pretty similar for Portugal. Thus, we can probably guess that a third or more of the fires in Montesihno are due to arson. This could be a good explanation for some of the outliers occuring during low FFMC.

Next let's look at relative humidity during these fire events, broken down by month:

In [5]:
sb.swarmplot(data=fire,x='month',y='RH',hue='nonzero')
plt.show()

The humidity is fairly consistent for all months that have more than a handful of observed fires. May and January have the highest humidity. However, I think that is just an artifact of sample size: Both of these months only have two observed fires. While the RH during fires is nearly consistent throughout the year the average monthly RH in Bragança varies greatly throughout the year. It runs about 50% during the summer, 70% in the spring, and 80% in the winter. Just like we saw with the FFMC, this indicates that when fires do occur outside summer, it is usually during abnormally dry periods.

How about fires by day of the week? Most fires in Portugal are human caused. Since it is a Mediterranean climate, not a humid continental climate, there is very little lightning. We should have more fires on the weekends:

In [6]:
sb.countplot(data=fire,x='day',hue='nonzero')

plt.show()

No big surprise: Sunday has the most fires. Even the vast majority of service industry workers in most European countries have Sunday off, so people can take their kids to the forest and set it on fire. I really would have thought this effect would have been bigger, considering that the vast majority of fires are human-caused. Perhaps arsons are more concentrated during the middle of the week, when there are less visitors. It's also possible that people feel that they can get away with flouting fire-safety regulations (no campfires during burn bans, etc) during the week. Lastly, there also can be time lag between the start of the fire and when it is detected. Fires can sometimes slowly smoulder in leaf matter (next to or below a campfire, say) for a while before actually catching.

Day of the week is probably irrelevant to fire spread. While more fires get started on weekends, the day of the week shouldn't matter to whether they spread significantly or not. To take a look at this, I'm going to get bootstrap estimates of the sampling distribution of the mean of 'nonzero' (whether fires spread beyond 0.01 Ha). This tells us whether the chances of fires spreading actually differs significantly based on day of the week.

There is not an implementation of bootstrapping in Pandas, but it just takes a few lines to write one.

In [7]:
def bootstrap(ser, stat_fun, n_resamps = 500, columns=None, size_resamp=None):
    if size_resamp == None:
        size_resamp = ser.size
    boot_series = pd.Series(np.empty(n_resamps))
    for i in xrange(n_resamps):
        resamp = ser.sample(size_resamp,replace=True)
        boot_series[i] =  stat_fun(resamp)
    return boot_series

#we'll be doing this a couple times, so I'm putting it in a function.
def bootstrap_col(data, grouping_col, boot_col):
    return data.groupby(grouping_col).apply(\
        lambda x: bootstrap(x[boot_col],np.mean,n_resamps=1000)).transpose()

def boot_report(data, grouping_col, boot_col):
    samples = bootstrap_col(data, grouping_col, boot_col)
    for key in samples.columns:
        sb.kdeplot(samples[key])
    print 'Standard deviation of \'nonzero\' by {0}:\n'.format(grouping_col), samples.apply(np.std)
    print '\nMean of \'nonzero\' by {0}:\n'.format(grouping_col), samples.apply(np.mean)
    plt.xlabel("Mean of 'nonzero'")
    plt.show()

boot_report(fire, boot_col='nonzero', grouping_col='day')
Standard deviation of 'nonzero' by day:
day
fri    0.053496
mon    0.058927
sat    0.055567
sun    0.051176
thu    0.064028
tue    0.064509
wed    0.068116
dtype: float64

Mean of 'nonzero' by day:
day
fri    0.505282
mon    0.529162
sat    0.498833
sun    0.496032
thu    0.509967
tue    0.560344
wed    0.594204
dtype: float64

We can see that more fires started on Wednesday spread beyond 0.01 hectares (59% versus the mean for all days of 51%). However, it doesn't look significant at all, since this difference is just a bit over one standard deviation of the bootstrap distribution of 'nonzero' for Wednesday. I think we can throw this out, especially considering I already have prior reasons to think it is irrelevant.

Next, we'll look at dependence of spatial location X (integers ranging from 1 to 9) and Y (ranging from 2 (?) to 9) on fire frequency, and fires that start, spread ('nonzero'). First, we'll look at a pivot table to more easily see the counts for each combination of 'X' and 'Y.'

In [8]:
pd.pivot_table(fire,values='nonzero',columns=['X'],index=['Y'],aggfunc=np.sum,fill_value=0)
Out[8]:
X 1 2 3 4 5 6 7 8 9
Y
2 4 11 0 0 0 0 0 0 0
3 7 0 1 13 0 9 1 3 0
4 10 17 15 20 13 6 25 1 4
5 4 14 1 10 0 32 3 3 1
6 0 0 0 4 2 1 1 29 1
8 0 0 0 0 0 0 0 1 0
9 0 0 0 0 0 0 0 0 3

Clearly, most fires start in particular areas, probably near roads and such. Let's see if there is a relationship between 'nonzero' and spatial location:

In [9]:
sb.countplot(data=fire,x='X',hue='nonzero')
plt.show()
sb.countplot(data=fire,x='Y',hue='nonzero')
plt.show()

There seems to be a relationship between X and probability of spreading beyond 0.01 hectares. This could be due to different vegetation or any number of reasons. I'll get bootstrap estimates of the standard deviation of the mean of 'nonzero' for each X, to see if the differences are significant. I'll also plot the sampling distribution of the mean of 'nonzero' for each 'X.'

In [10]:
boot_report(fire, 'X', 'nonzero')
boot_report(fire, 'Y', 'nonzero')
Standard deviation of 'nonzero' by X:
X
1    0.073212
2    0.057685
3    0.062723
4    0.050511
5    0.091576
6    0.051881
7    0.064054
8    0.063052
9    0.132286
dtype: float64

Mean of 'nonzero' by X:
X
1    0.525125
2    0.574849
3    0.304782
4    0.513813
5    0.496600
6    0.559814
7    0.501717
8    0.607475
9    0.694154
dtype: float64
Standard deviation of 'nonzero' by Y:
Y
2    0.075654
3    0.062563
4    0.036078
5    0.044318
6    0.058433
8    0.000000
9    0.209029
dtype: float64

Mean of 'nonzero' by Y:
Y
2    0.343614
3    0.533750
4    0.546030
5    0.545536
6    0.516946
8    1.000000
9    0.498833
dtype: float64

It's pretty clear from this that there is a relationship between fire spread and spatial location for X, while Y is a bit more borderline.

While fire areas are related to X and Y, it is nonlinear and non-monotic. This is a little bit different than met variables, where hotter and drier makes fires bigger. Because of this, we will be one-hot encoding X and Y.

Now let's look at burned areas of the fires. In this dataset, fires burning less than 0.01 hectares (100 sq. meters) are set to zero. About half of fires in this dataset are below this threshold. If we plot the log areas of the fires above this cutoff, it is beautifully normal (i.e., the areas are lognormal):

In [11]:
fire_nonzero = fire.query('area>0')
n_nonzero = fire_nonzero.shape[0]
print 'Fires with area > 0.01 Ha:', float(fire_nonzero.shape[0])/fire.shape[0]
fire['logarea']=np.log(fire_nonzero['area'])
sb.kdeplot(fire['logarea'])
plt.show()
Fires with area > 0.01 Ha: 0.522243713733

An area of 0.01 hectares corresponds to a log area of -4.6, which is off the end of this chart. However, about half of the observed fires have areas of less than 0.01 hectares (area=0, in our data). This points to a very bimodal distribution of fire sizes: either fires catch and really get going, or they fizzle out. This might be like lighting a campfire with a match: One of two things usually happens. Either the match goes out, and your total burned area is a few square millimeters, or the bigger fuel catches, and you have a burned area (hopefully confined to the pit) on the order of a square meter.

This lognormal-ish distribution of areas makes me think that Cortez and Morais' use of a $\log(x+1)$ transform was not a great choice. They did this to ameriolate some of the problems with having a huge spike in areas at zero. However, it doesn't actually do much to solve the problem. Worse, it implicitly assumes that areas can be less than zero! Considering we have a great lognormal distribution when the area is greater than zero, why not just work with that?

A compound distribution makes a lot more sense to me:

$\Pr(x > 0) \sim \mathrm{Bernoulli}(p)$

$\Pr(\log x | x>0) \sim \mathrm{Normal}(\mu,\sigma)$

This also decouples the problem of predicting fire sizes into a classification problem (predict whether size > 0), and a regression problem (predict the size given that it is greater than zero).

Now let's do a pairplot to look at pairwise relationships. To save space, I'm only plotting DC, FFMC, temp, X, Y, and DMC, which look like they are the more relevant variables for predicting fire spread. As the sort of dependent variable, I also am plotting area. The coloring is whether the fire has a nonzero area (e.g. less than 0.01 hectares) or not:

In [12]:
sb.pairplot(data=fire, diag_kind='kde', vars={'DC','FFMC','DMC','temp','X','Y','area'}, hue='nonzero')
plt.show()

The vast majority of fires start with high FFMC (dry conditions), and usually high DC. There's not much of a relationship with wind here. Fast winds can have a huge effect on fires, but wind speeds in these data are all slow; I don't think 10 km/h would do much.

None of these are great predictors of whether a fire spreads or not ('nonzero'). I can't really see any immediately visible relationships. This does not bode well for my efforts at predicting fire spread, which we will see in the next section.

Predicting fire spread¶

Cortez and Marais only used the meterological variables FFMC, DC, DMC, and ISI, and spatial variables X and Y with their best-performing model (support vector regression). Here, I'm going to try logistic regression on with cubic features (after scaling the data), and XGBoost. I'm doing about a 60/40 test/train split, rather than a more standard 80/20, because the data is so small.

This dataset is so small, and it is so noisy, that fitting the test set is really, really easy. Reshuffling the data, or tweaking parameters can make a huge difference on the test set fit just by random chance. Implicitly fitting the test set can give misleading estimates of true generalization performance. Therefore, it's really important fit the training set "blind" to the test set. To do this, I'm using randomized parameter search with 10-fold cross-validation.

In [13]:
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import xgboost as xgb
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.preprocessing import OneHotEncoder
import scipy.stats.distributions as dist

#number of training points
n_test = 100
pred_features =['RH','FFMC','DC','DMC','ISI','temp','summer','X','Y']
fire_test=fire[0:n_test]
fire_arr = np.array(fire[pred_features])
#One-hot encode X and Y. With 'categorical_features', this can't be put
# in a pipeline. https://github.com/scikit-learn/scikit-learn/issues/8539
oh = OneHotEncoder(categorical_features=[-1,-2],sparse=False)
fire_arr = oh.fit_transform(fire_arr)

#do a test-train split. Already shuffled.
Xtest = np.array(fire_arr[0:n_test,:])
Xtrain = np.array(fire_arr[n_test+1:,:])
Ytest = np.array(fire['nonzero'])[0:n_test]#,np.newaxis]
Ytrain = np.array(fire['nonzero'])[n_test+1:]#,np.newaxis]

#Do regularized logistic regression with cubic features
pipe_log = Pipeline(steps=[('scaler',StandardScaler()),\
                           ('poly',PolynomialFeatures(degree=3)),
                           ('logreg', LogisticRegression(dual=True))])
logistic_param = {'logreg__C':np.logspace(-3,-2,100)}
logistic_cv = RandomizedSearchCV(pipe_log,logistic_param,cv=10,n_iter=30)
logistic_cv.fit(Xtrain,Ytrain)



#no reason not to use a huge number of trees. It's still quick this small.
xgb_model = XGBClassifier(n_estimators=1000)
#cross-validate on lambda (L2 regularzation) and alpha (min gain from splitting).
#also, find the best learning rate for our budget of 100 estimators.
param = {'subsample':np.logspace(-0.3,0,100),'learning_rate':np.logspace(-3,-1.5,100),'reg_lambda':np.logspace(-2,1,100),'reg_alpha':np.logspace(-2,1,100)}
cv_xg = RandomizedSearchCV(xgb_model,param,cv=10,n_iter=100)
cv_xg.fit(Xtrain,Ytrain)
rf_preds=cv_xg.predict(Xtest)
Ytest = np.array(fire['nonzero'])[0:n_test]
In [15]:
fire_test=fire[0:n_test]
fire_test['correct'] = rf_preds == Ytest

print 'Logistic reg. train accuracy: ', (logistic_cv.predict(Xtrain) == Ytrain).mean()
print 'Logistic reg. test accuracy: ', (logistic_cv.predict(Xtest) == Ytest).mean()

print 'Log. reg. Best C: ', logistic_cv.best_params_
print 'XGBoost best params:', cv_xg.best_params_

print 'XGB Test accuracy:', np.mean(fire_test['correct'])
print 'XGB Train accuracy:', (cv_xg.predict(Xtrain)==Ytrain).mean()
 Logistic reg. train accuracy:  0.798076923077
Logistic reg. test accuracy:  0.49
Log. reg. Best C:  {'logreg__C': 0.0018307382802953678}
XGBoost best params: {'learning_rate': 0.002750387840931946, 'subsample': 0.8168103123052588, 'reg_lambda': 0.13219411484660293, 'reg_alpha': 0.061359072734131728}
XGB Test accuracy: 0.6
XGB Train accuracy: 0.800480769231

XGBoost is accurately predicting whether a fire spreads beyond 0.01 hectares (the zero cutoff) or not 59% of the time. Based on the test/train split, the accuracy can vary between around 55%-70%. I could take a larger test set, but that sacrifices even more training data, and adds some stochasticity onto the training set anyway.

Subsampling with XGBoost seems to make it a bit more robust to sampling issues, although random forests actually peformed poorly here. One-hot encoding spatial variables X and Y is crucial: It tends to improve accuracy by several percent.

The logistic regression comes in at 50% accuracy. This seems to be due to the test-train split and the shuffle. Without setting "random_seed=0" in the initial shuffle, I usually get more like 55% accuracy with logistic regression. Keep in mind that this dataset is so noisy that no algorithm on the planet can get good accuracy: There's very important unobserved variables that help determine fire spread.

Next, let's do another pairplot with the predictive variables versus the predicted probability that a fire size will exceed 0.01 hectares ("predict_prob_nonzero").

I'll also plot the feature importances from the random forest.

In [16]:
fire_test['pred_prob_nonzero'] = cv_xg.predict_proba(Xtest)[:,1]
print 'Predicted proportion of correct answers:', (np.maximum(fire_test['pred_prob_nonzero'],1.-fire_test['pred_prob_nonzero'])).mean()
print 'Actual proportion of correct answers:', fire_test['correct'].mean()
best_xg = cv_xg.best_estimator_
sb.pairplot(data=fire_test, x_vars=['FFMC','DC','temp','summer','X','Y'],y_vars=['pred_prob_nonzero'],diag_kind='kde',hue='correct')#, plot_kws={"s": 8})
xgb.plot_importance(best_xg)
plt.show()
Predicted proportion of correct answers: 0.59068
Actual proportion of correct answers: 0.6