A methodological approach to assess the best weather spatialization technique

May 28, 2018

6 minutes read

1. Context & objectives

1.1. Context

European directive 2009/128/CE : establishing a framework for Community action to achieve the sustainable use of pesticides

The European directive 2009/128/CE imposes member-states to set up tools that allow for a more rational use of crop protection products. Among these tools, agricultural warning systems, based on crop monitoring models for the control of pests and diseases are widely adopted and have proved their efficiency. However, due to the difficulty to get meteorological data at high spatial resolution (at the parcel scale), they still are underused. The use of geostatistical tools (Kriging, Multiple Regressions, Artificial Neural Networks, etc.) makes it possible to interpolate data provided by physical weather stations in such a way that a high spatial resolution network (mesh size of 1 km2) of virtual weather stations could be generated. That is the objective of the AGROMET project.

1.2. Objective

Provide hourly 1km² gridded datasets of weather parameters with the best accuracy (i.e. spatialize hourly records from the stations on the whole area of Wallonia) = SPATIALIZATION

The project aims to set up an operational web-platform designed for real-time agro-meteorological data dissemination at high spatial (1km2) and temporal (hourly) resolution. To achieve the availability of data at such a high spatial resolution, we plan to “spatialize” the real-time data sent by more than 30 connected physical weather stations belonging to the PAMESEB and RMI networks. This spatialization will then result in a gridded dataset corresponding to a network of 16 000 virtual stations uniformly spread on the whole territory of Wallonia. These “spatialized” data will be made available through a web-platform providing interactive visualization widgets (maps, charts, tables and various indicators) and an API allowing their use on the fly, notably by agricultural warning systems providers. An extensive and precise documentation about data origin, geo-statistic algorithms used and uncertainty will also be available.

Best suited tools :

~~physical atmospherical models~~ (not straight forward to develop an explicit physical model describing how the output data can be derived from the input data)
supervised machine learning regression algorithms that given a set of continuous data, find the best relationship that represents the set of continuous data (common approach largely discussed in the academic litterature)

Our main goal will be to choose, for each weather parameter, the best suited supervised machine learning regression method

2. Key definitions

2.1. Spatialization

Spatialization or spatial interpolation creates a continuous surface from values measured at discrete locations to predict values at any location in the interest zone with the best accuracy.

In the chapter The principles of geostatistical analysis of the Using ArcGis Geostatistical analyst, K. Johnston gives an efficient overview of what spatialization is and what are the two big groups of techniques (deterministic and stochastic).

2.2. Supervised machine learning

From machinelearningmastery.com :

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output : Y = f(X)
The goal is to approximate the mapping function so well that when you have new input data (x), you can predict the output variables (Y) for that data.
It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process

Also check this worth reading post

3. Defining the best supervised machine learning regression method

3.1. Our general approach

Inspired from work of ZEPP + Arvalis + IRCeline
… to transpose into R-code
… using supervised machine learning techniques
… as proposed in the excellent geocomputation with R book from PhD Robin Lovelace.

3.2. Step-by-step workflow

From our historical dataset of hourly weather records (Pameseb db)
filter a representative subset of records (e.g. 5 years of continuous hourly records) + select the “good” stations
For each hourly set of records (30 stations - or more (by integrating IRM network? )
run a benchmark experiment where different desired regression learning algorithms are applied to various regression tasks (i.e. datasets with different combinations of explanatory variables + the target weather parameter) with the aim to compare and rank the combinations of algorithm + used explanatory variables using a cross validation resampling strategy (LOOCV) that provides the desired performance metrics (RMSE or MAE?)

Then aggregate, by calculating the mean, all the hourly performance measures on the whole representative subset to choose the method (= regression learning algorithm + regression task) that globally performs the best
For each desired hourly dataset, apply the choosen method to build a model to make spatial predictions
Use maps to vizualize the predictions and their uncertainty
Make the predictions available on the platform together with its uncertainty indicator

3.3. workflow activity diagrams

spatialization methodology viewer

3.4. Which target dependent variables ?

… or variables to be spatialized

temperature (a lot of litterature with expertise from KNMI + RMI)
relative humidity (performed by Arvalis + ZEPP)
~~rainfall~~ (RMI rain radar)
leaves wetness (none of our partners)

3.5. Which independent variables ?

… or explanatory variables

digital elevation model and its derivatives like aspect and slope (available from R command line using getData from Raster package)
solar irradiance (available from eumetsat - lsa saf)
other ? (distance to sea, CORINE land cover, temporal series, etc…)

3.6. Which R config and packages ?

In order to ensure science reproductibility (why it is important), the code (R) is developed in a self-maintained and publicly available Docker image

In addition to the famous tidyverse packages suite, we use bleeding edge R packages :

from sp to the new sf (perfect integration with dplyr verbs and with the OGC simple feature standard) for spatial data handling
mlr : an umbrella-package providing a unified interface to dozens of learning algorithms

4. Conclusion and perspectives

4.1. Conclusion

Thanks to exchanges with our partners (Steering Committee and KNMI + ZEPP + Arvalis) and an extensive review (both in terms of spatial prediction theory and R-coding),
we have figured out how to setup & code an R-facility to find the best suited interpolation method for each of our weather params

4.2. Perspectives

by the end of summer 2018 : benchmark of various combinations of learning algorithms & ancillary data
you can follow the advancement of this work in progress on github

5. Colofon an terms of services

5.1. Colofon

This document was generated using R software with the knitr library
The source code of the document is availbale on github

Terms of service

To use the AGROMET API you need to provide your own user token.
The present script is available under the GNU-GPL V3 license and comes with ABSOLUTELY NO WARRANTY.

(This document was generated using R software with the knitr library).

Tags:

methodology

Categories:

Back to posts

Geographer/Data-analyst/Coder