VectorNet Data Series 3: Culicoides Abundance Distribution Models for Europe and Surrounding Regions

This is the third in a planned series of data papers presenting modelled vector distributions produced during the ECDC and EFSA funded VectorNet project. The data package presented here includes those Culicoides vectors species first modelled in 2015 as part of the VectorNet gap analysis work namely C. imicola, C. obsoletus, C. scoticus, C. dewulfi, C. chiopterus, C. pulicaris, C. lupicaris, C. punctatus, and C. newsteadi. The known distributions of these species within the Project area (Europe, the Mediterranean Basin, North Africa, and Eurasia) are currently incomplete to a greater or lesser degree. The models are designed to fill the gaps with predicted distributions, to provide a) first indication of vector species distributions across the project geographical extent, and b) assistance in targeting surveys to collect distribution data for those areas with no field validated information. The models are based on input data from light trap surveillance of adult Culicoides across continental Europe and surrounding regions (71.8°N –33.5°S, – 11.2°W – 62°E), concentrated in Western countries, supplemented by transect samples in eastern and northern Europe. Data from central EU are relatively sparse. Funding statement: This work was carried out with support from the VectorNet framework contract OC/EFSA/AHAW/2013/02-FWC1 funded by the European Centre for Disease prevention and Control (ECDC) and the European Food Safety Agency (EFSA) and the PALE-Blu H2020 Project ID: 727393.


Introduction/Study Description
VectorNet [1] is a joint initiative of the European Food Safety Authority (EFSA) and the European Centre for Disease Prevention and Control (ECDC), which started in May 2014. The project supports the collection of distribution data on tick, sandfly, mosquito and Culicoides midge vectors, related to both animal and human health.
While VectorNet and its predecessor VBORNET [2] have made substantial progress collating European data on key vector species, the coverage is still incomplete. The 'Gap Analysis' work within these projects aims to identify those areas of likely species distribution within the project extent where there are no current data. These estimates were produced throughout the project and were intended to meet two objectives: firstly to help direct extensive VectorNet sampling efforts in the field, and secondly to provide first indications of the current likely extent and distribution of key vector species within continental Europe and its surrounding regions. The models provided here are the latest iteration using the distribution data available at the end of 2018. It is hoped that publishing these models will aid experts to engage the more extensive research and professional community in the drive to expand and validate the VectorNet database, and will also contribute to the veterinary and public health planning for Europe and its neighbouring countries. Readers are encouraged to contact the authors or visit the VectorNet website [1] for further details of the project, and to view distribution maps of arthropod disease vectors of midges, ticks, mosquitos, and sandflies.
For each model, abundance maps with a resolution of 1 km were generated using both Boosted regression trees and Random Forest spatial modelling techniques available through the VECMAP [3] system. The outputs from each technique were ensembled to create a ' consensus' output of Ln Maximum Annual number per trap per day. Culicoides imicola is a proven bluetongue virus (BTV) vector species as a livestock-associated species, as numerous isolations of the virus have been made from field-collected individuals, and as the entire transmission cycle was reproduced experimentally for this species [4,5]. The other listed species belonging to the Avaritia and Culicoides subgenera are considered probable vectors based on their ecological habits, on virus isolation or viral genome detections from field-collected individuals and on experimental infections. BTV was isolated from field-collected C. obsoletus [6][7][8] and C. pulicaris [9] -it was however not clear if these taxa referred to species or group of species. BTV-8 genome from C. dewulfi and C. chiopterus field individuals has been identified by real-time RT-PCR in the Netherlands [10,11] and in France [12]. In the Basque country, BTV-1 genome was detected by real-time RT-PCR from C. obsoletus/C. scoticus, C. pulicaris and C. lupicaris parous females [13].

Context
Culicoides obsoletus and C. scoticus from the United Kingdom have been experimentally infected by BTV-8 and BTV-9, C. scoticus showing higher viral titers [14]. Pools of C. pulicaris were found infected with BTV-2 in Sicily [15], and BTV genome was detected in C. punctatus and C. newsteadi field-collected specimens in Italy [16].

Steps
The series of procedures followed to produce the dataset. This should include any source data used, as well as software and instrumentation involved.

Model training data
The reported distributions of each vector species held in the VectorNet archive on May 2018 were used as the basis for species present training data for the analysis. They were formally released to the authors on request to ECDC (reference number 18-1421).
The raw input data was provided by light trap surveillance of adult Culicoides set up mostly in ruminant farms across continental Europe and surrounding regions (72N-33.5S, -11.2W -62E), concentrated in Western countries, supplemented by transect samples in eastern and northern Europe. Data from central EU are relatively sparse (see maps Appendix 1). These data were obtained either from National surveillance systems or from surveys carried out by the VectorNet project. Species were identified using a morphological identification key [17] from field collections or, in some case, retrospectively from stored collections from National surveillance systems.
Midge abundance varies throughout the year, and several metrics may be used to represent abundance. The one used here for every species is the mean annual maximum number per trap per day. Data was used only from locations that were sampled with at least one collection per month throughout the season of the peak of abundance. If data from more than a single year was available, the annual average was used. For each species zero values from the abundance datasets were included in the input data, but were not supplemented by zero values for which only presence/absence data were available. These values represent a standardised measure of abundance at the annual resolution, and so represent one aspect of absolute abundance. They are not, however comparable with traditional absolute abundance measures as they are not associated with a specific date.
Maps of the recorded distributions at that time are presented as overlays to the model outputs, in Appendix 1 available within this data package.

Modelling procedure
A range of modelling techniques are available in the VECMAP [3] system, of which Boosted Regression Trees (BRT) and Random Forest (RF) [18], using 10-25 repeated bootstraps per replicate, were used. Five replicates were implemented for each method. Each model was run using a 25% holdback for validation, but which also ensured variability between replicates. BRT model parameters were adjusted to result in 1000 trees; the RF parameters were set to the system defaults = namely 100 trees, the best 15% of the available covariates, and each tree using approximately 70% of available sample data with replacement. An ensembled average (and an associated standard deviation image) was then produced from the ten replicates. The standard deviation maps provide useful indicators of uncertainty in the model outputs.
The covariates offered to the modelling procedures were drawn from a standardised set of environmental parameters, and in particular a suite of Fourier processed MODIS satellite imagery [19] which provides a range of biologically interpretable variables related to levels and seasonality of temperature and vegetation related factors during the period 2001-2015. These are summarised in Table 1 and are all available to registered members of the PALE-Blu Data Website [20]. Each BRT model was run with the top ten predictors identified in the trial model runs for each species, which are listed at the end of Appendix 1.

Quality Control
As indicated above, only raw data with sufficient samples per site to ensure reliability were used as model inputs. The model outputs were evaluated using the standard, and very extensive, accuracy metrics (e.g. R-squared, AIC, Kappa, Confusion matrices) provided by the VECMAP [3] software. Providing the accuracy metrics indicated sufficient statistical reliability, the outputs were ensembled as described above. AUCs for the training sets for all the models exceed 0.85.

Sampling strategy
The abundance data used to train the maps were collected by longitudinal UV-light trap collections, a method commonly used to survey adult Culicoides populations at a wide scale. The reliability of UV-light trap collections to assess the ' aggressive density' on animals (which is the abundance parameter related to the risk of transmission) is still under debate and may be species dependent [24][25][26][27][28]. However, it is worth highlighting that abundances assessed by UV-light traps have been used for more than a decade to manage animal movements under EU regulations, and that this system has demonstrated its utility.

Constraints
There were no constraints in data production.

Privacy
Not applicable. No human data were used in the analyses or are provided in these datasets. Not Applicable -no personal data has been provided, and no animal welfare constraints apply to entomological sampling.

Dataset description Object name
VectorNet/PALE-Blu Midge Abundance Models

Data type
Processed data; Interpretation of data Ontologies N/A.

Creation dates
The start and end dates of when the data was created 01052018 -01042019.

Dataset creators
The modelling work was led by William Wint (ERGO, the Environmental Research Group Oxford) using data assembled and processed by Thomas Balenghien (CIRAD) and provided by the authors listed above together with additional collaborators of the VectorNet project as listed, with literature sources in the table in Appendix 2.

Licence
The open licence under which the data has been deposited CC-BY 4.

Accessibility criteria
The data are distributed as GIS raster GeoTIFF formats, which is a standard proprietary GIS raster format. To access and analyze the raster data directly GeoTIFFs can be read by most GIS software and some other software packages.

Reuse potential
Please briefly (approx. 50-200 words) describe the ways in which your data could be reused by other researchers both within and outside of your field. This might for example include aggregation, further analysis, reference, validation, teaching or collaboration. These layers have been created in an attempt to identify probable areas of species distribution where there are currently no sample data. These maps, therefore, attempt to identify the actual distribution of each species and so could be useful in identifying areas at risk from the disease for which each species is a vector and to identify suitable areas for further sampling. The VectorNet project plans to utilise these datasets in such a way.
The covariates of the models are also mainly climate orientated. A possible avenue of further work, therefore, could be to use the models to assess the potential change in distribution after a shift in climate parameters.