Big Data Analysis in Earth Sciences - EarthBiAs2014

SCHOOL OF SCIENCES - Department of Mathematics
7 Jul 2014 to 11 Jul 2014

A Case Study

General Description

The objective of the summer school is to allow students to understand the complexities as well as the advantages of managing Big Data in order to provide a solution in a real problem. A tool that will be used to achieve this will be the final project that will be discussed and built-up during the 5-day programme. The students will be provided with data and a case-study in order to prepare the final version of their project after the completion of the summer school in Rhodes. 


The Problem

Heat and cold waves can have serious societal, agricultural, economic, and environmental impacts, with heat being the number one weather-related killer. In addition to temperature, high humidity can increase the impacts of heat waves, while high winds can increase the impacts of cold waves. Many agricultural products exhibit direct temperature threshold responses and can be indirectly affected through threshold responses of agricultural pests. Heat and cold waves are typically defined as events exceeding specified temperature thresholds over some minimum number of days. Chosen thresholds may be statistical or absolute and in the case of the latter, they are geographically and sector dependent.

In order to monitor, forecast disasters and assess the impact of such phenomena, there is a strong need to acquire, processing and analyse appropriate data. In this case datasets may include:

  • Temperature real-time data for the area under examination (wide scales allow tracking and evolution of the phenomena),
  • Temperature thresholds for different possibly affected sectors of this area
  • Long historical temperature data in order to feed the simulation models and predict potential future disasters.

Furthermore, datasets capturing disaster-driven incidents connected to the phenomena under examination (human and animal deaths, human and animal health problems, farming destruction etc.) will be investigated in order support robust impact assessment.


Limitations and problems

1. A key issue of this case study is how to acquire these datasets? These data may be in diverse and dispersed data sources.

2. Having the datasets how do you integrate the information (pre-processing and harmonisation) in order to be consistent? These data may be in different formats with different semantic annotation (data model, metadata schemata).

3. How do you process different types of datasets? In this case, real-time data (daily temperatures) need different processing techniques and tools than static data (temperature thresholds).

4. Understanding the impact – does the data show the impact? Irregularities and thresholds may not actually indicate significant impacts or could possibly underestimate impacts due to case specific and/ or area specific evolution of an event (e.g a fire that was triggered by a heat wave).

However there is an immense amount of information that can be related with direct and indirect impacts of such extreme phenomena that could be used to understand the vulnerabilities (exposure to hazard) and minimize the impacts.


Big Data

Big Data [1] is the term for a collection of data sets so large, diverse and complex that becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. The trend towards larger data sets is reflected to the benefits related to the additional information derived from the analysis of a single large set of related and diverse data,  compared to individual and smaller sets. The integrated information supports correlations to be found to "spot business trends, determine quality of research, mitigate diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions."

  • The NASA Τechnologies applied to Βig Data Αnalysis include massive parallel-processing (MPP) databases, search-based applications, data-mining grids, distributed file systems, distributed databases, cloud based infrastructure (applications, storage and computing resources) and the Internet.


Technologies applied to Βig Data Αnalysis include massive parallel-processing (MPP) databases, search-based applications, data-mining grids, distributed file systems, distributed databases, cloud based infrastructure (applications, storage and computing resources) and the Internet.


Linked Data

Today’s vision of a common Web of Data is largely attributable to the Linked Open Data movement. The first wave of the movement transformed silo-based portions of data into a plethora of open accessible and interlinked data sets. The community itself provided guidelines (e.g., 5 stars Open Data) as well as open source tools to foster interactions with the Web of data. Harmonization between those data sets has been established at the modelling level, with unified description schemes characterizing a formal syntax and common data semantic.

Without doubt, Linked Open Data is the de-facto standard to publish and interlink distributed datasets within the Web commonly exposed in SPARQL endpoints.

LODRefine [2] is the gateway connects us to the cloud.

Data in various spreadsheets, flat files or databases that are not linked to anything else, seem to be useless. Opening up, enables users to create new datasets or combine them and produce various mashup applications or services. It supports the full life cycle of Linked data from extraction, authoring and enrichment and interlinking, to visualization and sharing. Except allowing for data loading, understanding, cleansing, reconciliation and extension, LODRefine allows you to reconcile your data with DBpedia, extend reconciled columns with data from DBpedia, extract named entities from columns containing text (e.g. descriptions, biographies) using cool services like Zemanta, DBpedia Lookup and AlchemyAPI, and  export it into RDF or even upload it to CrowdFlower, the popular crowdsourcing service.

Finally, Linked Open Data [3] has increased the availability of semantic data, including huge flows of real-time information from many sources. Processing systems must be able to cope with such incoming data, while simultaneously providing efficient access to a live data store including both this growing information and pre-existing data. The SOLID architecture has been designed to handle such workflows, managing big semantic data in real-time.