Big Data Analytics

The most of BioSense’s research in Big Data is currently concerned with two types of datasets that require huge storage capacity and extremely efficient algorithms that will process them in the specified time requirements.

 

The first of them are satellite images. Sentinel 2 satellites are releasing new 13-band images of any given point on Earth approximately every 5 days. These images come in 100×100 km2 tiles, with a 10 m resolution. Within Cybele project, for example, we are analysing satellite images of the whole Europe to classify soya fields, meaning that we need to go through hundreds of tiles and apply the classification algorithm. In such situations, we are parallelising the code to better suit the multi-core server architecture, speeding up its parts using Cython, but also adjusting the implementations to lower the computational complexity of the algorithm.

 

Another type of data are call detail records of mobile phone users and they satisfy all the V’s of big data analytics. These records consist of lists of established calls, exchanged text messages and registered internet traffic and are enormous in size. An important issue that needs to be addressed is concerned with privacy and it is a huge challenge to extract useful information from anonymised or aggregated datasets. Also, although the data is semi-structured, the traditional RDBMS technology for data analytics is not applicable to datasets of this volume, so we are developing processing workflows using Apache family of technologies, such as Spark, Hadoop and Hive. They allow us to get a deeper insight into human behaviour and mobility patterns and infer the interaction between rural and urban areas.