IT Blog | Big Data

How to replace GDAL with more efficient tools? ★ Posted on October 28, 2020

Library GDAL (developed by OSGeo) is more or less standard for processing geospatial data. However, although it implements almost all the tools needed for this purpose, it is notoriously known for many operational problems. Mainly, it is practically impossible to install it quickly (as it requires many system-level dependencies), API is obsolete (designed in the late nineties), and it is not difficult to find unexpected behaviours, etc. So it is not surprising that many developers actively started to seek different tools. This post presents some possibilities for the most common issues (mainly focused on Python users).

Dimension order problem when storing big data ★ Posted on September 26, 2020

One of the common challenges when dealing with big multidimensional data sets is how effectively store them. There are many problems to deal with, but maybe the most important is choosing the proper dimension order. Although typically, the selection of the optimal dimensional order does not impact the size of the output file (if you use some compression situation that may differ), it has a profound effect on the response time when accessing stored data.

Prediction of renewable energy from the software engineering perspective ★ Posted on May 25, 2020

Many open-source tools can help with the prediction of renewable energy production. For example, the Python library PVLIB for computation of production solar of photovoltaic panels implements essential scientific methods. However, if it comes to predicting energy from wind turbines, there is no similar library - but the computation itself is sufficiently simple - so everyone can write their methods. There is also a question of how precise predictions are, which requires a probabilistic approach.

Design of a system for on-demand processing of the large datasets ★ Posted on April 26, 2020

One of the common challenges of on-demand large datasets processing is the time required for data processing. The naive approach computes operation results on the same component as serves results (and horizontally scale the system when overloaded). The more complex (and generally speaking the only correct) approach is to use some asynchronous task queue based on distributed message passing (message broker). Another option is to stream data continuously as they come.

Pros and Cons of using xarray when accessing NetCDF files ★ Posted on April 10, 2020

The famous library xarray is more or less standard in the data analyst branch. During the last few years, it has become prevalent. It has been deployed in many projects, often without any careful decision if it is needed or if there is any other way to treat the issue. One of the essential xarray functionality is the reading and writing of the NetCDF files. This is also a focus of this article where at the end, you should be capable of making the informed decision whether or not to use xarray.

DevOps challenges of the system for processing big data ★ Posted on March 07, 2020

There are many common challenges in systems that process large data sets like videos, images, large feeds and similar. One of the most important decisions is whether to deploy infrastructure on some cloud service or if it is better to use an on-premises solution. Ultimately, if you decide to use a cloud-based solution, other challenges arise, like which provider would be the best one, how to deploy - use infrastructure as a code, or just some simple solution etc. Also, there are many interesting services that are cheap or free and can be helpful like Dropbox.

Acceleration of frequently accessed multi-dimensional values in Python using REDIS ★ Posted on February 03, 2019

In-memory database REDIS provides a simple interface for caching various data types. In addition, it supports simple manipulations with arrays and scalar values, including support for complex data constructs (like queues). What is, however not implemented, is the support of multi-dimensional objects. These objects are beneficial if you need to cache frequently used constants (or interim results) for computations (like matrices).