Software engineering perspective of the system for renewable energy prediction

★ Posted on April 02, 2021

This article explores problems related to designing and developing the system to predict renewable energy (or similar projects) - the outcome of the presented project is a web application (not a desktop app). Many things have to be considered, like the optimal technologies, team composition, budget, timetable. It might be helpful for you to read this article before you choose to develop a similar system, as otherwise, it can cost you a lot of time and money.

What is the best programming language?

In theory, there are multiple good options. Strictly speaking, the main languages worth considering are Python, Go, C++ and Java. However, it probably does not make sense to start any new project in Java as it has been in the state of clinical death for years, which reduces available options.

Probably the first choice would be Python (in version 3.10 or higher). There are many helpful libraries in Python (mainly: GDAL, Shapely, GeoPandas, NumPy/Pandas, Django, pvlib, xarray, Celery and others). So technically, all that is needed has already been implemented. What remains is to put these building blocks together. And here start the first problem - as you develop something continuously, versions of your (sub)dependencies change, which causes problems (as the main dependency stops working when its sub dependency is updated). You often spend days chasing what went wrong to discover that some sub-sub-dependency was updated and no longer backwards compatible.

There is also another well-known problem of Python - performance. Python still keeps GIL that disallows you to use a standard multithreading approach. As you are processing big data sets (often tens of GBs in size), this is a very significant limitation. Plus, it is interpreted language, which generally worsens performance. Just some helpful data - based on the most common use-cases - if you have some scenario composed of 10 generators (like wind turbines, solar PVs), and you need to simulate their behaviour on ten years of weather data (with resolution 10 minutes) - you need to expect running time around an hour (considering you have some standard hardware resources available by any cloud provider).

Another option would be the Go language. It is modern, simple and quite similar to Python - it also does not suffer from performance. So what is the problem? The problem is that there are almost no libraries that you need. There is no PVLIB, no GDAL, no Shapely, etc. That makes work with geospatial data complicated (as you need to implement everything independently). Maybe at some point in the future, some company like Google publishes codes for these purposes, making Go the first choice language. On the other hand, it is not that difficult to overcome this issue - as it is not difficult to build libraries that would make these computations. All codes are available in Python - they merely need to be converted to Go.

The last option can be C++. In this case, all things that you need are available there. The only problem is that library GDAL (one of the few reasonable tools for working with geospatial data in C++) is implemented in the Cpp98 standard. There is no sign that developers will update it any soon - that technically excludes the possibility of using the latest version of C++ and makes your code particularly vulnerable (to memory leaks and other unexpected errors).

Other languages, especially tools based on .NET, are intentionally excluded, as it is difficult to assess their possibilities. Mainly C# is not a very popular choice for geospatial data processing.

So we choose Python, what next (system design)?

The first thing you need to know is the overall design of your system - what are the main components, and how they are mutually related. Generally, it might be helpful to write some standardized System Design Document (aka SDD) and the Requirements Document plus the Algorithm Description Document (describing the mathematical and algorithmic side of the problem). In addition, it should include UML diagrams (including use-cases, database design, estimates).

If it comes to main (functional) components:

  • Gateway: using Ngnix or similar technology.
  • Web Application (framework): you would probably choose to implement it using Python/Django combination.
  • Relational database: in this case, probably PostgreSQL (you can also choose some commercial DB solution, but it is useless).
  • In-memory database: the usual choice is Redis (or Memcached).
  • Map-tile server: probably GeoServer (unfortunately, it is the only available simple ready-made solution).
  • Message broker: probably using Celery (on Redis) of Faust.
  • Front-end solution: probably based on the Vue.js framework.
  • Storage solution: in cloud probably using Amazon S3 bucket.

Regarding Gateway, you do not have to expect anything surprising - it can be helpful to have some special routes accessing some big data directly - but the rest is just a standard system.

If it comes to the web application (in the strict sense web framework), you have two options. The caveman approach would use Flask/FastAPI or a similar light-weighted framework. The more sophisticated one is to use Django. Each approach has its pros and cons. There are good reasons to suggest using Django (as you save a lot of time implementing trivial utilities, plus there are many advantages related to build-in migration logic) and creating a monolithic application. On the other hand, Flask or FastAPI code is easier to maintain and allow micro-service architecture.

Regarding the relational database, the reasonable option is PostgreSQL - it is free, open-source and of very high quality (plus continuously maintained). On the other hand, it is also good to avoid using MySQL - as its quality deteriorated in the past ten years (and because of well-known errors like restricted primary key size). Plus, be sure that you keep the database technology the same for the development and production stack. It can prevent many unexpected errors.

The in-memory database is typically helpful as an external service for other applications (like Celery asynchronous task queue) or sub-dependencies of Django like django-axes (for logging and preventing attacks on any user account). You can use it directly as well for caching frequently accessed values.

If it comes to the map-tile server, it is good to design your system in a way that allows you to have some custom-made solution. Currently, the only reasonable full-scale solution is GeoServer. But this Java-based application is based on code written around the year 2007. So it is an absolute hell to deploy it in the cloud environment. Plus, you have to expect strange behaviour (when it fails to serve one task properly, it switches its state and does not work anymore - very inconvenient for production). So, consider having some own solution to avoid using GeoServer if possible.

A message broker is an essential part of the system - as you cannot address computations directly (because of the risk of timeouts - typical computation can take you more than an hour without any problems). If you use Django, there is a very well integrated message broker called Celery (based on Redis or RabitMQ, prefer Redis if you can as it is more popular and used by many other dependencies applications as well). You can use it immediately by adding just two dependencies to your requirements - but despite how nice it sounds, expect many problems with Celery. Like unexpected behaviour, not the best documentation, a lot of refactoring from its source code. Another option can be Faust (based on Apache Kafka) - which allows you to use many other interesting features.

Without a doubt, the front-end should be completely separated from the back-end using some restful API access data (use, for example, the Django REST framework for implementation). The Vue.js framework is a popular, modern, and simple solution for the front-end. There are many advantages in using Vue - it is simple to find new developers, code is readable, there are many ready-made tools (especially the D3.js package is essential).

You have to contemplate what storage solution to choose carefully at the beginning. You will have to store hundreds of gigabits of data (or rather terabytes). That means you have to expect astronomical prices for cloud-based solutions in production. However, there is usually a cheaper cloud-based option for storing data not accessed frequently (helpful for data science purposes). Another option for storing not frequently accessed data is Dropbox or similar cloud-based storage. Dropbox can work as a cheap lake for your data from which you can pull them to your local machine - there is also an interface in Python for work with Dropbox (it is slightly tricky, but it works).

Some general remarks regarding the development

It is helpful to fix versions of your dependencies in the requirementes.txt file. But do not believe that it saves you any trouble (as many sub-dependencies do not have a fixed version, and you cannot change this). Also, expect a lot of unexplainable behaviour - as many dependencies are not of the highest quality (for example, when you use xarray, it rarely works like you want to). Also, in the design phase, think carefully about backup and restore procedures (you need to implement them in a working and complete way). Finally, you will frequently need to profile your code for performance and memory issues (so read about some libraries and tools designed for this, which can save you a lot of time).

Infrastructure

You have to consider many things when it comes to designing infrastructure. One of them can be: do not believe you can stay cloud-agnostic for long. Also, try not to believe there is a significant difference between cloud providers (AWS, G-Cloud, Azure are more or less the same). It is also necessary to use infrastructure as a code approach (for example, Terraform, Kubernetes). Finally, regarding deployment in the cloud, you need to be aware of the astronomical prices - expect at least a £1,000 bill per month (in 2021 prices) plus some other costs related to operational work. You can also easily hire a full-time DevOps engineer who will be busy enough.

It is also perfectly meaningful to have a powerful local machine for data science computations. However, when using Python, be aware that you cannot rely on multithreading because of GIL. So it is better to have a machine with a good Turbo Boost than with many cores.

Team and budget

Generally, it does not make sense to start your project if you cannot hire at least five full-time employees for at least two (or rather three) years. You will need at least one or two people dedicated to data science, one or two dedicated to back-end development and the same for front-end development. Above that, it is helpful to have applied meteorologists in your team and a full-time project manager. If it comes to the organization of people, SCRUM/agile is arguably the only possible way to success. The project price is usually between £200,000 to £400,000 annually (in 2021 prices).

All these things are necessary to consider when you are preparing a budget. Also, be aware that geospatial data are costly (free data are almost useless for practical purposes).

Summary

Optimal programming languages for systems are discussed. There are many options available, but the most popular is Python. This is because open-source libraries are available in Python that can significantly help with development. Other languages (mainly Go) have advantages, but they lack essential libraries. The standard choice for web application framework is Django (another option is to use FastAPI and design system as microservice). Before starting the project, knowing the price and required team size is essential - generally, at least five people are needed to be successful.

❋ Tags: Python Design Renewable energy Geospatial Web application