One of the common challenges when dealing with big multidimensional data sets is how effectively store them. There are many problems to deal with, but maybe the most important is choosing the proper dimension order. Although typically, the selection of the optimal dimensional order does not impact the size of the output file (if you use some compression situation that may differ), it has a profound effect on the response time when accessing stored data.
What is the optimal rule to follow when choosing the dimension order?
Note that this section considers only dealing with multidimensional data (at least two or three dimensions). In that case, there often is a simple rule to follow. First of all, you need to know the form of your typical selecting (or writing) request in advance. Then, there are a few things to consider:
- Do you often select just one point most of the time?
- Do you often select the data series only at one dimension?
- Do you follow any pattern when selecting data in more than one dimension?
Generally, the first case is the best one as it almost does not matter what dimension order you choose. More precisely, the most convenient thing is to select the dimension order equivalent to the input (usually something in the memory you want to store) - it makes writing fast without affecting the reading rate (win-win strategy on each side).
The second case is the most common one. Typically, it is equivalent to selecting a point data time series - where one dimension varies, but others are on a specific fixed value. A typical case would be selecting temperature time series on the sequence of two-dimensional maps (each defined by longitude, latitude, and time) - a usual source could be a large netCDF file with the weather forecast. There are three dimensions in this case: time (varying dimension), longitude (fixed for selection), latitude (fixed for selection). A similar example could be the selection of values in columns in the database for analytical processing (like computing averages or sums).
There is a simple rule in these cases: keep the varying dimension as the last one when writing data. The rationale for this choice is simple. When storing data (say on the hard drive or sending them via network), the system has to serialize them as a one dimensional series of bytes - which is always the ultimate way data are stored. So technically, when you are selecting, you choose just the proper slice of this one-dimensional data - the system interprets the reading command in the way: fetch the data (say on disk) from address x to address y. That means, if your most varying dimension is in the wrong position, the system has to either read the whole block of data into memory and then pick the correct elements or repeatedly access the device (disk or network) to ask for every point of the series. The first can cause memory issues; the second has a significant performance impact when considering latencies (it always takes some time when accessing stored data).
Unfortunately, writing data where source dimension order differs from the desired one is slow and potentially memory-hungry. If you are lucky and know the size and composition of the data set in advance, there is a place for optimization. For example, you can reduce memory requirements by serializing data continuously in chunks. However, if the source is a stream of data, you might quickly run out of memory.
The third case is the most problematic one. There is only one rule: keep the order of the dimensions from the least varying to the most varying. The rationale for this is the same as described above (data are in the end just a one-dimensional series). However, there are many exceptional cases when even this rule is not the best to follow (see use case bellow).
How are data stored technically?
As mentioned above, data are technically stored in just one dimension. This raises the question of how it is possible to store multidimensional data to just one dimension and reconstruct them? The answer is simpler than one might think. Suppose you have just three dimensions of sizes (l1, l2, l3). Then each point P has coordinates (p1, p2, p3) - always indexed from 0 to li (right excluded). You can easily map these three coordinates to just one coordinate, say q, by using the relation:
q = p1 + l1 ⋅ p2 + l1 ⋅ l2 ⋅ p3
draw a cube and try it practically if you do not believe it. And a similar relation is usable for the inverse mapping (from one dimension to three dimensions):
p3 = FLOOR(q / (l1 ⋅ l2))
p2 = FLOOR((q - p3 ⋅ l1 ⋅ l2) / l1)
p1 = q - p3 ⋅ l1 ⋅ l2 - p2 ⋅ l1
where FLOOR is the floor function (take just the integer part, ignore the decimal one).
The same logic can be used for any number of dimensions. Although it can be quite problematic for more than three dimensions, there is no simple way to imagine that problem geometrically.
Use case
The typical problem that one can imagine is the time series of maps. For example, say that you want to monitor irradiance at some location from the satellite. You have a 2D map of irradiance (the third dimension is time itself). So the dimensions are longitude, latitude and time. These maps are practically crucial for the prediction of solar panels' productions.
The following request for selecting data are the most common ones:
- Select the irradiance time series at some geographical location (to analyze average values, peaks, etc.).
- Select some polygon on the map and compute the average irradiance in each time-step (and return just 1D time series of this average values). This step is crucial when building large solar installations (you do not have just one point).
The first problem is the simple one; the best approach is to have a time dimension in the last place.
On the other hand, the second use case is quite problematic. The optimal dimension order has to be determined by the average size of polygons. If your polygons are big enough, having a time series in the first place is the best option (and having in the last place is the worst). This is due to the computation that is performed at each time step. If you have a big polygon and time dimension in the first place, you can very swiftly compute the average value at each time step. On the other hand, if the time dimension is in the last place, you must read each column (in geometric representation) separately, which is slow.
Column-oriented data stores (and DBMS)
There is one surprising place where the dimension order matters a lot. That is the environment of Apache Hadoop based databases (or similar proprietary environments). Technically, the database table is just a two-dimensional array. And again, the order of dimensions matters. Typically, rows are the last index when data are serialized. That is meaningful because the typical request is to select rows based on some condition. However, when doing analytical processing (OLAP), the standard request might look very different. For example, if you mainly count sums, averages or select the whole series of columns, the optimal order would be to have columns on the last dimension. There are analytical tools (in the Hadoop environment) that take this into account - like Apache Kudu - that implements precisely this logic. Many other database systems are also available (not only based on HDFS) - like time-series databases (most commonly InfluxDB).
The typical example for storing massive datasets of data is telemetry deployed on a large e-shop. Business analysts often need to analyze how many items have been sold, the average price of the package, the weight of items, check how many things are in your stock, and many others. These are the typical examples of analytical queries primarily performed on columns space. They also require a lot of time to be performed (if you have more than a million items), so the optimization makes perfect sense.
Read vs write trade-off
Again, it is important to know that if your data are coming in some specific dimension order, it is not that trivial to store them in a different order. It is always important to be aware of the trade-off between fast reading and fast writing. So, if you write data just for archive (and do not expect frequent processing), it is handier to write them as they are not to permutate any dimensions. The problem of dimension permutation is even more challenging if you write a massive amount of data (that cannot be stored in operational memory). In this case, the chunking of data should take place. Generally, there is no simple manual for dealing with this issue - you have to use your common sense.
Remarks and summary
Suppose you are lucky enough and have some simple problem that fits the above categories. In that case, you can achieve a significant performance overhaul by swapping the dimension order of your data. In many cases, unfortunately, there is no simple solution. You may often need more than one data source to optimize performance (each having different dimension order). The trade-off between writing data in changed dimension order (which is slow) and reading them must also be considered. Do not feel afraid to use your common sense when dealing with this issue, as there is generally no precise manual for your problem.
❋ Tags: Dimensions Big Data Performance Geospatial NetCDF