Sequential and parallel programming are two different approaches, each with its own advantages and considerations, especially when dealing with tasks like raster transformation using GDAL.
Loop programming involves iterating through data sequentially, performing operations step by step. It’s straightforward and easy to understand, making it accessible for many developers. On the other hand, parallel programming involves breaking tasks into smaller parts and executing them simultaneously, often speeding up processes significantly but requiring careful management of resources and potential synchronization issues.
We are going to generate NDVI GeoTiffs from the Copernicus Sentinel-2 bands 04 (red) and 08 (NIR-1).
The data we are using represents a zone next to Barcelona.
The images from Copernicus are taken on October 12th, 2023.
The input files come from ten Sentinel-2 tiles.
The tile size is 10,980 × 10,980 pixels.
The band tiffs weight from 8 to 253 MB.
Here the list of the ten tiles we are using:
The True Color visualization of these tiles:
I’ve created a script called “sequential.sh”.
#!/bin/bash
# Directory containing Sentinel-2 image files
export base_dir=/some_dir
# Directory where NDVI output will be saved
export destination_dir=/some_other_dir
# Choose the date that matches the downloaded tiles
export date=20231012
# Get unique tiles from files in base_dir
tiles=$(ls $base_dir | grep $date | grep '.tif$' | cut -d "." -f 1 | cut -d "_" -f 1-3 | sort -u)
# Loop through each unique date
for tile in $tiles; do
# Define file paths for B04, B08, and NDVI for the current date
b04="${base_dir}/${tile}.B04.tif"
b08="${base_dir}/${tile}.B08.tif"
ndvi="${destination_dir}/${tile}.ndvi.tif"
# Calculate NDVI using gdal_calc.py tool
gdal_calc.py --overwrite -A "${b04}" --A_band=1 -B "${b08}" --B_band=1 --calc="(B-A)/(B+A)" --outfile="${ndvi}"
done
#generate a virtual raster to visualize in QGIS
gdalbuildvrt ${destination_dir}/${date}.ndvi.vrt $(find $destination_dir | grep .tif | grep $date )
Bash has a timing feature called “time”. We’ll use it to measure the performances.
$ time sequential.sh
...
real 0m28.547s
user 0m29.108s
sys 0m17.317s
Here the second script called “parallel.sh”
#!/bin/bash
# Directory containing Sentinel-2 image files
export base_dir=/some_dir
# Directory where NDVI output will be saved
export destination_dir=/some_other_dir
# Choose the date that matches the downloaded tiles
export date=20231012
# Extract unique dates from files in base_dir
tiles=$(ls $base_dir | grep $date | grep '.tif$' | cut -d "." -f 1 | cut -d "_" -f 1-3 | sort -u)
# Define function to calculate NDVI for a given date
ndvi(){
tile=$1
echo $date
# Define file paths for B04, B08, and NDVI for the current date
b04="${base_dir}/${tile}.B04.tif"
b08="${base_dir}/${tile}.B08.tif"
ndvi="${destination_dir}/${tile}.ndvi.tif"
# Calculate NDVI using gdal_calc.py tool
gdal_calc.py --overwrite -A "${b04}" --A_band=1 -B "${b08}" --B_band=1 --calc="(B-A)/(B+A)" --outfile="${ndvi}"
}
# Exporting the 'ndvi' function to be used with parallel processing
export -f ndvi
# Execute the 'ndvi' function in parallel for each date using 'parallel' command
parallel ndvi ::: $tiles
#generate a virtual raster to visualize in QGIS
gdalbuildvrt ${destination_dir}/${date}.ndvi.vrt $(find $destination_dir | grep .tif | grep $date )
When running the script with the “time” function.
$ time parallel.sh
...
real 0m11.892s
user 1m46.785s
sys 0m8.576s
The resulting NDVI image:
The evident difference between parallel programming and sequential programming on a modern multi-threaded processor is clear.
In our case we can process the 10 files in a fraction of the time (11.892 sec vs 28.547 sec). An increase in speed by 240%.
With a big dataset to process the impact would be even higher.
The advantage increases the more data we have to process.
Not all aspects of a geospatial pipeline can be seamlessly parallelized. While pixel-by-pixel calculations often lend themselves well to parallel processing, other operations might pose challenges.
Features that initiate in one tile and extend into another can complicate or entirely change the process of merging resulting datasets. This boundary handling becomes a crucial consideration during parallel processing.
Certain GDAL packages are coded in Python and may not inherently support parallel execution. Understanding the limitations of these packages is vital when designing a parallel processing workflow.
Efficiently managing large datasets involves considering strategies like processing smaller tiles separately and then integrating them into a virtual raster. However, this strategy might not be universally applicable, especially with complex feature interactions.
Incorporating parallelism into geospatial workflows offers significant performance gains, yet it demands a nuanced understanding of data structure, feature interactions, and limitations within the processing tools employed.