Merging Rasters using Rasterio
In this blog, we’ll examine how to merge or mosaic rasters using Python, the modern way. Additionally, we would look at a few nuances and internal workings of rasterio’s merge functionality along with saving your rasters in-memory.
By “modern way”, it is implied that you have an improved workflow and data management. And that you can experiment with various scenarios quickly and efficiently.
The traditional way to mosaic data is by downloading multiple intersecting tileset in its entirety. Downloading an entire tileset is itself a cost prohibitive task, added to already lost time in searching desired satellite imagery on GUI.
To overcome these traditional challenges, there has been significant improvement in storing metadata of satellite imagery (namely STAC) which has enabled querying them much smoother and made it imagery-provider agnostic.
TL;DR
We would perform the following task in this blog — - Use pystac to query items over our AOI - Plot the tiles on map using hvplot - Merge tiles without data download on local machine - Save the merged tile in-memory using rasterio’s MemoryFile - Internals of rasterio’s merge methods
Problem at hand
I wish to access sentinel-2 True Color Image for the month of January over my area of interest (AOI), which is a highway network across Karnataka and Andhra Pradesh (Figure 1).
We start by fetching sentinel-2 tiles over our AOI from sentinel-s2-l2a-cogs
STAC catalog using pystac-client. This library allows us to crawl STAC catalog and enables rapid access to the metadata we need.
# STAC API root URL
# Thanks to element84 for hosting the API for sentinel-2 catalog.
= 'https://earth-search.aws.element84.com/v0/'
URL
= Client.open(URL)
client
= client.search(
search = 10,
max_items = "sentinel-s2-l2a-cogs",
collections = gdf.total_bounds, # geodataframe for our region of study
bbox = '2022-01-01/2022-01-24'
datetime )
In the above code, we search for 10 sentinel-s2-l2a-cogs over our AOI for the date between January 1 and 24 of 2022.
Now we need to know which of our 10 queried images covers our area of interest in its entirety. To do that, we can plot all the search results on the map and visually inspect.
We see that our AOI is not covered by a single tile in entirety, and that there is a need to merge adjacent tiles.
Note that we have so far only queried the metadata of our desired imagery
We use rasterio’s merge functionality, which would enable us to combine all of them seamlessly.
First, we get all the tiles for a single day and look for True Color Image (TCI) band
# retrieve the items as dictionaries, rather than Item objects
items = list(search.items_as_dicts())
# convert found items to a GeoDataFrame
items_gdf = items_to_geodataframe(items)
tiles_single_day = items_gdf.loc['2022-01-23', "assets.visual.href"]
# print(tiles_single_day)
properties.datetime
2022-01-23 05:25:14+00:00 https://sentinel-cogs.s3.us-west-2.amazonaws.c...
2022-01-23 05:25:11+00:00 https://sentinel-cogs.s3.us-west-2.amazonaws.c...
2022-01-23 05:24:59+00:00 https://sentinel-cogs.s3.us-west-2.amazonaws.c...
2022-01-23 05:24:56+00:00 https://sentinel-cogs.s3.us-west-2.amazonaws.c...
Name: assets.visual.href, dtype: object
Next, read the remote files via the URL in the above output using rasterio.open
and save the returned file handlers as a list. This is the first instance where we are dealing with the actual imagery. Although, we are not reading the values stored in the data just yet.
# open images stored on s3
= [rasterio.open(row) for row in tiles_single_day] file_handler
Finally we can merge all of the tiles and get the clipped raster stored in memory.
from rasterio.io import MemoryFile
from rasterio.merge import merge
= MemoryFile()
memfile
=file_handler, # list of dataset objects opened in 'r' mode
merge(datasets=tuple(gdf.set_crs("EPSG:4326").to_crs(file_handler[0].crs).total_bounds), # tuple
bounds=None, # float
nodata='uint16', # dtype
dtype=Resampling.nearest,
resampling='first', # strategy to combine overlapping rasters
method=memfile.name, # str or PathLike to save raster
dst_path={'blockysize':512, 'blockxsize':512} # Dictionary
dst_kwds )
There are really interesting things to look at in the above code. Overall, the code above returns a MemoryFile
object which contains a uint16
raster with bounds of our AOI and blocksize of 512. The attribute dst_path
allows us to specify a path to save the output as a raster. What is interesting is we can not only pass a file path to save on local disk but also a virtual path and save the merged raster in-memory, avoiding clutter of additional files on disk.
To define a virtual path, we use rasterio’s MemoryFile
class. When we create a MemoryFile
object, it has a name
attribute which gives us a virtual path, thus treating it as a real file (using GDALs vsimem internally). This MemoryFile object (memfile
here) provides us all the methods and attributes of rasterio’s file handler, which is extremely helpful.
print(memfile.open().profile)
'driver': 'GTiff', 'dtype': 'uint16', 'nodata': 0.0, 'width': 4110, 'height': 3211, 'count': 3, 'crs': CRS.from_epsg(32643), 'transform': Affine(10.0, 0.0, 788693.4700669964,
{0.0, -10.0, 1500674.3670768766), 'blockxsize': 512, 'blockysize': 512, 'tiled': True, 'compress': 'deflate', 'interleave': 'pixel'}
The method='first'
tells us the strategy used to determine the value of the pixel where the rasters overlap. In this case, the pixel value from the first imagery of the overlapping region in the list, is used as the value for the output raster.
The entire algorithm to merge rasters is illustrated in the figure below by taking an example of combining two rasters with method=first
.
From the above figure, for each raster in the list: - it finds the intersection with the Output Bounds (named region
in the figure)
next, it gets a boolean mask of invaild pixel over the
region
(namedregion_mask
in the figure).next, it copies over all the existing values from the raster for the
region
to an array (namedtemp
in the figure)It gets a boolean mask for the valid pixels in the
temp
array. (namedtemp_mask
in the figure)With these four arrays, it runs the
method=first
, which is to- create the same shaped array as that of
region
and fill values with negation ofregion_mask
(namedA
in the figure) - create a filter by combining
region_mask
andA
with a AND gate (namedB
in the figure) - copy over the values from
temp
toregion
usingB
as the filter
- create the same shaped array as that of
These series of steps are performed for all the rasters in the list. Finally, the output at the end of each iteration is combined to produce dest
raster.
Notice the dark strip bands for each array which represents the overlapping region. Also notice that values from the dark strip in step
1
did not change at the end of step2
Custom combining strategy for overlapping regions
We can have arbitrary conditions on how to combine the overlapping region. By default rasterio uses values of the first overlapping raster from the list of Input files as pixel values for the output raster file. It has several other options in its utility such as min
, max
, sum
, count
, last
.
To define our custom method, say in this case, I want to take the average of all the pixel values over my overlapping region and copy them to the output file. To do that, we can override the method by defining our custom method. Let us see how —
We take a look at the source code of built-in methods which make use of two or more rasters to make decisions on the output pixel values. Few such methods which do that are copy_sum
, copy_min
, copy_max
, copy_count
.
Looking at the copy_min from source code, we see that it performs two logical operations each before and after the custom logic we wish to apply.
We would replace our custom logic of averaging with that of minimum
in the above code and that is all there is to it. We can now use this function to manipulate the values of overlapping region!
def custom_method_avg(merged_data, new_data, merged_mask, new_mask, **kwargs):
"""Returns the average value pixel."""
= np.empty_like(merged_mask, dtype="bool")
mask =mask)
np.logical_or(merged_mask, new_mask, out=mask)
np.logical_not(mask, out=0, out=merged_data, where=mask)
np.nanmean([merged_data, new_data], axis=mask)
np.logical_not(new_mask, out=mask)
np.logical_and(merged_mask, mask, out=mask, casting="unsafe") np.copyto(merged_data, new_data, where
Endnote
The modern approach to merge rasters in python is to only stream the data for your region of interest, process and perform analysis on the raster in memory. This would save you a huge cost and time. This is possible because of COGs and STAC.
We looked at the merge method in depth and also explored the techniques used to combine the overlapping data. Finally, we created a custom method for merging rasters by modifying the existing code to suit our requirements. The code associated with this post can be found here.