Reading large tiles from S3 directly with `rasterio`
At SpotR, we make heavy use of rasterdata containing gridded height measurements.
When working in python
, the rasterio
package is useful. This package is
essentially a more pythonic binding to the GDAl library, as explained in their
introduction. The file below was obtained from data.gov.uk and shows a 1x1km
patch of height measurements in the UK. The resolution is 1000x1000 pixels,
every pixel represents the (maximum) height of 1x1m. The lighter dots along the
top of the image are houses, there are some ragged parts where there is no data.
In the middle there is a depression, could be a riverbed, and then at the bottom
the terrain is rising a little.
import rasterio import matplotlib.pyplot as plt import numpy as np with rasterio.open("/tmp/sd9863_DSM_1M.tiff") as dataset: heights = dataset.read(1) fn = "images/heights.png" # replace nodata values with nan heights[np.where(heights==dataset.nodata)] = np.nan plt.imshow(heights) plt.title("Example of a 1km x 1km raster") plt.tight_layout() plt.savefig(fn) fn
Rastertiles, typically GeoTiff files, can become quite large in terms of memory
size. This grid above takes up \~4Mb as an uncompressed GeoTiff file, down from
6.5Mb as a .asc
file, which is a simple text-based format. There are a couple
of interesting compression techniques like DEFLATE and LZW that can bring the
size of the data down further. It is possible to convert rasters with
rasterio
, but the gdal_translate
utility is the tool for the job.
gdal_translate /tmp/sd9863_DSM_1M.asc /tmp/sd9863_DSM_1M.tiff > /dev/null gdal_translate /tmp/sd9863_DSM_1M.asc /tmp/sd9863_DSM_1M_lzw.tiff -co COMPRESS=LZW > /dev/null gdal_translate /tmp/sd9863_DSM_1M.asc /tmp/sd9863_DSM_1M_def.tiff -co COMPRESS=DEFLATE > /dev/null gdal_translate /tmp/sd9863_DSM_1M.asc /tmp/sd9863_DSM_1M_def_pred.tiff -co COMPRESS=DEFLATE -co PREDICTOR=2 > /dev/null ls -lha /tmp/sd*
-rw-rw-r-- 1 gijs gijs 6,5M jun 13 2018 /tmp/sd9863_DSM_1M.asc -rw-rw-r-- 1 gijs gijs 1,1M mei 9 09:11 /tmp/sd9863_DSM_1M_def_pred.tiff -rw-rw-r-- 1 gijs gijs 1,5M mei 9 09:11 /tmp/sd9863_DSM_1M_def.tiff -rw-rw-r-- 1 gijs gijs 1,8M mei 9 09:11 /tmp/sd9863_DSM_1M_lzw.tiff -rw-rw-r-- 1 gijs gijs 3,9M mei 9 09:11 /tmp/sd9863_DSM_1M.tiff
Interestingly, all compression techniques available in GDAL
are lossless.
There are JPEG based compression systems, but they can only be applied to 8bit
unsigned data, in other words, images, and these height measurements which are
organized as floating point numbers cannot be stored using JPEG compression. I
can definitely think of some usecases where some distortion of these
measurements is fine, as long as it's bounded somehow, but I haven't come across
examples of a lossy compression for rasters of floating points.
Partial reads
Compression can save us almost an order of magnitude, but to store this data at our scale, things still add up. I live in the Netherlands which has an area of 41,543 km2. That's 40k+ tiles at 1Mb+ each, 50Gb in total. Perfect to save on cloud storage such as S3.
aws s3 ls s3://heights-tiles/tiles/sd980
2022-04-29 23:08:55 2903641 sd9800_DSM_1M.tiff 2022-04-29 23:08:54 2871755 sd9801_DSM_1M.tiff 2022-04-29 23:08:54 2938302 sd9802_DSM_1M.tiff 2022-04-29 23:08:55 2719476 sd9803_DSM_1M.tiff 2022-04-29 23:08:55 2643684 sd9804_DSM_1M.tiff 2022-04-29 23:08:55 2533681 sd9805_DSM_1M.tiff 2022-04-29 23:08:55 2715498 sd9806_DSM_1M.tiff 2022-04-29 23:08:55 2818095 sd9807_DSM_1M.tiff 2022-04-29 23:08:55 2755601 sd9808_DSM_1M.tiff 2022-04-29 23:08:56 468739 sd9809_DSM_1M.tiff
When doing a calculation, we're typically not interested in the whole of the tile. For example, we only want to know the height of a single pixel in the raster file. It is possible to avoid downloading the whole file, this operation can be done using a partial read. This is possible because S3 allows random-access reads, and GDAL supports reading over a network with virtual file systems.
Depending on how large your tiles are, this can make a big difference. Let's benchmark this.
import rasterio from rasterio.windows import Window with rasterio.open("s3://heights-tiles/tiles/sd9800_DSM_1M.tiff") as raster: dt = raster.read(1, window=Window(500, 500, 501, 501))
time python src/read_raster_window.py
real 0m17,300s user 0m3,026s sys 0m1,038s
Wait a minute .. 17 seconds is still a long time. It turns out that GDAL
will
scan the whole folder for other files before opening a file. This is interesting
behaviour that makes sense when geodata files are often accompanied by other
files that include information about transformation, possibly some indexes and
more. We can disable this behaviour by setting an environment value.
time GDAL_DISABLE_READDIR_ON_OPEN=YES python src/read_raster_window.py
real 0m1,230s user 0m0,400s sys 0m0,948s