In 2000, NASA Space Shuttle Endeavour had a 11 days mission named SRTM. It was about acquiring topographic information of earth
surface, from 56°S to 60°N. Data has been released in public domain. It's one of the
main sources for free global topographic data.
I want to play with these data, and the first part of the game is
putting data into a postgis database. Fortunately, this
had already been done as part of Route
altitude profiles SRTM. It's a 2008 Google Summer of Code project by
Sjors Provoost. Srtm2postgis is a small utility that
downloads srtm data, and loads it into postgis.
Download urls for SRTM had changed, so original srtm2postgis does not work as-is.
Fortunately, a fork with updated urls is available on github.
So, I started tickering with srtm2postgis. I fixed a few minor bugs. For
example, srtm data is split into 14546 files. 14540 of them have all upper
case name, and 6 have lower letters name.
Those 6 files could not be downloaded. I also fixed some typos in the README,
and a few failing tests, but overall, the script was running smoothly.
Then I spent some time trying to improve import time. During its summer
of code, Sjors Provoost reported
importing data at one tile a minute speed. A 75 tiles import (44°N to 50°N
and 3°E to 7°W) took me 80 minutes. This about the same as Sjors Provoost result.
To connect to postgres, srtm2postgis uses pygresql, but also uses psycopg because of its copy_from
method. But copy_from method was ultimately not used, and a psql subprocess
was launched to run the copy instead. A comment explained that psycopg copy_from method
was hanging. I tried to use it and it was not hanging. As
the comment was nearly 3 years old, I assume this is a now fixed bug. I'm also
glad this was clearly documented. Otherwise, this workaround would have been
impossible to understand. With this modification, my 75 tiles import was
only 49 minutes. I really don't understand why this improves
performances so much, but I'm happy about it.
Then, I totally remove pygresql, and only used psycopg. This allows me, first
to open only one connection, then, to run everything in a single transaction, and commit everything at the end. Import was now 42 minutes.
After improving database-related performances, I tried to speed up things in other parts of the code. SRTM data is read by gdal library. It returns a 1200 * 1200 arrays of numpy.int16. Converting int16 to string is something really slow. On a 1200 * 1200 grid, many points have the same altitude. So, I use a cache to store string representations of numpy.int16 altitudes. This resulted in a major performance improvement. My 75 tile import is now 28 minutes.
I also use string buffer instead of a temporary file to store data to copy, but this does not improve performance significantly. I also tried to remove table primary key constraint and recreate it after import. I thought it would improve copy speed. But actually, it damaged performances a lot.
Eventually, I run
ANALYZE after import. This takes 10 minutes, but this will probably
improve future performances.
I could improve import time from 80 to 28 minutes (38 minutes with
VACUUM ANALYZE), with is nearly 3 times faster!!! My updated srtm2postgis
version is available on