As part of my AI for Earth software development, I am working hard on a system that simultaneously applies a supervised classification algorithm, and inverted radiative transfer model, an albedo calculation and an energy balance model pixel-wise over satellite imagery over the Greenland Ice Sheet. This is hugely computationally expensive and very memory hungry. I will write another post about how computational speed-up was achieved using vectorization and distribution over cores. In this post I want to talk briefly about some of the methods I’ve used to avoid running out of memory.
RAM is the fast-access memory. In Python, many operations are achieved by holding data in RAM because it is fast – the time required for the operating system to search for and retrieve data from the hard drive is reduced because all the relevant data is held temporarily in this quick-access repository known as RAM. However, the large arrays I’m using for the satellite data processing quickly fill up all the available RAM, mostly because some of the algorithmic processes applied to the data spawn very large intermediate arrays. Thankfully, I was able to make use of the Python package “xarray” that allows computation on large multidimensional arrays to occur out of memory. This means data is only loaded into memory as and when it is actually required, and sent back to disk storage at other times. Because xarray datasets are built on top of dask arrays, the computation can be lazy and easily distributed across multiple cores and maintains computation speed. This prevents the computation from chewing up all the available RAM. I’ve also spent a lot of time finding optimal positions in-script to set unused variables to None (which should allow the garbage collector to deallocate the memory associated with that variable) and in some cases explicitly calling the Python garbage collector to ensure the deallocated memory is available.
- Disk memory
The code stores results in NetCDF files. There is a large amount of data in each netCDF file (>1GB) and a separate file is produced for each tile on each day. This means the available local disk storage is at risk of filling up pretty fast when the code is deployed over many tiles and many dates. To resolve this, I’ve made use of Azure blob storage. Now, I use a small script as part of the workflow that sends the result file to a container in a storage account on Azure (Microsoft’s cloud computing platform) and deletes it from local storage. This keeps the disk storage free for other operations and prevents the code from failing due to memory management problems. From an Azure VM the upload and download speed is pretty incredible – a 1GB file can be transferred in or out of blob storage in a few seconds. I’ve copied a few lines of code below to demonstrate how I’m doing this (it took me a while to figure it out from the docs – specifically it wasn’t clear how to upload into specific subfolders within a container). Bear in mind also that there is a separate function not shown here that connects to my specific blob container using the user and account keys held in an external config file. This is easy to understand from the docs. Using this blob i/o method, I only keep the image that is actually being worked on stored locally, all other images and the associated output files are all held in the cloud in an Azure storage container.
As always, I’d be delighted to hear if anyone can refine my understanding or suggest better methods.