AI Adventures in Azure: making use of blob i/o

As part of my AI for Earth software development, I am working hard on a system that simultaneously applies a supervised classification algorithm, and inverted radiative transfer model, an albedo calculation and an energy balance model pixel-wise over satellite imagery over the Greenland Ice Sheet. This is hugely computationally expensive and very memory hungry. I will write another post about how computational speed-up was achieved using vectorization and distribution over cores. In this post I want to talk briefly about some of the methods I’ve used to avoid running out of memory.

  1. RAM

RAM is the fast-access memory. In Python, many operations are achieved by holding data in RAM because it is fast – the time required for the operating system to search for and retrieve data from the hard drive is reduced because all the relevant data is held temporarily in this quick-access repository known as RAM. However, the large arrays I’m using for the satellite data processing quickly fill up all the available RAM, mostly because some of the algorithmic processes applied to the data spawn very large intermediate arrays. Thankfully, I was able to make use of the Python package “xarray” that allows computation on large multidimensional arrays to occur out of memory. This means data is only loaded into memory as and when it is actually required, and sent back to disk storage at other times. Because xarray datasets are built on top of dask arrays, the computation can be lazy and easily distributed across multiple cores and maintains computation speed. This prevents the computation from chewing up all the available RAM. I’ve also spent a lot of time finding optimal positions in-script to set unused variables to None (which should allow the garbage collector to deallocate the memory associated with that variable) and in some cases explicitly calling the Python garbage collector to ensure the deallocated memory is available.

  1. Disk memory

The code stores results in NetCDF files. There is a large amount of data in each netCDF file (>1GB) and a separate file is produced for each tile on each day. This means the available local disk storage is at risk of filling up pretty fast when the code is deployed over many tiles and many dates. To resolve this, I’ve made use of Azure blob storage. Now, I use a small script as part of the workflow that sends the result file to a container in a storage account on Azure (Microsoft’s cloud computing platform) and deletes it from local storage. This keeps the disk storage free for other operations and prevents the code from failing due to memory management problems. From an Azure VM the upload and download speed is pretty incredible – a 1GB file can be transferred in or out of blob storage in a few seconds. I’ve copied a few lines of code below to demonstrate how I’m doing this (it took me a while to figure it out from the docs – specifically it wasn’t clear how to upload into specific subfolders within a container). Bear in mind also that there is a separate function not shown here that connects to my specific blob container using the user and account keys held in an external config file. This is easy to understand from the docs. Using this blob i/o method, I only keep the image that is actually being worked on stored locally, all other images and the associated output files are all held in the cloud in an Azure storage container.

def dataset_to_blob(self, path_to_ds, delete_local_nc=True):
This function is for uploading the output spatial datasets to blob storage and deleting them from local storage.
This was introduced because running everything locally was using up all available disk space on the VM.
“””
print(“\nUploading netCDF to blob storage\n”)
container_name = ‘bisc-outputs/’ #name of container in blob store to collect datasets into
# get list of existing containers
existing_containers = self.block_blob_service.list_containers()
existing_container_names = []
for item in existing_containers:
existing_container_names.append(item.name)
# check to see if the bisc-outputs blob container already exists
ifany(container_name in p for p in existing_container_names):
print(‘\n’,’ CONTAINER {} ALREADY EXISTS IN STORAGE ACCOUNT ‘.format(tile))
else:
# if container does not exist, create it
print(‘container doesnot exist, now creating it\n’)
self.block_blob_service.create_container(container_name)
# with container created send all files in the “interpolated” subdirectory to the blob
forfilein os.listdir(path_to_ds):
iffile != ‘interpolated’:
self.block_blob_service.create_blob_from_path(container_name, file, os.path.join(path_to_ds,file))
print(“Uploading {}”.format(file))
# if toggled, delete the uploaded files from the local storage
if delete_local_nc == True:
files = os.listdir(path_to_ds)
for f in files:
if f != ‘interpolated’: #do not delete the subdirectory itself, just the files inside
try:
os.remove(str(path_to_ds + f))
except:
print(“did not delete {}”.format(f))
# explicitly call garbage collector to deallocate memory
print(“GARGABE COLLECTION\n”)
gc.collect()
return

 

As always, I’d be delighted to hear if anyone can refine my understanding or suggest better methods.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s