Saving arrays

Beyond the basic data types of dictionaries, lists, strings and numbers, the most important thing ASDF can save is arrays. It’s as simple as putting a numpy array somewhere in the tree. Here, we save an 8x8 array of random floating-point numbers (using numpy.random.rand). Note that the resulting YAML output contains information about the structure (size and data type) of the array, but the actual array content is in a binary block.

from asdf import AsdfFile
import numpy as np

tree = {'my_array': np.random.rand(8, 8)}
ff = AsdfFile(tree)
ff.write_to("test.asdf")

Note

In the file examples below, the first YAML part appears as it appears in the file. The BLOCK sections are stored as binary data in the file, but are presented in human-readable form on this page.

test.asdfb'#ASDF' 1.2.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 2.2.0.dev1526}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: {name: asdf, version: 2.2.0.dev1526}
my_array: !core/ndarray-1.0.0
  source: 0
  datatype: float64
  byteorder: little
  shape: [8, 8]
...BLOCK 0:
    allocated_size: 512
    used_size: 512
    data_size: 512
    data: b'52ed7b97ded3db3f75b46049786eed3f4c3e7b80...'#ASDF BLOCK INDEX
%YAML 1.1
--- [523]
...

Sharing of data

Arrays that are views on the same data automatically share the same data in the file. In this example an array and a subview on that same array are saved to the same file, resulting in only a single block of data being saved.

from asdf import AsdfFile
import numpy as np

my_array = np.random.rand(8, 8)
subset = my_array[2:4,3:6]
tree = {
    'my_array': my_array,
    'subset':   subset
}
ff = AsdfFile(tree)
ff.write_to("test.asdf")
test.asdfb'#ASDF' 1.2.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 2.2.0.dev1526}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: {name: asdf, version: 2.2.0.dev1526}
my_array: !core/ndarray-1.0.0
  source: 0
  datatype: float64
  byteorder: little
  shape: [8, 8]
subset: !core/ndarray-1.0.0
  source: 0
  datatype: float64
  byteorder: little
  shape: [2, 3]
  offset: 152
  strides: [64, 8]
...BLOCK 0:
    allocated_size: 512
    used_size: 512
    data_size: 512
    data: b'803217de555ad43f5a3be0a73276d23f0ef68d7c...'#ASDF BLOCK INDEX
%YAML 1.1
--- [652]
...

Saving inline arrays

As of asdf-2.2.0, small numerical arrays are automatically stored inline. The default threshold size for inline versus internal arrays can be found with the following:

>>> from asdf.block import _DEFAULT_INLINE_THRESHOLD_SIZE
>>> print(_DEFAULT_INLINE_THRESHOLD_SIZE)
50

The default threshold can be overridden passing the inline_threshold argument to the asdf.AsdfFile constructor. Setting inline_threshold=0 has the effect of making all small arrays be stored in internal blocks:

from asdf import AsdfFile
import numpy as np

# Ordinarilly an array this size would be automatically inlined
my_array = np.ones(10)
tree = {'my_array': my_array}
# Set the inline threshold to 0 to force internal storage
with AsdfFile(tree, inline_threshold=0) as ff:
   ff.write_to("test.asdf")
test.asdfb'#ASDF' 1.2.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 2.2.0.dev1526}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: {name: asdf, version: 2.2.0.dev1526}
my_array: !core/ndarray-1.0.0
  source: 0
  datatype: float64
  byteorder: little
  shape: [10]
...BLOCK 0:
    allocated_size: 80
    used_size: 80
    data_size: 80
    data: b'000000000000f03f000000000000f03f00000000...'#ASDF BLOCK INDEX
%YAML 1.1
--- [521]
...

The set_array_storage method can be used to set or override the default storage type of a particular data array. The allowed values are internal, external, and inline.

  • internal: The default. The array data will be stored in a binary block in the same ASDF file.
  • external: Store the data in a binary block in a separate ASDF file (also known as “exploded” format, which discussed below in Saving external arrays).
  • inline: Store the data as YAML inline in the tree.
from asdf import AsdfFile
import numpy as np

my_array = np.random.rand(8, 8)
tree = {'my_array': my_array}
ff = AsdfFile(tree)
ff.set_array_storage(my_array, 'inline')
ff.write_to("test.asdf")
test.asdfb'#ASDF' 1.2.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 2.2.0.dev1526}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: {name: asdf, version: 2.2.0.dev1526}
my_array: !core/ndarray-1.0.0
  data:
  - [0.45782551569644714, 0.7633025502276053, 0.7254753394851623, 0.9821566220720997,
    0.18718865739211277, 0.5746702220691329, 0.24070533627409674, 0.14658140052230195]
  - [0.27129699724425216, 0.08634235997981876, 0.9230251313186303, 0.899583614121922,
    0.9256137230289652, 0.3046377648966896, 0.37833786797538926, 0.7243346473390603]
  - [0.35438635104472493, 0.6447064069898568, 0.6999080353994773, 0.05296781283296903,
    0.0962381299055779, 0.06347949906817751, 0.2832278839075333, 0.5886037196677254]
  - [0.03778547638419516, 0.0017606684546414009, 0.19746475546997566, 0.8950557733658656,
    0.4315971537144243, 0.22655818962981078, 0.29020484516304146, 0.6564729835635267]
  - [0.5062814004109937, 0.6790461276125752, 0.338465697909228, 0.08717418885512562,
    0.7410203972629893, 0.7886773060610927, 0.35448014152945106, 0.5513357595878581]
  - [0.07942284712286207, 0.24084316603635758, 0.9319369270567768, 0.2624172649716906,
    0.9074245759870824, 0.5612188111914184, 0.3732991601667718, 0.4444902079812102]
  - [0.47903663400309826, 0.16596938966903418, 0.6694562534770692, 0.10877846460293517,
    0.5801094842535167, 0.9358856252372156, 0.8699171032963758, 0.5577373778739075]
  - [0.21237428137859837, 0.42685889787020614, 0.18197599110795548, 0.310147309587676,
    0.3894900683597923, 0.26783582908591574, 0.001582725352663461, 0.8314048584059968]
  datatype: float64
  shape: [8, 8]
...

Alternatively, it is possible to use the all_array_storage parameter of AsdfFile.write_to and AsdfFile.update to control the storage format of all arrays in the file.

# This controls the output format of all arrays in the file
ff.write_to("test.asdf", all_array_storage='inline')

Saving external arrays

ASDF files may also be saved in “exploded form”, which creats multiple files corresponding to the following data items:

  • One ASDF file containing only the header and tree.
  • n ASDF files, each containing a single array data block.

Exploded form is useful in the following scenarios:

  • Not all text editors may handle the hybrid text and binary nature of the ASDF file, and therefore either can’t open a ASDF file or would break a ASDF file upon saving. In this scenario, a user may explode the ASDF file, edit the YAML portion as a pure YAML file, and implode the parts back together.
  • Over a network protocol, such as HTTP, a client may only need to access some of the blocks. While reading a subset of the file can be done using HTTP Range headers, it still requires one (small) request per block to “jump” through the file to determine the start location of each block. This can become time-consuming over a high-latency network if there are many blocks. Exploded form allows each block to be requested directly by a specific URI.
  • An ASDF writer may stream a table to disk, when the size of the table is not known at the outset. Using exploded form simplifies this, since a standalone file containing a single table can be iteratively appended to without worrying about any blocks that may follow it.

To save a block in an external file, set its block type to 'external'.

from asdf import AsdfFile
import numpy as np

my_array = np.random.rand(8, 8)
tree = {'my_array': my_array}
ff = AsdfFile(tree)

# On an individual block basis:
ff.set_array_storage(my_array, 'external')
ff.write_to("test.asdf")

# Or for every block:
ff.write_to("test.asdf", all_array_storage='external')
test.asdfb'#ASDF' 1.2.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 2.2.0.dev1526}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: {name: asdf, version: 2.2.0.dev1526}
my_array: !core/ndarray-1.0.0
  source: test0000.asdf
  datatype: float64
  byteorder: little
  shape: [8, 8]
...
test0000.asdfb'#ASDF' 1.2.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 2.2.0.dev1526}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: {name: asdf, version: 2.2.0.dev1526}
...BLOCK 0:
    allocated_size: 512
    used_size: 512
    data_size: 512
    data: b'07c1f4c83c9ce83f0cbbe67a7cdedd3fa42dc1fd...'#ASDF BLOCK INDEX
%YAML 1.1
--- [425]
...

Like inline arrays, this can also be controlled using the set_array_storage parameter of AsdfFile.write_to and AsdfFile.update.

Streaming array data

In certain scenarios, you may want to stream data to disk, rather than writing an entire array of data at once. For example, it may not be possible to fit the entire array in memory, or you may want to save data from a device as it comes in to prevent data loss. The ASDF standard allows exactly one streaming block per file where the size of the block isn’t included in the block header, but instead is implicitly determined to include all of the remaining contents of the file. By definition, it must be the last block in the file.

To use streaming, rather than including a Numpy array object in the tree, you include a asdf.Stream object which sets up the structure of the streamed data, but will not write out the actual content. The file handle’s write method is then used to manually write out the binary data.

from asdf import AsdfFile, Stream
import numpy as np

tree = {
    # Each "row" of data will have 128 entries.
    'my_stream': Stream([128], np.float64)
}

ff = AsdfFile(tree)
with open('test.asdf', 'wb') as fd:
    ff.write_to(fd)
    # Write 100 rows of data, one row at a time.  ``write``
    # expects the raw binary bytes, not an array, so we use
    # ``tostring()``.
    for i in range(100):
        fd.write(np.array([i] * 128, np.float64).tostring())
test.asdfb'#ASDF' 1.2.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 2.2.0.dev1526}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: {name: asdf, version: 2.2.0.dev1526}
my_stream: !core/ndarray-1.0.0
  source: -1
  datatype: float64
  byteorder: little
  shape: ['*', 128]
...BLOCK 0:
    flags: BLOCK_FLAG_STREAMED
    allocated_size: 0
    used_size: 0
    data_size: 0
    data: b'0000000000000000000000000000000000000000...'

A case where streaming may be useful is when converting large data sets from a different format into ASDF. In these cases it would be impractical to hold all of the data in memory as an intermediate step. Consider the following example that streams a large CSV file containing rows of integer data and converts it to numpy arrays stored in ASDF:

import csv
import numpy as np
from asdf import AsdfFile, Stream

tree = {
    # We happen to know in advance that each row in the CSV has 100 ints
    'data': Stream([100], np.int64)
}

ff = AsdfFile(tree)
# open the output file handle
with open('new_file.asdf', 'wb') as fd:
    ff.write_to(fd)
    # open the CSV file to be converted
    with open('large_file.csv', 'r') as cfd:
        # read each line of the CSV file
        reader = csv.reader(cfd)
        for row in reader:
            # convert each row to a numpy array
            array = np.array([int(x) for x in row], np.int64)
            # write the array to the output file handle
            fd.write(array.tostring())

Compression

Individual blocks in an ASDF file may be compressed.

You can easily zlib or bzip2 compress all blocks:

from asdf import AsdfFile
import numpy as np

tree = {
    'a': np.random.rand(32, 32),
    'b': np.random.rand(64, 64)
}

target = AsdfFile(tree)
target.write_to('target.asdf', all_array_compression='zlib')
target.write_to('target.asdf', all_array_compression='bzp2')
target.asdfb'#ASDF' 1.2.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 2.2.0.dev1526}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: {name: asdf, version: 2.2.0.dev1526}
a: !core/ndarray-1.0.0
  source: 0
  datatype: float64
  byteorder: little
  shape: [32, 32]
b: !core/ndarray-1.0.0
  source: 1
  datatype: float64
  byteorder: little
  shape: [64, 64]
...BLOCK 0:
    compression: bzp2
    allocated_size: 8341
    used_size: 8341
    data_size: 8192
    data: b'a019d451e2efc13f2acf2e02b03ded3f50895c01...'BLOCK 1:
    compression: bzp2
    allocated_size: 31735
    used_size: 31735
    data_size: 32768
    data: b'be4cfd487fabdd3f64d5beb7f1a4ee3f2278eaf4...'#ASDF BLOCK INDEX
%YAML 1.1
--- [611, 9006]
...

The lz4 compression algorithm is also supported, but requires the optional lz4 package in order to work.

When reading a file with compressed blocks, the blocks will be automatically decompressed when accessed. If a file with compressed blocks is read and then written out again, by default the new file will use the same compression as the original file. This behavior can be overridden by explicitly providing a different compression algorithm when writing the file out again.

import asdf

# Open a file with some compression
af = asdf.open('compressed.asdf')

# Use the same compression when writing out a new file
af.write_to('same.asdf')

# Or specify the (possibly different) algorithm to use when writing out
af.write_to('different.asdf', all_array_compression='lz4')

Memory mapping

By default, all internal array data is memory mapped using numpy.memmap. This allows for the efficient use of memory even when reading files with very large arrays. The use of memory mapping means that the following usage pattern is not permitted:

import asdf

with asdf.open('my_data.asdf') as af:
    ...

 af.tree

Specifically, if an ASDF file has been opened using a with context, it is not possible to access the file contents outside of the scope of that context, because any memory mapped arrays will no longer be available.

It may sometimes be useful to copy array data into memory instead of using memory maps. This can be controlled by passing the copy_arrays parameter to either the AsdfFile constructor or asdf.open. By default, copy_arrays=False.