Data storage

Data is stored in three main places:

The database
The local filesystem
The data archive

Provenances

If you don’t understand provenances, you will probably do the wrong thing when you query the database.

The concept of Provenances is fundamental to the SeeChange data model. SeeChange is designed with the understanding that we are likely to want to run the same data through the pipeline more than once. (We’ll redo things with different parameters, or we may want to redo a year’s worth of supernova searching once we have better references.) To avoid the chaos of overwriting data when something is redone, SeeChange uses a “provenance” system to keep track of different versions of data products that come from the same original data. Each data product produced by the pipeline is associated with a Provenance object. A Provenance is defined (in models/provenance.py) by the version of the code used to produce the data product, the name of the process (one of a set of standardized names used by the code), the parameters that configure the details of how the process runs, and, crucially, immediately upstream provenances.

Thus, when searching the database, it’s not enough to search for (say) measurements of an object at a given RA an Dec in a given time range. You must also specify the provenance of the measurements you wish to find. If you don’t, you may well get reudndant data, and you will probably get a mess of heterogeneous data.

For example, Image objects usually have a provenance whose process is preprocessing. The parameters include which steps were applied (bias subtraction, flatfielding, nonlinearity correction, etc.) and any configuration parameters for those steps (e.g. what set of flatfields were used). From the image, we might run something like SExtractor to produce a SourceList object. The SourceList provenance process is “extraction”, the parameters include which code was used to generate the source list (SExtractor or something else), and the configuration settings for that code. An extraction provenance has a preprocessing provenance as an upstream. Even if you run exactly the same code, with exactly the same parameters, on an image, but you changed something about the way the image was preprocessed, you don’t expect to get exactly the same source list out the other end! This is why the upstream preprocessing provenance is included as part of the definition of a downstream provenance. If you change the parameters for preprocessing, that will change the provenances for every step of the pipeline (as they are all downstream of preprocessing, so the provenance change will cascaed through each step).

(Technically, the parameters for a Provenance are defined by the pipeline object that performs the process. See Development and Testing for more information.)

Provenance Tags

In practice, provenances are treated as (mostly) opaque objects, and are a field of objects of classes associated with data products. Every Provenance has a simple string id which can be used to refer to it, and to use when searching for objects. These strings are hashes, so are unpleasant for humans to use. The database also allows for the definition of “provenance tags”, which are human readable strings that point to a provenacne for each process. You might, for instance, use the “DR1” provenance tag to get all of the provenacne ids for all processes, and then use those provenance ids to get all the data products associated with “data release 1”. It’s up to the managers of the pipeline to maintain which provenance tags exist and to keep track of what they mean.

Database

We use postgres as the database backend, which allows us to use the q3c extension for fast spatial queries, as well as types like JSONB for storing dictionaries. A SeeChange instance must have a database server associated with it, and any running SeeChange code needs to be able to connect to the PostgreSQL port (5432) of that server. If you’re doing this for local development, see Development and Testing for a self-contained docker environment that creates a throw-away database. Alternatively, you can run PostgreSQL on your local system, or you can set up a server somewehre else you have access.

Database models are defined using SQLAlchemy, and database migrations are tracked with Alembic. Database communication is done using a combination of SQLALchmey and direct SQL using psycopg. The database communications is defined in models/base.py, where we define a Base class for all database mapped objects. Each class that inherits from Base can have columns mapped to object attributes.

Unless you’re writing one-off code that you expect to run by itself (i.e. not as part of a pipline that is mass-processing data), do not hold database connections open for a long time. Sometimes you do want to have a connection open for several operations, either because they’re linked, or because they’re happening fast enough that the inefficiency of opening a new connection between the operations isn’t worth it. However, if code that runs in a pipeline designed to run many times concurrently hold open database connections for extended periods of time, we run the risk of exhausting database connection resources. See SQLAlchemy sessions and Psycopg connections below.

Database communcation should usually be done inside with SmartSession(...) as session: or with PsycopgConnection(...) as conn: blocks, where these two functions are defined in models/base.py. See SQLAlchemy sessions and Psycopg connections below.

Many of our basic database functions take either a SQLAlchemy Session object, or a psycopg connection object, as an optional argument. If you are inside a with block that holds a database connection, pass that connection to the function, and the function will use the same session or connection that you’re already using. (Otherwise, the function will open a new connection.) For example, see test.py::SeeChangeBase.insert(). (SeeChangeBase is a class that most (all?) of our database model objects inherit from, so insert (and others) are defined for all models.) Often, however, you will not want to open your own session, but just let these functions open sessions themselves. This is usually the case when you are doing any non-trivial computation between database calls, as the overhead of establishing a new database connection will become small compared to the computation you’re doing.

SQLAlchemy sessions

While the SQLAlchemy syntactic sugar can, in simple cases, simplify the mapping of database rows to object properties, we aren’t using the full infrastructure of SQLAlchemy. The reason is that SQLAlchemy implicitly assumes you’re going to keep a database connection open as long as an object that might want to communicate with the database exists. This is fine for web applications whose total runtime is less than a second. It’s a total disaster for a pipeline that might run for a couple of minutes, and for which there may more than a hundred copies of that pipeline all running at once. (And, no, connection pools does not solve the issue, but in fact makes it worse because each parallel process ends up keeping its own pool, so connections don’t even get closed when you close them!) Previously, use used SQLAlchemy relations, but those caused constant development nightmares and debugging tangles when we didn’t hold open database connections; eventually, we refactored the code to get rid of those, and it was the cause of much relief. However, we find that there are things you want to do in PostgreSQL that the SQLAlchemy intermediary makes very difficult, and for all but the simplest queries, the SQLAlchmy syntax is just as byzantine as SQL (and frustratingly different for somebody who already knows SQL). What’s more, whereas you might have some idea what you’re doing with SQL, you can’t really be sure what an SQLAlchmey construction is actually going to do with the database. Sometimes, it’s very frustrating to figure out how to get SQAlchemy to actually close conections on the backend in a reproducible manner, meaning we were stuck with lingering locks (which only appeared intermittently, to make debugging as hard as possible) that would have been trivial to deal with if we hadn’t foolishly wedded ourselves to SQLAlchemy. As such, at some point we may refactor the code to remove SQLAlchemy entirely (to more great relief)— see Issue #516. For the most part, except for simplest things, when we write new code we use direct connections with psycopg (see Psycopg connections below).

If you want to communicate with the databse using SQLAlchemy constructions, this section describes how to get a connection.

The file models/base.py defines two SQLAlchemy session starting functions, Session() and SmartSession(). If you’re using SQLAlchemy, we strongly urge developers to use SmartSession(); see below. To open a regular session, use the Session() function, and make sure to close it when you are done. This is useful mainly for interactive use, and highly discouraged for using in the code base. The session is a sqlalchemy.orm.session.Session object:

from models.base import Session
session = Session()
# do stuff with the session
obj = MappedObject()  # can be any of the models
session.add(obj)
session.commit()

# make sure to close at the end
session.close()

The Session() function can also be used in a context manager, which will automatically close the session at the end of the block:

from models.base import Session
with Session() as session:
    # do stuff with the session
    obj = MappedObject()  # can be any of the models
    session.add(obj)
    session.commit()

However, we encourage developers to use the SmartSession() function for such uses. The main difference is that a SmartSession accepts inputs that can be either another session object or None. If the input is None, it will open a new session, which will close at the end of the context. If the input given to it is a session object, it will not close it at the end of the context, rather it will depend on the external context that made the session to close it when done.

This is particularly useful in nested functions:

from models.base import SmartSession

def my_function(session=None):
    with SmartSession(session) as session:
        # do stuff with the session
        obj = MappedObject()  # can be any of the models
        session.add(obj)
        session.commit()

# call the function with a self-contained session
my_function()  # session opens and closes inside the function scope

# call the function with an externally created session
with SmartSession() as session:
    my_function(session)
    # session will be closed here, not in the function

Make sure not to hold open connections for a long time. In normal operation of the pipeline, many processes are running in parallel. If they all hold connections open at once, they may exhaust database resources. There is some overhead in creating a databse connection, so if you’re going to do several fast operations in a row, do them within one connection. But, if you’re going to do substantial computations between two calls to the database, make sure to close the database connection in between. This usually means exiting the with block that started the connection. Additionally, be careful about passing a session nested functions that do substantial computation, as that will hold the database connection open throughout the running of the nested function.

Psycopg connections

While it’s possible to do direct SQL with SQLAlchemy sessions, either with the session itself, or by (painfully) extracting the underlying database connection, if what you really need to do is direct SQL, you’re probably better off with a simple psycopg connection. If you are going to do anything that locks tables, you must do it this way. Empirically, SQLAlchemy does not seem to actually close the underlying databse connection when you call the close method of a session, but rather markes it for closing sometime later during garbage collection. The result is that if you lock a table using SQL passed to an SQLAlchemy session, most of the time it will work, but rarely the table will not unlock when you close the session (or exist the relevantg with block). If you really need control over your database, use an actual database connection and save yourself the complication of another opininated layer between you and the database.

models/base.py provides a function PsycopgConnection() that must be used as a context manager (i.e. in a with block). It takes one optional argument, and returns a standard psycopg.connection object. If that argument is None, it opens a new connection to the databse, and then calls rollback and closes that connection when the context exits. (This means that if you want any changes you made to presist, you must make sure to call commit on the connection.) If that argument is an existing psycopg connection, it just immediately returns that, and does not automatically rollback or close it. (In that case, whatever created the connection in the first place is responsible for doing that.) This latter use is analgous to passing an existing SQLAlchemy session to SmartSession, as described above.

As a trivial example:

with PsycopgConnection() as conn:
    cursor = conn.cursor()
    cursor.execute( "SELECT _id FROM images WHERE mjd>60000. AND Mjd<60001." )
    foundids = [ row[0] for row in cursor.fetchall() ]

For a less trivial example, see models/provenance.py::ProvenanceTag.addtag() (which locks a table).

Defining mapped classes

(Users working with the code to analyze existing data can probably skip this section; this is necessary for developers who are adding to the pipeline code itself.)

The models folder contains a module for each database table, which defines a mapped class for that table. Each mapped class inherits from the Base class defined in models/base.py, and has a __tablename__ attribute that defines the table name.

The class definition of each mapped class includes columns and relationships that map to the database table columns and relationships. For example, a SourceList will have a num_sources column attribute that corresponds to a column in the database table.

Do not use SQLAlchemy relationships. Previously, the code defined things like a image field of the SourceList table that use SQLAlchemy lazy loading to load the Image record associated with the source list when you accessed it. This led to a constant stream of painful debugging of detached instances and other mysterious and hard-to-debug errors that were the result of SQLAlchemy’s design assuming that you were holding a connection open throughout your entire computation (which, while reasonable for a webap that completes its computations in less than a second, is not reasonable for a data analysis pipeline that will run for two minutes). Instead, code relationships with standard SQL foreign keys. For example, SourceList has the image_id field, which is a foreign key pointing at the _id column of the images table (the table behind the Images model). If you want to get the actual image data, just run:

image = Image.get_by_id( sources.image_id )

(This method is defined in models/base.py::UUIDMixin.get_by_id.)

The __init__() method of each mapped class can define other attributes that are not mapped to the database. Also, the init_on_load() function should be defined to initialize any such attributes when an object is loaded, rather than created from scratch. Use the orm.reconstructor decorator to define this function:

from sqlalchemy.orm import reconstructor

class MyObject:
    def __init__(self):
        self.my_attribute = None

    @reconstructor
    def init_on_load(self):
        self.my_attribute = None

This makes sure those properties are defined and initialized even when an object is not created with its __init__() function.

If you change a mapped class, or define a new mapped class, make sure also to update the database migrations. See “Database migrations” in TODO FIGURE OUT WHERE THESE ARE, PROBABLY DEVELOPMENT.

Files on disk

Some of the data products are far too large to store in PostgreSQL tables. For example, the image data itself is stored in FITS files on disk. Each such resource must also be mapped to a database object, so it can be queried and loaded when needed. These files have an “official” location in the archive (see Data archive below), as well as a (potentially transitory) location in a local file store.

In models/base.py, we define a FileOnDisk mixin class, that allows a database row to also store a file (or multiple files) on disk in a local directory. The mixin class also has the ability to save and load the data from the archive. The FileOnDisk class defines a filepath attribute, which is relative to some base directory, defined by FileOnDisk.local_path. Changing the local_path allows the same database entries to correspond to different filesystems, but with identical relative paths. The relative filepath can include subdirectories. If the FileOnDisk subclass needs to store multiple files, it should have a filepath that includes the first part of the filename, and filepath_extensions that is an array of strings with the ends of the filenames, which are different for each file. Image is an example of a class that does this:

from models.image import Image
image = Image()      # is a FileOnDisk subclass
image.filepath = 'images/2020-01-01/some_image'
image.filepath_extensions = ['image.fits', 'weight.fits', 'flags.fits']

To save the file to disk, use the save() method of the object. Most of the time, the subclass will implement a save() method, which will call the save() method of the FileOnDisk class after it has actually done the saving of the data (which is specific to each subclass and the data it contains). The FileOnDisk class will make sure the file is in the right place, will check the MD5 checksum, and will push the file to the archive (see below). The object will generally not be saved to the database until after it has a legal filepath and MD5 checksum. This makes sure database objects are mapping to actual files on disk. It should be noted that those files could later be removed without the database knowing about it.

Usually, to load data from disk, you just access the relevant properties of the object. For instance, if you have an Image object that corresponds to a pre-existing image file, just access the .data property of the Image object; if there isn’t already data in memory, just accessing the property triggers it to be laoded from disk. (It will also make a copy from the archive to the local file store if necessary.)

To remove the file from local storage only, use remove_data_from_disk(). To remove the file from local disk, archive and database, use delete_from_disk_and_database().

(From an implementation point of view, note that loading of the data is usually not done entirely in the FileOnDisk mixin, as that is specific to each subclass. In general, we add a private _data attribute to the subclass, which is None when the object is initialized or loaded from the database. Then a public data property will cause a lazy load of the data from disk when it is first accessed, putting it into _data. To get the full path of the file stored on disk (e.g., for loading it) use the FileOnDisk.get_fullpath() method, which attaches the local_path to the filepath attribute, and includes extensions if they exist.)

Exposures and data files

An Exposure object encapsulates the (usually) raw data taken by an instrument/telescope at one point in time and space. It can include multiple sub-images if the instrument has multiple sections, e.g., for a mosaic camera, or a multi-channel instrument, e.g., using a dichroic.

The Exposure must have some basic properties such as the time of observation (which is represented as the modified Julian date, or MJD, when the image was taken), the coordinates the telescope was pointing at (RA and Dec), and some housekeeping information like the project/proposal under which the images were taken, the name of the object/target/field that was observed, the exposure time, and the filter used (or filter array).

The Exposure object will contain the path(s) to the underlying raw data files. Ideally, these data files are not modified at all from when they were generated at the telescope. Additional processing steps, such as bias/flat should be done on the next processing level (i.e., when generating Image objects) so the details of the processing can be tracked. However, the code does support downloading pre-reduced exposures, e.g. exposures processed through the standard NOIRLab pipeline and available on the NOIRLab archive.

The most common use case for Exposure objects is to (somehow) acquire a file with data, make a new Exposure object that links to it, and then commit that object to track the file. There are many tools built into the Exposure class to read the header data to automatically fill in all the information needed to populate the Exposure object. The Exposure can then be used directly or loaded at a different time from the database.

Once linked to a file, the Exposure object’s data attribute can be accessed, which will lazy download the file from server and then lazy load the image data into numpy arrays. For example:

from models.exposure import Exposure
exp = Exposure(filename='Camera_Project_2021-01-01T00:00:00.000.fits')

print(type(exp.data))  # this is a SectionData object, defined in models/exposure.py
print(type(exp.data[0]))  # numpy array for section zero

for section_id in exp.instrument_object.get_section_ids():
    print(exp.data[section_id].shape)  # print the shape of each section

In some cases, the data is in memory and needs to be saved to disk. An Exposure can then be generated without a file using Exposure(nofile=True). The user will then need to fill in the column data (such as MJD) and use some method like Exposure.save() to generate the file.

Images

An Image encapsulates a single two-dimensional array of data, typically from one chip (though we call it “sensor section” in the code) of an exposure. For instances, for DECam, an Exposure has all of the data (usually raw) taken for all the chips for one pointing of the telescope. That will be broken into 60 or 62 (modulo bad chips) different Image objects, each of which encapsulates the data for each chip.

Although we violate this principle in a few places, for the most part SeeChange is designed so that data products tracked in the database are WORM (“Write Once, Read Many”). That means that once an Image object is created, neither the data, nor the header in the data file, nor the columns of the database table row for that image, ever change again in the future. For this reason, we perform basic preprocessing (flatfielding, etc.) before loading an Image object into the database. If you want to get access to raw image data, you have to do so through the Exposure object. (Another side effect of this is that things like the best WCS and the best zeropoint will not generally be in the image headers! Rather, you will need to get the WCS and zeropoint products separately from the database. This is covered in Using SeeChange.)

Exposure and image headers

Each Exposure object is associated with a single FITS file (or sometimes multiple files, for different sections, though at the moment the code assumes that all of an exposure is packed into a single file). Header information is loaded from the FITS file, but this information can be accessed in a few different places. First, and what should be used most of the time (because they should have the same meanings for exposures from any source, whereas detailed header keywords are always observatory-specific), some of the properties of the Exposure object probably originally came from the exposure file’s header (e.g. mjd, filter, exp-time). Exposure object have a info property which is a dictionary that contains a small subset of the full FITS header. Generally only the properties we intend to query on will be saved here. The choice of keywords that get copied into the info dictionary are the combination of the keys in exposure.EXPOSURE_HEADER_KEYS and Instrument.get_auxiliary_exposure_header_keys(). The latter is used to let each Instrument subclass to define additional keywords that are useful to have in the info dictionary. The info dictionary is saved to the database as a JSONB column. Since some of this information is saved as independent columns (like exp_time), the info column does not necessarily keep much information (information is not duplicated). Note that this dictionary is filled using the global Exposure header, not the header of individual sections. The keywords in this dictionary are all lower-case, and are translated to standardized names using the _get_header_keyword_translations() method of the Instrument class. This makes it easy to tell them apart from the header information (in upper case) which also uses instrument-specific keywords. The value of the info values are also converted to standard units using the _get_header_values_converters()

In addition to the info column which is saved to the database, the Exposure also has a header and a section_headers attributes. The header is a dictionary-like object (using astropy.io.fits.Header) that contains the full FITS header of the file. This is not saved to the database, but is lazy loaded from the exposure FITS file when needed. The headers use all upper case keywords, as per the FITS standard. The section_headers property is a SectionHeaders object (also defined in models/exposure.py) which acts like a dictionary that lazy loads the FITS header from file for a specific section when needed. In cases where multiple section data is saved in one FITS file, there would usually be a primary HDU that contains the global exposure header information, and additional extension HDUs with their own image data and headers. In this case the section_headers are all different from the Exposure’s header. If the raw data includes a separate FITS file for each section, then each file would have a different header, and the “exposure global header” would arbitrarily be the header of the first file. If the instrument only has one section, this is trivially true as well.

After running basic pre-processing, we split each Exposure object into one or more Image objects. These are already section-specific, so we have fewer properties to track when looking for the data or headers. The Image object’s data property contains the pixel values (usually after some pre-processing). In addition to the pixel values, we also keep some more data arrays relevant to the image. These include the flags array, which is an integer bit-flag array marking things like bad pixels, the weight array, giving the inverse variance of each pixel (noise model), and additional, optional arrays like the score array which is a “match-filtered” image, normalized to units of signal-to-noise. These arrays are all numpy arrays, and are saved to disk using the format defined in the config file.

The Image object’s header property contains the section-specific header, copied directly from the Exposure object’s section_headers property. Some header keywords may be added or modified in the pre-processing step. This header is saved to the file, and not the database. The Image object’s info attribute contains a subset of the section-specific header. Again, this dictionary uses standardized names, in lower case, that are searchable on the database. This info dictionary is produced during the pre-processing step and contains only the important (searchable) keywords, but not those that are important enough to get their own columns in the images table (e.g., the MJD of the image).

Exposure and Image Data on disk

The Image data is saved to disk using the format defined in the config file (using the config options under storage.images). Currently, the code base only supports FITS files. Unlike the Exposure object, which is linked to files that were created by an instrument we are not in control of, the files associated with an Image object are created by the pipeline. We can also choose to save all the different arrays (data, weight, flags, etc.) in different files, or in the same file (using multiple extensions). (Currently, only saving to different files is well-supported and well-tested.) This is defined in the config file using storage.images.single_file. In either case, the additional arrays are saved with their own headers, which are all identical to the Image object’s header dictionary. (…Maybe. To be safe, totgally ignore weight and flags headers, and only use image headers.)

Many subsequenct data products (including source lists, world coordinate systems, difference images, detections on difference images, etc.) are also stored on disk; some subsequent data projects (e.g. zeropoints) only have database records.

Data archive

Local storage may be limited. It may, for instance, be on a “scratch” filesystem where old files are purged, or it may be on a small enough file system that occasionally you have to clean it out manually. What’s more, you may want to run the pipeline in multiple different places that don’t share the same filesystem. For this reason, the pipeline defines an “archive” where all of the files referenced the database are supposed to exist. The idea is that the archive may be less performant than the local filesystem. The local filesystem can be thought of as a cache for the archive. When an object is saved and comitted to the database, it is also pushed up to the archive; it’s not considered fully saved until it exists on the archive. (Unless you’ve defined your entire archive as null in the config YAML file, in which case you don’t have an archive.) When the code (in particular, the FileOnDisk mixin in models/base.py) loads a file, it first looks for it on the local filesystem; if it doesn’t find it, it pulls it down from the archive, and then loads it. Optionally, if the file is found on the local filesystem, the code verifies that the md5sum of the local file matches what’s in the database; if there’s a mismatch, it’s either an error, or the file is pulled from the archive and overwritten, based on how the archive functions are called.

Multiple different installations of SeeChange can work on the same collection of data as long as they’re configured in a consistent manner, using consistent versions of the code, and point to the same database and archive. They do not need to share a local data store.

The class that defines interaction with the archive is in Util/archive.py (which is actually a symbolic link to the same file in a submodule located in the extern subdirectory.) Users of the pipeline will usually not need to directly run the archive, as it’s all run transparently in the data saving and loading functions in the pipeline. (However, it might be useful if you want to make sure you have a local copy of a file for, e.g., viewing with an image viewer.) If you do want to use it, rather than instantiating an Archive object directly, call models/base.py::get_archive_object(). That will get you an archive object based on the configuration parameters on the configuration YAML file.

The data archive could be a local filesystem, for example on an external hard drive with lots of storage space, or it can be a remote server. The archive parameters must be defined in the config YAML file, under the archive key; see the comments there for more documentation on keywords that define the archive.

In general, whenever using the save() command, specify no_archive=True to avoid writing to the archive (only to local storage). This is useful for testing, or doing one-off calculations where you want to write data files, but are not saving records to the database. If not specified, the archive will be used. Whenever adding something to the databse, you must use the archive.