SFM stores data in multiple directories, which may be set up as separate volumes:
sfm-db-data: Postgres database for sfm-ui data
sfm-export-data: exports storage
sfm-containers-data: Docker containers data
sfm-collection-set-data: collection set data, including WARCs
sfm-mq-data: RabbitMQ data
sfm-processing: The processing volume is where processed data is stored when using a processing container. (See Command-line exporting/processing.) It is available within containers as /sfm-processing.
There are 2 types of volumes:
Internal to Docker. The files on the volume will only be available from within Docker containers.
Linked to a host location. The files on the volumes will be available from within Docker containers and from the host operating system.
The type of volume is specified in the .env file. When selecting a link to a host location, the path on the host environment must be specified:
# Docker internal volume DATA_VOLUME_COLLECTION_SET=/sfm-collection-set-data # Linked to host location #DATA_VOLUME_COLLECTION_SET=/src/sfm-data/sfm-collection-set-data:/sfm-collection-set-data # Docker internal volume PROCESSING_VOLUME=/sfm-processing # Linked to host location #PROCESSING_VOLUME=/src/sfm-processing:/sfm-processing
We recommend that you use an internal volume only for development; for other uses linking to a host location is recommended. This make it easier to place the data on specific storage devices (e.g., NFS or EBS) and to backup the data.
SFM files are owned by the sfm user (default uid 990) in the sfm group (default gid 990). If you use a link to a host location and list the files, the uid and gid may be listed instead of the user and group names.
If you shell into a Docker container, you will be the root user. Make sure that any operations you perform will not leave behind files that do not have appropriate permissions for the sfm user.
Note then when using Docker for Mac and linking to a host location, the file ownership may not appear as expected.
Directory structure of SFM data¶
The following is a outline of the structure of sfm data:
/sfm-collection-set-data/ collection_set/ <collection set id> README.txt (README for collection set) <collection id>/ README.txt (README for collection) state.json (Harvest state record) records/ JSON records for the collection metadata <year>/<month>/<day>/<hour>/ WARC files /sfm-containers-data containers/ <container id>/ Working files for individual containers /sfm-export-data export/ <export id>/ Export files /sfm-db-data postgresql/ Postgres db files /sfm-mq-data rabbitmq RabbitMQ files
SFM will monitor free space on data volumes and sfm-processing. Administrators will be notified when the amount of free space crosses a configurable threshold. The threshold is set in the .env file:
# sfm-data free space threshold to send notification emails. Values must end with MB,GB,TB. eg. 500MB,10GB,1TB # Use DATA_THRESHOLD_SHARED when all data volumes are on the same filesystem and DATA_SHARED_USED is True. #DATA_THRESHOLD_SHARED=6GB DATA_VOLUME_THRESHOLD_DB=10GB DATA_VOLUME_THRESHOLD_MQ=10GB DATA_VOLUME_THRESHOLD_EXPORT=10GB DATA_VOLUME_THRESHOLD_CONTAINERS=10GB DATA_VOLUME_THRESHOLD_COLLECTION_SET=10GB # sfm-processing free space threshold to send notification emails,only ends with MB,GB,TB. eg. 500MB,10GB,1TB PROCESSING_VOLUME_THRESHOLD=10GB