SFM stores data on 2 volumes:
- sfm-data: The data volume is where SFM stores the harvested social media content and web resources, the db files, and exports. This is described in more detail below. It is available within containers as /sfm-data.
- sfm-processing: The processing volume is where processed data is stored when using a processing container. (See Command-line exporting/processing.) It is available within containers as /sfm-processing.
There are 2 types of volumes:
- Internal to Docker. The files on the volume will only be available from within Docker containers.
- Linked to a host location. The files on the volumes will be available from within Docker containers and from the host operating system.
The type of volume is specified in the .env file. When selecting a link to a host location, the path on the host environment must be specified:
# Docker internal volume DATA_VOLUME=/sfm-data # Linked to host location #DATA_VOLUME=/src/sfm-data:/sfm-data # Docker internal volume PROCESSING_VOLUME=/sfm-processing # Linked to host location #PROCESSING_VOLUME=/src/sfm-processing:/sfm-processing
We recommend that you use an internal volume only for development; for other uses linking to a host location is recommended. This make it easier to place the data on specific storage devices (e.g., NFS or EBS) and to backup the data.
SFM files are owned by the sfm user (default uid 990) in the sfm group (default gid 990). If you use a link to a host location and list the files, the uid and gid may be listed instead of the user and group names.
If you shell into a Docker container, you will be the root user. Make sure that any operations you perform will not leave behind files that do not have appropriate permissions for the sfm user.
Note then when using Docker for Mac and linking to a host location, the file ownership may not appear as expected.
Directory structure of sfm-data¶
The following is a outline of the structure of sfm-data:
/sfm-data/ collection_set/ <collection set id> README.txt (README for collection set) <collection id>/ README.txt (README for collection) state.json (Harvest state record) heritrix_job/ Web harvest state records records/ JSON records for the collection metadata <year>/<month>/<day>/<hour>/ WARC files containers/ <container id>/ Working files for individual containers elk/ <container id>/ ELK files export/ <export id>/ Export files postgresql/ Postgres db files
SFM will monitor free space on sfm-data and sfm-processing. Administrators will be notified when the amount of free space crosses a configurable threshold. The threshold is set in the .env file:
# sfm-data free space threshold to send notification emails,only ends with MB,GB,TB. eg. 500MB,10GB,1TB DATA_VOLUME_THRESHOLD=10GB # sfm-processing free space threshold to send notification emails,only ends with MB,GB,TB. eg. 500MB,10GB,1TB PROCESSING_VOLUME_THRESHOLD=10GB