Setting up a development environment
SFM is composed of a number of components. Development can be performed on each of the
components separately. The following describes setting up an development environment
for a component.
Step 1: Pick a development configuration
For SFM development, it is recommended to run components within a Docker environment
(instead of directly in your OS, not in Docker). Docker runs natively (and cleanly) on Ubuntu; on OS X
Docker requires Docker Toolbox.
Since Docker can’t run natively on OS X, Docker Toolbox
runs it inside a VirtualBox VM, which is largely transparent to the user. Note that GWU’s
configuration of the Cisco AnyConnect VPN client breaks Docker Toolbox. You can work
around this with vpn_fix.sh,
but this is less than optimal.
Depending on your development preferences and the OS you development on, you may want to
consider one of the following configurations:
- Develop locally and run Docker locally: Optimal if using an IDE and not using OS X/
Cisco AnyConnect.
- Both develop and run Docker in an Ubuntu VM. The VM can be local (e.g., in VMWare Fusion)
or remote Ubuntu VM (e.g., a WRLC or AWS VM): Optimal if using a text editor.
- Develop locally and run Docker in a local VM with the local code shared into the VM:
Optimal if using an IDE.
Step 3: Clone the component’s repo
For example:
git clone https://github.com/gwu-libraries/sfm-ui.git
Step 5: Run the code
cd docker
docker-compose up -d
For additional Docker and Docker-Compose commands, see below.
Development tips
Admin user accounts
When running a development docker-compose.yml, each component should automatically
create any necessary admin accounts (e.g., a django admin for SFM UI). Check dev.docker-compose.yml
for the username/passwords for those accounts.
RabbitMQ management console
The RabbitMQ management console can be used to monitor the exchange of messages. In particular, use it
to monitor the messages that a component sends, create a new queue, bind that queue to sfm_exchange
using an appropriate routing key, and then retrieve messages from the queue.
The RabbitMQ management console can also be used to send messages to the exchange so that
they can be consumed by a component. (The exchange used by SFM is named sfm_exchange.)
For more information on the RabbitMQ management console, see RabbitMQ.
Blocked ports
When running on a remote VM, some ports (e.g., 15672 used by the RabbitMQ management console) may
be blocked. SSH port forwarding
can help make those ports available.
Django logs
Django logs for SFM UI are written to the Apache logs. In the docker environment, the level of various
loggers can be set from environment variables. For example, setting SFM_APSCHEDULER_LOG to DEBUG
in the docker-compose.yml will turn on debug logging for the apscheduler logger. The logger for
the SFM UI application is called ui and is controlled by the SFM_UI_LOG environment variable.
Apache logs
In the SFM UI container, Apache logs are sent to stdout/stderr which means they can be viewed with
docker-compose logs or docker logs <container name or id>.
Initial data
The development and master docker images for SFM UI contain some initial data. This includes a user (“testuser”,
with password “password”). For the latest initial data, see fixtures.json. For more information on fixtures,
see the Django docs.
Runserver
There are two flavors of the the development docker image for SFM UI. gwul/sfm-ui:dev runs SFM UI with
Apache, just as it will in production. gwul/sfm-ui:dev-runserver runs SFM UI with runserver,
which dynamically reloads changed Python code. To switch between them, change the image field in your
docker-compose.yml.
Job schedule intervals
To assist with testing and development, a 5 minute interval can be added by setting SFM_FIVE_MINUTE_SCHEDULE to
True in the docker-compose.yml.
Docker tips
Building vs. pulling
Containers are created from images. Images are either built locally or pre-built and pulled from
Docker Hub. In both cases, images are created based on the docker build (i.e., the
Dockerfile and other files in the same directory as the Dockerfile).
In a docker-compose.yml, pulled images will be identified by the image field, e.g., image: gwul/sfm-ui:dev. Built images
will be identified by the build field, e.g., build: app-dev.
In general, you will want to use pulled images. These are automatically built when changes are made to the Github repos.
You should periodically execute docker-compose pull to make sure you have the latest images.
You may want to build your own image if your development requires a change to the docker build (e.g., you modify
fixtures.json).
Killing, removing, and building in development
Killing a container will cause the process in the container to be stopped. Running the container again will cause
process to be re-started. Generally, you will kill and run a development container to get the process to be run
with changes you’ve made to the code.
Removing a container will delete all of the container’s data. During development, you will remove a container to make
sure you are working with a clean container.
Building a container creates a new image based on the Dockerfile. For a development image, you only need to build
when making changes to the docker build.
Social Feed Manager (SFM) documentation¶
Social Feed Manager is open source software for libraries, archives, cultural heritage institutions and research organizations. It empowers those communities’ researchers, faculty, students, and archivists to define and create collections of data from social media platforms. Social Feed Manager will harvest from Twitter, Tumblr, Flickr, and Sina Weibo and is extensible for other platforms. In addition to collecting data from those platforms’ APIs, it will collect linked web pages and media.
This site provides documentation for installation and usage of SFM. See the Social Feed Manager project site for full information about the project’s objectives, roadmap, and updates.
Contents:¶
Installation and configuration¶
Overview¶
The supported approach for deploying SFM is Docker containers.
Each SFM service will provide images for the containers needed to run the service (in the form of
Dockerfile
s). These images will be published to Docker Hub. GWU created images will be part of the GWUL organization and be prefixed with sfm-.sfm-docker provides the necessary
docker-compose.yml
files to compose the services into a complete instance of SFM.For a container, there may be multiple flavors of the container. In particular, there may be the following:
For more information, see Docker.
SFM can be deployed without Docker. The various ``Dockerfile``s should provide reasonable guidance on how to accomplish this.
Configuration¶
Passwords are kept in
secrets.env
. A template for this file (example.secrets.env
) is provided.Debug mode for sfm-ui is controlled by the
DEBUG
environment variable indocker-compose.yml
. If settingDEBUG
to false, theSFM_HOST
environment variable must be provided with the host. See the Django documentation forALLOWED_HOSTS
.The default timezone is Eastern Standard Time (EST). To select a different timezone, change
TZ=EST
indocker-compose.yml
.Email is configured by providing the
SFM_HOST
,SFM_SMTP_HOST
,SFM_EMAIL_USER
, andSFM_EMAIL_PASSWORD
environment variables.SFM_HOST
is used to determine the host name when constructing links contained in the emails.Application credentials for social media APIs are configured by providing the
TWITTER_CONSUMER_KEY
,TWITTER_CONSUMER_SECRET
,WEIBO_API_KEY
, and/orWEIBO_API_SECRET
. For more information, see API Credentials.The data volume strategy is used to manage the volumes that store SFM’s data. By default, normal Docker volumes are used; to use a host volume instead, add the host directory to the
volumes
field. This will allow you to access the data outside of Docker. For example:Local installation¶
Installing locally required Docker and Docker-Compose. See Installing Docker.
Either clone this repository:
or just download
docker-compose.yml
andexample.secrets.env
:Put real secrets in
secrets.env
.Bring up the containers:
Amazon EC2 installation¶
To launch an Amazon EC2 instance running SFM, follow the normal procedure for launching an instance. In Step 3: Configure Instance Details, under Advanced Details paste the following in user details and modify as appropriate:
When the instance is launched, SFM will be installed and started.
Note the following:
docker-compose.yml
, you can ssh into the EC2 instance and make changes.docker-compose.yml
andsecrets.env
will be in the default user’s home directory.Authentication¶
Social Feed Manager allows users to self-sign up for accounts. Those accounts are stored and managed by SFM. Future versions of SFM will support authentication against external systems, e.g., Shibboleth.
By default, a group is created for each user and the user is placed in group. To create additional groups and modify group membership use the Admin interface.
In general, users and groups can be administered from the Admin interface.
The current version of SFM is not very secure. Future versions of SFM will more tightly restrict what actions users can perform and what they can view. In the meantime, it is encouraged to take other measures to secure SFM such as restricting access to the IP range of your institution.
API Credentials¶
Accessing the APIs of social media platforms requires credentials for authentication (also knows as API keys). Social Feed Manager supports managing those credentials.
Most API credentials have two parts: an application credential and a user credential. (Flickr is the exception – only an application credential is necessary.)
It is important to understand how credentials/authentication effect what API methods can be invoked and rate limits. For more information, consult the documentation for each social media platform’s API.
Managing credentials¶
SFM supports two approaches to managing credentials: adding credentials and connecting credentials. Both of these options are available from the Credentials page.
Adding credentials¶
For this approach, a user gets the application and/or user credential from the social media platform and provide them to SFM by completing a form. More information on getting credentials is below.
Connecting credentials¶
For this approach, SFM is configured with the application credentials for the social media platform. The user credentials are obtained by the user being redirected to the social media website to give permission to SFM to access her account.
SFM is configured with the application credentials in the
docker-compose.yml
. If additional management is necessary, it can be performed using the Social Accounts section of the Admin interface.This is the easiest approach for users. Configuring application credentials is encouraged.
Platform specifics¶
Twitter¶
Twitter credentials can be obtained from https://apps.twitter.com/.
Weibo¶
For instructions on obtaining Weibo credentials, see this guide.
To use the connecting credentials approach for Weibo, the redirect URL must match the application’s actual URL and use port 80.
Flickr¶
Flickr credentials can be obtained from https://www.flickr.com/services/api/keys/.
Flickr does not require user credentials.
Processing¶
Your social media data can be used in a processing/analysis pipeline. SFM provides several tools and approaches to support this.
Tools¶
Warc iterators¶
A warc iterator tool provides an iterator to the social media data contained in WARC files. When used from the commandline, it writes out the social items one at a time to standard out. (Think of this as
cat
-ing a line-oriented JSON file. It is also equivalent to the output of Twarc.)Each social media type has a separate warc iterator tool. For example,
twitter_rest_warc_iter.py
extracts tweets recorded from the Twitter REST API. For example:Warc iterator tools can also be used as a library.
Find Warcs¶
find_warcs.py
helps put together a list of WARC files to be processed by other tools, e.g., warc iterator tools. (It gets the list of WARC files by querying the SFM API.)Here is arguments it accepts:
For example, to get a list of the WARC files in a particular collection, provide some part of the collection id:
(In this case there is only one WARC file. If there was more than one, it would be space separated.)
The collection id can be found from the SFM UI.
Note that if you are running
find_warcs.py
from outside a Docker environment, you will need to supply--api-base-url
.Approaches¶
Processing in container¶
To bootstrap processing, a processing image is provided. A container instantiated from this image is Ubuntu 14.04 and pre-installed with the warc iterator tools,
find_warcs.py
, jq, Twarc (for access to the Twarc utilities). It will also have access to the data from/sfm-data/collections
.To instantiate:
The arguments will need to be adjusted depending on your Docker environment.
You will then be provided with a bash shell inside the container from which you can execute commands:
Processing locally¶
In a typical Docker configuration, the data directory will be linked into the Docker environment. This means that the data is available both inside and outside the Docker environment. Given this, processing can be performed locally (i.e., outside of Docker).
The various tools can be installed locally:
Development¶
Setting up a development environment¶
SFM is composed of a number of components. Development can be performed on each of the components separately. The following describes setting up an development environment for a component.
Step 1: Pick a development configuration¶
For SFM development, it is recommended to run components within a Docker environment (instead of directly in your OS, not in Docker). Docker runs natively (and cleanly) on Ubuntu; on OS X Docker requires Docker Toolbox.
Since Docker can’t run natively on OS X, Docker Toolbox runs it inside a VirtualBox VM, which is largely transparent to the user. Note that GWU’s configuration of the Cisco AnyConnect VPN client breaks Docker Toolbox. You can work around this with vpn_fix.sh, but this is less than optimal.
Depending on your development preferences and the OS you development on, you may want to consider one of the following configurations:
Step 2: Install Docker and Docker Compose¶
See See Installing Docker.
Step 3: Clone the component’s repo¶
For example:
Step 4: Configure docker-compose.yml¶
Each SFM component should provide a development Docker image and an example dev.docker-compose.yml file (in the docker/ directory).
The development Docker image will run the component using code that is shared with container. That is, the code is made available at container run time, rather than build time (as it is for master or production images). This allows you to change code and have it affect the running component if the component (e.g., a Django application) is aware of code changes. If the component is not aware of code changes, you will need to restart the container to get the changes (docker kill <container name> followed by docker-compose up -d).
The development docker-compose.yml will bring up a container running the component and containers for any additional components that the component depends on (e.g., a RabbitMQ instance). Copy dev.docker-compose.yml to docker-compose.yml and update it as necessary. At the very least, you will need to change the volumes link to point to your code:
You may also need to change the defaults for exposed ports to ports that are available in your environment.
Step 5: Run the code¶
For additional Docker and Docker-Compose commands, see below.
Development tips¶
Admin user accounts¶
When running a development docker-compose.yml, each component should automatically create any necessary admin accounts (e.g., a django admin for SFM UI). Check dev.docker-compose.yml for the username/passwords for those accounts.
RabbitMQ management console¶
The RabbitMQ management console can be used to monitor the exchange of messages. In particular, use it to monitor the messages that a component sends, create a new queue, bind that queue to sfm_exchange using an appropriate routing key, and then retrieve messages from the queue.
The RabbitMQ management console can also be used to send messages to the exchange so that they can be consumed by a component. (The exchange used by SFM is named sfm_exchange.)
For more information on the RabbitMQ management console, see RabbitMQ.
Blocked ports¶
When running on a remote VM, some ports (e.g., 15672 used by the RabbitMQ management console) may be blocked. SSH port forwarding can help make those ports available.
Django logs¶
Django logs for SFM UI are written to the Apache logs. In the docker environment, the level of various loggers can be set from environment variables. For example, setting SFM_APSCHEDULER_LOG to DEBUG in the docker-compose.yml will turn on debug logging for the apscheduler logger. The logger for the SFM UI application is called ui and is controlled by the SFM_UI_LOG environment variable.
Apache logs¶
In the SFM UI container, Apache logs are sent to stdout/stderr which means they can be viewed with docker-compose logs or docker logs <container name or id>.
Initial data¶
The development and master docker images for SFM UI contain some initial data. This includes a user (“testuser”, with password “password”). For the latest initial data, see fixtures.json. For more information on fixtures, see the Django docs.
Runserver¶
There are two flavors of the the development docker image for SFM UI. gwul/sfm-ui:dev runs SFM UI with Apache, just as it will in production. gwul/sfm-ui:dev-runserver runs SFM UI with runserver, which dynamically reloads changed Python code. To switch between them, change the image field in your docker-compose.yml.
Job schedule intervals¶
To assist with testing and development, a 5 minute interval can be added by setting SFM_FIVE_MINUTE_SCHEDULE to True in the docker-compose.yml.
Docker tips¶
Building vs. pulling¶
Containers are created from images. Images are either built locally or pre-built and pulled from Docker Hub. In both cases, images are created based on the docker build (i.e., the Dockerfile and other files in the same directory as the Dockerfile).
In a docker-compose.yml, pulled images will be identified by the image field, e.g., image: gwul/sfm-ui:dev. Built images will be identified by the build field, e.g., build: app-dev.
In general, you will want to use pulled images. These are automatically built when changes are made to the Github repos. You should periodically execute docker-compose pull to make sure you have the latest images.
You may want to build your own image if your development requires a change to the docker build (e.g., you modify fixtures.json).
Killing, removing, and building in development¶
Killing a container will cause the process in the container to be stopped. Running the container again will cause process to be re-started. Generally, you will kill and run a development container to get the process to be run with changes you’ve made to the code.
Removing a container will delete all of the container’s data. During development, you will remove a container to make sure you are working with a clean container.
Building a container creates a new image based on the Dockerfile. For a development image, you only need to build when making changes to the docker build.
Docker¶
This page contains information about Docker that is useful for installation, administration, and development.
Installing Docker¶
Docker Engine and Docker Compose
On OS X:
On Ubuntu:
apt
install, try thepip
install./etc/group
.Helpful commands¶
docker-compose up -d
docker-compose pull
docker-compose build
--no-cache
to re-build the entire image (which you might want to do if the image isn’t building as expected).docker ps
-a
to also list stopped containers.docker-compose kill
docker kill <container name>
docker-compose rm -v --force
docker rm -v <container name>
docker rm $(docker ps -a -q) -v
docker-compose logs
docker logs <container name>
docker-compose -f <docker-compose.yml filename> <command>
docker exec -it <container name> /bin/bash
docker rmi <image name>
docker rmi $(docker images -q)
docker-compose scale <service name>=<number of instances>
Scaling up with Docker¶
To create multiple instances of a service, use docker-compose scale. This can be used to created multiple instances of a harvester when the queue for that harvester is too long.
To spread containers across multiple containers, use Docker Swarm.
Using compose in production provides some additional guidance.
Writing a harvester¶
Requirements¶
Suggestions¶
Notes¶
Messaging¶
RabbitMQ¶
RabbitMQ is used as a message broker.
The RabbitMQ managagement console is exposed at
http://<your docker host>:15672/
. The username issfm_user
. The password is the value ofRABBITMQ_DEFAULT_PASS
insecrets.env
.Publishers/consumers¶
mq
and the port is 5672.rabbit
. See appdeps.py for docker application dependency support.Exchange¶
sfm_exchange
is a durable topic exchange to be used for all messages. All publishers/consumers must declare it.:Queues¶
All queues must be declared durable.:
Messaging Specification¶
Introduction¶
SFM is architected as a number of components that exchange messages via a messaging queue. To implement functionality, these components send and receive messages and perform certain actions. The purpose of this document is to describe this interaction between the components (called a “flow”) and to specify the messages that they will exchange.
Note that as additional functionality is added to SFM, additional flows and messages will be added to this document.
General¶
Harvesting social media content¶
Harvesting is the process of retrieving social media content from the APIs of social media services and writing to WARC files. It also includes extracting urls for other web resources from the social media so that they can be harvested by a web harvester. (For example, the link for an image may be extracted from a tweet.)
Background information¶
Flow¶
The following is the flow for a harvester performing a REST harvest and creating a single warc:
The following is the message flow for a harvester performing a stream harvest and creating multiple warcs:
Messages¶
Harvest start message¶
Harvest start messages specify for a harvester the details of a harvest. Example:
Another example:
Harvesters will extract urls from the harvested social media content and publish a web resource harvest start message. This message is similar to other harvest start messages, with the differences noted below. Example:
Harvest stop message¶
Harvest stop messages tell a harvester perform a stream harvest to stop. Example:
Harvest status message¶
Harvest status messages allow a harvester to provide information on the harvests it performs. Example:
Warc created message¶
Warc created message allow a harvester to provide information on the warcs that are created during a harvest. Example:
Exporting social media content¶
Exporting is the process of extracting social media content from WARCs and writing to export files. The exported content may be a subset or derivate of the original content. A number of different export formats will be supported.
Background information¶
Flow¶
The following is the flow for an export:
Export start message¶
Export start messages specify the requests for an export. Example:
Another example:
Export status message¶
Export status messages allow an exporter to provide information on the exports it performs. Example: