SFM provides an instance of ELK that has been customized for exploring social media data. It currently supports data from
Twitter and Weibo.
One possible use for ELK is to monitor data that is being harvested to discover new seeds to select.
For example, it may reveal new hashtags or users that are relevant to a collection.
Though you can use Logstash and Elasticsearch directly, in most cases you will interact exclusively with Kibana,
which is the exploration interface.
Enabling ELK
ELK is not available by default; it must be enabled as described here.
You can enable one or more ELK Docker containers. Each container can be configured to be loaded with all social
media data or the social media data for a single collection set.
To enable an ELK Docker container it must be added to your docker-compose.yml
and then started by:
An example container is provided in example.docker-compose.yml
and example.prod.docker-compose.yml
. These examples
also show how to limit to a single collection set by providing the collection set id.
By default, Kibana is available at http://<your hostname>:5601/app/kibana. (Also,
by default Elasticsearch is available on port 9200 and Logstash is available on port 5000.)
If enabling multiple ELK containers, add multiple containers to your docker-compose.yml
. Make sure to give each a
unique name and map to different ports.
Loading data
ELK will automatically be loaded as new social media data is harvested. (Note, however, that there will be some latency
between the harvest and the data being available in Kibana.)
Since only new social media data is added, it is recommended that you enable the ELK Docker container before beginning
harvesting.
If you would like to load social media data that was harvested before the ELK Docker container was enabled, use the
resendwarccreatedmsgs
management command:
usage: manage.py resendwarccreatedmsgs [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback]
[--no-color]
[--collection-set COLLECTION_SET]
[--harvest-type HARVEST_TYPE] [--test]
routing_key
The resendwarccreatedmsgs
command resends warc_created messages which will trigger the loading of data by ELK.
To use this command, you will need to know the routing key. The routing key is elk_loader_<container id>.warc_created
.
The container id can be found with docker ps
.
The loading can be limited by collection set (--collection-set
) and/or (--harvest-type
). You can get collection
set ids from the collection set detail page. The available harvest types are twitter_search, twitter_filter,
twitter_user_timeline, twitter_sample, and weibo_timeline.
This shows loading the data limited to a collection set:
docker exec docker_sfmuiapp_1 python sfm/manage.py resendwarccreatedmsgs --collection-set b438a62cbcf74ad0adc09be3b07f039e elk_loader_26ce21fa2e43.warc_created
Overview of Kibana
The Kibana interface is extremely powerful. However, with that power comes complexity.
The following provides an overview of some basic functions in Kibana. For some advanced
usage, see the Kibana Reference or the Kibana 101: Getting Started with Visualizations video.
When you start Kibana, you probably won’t see any results.
This is because Kibana defaults to only showing data from the last 15 minutes. Use the
date picker in the upper right corner to select a more appropriate time range.
Tip: At any time, you can change the date range for your query, visualization, or dashboard
using the date picker.
Discover
The Discover tab allows you to query the social media data.
By default, all social media types are queried. By limit to a single type (e.g., tweets),
click the folder icon and select the appropriate filter.
You will now only see results for that social media type.
Notice that each social media item has a number of fields.
You can search against a field. For example, to find all tweets containing the term “archiving”:
or having the hashtag #SaveTheWeb:
or mentioning @SocialFeedMgr:
Visualize
The Visualize tab allows you to create visualizations of the social media data.
The types of visualizations that are supported include:
- Area chart
- Data table
- Line chart
- Pie chart
- Map
- Vertical bar chart
Describing how to create visualizations is beyond the scope of this overview.
A number of visualizations have already been created for social media data. (The available
visualizations are listed on the bottom of the page.)
For example, here is the Top 10 hashtags visualization:
Dashboard
The Dashboard tab provides a summary view of data, bringing together multiple visualizations
and searches on a single page.
A number of dashboards have already been created for social media data. To select a dashboard,
click the folder icon and select the appropriate dashboard.
For example, here is the top of the Twitter dashboard:
Social Feed Manager (SFM)¶
Social Feed Manager is open source software for libraries, archives, cultural heritage institutions and research organizations. It empowers those communities’ researchers, faculty, students, and archivists to define and create collections of data from social media platforms. Social Feed Manager will harvest from Twitter, Tumblr, Flickr, and Sina Weibo and is extensible for other platforms. In addition to collecting data from those platforms’ APIs, it will collect linked web pages and media.
This site provides documentation for installation and usage of SFM. See the Social Feed Manager project site for full information about the project’s objectives, roadmap, and updates.
Quick Start Guide¶
This quick start guide describes how you can start using Social Feed Manager to select, harvest, explore, export, process and analyze social media data. This covers just the basics of using the software; technical information about installing and administering SFM can be found in the technical-documentation.
Prerequisites¶
SFM in operation¶
This quick start guide assumes SFM is already set up and running. For details about installing and administering SFM, see technical-documentation.
An SFM account¶
You can sign up for an account by clicking the Sign Up link from within SFM.
If you’d like to set up shared collecting at your institution, you’ll need to have your systems administrator set up groups in SFM.
API credentials¶
You will need API credentials for each of the social media platforms from which you want to collect. This is more than the Twitter/Flickr/Weibo account that you may already have. To get API credentials:
Setting up collections¶
Hopefully you’ve considered what you want to use SFM to collect: which social media accounts, which queries/hashtags/searches/etc., and on which platform(s). You may also have learned a bit about the social media platforms’ APIs and best practices for collecting from social media APIs. Now you’d like to set up your collections in SFM.
Create a collection set¶
At the top of the page, go to Collection Sets and click the Add Collection Set button. A collection set is just a group of collections around a particular topic or theme. For example, you might set up a “2016 U.S. Elections” collection set.
Create a collection¶
On the collection set detail page, under Collections click the Add Collection button and select a type.
Collection types differ based on the social media platform and the part of the API from which the social media is to be collected. For more information, see Collection types.
The collection types supported by SFM include:
SFM allows you to create multiple collections of each type within a collection set. For example, you might create a “Democratic candidate Twitter user timelines” collection and a “Republican candidate Twitter user timelines” collection. Collections are one way of organizing harvested content.
Each collection’s harvest type has specific options, which may include:
Add seeds¶
Some harvest types require seeds, which are the specific targets for collection.
As shown in the chart below, what a seed is and the number of seeds varies by harvest type. Note that some harvest types don’t have any seeds.
Start harvesting¶
Each collection’s detail page has a Turn On button.
Once you turn on the collection, harvesting will proceed in the background according to the collection’s schedule. It will stop when it hits the end date or you turn it off.
The collection’s detail page will also show a message noting when the next harvest is scheduled for.
As harvesting progresses, SFM will list the results of harvests on the collection’s detail page.
During harvesting¶
Within SFM, harvesting is performed by (you guessed it) harvesters. Harvesters make calls to the social media platforms’ APIs and records the social media data in WARC files. (WARC is a standard file format used for web archiving.)
Depending on the collection options you selected, SFM may also extract URLs from the posts; these URLs link to web resources such as images, web pages, etc. SFM passes the URLs to the web harvester, which will collect these web resources (similar to more traditional web archiving).
To monitor harvesting:
If you want to make changes to the collection’s options and/or its seeds after harvesting is started, turn off the collection and then click the Edit button.
You’ll be able to turn it back on and resume collecting afterwards.
Exploring, exporting, processing and analyzing your social media data¶
SFM provides several mechanisms for exporting collected social media data or feeding the social media data into your own processing pipelines. It also provides some basic tools for exploring and analyzing the collected content within the SFM environment.
Exports¶
To export collected social media data, click the Export button on the collection detail page. Exports are available in a number of formats, including Excel, CSV, and JSON.
The “Full JSON” format provides the posts (e.g. tweets) in their original form, whereas the other export formats provide a subset of the metadata for each social media item. For example, for a tweet, the CSV export includes the tweet’s “coordinates” value but not the “geo” value.
Dehydration (exporting a list of just the IDs of social media items) is supported for certain data-sharing purposes.
Exports are run in the background, and larger exports may take a significant amount of time. You will receive an email when it is completed or you can monitor the status on the Exports page, where you can vew details about the export. This is also where you will find a link to download the export file once it becomes available.
Processing¶
If you’ve set up a processing container, or if you’ve installed SFM tools locally, then you have access to the collected social media data from the command line. You can then feed the data into your own processing pipeline and use your own tools.
More on this topic can be found in the Processing section.
Exploration and analysis¶
While SFM does not provide a comprehensive toolset for exploring and analyzing the collected social media data, it provides some basic exploration and analysis tools and allows you to export social media data for use with your own tools.
Tools provided by SFM are:
The ELK stack is a general-purpose framework for exploring data. It provides support for loading, querying, analysis, and visualization. SFM provides an instance of ELK that has been customized for exploring social media data, in particular, Twitter and Weibo data.
ELK may be particularly useful for monitoring and adjusting the targets of ongoing social media collections. For example, it can be used to discover additional relevant Twitter hashtags or user accounts to collect, based on what has been collected so far.
ELK requires some additional setup. More on this topic can be found in the Exploring social media data with ELK section.
A processing container allows you to have access to the collected social media content from the command line. The processing container has been provisioned with a handful of analysis tools such as Twarc utils.
The following shows piping some tweets into a wordcloud generator from within a processing container:
# find_warcs.py 4f4d1 | xargs twitter_rest_warc_iter.py | python /opt/twarc/utils/wordcloud.py
More on this topic can be found in the Processing section.
Access and display¶
SFM does not currently provide a web interface to the collected social media content. However, this should be possible, and we welcome your ideas and contributions.
API Credentials¶
Accessing the APIs of social media platforms requires credentials for authentication (also knows as API keys). Social Feed Manager supports managing those credentials.
Most API credentials have two parts: an application credential and a user credential. (Flickr is the exception – only an application credential is necessary.)
It is important to understand how credentials/authentication affect what API methods can be invoked and rate limits. For more information, consult the documentation for each social media platform’s API.
Managing credentials¶
SFM supports two approaches to managing credentials: adding credentials and connecting credentials. Both of these options are available from the Credentials page.
Adding credentials¶
For this approach, a user gets the application and/or user credential from the social media platform and provide them to SFM by completing a form. More information on getting credentials is below.
Connecting credentials¶
For this approach, SFM is configured with the application credentials for the social media platform. The user credentials are obtained by the user being redirected to the social media website to give permission to SFM to access her account.
SFM is configured with the application credentials in the
docker-compose.yml
. If additional management is necessary, it can be performed using the Social Accounts section of the Admin interface.This is the easiest approach for users. Configuring application credentials is encouraged.
Platform specifics¶
Twitter¶
Twitter credentials can be obtained from https://apps.twitter.com/. It is recommended to change the application permissions to read-only. You must provide a callback URL, but the URL you provide doesn’t matter.
Weibo¶
For instructions on obtaining Weibo credentials, see this guide.
To use the connecting credentials approach for Weibo, the redirect URL must match the application’s actual URL and use port 80.
Flickr¶
Flickr credentials can be obtained from https://www.flickr.com/services/api/keys/.
Flickr does not require user credentials.
Tumblr¶
Tumblr credentials can be obtained from https://www.tumblr.com/oauth/apps.
Collection types¶
Each collection type connects to one of a social media platform’s APIs, or methods for retrieving data. Understanding what each collection type provides is important to ensure you collect what you need and are aware of any limitations. Reading the social media platform’s documentation provides further important details.
Twitter search¶
Queries the Twitter Search API to retrieve public tweets from a sampling of tweets from the most recent 7-9 days. This is not a comprehensive search of all tweets.
To formulate a search query, use the Twitter Advanced Search query builder. Pay careful attention to the query syntax described in Twitter’s documentation. This query is the seed for the collection; Twitter search collections only have one seed.
Due to Twitter’s rate limits and the amount of data available from the Search API, broad Twitter searches may take a long time to complete (up to multiple days). In choosing a schedule, make sure that there is enough time between searches. In some cases, you may only want to run the search once and then turn off the collection.
If the incremental option is selected, only new tweets (i.e., tweets that have not yet been harvested in this collection) will be harvested. In general, you will want to select the incremental option.
See the Collecting Web resources guidance below for deciding whether to collection media or web resources.
Twitter filter¶
Collects public tweets from the Twitter filter streaming API, matching keywords, locations, or users. The tweets are from the current time going forward. Tweets from the past are not available with this collection type. (Create a Twitter search collection with the same terms to collect tweets from the recent past).
When creating a Twitter filter pay careful attention to query syntax described in Twitter’s documentation. The filter query is the seed for the collection; Twitter filter collections only have one seed.
There are limits on how many tweets Twitter will supply, so filters on high-volume terms/hashtags will not return all tweets available. Thus, you will want to strategize about how broad/narrow to construct your filter. Twitter only allows you to run one filter at a time with a set of Twitter API credentials; SFM enforces this for you.
SFM captures the filter stream in 30 minute chunks and then momentarily stops. Between rate limiting and this momentary stop, you should never assume that you are getting every tweet.
Unlike other collection types, Twitter filter collections are either turned on or off; they do not operate according to a schedule.
See the Collecting Web resources guidance below for deciding whether to collection media or web resources.
Twitter user timeline¶
Collects tweets by a particular Twitter user account using Twitter’s user_timeline API. Twitter provides up to the most recent 3,200 tweets by that account, provided a limited ability to collect tweets from the past.
Each Twitter user timeline collection can have multiple seeds, where each seed is a user timeline. To identify a user timeline, you can provide a screen name or Twitter user ID (UID). If you provide one, the other will be looked up and displayed in SFM UI. The Twitter user ID is a number and does not change; a user may change her screen name. The user ID will be used for retrieving the user timeline.
While large number of user timeline seeds are supported in a collection, they may take a long time to collect due to Twitter’s rate limits.
If the incremental option is selected, only new tweets (i.e., tweets that have not yet been harvested for that user timeline) will be harvested, meaning you will not collect duplicate tweets. If the incremental option is not selected, you will collect the most 3,200 recent tweets, meaning you will get duplicates across harvests. However, you may be able to examine differences across time in a user’s timeline, e.g., deleted tweets, or track changes in follower or retweet counts.
In choosing a schedule, you may want to consider how prolific a tweeter the Twitter user is. In general, the more frequent the tweeter, the more frequent you’ll want to schedule harvests.
See the Collecting Web resources guidance below for deciding whether to collection media or web resources.
Twitter sample¶
Collects tweets from the Twitter sample stream. The Twitter sample stream returns approximately 0.5-1% of public tweets, which is approximately 3GB a day (compressed).
See the Collecting Web resources guidance below for deciding whether to collection media or web resources.
Flickr user¶
Collects metadata about public photos by a specific Flickr user.
Each Flickr user collection can have multiple seeds, where each seed is a Flickr user. To identify a user, you can provide a either a username or an NSID. If you provide one, the other will be looked up and displayed in the SFM UI. The NSID is a unique identifier and does not change; a user may change her username.
For each user, the user’s information will be collected using Flickr’s people.getInfo API and the list of her public photos will be retrieved from people.getPublicPhotos. Information on each photo will be collected with photos.getInfo.
Depending on the image sizes you select, the actual photo files will be collected as well.
If the incremental option is selected, only new photos will be collected.
Weibo timeline¶
Collects Weibos by the user and friends of the user whose credentials are provided using the Weibo friends_timeline API.
Note that because collection is determined by the user whose credentials are provided, there are no seeds for a Weibo timeline collection. To change what is being collected, change the user’s friends from the Weibo website or app.
See the Collecting Web resources guidance below for deciding whether to collect image or web resources.
Tumblr blog posts¶
Collects blog posts by a specified Tumblr blog.
Each Tumblr blog post collection can have multiple seeds, where each seed is a blog. The blog can be specified with or without the .tumblr.com extension.
If the incremental option is selected, only new blog posts will be collected.
See the Collecting Web resources guidance below for deciding whether to collect image or web resources.
Collecting Web resources¶
Each collection type allows you to select an option to collect web resources such as images, web pages, etc. that are included in the social media post. When a social media post includes a URL, SFM will harvest the web page at that URL. It will harvest only that web page, not any pages linked from that page.
Be very deliberate in collecting web resources. Performing a web harvest both takes longer and requires significantly more storage than collecting the original social media post.
Processing¶
Your social media data can be used in a processing/analysis pipeline. SFM provides several tools and approaches to support this.
Tools¶
Warc iterators¶
A warc iterator tool provides an iterator to the social media data contained in WARC files. When used from the commandline, it writes out the social items one at a time to standard out. (Think of this as
cat
-ing a line-oriented JSON file. It is also equivalent to the output of Twarc.)Each social media type has a separate warc iterator tool. For example,
twitter_rest_warc_iter.py
extracts tweets recorded from the Twitter REST API. For example:Here is a list of the warc iterators:
twitter_rest_warc_iter.py
: Tweets recorded from Twitter REST API.twitter_stream_warc_iter.py
: Tweets recorded from Twitter Streaming API.flickr_photo_warc_iter.py
: Flickr photosweibo_warc_iter.py
: Weibostumblr_warc_iter.py
: Tumblr postsWarc iterator tools can also be used as a library.
Find Warcs¶
find_warcs.py
helps put together a list of WARC files to be processed by other tools, e.g., warc iterator tools. (It gets the list of WARC files by querying the SFM API.)Here is arguments it accepts:
For example, to get a list of the WARC files in a particular collection, provide some part of the collection id:
(In this case there is only one WARC file. If there was more than one, it would be space separated.)
The collection id can be found from the SFM UI.
Note that if you are running
find_warcs.py
from outside a Docker environment, you will need to supply--api-base-url
.Approaches¶
Processing in container¶
To bootstrap processing, a processing image is provided. A container instantiated from this image is Ubuntu 14.04 and pre-installed with the warc iterator tools,
find_warcs.py
, and some other use tools. It will also have read-only access to the data from/sfm-data
.The other tools are:
To instantiate:
You will then be provided with a bash shell inside the container from which you can execute commands:
Setting
PROCESSOR_VOLUME
in.env
to a host volume will link/sfm-processing
to your local filesystem. You can place scripts in this directory to make them available inside the processing container or write output files to this directory to make them available outside the processing container.Note that once you exit the processing container, the container will be automatically removed. However, if you have saved all of your scripts and output files to
/sfm-processing
, they will be available when you create a new processing container.Processing locally¶
In a typical Docker configuration, the data directory will be linked into the Docker environment. This means that the data is available both inside and outside the Docker environment. Given this, processing can be performed locally (i.e., outside of Docker).
The various tools can be installed locally:
Recipes¶
Extracting URLs¶
The “Extracting URLs from #PulseNightclub for seeding web archiving” blog post provides some useful guidance on extracting URLs from tweets, including unshortening and sorting/counting.
Exporting to line-oriented JSON files¶
This recipe is for exporting social media data from WARC files to line-oriented JSON files. There will be one JSON file for each WARC. This may be useful for some processing or for loading into some analytic tools.
This recipe uses parallel for parallelizing the export.
Create a list of WARC files:
Replace 7c37157 with the first few characters of the collection id that you want to export. The collection id is available on the colllection detail page in SFM UI.
Create a list of JSON destination files:
This command puts all of the JSON files in the same directory, using the filename of the WARC file with a .json file extension.
If you want to maintain the directory structure, but use a different root directory:
Replace sfm-processing/export with the root directory that you want to use.
Perform the export:
Replace twitter_stream_warc_iter.py with the name of the warc iterator for the type of social media data that you are exporting.
You can also perform a filter on export using jq. For example, this only exports tweets in Spanish:
And to save space, the JSON files can be gzip compressed:
You might also want to change the file extension of the destination file to ”.json.gz” by adjusting the commmand use to create the list of JSON destination files. To access the tweets in a gzipped JSON file, use:
Counting posts¶
wc -l can be used to count posts. To count the number of tweets in a collection:
To count the posts from line-oriented JSON files created as described above:
wc -l gotcha: When doing a lot of counting, wc -l will output a partial total and then reset the count. The partial totals must be added together to get the grand total. For example:
Using jq to process JSON¶
For tips on using jq with JSON from Twitter and other sources, see:
Exploring social media data with ELK¶
The ELK (Elasticsearch, Logstash, Kibana) stack is a general-purpose framework for exploring data. It provides support for loading, querying, analysis, and visualization.
SFM provides an instance of ELK that has been customized for exploring social media data. It currently supports data from Twitter and Weibo.
One possible use for ELK is to monitor data that is being harvested to discover new seeds to select. For example, it may reveal new hashtags or users that are relevant to a collection.
Though you can use Logstash and Elasticsearch directly, in most cases you will interact exclusively with Kibana, which is the exploration interface.
Enabling ELK¶
ELK is not available by default; it must be enabled as described here.
You can enable one or more ELK Docker containers. Each container can be configured to be loaded with all social media data or the social media data for a single collection set.
To enable an ELK Docker container it must be added to your
docker-compose.yml
and then started by:An example container is provided in
example.docker-compose.yml
andexample.prod.docker-compose.yml
. These examples also show how to limit to a single collection set by providing the collection set id.By default, Kibana is available at http://<your hostname>:5601/app/kibana. (Also, by default Elasticsearch is available on port 9200 and Logstash is available on port 5000.)
If enabling multiple ELK containers, add multiple containers to your
docker-compose.yml
. Make sure to give each a unique name and map to different ports.Loading data¶
ELK will automatically be loaded as new social media data is harvested. (Note, however, that there will be some latency between the harvest and the data being available in Kibana.)
Since only new social media data is added, it is recommended that you enable the ELK Docker container before beginning harvesting.
If you would like to load social media data that was harvested before the ELK Docker container was enabled, use the
resendwarccreatedmsgs
management command:The
resendwarccreatedmsgs
command resends warc_created messages which will trigger the loading of data by ELK.To use this command, you will need to know the routing key. The routing key is
elk_loader_<container id>.warc_created
. The container id can be found withdocker ps
.The loading can be limited by collection set (
--collection-set
) and/or (--harvest-type
). You can get collection set ids from the collection set detail page. The available harvest types are twitter_search, twitter_filter, twitter_user_timeline, twitter_sample, and weibo_timeline.This shows loading the data limited to a collection set:
Overview of Kibana¶
The Kibana interface is extremely powerful. However, with that power comes complexity. The following provides an overview of some basic functions in Kibana. For some advanced usage, see the Kibana Reference or the Kibana 101: Getting Started with Visualizations video.
When you start Kibana, you probably won’t see any results.
This is because Kibana defaults to only showing data from the last 15 minutes. Use the date picker in the upper right corner to select a more appropriate time range.
Tip: At any time, you can change the date range for your query, visualization, or dashboard using the date picker.
Discover¶
The Discover tab allows you to query the social media data.
By default, all social media types are queried. By limit to a single type (e.g., tweets), click the folder icon and select the appropriate filter.
You will now only see results for that social media type.
Notice that each social media item has a number of fields.
You can search against a field. For example, to find all tweets containing the term “archiving”:
or having the hashtag #SaveTheWeb:
or mentioning @SocialFeedMgr:
Visualize¶
The Visualize tab allows you to create visualizations of the social media data.
The types of visualizations that are supported include:
Describing how to create visualizations is beyond the scope of this overview.
A number of visualizations have already been created for social media data. (The available visualizations are listed on the bottom of the page.)
For example, here is the Top 10 hashtags visualization:
Dashboard¶
The Dashboard tab provides a summary view of data, bringing together multiple visualizations and searches on a single page.
A number of dashboards have already been created for social media data. To select a dashboard, click the folder icon and select the appropriate dashboard.
For example, here is the top of the Twitter dashboard:
Caveats¶
Installation and configuration¶
Overview¶
The supported approach for deploying SFM is Docker containers. For more information on Docker, see Docker.
Each SFM service will provide images for the containers needed to run the service (in the form of
Dockerfile
s). These images will be published to Docker Hub. GWU created images will be part of the GWUL organization and be prefixed with sfm-.sfm-docker provides the necessary
docker-compose.yml
files to compose the services into a complete instance of SFM.The following will describe how to setup an instance of SFM that uses the latest release (and is suitable for a production deployment.) See the development documentation for other SFM configurations.
SFM can be deployed without Docker. The various
Dockerfile
s should provide reasonable guidance on how to accomplish this.Local installation¶
Installing locally requires Docker and Docker-Compose. See Installing Docker.
Either clone the sfm-docker repository and copy the example configuration files:
or just download
example.prod.docker-compose.yml
andexample.env
:Update configuration in
.env
as described in Configuration.Bring up the containers:
It is also recommended that you scale up the Twitter REST Harvester container:
Notes:
Amazon EC2 installation¶
To launch an Amazon EC2 instance running SFM, follow the normal procedure for launching an instance. In Step 3: Configure Instance Details, under Advanced Details paste the following in user details and modify as appropriate as described in Configuration:
When the instance is launched, SFM will be installed and started.
Note the following:
docker-compose.yml
, you can ssh into the EC2 instance and make changes.docker-compose.yml
and.env
will be in the default user’s home directory.Configuration¶
Configuration is documented in
example.env
. For a production deployment, pay particular attention to the following:SFM_SITE_ADMIN_PASSWORD
,RABBIT_MQ_PASSWORD
,POSTGRES_PASSWORD
, andHERITRIX_PASSWORD
.DATA_VOLUME
andPROCESSING_VOLUME
settings. Host volumes are recommended for production because they allow access to the data from outside of Docker.SFM_HOSTNAME
andSFM_PORT
appropriately. These are the public hostname (e.g., sfm.gwu.edu) and port (e.g., 80) for SFM.SFM_SMTP_HOST
,SFM_EMAIL_USER
, andSFM_EMAIL_PASSWORD
. (If the configured email account is hosted by Google, you will need to configure the account to “Allow less secure apps.” Currently this setting is accessed, while logged in to the google account, via https://myaccount.google.com/security#connectedapps).TWITTER_CONSUMER_KEY
,TWITTER_CONSUMER_SECRET
,WEIBO_API_KEY
,WEIBO_API_SECRET
, and/orTUMBLR_CONSUMER_KEY
,TUMBLR_CONSUMER_SECRET
. These are optional, but will make acquiring credentials easier for users. For more information and alternative approaches see API Credentials.SFM_SITE_ADMIN_EMAIL
.HERITRIX_CONTACT_URL
.Note that if you make a change to configuration after SFM is brought up, you will need to restart containers. If the change only applies to a single container, then you can stop the container with
docker kill <container name>
. If the change applies to multiple containers (or you’re not sure), you can stop all containers withdocker-compose stop
. Containers can then be brought back up withdocker-compose up -d
and the configuration change will take effect.Authentication¶
Social Feed Manager allows users to self-sign up for accounts. Those accounts are stored and managed by SFM. Future versions of SFM will support authentication against external systems, e.g., Shibboleth.
By default, a group is created for each user and the user is placed in group. To create additional groups and modify group membership use the Admin interface.
In general, users and groups can be administered from the Admin interface.
The current version of SFM is not very secure. Future versions of SFM will more tightly restrict what actions users can perform and what they can view. In the meantime, it is encouraged to take other measures to secure SFM such as restricting access to the IP range of your institution.
Docker¶
This page contains information about Docker that is useful for installation, administration, and development.
Installing Docker¶
Docker Engine and Docker Compose
On OS X:
On Ubuntu:
apt
install, try thepip
install./etc/group
.Helpful commands¶
docker-compose up -d
docker-compose pull
docker-compose build
--no-cache
to re-build the entire image (which you might want to do if the image isn’t building as expected).docker ps
-a
to also list stopped containers.docker-compose kill
docker kill <container name>
docker-compose rm -v --force
docker rm -v <container name>
docker rm $(docker ps -a -q) -v
docker-compose logs
-f
to follow the logs.docker logs <container name>
-f
to follow the logs.docker-compose -f <docker-compose.yml filename> <command>
docker exec -it <container name> /bin/bash
docker rmi <image name>
docker rmi $(docker images -q)
docker-compose scale <service name>=<number of instances>
Scaling up with Docker¶
Most harvesters and exporters handle one request at a time; requests for exports and harvests queue up waiting to be handled. If requests are taking too long to be processed you can scale up (i.e., create additional instances of) the appropriate harvester or exporter.
To create multiple instances of a service, use docker-compose scale.
The harvester most likely to need scaling is the Twitter REST harvester since some harvests (e.g., broad Twitter searches) may take a long time. To scale up the Twitter REST harvester to 3 instances use:
To spread containers across multiple containers, use Docker Swarm.
Using compose in production provides some additional guidance.
Limitations and Known Issues¶
To make sure you have the best possible experience with SFM, you should be aware of the limitations and known issues:
docker-compose scale command
(Ticket 408)We are planning to address these in future releases. In the meantime, there are work-arounds for many of these issues. For a complete list of tickets, see https://github.com/gwu-libraries/sfm-ui/issues
In addition, you should be aware of the following:
Troubleshooting¶
General tips¶
docker ps
.docker-compose logs
anddocker logs <container name>
..env
.Specific problems¶
Bind error¶
If when bringing up the containers you receive something like:
it means another application is already using a port configured for SFM. Either shut down the other application or choose a different port for SFM. (Chances are the other application is Apache.)
Bad Request (400)¶
If you receive a Bad Request (400) when trying to access SFM, your
SFM_HOST
environment variable is not configured correctly. For more information, see ALLOWED_HOSTS.Social Network Login Failure for Twitter¶
If you receive a Social Network Login Failure when trying to connect a Twitter account, make sure that the Twitter app from which you got the Twitter credentials is configured with a callback URL. The URL you provide doesn’t matter.
Docker problems¶
If you are having problems bringing up the Docker containers (e.g.,
driver failed programming external connectivity on endpoint
), restart the Docker service. On Ubuntu, this can be done with:Still stuck?¶
Contact the SFM team. We’re happy to help.
Development¶
Setting up a development environment¶
SFM is composed of a number of components. Development can be performed on each of the components separately.
For SFM development, it is recommended to run components within a Docker environment (instead of directly in your OS, without Docker).
Step 1: Install Docker and Docker Compose¶
See Installing Docker.
Step 2: Clone sfm-docker and create copies of docker-compose files¶
For example:
For the purposes of development, you can make changes to
docker-compose.yml
and.env
. This will be described more below.Step 3: Clone the component repos¶
For example:
Repeat for each of the components that you will be working on. Each of these should be in a sibling directory of sfm-docker.
Running SFM for development¶
To bring up an instance of SFM for development, change to the sfm-docker directory and execute:
You may not want to run all of the containers. To omit a container, simply comment it out in
docker-compose.yml
.By default, the code that has been committed to master for each of the containers will be executed. To execute your local code (i.e., the code you are editing), you will want to link in your local code. To link in the local code for a container, uncomment the volume definition that points to your local code. For example:
sfm-utils and warcprox are dependencies of many components. By default, the code that has been committed to master for sfm-utils or warcprox will be used for a component. To use your local code as a dependency, you will want to link in your local code. Assuming that you have cloned sfm-utils and warcprox, to link in the local code as a dependency for a container, change
SFM_REQS
in.env
to “dev” and comment the volume definition that points to your local code. For example:Note: * As a Django application, SFM UI will automically detect code changes and reload. Other components must be killed and brought back up to reflect code changes.
Running tests¶
Unit tests¶
Some components require a
test_config.py
file that contains credentials. For example, sfm-twitter-harvester requires atest_config.py
containing:Note that if this file is not present, unit tests that require it will be skipped. Each component’s README will describe the
test_config.py
requirements.Unit tests for most components can be run with:
The notable exception is SFM UI, which can be run with:
Integration tests¶
Many components have integration tests, which are run inside docker containers. These components have a
ci.docker-compose.yml
file which can be used to bring up a minimal environment for running the tests.As described above, some components require a
test_config.py
file.To run integration tests, bring up SFM:
Run the tests:
You will need to substitute the correct name of the container. (
docker ps
will list the containers.)And then clean up:
For reference, see each component’s
.travis.yml
file which shows the steps of running the integration tests.Smoke tests¶
sfm-docker contains some smoke tests which will verify that SFM is running correctly.
To run the smoke tests, first bring up SFM:
and then run the tests:
Note that the smoke tests are not yet complete.
For reference, the continuous integration deploy instructions shows the steps of running the smoke tests.
Requirements files¶
This will vary a depending on whether a project has warcprox and sfm-utils as a dependency, but in general:
requirements/common.txt
contains dependencies, except warcprox and sfm-utils.requirements/release.txt
references the last released version of warcprox and sfm-utils.requirements/master.txt
references the master version of warcprox and sfm-utils.requirements/dev.txt
references local versions of warcprox and sfm-utils in development mode.To get a complete set of dependencies, you will need
common.txt
and eitherrelease.txt
,master.txt
ordev.txt
. For example:Development tips¶
Admin user accounts¶
Each component should automatically create any necessary admin accounts (e.g., a django admin for SFM UI). Check
.env
for the username/passwords for those accounts.RabbitMQ management console¶
The RabbitMQ management console can be used to monitor the exchange of messages. In particular, use it to monitor the messages that a component sends, create a new queue, bind that queue to sfm_exchange using an appropriate routing key, and then retrieve messages from the queue.
The RabbitMQ management console can also be used to send messages to the exchange so that they can be consumed by a component. (The exchange used by SFM is named sfm_exchange.)
For more information on the RabbitMQ management console, see RabbitMQ.
Blocked ports¶
When running on a remote VM, some ports (e.g., 15672 used by the RabbitMQ management console) may be blocked. SSH port forwarding can help make those ports available.
Django logs¶
Django logs for SFM UI are written to the Apache logs. In the docker environment, the level of various loggers can be set from environment variables. For example, setting SFM_APSCHEDULER_LOG to DEBUG in the docker-compose.yml will turn on debug logging for the apscheduler logger. The logger for the SFM UI application is called ui and is controlled by the SFM_UI_LOG environment variable.
Apache logs¶
In the SFM UI container, Apache logs are sent to stdout/stderr which means they can be viewed with docker-compose logs or docker logs <container name or id>.
Initial data¶
The development and master docker images for SFM UI contain some initial data. This includes a user (“testuser”, with password “password”). For the latest initial data, see fixtures.json. For more information on fixtures, see the Django docs.
Runserver¶
There are two flavors of the the development docker image for SFM UI. gwul/sfm-ui:master runs SFM UI with Apache, just as it will in production. gwul/sfm-ui:master-runserver runs SFM UI with runserver, which dynamically reloads changed Python code. To switch between them, change UI_TAG in .env.
Note that as an byproduct of how runserver dynamically reloads Python code, there are actually 2 instances of the application running. This may produce some odd results, like 2 schedulers running. This will not occur with Apache.
Job schedule intervals¶
To assist with testing and development, a 5 minute interval can be added by setting SFM_FIVE_MINUTE_SCHEDULE to True in the docker-compose.yml.
Connecting to the database¶
To connect to postgres using psql:
You will be prompted for the password, which you can find in .env.
Docker tips¶
Building vs. pulling¶
Containers are created from images. Images are either built locally or pre-built and pulled from Docker Hub. In both cases, images are created based on the docker build (i.e., the Dockerfile and other files in the same directory as the Dockerfile).
In a docker-compose.yml, pulled images will be identified by the image field, e.g., image: gwul/sfm-ui:master. Built images will be identified by the build field, e.g., build: app-dev.
In general, you will want to use pulled images. These are automatically built when changes are made to the Github repos. You should periodically execute docker-compose pull to make sure you have the latest images.
You may want to build your own image if your development requires a change to the docker build (e.g., you modify fixtures.json).
Killing, removing, and building in development¶
Killing a container will cause the process in the container to be stopped. Running the container again will cause process to be re-started. Generally, you will kill and run a development container to get the process to be run with changes you’ve made to the code.
Removing a container will delete all of the container’s data. During development, you will remove a container to make sure you are working with a clean container.
Building a container creates a new image based on the Dockerfile. For a development image, you only need to build when making changes to the docker build.
Writing a harvester¶
Requirements¶
Suggestions¶
Notes¶
Messaging¶
RabbitMQ¶
RabbitMQ is used as a message broker.
The RabbitMQ managagement console is exposed at
http://<your docker host>:15672/
. The username issfm_user
. The password is the value ofRABBITMQ_DEFAULT_PASS
insecrets.env
.Publishers/consumers¶
mq
and the port is 5672.rabbit
. See appdeps.py for docker application dependency support.Exchange¶
sfm_exchange
is a durable topic exchange to be used for all messages. All publishers/consumers must declare it.:Queues¶
All queues must be declared durable.:
Messaging Specification¶
Introduction¶
SFM is architected as a number of components that exchange messages via a messaging queue. To implement functionality, these components send and receive messages and perform certain actions. The purpose of this document is to describe this interaction between the components (called a “flow”) and to specify the messages that they will exchange.
Note that as additional functionality is added to SFM, additional flows and messages will be added to this document.
General¶
Harvesting social media content¶
Harvesting is the process of retrieving social media content from the APIs of social media services and writing to WARC files. It also includes extracting urls for other web resources from the social media so that they can be harvested by a web harvester. (For example, the link for an image may be extracted from a tweet.)
Background information¶
Flow¶
The following is the flow for a harvester performing a REST harvest and creating a single warc:
The following is the message flow for a harvester performing a stream harvest and creating multiple warcs:
Messages¶
Harvest start message¶
Harvest start messages specify for a harvester the details of a harvest. Example:
Another example:
Web resource harvest start message¶
Harvesters will extract urls from the harvested social media content and publish a web resource harvest start message. This message is similar to other harvest start messages, with the differences noted below. Example:
Harvest stop message¶
Harvest stop messages tell a harvester perform a stream harvest to stop. Example:
Harvest status message¶
Harvest status messages allow a harvester to provide information on the harvests it performs. Example:
Warc created message¶
Warc created message allow a harvester to provide information on the warcs that are created during a harvest. Example:
Exporting social media content¶
Exporting is the process of extracting social media content from WARCs and writing to export files. The exported content may be a subset or derivate of the original content. A number of different export formats will be supported.
Background information¶
Flow¶
The following is the flow for an export:
Export start message¶
Export start messages specify the requests for an export. Example:
Another example:
Export status message¶
Export status messages allow an exporter to provide information on the exports it performs. Example: