Friday, July 08, 2011

Twitter Archiver

Twitter is great to get and share information, quickly. But it is all web 2.0 and you cannot use a simple cat or grep to view or search your tweets. I would like to have tweets saved in simple text format: date, user, text. Also, I would like a simple program to give me the list of followers/following of a user.

Fortunately, Twitter provides a web API and people started to implement it in different languages, like Mike Verdone (@sixohsix) and his great Python Twitter Tools.

So here comes twitter-archiver and twitter-follow, python programs added to Python Twitter Tools to archive any public timeline of tweets in simple text format and to view the list of following/followers of a user.

Why not using twitter-log?

Python Twitter Tools already includes twitter-log to backup tweets, but it does not display them as I want (compact, one per line) and does not support saving to file and resume (everytime you need to retrieve all tweets). Also, it does not include retweets and I want them (include_rts=1).

Authentication?

By default, the script does not authenticate to Twitter therefore you can only browse public timelines. Also note that you have a lower API rate limit when not authenticated.

If you want to archive a protected timeline you have access to, just give -o (--oauth) parameter the script will guide you through the OAuth process that is only asked on first run (tokens are saved in ~/.twitter-archiver_oauth and reused next time).

Installation

You can find twitter-archiver and twitter-follow on my cloned repository of Python Twitter Tools on github.
You can choose to install the programs as explained in the README and then use twitter-archiver, or run it directly by using:
$ python -uc 'from twitter import archiver; archiver.main()' [<args...>]
Tip: you can use PYTHONPATH environment variable to point at Python Twitter Tools (parent) directory.

Usage

$ twitter-archiver
USAGE
    twitter-archiver [options] <-|user> [<user> ...]

DESCRIPTION
    Archive tweets of users, sorted by date from oldest to newest, in
    the following format: <id> <date> <<screen_name>> <tweet_text>
    Date format is: YYYY-MM-DD HH:MM:SS TZ. Tweet <id> is used to
    resume archiving on next run. Archive file name is the user name.
    Provide "-" instead of users to read users from standard input.

OPTIONS
 -o --oauth            authenticate to Twitter using OAuth (default no)
 -s --save-dir <path>  directory to save archives (default: current dir)
 -a --api-rate         see current API rate limit status
 -t --timeline <file>  archive own timeline into given file name (requires
                       OAuth, max 800 statuses).

AUTHENTICATION
    Authenticate to Twitter using OAuth to archive tweets of private profiles
    and have higher API rate limits. OAuth authentication tokens are stored
    in ~/.twitter-archiver_oauth.

Examples

Give a twitter user to retrieve all tweets and print them:
$ twitter-archiver stalkr_
* Archiving stalkr_ tweets in ./stalkr_
Browsing stalkr_ timeline, new tweets: 200
[...]
Browsing timeline, new tweets: 186
Total tweets for stalkr_: 982 (982 new)

It saves tweets in a format you can grep:
$ grep -i winbuilder stalkr_
89264285980696576 2011-07-08 11:27:18 CEST <stalkr_> RT @irongeek_adc: Dual booting Winbuilder/Win7PE SE and Backtrack 5 on a USB flash drive with XBOOT http://t.co/SdcxMIu

See your current unauthenticated API rate limit status:
$ twitter-archiver --api-rate
Remaining API requests: 139/150 (hourly limit)
Next reset in 3508s (Mon Jul 11 02:18:07 2011)

Run it again and it automatically resumes archiving:
$ twitter-archiver stalkr_
* Archiving stalkr_ tweets in ./stalkr_
Browsing stalkr_ timeline, new tweets: 16
Total tweets for stalkr_: 998 (16 new)

You cannot archive tweets of a protected timeline:
$ twitter-archiver protected_user
* Archiving protected_user tweets in ./protected_user
Fail: 401 Unauthorized (tweets of that user are protected)
Total tweets for protected_user: 0 (0 new)

Authenticate with OAuth to archive protected timelines you follow:
$ twitter-archiver --oauth protected_user
* Archiving protected_user tweets in ./protected_user
Browsing timeline, new tweets: 134
Total tweets for protected_user: 134 (134 new)

You can also see that your API rate limit is higher when authenticated:
$ twitter-archiver --oauth --api-rate
Remaining API requests: 343/350 (hourly limit)
Next reset in 752s (Mon Jul 11 01:38:22 2011)

And of course, you archive as many users you want:
$ twitter-archiver --oauth stalkr_ Ivanlef0u [etc.]
* Archiving stalkr_ tweets in ./stalkr_
[...]
* Archiving Ivanlef0u tweets in ./Ivanlef0u
[...]
Total: X tweets (Y new) for 2 users

Script automatically retries when there is a failure and automatically waits for the reset when the API rate limit is reached.

Simple list of following/followers

In the same spirit as the Archiver, I wanted a simple script to retrieve:
  • the list of users a particular user follow: the following page
  • the list of users that follow a particular user: the followers page
All that in a simple text format: one user name per line.

Twitter kindly provides friends/ids and following/ids API for that, but it returns a list of user ids and not user names. One has to use users/lookup API to resolve a user id into a screen name, no more than 100 user ids at a time.

So along with twitter-archiver also comes twitter-follow, also with OAuth support:
$ twitter-follow
USAGE
    twitter-follow [options] <user>

DESCRIPTION
    Display all following/followers of a user, one user per line.

OPTIONS
-o --oauth            authenticate to Twitter using OAuth (default no)
-r --followers        display followers of the given user (default)
-g --following        display users the given user is following
-a --api-rate         see your current API rate limit status

AUTHENTICATION
    Authenticate to Twitter using OAuth to see following/followers of private
    profiles and have higher API rate limits. OAuth authentication tokens
    are stored in the file .twitter-follow_oauth in your home directory.


Usage example:
$ twitter-follow --oauth --followers stalkr_
Browsing followers, new: 1022
Resolving user ids to screen names: 100/1022
[...]
Resolving user ids to screen names: 1022/1022
0vercl0k
[...]
Total followers for stalkr_: 1022

$ twitter-follow --oauth --following stalkr_
Browsing following, new: 314
Resolving user ids to screen names: 100/314
[...]
Resolving user ids to screen names: 314/314
Ivanlef0u
[...]
Total users stalkr_ is following: 314

Pulling pieces together

You are now able to:
  • archive all your tweets
  • archive tweets of users you follow (following)
    • you can rebuild your own timeline (something Twitter API does not allow you to do because it limits to last 800 statuses)
  • archive tweets of users who follow you (followers)
  • do that regularily with a cron script
  • grep tweets!

As an example, here is the script I use to build my personal twitter database:
#!/bin/bash
# Your username + directory to work in
ME=stalkr_
DIR=~/twitter-db

error() { echo "Error: $@" >&2; exit 1; }
cd "$DIR" || error "unable to cd to $DIR"
S=$(date '+%s')

# Save following/followers
twitter-follow -o -g "$ME" > following.lst
twitter-follow -o -r "$ME" > followers.lst

# Archive all tweets of self + following/followers + others
mkdir -p all || error "failed mkdir all"
{
  echo "$ME"
  cat following.lst followers.lst others.lst 2>/dev/null
} | twitter-archiver -o -s "$PWD/all" -- -

# Build subsets of following/followers using symlinks
rm -rf following followers
mkdir following followers || error "failed mkdir following followers"
while read N; do
  [ -f "all/$N" ] && ln -s "../all/$N" "following/$N"
done < following.lst
while read N; do
  [ -f "all/$N" ] && ln -s "../all/$N" "followers/$N"
done < followers.lst

# Rebuild timeline by sorting all following tweets
find following -not -type d -print0 | xargs -0 cat | sort -n > timeline

# Execution time
D=$[$(date '+%s') - $S]
[ -x "$(which duration)" ] && D=$(duration $D) || D="${D}s"
echo "Total time: $D"

# Cron to update twitter-db
25 3 * * * cd /home/stalkr/twitter-db; flock -nox lock bash -c './db.sh > log 2>&1'
# Ensure you already performed OAuth authentication before:
#   twitter-archiver -o -a; twitter-follow -o -a

Data liberation for Twitter!

2 comments:

  1. Hi mate, thank you for this nice post!

    I am looking for the best way to save/archive all my tweets (mine and people I follow) to a log file continuously (using an interval) avoiding duplicates.
    When I use :
    "twitter friends -d -t -r -R 30 --format default >> twitter.log"
    Duplicate entries are included since the same tweets are logged more than once.

    I tried your db.sh script (https://github.com/StalkR/misc/blob/master/twitter/db.sh) which is mentioned above too, but when the followers/following users are many it takes hours to create the timeline since the hourly limit for requests is reached and actually the file is created each time from scratch.
    So "tail -f" doesn't help to check and maybe then grep the latest tweets.

    Any ideas/suggestions? Thanks in advance!

    ReplyDelete
  2. Hi mate,

    It is slow because it has to make 1 request/user minimum (to check if there are new tweets) and request limit is low.. unfortunately.

    The timeline is rebuilt from scratch but don't worry this is just a concatenation and sort by time of all tweet files created by twitter-archiver, and these are not created every time: they resume automatically and only new tweets are saved.

    The reason it is rebuilt from scratch is because you might have new followers/following, so new tweets with an older date need to be inserted in the timeline. Similarly, you may have removed followers/following and their tweets needs to be removed from timeline.

    If you ignore these cases, then yes you can build an append-only timeline that you can "tail -f". The algorithm can be as simple as append from last known tweet id.

    Let me know if you need help doing that,
    Cheers.

    ReplyDelete