How to give up Twitter (but get all the data out first)

Why I want to give up Twitter

I have been maintaining a Twitter account that will become 10 years old this year (2017). It contains over 93,500 tweets as of today. Yes, ninety freaking thousand. That’s roughly 9,300 tweets per year, or an average of 25 tweets posted on every single day. Assuming that my average tweet comprises 100 characters, this is just under 9 megabytes of plain text, the equivalent of roughly two 1,000-page books. So much for a bit of statistics.

Despite still using it as some kind of notebook to quickly jot down ideas and thoughts or capture something that grabbed my attention, I have been increasingly unhappy with the way that Twitter has developed, and I have been wanting to give up using Twitter altogether for many years.

That has not happened (yet), despite quite a few earnest attempts to quit.

This article gives insight into my reasons for wanting to quit, and the (possibly long and winded) road to reach that goal. (If it sounds like I’m talking about some kind of addiction, then that’s probably not far from the truth.)

Part of the reason is that is still the most convenient way to do the things I am using it for. But that is just laziness on my part; I’m still using it simply because it’s there.

I have even stopped following people on Twitter because I either get too upset by what I read, or too lost in it, which causes me to spend way too much time there, dramatically cutting in to my productivity and creative output. Of course, cutting the conversation on one side does not exactly help keeping it up, but there are other reasons.

There is no longer any dialogue on Twitter, it’s all broadcasting

The social aspect of Twitter (exchanging and discussing ideas, making connections) has been working less and less well for me — and this development began long before I started unfollowing everyone (which was actually more of a reaction to this development rather than the cause; if it were the latter, I’d be a fool to complain about it).

Twitter used to be a place where I could engage with others about stuff that I was interested in, and it was great fun for many years, but that is no more.

Most people who are still active on Twitter (and who regularly post high-quality and original content) use the service as a one-way broadcast channel. Unless you have some kind of active following or social network outside of Twitter, you will essentially be holding a monologue (which is what I’ve been doing for the last one or two years).

If that is so, I don’t need Twitter; I might as well publish my notes to any other platform. Ironically, I have been having much more engaging and interesting discussions on Facebook lately. (Why ironically? Because when I was still using Twitter very actively, Facebook was a place where people sent each other Farmville requests.)

Single-theme accounts probably still work

There are situations where Twitter still works well as an engagement platform. If you limit the content of your tweets to a very restricted set of themes or topics, you are still likely to find and build your audience, and there will be a fruitful exchange of ideas.

But I personally never wanted to tweet monothematically. On the contrary, I enjoy hopping wildly between the most diverse areas of interest to me, and I think this is something that only works on Twitter if, again, you already have a following outside of Twitter, such as when you are a celebrity or have made yourself a name in other ways — because then people tend to be more interested in the person than what the person says.

I’m no celebrity, and with every new tweet, it seems I am alienating a good part of the people that still follow me. I believe most of my followers are interested in either X or Y or Z (but not all of them), and since only about every 10th or 20th tweet of mine is about one specific and recurrent topic, my Twitter updates are not interesting to my followers most of the time, so eventually they leave — and I don’t blame them at all.

Content sharing in a post-Snowden world

There is another issue that not only pertains to Twitter. It’s the fact that when using a content sharing platform, I’m essentially supplying the companies running the service with loads of free data for them to perform mass-scale data-mining on. I understand full well that this is their business model, and it’s the price I’m paying while I’m using their service at no monetary cost.

But in a post-Snowden world where we know all this data could one day be used against us, no matter how well these companies protect this data (or claim to do so), I feel increasingly uncomfortable with posting any content to any service that is not fully under my own control.

(Yes, I’m aware I’m publishing this article on wordpress.com, which is just that very kind of service. Until about five years ago, I used to be all post-privacy. I used to have literally hundreds of online accounts, and I uploaded tons of data everywhere, and I liked it. Today, I think I may have made a mistake, and it will take a lot of time to undo all of this — if it can be undone, because once the data is out there, it’s out.)

So — Twitter is all but dead to me, and I seriously want to leave. What’s holding me up?

 

I want to keep my data — how do I do it?

To be able to move on to something else, I will have to make it inconvenient to myself to still be using Twitter, and the only way I believe I will achieve this is if I delete my account altogether. This is where the other reason comes in: there is too much data in this account that is important to me, and I couldn’t stand losing it all. If I’m going to delete my account, I want to save my data first — all of it.

Downloading your tweet archive

It’s pretty straightforward to get all tweets out of an account. You can download a tweet archive that contains everything you have ever posted. It’s an official feature by Twitter. You request the archive download, and it usually takes about a day or two until you get a download link. The download is a zipped archive that contains a small web application shell to view the archive, while the archived tweet data itself is contained in a directory full of JSON files. That data is already very cleanly structured and complete; there is nothing left to ask for.

So where’s the problem? The archive only contains my own tweets. It does not comprise any of the interactions I have had with other people on Twitter. It does not contain fav or retweet counts, nor the IDs or screen names of people who faved or RT’d my tweets. It does not contain replies or threaded conversations, which naturally involve other people’s tweets — that is, content that is not my own. However, without this interaction data, a Twitter archive is only half of the story. I could download my tweet archive and delete my account, and all of this other half would be lost. Many people would be happy with that; I am not.

Using Favstar to get Fav and RT data

For a couple of years, I have been attacking this problem from different angles. For favs and RTs, one solution is Favstar. This non-free service keeps track of all favs and RTs for an account (including lists of who fav’d or RT’d), and it looks like it would be easy to get that data out, either using their API, or by scraping their web content. I’ve not explicitly attempted this, but it appears feasible. (The only reason I keep renewing my Favstar subscription is that I’m hoping I’ll get easy access to this data to accomplish my larger goal.)

Using web scraping to get conversations

For replies and conversations, the only viable solution I have come up with is to scrape Twitter’s public web content. The advantage is that it’s actually public, that is, you don’t even need to be logged in to access this data. All you need is a Twitter account’s user id or screen name, and a list of the tweet IDs for which you want to get the associated conversations (be it just a single reply or a long thread of back-and-forth replies). The tweet IDs are easily obtainable from your tweet archive, as each tweet is identified by its ID (which is unique across all of Twitter’s userbase). It would not be too challenging to hack a script that downloads the HTML for each tweet ID and then performs a bit of nested regular expression processing to extract the parts that I want.

(Yes, I know it’s not possible in computer science theory to actually parse HTML using regex, but this only applies to arbitrary HTML, but not if the nesting structure is already known and constant. I’ve written lots of concise and easy-to-understand code that correctly and efficiently extracts data from HTML and XML using nothing but regular expressions, so academics who keep repeating the old HTML-and-regex advice: please go back to your enterprise applications with your 2 megabyte XML parsing library behemoths if it makes you happy and gives you that nice smug feeling of the righteous who know best.)

Flying under the radar

With web scraping, I’d be cautious to fly under the radar. I’d expect any service such as Twitter to have measures in place that detect large numbers of automated requests (as would be the case if I tried to download 90,000 HTML pages all at once). To avoid getting detected, it’s best to design your requests in such a way that it would be hard to discern them from any regular access to the public web content. This isn’t hard to do, either. Usually, it appears sufficient to simulate the request headers that your web browser would transmit; in particular, the user agent, and, where applicable, a refer(r)er URI. (To see a full set of headers, take a look at the network pane of your browser’s developer tools when you submit a particular request.)

To make automated requests appear non-automated, it is probably a good idea to randomise them across time. To achieve this, I simply space them out using a call to PHP’s sleep() or usleep() (or something equivalent in your language of choice), with sufficiently random pauses inbetween requests that it would be non-trivial to detect a pattern on the server side. (I use the cURL library for scraping. In PHP, you could simply use file_get_contents(URL), and you can set the user agent for URL downloads via a configuration option (an ini setting), but cURL is the easiest-to-use choice if you need to go beyond that.)

Using Twitter’s API — or probably not

What about Twitter’s public API? Isn’t it the first place I should have looked? Well, after studying it for a while I came to the conclusion that it’s not extremely helpful for what I want to achieve. First of all, it uses an API key/token access mechanism involving their OAuth system — something that I couldn’t be bothered to deal with, despite the existence of various third-party libraries that handle all of the intricacies. (However, you can manually create API access tokens using their API explorer. These tokens expire after a while, so you’d have to keep renewing them.)

Still, once you have API access, you’re quite limited in what you can do. The API (understandably) limits the number of requests you are allowed to make in a certain timeframe, which is less of a problem, since you could just spread your requests out over time, but with 90,000 tweets, we are talking about a long time (days, weeks, possibly months).

The more serious restriction is that the API will by default only grant you access to the latest 3600 tweets — at least this was the case when I last checked the documentation (years ago). Using cursors, you can tell the API the range of tweet IDs you want data for, and it appears that this would allow you to go beyond the latest-3600-tweet limit, but I have not tested this sufficiently. All in all, using the API did not seem to be an efficient approach at all — even though it gives you the data in a very well-structured format —, so I mostly gave up trying.

Using Twitter’s non-public (?) internal (?) REST interface

Only quite recently, I have discovered that there is a more direct way to access Twitter’s content than full page scraping. When you scroll down someone’s timeline in the web UI, additional timeline items are loaded by the web application via background requests to a number of REST-like URLs.

Like the regular HTML content, and unlike Twitter’s API, these URLs can be accessed publically and do not require a login session. They return snippets of JSON code that contain up to 20 items (tweets, etc.) at once. For some reason, the items are not provided as structured data, but as pre-rendered HTML, so you’d still need to distill that HTML back into clean data.

The flying-under-the-radar advice applies here, too, because I assume that these URLs are not meant to be accessed from outside of Twitter’s web UI. However, and somewhat to my surprise, I was able to call these URLs directly via cURL (even without setting a referer header, if I recall correctly) and got the expected data back.

I have some working PHP code that implements this for media timelines, but it’s still too hackish for publication — the screenshot I used as the article image shows a small part of this hack (it contains a stupid bug in the last line; do you see it?).

All in all, it looks like this is the most promising approach, and it’s the one I’m most likely to pursue further.

Using 3rd-party »backup« services

There are numerous third-party web services that promise to backup all of your Twitter data. I have looked at a number of them, and none actually keep this promise. They simply store any tweets you post from the time you register with them, but they don’t access older tweets at all, and none of them accesses anything other than the tweets themselves, so there is exactly zero benefit from using these services over Twitter’s own archive feature. As such, I did not venture any deeper into this territory.

That’s all I have, so far

If you came to this post in the hope of finding a recipe or even a ready-to-use tool, I am sorry to disappoint you. This is still very much a work in progress. However, in case you’re trying to do the same thing I’ve described here, why don’t we join our efforts? In any case, I’d appreciate your feedback, and thanks for reading.

(By the way, I’m trying to do the same kind of thing for my Facebook account as well; and for my Flickr account, and… you know what I’m getting at. Any ideas or suggestions are highly welcome.)

Advertisements

Author: schoschie

I like to see the wiring under the board™

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s