A Twitter Storm is Approaching

twitter logo

Twitter logo

Seems that Twitter is going to open source a very powerful data processing framework which is superficially similar to Hadoop.

The article from Twitter states that

“A Storm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoop you run “MapReduce jobs”, on Storm you run “topologies”.

Go check out the article right now.

Scraping and Analyzing Social Data

[Update : the tool is now featured on First Monday! http://www.firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3937 ]

I’m currently involved in a project National Science Council (Taiwan) funded project that investigates public communications in times of crisis.

I’m in charge of building a tool that helps non-technical users to mine and analyze data related to disasters such as the recent Japanese Earthquake on new social media outlets such as Twitter,m Facebook and blogs.

I’ve given it a nickname “Scraper”, cos its scraps data from social media. Not the most creative name, but it delivers. =).

I do have some experience in doing similar tools, but on a smaller scale. Hence it’s certainly a challenge to be involved in such projects.

Here’s some of my thoughts about this project:

Architecture

The tool should contain the following components:

  1. a web based GUI so that any users can use it
  2. A visualization layer, so that non technical users can make sense of the data
  3. A mining layer, so users can keep track of the topics they are interested in
  4. A computation layer – this is to allow users to perform various analysis, such as searching for keyword frequency over a certain period of time.

Potential Challenges

Computation Layer

The computation layer will be required to do massive calculations in very short periods of time. I’m currently tracking data about the Japanese Earthquake (using Twitter’s search API )and have about 300,000 plus odd tweets. Calculating the most frequent keywords and users on a Quad-core, 4GB RAM  personal computer can take more than 40 seconds to do the job.

So the solution would most probably make use Hadoop or some other distributed computing platforms to speed up the process.

Visualization Layer

This is fairly subjective. It depends on what kinds of analysis end users would want to perform. In my case, I foresee that users would want to perform time-series analysis.

For example, based on my current data, i noticed that different keywords “bubbles up” in different days; on the first day or so of the Japanese Earthquake, the keyword “earthquake”, “death” appears

Than by 14th of March, approximately 3 to 4 days after the earthquake, keywords like “nuclear” begins to bubble up.

So a time series analysis of the data will show different topics which are of concern to users at different points in time of the event.

I’m considering using the Google Visualization APIs, as it should do the job. I think the motion chart tool is very cool.

Technologies used

I intend to add more APIs such as Facebook, Google Buzz over the next few months.

EndNote

I’m currently focused on Twitter and have almost completed the mining portion of the tool.

Am now looking forward to implement the computation and visualization component of the tool, and perhaps all other media outlets like Facebook and Google Buzz.

Any thoughts? Talk to me here using the comments box.

Visualizing Social Data

[UPDATE on 21st March 2011]

I’m now working a similar but more powertool, read more about it here.

Introduction

This is a public post about a CS project which I have done in the previous semester for the course “Special Topics on Computer Science (B).

The idea behind this project is to visualize data from Twitter, Facebook and Google Buzz in the form of a TreeMap.

Usage

The usage is as follows:

  1. User inputs topics of interest
  2. Program scraps for dataacross Facebook, Twitter and Google Buzz
  3. Results from 2. is aggregated according to its attributes
  4. User can select the various attributes he wants to visualize.
  5. After selecting attributes, the TreeMap is created based on the data and attributes selected.

Video Demonstration

Here’s a video demo : http://www.youtube.com/watch?v=5DNRd4wrLL0

Environment and Tools Used

Here’s the stuff which I’ve used:

Thoughts on Twitter’s API – Twython and Tweepy

All things are great with Twython. But its a pity as Twython doesn’t support Twitter’s Streaming APIs at this point of time.

So if you want to make use of Twitter’s Streaming API, you might want to check out Tweepy.

But after testing Tweepy, I noticed that Tweepy’s OAuth Handler thing is not working.

Therefore, to use Twitter’s Streaming API, you can use Twyton for OAuth authencation and than using the token and key in Tweepy.

This will help you  get the best of both worlds.

Store the token and key (using Twython’s OAuth methods ) into your database, and than reuse them in Tweepy’s Streaming API.

An easy way to hook Twitter API on Django

I was working on a project and I needed to hook on some twitter API onto Django.

I researched quite a bit and came across Twython.

Twython is the the easiest  Twitter client for Django in terms of :

  1. installation
  2. executing of common functions such as searching, retrieving public timeline results.

If you do not want to install using the official method described here, you can simply drop the files  streaming.py, twitter_endpoints.py and twython.py ( found in the folder twython , right here ) into your django app.

Give it a try, it should work for you.

Twitter JavaScript API (and a small error).

Yes, it might be old news, Twitter has a pure javascript api found at http://platform.twitter.com/js-api.html But i wanted to point out that there’s an error in their sample code. At http://platform.twitter.com/js-api.html, the “facade method example” has a small mistake.

status.screenName

should read

status.user.screenName

If you use the former, you will receive a undefined value. That’s all for now.