Scraping and Analyzing Social Data
[Update Feb 2013: this tool has evolved to become Querybox which came in first for StartUp Weekend 2013 at Taiwan, Tainan. ] [Update : the tool is now featured on First Monday! http://www.firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3937 ]
I'm currently involved in a project National Science Council (Taiwan) funded project that investigates public communications in times of crisis.
I'm in charge of building a tool that helps non-technical users to mine and analyze data related to disasters such as the recent Japanese Earthquake on new social media outlets such as Twitter,m Facebook and blogs.
I've given it a nickname "Scraper", cos its scraps data from social media. Not the most creative name, but it delivers. =).
I do have some experience in doing similar tools, but on a smaller scale. Hence it's certainly a challenge to be involved in such projects.
Here's some of my thoughts about this project:
The tool should contain the following components:
- a web based GUI so that any users can use it
- A visualization layer, so that non technical users can make sense of the data
- A mining layer, so users can keep track of the topics they are interested in
- A computation layer - this is to allow users to perform various analysis, such as searching for keyword frequency over a certain period of time.
The computation layer will be required to do massive calculations in very short periods of time. I'm currently tracking data about the Japanese Earthquake (using Twitter's search API )and have about 300,000 plus odd tweets. Calculating the most frequent keywords and users on a Quad-core, 4GB RAM personal computer can take more than 40 seconds to do the job.
So the solution would most probably make use Hadoop or some other distributed computing platforms to speed up the process.
This is fairly subjective. It depends on what kinds of analysis end users would want to perform. In my case, I foresee that users would want to perform time-series analysis.
For example, based on my current data, i noticed that different keywords "bubbles up" in different days; on the first day or so of the Japanese Earthquake, the keyword "earthquake", "death" appears
Than by 14th of March, approximately 3 to 4 days after the earthquake, keywords like "nuclear" begins to bubble up.
So a time series analysis of the data will show different topics which are of concern to users at different points in time of the event.
I'm considering using the Google Visualization APIs, as it should do the job. I think the motion chart tool is very cool.
I intend to add more APIs such as Facebook, Google Buzz over the next few months.
I'm currently focused on Twitter and have almost completed the mining portion of the tool.
Am now looking forward to implement the computation and visualization component of the tool, and perhaps all other media outlets like Facebook and Google Buzz.
Any thoughts? Talk to me here using the comments box.