Thursday, March 02, 2006
We decided on the high level architecture for the Blog crawler. The crawlers operations can be iterated out as follows.
1. Starts with subscription pages of the Bloglines
2. Crawl each Blog subscribed to the subscription page
3. For each crawled Blog we store the Blog-Rolls for the Blog and add the subscribed Blog URLs to Blog frontier
4. We do this until the Blog frontier is empty
We took following design decisions
1. We use a Distributed Queue developed by Chathura as the Blog frontier
2. We uses a MySQL data base to store the data about crawled Blogs and Blog-Rolls for each Blog.
3. We uses the data base to store the visited Blogs, and keep a local hash-map in memory to fast access. The data base protect data from losing in case of system failure.
4. We uses Apache Commons HTTP Client (http://jakarta.apache.org/commons/httpclient/)
5. We uses HTTP Parser (http://htmlparser.sourceforge.net/) to filter the Blog-rolls and subscription URLs from Blogs
We have done following steps in last two weeks
1. Step up a MySQL database, and design the tables
2. Set up a svn repository in http://www.cvsdude.org/, our code is available in the svn repository
3. Writing the code to fetch and parse the Blogs,
4. Chathura is working on the database encapsulation layer for the data
5. We have initial skeleton of the crawler, which fetch the Blogs, parse them and print the data to the screen. We are working on the saving the information on the data base and making the Blog frontier persistent.
We are yet to decide on the finer detail about how the Blog analysis is to be done, we plan to finish the data gathering steps and then start on the data analysis steps.
1. Starts with subscription pages of the Bloglines
2. Crawl each Blog subscribed to the subscription page
3. For each crawled Blog we store the Blog-Rolls for the Blog and add the subscribed Blog URLs to Blog frontier
4. We do this until the Blog frontier is empty
We took following design decisions
1. We use a Distributed Queue developed by Chathura as the Blog frontier
2. We uses a MySQL data base to store the data about crawled Blogs and Blog-Rolls for each Blog.
3. We uses the data base to store the visited Blogs, and keep a local hash-map in memory to fast access. The data base protect data from losing in case of system failure.
4. We uses Apache Commons HTTP Client (http://jakarta.apache.org/commons/httpclient/)
5. We uses HTTP Parser (http://htmlparser.sourceforge.net/) to filter the Blog-rolls and subscription URLs from Blogs
We have done following steps in last two weeks
1. Step up a MySQL database, and design the tables
2. Set up a svn repository in http://www.cvsdude.org/, our code is available in the svn repository
3. Writing the code to fetch and parse the Blogs,
4. Chathura is working on the database encapsulation layer for the data
5. We have initial skeleton of the crawler, which fetch the Blogs, parse them and print the data to the screen. We are working on the saving the information on the data base and making the Blog frontier persistent.
We are yet to decide on the finer detail about how the Blog analysis is to be done, we plan to finish the data gathering steps and then start on the data analysis steps.