Tuesday, May 02, 2006

 
Our final results are List of names of clusters and List of urls of clusters. Code we developed can be found in here.

The lists do not themselves demonstrate success of our algorithm. We tried to find a known cluster of our personal blogs, however on the thresholding most of the their blogs are removed and we could not derive a reliable conclusion from that information.

We ran the clustering algorithm there is a one single cluster left with about 2/3 of the nodes. May be we should try to do the clustering using lesser number of blogs for better understanding about the algorithm. In the light of the final exams we had to stop our effort without without empirical justification for the tequniques we presented.

Finally few possible future work

 
Results

The result of the project is a dendogram of the blog nodes. We try to summerize the finding below.
The following results are such that the original graph of 40000 nodes is threshold in such a way that it is reduced to a graph of 1638 nodes. Then following are the results obtained by parsing the dendogram file that was produced.
Below is the statistics of the size of the cluster against the frequency of the clusters of that size.












Size of the Cluster Frequency
2 168
3 62
4 14
5 1
6 2
8 1
10 1
11 1
1014 1

Monday, April 24, 2006

 
Modifying Radicchi Algorithm to handle large connectivity graphs

The connectivity graph that we have obtained from the blogs was too densed as shown in the statistics earlier. Radicchi algorithm is an much simplfied version of the original Grivan Newman algorithm. But for a graph of over 40,000 nodes and few million edges, even Radicchi algorithm becomes computationally unfeasible. SO we have adopted few of the following optimization mechanisms and managed to bring the computation time from few weeks to little less than a day. Following are the optimizations

* Prune the connectivity of the graph by increasing the threshold of the edges. In other words remove all the edges that are smaller than a threshold value.
* Do batch elmination of edges in the Radicchi algorithm. That is instead of removing the edge with the smallest clustering coefficient, we remove all the edges which are in that small proximity which allows us to drastically reduce the computational time.

Thursday, April 20, 2006

 
Subscription count for a Feed in our data set have a power law with exponet 1.9

let x - number of subscriptions a feed has. The graph on the right is log-log plot of x against number of occurences of each x. 1 on X axis means 0=<10, 2 means 10=<20 and so on.

Note the staight line chatersrise the power law, and notice exponet is about 1.9!

 
Construing the co-reference Graph

  1. We collected the information about Blogs in to a database table Blogs(id,name,url,blogrolls), here Blog rolls is a # separated list of feeds the given Blog is subscribed to. As the number of feed is about 0.2 Million we can not perform a SQL join on the table, fil instruct us to load everything to memory
  2. We go though table Blogs and create a text file with subscriptions for each Blog. The text file had a two dimensional array of integers. ith row of the array lists all the feeds, the ith Blog is subscribed to. Each feed is represented by the hash of it's name. Also we create subscription table sub(id,subid), id is the Blog id and subid is the has of the feed.
  3. We wrote a code to load the above text file to memory. As 2341974 subscriptions are available that take about 2.3MB*4 = 9.2MB
  4. We need count of subscriptions each feed has, we use select subcribed, count(*) from hssub group by subcribed having count(*)>1 query to store number of subscriptions each feed has as (subid,f) in a file. ith row of the file list subid and f separated by a space. The list is sorted subid.
  5. As we need to lookup the number of frequency using subid, we load frequency data to the two dimensional integer array, each row having subid and frequency. As there are 168932 entries and that take 0.16MB*4*2 = 1.28MB. We implemented binary search on the sorted array for faster lookup.
  6. Using subscription count for each feed and subscriptions for each Blog we implemented our algorithm to calculate the relatedness between each pair of Blogs. To represent the graph as a adjacency matrix thresholding the calculated relatedness at 150. 150 is chosen such that matrix is spare enough for clustering. To store data we use a spare matrix representation learned at scientific computing class. (matrix is represented as a two arrays iindex[?] and data[?][2], when each entry in the adjacency matrix is represented by (i,j,w), iindex[x] is the start of entries with i=x in data. w is found by matching j value in data matrix. w = data[iindex[i+x][1] where iindex[i+x][0] =j for smallest positive x ]). We implemented a search that does a lookup on i and binary search for matching j.

Tuesday, April 18, 2006

 

Clustering

The size of the connectivity graph that we are dealing is too big and it simply rules out the possibility of Girvan Newman's Clustering algorithm based on Edge betweenness. Thus Radicchi algorithm was chosen as the clustering algorithm. One problem with the Radicchi algorithm is, that it was originally intended for unweighted graphs. Our Graph is a weighted graph and the weights carry the significance of the connection between the two blogs.
One suggestion from Fillipo was to extend the Radicchi algorithm to a weighted graph. If the weight of the edge was W, one option is to take a measure sqrt(2*(i-w)) as the measure for the weight and then do radicchi based on that. Above transformation is necessary because higher weight implies strong connection and we want those connections to stick around towards the end when you do the Radicchi clustering.
One other option was to do the Radicchi as if the graph was un weighted and at the time of removing the edge select the highest value by multiplying with the edge weight and then considering the highest betweenness approximation.


Sunday, April 09, 2006

 
Data Processing

We are done with the data gathering and have started the processing. But with 40000 records in the data base we can not perform a query with a SQL JOIN, which make complex SQL queries impossible . We talked with fil and he suggest try to use all the data in to the memory and do the processing. We were able to load the data about subscriptions to the memory and currently working on creation of the graph.

Algorithm

After the discussion with Fil this is our revised algorithm. We groups feeds as private Blogs and public Blogs, in this project we are making an effort to derive the community structure from the blog-roll of private blogs. We plan to create a co-reference graph for the private blogs and perform a clustering algorithm the created graph.

Our graph creation is based on two heuristics, first we assume if two blogs have a subscribed to same feed they are related to each other. Second, we take in to account that if Blogs A and B are subscribed to Slashdot, that does not imply a strong relationship, however if two Blogs are subscribed to another Blog which has just five total subscriptions, that possibly means a stronger relationship.

To model this fact, we calculated the weighted co-reference graph from the subscriptions data using following formula.
Let SUB (A AND B), be subscription both A and B share. Let P(x), be probability that feed x is selected if Blog C add blogrolls randomly.


We use self information, log(1/P(X)) as defined by the information theory to approximate the information provided by each feed .

Tuesday, April 04, 2006

 
I am collecting information about the public blog subscriptions from set of personal blogs. The code is still running adding subscriptions to the data base. So far 11349 persoanl blogs has subscribed to 203462 public blogs, which means about 20 subscriptions per blog. Here are few statistics

Public Blogs

blogs have more than 100 refernaces 843/203462
blogs have more than 75 refernaces 1205/203462
blogs have more than 50 refernaces 1836/203462
blogs have more than 32 refernaces 2954/203462
blogs have more than 16 refernaces 6089/203462
blogs have more than 8 refernaces 11842/203462
blogs have more than 4 refernaces 18281/203462
blogs have more than 3 refernaces 18281/203462
blogs have more than 2 refernaces 39378/203462
blogs have more than 1 refernaces 65985/203462
blogs have 1 refernaces 137477/203462


Personal blogs

having >5 blog rolls 10825/11349
having >10 blog rolls 10064/11349
having >20 blog rolls 8466/11349
having >30 blog rolls 7121/11349
having >50 blog rolls 5997/11349
having >100 blog rolls 2447/11349
having >200 blog rolls 800/11349
having >400 blog rolls 800/11349
having >1000 blog rolls 26/11349

Note there are 26 personal blogs that have >1000 subscriptions!!

Monday, April 03, 2006

 
We have make a mistake on naming the graph we plan to analyze. Our document said we are analyzing a co-citation graph. But It should be co-reference graph. Our heuristic is if two blogs subscribed to same feed, there have some level of smiler interests.

 
We have been reading about the related publications to our project. Among them we find following very interesting

  1. Finding "the life between buildings": An approach for defining a weblog community by Lilia Efimova
  2. Tomographic Clustering To Visualize Blog Communities as Mountain Views by Belle L. Tseng,Junichi Tatemura,and Yi Wu
  3. NusEye: Designing for Social Navigation in Syndicated Content by Azzari C. Jarrett, and Brian M. Dennis

Even though not directly related, following are few interesting papers

  1. Conversations in the Blogosphere: An Analysis "From the Bottom Up" Susan C. Herring, Inna Kouper, John C. Paolillo, Lois Ann Scheidt, Michael Tyworth, Peter Welsch, Elijah Wright, and Ning Yu
  2. Power Laws, Weblogs and Inequality by Clay Shirky
  3. On Webfeed Aggregators and Social Navigation Brian M. Dennis
  4. Audience, structure and authority in the weblog community by Cameron Marlow
  5. The EigenRumor Algorithm for Ranking Blogs by Ko Fujimura, Takafumi Inoue,and Masayuki Sugisaki
  6. STRUCTURE AND EVOLUTION OF Blogspace By RAVI KUMAR, JASMINE NOVAK, PRABHAKAR RAGHAVAN,
  7. AND ANDREW TOMKINS
  8. How to search a social network by Lada Adamic and Eytan Adar
  9. Discovering Important Bloggers based on Analyzing Blog Threads by Shinsuke Nakajima, Junichi Tatemura , and Yoichiro Hino
  10. Implicit Structure and the Dynamics of Blogspace by Eytan Adar, Li Zhang, Lada A. Adamic Rajan, and M. Lukose

We decide to process the downloaded blogs from the data base itself. We have develop a code to extract the blog-rolls from the data base entries and create a new table that shows the subscriptions for each blogs. Right now we are working on following.

  1. Code to build co-citation graph in the data base tables
  2. Code to gather statistics about blogs from the data base tables and provide statistical analysis about blogs
  3. Code to do clustering

As a method of evaluating the results of the clustering we decided to compare resulting groups against groups derived from the subscription (spare graph which we originally plan to analyize). As the subsriptions between private blogs suggest stronger connections among blogs, we expect subscription groups to be included in clustering groups.

This page is powered by Blogger. Isn't yours?