<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-21734325</id><updated>2011-04-21T20:58:10.037-07:00</updated><title type='text'>Web Mining (B659) 2006</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>18</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-21734325.post-114659146347784707</id><published>2006-05-02T10:23:00.000-07:00</published><updated>2006-05-02T10:37:43.756-07:00</updated><title type='text'></title><content type='html'>Our final results are &lt;a href="http://www.cs.indiana.edu/%7Ehperera/research/final.txt"&gt;List of names of clusters&lt;/a&gt; and &lt;a href="http://www.cs.indiana.edu/%7Ehperera/research/url.txt"&gt;List of urls of clusters&lt;/a&gt;. Code we developed can be found in &lt;a href="http://www.cs.indiana.edu/~hperera/research/blogs.zip"&gt;here&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;The lists do not themselves demonstrate success of our algorithm. We tried to find a known cluster of our personal blogs, however on the thresholding most of the their blogs are removed and we could not derive a reliable conclusion from that information.&lt;br /&gt;&lt;br /&gt;We ran the clustering algorithm there is a one single cluster left with about 2/3 of the nodes. May be we should try to do the clustering using lesser number of blogs for better understanding about the algorithm. In the light of the final exams we had to stop our effort without without empirical justification for the tequniques we presented.&lt;br /&gt;&lt;br /&gt;Finally few possible future work&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Based on the clustering we did, pick subset of 40 000 blogs that have highest connectivity and perform clustering again&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Our relatedness measure favors the high number of subscriptions in contrast to few rare subscriptions do to logarithm. May be we should compare and contrast two approaches using the samples from the collected blogs and add a refinement to relatedness measure&lt;/li&gt;&lt;br /&gt;&lt;li&gt;visualization of the communities&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Come up a measure for success of the algorithm for relatedness&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114659146347784707?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114659146347784707/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114659146347784707' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114659146347784707'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114659146347784707'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/05/our-final-results-are-list-of-names-of.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-114658114875034117</id><published>2006-05-02T07:25:00.000-07:00</published><updated>2006-05-02T07:45:48.770-07:00</updated><title type='text'></title><content type='html'>&lt;span style="font-size:130%;"&gt;Results&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The result of the project is a dendogram of the blog nodes. We try to summerize the finding below.&lt;br /&gt;The following results are such that the original graph of 40000 nodes is threshold in such a way that it is reduced to a graph of 1638 nodes. Then following are the results obtained by parsing the dendogram file that was produced.&lt;br /&gt;Below is the statistics of the size of the cluster against the frequency of the clusters of that size.&lt;br /&gt;&lt;table&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;Size of the Cluster&lt;/td&gt;&lt;td&gt;       Frequency&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;2      &lt;/td&gt;&lt;td&gt;      168&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;3     &lt;/td&gt;&lt;td&gt;        62&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;4       &lt;/td&gt;&lt;td&gt;      14&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;5      &lt;/td&gt;&lt;td&gt;        1&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;6      &lt;/td&gt;&lt;td&gt;        2&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;8      &lt;/td&gt;&lt;td&gt;        1&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;10     &lt;/td&gt;&lt;td&gt;       1&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;11      &lt;/td&gt;&lt;td&gt;      1&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;1014   &lt;/td&gt;&lt;td&gt;       1&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;br /&gt;&lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114658114875034117?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114658114875034117/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114658114875034117' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114658114875034117'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114658114875034117'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/05/results-result-of-project-is-dendogram.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-114591417663653439</id><published>2006-04-24T14:29:00.000-07:00</published><updated>2006-04-24T14:29:36.646-07:00</updated><title type='text'></title><content type='html'>&lt;span style="font-size:130%;"&gt;Modifying Radicchi Algorithm to handle large connectivity graphs&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The connectivity graph that we have obtained from the blogs was too densed as shown in the statistics earlier. Radicchi algorithm is an much simplfied version of the original Grivan Newman algorithm. But for a graph of over 40,000 nodes and few million edges, even Radicchi algorithm becomes computationally unfeasible. SO we have adopted few of the following optimization mechanisms and managed to bring the computation time from few weeks to little less than a day. Following are the optimizations&lt;br /&gt;&lt;br /&gt;    * Prune the connectivity of the graph by increasing the threshold of the edges. In other words remove all the edges that are smaller than a threshold value.&lt;br /&gt;    * Do batch elmination of edges in the Radicchi algorithm. That is instead of removing the edge with the smallest clustering coefficient, we remove all the edges which are in that small proximity which allows us to drastically reduce the computational time.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114591417663653439?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114591417663653439/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114591417663653439' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114591417663653439'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114591417663653439'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/04/modifying-radicchi-algorit_114591417663653439.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-114554493767947216</id><published>2006-04-20T07:47:00.000-07:00</published><updated>2006-04-20T07:56:50.216-07:00</updated><title type='text'></title><content type='html'>&lt;span style="font-weight: bold;font-size:130%;" &gt;Subscription count  for a Feed in our data set have a power law with exponet 1.9&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://photos1.blogger.com/blogger/6539/2201/1600/blogrolls.0.jpg"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://photos1.blogger.com/blogger/6539/2201/320/blogrolls.jpg" alt="" border="0" /&gt;&lt;/a&gt;let x - number of subscriptions a feed has. The graph on the right is log-log plot of x against number of occurences of each x. 1 on X axis means 0=&amp;lt;&lt;x&gt;10, 2 means 10=&lt;x&gt;&amp;lt;20 and so on.&lt;br /&gt;&lt;br /&gt;Note the staight line chatersrise the power law, and notice exponet is about 1.9!&lt;br /&gt;&lt;/x&gt;&lt;/x&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114554493767947216?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114554493767947216/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114554493767947216' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114554493767947216'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114554493767947216'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/04/subscription-count-for-feed-in-our.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-114553812288231773</id><published>2006-04-20T06:01:00.000-07:00</published><updated>2006-04-20T06:02:02.903-07:00</updated><title type='text'></title><content type='html'>Construing the co-reference Graph&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;We collected the information about Blogs in to a database table Blogs(id,name,url,blogrolls), here Blog rolls is a # separated list of feeds the given Blog is subscribed to. As the number of feed is about 0.2 Million we can not perform a SQL join on the table, fil instruct us to load everything to memory&lt;/li&gt;&lt;li&gt;We go though table Blogs and create a text file with subscriptions for each Blog. The text file had a two dimensional array of integers. ith row of the array lists all the feeds, the ith Blog is subscribed to. Each feed is represented by the hash of it's name. Also we create subscription table sub(id,subid), id is the Blog id and subid is the has of the feed.&lt;/li&gt;&lt;li&gt;We wrote a code to load the above text file to memory. As  2341974 subscriptions are available that take about 2.3MB*4 = 9.2MB&lt;/li&gt;&lt;li&gt;We need count of subscriptions  each feed has, we use &lt;span style="font-style: italic;"&gt;select subcribed, count(*) from hssub group by subcribed having count(*)&gt;1 &lt;/span&gt;query to store number of subscriptions each feed has as (subid,f) in a file. ith row of the file list subid and f separated by a space. The list is sorted subid.&lt;/li&gt;&lt;li&gt;As we need to lookup the number of frequency using subid, we load frequency data to the two dimensional integer array, each row having subid and frequency. As there are 168932 entries and that take 0.16MB*4*2 = 1.28MB. We implemented binary search on the sorted array for faster lookup.&lt;/li&gt;&lt;li&gt;Using subscription count for each feed and subscriptions for each Blog we implemented our algorithm to calculate the relatedness between each pair of Blogs. To represent the graph as a adjacency matrix thresholding the calculated relatedness at 150. 150 is chosen such that matrix is spare enough for clustering. To store data we use a spare matrix representation learned at scientific computing class. (matrix is represented as a two arrays iindex[?] and data[?][2], when each entry in the adjacency matrix is represented by (i,j,w), iindex[x] is the start of entries with i=x in data. w is found by matching j value in data matrix. w = data[iindex[i+x][1] where iindex[i+x][0] =j for smallest positive x ]). We implemented a search that does a lookup on i and binary search for matching j.&lt;/li&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114553812288231773?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114553812288231773/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114553812288231773' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114553812288231773'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114553812288231773'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/04/construing-co-reference-graph-we.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-114538571356359287</id><published>2006-04-18T11:27:00.000-07:00</published><updated>2006-04-18T11:41:53.576-07:00</updated><title type='text'></title><content type='html'>&lt;p class="MsoNormal"&gt;Clustering&lt;br /&gt;&lt;br /&gt;The size of the connectivity graph that we are dealing  is too big and it simply rules out the possibility of Girvan Newman's Clustering algorithm based on Edge betweenness. Thus Radicchi algorithm was chosen as the clustering algorithm. One problem with the Radicchi algorithm is, that it was originally intended for unweighted graphs. Our Graph is a weighted graph and the weights carry the significance of the connection between the two blogs.&lt;br /&gt;One suggestion from Fillipo was to extend the Radicchi algorithm to a weighted graph. If the weight of the edge was W, one option is to take a measure sqrt(2*(i-w)) as the measure for the weight and then do radicchi based on that. Above transformation is necessary because higher weight implies strong connection and we want those connections to stick around towards the end when you do the Radicchi clustering.&lt;br /&gt;One other option was to do the Radicchi as if the graph was un weighted and at the time of removing the edge select the highest value by multiplying with the edge weight and then considering the highest betweenness approximation.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114538571356359287?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114538571356359287/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114538571356359287' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114538571356359287'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114538571356359287'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/04/clustering-size-of-connectivity-graph.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-114463762078671680</id><published>2006-04-09T19:48:00.000-07:00</published><updated>2006-04-09T19:58:02.450-07:00</updated><title type='text'></title><content type='html'>&lt;span style="font-size:130%;"&gt;Data Processing&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;We are done with the data gathering and have started the processing. But with 40000 records in the data base we can not perform a query with a SQL JOIN, which make complex SQL queries impossible . We talked with fil and he suggest try to use all the data in to the memory and do the processing. We were able to load the data about subscriptions to the memory and currently working on creation of the graph.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Algorithm&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;After the discussion with Fil this is our revised algorithm. We groups feeds as private Blogs and public Blogs, in this project we are making an effort to derive the community structure from the blog-roll of private blogs. We plan to create a co-reference graph for the private blogs and perform a clustering algorithm the created graph.&lt;br /&gt;&lt;br /&gt;Our graph creation is based on two heuristics, first we assume if two blogs have a subscribed to same feed they are related to each other. Second, we take in to account that if Blogs A and B are subscribed to Slashdot, that does not imply a strong relationship, however if two Blogs are subscribed to another Blog which has just five total subscriptions, that possibly means a stronger relationship.&lt;br /&gt;&lt;br /&gt;To model this fact, we calculated the weighted co-reference graph from the subscriptions data using following formula.&lt;br /&gt;Let SUB (A AND B), be subscription both A and B share. Let P(x), be probability that feed x is selected if Blog C add blogrolls randomly.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://photos1.blogger.com/blogger/6539/2201/1600/img1.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://photos1.blogger.com/blogger/6539/2201/320/img1.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;We use self information, log(1/P(X))  as defined by the information theory to approximate the information provided by each feed .&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114463762078671680?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114463762078671680/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114463762078671680' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114463762078671680'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114463762078671680'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/04/data-processing-we-are-done-with-data.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-114416106024272258</id><published>2006-04-04T07:26:00.000-07:00</published><updated>2006-04-04T07:31:00.256-07:00</updated><title type='text'></title><content type='html'>I am collecting information about the public blog subscriptions from set of personal blogs. The code is still running adding subscriptions to the data base. So far 11349 persoanl blogs has subscribed to 203462 public blogs, which means about 20 subscriptions per blog. Here are few statistics&lt;br /&gt;&lt;br /&gt;Public Blogs&lt;br /&gt;&lt;br /&gt;blogs have more than 100 refernaces 843/203462&lt;br /&gt;blogs have more than 75 refernaces 1205/203462&lt;br /&gt;blogs have more than 50 refernaces 1836/203462&lt;br /&gt;blogs have more than 32 refernaces 2954/203462&lt;br /&gt;blogs have more than 16 refernaces 6089/203462&lt;br /&gt;blogs have more than 8 refernaces 11842/203462&lt;br /&gt;blogs have more than 4 refernaces 18281/203462&lt;br /&gt;blogs have more than 3 refernaces 18281/203462&lt;br /&gt;blogs have more than 2 refernaces 39378/203462&lt;br /&gt;blogs have more than 1 refernaces 65985/203462&lt;br /&gt;blogs have          1 refernaces 137477/203462&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Personal blogs&lt;br /&gt;&lt;br /&gt;having &gt;5 blog rolls 10825/11349&lt;br /&gt;having &gt;10 blog rolls 10064/11349&lt;br /&gt;having &gt;20 blog rolls 8466/11349&lt;br /&gt;having &gt;30 blog rolls 7121/11349&lt;br /&gt;having &gt;50 blog rolls 5997/11349&lt;br /&gt;having &gt;100 blog rolls 2447/11349&lt;br /&gt;having &gt;200 blog rolls 800/11349&lt;br /&gt;having &gt;400 blog rolls 800/11349&lt;br /&gt;having &gt;1000 blog rolls 26/11349&lt;br /&gt;&lt;br /&gt;Note there are 26 personal blogs that have &gt;1000 subscriptions!!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114416106024272258?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114416106024272258/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114416106024272258' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114416106024272258'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114416106024272258'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/04/i-am-collecting-information-about.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-114411506298091118</id><published>2006-04-03T18:41:00.000-07:00</published><updated>2006-04-03T18:44:22.993-07:00</updated><title type='text'></title><content type='html'>We have make a mistake on naming the graph we plan to analyze. Our document said we are analyzing a co-citation graph. But It should be co-reference graph. Our heuristic is if two blogs subscribed to same feed, there have some level of smiler interests.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114411506298091118?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114411506298091118/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114411506298091118' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114411506298091118'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114411506298091118'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/04/we-have-make-mistake-on-naming-graph.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-114411230970387645</id><published>2006-04-03T17:05:00.000-07:00</published><updated>2006-04-03T17:58:29.716-07:00</updated><title type='text'></title><content type='html'>We have been reading about the related publications to our project. Among them we find following very interesting&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Finding "the life between buildings": An approach for defining a weblog community by Lilia Efimova&lt;/li&gt;&lt;li&gt;Tomographic Clustering To Visualize Blog Communities as Mountain Views by Belle L. Tseng,Junichi Tatemura,and Yi Wu&lt;/li&gt;&lt;li&gt;NusEye: Designing for Social Navigation in Syndicated Content by Azzari C. Jarrett, and Brian M. Dennis&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;Even though not directly related, following are few interesting papers&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Conversations in the Blogosphere: An Analysis "From the Bottom Up" Susan C. Herring, Inna Kouper, John C. Paolillo, Lois Ann Scheidt, Michael Tyworth, Peter Welsch, Elijah Wright, and Ning Yu&lt;/li&gt;&lt;li&gt;Power Laws, Weblogs and Inequality by Clay Shirky&lt;/li&gt;&lt;li&gt;On Webfeed Aggregators and Social Navigation Brian M. Dennis&lt;/li&gt;&lt;li&gt;Audience, structure and authority in the weblog community by Cameron Marlow&lt;/li&gt;&lt;li&gt;The EigenRumor Algorithm for Ranking Blogs by Ko Fujimura, Takafumi Inoue,and Masayuki Sugisaki&lt;/li&gt;&lt;li&gt;STRUCTURE AND EVOLUTION OF Blogspace By RAVI KUMAR, JASMINE NOVAK, PRABHAKAR RAGHAVAN,&lt;/li&gt;&lt;li&gt;AND ANDREW TOMKINS&lt;/li&gt;&lt;li&gt;How to search a social network by Lada Adamic and Eytan Adar&lt;/li&gt;&lt;li&gt;Discovering Important Bloggers based on Analyzing Blog Threads by Shinsuke Nakajima, Junichi Tatemura , and Yoichiro Hino&lt;/li&gt;&lt;li&gt;Implicit Structure and the Dynamics of Blogspace by Eytan Adar, Li Zhang, Lada A. Adamic Rajan, and M. Lukose&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;We decide to process the downloaded blogs from the data base itself. We have develop a code to extract the blog-rolls from the data base entries and create a new table that shows the subscriptions for each blogs. Right now we are working on following.&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Code to build co-citation graph in the data base tables&lt;/li&gt;&lt;li&gt;Code to gather statistics about blogs from the data base tables and provide statistical analysis about blogs&lt;/li&gt;&lt;li&gt;Code to do clustering &lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;As a method of evaluating the results of the clustering we decided to compare resulting groups against groups derived from the subscription (spare graph which we originally plan to analyize). As the subsriptions between private blogs suggest stronger connections among blogs, we expect subscription groups to be included in clustering groups.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114411230970387645?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114411230970387645/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114411230970387645' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114411230970387645'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114411230970387645'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/04/we-have-been-reading-about-related.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-114263102091605299</id><published>2006-03-17T13:21:00.000-08:00</published><updated>2006-03-17T13:30:20.926-08:00</updated><title type='text'></title><content type='html'>We have finish developing the Blog crawler and start collecting for the Blogs. By the time of writing the crawler is running and have collected  17164 Blogs.&lt;br /&gt;&lt;br /&gt;The Blogs crawled so far seems loosely connected and the Blog graph consist of number of loosely connected components. So we might be able to analysis connected component by component without handling 50 000 Blogs at once.&lt;br /&gt;&lt;br /&gt;Right now we are reading about work already done to identify the communities in personal Blogs.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114263102091605299?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114263102091605299/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114263102091605299' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114263102091605299'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114263102091605299'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/03/we-have-finish-developing-blog-crawler.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-114191830669890927</id><published>2006-03-09T07:27:00.000-08:00</published><updated>2006-03-09T07:31:46.710-08:00</updated><title type='text'></title><content type='html'>Quick update, more detai will follow soon.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;We decide to use the approximation of the GN (Girvan and Newman) algorithm by Filippo at el* for identifying the communities. &lt;/li&gt;&lt;li&gt;Now our crawler is working and able to crawl pages successfully, Chathura is working on writing data base layer to save crawled data in the data base. &lt;/li&gt;&lt;li&gt;We asked fil about Graph Analysis tools, he give us a list. We prefer a java tool but do not decide on specific one yet. &lt;br /&gt;&lt;/li&gt;&lt;/ol&gt; *&lt;span style="font-size:85%;"&gt;&lt;span style="font-style: italic;"&gt; &lt;span style="font-weight: bold;"&gt;Defining and identifying communities in networks&lt;/span&gt; Filippo Radicchi *, Claudio Castellano , Federico Cecconi , Vittorio Loreto , and Domenico Parisi&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114191830669890927?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114191830669890927/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114191830669890927' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114191830669890927'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114191830669890927'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/03/quick-update-more-detai-will-follow.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-114132232214241711</id><published>2006-03-02T09:58:00.000-08:00</published><updated>2006-03-02T09:58:42.160-08:00</updated><title type='text'></title><content type='html'>We decided on the high level architecture for the Blog crawler. The  crawlers operations can be iterated out as follows.&lt;br /&gt;&lt;br /&gt;   1. Starts with subscription pages of the Bloglines&lt;br /&gt;   2. Crawl each Blog subscribed to the subscription page&lt;br /&gt;   3. For each crawled Blog we store the Blog-Rolls for the Blog and  add the subscribed Blog URLs to Blog frontier&lt;br /&gt;   4. We do this until the Blog frontier is empty&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;We took following design decisions&lt;br /&gt;&lt;br /&gt;   1. We use a Distributed Queue developed by Chathura as the Blog frontier&lt;br /&gt;   2. We uses a MySQL data base to store the data about crawled Blogs  and Blog-Rolls for each Blog.&lt;br /&gt;   3. We uses the data base to store the visited Blogs, and keep a  local hash-map in memory to fast access. The data base protect data from  losing in case of system failure.&lt;br /&gt;   4. We uses Apache Commons HTTP Client  (&lt;a class="moz-txt-link-freetext" href="http://jakarta.apache.org/commons/httpclient/"&gt;http://jakarta.apache.org/commons/httpclient/&lt;/a&gt;)&lt;br /&gt;   5. We uses HTTP Parser (&lt;a class="moz-txt-link-freetext" href="http://htmlparser.sourceforge.net/"&gt;http://htmlparser.sourceforge.net/&lt;/a&gt;) to  filter the Blog-rolls and subscription URLs from Blogs&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;We have done following steps in last two weeks&lt;br /&gt;&lt;br /&gt;   1. Step up a MySQL database, and design the tables&lt;br /&gt;   2. Set up a svn repository in &lt;a class="moz-txt-link-freetext" href="http://www.cvsdude.org/"&gt;http://www.cvsdude.org/&lt;/a&gt;, our code is  available in the svn repository&lt;br /&gt;   3. Writing the code to fetch and parse the Blogs,&lt;br /&gt;   4. Chathura is working on the database encapsulation layer for the data&lt;br /&gt;   5. We have initial skeleton of the crawler, which fetch the Blogs,  parse them and print the data to the screen. We are working on the  saving the information on the data base and making the Blog frontier  persistent.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;We are yet to decide on the finer detail about how the Blog analysis is  to be done, we plan to finish the data gathering steps and then start on  the data analysis steps.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114132232214241711?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114132232214241711/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114132232214241711' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114132232214241711'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114132232214241711'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/03/we-decided-on-high-level-architecture_02.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-114126827046213617</id><published>2006-03-01T17:58:00.000-08:00</published><updated>2006-03-01T18:57:50.480-08:00</updated><title type='text'></title><content type='html'>We decided on the high level architecture for the Blog crawler. The crawlers operations can be iterated out as follows.&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;   &lt;li&gt;Starts with subscription pages of the Bloglines&lt;/li&gt;   &lt;li&gt;Crawl each Blog subscribed to the subscription page&lt;/li&gt;   &lt;li&gt;For each crawled Blog we store the Blog-Rolls for the Blog and add the subscribed Blog URLs to Blog frontier&lt;/li&gt;   &lt;li&gt;We do this until the Blog frontier is empty&lt;/li&gt; &lt;/ol&gt;&lt;br /&gt;We took following design decisions&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;   &lt;li&gt;We use a Distributed Queue developed by Chathura as the Blog frontier&lt;/li&gt;   &lt;li&gt;We uses a MySQL data base to store the data about crawled Blogs and Blog-Rolls for each Blog.&lt;/li&gt;   &lt;li&gt;We uses the data base to store the visited Blogs, and keep a local hash-map in memory to fast access. The data base protect data from losing in case of system failure.&lt;/li&gt;   &lt;li&gt;We uses Apache Commons HTTP Client (http://jakarta.apache.org/commons/httpclient/)&lt;/li&gt;   &lt;li&gt;We uses HTTP Parser (http://htmlparser.sourceforge.net/) to filter the Blog-rolls and subscription URLs from Blogs&lt;/li&gt; &lt;/ol&gt;&lt;br /&gt;We have done following steps in last two weeks&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;   &lt;li&gt; Step up a MySQL database, and design the tables&lt;/li&gt;   &lt;li&gt; Set up a svn repository in http://www.cvsdude.org/, our code is available in the svn repository&lt;/li&gt;   &lt;li&gt; Writing the code to fetch and parse the Blogs,&lt;/li&gt;   &lt;li&gt; Chathura is working on the database encapsulation layer for the data&lt;/li&gt;   &lt;li&gt; We have initial skeleton of the crawler, which fetch the Blogs, parse them and print the data to the screen. We are working on the saving the information on the data base and making the Blog frontier persistent.&lt;/li&gt; &lt;/ol&gt;&lt;br /&gt;We are yet to decide on the finer detail about how the Blog analysis is to be done, we plan to finish the data gathering steps and then start on the data analysis steps.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114126827046213617?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114126827046213617/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114126827046213617' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114126827046213617'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114126827046213617'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/03/we-decided-on-high-level-architecture_01.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-114048839729255430</id><published>2006-02-20T18:18:00.000-08:00</published><updated>2006-02-20T18:19:57.303-08:00</updated><title type='text'></title><content type='html'>We discuss with Fil the Terms of Service of the bloglines, and fil says it should be ok and we should go ahead. Also we decide to use random waiting with a period of 60 seconds to crawl the bloglines site.&lt;br /&gt;&lt;br /&gt;We are planning to use a MySql server to store the information from the crawling. We have setup a MySQL server. I am working on a code to extract the links from the bloglines pages.&lt;br /&gt;--Srinath&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-114048839729255430?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/114048839729255430/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=114048839729255430' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114048839729255430'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/114048839729255430'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/02/we-discuss-with-fil-terms-of-service.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-113997715171324268</id><published>2006-02-14T20:09:00.000-08:00</published><updated>2006-02-20T18:21:12.576-08:00</updated><title type='text'></title><content type='html'>&lt;span style="font-size:130%;"&gt;&lt;span style="font-style: italic;font-size:100%;" &gt;Based on the feedback for the proposal we choose a new topic, the basic idea is to cluster the bloglines blogs. Here is our proposal. We have submit the a new proposal.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Mining for Blog communities&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:100%;"&gt;Personal Blogs gives rise to interesting social networks. When we consider a Personal blog as a vertex of a graph, there are two types of edges that are connected to that particular vertex. Given a particular blog X;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt; &lt;ol&gt;   &lt;li&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;Subscriptions to the blog X&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;Blog X owner's Subscriptions&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;/ol&gt; &lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;Apart form this there are subscriptions are pointed to popular news feeds which could potentially unveil information about the user. In this project we try to analyze the connectivity of public blogs at www.bloglines.com to identify the communities and to observe the interests of the groups based on the public&lt;br /&gt;news feeds they refer to. The information will be mined at the blogs of individuals starting form seeds harvested by querying the subscribers to popular news feeds. Starting form these seeds the system will identify and build blog connectivity graph which would be a directed graph in which the vertices would represent the individual bloggers and the edges would represent the subscriptions to blogs. In other words the If A has subscribed&lt;br /&gt;to B's blog then (A, B) would be a directed edge.&lt;br /&gt;&lt;br /&gt;The analysis is geared towards&lt;br /&gt;&lt;/span&gt;&lt;/span&gt; &lt;ul&gt;   &lt;li&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;Community identification&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;Identifying meta-groups of similar interest within a community by filtering the individuals that has subscribed to popular feeds of same or similar interest.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;Given initials seeds that would contain blogs of individuals that share same interest or may be interested in same area of technology, identify a community that consist of individuals that share the same interest with high probability.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;/ul&gt; &lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tasks&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt; &lt;ol&gt;   &lt;li&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;Identifying seed blogs - This would be based on either harvesting the blog by&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;going to subscribers to popular news feed or by user input depending on the use case.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;Crawling the blogs - Involves crawling the blogs based on initial seeds and&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;building a frontier by analyzing already crawled blogs.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;Social network analysis using graph algorithms - Involves connectivity graph&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;analysis and graph overlap analysis based on use case.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;Visualization and filtering of communities -Using a visualization package to&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;visualize the output graph and to emphasize the relationships and clusters/communities identified.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;/ol&gt; &lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Note:&lt;/span&gt; We have looked in to the robots.txt file at www.bloglines.com and we can extract necessary information without violating it.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-113997715171324268?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/113997715171324268/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=113997715171324268' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/113997715171324268'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/113997715171324268'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/02/based-on-feedback-for-proposal-we.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-113915351904604987</id><published>2006-02-05T07:29:00.000-08:00</published><updated>2006-02-05T07:31:59.056-08:00</updated><title type='text'></title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://photos1.blogger.com/blogger/6539/2201/1600/graph1.0.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://photos1.blogger.com/blogger/6539/2201/320/graph1.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;I come across a Grpah analysis framework called JUNG&lt;br /&gt;(http://jung.sourceforge.net/). If any of you plan to do social network analysis&lt;br /&gt;it might be hepful. And it is very well documented !!&lt;br /&gt;&lt;br /&gt;I did a litte programme to parse the message borad and build a network based on&lt;br /&gt;the dicussions took place on the message board. Here is the visualization of it with JUNG!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-113915351904604987?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/113915351904604987/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=113915351904604987' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/113915351904604987'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/113915351904604987'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/02/i-come-across-grpah-analysis-framework.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21734325.post-113876661329578384</id><published>2006-01-31T20:01:00.000-08:00</published><updated>2006-01-31T20:25:54.706-08:00</updated><title type='text'></title><content type='html'>&lt;h1 style="text-align: left;"&gt;Praposal&lt;/h1&gt;&lt;br /&gt;&lt;br /&gt;&lt;h1 style="text-align: center;" align="center"&gt;&lt;span style="font-size:14;"&gt;Comparison of research community derived from co-author networks with cluster of related publication derived from citation graphs&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/h1&gt;   &lt;p class="MsoNormal"&gt;&lt;b style=""&gt;&lt;i style=""&gt;Group : Srinath Perera and Chathura Hearth&lt;br /&gt;Blog    : &lt;a href="http://cs-b659.blogspot.com/"&gt;http://cs-b659.blogspot.com/&lt;/a&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/i&gt;&lt;/b&gt;&lt;/p&gt;     &lt;p class="MsoNormal"&gt;&lt;b style=""&gt;&lt;i style=""&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/i&gt;&lt;/b&gt;  &lt;/p&gt; &lt;p class="MsoNormal"&gt;In this project we plan to apply existing community identification algorithms to co-author networks and citation-graphs constructed form research publications of a particular domain. In identifying co-author networks, we consider two authors are connected if they have co-authored a publication. Thus this author community graph will consist of author nodes and the vertices of the graph will represent a value that represents the publications that they have co-authored. &lt;span style=""&gt; &lt;/span&gt;Secondly, in the citation graph of the publications of the domain in concern, the nodes represent the publications and the vertices will represent the citations. &lt;/p&gt;  &lt;p class="MsoNormal"&gt; &lt;/p&gt;    &lt;p class="MsoNormal"&gt;Both the cluster extraction techniques will be based on the social network analysis techniques that have been developed by previous research. The identification of communities expected to result in following communities. &lt;/p&gt;   &lt;ul&gt;   &lt;li&gt;&lt;b style=""&gt;Community of      authors - &lt;/b&gt;Communities in the first graph have a likelihood of identifying research communities and further they will represent research groups that are collaborating in particular area of research. &lt;/li&gt;   &lt;li&gt;&lt;b style=""&gt;Cluster of      related publications - &lt;/b&gt;Communities in the second graph will identify clusters of research papers belong to a one broad research area or a meta-research area within a broader research area. &lt;/li&gt; &lt;/ul&gt;   &lt;p class="MsoNormal"&gt;Once the two graphs are constructed we can identify the authors from the clusters of relation publications and compare the relationships between two types of author groups. The final outcome of the project would be to observe the co-relation between the two different set of information extracted from academic publications from a particular domain and to identify meta research groups within a given research area, and the topics they are working on. &lt;/p&gt;   &lt;h3&gt;Task List.&lt;/h3&gt;   &lt;ul&gt;   &lt;li&gt;&lt;!--[if !supportLists]--&gt;&lt;span style="font-family:Symbol;"&gt;&lt;span style=""&gt;&lt;span style=";font-family:&amp;quot;;font-size:7;"  &gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Decide and&lt;span style=""&gt;      &lt;/span&gt;obtain the publication set on which the network analysis will be performed &lt;/li&gt;   &lt;li&gt;We have not pinpoint algorithms used to identify the communities in the each case, but for the co-author graphed we consider the approach presented by Ebel, Davidsen and Bornholdt [1]&lt;span style=""&gt;  &lt;/span&gt;is a possible candidate and for co-citation graph we plan to pick one of the algorithms from refernaces [3], [4] or [5]. &lt;/li&gt;   &lt;li&gt;We are thinking about using a focused topic like Web      Services under computer Science and gather all related Papers. &lt;/li&gt;   &lt;li&gt;If time permitting we would provide a GUI to      visualize the communites we identified in the analysis&lt;/li&gt; &lt;/ul&gt;     &lt;h3&gt;References&lt;/h3&gt;   &lt;p class="Reference" style=""&gt;&lt;!--[if !supportLists]--&gt;&lt;span style=""&gt;[1]&lt;span style=";font-family:&amp;quot;;font-size:7;"  &gt;    &lt;/span&gt;&lt;/span&gt;&lt;!--[endif]--&gt;H Ebel, J Davidsen, &lt;st1:place st="on"&gt;S Bornholdt&lt;/st1:place&gt;, &lt;i style=""&gt;“Dynamics of Social Networks”&lt;/i&gt;&lt;/p&gt;   &lt;p class="Reference"&gt;&lt;!--[if !supportLists]--&gt;&lt;span style="" lang="DE"&gt;&lt;span style=""&gt;[2]&lt;span style=";font-family:&amp;quot;;font-size:7;"  &gt;    &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;!--[endif]--&gt;&lt;span style="" lang="DE"&gt;Yuan An,Jeannette Janssen,Evangelos E. Milios&lt;/span&gt;, “Characterizing and Mining the Citation Graph of the Computer Science Literature”&lt;span style="" lang="DE"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;p class="Reference"&gt;&lt;!--[if !supportLists]--&gt;&lt;span style="" lang="DE"&gt;&lt;span style=""&gt;[3]&lt;span style=";font-family:&amp;quot;;font-size:7;"  &gt;    &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;!--[endif]--&gt;&lt;span style="" lang="DE"&gt;Gary William Flake, Steve Lawrence, C. Lee Giles, Frans M. Coet, Self-Organization and Identification of Web Communities &lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;p class="Reference"&gt;&lt;!--[if !supportLists]--&gt;&lt;span style="" lang="DE"&gt;&lt;span style=""&gt;[4]&lt;span style=";font-family:&amp;quot;;font-size:7;"  &gt;    &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;!--[endif]--&gt;&lt;span style="" lang="DE"&gt;Filippo Radicchi, Claudio Castellano, et al &lt;i style=""&gt;“Defining and identifying communities in networks Community structure in social and biological networks“&lt;/i&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;p class="Reference"&gt;&lt;!--[if !supportLists]--&gt;&lt;span style=""&gt;[5]&lt;span style=";font-family:&amp;quot;;font-size:7;"  &gt;    &lt;/span&gt;&lt;/span&gt;&lt;!--[endif]--&gt;&lt;span style="" lang="DE"&gt;M. Girvan and M. E. J. Newman &lt;i style=""&gt;“&lt;/i&gt;&lt;/span&gt;&lt;i style=""&gt;Co-authorship networks and patterns of scientific collaboration”&lt;/i&gt; &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21734325-113876661329578384?l=cs-b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cs-b659.blogspot.com/feeds/113876661329578384/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21734325&amp;postID=113876661329578384' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/113876661329578384'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21734325/posts/default/113876661329578384'/><link rel='alternate' type='text/html' href='http://cs-b659.blogspot.com/2006/01/praposal-comparison-of-research.html' title=''/><author><name>Web Mining (B659) 2006</name><uri>http://www.blogger.com/profile/01885444802448702473</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
