[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[Freedombox-discuss] since we mentioned YaCy: Common Crawl Foundation



http://www.readwriteweb.com/archives/common_crawl_foundation_announces_5_billion_page_w.php

New 5 Billion Page Web Index with Page Rank Now Available for Free from
Common Crawl Foundation

By Marshall Kirkpatrick / November 7, 2011 3:42 PM / 0 Comments

A freely accessible index of 5 billion web pages, their page rank, their link
graphs and other metadata, hosted on Amazon EC2, was announced today by the
Common Crawl Foundation. "It is crucial [in] our information-based society
that Web crawl data be open and accessible to anyone who desires to utilize
it," writes Foundation director Lisa Green on the organization's blog.

The Foundation is an organization dedicated to leveraging the falling costs
of crawling and storage for the benefit of "individuals, academic groups,
small start-ups, big companies, governments and nonprofits." It's lead by
Gilad Elbaz, the forefather of Google AdSense and the CEO of data platform
startup Factual. Joining Elbaz on the Foundation board is internet public
domain champion Carl Malamud and semantic web serial entrepreneur Nova
Spivack. Director Lisa Green came to the Foundation by way of Creative
Commons.

The Foundation explains the scope of the project thusly.

    "Common Crawl is a Web Scale crawl, and as such, each version of our
crawl contains billions of documents from the various sites that we are
successfully able to crawl. This dataset can be tens of terabytes in size,
making transfer of the crawl to interested third parties costly and
impractical. In addition to this, performing data processing operations on a
dataset this large requires parallel processing techniques, and a potentially
large computer cluster.

    "Luckily for us, Amazon's EC2/S3 cloud computing infrastructure provides
us with both a theoretically unlimited storage capacity coupled with
localized access to an elastic compute cloud."

The organization was formed three years ago, just now started talking about
itself publicly and believes that free access to all this information could
lead to "a new wave of innovation, education and research."

Open Web Advocate James Walker agrees: "An openly accessible archive of the
web - that's not owned and controlled by Google - levels the playing field
pretty significantly for research and innovation."



Reply to: