Every Hosting Company can serve Public Datasets

Amazon is doing a really good job of turning hosting data sets into a big marketing win. A lot of sentiment gets repeated, like “What a friendly environment! They host those data sets for free!”. For the most part, I have to agree. Like so many other good ideas its one of those “Well duh” concepts, and it makes sense, not just for Amazon, but for every medium and large sized hosting company.

Why should I host large datasets?

Hosting large sets of data is not only an incentive for people to use your service, but there are economics at play. If more than a couple people use a dataset then you can save money by storing it internally. This is because moving data over the LAN is much much cheaper than the WAN. Storage is even cheaper than LAN traffic. After you add up LAN and Local Storage and compare that to the cost of moving data over the WAN, you save money.

There are public repositories of data sets but they’re fragmented, not well documented, and are stored in many different locations. Bringing this all under one roof is a fantastic idea.

Something I wish DataMob would do is accept, catalog, and store these large datasets so that other large hosting companies can mirror them. One issue here is that to crunch this huge amount of data you need ‘on-demand’ compute capacity because parsing, crunching and perhaps visualizing 10TB of data can take a few cycles. This is where Amazon really shines. There are also licensing issues at play, which is something I need to spend more time on. Why you’d want to ensure your dataset has to be privately hosted, but can be publicly, anonymously queryable is beyond me.

  • http://datamob.org Lauren

    Hey Trevor, thanks for the mention!

    I’m intrigued by your suggestion that Datamob should store and catalog large datasets, but I’m not sure I understand why hosting companies like amazon and its competitors would need us as a middleman. Shouldn’t they be doing just that with their rows of servers and excess computing power?

  • http://trevoro.ca Trevor Orsztynowicz

    Hi Lauren!

    Good to meet you – I’m a fan of Datamob.

    The reason I suggested having Datamob be a middle man was because while companies like Amazon do have a ton of excess computing and storage resources, I still see the need for a centralized catalog and repository for large data sets and API’s. Whether or not you host the data or API yourself is another issue. Ideally Datamob would be the one place everyone knows they could look for these kinds of data sets, either for a download or to submit something interesting. Then datacenters with large amounts of compute and storage resources (ie utility compute centers) could mirror that data, and then people could mirror off of them. Same way we distribute linux distributions / isos / etc.

    It’s largely an cloud economics thing, but I think having a great CC licensed group like Datamob for the marketing aspect would work really well.