Big Data and Turkeys

Since a lot of people grumble about the “Big Data” meme — “what is this _big_ data anyhow” — I thought an analogy might help.

Big Data:Data::turkeys:chickens

A turkey is “really” just a big chicken.  Same limbs.  Same white and dark meat, same spices and herbs, similar taste.

But the scale of the turkey introduces new problems and requires new solutions:

  • Will it fit in your oven?
  • Will anything else fit in your oven if the turkey is there?
  • Where will you cook the others things if they won’t fit?
  • Do you have a roasting pan and rack big enough for a turkey?
  • Can you muscle the turkey up and down-stairs to brine it in the cooler (the only place it will fit)?

Ok, I won’t belabor the point: Big Data is different from data because the scale means your old techniques won’t always work.

Have a great holiday.


Big Data, Big Dreams

We’ve got to be at or near the Peak of Inflated Expectations in the Hype Cycle for Big Data.  It’s the point where the meme seems so powerful that everyone wants to associate themselves with it.

But, as happened with data mining, unstructured data mining, and other fevered dreams of extracting ponies from the manure heap of raw data, what if the insights we all believe are lurking in our data… aren’t lurking, or can’t be lured out of hiding?

I ran across a couple of posts this week that bear on the issue.

A post from Jeff Jonas. who can always be relied on to smash false idols, deals with this question.  As Jonas says:

The problem being; often the business objectives (e.g., finding a bomb) are simply not possible given the proposed observation space (data sources).

Dan Woods re-posts another variation on this theme:

…the data created and maintained outside your company is becoming much more important than the data that you can acquire from internal sources. Yet, few companies realize this and fewer are taking action. Instead, they are suffering from the Data Not Invented Here Syndrome.

In other words, there’s a difference between Big Data techniques and magic.  Sigh.

Your thoughts?

Where is the Big Data market at today?

Valhalla has been looking over the Big Data market, trying to answer the question: “how far along is the market?”  Are there really only four or so Big Data users — the likes of Google, Yahoo, Facebook, and Twitter — or are there more?  Is it an Early Adopter (or even merely a Tech Enthusiast market), or has it crossed the chasm?  What are the use cases?

Here are some of our findings:

1.The Big Data market is an Innovator/Early Adopter market overall, with possible Early Majority beachheads in web analytics and adtech

Although our interviewees described a larger number of use cases – “voice of the customer” analytics in marketing, M2M sensor processing, fraud and risk analysis, predictive analytics of various types – there was no hard evidence for widespread uses of Big Data today in these use cases, and many of the interviewees described them as “nascent” or “near-future” use cases.

There was, however, agreement that web analytics and adtech platforms were much further along in terms of using Big Data techniques for projects which were important to the customers’ businesses and mainstream today.

·         AdTech users employ Big Data technologies for real-time bidding (RTB) and managing and matching 3rd-party data to ad inventory or online user data (this area seems to be called “data management platforms”, an area where DemDex (which was acquired by Adobe for $xxxM) is perhaps the poster child.

·         Web analytics users employ Big Data technologies for indexing web pages and extracting performance indicators from raw weblogs.

2.     Informants believe that Hadoop and its stack is likely to remain the central platform for the Big Data market, but there is contradictory evidence

I don’t personally agree with this finding, but our interviewees all said, implicitly and explicitly, that the Hadoop stack was going to be the basis for Big Data technologies going forward.

One very thoughtful analyst said explicitly that the MapReduce/Hadoop stack would evolve over time, and that new technologies – like Dremel or Storm or Spanner and so forth – would be incorporated into the Hadoop ecosystem rather than creating new ecosystems of their own.

The only problem with this point of view is that “legacy” Big Data techniques – data warehousing, RDBMS, classic Business Intelligence suites – have a vast market share and a long history of productive use cases.   How these platforms will interoperate in the future is unknown, and whether an approach like Hadapt’s (where a “classic” RDBMS or BI technology suite runs within the Hadoop stack) will prevail is still too early to call.

3.     Wikibon’s analysis sizes the Big Data market today at $5B

A quantitative Wikibon analysis, which is quite thoughtful, concludes that $480M of this revenue comes from what they call “pure play” vendors (i.e., Hadoop infrastructure vendors and some other NoSQL or NewSQL) and the balance from legacy players.

Very curious about your thoughts on this.


Maybe SQL is the SQL of NoSQL

Derrick Harris has written the last couple of days a great deal on SQL front ends for MapReduce platforms.  This is a particularly meaty post.

What does it all mean?  That SQL support is a must-have for a self-respecting MR implementation, and everyone is rushing to provide it.

I’ve posted here, here, and here about the function that SQL plays in the legacy data fabric — a fence separating data management from data analysis, for example — and wondering out loud what will take its place in a NoSQL or PostSQL world.

This motion suggests that SQL may have some life in it yet.  Despite its RDBMS-ism, it is a rich data-analysis language, and it is the canvas upon which millions of data-analysis paintings have been painted.  It’s asking a lot to just throw that away and go back to writing software in what are really still 3GLs to get at data.

In any case, it’s an admission that the data fabric will be more PostSQL (including and building upon SQL) rather than NoSQL in the future.  And suggests that we need an expressive model of PostSQL data before we’ll have an expressive interface language for it.

Your thoughts?

The (Increasingly Worthless) Network Effect

The other day I did what I do with increasing frequency: I wanted to meet an exec (call him “Exec A”) at a startup company (call it “Company X”) where Valhalla might invest, so I looked in LinkedIn to see who was connected with them.

An old friend from Palo Alto days was indeed one degree of separation from Exec A, but when I contacted my friend — and, by the way, it was great to catch up with him on all kinds of things — he said, “I hardly know A and I know nothing about X”.  He had LinkedIn with A because they had worked together once, but it was not a meaningful connection.

There are pressures to make meaningless connections: pressures on LinkedIn, on Facebook, on Twitter.  And a kind of Gresham’s Law takes over: the bad links drive out the good.

I’ve watched it happen with UseNet, with email, with the Web, with portals, with Quora, with the social sites cited above.

So maybe there isn’t an absolute “network effect”.  Maybe above a certain size the debasement of links takes over and the value of network declines.

I’m certain that smarter folks than I have worked this problem.  I’d welcome any links to discussions.

But, please, only the good links.

Is JSONiq the SQL of NoSQL?

In response to my post on “What will be the SQL of NoSQL”  William Candillon of 28msec wrote:

Our take is JSONiq, an extension of XQuery for JSON:

First I’d heard about JSONiq, and, truth be told, I didn’t know a heck of a lot about JSON except the name.  Or about XQuery for that matter.

(One drawback of crossing over to the VC side is I fall inexorably behind on tech, despite my wishes and hopes.)

So I followed the link, looked at some of the code examples, and looked at 28msec.

I’m just digging into this, so would welcome any further pointers from the community.  SQL was not just a query language, but a frontier between the data layer and the rest of the app.  How does JSONiq get to that status?

Your thoughts?

Beginning of the End for the CIO?

As cloud computing really starts to take hold, some features of the landscape are becoming clearer. At least they seem clearer to me.

First, the disruption happens disruptively:

  • Through consumers and smaller business customers, rather than through the biggest enterprises.
  • Through applications converting to SaaS delivery and non-critical, experimental, or bursty infrastructure rather than mission-critical infrastructure.
  • Through “consumerization of IT”, where demand for iPads and cool apps drives the need for new delivery models

Second, the disruption will move upward to larger organizations, first through private clouds, then hybrid clouds, and then full-dress clouds.  (Of course, larger organizations use SaaS applications and suffice iPads and cool apps today, so they need some kind of cloud solutions today.)

Third, however, is interesting: if IT moves out of the enterprise and into the cloud, why do you need an IT organization over time… and why do you need a CIO?

The CIO’s job is to make sure that the house IT plant supports the mission of the organization.  Already other functions are trying to make IT decisions without the CIO: implementing Salesforce, constructing mobile apps, buying kit for websites.  It stands to reason that the CIO’s job loses power over time.

CIOs have barely been able to pry themselves free from reporting to CFOs.  Will they report to CMOs next?