Don’t bind too early

On Tuesday evening, Gigaom and Collective[i] hosted a dinner in Boston for area executives, with Ray Ozzie as the special guest and speaker. This post isn’t a recap of that dinner, but rather a discussion of a concept Ozzie raised that is very important in the world of big data and also NoSQL: late binding.

Better late than ever?
As it happens, Ozzie did not present the concept of late binding within the context of big data. In fact, the discussion in which the concept arose centered around competing social communication tools and at which point organizations should standardize on one product in that category. Ozzie’s recommendation: don’t bind too early.

In other words, don’t rush to commit. Play the field. Use lots of tools and see which works best, for the greatest number of business units, and then make a decision. And if different groups are each getting value out of different tools, don’t be afraid to use multiple tools.

Romancing the code
“Late binding” is a programming term. It refers to a capability and technique whereby variables’ data types are determined at execution time (late) rather than at coding time (early). By declaring a generic variable and assigning it a text, numeric, Boolean or another value, the data type of that variable can be implicitly defined on the fly. You keep your options open until concrete values are assigned.

So if late binding can be applied to both tool standardization and variable typing, why mention it in a weekly big data update? Because late binding is what so-called “unstructured” data is all about. In the world of Hadoop and NoSQL, schemas are determined on a just-in-time basis.

It’s not that the data is unstructured. Without structure, you couldn’t perform analytics on the data. But the structure is determined at query time rather than at the time the database is designed. It’s interpretive, and it’s not stored.

As it turns out, determining structure at the last minute allows more analyses to be performed, and eliminates the inefficiencies of negotiating a schema of consensus. The power to structure is delegated. It’s more inclusive that way, and it eliminates a lot of bureaucratic process that would kick in if that power were centrally consolidated.

There are many programming languages out there that support both early and late binding of variables because they provide for both dynamic and strong typing. So what does that say about databases? It says that the status quo today, where NoSQL databases supply late binding and relational databases are essentially strongly typed, is silly. Having a few pure-play products is fine, but just as many programming languages support early and late binding, so too should many databases.

Which is to say that if the major relational databases were to add late-bound structure capabilities we could possibly eliminate the situation where relational and NoSQL engines can’t be common or run on a common platforms. We could eliminate data silos, respect existing skillsets, and get the industry to focus on quality rather than just competing in what is still an immature product market.

In short, we could mitigate some of the fragmentation that now exists in the industry. While it’s lovely to have so many choices, eventually we have to commit to a database standard. The decision can be late bound, but it can’t be unbound.

As unstructured data heats up, will you need a license to webcrawl?

Cheap computing and the ability to store a lot of data fairly cheaply have made the concept of big data a buig business. But amid the frenzy to gather such information, especially unstructured information, are companies pushing the boundaries of polite (or ethical) behavior?

How the AP got a hold of its big, old data

Holding onto millions of pieces of archived content it still wanted to monetize, the Associated Press turned to MarkLogic’s NoSQL non-relational database designed for XML files. As publishers try to leverage their years worth of archived, often not tagged content, they’ll need new tools.

Like your data big? How about 5 trillion records?

1010data says it now hosts more than 5 trillion records for its customers. If 1010data’s growth is a microcosm of the greater market, it’s no wonder there’s so much excitement around scalable data stores such as Hadoop, NoSQL databases and massively parallel analytic databases.

Why the big data startup boom will likely be short-lived

There has been a remarkable flowering of companies over the past year or two, all riding a wave of developer and investor enthusiasm for the loosely defined concept of “big data.” But given that the big data startup market is probably overvalued and headed for a lot of consolidation, these new companies’ days might be numbered.