Real-life use-case examples show where big data is headed

In two weeks, Gigaom brings its big data conference, Structure Data, back to New York City in a big way. Growth in the big data industry has moved beyond infrastructure to include intelligent data applications and products built on robust platforms with algorithms producing remarkable results. For many industries, this use of big data is not only useful, it has become vital to transforming everyday life.

Structure Data brings in not only major players like Cloudera, AWS and Google but also speakers from companies that we use everyday—adding a lively use-case component to the program. We’re excited to hear from the University of Chicago Crime Lab on how they’re using big data to predict social behavior, how Hampton Creek is categorizing naturally occurring plant substances to help combat future climate change impacts on agriculture, and how Spotify is building predictive analytics based on use of its services.

And that’s not all. Here’s a look at what else to expect at this year’s Structure Data:

  • How big data analytics is changing everything from the NFL to swimming with Krish Dasgupta, VP Data Platforms and Technology at ESPN and Bill Squadron, Executive VP, Pro Analytics at STATS
  • We’ll cover privacy and security with Julie Brill, Commissioner at the Federal Trade Commission.
  • Google Glass may be over, but hardware is alive and well. We’ll sit down with Naveen Rao, CEO at Nervana Systems and Anthony Lewis, Senior Director of Technology at Qualcomm, to talk about how hardware is getting smarter.
  • In a three-part series, we’ll explore data through diet, disease and healthcare with Dan Zigmond, VP of Data at Hampton Creek, Ahna Girshick, Senior Data Scientist at Enlitic and Steven Horng, Instructor of Medicine at Harvard Medical School.
  • Not to mention BuzzFeed, Twitter, Goldman Sachs, Hortonworks, Yahoo!, GE, IBM, NASA, and more (yes, there’s a lot more: check out the full schedule and speaker list).

Preferred Pricing

Remember: Gigaom Research subscribers get 25% off all Gigaom events.  Register using this link today.

Structure Data is March 18th & 19th. Learn more at StructureData.com

The Hadoop wars, HP cloud(s) and IBM’s big win

[soundcloud url=”https://api.soundcloud.com/tracks/194323297″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

If you are confused about Hewlett-Packard’s cloud plan of action, this week’s guest will lay it out for you. Bill Hilf, the SVP of Helion product management, makes his second appearance on the show (does that make him our Alec Baldwin?) to talk about HP’s many-clouds-one-management layer strategy.

The lineup? Helion Eucalyptus for private clouds which need Amazon Web Services API compatibility; Helion OpenStack for the rest and [company]HP[/company] Cloud Development Platform (aka Cloud Foundry) for platform as a service.  Oh and there’s HP Public Cloud which I will let him tell you about himself.

But first Derrick Harris and I are all over IBM’s purchase of AlchemyAPI, the cool deep learning startup that does stuff like identifying celebs and wanna-be celebs from their photos. It’s a win for [company]IBM [/company]because all that coolness will be sucked into Watson and expand the API set Watson can parlay for more useful work. (I mean, winning Jeopardy is not really a business model, as IBM Watson exec Mike Rhodin himself has pointed out.) 

At first glance it might seem that a system that can tell the difference between Will Ferrell and Chad Smith might be similarly narrow, but after consideration you can see how that fine-grained, self-teaching technology could find broader uses.

AlchemyAPI CEO Elliot Turner and IBM Watson sales chief Stephen Gold shared the stage at Structure Data last year. Who knows what deals might be spawned at this year’s event?

Celebrity_ChadSmith_WillFerrell_cropped

Also we’re happy to follow the escalating smack talk in the Hadoop arena as Cloudera CEO Tom Reilly this week declared victory over the new [company]Hortonworks[/company]-IBM-Pivotal-backed Open Data Platform effort which we’re now fondly referring to as the ABC or “Anyone But Cloudera” alliance.

It’s a lively show so have a listen and (hopefully) enjoy.

SHOW NOTES

Hosts: Barb Darrow, Derrick Harris and Jonathan Vanian

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

PREVIOUS EPISODES:

Mark Cuban on net neutrality: the FCC can’t protect competition. 

Microsoft’s machine learning guru on why data matters sooooo much 

No, you don’t need a ton of data to do deep learning 

VMware wants all those cloud workloads “marooned” in AWS

Don’t like your cloud vendor? Wait a second.

 

 

 

DJ Patil joins White House as Chief Data Scientist to work on precision medicine

As we reported two weeks ago, after White House staff leaked the news on a conference call, DJ Patil, the former Chief Data scientist at LinkedIn and most recently a VP of Product at RelateIQ, is now the Chief Data Scientist at the White House. In that role, he will work on the Administration’s Precision Medicine Initiative, which aims to use advances in data and health care to personalize medical treatments. He will also work on issues relating to open data in government. For a deeper dive on precision medicine and open data, check out our Structure Data conference in New York next month.

How a $125 billion dollar market is evolving

In 2014 we saw an increasing number of enterprises engage with and implement big data strategies. Simultaneously, big data pioneers made some big plays that further inspired big data adoption and use. The underlying theme is that machine learning and cognitive computing is the next big development in big data.

Big data could be a $125 billion dollar market in 2015 according to our friends at IDC and it shows no signs of slowing down. Intelligent data applications, natural language processing and predictive semantic search are all fast becoming a reality from the aggregates of data we produce in enterprise applications, the Internet of Things, or mobile.

Gigaom’s Structure Data conference next month promises to illuminate these developments with the same high standard of speakers and discussion we bring to our news and research, diving into the technologies being built today as well as the applications companies are working toward tomorrow. We’ll talk about issues that shape policy including crime, health care, and security. We’ll sit down with the big players who put big data and analytics on the forefront of every enterprise CTO’s mind. And Big Data Research Director Andrew Brust and his team of Analysts will be on site with insights and analysis of how these markets are continuing to develop.

I hope you can join us next month at Structure Data, March 18-19 in New York City. It’ll be a great show!

Remember: Gigaom Research subscribers get 25% off all Gigaom events. You can register using this link today.

 

Oracle paid north of $1.2B for Datalogix, says WSJ

Right before Christmas, when [company]Oracle[/company] announced plans to buy Datalogix it didn’t detail price. Now, The Wall Street Journal, (paywall)  is filling in the blanks, reporting that the price tag is $1.2 billion, according to two anonymous sources “familiar with the deal.”

Oracle did not comment for this story.

Datalogix collects consumer sentiment by the truckload via agreements with [company]Facebook[/company], [company]Twitter[/company] and other sources. In its press release, Oracle said Datalogix “aggregates and provides insights on over $2 trillion in consumer spending from 1,500 data partners across 110 million households to provide purchase-based targeting and drive more sales.”

Also it said that Datalogix has more than 650 customers including “82 of the top 100 US advertisers such as [company]Ford[/company] and [company]Kraft[/company].” and, it said that 7 of the top 8 digital media publishers including the aforementioned Facebook and Twitter use Datalogix to “enhance their media.”

Hmmm. Given the tremendous amount of money, time and effort companies spend to better target marketing and ad pitches, that consumer info is probably worth a pretty penny.  But the fact that Oracle, the database market leader and a power in enterprise software, is now buying up consumer data, raised more than a few eyebrows, especially coming as it did after a number of other deals that seem to be consolidating vast troves of consumer data in a few powerful hands. Oracle made this move after having already acquired Bluekai, a Cambridge, Mass. startup that parses inforamtion about what consumers are looking to buy both online and in the real world.

In particular, watchdog group The Center for Digital Democracy has asked the FTC review this deal with an eye as to whether it gives Oracle too much access to too much customer data. The CDD also cited Axciom’s buy of Liveramp in May; [company]Adobe Systems[/company]’ purchase of Neolane in June 2013, and Oracle’s acquisitions of Eloqua in December 2012 and Responsys a year later, as signs that data aggregation is becoming a problem.

The Datalogix-Facebook tie in is of special note since under an existing settlement with the FTC, Facebook agreed to obtain consumers’ permission before sharing their data. The specter of that agreement surfaced when Facebook bought WhatsApp.

The consolidation of data aggregators is a topic we can bring up next month with FTC Commissioner Julie Brill who will speak at Structure Data.

For what it’s worth on the pricing issue, public companies do not have to disclose purchase price of a private company unless that price is “material” to their business. The definition of materiality is a bit loosey goosey, however. If you want to delve into the niceties, check out concept statement 2 from the Financial Accounting Standards Board.

 

What happens when too few databases become too many databases?

So here’s some irony for you: For years, Andy Palmer and his oft-time startup partner Michael Stonebraker have pointed out that database software is not a one-size-fits-all proposition. Companies, they said, would be better off with a specialized database for certain tasks rather than using a general-purpose database for every job under the sun.

And what happened? Lots of specialized databases popped up, such as Vertica (which Stonebraker and Palmer built for data warehouse query applications and is now part of HP). There are read-oriented databases and write-oriented databases and relational databases and non-relational databases … blah, blah, blah.

The unintended consequence of that was the proliferation of new data silos, in addition to those already created by older databases and enterprise applications. And the existence of those new silos pose a next-generation data integration problem for people who want to create massive pools of data they can cull for those big data insights we keep hearing about.

Meet new-style ETL

In an interview, Palmer acknowledged that we’ve gone from one extreme — not enough database engines —  to too many options, with customers getting increasingly confused. And in that complexity, there is opportunity. And Palmer’s latest startup with Stonebraker — called Tamr — as well as other young companies like ClearStory, Paxata, Trifacta are attacking the task of cleaning up data in a process traditionally called Extract Transform Load or ETL.

Tamr combines machine learning smarts with human subject matter experts to create what Palmer calls a sort of self-teaching system. The startup is one of seven winners of our Structure Data Awards will be on hand at the Structure Data event next month to discuss the new era of ETL along with other trends in data.

The data sharing economy

As more companies share select information with supply chain and other trusted partners, ensuring that key data is clean will become more important. According to a new Accenture survey of 2,000 IT professionals, 35 percent of those surveyed said they’re already using partner APIs to integrate data and work with those partners while another 38 percent said they plan do that.

Per the survey:

One example is Home Depot, which is working with manufacturers to ensure that all of the connected home products it sells are compatible with the Wink connected home system – thereby creating its own connected home ecosystem and developing potential new services and unique experiences for Wink customers.

And, 74 percent of those respondents said they are using or experimenting with new technologies that integrate data with digital business partners.  Also from the Accenture report:

 “Rapid advances in cloud and mobility are not only eliminating the cost and technology barriers associated with such platforms, but opening up this new playing field to enterprises across industries and geographies.”

As the velocity and types of data flowing to and from applications increases “old style careful ETL curation doesn’t work anymore but [the data] still needs to be cleansed and prepped,” said Gigaom Research Director Andrew Brust.

In other words big data is big, no doubt. But in some cases, the old adage “Garbage In, Garbage Out” holds true even in the era of big data. If you really want the best insights out of the information you have, getting that data cleaned and spiffed up, can be a very big deal.

 

As social media gets quantified, more people use Twitter to trade

Professional investors known as quants use hard facts about companies —  share price, EBITDA, and so on — to inform the algorithms that carry out their automated trading strategies. But softer sources of information such as reports and rumors have long proved much harder to quantify.

Now, however, a major change is underway thanks to custom financial applications that treat social media discussions as data, and turn it into hard stats.

“The clear trend we’re seeing is the quantification of qualitative aspects of the world,” Claudio Storelli, who overseas Bloomberg’s app portal, told me last week in New York where he led a presentation on technical analysis applications.

He pointed in particular to Twitter, which throws off millions of data points (“inputs for black box consumption” in Storelli’s words), that can provide big clues about stock movements. Here is a screenshot showing Twitter sentiment about Apple, as parsed by an app called iSense:

Bloomberg isense app

The result is that computer-based trading tools are using social media signals not only to react to market events, but to predict them as well.

While Bloomberg has hosted such sentiment analysis tools for some time, Storelli said their use is more prevalent than ever. And this is converging with another trend in big-league investing: applications that let traders who lack a background in math or coding deploy technical analysis or academic theories that have traditionally been the purview of quants.

“Our mission is to eliminate the coding barrier,” he said, saying new applications now allow anyone with a basic knowledge of markets and statistics to apply complex technical theories to real-time events.

One example he cited is an application that lets traders integrate the theories of Tom Demark, who is known for using esoteric mathematical models to predict market timing, into run-of-the-mill financial charts.

Together, the two trends Storelli cited — applications integrating technical analysis and the use of social media sentiment — reflect more widespread access to opposite ends of a spectrum of expertise. On one hand, traders can deploy the knowledge of elite experts while, on the other hand, they can act on the collective hunches of millions of average people on social media.

In practice, of course, these approaches are still far beyond the reach of average investors, in no small part due to Bloomberg’s hefty price tag. But they may also appear to be laying the groundwork for democratizing the tools that supply inside insight into financial markets.

To learn more about how tools powered by big data are changing finance and other industries, join us at Gigaom’s Structure Data event in New York City on March 18.

iSense screenshot

DataStax’s first acquisition is a graph-database company

DataStax, the rising NoSQL database vendor that hawks a commercial version of the open-source Apache Cassandra distributed database, plans to announce on Tuesday that it has acquired graph-database specialist Aurelius, which maintains the open-source graph database Titan.

All of Aurelius’s eight-person engineering staff will be joining DataStax, said Martin Van Ryswyk, DataStax’s executive vice president of engineering. This makes for DataStax’s first acquisition since being founded in 2010. The company did not disclose the purchase price, but Van Ryswyk said that a “big chunk” of DataStax’s recent $106 million funding round was used to help finance the purchase.

Although DataStax has been making a name for itself amid the NoSQL market, where it competes with companies like MongoDB and Couchbase, it’s apparent that the company is branching out a little bit by purchasing a graph-database shop.

Cassandra is a powerful and scalable database used for online or transactional purposes (Netflix and Spotify are users), but it lacks some of the features that make graph databases attractive for some organizations, explained DataStax co-founder and chief customer officer Matt Pfeil. These features include the ability to map out relationships between data points, which is helpful for social networks like Pinterest or [company]Facebook[/company] who use graph architecture to learn about user interests and activities.

Financial institutions are also interested in graph databases as a way to detect fraud and malicious behavior in their infrastructure, Pfeil said.

As DataStax “started to move up the stack,” the company noticed that its customers were using graph database technology, and DataStax felt it could come up with a product that could give customers what they wanted, said Pfeil.

DataStax Enterprise

DataStax Enterprise

Customers don’t just want one database technology, they want a “multi-dimensional approach” that includes Cassandra, search capabilities, analytics and graph technology, and they are willing to plunk down cash for commercial support, explained Van Ryswyk.

Because some open-source developers were already figuring out ways for both Cassandra and the Titan database to be used together, it made sense that DataStax and the Aurelius team to work together on making the enterprise versions of the technology compatible with each other, Van Ryswyk said.

Together, DataStax and the newly acquired Aurelius team will develop a commercial graph product called DataStax Enterprise (DSE) Graph, which they will try to “get it to the level of scalability that people expect of Cassandra,” said Van Ryswyk. As of now, there is no release date as to when the technology will be ready, but Pfeil said work on the new product is already taking place.

If you’re interested in learning more about what’s going on with big data in the enterprise and what other innovative companies are doing, you’ll want to check out this year’s Structure Data conference from March 18-19 in New York City.

Cloud monitoring category gets busier

Server monitoring gets hot

SolarWinds, which monitors multi-vendor technologies running in house, last week bought Librato to extend its reach into the cloud. Librato is noted for its ability to watch workloads running in Heroku and Amazon Web Services as well is in internally-run Rails, Node.js, Ruby, Rails and Java applications.

Austin, Texas-based SolarWinds is betting that, despite the hype, most companies will not move everything to a public cloud or a SaaS vendor but will want ways to monitor workloads whether they are running in the server room down the hall, in AWS US-East or wherever.

For the $40 million purchase price SolarWinds “gets a way to bridge on-prem and cloud worlds and provide visibility across both,” Kevin Thompson, Solarwinds president, (pictured above) said in an interview.

The goal is to give “IT and devops pros a way to manage everything from on-prem to cloud and everything in between or we can’t guarantee a level of performance,” he said.

That’s a tall order. But it’s also a potentially huge market as it’s increasingly clear that most big tech buyers will continue to spread their bets between private and public resources.

[company]Solarwinds[/company] will not, however, meld Librato’s cloud monitoring service in with its existing services but rather field discrete services so customers can buy what they need, he said.

It started down this road to cloud by purchasing Pingdom, a website and application monitoring company, in June. Librato is more about monitoring infrastructure at all layers in cloud environments, according to SolarWinds.

There’s been a flurry of activity in server monitoring over the past year. In May, Google purchased StackDriver, which provided monitoring tools for AWS and Google Cloud Platform. Three months later Idera bought Copperegg, another server monitoring product.

Investors are paying attention. Last week another server monitoring entry, Datadog, snagged $31 million in new funding bringing its total funding to more than $50 million. Other entries in this space include Boundary and Server Density.

What’s that again? Amazon to break out its cloud numbers?

But the really big news in cloud last week was all about AWS accounting. [company]Amazon[/company] always talks about how transparent it is. And yet,  the size of its clearly huge AWS business was treated like a state secret, hidden inside another category — which also includes sales of various and sundry other stuff including co-branded credit cards.  That left pundits to guestimate its size. (My favorite anecdote is when an Amazon employee complained to me about [company]Microsoft[/company] being opaque in claiming Azure was a $1 billion business a few years back. FWIW, I agreed that the Azure number was fuzzy at best, but oh the irony that Amazon, of all companies would complain about that.

Sooo, when, on the Q4 earnings call Thursday,  Amazon’s CFO said the company would at last break out AWS numbers, starting this quarter, I had to triple check the news. Make no mistake, this is a big deal, but the break out of cloud numbers won’t necessarily illuminate all mysteries. But baby steps, people.

The down side? Our nifty “Amazon North America Net Sales (other)” chart now can be retired. So here it is one more time (along with associated growth chart.)

[dataset id=”911024″]

[dataset id=”911074″]

Structure Show! All about data!

Check out this week’s Structure Show as data nerd Derrick Harris talks to data nerd Matt Ocko, of Data Collective Venture Capital, about real opportunities (a la beyond the hype) for big data technologies. DCVC has put seed money into database companies (MemSQL) and satellite companies (Planet Labs).

And if that chat leaves you wanting more, you can get it when Ocko speaks at Structure Data next month in New York). Last week’s guest, Hilary Mason will be there as well. So come for an embarrassment of data riches; stay for the parties!

 

[soundcloud url=”https://api.soundcloud.com/tracks/188532972″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

 

 

SHOW NOTES

Hosts: Barb Darrow and Derrick Harris.

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

 

A model that can predict the unpredictable New England Patriots

It’s said that familiarity breeds contempt in personal relationships. In the NFL, it might also breed predictability. Although the New England Patriots and their coach Bill Belichick are often called unpredictable, it turns out that machine learning models are actually pretty good at guessing what they’ll do.

Alex Tellez, who works for machine learning startup H2O, built a model he says can predict with about 75 percent accuracy whether the Patriots will run the ball or pass it on any given play. He used 13 years of data — all available on NFL.com — that includes 194 games and 14,547 plays. He considered a dozen variables for each play, including things such as time, score and opposing team.

Tellez thinks it might be possible to build a model that predicts plays with even more accuracy. He noted, while slyly touting his company’s software, that this one was created with just a few clicks using the H2O platform. Spending more time and tweaking some of the features might improve accuracy, and he suggested that feeding data into a recurrent neural network (which would have some ability to remember some results from one play to the next) might help account for the emergence of players like running back Legarrette Blount, who can skew play-calling in the short term.

belichick

Tellez’s model works, in part, because of how long Belichick and quarterback Tom Brady have been together — 15 seasons now. That’s a lot of time to amass data about what types of plays the team will call in any given situation, with at least two constant — and important — variables in the coach and the quarterback.

“Realistically, Bill Belichick and Tom Brady, those are like the only dynamic duo,” explained Tellez. “You couldn’t do it with the Raiders,” he added, alluding to that team’s revolving door of coaches and quarterbacks.

Or even the New England Patriots’ Super Bowl competitor, the Seattle Seahawks, who are working with a fifth-year coach and third-year quarterback.

Last year, I wrote about Brian Burke, the founder of Advanced NFL Analytics and the guy whose models power the New York Times 4th Down Bot. “The number of variables, it explodes geometrically,” he said about the challenges of predicting football plays.

belichick2

Still, even if predicting the likelihood of a run or a pass remains an unsolvable challenge for most of the NFL, the proliferation of data has already and likely will continue to change the face of football — and sports overall — in some very significant ways. Some obvious ones are the advanced metrics used by Major League Baseball teams to rate players beyond just their batting averages or earned-run averages, the now-trite “moneyball” method building rosters, and the remarkable success of expert statisticians such as Burke and FiveThirtyEight’s Nate Silver.

At our Structure Data conference in March, data executives from ESPN and real-time player-tracking specialist STATS will discuss how access to so much data is changing the fan experience, as well, and even the on-court decision-making in sports such as professional basketball.

Depending on whether anyone can build accurate-enough models, Tellez actually suggested we could see live sports broadcasts include predictions of the next play similar to how ESPN predicts outcomes in its World Series of Poker broadcasts. While his Patriots model took about 30 seconds to run, live-broadcast models would have the benefit of being able to pre-load data for the specific game situation and only run against that data, he said.

Richard Sherman

Richard Sherman

He also has another idea for applying advanced data analysis the NFL — predicting rookie performance in the NFL combine. That’s where draft prospects go to show off to NFL scouts how big, fast and strong they are. However, not all prospects participate in all the events, which can give teams an incomplete view of their athletic prowess.

Tellez built a special type of neural network, called a self-organizing map, to analyze all combine performance for cornerbacks, specifically, and then fill in the blanks when players opt to skip a particular exercise. Think about it like Google’s Auto-Fill feature, which predicts missing values in spreadsheets. He says he discovered that good 40-yard dash, shuttle run and 3-cone times tend to correlate with high draft picks and future success, so being able to predict those times even if a prospect doesn’t do them could be valuable.

Of course, Tellez noted, stats don’t always tell us the truth. His model, as well as NFL scouts, predicted Seattle Seahawks cornerback Richard Sherman as a mid-round draft pick. The Seahawks drafted him in the fifth round. He’s now considered one of the league’s most-dominant cornerbacks and most-recognizable players.