This woman and her machine learning tech could make Box a whole lot smarter

When cloud collaboration goldenboy Box acquired a stealth content-discovery startup called dLoop last month, the deal was about a lot more than helping Box become this generation’s version of Autonomy. Box sees a lot of potential in dLoop’s machine learning technology to help the company evolve into a platform for developers and maybe even a pioneer in mobile content-management. Divya Jain, dLoop’s co-founder and now a software engineer at Box, will help lead that charge.

The dLoop technology uses machine learning algorithms and graph analysis to detect the similarities among documents and organize them into clusters that are searchable not just by keyword, but by relevance. The immediate benefits are obvious — Box users will be able to search for documents more confidently, knowing they’ll see everything they might need even if it doesn’t feature or even include a particular term. It’s helpful for security, too, Jain explained in recent interview, because companies can, for example, find all their documents relating to intellectual property — even those they might have otherwise missed — and apply the proper permissions to them.

It’s machine learning doing (arguably) what it does best — churning through and classifying data that might take humans untold days, weeks or months to do — but doing it on some rather difficult data types.

Divya Jain

Divya Jain

Given the complicated nature of dealing with documents and content, perhaps it’s fitting that Jain, along with dLoop co-founder Anurag Maunder, was the one to try tackling it. She has been working with content for years, first at Sun Microsystems on the SunONE Portal Server team from 2001 through 2005, then at e-discovery startup Kazeon and EMC (which acquired Kazeon in 2009) from 2005 through 2011. She got interested in machine learning in 2009 and by 2010 had completed a graduate certificate in data mining and analysis from Stanford.

Jain’s interest in machine learning wasn’t just a matter of personal interest, though — it was also a matter of timing. Technologies like Hadoop were coming of age, and the easy access to cloud computing resources (the same ones that helped spur Box’s growth after it launched in 2005) made it feasible for the first time to really do large-scale machine learning without buying and managing all the resources interally.

“Earlier, it was not even possible to cluster the amount of data we were seeing in the enterprise,” Jain said.

By early 2012, she and Maunder (who also founded Kazeon) had started dLoop, which was itself was the product of advances in big data, as well as the amount of content people were now storing in cloud services and Jain’s own document-search experience at Kazeon. The idea behind dLoop was to connect data from all those different places — Google, Box, Dropbox, etc. — and, as explained above, use machine learning to correlate them.

Within Box, though — which just raised another $100 million in venture capital and is eyeing an IPO — dLoop’s technology could help serve some less-obvious purposes than just automating document classification. “We care a great deal about making sure our platform continues to develop and evolve,” said Box Vice President of Engineering Sam Schillace.

Box wants to be a developer platform, and embedding the dLoop capabilities into the platform will make it that much more appealing, he added. Machine learning could also help Box power better content recommendations and search results ranked by relevancy rather than keywords. Documents’ metadata could be valuable in this regard (e.g., who created a document and when might be telling), and Schillace also suggested some of the information gleaned via process could be used to annotate documents.

Structure 2013 Sam Schillace Box

Sam Schillace at Structure 2013 (c) 2013 Pinar Ozger [email protected]

All of this potentially helps Box a great deal as it tries to tackle the mobile space, too. Schillace has previously discussed — during an interview earlier this year and again at our Structure conference in June — the importance of evolving the desktop-era Box platform into a more natural fit for the mobile era. He thinks the technology Jain helped create can directly impact the mobile experience “just by helping people find what they’re looking for faster.” (Box Founder and CEO Aaron Levie also discussed this topic, among others, during an appearance on our Structure Show podcast in September.)

Search would be a natural starting point because mobile users need the best stuff up high in the search results so they’re not forced to work through multiple screens. If Box were to get more into the structure of the documents, Schillace added, it could theoretically start surfacing relevant paragraphs, or even somehow treating paragraphs as separate documents, so users didn’t have to scan an entire file to find out whether it’s actually relevant to what they need.

In mobile, he said, “There’s now a premium about the server being intelligent about what it serves to you.”

Jain, who’s now leading Box’s data classification and advance content analysis efforts, is happy to be in a company that’s thinking about these types of uses for her technology and still nimble enough to actually pull them off. Even pushing 1,000 employees, in terms of size, legacy technologies and businesses, and culture, Box is still very much a startup compared with EMC.

“Personally,” Jain said, “I like the startup environment and startup culture much more than the big company thing.”