How your company can start using data: Think small and internal first

[protected-iframe id=”2e9812c3eb07ccbeec87e2c9b5db40f0-14960843-25766478″ info=”″ width=”640″ height=”360″ frameborder=”0″ scrolling=”no”]
Session Name: The Data Guru’s Panel: Give and Take

Chris Albrecht
Derrick Harris
Sam Hamilton
John Foreman
Andrew Fogg

Chris Albrecht 00:00

Get our next panel all prepped and ready. Please mute your cellphones, keep fire exits clear, Wi-Fi you can join the network at Confinement, and enter the password Gringe1990, and you can follow us on Twitter. There’s actually been a lot of discussion, what I really like seeing, is if you follow the structure of your Pashtag aside from all the discussion with Joe Weinman, there’s also a lot of people connecting with other people. So it’s been a really cool tool for lack of a word that doesn’t rhyme. For people connecting with one another and I think that’s really awesome, so please continue to do that and keep the conversation going. But right now, I want to bring out my colleague again, Derrick Harris and this is the data guru’s panel, give and take, the data gives and it takes away. And he’s going to be speaking with Andrew Fogg, the founder and chief data officer for Import is. John Foreman, the chief data scientist for MailChimp and Sam Hamilton, the VP of data for PayPal. Please welcome our next panel to the stage.

Derrick Harris 01:03

Boy it is much brighter in here this afternoon, I’m floored. Okay so, we have a really good panel I think. Andrew Fogg, co-founder, chief data officer of Import io. John Foreman, chief data scientist at MailChimp. And Sam Hamilton, VP of data at PayPal. So before we get into the meat of the discussion, I just want to– instead of having everyone introduce themselves and what they do, I’d like, starting with Sam, start with one great example of data science in action I guess that you’ve seen. Or that you’ve been involved with ideally.

Sam Hamilton 01:38

Sure it’s PayPal all of our data, we have 190 million plus consumer wallets online. And security and safety of the data and the transaction data with PayPal is very, very important, that’s the code of our business. The data science is used pretty much in fraud prevention or fraud detection. So taking all the transaction data, putting them together to make sure we are a bit ahead of what fraud crooks could do on the internet. So that’s that top, I would say, you say there are many other data usages, I’ll stop there.

Derrick Harris 02:16

Alright, John.

John Foreman 02:18

So MailChimp is an e-mail service provider so naturally there’s a fraud prevention piece. We’ve got about 4 million customers now so it’s impossible to check everyone manually. A lot of ESPs that only have up to a thousand customers would. But one of the fun parts about that is we have a complementary model that actually predicts if a customer is going to be extremely well behaved, and reduces friction for them signing up completely. So they just sail through the compliance process. So I really like that one.

Andrew Fogg 02:45

So what we do in [PlatayA?], we help people get data from the web. So it’s kind of an extraction of data from the web so we’re very often involved in the very early stages in the data science project with clients. One of my favorite parts of what I do is actually working with people and seeing the crazy ways in which they use web data to solve business problems. Probably one of my favorites, this is a very early example, a year now, with a company who were kind of competing with [IB&B?], a market with [IB&B?]. And what they used to do was to basically– they build connectors on a platform to all of the booking pages on all of these five star hotels around the world. And what they wanted to do was devise a pressing model for themselves. So they essentially booked a hotel for every night, for the next 365 days, these baskets of hotels brought all this pricing data back and it allowed them to basically inspect the pricing models of all these hotel competitors. And it’s very interesting, some of these hotels have nice flat lines in available pricing. Others hand weekly variable, seasonal stuff. And that was their starting off point to devise an available pricing model based on evidence and data from the competitors’ taken from the lab.

Derrick Harris 04:13

So, you know at some point, PayPal is obviously a relatively large organization within a very large organization eBay. MailChimp, 4 million users’ right? So you guys have some resources to do this, you have some very smart people involved in the process. If your company, just a random company, what’s the process where you get started thinking about data science. Because it seems scary, there’s a shortage of them, no one has the skills. At some point, I guess if you were advising a company why would you say, “Here’s how to do data science.” Is it like find a problem, and then build your way up, or what’s the work flow?

Sam Hamilton 04:56

Maybe I can start. At PayPal, we don’t necessarily look for a problem to solve for the data scientist. It’s more like data scientists started finding some of those problems. There are some business problems we throw at the data scientists and say, “Hey, try to find a solution for that.” But I think the primary, the first tap in comes to a data science team is to identify the problem themselves. And that can be solved by data. So if the process, there’s not specifically a process I would say. Actually some of the methodology that I have found to be working is to work very closely with a business, with business knowledge, very good business knowledge. With the data scientist, identifying here are the areas that we could improve on, here are the areas that we could solve with the data. So the knowledge of data, I have a background in computers helping to deal with large amount of data, and a good domain knowledge, the combination eases into the solution.

Sam Hamilton 05:58

So now you ask the board how big the problem should be, how big the team should be and how big the organization should be. I don’t think there’s a specific size for that. As we can look at this as an opportunity that we can bring in technologies as we bring in the platform, some of these, we need to bring the science as well in to solve the problem.

John Foreman 06:14

Alright, for MailChimp it’s pretty interesting. We have a fairly rigorous process which is funny when you think there’s a monkey in our name, we actually are very rigorous about it. The new science team considers two classes of customers. We have internal customers which are other teams, and then we have external customers which are our users. And we provide analysis; so just reports, data sets, etcetera. And then tools, which are a lot more fun, so capabilities that allow the customer to do data science themselves, but they might not know it. And so we always start with just staying in constant conversation with the customer. If that’s internal teams it’s very easy. If it’s users we do that through every means we have of communicating with users. Identify a problem very specifically. This is the problem we’re going to solve. Figure out what techniques we’re going to use to do an audit of our data and make sure we have the appropriate data. If we don’t, figure out how to get that data and how it’s going to be structured to solve the problem. And only then do we really start thinking about which tools do we use to implement the solution.

Derrick Harris 07:22

Do you think it’s a case where the average company– a company that doesn’t have a data science team right? They’re in a different position. How much of that can a regular company actually do without building a data science team or hiring a chief data scientist or something? Is there a way to explain that process if you don’t have the data science team to lead it? How would just a regular data team handle this?

Andrew Fogg 07:48

I can send that, to that– I think as an industry if we have to hire data scientists before we can start looking at data in a context about businesses. I think I’ve got a bit of a problem. I would argue that one of the most sophisticated data science tools we’ve had in the past 20 years has been Microsoft Excel. Businesses run on this, and this is not just small businesses, ranges from very small to very large. And I think it’s always an interesting question to just comment on. You said earlier it’s not about how a data science project starts. Maybe I misinterpreted you, you said it’s PayPal, you start with the data. From seeing this, we see a lot of data science projects in a number of different companies, I think it’s always best to start with a business. I know you said that’s a rip-out, is it definitely an aspect of what you do? But start with the business, understand the questions you’re asking. Once a data science project is defined, the work flows pretty straightforward, but I think you’ve got to ask prior to that, the questions that you’re asking as a business, what do you expect, what prior beliefs do you have about your data and how it–

Derrick Harris 09:09

So it’s not technology. It’s not like, “Well we have a dupe so therefore we have data science.” And then your point about excel, it’s not even like data science means big. If it’s something I can fit in a spreadsheet then–

Andrew Fogg 09:23

I mean I remember the time when Excel had 64,000 rows and you see spreadsheets where there were macros to manage multiple rows of multiple worksheets, and the processing on it– it was one big table. Spent lots of time in big banks, I remember we did an order once in the strategy partner of just how many Excel spreadsheets there were in the organization. And tens of thousands. And these were running business-critical, business processes. But they expanded it to a million rows, and this is what you can do with a million rows and I think maybe the first, it’s interesting to think about the limitations of a tool like Excel. And then work out the points at which you need to move on to more developed things. Is it that you need more rows? That’s obviously, Excel’s only got a million. Or–

Derrick Harris 10:11

John is the Excel master.

John Foreman 10:13

Yeah, I’ve got a book coming out on Excel and data science called “Data Smart”. Go buy it. But it’s really interesting because there’s been this conflation between data science, i.e. the use of techniques in data to create some sort of intelligence or make decisions. And the concept of big data which is just making decisions off of large data sets. But I think for most companies, small businesses, mid-sized businesses, they’re not generating a mountain of transactional data so they don’t need a dupe necessarily. A sequel database will often suffice, and so I’ve seen instances of, “Oh yeah, I put a gigabyte in a dupe.” No, just use a sequel database if you like or maybe use a spreadsheet. A lot of the data science packages and software that folks use like R, runs completely in memory, so how are we doing data science completely in memory if everything requires a dupe, it doesn’t. Usually training sets for models are much smaller than an entire transactional database. So I think data science just requires an understanding of exactly what you’re trying to solve, and finding the data that does that regardless of the size of that data set.

Sam Hamilton 11:30

Just to clarify, the errors would be like, we don’t start with the data to come up with the data sciences. It’s more like data scientists are understanding the domain and the space they’re entering to solve the problem, will be able to even ask the question, what is the problem that they need to solves. To the point, I think the combination of the three skills that we look for in data scientists to start with. One is the math stats and economics have to come to the point, if we have time to talk about a little bit of economics here, because of the space that we are in. And the second one is programming, and able to deal with the data. The third one is domain knowledge. Those three are equally important to get a good output of a–

Derrick Harris 12:17

So those things have been around. Domain expertise, math, stats have been around forever. So what is it? Why today, are we talking about data science and 10 years ago we were talking about BI statistics or something. There must have been some change in something right? What catalyzed that shift?

Sam Hamilton 12:37

So I think the way progression that I’m seeing that is, analytics was a big thing. Which was basically you have a whole bunch of data, you can see a pattern, that’s the analytics if you would. And then we brought in science, starting with search and various other things that started the science speaking a little more popular, more predictive and what is going to be beyond that. Extending the curve of analytics beyond to the future, that’s what we call that as a science. So this is what I was talking about economics in the common space, well you have analytics, you have beyonding that beyond today. That’s the prediction of what should be. And the economics will influence the direction of how this is going to be. Since if you’re on the curve, predicting the curve, and bending the curve. So that’s where the economics comes in, you’ll be able to turn some of the knobs, if you would in the metrics to get to that. I think right now, I think we’re seeing this as a need for us to forecast, a need for us to figure out where it’s going, and big arrow sciences and big arrow data. All these kinds of things, you listen to the point of making data science to be the front-ender rather than being a bad contraceptive.

Derrick Harris 13:54

So I was just saying, so it sounds like the availability of a lot of this data may be change, it’ll help open it up too because all of a sudden you need to take in seven different things to try to solve one–

John Foreman 14:05

Well there’ve been examples of data science being used for decades right? But they were all large companies so when we think about yield management and revenue management in hotel chains, and in airlines where you’re actually optimizing, “How do I get the most butts in my seats and do I overbook or not?” Those examples have been around for decades but they were large companies with lots of resources, they could hire Teradata, etcetera. And so recently what’s happened is there’s been advances in storage technology, storage’s gotten cheaper, and we started to see used cases from small companies, specifically around things like collaborative filtering, so recommenders, “Oh, you should watch this movie, or you should listen to this song.” I think those examples have made it realistic. Other companies have seen that sort of, “Okay, wow, this is a smaller business doing this.” The data makes sense to me, it’s not highly specialized the way hotel chains might be, and that’s really allowed people to see how they can use it, and then get interested in it.

Andrew Fogg 15:05

I think, so you’re planning the Web’s changed, that’s a lot say– we’re still in the early days of the web, but small companies all of a sudden can have lots and lots of transactional data. And it’s not just banks, we have this problem with airlines or– and that rate, etcetera.

Derrick Harris 15:18

And everyone seems to want to use social data. That seems like– the web has at least opened up at least these new avenues of data, I think are at that [?]. So okay, how do we all of a sudden take this Twitter data and get some value on the ride.

John Foreman 15:31

Everyone wants to use social data. What I would say for most companies is that’s fine if there’s a use for you. Definitely look internally and see if you have valuable data internally, such as purchase data. That will probably be more worthwhile first, before just immediately starting with, “Okay, none of our data matters. Let’s just go find some social data somewhere and use that instead.” You add social data on–

Derrick Harris 15:59

I was going to say you add it on top of that right?

Andrew Fogg 16:01

I’m sorry. I was just going to say it’s an interesting question as to which data you use for which project. And sending large organizations where you have a choice, so you’ve got data in all sorts of places. That the head of data role sort of becomes less of a data science role. I know a guy who’s a head of data in a large bank, and he doesn’t work with data, he works with people, and his job is to make sure that all of the different departments are sharing and can actually collaborate. I think that’s an interesting– depends on the size of the business. But to the point of, the internal data can often add a lot of value quite quickly, as long as it’s being shared in the organization right away, and it’s not siloed off.

Sam Hamilton 16:45

So what we are seeing at PayPal is that there are four kinds of data that we are bringing together. When we put all four of them together, it’s very valuable. One is the transaction data, as you can imagine PayPal, we have a cluster of commerce data that we can bring in, what people are transacting on. Second kinds of data are the system data, what are systems doing, and performance, speed and all those kinds of stuff. The third kind is the events that that we [?] when people are there on the site. What are they doing with their merchant sites? What are they doing all these events that we have? The fourth is the social data that comes in. These four as a combination together inform quite a bit about how we can help other consumers, how we can help our merchants. To bring them together is the value that we can bring in to help commerce as a whole. Social data by itself is valuable but we can bring them to your data, I think when you join them, it’s not easy, once you join them, that’s going to be very beneficial right?

Derrick Harris 17:52

So one of the things that I like to write about and think about is – it’s kind of a generic term – but the democratization of data. And just this idea that it’s easier now than ever to access data, like we’re talking about. It’s easier, the technology to process and store it are better and cheaper. How realistic is the guess to see data science as something that’s less this ivory tower sort of thing and more something that you can take your data analyst and with the right tools, the right simplified Web tools even if I make it something that’s more tangible.

Andrew Fogg 18:28

I don’t really have a choice. I think I kind of have to do this. If big data’s going to deliver on the promise, then we’ve got to make it easier for more people. And I think this will happen. You see this in technologies, it starts off with custom solutions, very often done by research in universities, used in large corporations and it’s kind of like skunkworks, it’s kind of projects. Then the bigger size starts to deliver these as custom things, and then you’ve got productization and it kind of goes to light. I think it will happen. There will always be, certainly unorthodox techniques, there will always be a cutting edge that requires a Ph.D. and requires a data scientist to keep abreast of research, try different things. But I think the 80%, the large majority, the hump of problems can be tackled readily with tools and techniques that can be easily productionized. There are companies like BigML for example who I know are doing nice things around machine learning and stuff.

Derrick Harris 19:32

This still, yeah, it’s very simple. I use BigML and if I can use it–

Andrew Fogg 19:37

Yeah, but it’s a step in the right direction. And I think it’s where we’ve got to go as an industry. My opinion.

John Foreman 19:46

So there’s some software that came out recently called Wizard. That’s a ridiculous name but it’s a statistical software, and the author was using AR and felt like, “Oh man, most people cannot use AR.” And ended up putting together Wizard, which is just another example of a way where just a normal layman, doesn’t have a statistical background. Can actually load their data into there and do really advanced statistical tests for AB testing, etcetera. So I think we’re headed in that direction. I know for MailChimp the idea has been, if data science is going to be useful, then it needs to hold its weight against products that are not data science products. So I’m competing internally with designers to get features released. Because if a graphic designer can put a feature into MailChimp that’s more useful to the user than my big data feature, then it shouldn’t go in the app. And so what that’s ended up doing for me is it’s caused me to edit.

John Foreman 20:40

So now we’ve got features in the app where on the back end I’ve got a huge graph database, and I’m doing K-Medians clustering and all this fun stuff, but then on the front end there’s just a button that just says “Discover More Subscribers Like These Subscribers.” And it’s just a simple button and people use it and can see that segment. So I think that’s where we’re headed is that a lot of this data science is going to get very specific. So folks who don’t understand data science can’t just have the world of data science at their fingertips. But if they know this specific problem they want solved and data science can solve it, a product can be built around that, such that they don’t have to understand the science, they just have to press the button.

Derrick Harris 21:18

So things like yeah, clustering becomes a product that’s easy, or a feature, that’s easy enough. It’s like four meets function at that point.

Sam Hamilton 21:27

So I just wanted to say, I think it’s the company that can critical stats [?]. We were basically simplifying a–

Andrew Fogg 21:33

Big fan of what they do, they’re based in San Francisco.

Derrick Harris 21:33

Statlane, check them out. I like them.

Andrew Fogg 21:38

If you ever used SPSS, you press the button that says, analyze or whatever it is, and you get a bunch of stuff, half of which, even if you have the training, you’re like, “Hmm, I have to get my textbook out. It’s kind of–”

Derrick Harris 21:49

It’s the wall of statistic.

Andrew Fogg 21:50

But you know, the recently companies like stats with Statlane, that are beginning to make this easier. I just want to click them and have something there. I think what they’re doing is in this line of democratization and data science and I think they’re doing a good job.

Sam Hamilton 22:02

To PayPal, to democratize some of the data, we’re building data platforms to make it easier to a point. If you can make that look as simple as Excel, then we have accomplished quite a bit. So we’re building us a layer of infrastructure of the web but you don’t have to worry about where the data is being stored. We have multiple platforms to store the data. And on top of that, what we have is the data platform, to provide the axis to the data, any kind of data, anytime, anybody, not anybody in the sense of anyone, as long as you have the need for it. So we can build the pieces in place so we’ll be able to access the data and if the scientist can spend less time figuring out how to access the data and spend more time figuring out what is the methodology he could use to solve the problem, that’s going to be beneficial. So for that matter, we are building that as a platform so that the solutions can build on top of it. And to provide interfaces like this odd and scholar, which is very native for the scientists to use them, so that’s a big time usage of the technology as well.

Derrick Harris 23:08

And then we have a little bit of time left, sorry I just want to end with one question that you could say is contrarian, this is the contrarian question. We talked about this in the most important stuff of big data science, what’s the most overhyped or maybe overplayed aspect of data science. Like one of the skills or one of the things that you can achieve, what’s a lot of hot air.

John Foreman 23:27

Data visualization. In my opinion. A lot of people disagree with me on that because pictures are so pretty. But I’ve seen a lot of products come out recently where it’s, “Oh, we took our data and made a heat map.” And it doesn’t provide any benefit to the customer. Where, “Oh, we took all your contacts and we put them in a graph and now you can see that they’re all connected to you.” That’s great but it doesn’t provide me any use from the data so I don’t think a pretty picture by itself is useful. But there’s been a lot of hype around pretty pictures because they’re easy to write about and understand and stuff like Geffy. You need to find a real use for that. And I think there are real use cases but we’re just not there yet a lot of times when we release them.

Derrick Harris 24:09


Sam Hamilton 24:09

So I would say there’s quite a bit of science and we talk about in isolation, but I think connecting that to the business and how the scientists are solving a business problem is not only useful for the business of course, and that motivates the scientists by themselves. They’re solving a real world problem. So most of the faults as a scientist, they come to the investor because they want to solve a real world problem. They’re applied scientists, I call them, not necessarily the lab scientists who want to write papers for a citation on this. So for them, connecting then, how would just improving the revenue or the operating costs? What of other business metrics going to be, if we can connect that, then it’s going to be really useful. Otherwise it’s going to be tending to be a high puff, here’s a data scientist, it’s a pretty picture, something new that we have invented, here’s a new algorithm that will [?], that’s all great. I think to make it to be real we kind of–

Derrick Harris 25:03

Do you have a one word answer?

Andrew Fogg 25:04

I’ll end with a Josh Wills’ comment about a lot of data science begins with data cleaning. And he just greets like, “I’m not a data scientist, I’m a data janitor.” And somebody in Import is trying to help give people tools to standardize data from the web at the point at which you bring it into your organization. But if you then want to combine that, you can do that, but if you want to combine that with internal data, it often needs a little cleanup and say. Our data janitor is the 21st century sexiest job.

Derrick Harris 25:33

Alright sounds like it. Thanks