fresh Strong Russian word: Nyet [No]. By the end of this post I hope you'll agree. Worst case you'll have food for thought.

This in-depth post covers a complex topic that might not apply to everyone, but it covers an area where companies have struggled to try to show return on the investments made in skills, technology and time. The post promises clarity and guidance that hopefully will result in you saving tons of aggravation and yes even a nice chunk of change.

Data Mining and Predictive Analytics have promised a the earth, the moon and the sun for sometime now, in all channels we do business in. My personal point of view is that on the web they fall far short of even the most pessimistic promises. For now.

As someone who has grown up in the world of traditional decision support systems (massively large data warehouses, business intelligence systems and tools, ERP & CRM systems) I have had the opportunity to be on both marketing / business side as well as development and implementation side of things.

There is nothing cooler than imagining all the wonderful things that will come if you simply move beyond reporting, and even analysis, to doing true data mining and predictive analytics. It is hard but can be rewarding.

Lots of consultants (yes I realize the irony here) will sell you this very effectively.

no outletOn pure web data though sadly it does not work.

Much as you might desire it, much as you might will it to happen. Your traditional data mining efforts and resources and $$$ spent on doing predictive analytics will yield very few and rare actionable insights. Most of the time it will prove to be a sub optimal use of time and energy.

[I can see the smart analysts amongst you get off your chair and mutter obscenities under your breath.]

There are a few very powerful, and non-obvious, elements working against you when it comes to finding exploitable trends and patterns in your web data, the kind that you are used to in offline and erp/crm type environments. Before you decide to pour $$$ and systems and people into your web predictive analytics efforts please consider the rest of this post.

I recently had the great opportunity to present at the bay area ACM Data Mining Special Interests Group. Here is the last slide of my presentation:

data mining and predictive analytics challenge

The slide, on my behalf, captures the essence of the challenge when it comes to doing Predictive Analytics with web data. Let me explain.

#1 Type of Data:

It is important to realize that web data for the most part is completely anonymous, usually incomplete and really really unstructured. When you want to do traditional data mining (and not just analysis) and predictive analytics all of these things are poison.

You are looking for larger complex trends and patterns in the data for people, products, outcomes, behavior over large enough periods of time so that you can find something insightful that can also be exploitable.

That is really hard to do when the core things you are relying to capture data are anonymous cookies and javascript tags that can be very, shall we say, sensitive. And that's just the tip of the iceberg.

All this makes it much much harder to tie behavior of people to outcomes they might be driving (on any kind of website, ecommerce or not). Yes if you capture login id's and have connected that to a actual human's details from your offline system and do this for every single person who visits this problem eases a bit (the anonymity part) but most of it is still there.

variables

#2 Number of Variables:

People behave in crazy ways offline, they have multiple touch points and dont use perfect names and addresses etc. All this is much more insane in the online world.

We have discussed on this blog how it is not a online world or a offline world but rather it is a nonline world! This means people flow between channels and touch points and there could be a outcome (lead, purchase, problem resolution) at a completely different channel than were most of the interaction was. You can imagine how this will completely screw up your SAS or SPES or Clementine or other home grown solutions.

Here is another thing that lots of us underestimate. It is easier to Mine and then Predict when there is a certain amount of non-siloed existence. On the web Google is competing with a guy and his pony putting together a new search engine. Not only are there pretty much no barriers to entry but it is easy for your customers to flirt with your competitors and for your competitors to react to you in a massively efficient manner.

So are three visits to purchase typical? (What about two visits to a store in between?) Is $15 off to people from Florida the best strategy? (What happens to that when your competitors run aggressive PPC?) Is "Tony" and all visits attributed to Tony really Tony? (What about cookies and my wife and I and Damini all surfing Amazon on the same login?)

And here is what happens, by the time you control for the variables you can count and account for (while throwing away all that you can't) literally you are left with a glass of water (and you started with a ocean full of water) and your ability to predict anything scalable for massively actionable insights is deeply limited. It is just a glass of water after all. :)

multiple purposes

#3 Multiple Primary Purposes:

On the web this issue complicates things. We are trying to predict the outcome of our website, a complex being that exists to do lots (even things that your website was not created for).

So if it is unlike you other channels where a visit and outcome is fairly easily identifiable at the highest level then how do you Mine and Predict?

I have often stressed the importance of measuring Primary Purpose because of the power that comes from real understanding of why people visit the website. Two things connected to Primary Purpose mess up your Mining and Prediction efforts:

1) You don't know all of the primary purposes (click here to find out how you can find out).
2) It is incredibly difficult to take your massive collection of clicks and visits and then assign them into each primary purpose bucket and then predict on top of that.

3) See below.

#4 Multiple Visit Behavior:

multi taskingThis really screws things up. You can predict frames of minds (primary purpose) when you send people pieces of mail. You can predict what people want/think when they want into your supermarket / store. You can make up even more examples of things we all analyze and Mine and Predict.

It is a pain to go to a store and then go there six more times. On the web this is trivial. Hardly any website converts in one visit.

It is also a pain to go to the store for every problem you have or every question you have. On the web this is trivial. You can have the same person come to your website as a different persona many times to solve a different issue.

The question as you get ready to analyze your multi terabyte database is: How can you isolate this behavior in your clicks? With how much confidence?

On paper it sounds easy but in practice it is incredibly hard to accommodate for multiple visit behavior, even if you have nixed the problem of collecting data accurately for each person and for each of their visits.

missing keys

#5 Missing Primary Keys, Data Silos, Lack of Holistic Datasets:

One way to get better at prediction is to take you data out of the web analytics silos and merge it with other sets of customer data in your company (stores and supermarkets, phone channels, others). If you knew all the costumer touch points and had merged the data then it gets much much easier to understand current behavior and predict future behavior and outcomes.

This nirvana scenario is crushed by a couple of rather rotten tomatoes.

We are all familiar with untagged campaigns and pages. We also know that the url parameters don't always work in helping us collect data. The issue that causes more problems is the fact most companies don't quite put the forethought required to create the right "primary keys" that will allow data from different channels to be hooked up together.

There are even problems with name and address and phone numbers collected and stored differently, causing both a data reconciliation nightmare but specific to this post causing major challenges in analyzing outcomes.

For data mining and predictive analytics to yield positive ROI your company will have to put a lot of forethought into the process of data collection and storage across channels and in the deep bowels of your web / erp / crm systems. If that action item is not marked completed then it is optimal to focus on that first before cutting a chq for tools / people to do Mining and Predictions.

rapid change

#6 Massive Pace of Change on the Web:

Sure Google, Yahoo, Cnn, Craigslist, Amazon, Ebay, New York Times are always going to be there. It might even seem like things never change.

Unfortunately for you and I the game is not quite the same. The web is constantly changing. The way people experience it, the way people compete, the way people read and recommend and buy, the way everything happens.

Doing mining and predictive analytics on past behavior requires a certain amount of "stability" about your future (customers, business, outcomes etc etc). But if the "environment" changes too much, or even enough, then your predictions on past behavior will have only tiny chances of success.

For now this is perhaps one of the biggest challenges to Analysts and Statisticians who are working hard to get some of the traditional mining and predictive algorithms to work on our web data.

fortune cookie

The Wikipedia article on Predictive Analytics ends with this statement:

"Predictive analytics adds great value to a businesses decision making capabilities by allowing it to formulate smart policies on the basis of predictions of future outcomes. A broad range of tools and techniques are available for this type of analysis and their selection is determined by the analytical maturity of the firm as well as the specific requirements of the problem being solved."

I'll leave that thought with you and stress that you consider:

1] maturity of your firm

2] requirements of the problem you are solving

3] the six items mentioned in this post and weather

4] you fixed all the "low hanging fruit"?

Ok now its your turn.

What do you all think? Do you agree this is hard? Perhaps you have already subdued this tough problem? Perhaps there is a flaw in my hypothesis?

Please share your tips, tricks, war stories, critique, brickbats via comments.

[Like this post? For more posts like this please click here, if it might be of interest please check out my book: Web Analytics: An Hour A Day.]

Social Bookmarks:

  • services sprite
  • services sprite
  • services sprite
  • services sprite
  • services sprite
  • services sprite
  • services sprite
  • services sprite
  • services sprite
  • services sprite
  • services sprite