June 2007

28 Jun 2007 12:20 am

hmmmm This probably impacts a minority of web analytics practitioners, normally only those with large websites with millions of page views per month / week / day. I know that there are lots of you out there. :)

When you are generating that much data from your website with your javascript tag based solutions there are a couple of "delightful" problems:

  1. It starts costing you lots of money because most javascript tag based solutions are pay for play (seems fair, it costs them money to collect your data).
  2. Your reports and queries from your web analytics solutions start slowing down, especially if you are segmenting the data or looking at vast amounts of history (and most definitely if your vendor has a backend that actually allows you to write custom queries against that massive amounts of data).

It will be a rare vendor that will admit that this challenge afflicts them, but it does.

To deal with the above two problems the standard operating procedure it to do data sampling.

I find that there is a bunch of confusion about sampling your data and implications of making that decision (other than that if you sample the data you'll save money).

So here's the 411 on data sampling. There are three primary ways of sampling your data.

  • Code Red: Sampling web pages on your site.
  • Code Orange: Sampling data collected from each page.
  • Code Green: Sampling data processed when you run the query / report.

Here are more details on each option…….

mountaindew codered 1Code Red: Sampling web pages on your site:

Under this option either by choice or on advice from your vendor you only add the javascript tag to some pages on your website.

Typically you might add the javascript tags to a bunch of your most busy pages and forget the rest (CEO: "we should atleast track our important pages even if we can't afford to track the site!").

Implications: Perhaps the least palatable of the three options. If you ever want to know anything about pages you might find interesting it is possible that you have no data.

You also probably not have a complete picture of your website, as in you forgot to tag page x and your marketing department sent off a million direct marketing emails pointing to that page or page y got indexed by Google and is attracting a bunch of traffic now and you have no idea.

Code Orange: Sampling data collected from each page:

Rather than every single page view on your site being collected there are ways to say in the javascript tag code: "just collect every tenth page view" (or every fiftieth or hundredth).


So when the page loads only every tenth time it will send data to your vendor. This means less data is collected for the vendor to store and process, when reports run you get that sampled data.

Now in the report you have "lower" numbers than your real numbers but there is usually some approximation applied (say multiply ever number by ten) to get the "correct" numbers for you.

Implications: Better than not collecting data at all for some pages on your site. In this case atleast you have some representative data for all your pages. Even with the multiplier you are getting a "approximately ok" view of your over all metrics.

For pages that don't get lots of page views (say beyond your top twenty or so pages) it also means that if you segment data it might not be of optimal quality.

If you have a choice between red and orange, always choose orange.

Code Green: Sampling data processed when you run the query / report.

All data from your website is collected and stored by your vendor.

But your web analytics application allows you to select the amount of data you want to statistically sample so that your queries run faster and your reports come back quicker even when you hit humongous amounts of data.

To extend the example above in this scenario you would simple say: "use statistical sampling of 10".

clicktracks data sampling options

Now when the query runs it will use statistical sampling to run the query really really fast and get you close to correct data.

Implications: You collect all the data from your website. If you ever want to wait a loooong time for your query to come back with God's perfect answer, you can.

As in the case of ClickTracks above it is best if you can choose the sampling level, rather than your vendor because you can fine tune that sampling in a very white box way to your own comfort level.

green soybeans 1The nice thing about this methodology is that you can sample the data and get reasonably fast results if you are querying massive amounts of data (in terms of months of history or number of users or page views etc).

But if you are querying small segments of the data (say everyone who came from source x or everyone who visited only these pages and purchased or just the last weeks data or… you get the idea) then you have the option of saying "don't sample the data, just run the query against the all the data" and you will get confident results.

If your vendor permits always choose code green over choosing code red or orange, knowing that if you are using a asp paid javascript tag solution then you'll still have to pay per page view to get this benefit. Or you go free or go to web log files or bring data collection in-house (which vendors like ClickTracks, Unica, WebTrends allow you to do).

Your Action Item:

Is your web analytics vendor sampling data? If so what method are they using? Find out and ensure that you are aware of the implications. Not every company can make the best possible choice between red, orange and green, but atleast now you are well informed.

What do you all think? Please share your feedback via comments.

[Like this post? For more posts like this please click here, if it might be of interest please check out the book.]

PS: For non-American readers: 411 is the telephone number you would dial to get information such as a phone number and I have come to learn from my time here that the term us used commonly as "give me the 411 on that". Wikipedia 4-1-1.

26 Jun 2007 12:21 am

saratoga A couple of weeks ago I spent a little bit of time talking to industry guru Bryan Eisenberg about a bunch of really interesting stuff. The result is two podcasts that I, humbly, think are pretty fun.

If you know Bryan then you'll know that he does not shy away from tough topics or asking controversial questions. He does that and more in these two podcasts.

Bryan is not just an industry colleague but also a good friend. Like me Bryan is also a vegetarian and at conferences he always scopes out places we can have dinner. He has also been very generous in sharing his advice and help, something a newbie Independent Consultant can clearly use lots.

Doing a podcast with a friend is so much more fun. You'll hear a lot of laughter, Bryan (like a good friend) selling the book, and general good humor. I really did enjoy myself.

You'll also notice that around minute ten in second podcast Bryan says "thanks for taking the time to talk to us today Avinash", and then we continue chatting!!

Podcast One:

This podcast focuses on:

  • Why "looking beyond the click" to optimize the experience is so necessary.

  • How technology has leveled the playing field, so companies of all sizes can be data-driven.

  • The importance of being data-driven, yet customer-focused.

  • The new "data democracy," and how it's created an environment where Google needs an Analytics Evangelist.

  • Exploiting "the long tail".

  • Controversial blog topics, such as "What is enterprise-class web analytics software?"

  • Documenting processes in your company, so you can fix them by measuring and optimizing intelligently.


Podcast Two:

This podcast focuses on:


I hope that you'll enjoy these podcasts as much as I did, and at the same time find them to be educational.

You'll hear about Persuasion Architecture in the podcast, in case you don't know what it is please visit the Future Now Inc website. On that page is also a link to a very informative news letter, check that out as well.

[On a related note, Bryan also suggested an excellent idea to my friend Mike Moran, author of the new book Do It Wrong Quickly: From real people out there like you and me get the sorriest, the most egregious tales of delay, indecision, paralysis by analysis, and refusal to try things out. Please e-mail Mike your stories (mike at mikemoran dot com), there are prizes involved!]

Please share your feedback on the podcasts via comments. Where they interesting? Did you learn anything? Should I slow down my talking speed to something less than six hundred miles per hour? :)

Next Page »