Data Discovery: Warning Batteries Not Included

There were few things worse than the Christmas disappointment of frantically tearing opening a present to find out it was dead in the water - no batteries. Worse still, back in the day when I was a kid, there weren’t any stores open on the day. So in the absence of some forward planning with some on-hand batteries (usually unlikely), it meant a grindingly slow wait until the following day to get some satisfaction. From anticipation to disappointment in a few short seconds. These days toy manufacturers are smarter – they’ll just include them, thankfully.

Sometimes software can be prone to the same issue, and most recently data discovery tools in particular. Data Discovery has been one of the fastest growing segments within analytics, growing substantially faster than traditional Business Intelligence counterparts. And with good reason, data discovery adoption typically starts as a bottom up business user driven initiative. Adoption starts with a frustrated and enterprising analyst looking to explore or share some insight, caught between spreadsheets, and the absence of a useful (or existent) analytics initiative (which is usually too costly, too rigid, or just sat on the shelf), data discovery just makes sense to get success quickly.

The great thing about data discovery tools is they provide near instant short term satisfaction. From quick and easy setup, through to data visualization and exploration capabilities - from easy ad-hoc analysis, to cool geospatial visualizations and heat maps. With tools like Tableau you can get eye catching results incredibly quickly against spreadsheets, or connecting to a database, or cloud source like Salesforce. A business user can typically go from data to dashboard significantly faster than traditional Business Intelligence tools, because those tools require complex mappings and semantic layers, and require IT setup, before getting any joy.

In contrast to traditional BI tools, they eschew centralized data integration, metrics layers, and IT maintained business mapping layers. That’s the unglamorous stuff, that once it’s all done (which takes a lot of time!) is often too rigid to accommodate new ad-hoc data requirements, or perhaps misses the mark in terms of helping answer what analysts need asked when the need arises. The simple fact is that it is difficult to design an analytics initiative a priori – because you don’t necessarily know all the questions analysts will ask. It’s why data discovery has been so successful and been adopted so quickly.

What About Those Batteries?

It’s true, setting up all of that data integration, and semantic layers for users to interact with slows traditional BI deployments down. Also, having to prepare data, or optimize database schemas to get decent query performance, well that’s just plain thankless. Analysts just want to answer the questions they have, right now. And all of that plumbing just gets in the way of speed and autonomy.

So data discovery tools typically dispense with all that, but in doing so, they throw the baby out with the bath water – and there are consequences. Their value proposition is simply to point the tool at a spreadsheet, a text file, or a simple data source, or perhaps a cloud source like Salesforce, and start analyzing. The problem is that life in the long run is rarely that simple. And that nice shiny demo of the product often had hidden the real data integration complexity that it takes to get to that place. Because often even spreadsheets and text files need cleansing, opportunities or accounts in Salesforce need de-duping. Never mind perhaps joining together accounts across CRM or ERP systems. Or perhaps resolving complex joins across multiple tables (or databases). In emphasizing speed and autonomy, what’s lost is reuse, repeatability, and sharing clean data.

It’s Like Making a Battery Run to the Store. Daily.

What often happens, especially when data discovery tools get virally deployed across departments, is that IT, or the administrator of the data-sources (e.g. the Salesforce or ERP admin) in question often get left carrying the bag. It means repeated requests for an ad-hoc data extract, or for the analyst repeatedly grabbing an updated extract and then try and join it with other sources and cleanse it in spreadsheet hell. Over, and over again.

The organization turns into a culture of one-offs – a one-off extract for a few periods of data for some win-loss analysis, another extract for some product discounting analysis.  Analysts may end up performing weekly or monthly data prep and cleansing, just for their own activities, with no shared benefit for the rest of the organization. The business ends up with multiple data silos, and a lot of redundant effort. Multiple versions of the truth get created with every data discoverer using his/her own logic to cleanse and transform the data, and visualize.

Everyone ends up with cool visualizations to share (and impress the management team with!), but the organizational cost is high, with wasted time and redundant sets of conflicted data.

But things can be different with a little planning ahead.

Three Steps to Building a Batteries-Included Approach to Data Discovery

1)     Create a sustainable Data Discovery strategy

I’m not advocating building old school centralized BI (though it does have a role as part of a broader analytics strategy, more later) because data discovery tools fill a need to understand and explore data quickly. But organizations need to create a strategy around data, and encourage sharing of not just dashboards, but data too – to optimize for more reuse. So when the organization hits an inflection point in data discovery adoption, there is readiness to roll out user driven data prep tools like Paxata and Alteryx. These tools provide relief in terms of enabling business users not just to prepare their own data, and also automate common preparation activities, but to share it with others too. The outcome is shared pools of data that have been refined to handle common business questions. And better yet compared to traditional data warehouse initiatives, when data is prepared from the bottom up, and shared, you’ll often ended up with much more pragmatic and useful data to handle real-world business questions, based on a more democratic (and continually improving) process for improving the data pool.

2)     Identify data sources that need to be frequently analyzed and optimize for re-use.

One of the other keys is to identify which data requests have moved into inefficiency and dysfunction. For example run a quick poll amongst apps administrators, such as asking the Sales Ops Salesforce or Dynamics GP admins which data pulls for business users have become onerous. Perhaps there is a month end extract from multiple ERPs that requires merging continually every month, that's sucking up cycles in finance or ops. It’s also worth polling analysts to understand what kinds of recurring transformation and merging they’re performing – and which ones are duplicated across team members. The answers to these questions reveal what data tasks are candidates to be consolidated across teams or are opportunities for automation.

3)     Think Holistically about Analytics, Create a Journey

As we've seen, while laissez-faire based adoption of discovery tools can quickly create results quickly, it’s often not sustainable as adoption scales up. The truth is that there typically needs to be some ownership and data stewardship. In mid-size organizations it may mean an analytics strategy led by finance, perhaps consisting of using analytics that's embedded with the transactional apps, some centralized BI/reporting (for hardened shared metrics and reports), collaborative data pools, and data discovery tools. In larger organizations, it’s a prime area for IT to lay the foundation to support a sustainable bottom up data discovery strategy.

So before you go out shopping for that shiny new data discovery tool for the holidays, and think about rolling out across your organization, consider stocking up on batteries first, so your team will spend more time playing with visualizations, and less time stepping over each around data.