Pontifications on Microsoft and the Tech Industry

Twitter












Tag Cloud


Putting the “BI” in Big Data

Last week, at the PASS (Professional Association for SQL Server) Summit in Seattle, Microsoft held a coming out party, not only for SQL Server 2012 (formerly “Denali”), but also for the company’s “Big Data” initiative.  Microsoft’s banner headline announcement: it is developing of a version of Apache Hadoop that will run on Windows Server and Windows Azure.  Hadoop is the open source implementation of Google’s proprietary MapReduce parallel computation engine and environment, and it's used (quite widely now) in the processing of streams of data that go well beyond even the largest enterprise data sets in size.  Whether it’s sensor, clickstream, social media, location-based or other data that is generated and collected in large gobs, Hadoop is often on the scene in the service of processing and analyzing it.

Microsoft’s Hadoop release will be a bona fide contribution to the venerable open source project. It will be built in conjunction with Hortonworks, a company with an appropriately elephant-themed name (“Hadoop” was the name of the toy elephant of its inventor’s son) and strong Yahoo-Hadoop pedigree.  Even before PASS, Microsoft had announced Hadoop connectors for its SQL Server Parallel Data Warehouse Edition (SQL PDW) appliance.  But last week Microsoft announced things that would make Hadoop its own – in more ways than one.

Yes, Hadoop will run natively on Windows and integrate with PDW.  But Microsoft will also make available an ODBC driver for Hive, the data warehousing front-end for Hadoop developed by FaceBook. What’s the big deal about an ODBC driver?  The combination of that driver and Hive will allow PowerPivot and SQL Server Analysis Services (in its new “Tabular mode”) to connect to Hadoop and query it freely.  And that, in turn, will allow any Analysis Services front end, including PowerView (until last week known by its “Crescent” code name), to perform enterprise-quality analysis and data visualization on Hadoop data.  Not only is that useful, it’s even a bit radical.

As powerful as Hadoop is, it’s more of a computer scientist’s or academically-trained analyst’s tool than it is an enterprise analytics product.  Hadoop tends to deal in data that is less formally schematized than an enterprise’s transactional data, and Hadoop itself is controlled through programming code rather than anything that looks like it was designed for business unit personnel.  Hadoop data is often more “raw” and “wild” than data typically fed to data warehouse and OLAP (Online Analytical Processing) systems.  Likewise, Hadoop practitioners have had to be a bit wild too, producing analytical output perhaps a bit more raw than what business users are accustomed to.

But assuming Microsoft makes good on its announcements (and I have pretty specific knowledge that indicates it will), then business users will be able to get at big data, on-premise and in-cloud, and will be able to do so using Excel, PowerPivot, and other tools that they already know, like and with which they are productive.

Microsoft’s Big Data announcements show that Redmond’s BI (Business Intelligence) team keeps on moving.  They’re building great products, and they’re doing so in a way that makes powerful technology accessible by a wide commercial audience.  For the last seven years, SQL Server’s biggest innovations have been on the BI side of the product.  This shows no sign of stopping any time soon, especially since Microsoft saw fit to promote Amir Netz, the engineering brain trust behind Microsoft BI since its inception, to Technical Fellow.  This distinction is well-deserved by Mr. Netz and its bestowal is a move well-played by Microsoft.

Last week’s announcements aren’t about just Big Data; they’re about Big BI, now open for Big Business.



Feedback

# re: Putting the “BI” in Big Data

"As powerful as Hadoop is, it’s more of a computer scientist’s or academically-trained analyst’s tool than it is an enterprise analytics product." - I imagine there are a few people chuckling at that bit. Hadoop, with or without, a variety of stand-atop addins like HIVE is the very basis of analytics at some the biggest of big data shops, especially web metrics shops. Check out a good full distro like Cloudera (www.cloudera.com) for details on what's possible.

That being said- Personally, I like the relational, SQL, and analytic SQL model. I welcome SQL SERVER PDW and can't wait to give it a run. 10/28/2011 10:21 AM | Sal M.

# re: Putting the “BI” in Big Data

Sal,

You may be right about the chuckle, but I stand by my assertion. If you look at what working with Hadoop entails, and the whole notion of map-reduce development just to query your data, that is just not up to the standard of mainstream BI products in terms of usability, especially in a self-service scenario. You're right in saying that many of "the biggest of big data shops" use Hadoop...that's the whole reason Microsoft is adopting it. But my point is that there's some absurdity in that, and it represents the immature state the big data space is in, relative to other enterprise computing technologies. BI products (especially data mining tools) were like this too, at one point. But they evolved, matured and became more accessible...and then adoption increased. And while things like Hive certainly do help, they mask Hadoop's rough edges more than they solve them.

We seem to be in agreement though about the utility of SQL and the promise of a parallel approach like PDW vs. the map-reduce approach of Hadoop. 10/28/2011 11:37 AM | Andrew Brust

# re: Putting the “BI” in Big Data

Andrew, I agree;I think we are in complete agreement. I personally think HADOOP is a stop gap in a world where there was no affordable way to do what map-reduce does fairly cheaply but not easily from an implementation POV. My concern was with the comment's presentation. Characterizing HADOOP as a scientific tool, readers would immediately think flinch and that would undermine the value of piece. 10/28/2011 11:59 AM | Sal M.