Data warehousing tools
The term data warehousing was coined just six years ago by W.H. Inmon, a noted database expert and founder of Pine Cone Systems Inc. of Edgewood, Colo. Though widely used, it's also largely misunderstood. Before we look at specific types of data warehousing tools, let's clarify a few terms.
The term data warehousing was coined just six years ago by W.H. Inmon, a noted database
expert and founder of Pine Cone Systems Inc. of Edgewood, Colo. Though widely used,
its also largely misunderstood. Before we look at specific types of data warehousing
tools, lets clarify a few terms.
A data warehouse is a large collection of data from multiple sources within an
organization. The data is used to support enterprise decision-making, forecasting and
planning. It is the end use of that datathe support of enterprise-level
decision-makingthat defines a data warehouse.
A related, sometimes misused, term is data mart, which is a smaller, more localized
version of a data warehouse. A data mart is designed to serve a division, department,
location or other entity within an enterprise.
Typically, agencies will implement a data warehouse by assembling data from legacy and
operational database systems throughout the enterprise, although the warehouse itself is
not a production database. Because the original data comes from multiple sources in a
variety of usually incompatible formats, a lot of processing is needed both to create the
data warehouse and to use it.
Hundreds of products are at play in the data warehousing market, which today may be
second only to the Internet as a fertile ground for marketers. Products range from
all-in-one integrated suites, which, their makers claim, can handle the entire data
warehousing jobcompilation, management and analysisto specialized niche tools
that do only one job. Broadly, these specialized tools form three groups:
A data warehouse is an enterprise database thats gone about as far as it can go.
To take it to these ends, youre going to need more than a few software tools.
Were talking about serious computer hardware and massive amounts of online storage,
sometimes measured in multiple terabytes and, in some cases, conceivably even exabytes.
The database management system component of data warehousing is not fundamentally
different from any other database.
According to the Meta Group Inc. of Stamford, Conn., most data warehouses today are
managed with standard Unix relational DBMSes from Sybase Inc., Oracle Corp., IBM Corp. or
Informix Software Inc. of Menlo Park, Calif. It is worth noting, however, that some
DBMSes, such as Informixs Red Brick Warehouse, are designed for data warehousing.
In this Buyers Guide, I will concentrate on tools for assembling and converting the
data from its original sources, and products that perform sophisticated queries and
analyses.
Ive also eliminated those products oriented toward smaller-scale projects. For
example, if a particular server is specified to run on Microsoft Windows 95 and use
Microsoft Access as its main DBMS, its just not a realistic candidate for managing
anything that should be called a data warehouse. Database, yes; data mart, possibly; data
warehouse, no way.
When creating a data warehouse, its almost inevitable that youll get data
from a lot of systems that dont usually talk to one anotherassorted
mainframes, distributed departmental-level systems such as IBM AS/400s, servers running a
variety of Unix flavors on a wide variety of hardware, and even from LAN-based data on
Novell NetWare, Banyan Vines or Microsoft Windows NT systems, and smaller, proprietary
systems, too.
And you may well be getting data from outside sources: customers, suppliers, and state
and federal agencies.
Assembling data from such multiple sources requires a serious effort at conversion,
both in making the physical transfers workable and in mapping the data structures into
something that will handle all the incoming data. The second of those two jobs is where
the greatest amount of work is involved.
Keep in mind that you may have to accommodate both fixed-length data fields and
recordsthose 80-column card systems persistand variable-length fields. Data
from multiple-table relational databases, maybe normalized and maybe not, will have to be
matched with data from hierarchical network-based and object-oriented databases.
For example, youll encounter a variety of date structures that have to be
reconciled over and above year 2000 issues. Expect also to encounter different accounting
systems, different ways of handling null entries, mismatched file-naming conventions and
duplicated part numberseverything except a simple answer.
Before you can select a product, you need to create a comprehensive inventory of all
the machines, physical locations, DBMSes and networks from which you will get the data to
populate the warehouse. And youll also have to figure out how much data will have to
be moved and how often.
After youve done such an inventory, youre ready to look at some of the
acquisition tools in the accompanying table. Given the uniqueness of each organization and
its warehousing needs, theres no shortcut to doing your homework and seeing just
which products might be useful in your situation.
OK, youve collected the data from all your sources, massaged it so that you can
make comparisons, and its all assembled in a database managed by a good DBMS. Now
what?
Most queries that are run against databases outside of warehouses are specific, so
youre looking for definite information thats presented in a definite form.
Thats fine as far as it goes, but its generally not enough to justify having a
data warehouse.
The value of warehousing lies in its ability to let you discover information you
didnt know you had, explore new possibilities, follow a line of reasoning wherever
it takes you, and do what-if analyses and create new scenarios. For this work, you need a
package that slices and dices the data in as much detail and as many dimensions as you
need.
The process is generally called OLAP, but other related terms are data mining, drilling
down, or, for one vendor, data diving. One factor that differentiates these OLAP tools
from query and reporting packages is that you use them interactively.
With most OLAP products, when you want more detail, you just double-click on a category
to drill down. Or you can show data as a graph by simply clicking on a button, or bring in
a new data dimension with a click or two. Equally important, you can usually save the OLAP
queries in a form that lets others continue the interaction to extract even more
information from the data.
Using OLAP tools, you often can discover correlations and relationships in your data
that you didnt even suspect existed. You can analyze trends and population factors
more accurately and in finer detail, look at time-series and statistical factors, and view
different types of graphical representations and visualizations to see what they reveal.
Some OLAP tools are designed to work directly with warehouse databases on networked
workstations, while others are designed to download pieces of a data warehouse onto a
desktop PC and do most processing and analysis locally.
Establishing and maintaining a data warehouse is one of the biggest information
technology projects your organization may ever tackle, short of a complete re-engineering
of procedures. Its mandatory that you put a lot of time into investigating
alternatives and products to find which ones will serve you best. This guide is only a
starting pointa product that isnt listed here may be the right one for your
organization.
More information is available on the Web. Three good starting points are: Data
Warehousing at http://www.datawarehousing.com,
the Data Warehousing Institute at http://www.dw-institute.com,
and the International Data Warehousing Association at http://www.idwa.org/.
The following books also can help:
Russell Kay writes about and reviews software and hardware in Worcester, Mass. E-mail
him at russellk@bix.com.




