Data Warehousing
Data warehouses contain consolidated data from many sources?? spanning long time periods?? and augmented with summary information. Warehouses are much larger than other kinds of databases; sizes ranging from several gigabytes to terabytes are common. Typical workloads involve ad hoc?? fairly complex queries?? and fast response time is important. These characteristics differentiate warehouse applications from OLTP applications?? and different DBMS design and implementation techniques must be used to achieve satisfactory results. Adistributed DBMS with good scalability and high availability ??achieved by storing tables redundantly at more than one site?? is required for very large warehouses.
An organization's daily operations access and modify operational databases. Data from these operational databases and other external sources ??e. g.?? customer profiles supplied by external consultants?? are extracted by using gateways?? or standard external interfaces supported by the underlying DBMS. Standards such as Open Database Connectivity ??ODBC?? from Microsoft are emerging for gateways;ODBC is an application program interface that allows client programs to generate SQL statements to be executed at a sewer.
There are many challenges in creating and maintaining a large data warehouse. A goad database schema must be designed to hold an integrated collection of data copied from diverse sources. For example?? a company warehouse might include the Inventory and Personnel departments' databases?? together with Sales databases maintained by offices in different countries. Since the source databases are often created and maintained by different groups?? there are a number of semantic mismatches across these databases?? such as different currency units?? different names for the same attribute?? and differences in how tables are normalized or structured;these differences must be reconciled when data is brought into the warehouse. After the warehouse schema is designed?? the warehouse must be populated?? and over time?? it must be kept consistent with the primary data sources.
Data extracted from operational databases and external sources is first cleaned to minimize errors and fin in missing information when possible?? and transformed to reconcile semantic mismatches. Transforming data is typically accomplished by defining a relational view over the tables in the data sources ??the operational databases and other external sources??. Loading data consists of materializing such views and storing them in the warehouse. Unlike a standard view in a relational DBMS?? therefore?? the view is stored in a database ??the warehouse?? that is different from the database ??s?? containing the tables it is defined over.
The cleaned and transformed data is finally loaded into the warehouse?? Additional preprocessing such as sorting and generation of summary information is carried out at this stage. Data is partitioned and indexes are built for efficiency. The large volume of data to be loaded means that loading is a slow process; loading a terabyte of data sequentially can take weeks. Parallelism is therefore important for loading warehouses.