ETL also means Express Toll Lanes, see Express Toll Lanes.
Extract, Transform, and Load (ETL) is a process in data warehousing that
involves
- extracting data from outside sources,
- transforming it to fit business needs (which can include quality levels), and
ultimately
- loading it into the end target, i.e. the data warehouse.
ETL is important, as it is the way data actually gets loaded into the warehouse. This article assumes that data is always
loaded into a data warehouse, whereas the term ETL can in fact refer to a process that loads any database. ETL can also be used for the integration with legacy systems. Usually ETL implementations store an
audit trail on positive and negative process runs. In almost all designs, this audit trail is not at the level of granularity
which would allow to reproduce the ETL's result if the raw data were not available.
Extract
The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate
data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases and
flat files, but may include non-relational database structures such as
IMS or other data structures such as VSAM or ISAM. Extraction converts the data into a format for
transformation processing.
An intrinsic part of the extraction is the parsing of extracted data, resulting in a check if the data meets an expected
pattern or structure. If not, the data is rejected entirely.
Transform
The transform stage applies a series of rules or functions to the extracted data from the source to derive the data to be
loaded to the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more
of the following transformations types to meet the business and technical needs of the end target may be required:
- Selecting only certain columns to load (or selecting null columns not to load)
- Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M
for male and F for female), this is called automated data cleansing; no manual cleansing
occurs during ETL
- Encoding free-form values (e.g., mapping "Male" and "1" and "Mr" into M)
- Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
- Joining together data from multiple sources (e.g., lookup, merge, etc.)
- Summarizing multiple rows of data (e.g., total sales for each store, and for each region)
- Generating surrogate key values
- Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
- Splitting a column into multiple columns (e.g., putting a comma-separated
list specified as a string in one column as individual values in different columns)
- Applying any form of simple or complex data validation; if failed, a full, partial or no rejection of the data, and thus no,
partial or all the data is handed over to the next step, depending on the rule design and exception handling. Most of the above
transformations itself might result in an exception, e.g. when a code-translation parses an unknown code in the extracted
data.
Load
The load phase loads the data into the end target, usually being the data warehouse
(DW). Depending on the requirements of the organization, this process ranges widely. Some data warehouses might weekly overwrite
existing information with cumulative, updated data, while other DW (or even other parts of the same DW) might add new data in a
historized form, e.g. hourly. The timing and scope to replace or append are strategic design choices dependent on the time
available and the business needs. More complex systems can maintain a history and audit trail of all changes to the data loaded
in the DW.
As the load phase interacts with a database, the constraints defined in the database schema as well as in triggers activated
upon data load apply (e.g. uniqueness, referential integrity, mandatory fields), which also contribute to the overall data
quality performance of the ETL process.
Challenges
ETL processes can be quite complex, and significant operational problems can occur with improperly designed ETL systems.
The range of data values or data quality in an operational system may be outside the expectations of designers at the time
validation and transformation rules are specified. Data profiling of a source
during data analysis is recommended to identify the data conditions that will need to be managed by transform rules
specifications. This will lead to an amendment of validation rules explicitely and implicitely implemented in the ETL
process.
DW are typically fed asyncronously by a variety of sources which all serve a different purpose, resulting in e.g. different
reference data. ETL is a key process to bring heterogeneous and asyncronous source extracts to a homogenous environment.
The scalability of an ETL system across the lifetime of its usage needs to be established
during analysis. This includes understanding the volumes of data that will have to be processed within Service Level Agreements, (SLAs). The time available to extract from source systems may change,
which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes
of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale
from daily batch to intra-day micro-batch to integration with message queues or real-time change data capture (CDC) for continuous transformation and update.
Parallel Processing
A recent development in ETL software is the implementation of parallel
processing. This has enabled a number of methods to improve overall performance of ETL processes when dealing with large
volumes of data.
There are 3 main types of parallelisms as implemented in ETL applications:
- Data
- By splitting a single sequential file into smaller data files to provide parallel access.
- Pipeline
- Allowing the simultaneous running of several components on the same data stream. An
example would be looking up a value on record 1 at the same time as adding together two fields on record 2.
- Component
- The simultaneous running of multiple processes on different data streams in the
same job. Sorting one input file while performing a reduplication on another file would be an example of component
parallelism.
All three types of parallelism are usually combined in a single job.
An additional difficulty is making sure the data being uploaded is relatively consistent. Since multiple source databases all
have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL system may be
required to hold back certain data until all sources are synchronized. Likewise, where a warehouse may have to be reconciled to
the contents in a source system or with the general ledger, establishing synchronization and reconciliation points is
necessary.
Tools
While an ETL process can be created using almost any programming language,
creating them from scratch is quite complex. Increasingly, companies are buying ETL tools to help in the creation of ETL
processes.
A good ETL tool must be able to communicate with the many different relational
databases and read the various file formats used throughout an organization. ETL tools have started to migrate into
Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation
and loading of data. Many ETL vendors now have data profiling, data quality and metadata capabilities.
Drawbacks to ETL
As the number of highly-connected computers in any data exchange grows, ETL suffers from exponentially increasing costs. See
Metcalf's Law. A solution to ETL cost growth is to use XML standards on an
Enterprise Service Bus.
See also
References
- Ralph Kimball and Margy Ross The Data Warehouse
Toolkit: The Complete Guide to Dimensional Modeling (Second Edition), Wiley, 2002. ISBN 0-471-20024-7.
The Data Warehouse Toolkit by Second Edition. ETL is discussed on pages 358 to 362.
- Ralph Kimball , et al. "The Data Warehouse Lifecycle Toolkit", Wiley, 1998. ISBN
0-471-25547-5.
- Ralph Kimball , Caserta. "The Data Warehouse ETL Toolkit", Wiley. 2004. ISBN
0-7645-6757-8.
External links
This entry is from Wikipedia, the leading user-contributed encyclopedia. It may not have been reviewed by professional editors (see full disclaimer)