Once the source data is loaded, the data present in the external tables can be processed using the capabilities of the data store. In Azure Synapse, PolyBase can achieve the same result - creating a table against data stored externally to the database itself. For example, a Hadoop cluster using Hive would describe a Hive table where the data source is effectively a path to a set of files in HDFS. The data store only manages the schema of the data and applies the schema on read. These are referred to as external tables because the data does not reside in storage managed by the data store itself, but on some external scalable storage such as Azure data lake store or Azure blob storage. In general, a schema is overlaid on the flat file data at query time and stored as a table, enabling the data to be queried like any other table in the data store. In practice, the target data store is a data warehouse using either a Hadoop cluster (using Hive or Spark) or a SQL dedicated pools on Azure Synapse Analytics. This approach skips the data copy step present in ETL, which often can be a time consuming operation for large data sets. This data store reads directly from the scalable storage, instead of loading the data into its own proprietary storage. The key point with ELT is that the data store used to perform the transformation is the same data store where the data is ultimately consumed. Technologies, such as Spark, Hive, or Polybase, can then be used to query the source data. For example, you might start by extracting all of the source data to flat files in scalable storage, such as a Hadoop distributed file system, an Azure blob store, or Azure Data Lake gen 2 (or a combination). Typical use cases for ELT fall within the big data realm. However, ELT only works well when the target system is powerful enough to transform the data efficiently. Another benefit to this approach is that scaling the target data store also scales the ELT pipeline performance. This simplifies the architecture by removing the transformation engine from the pipeline. Instead of using a separate transformation engine, the processing capabilities of the target data store are used to transform data. In the ELT pipeline, the transformation occurs in the target data store.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |