Data Ingestion and Curation Techniques
The Data Universe
There is a whole area in the abstract Data universe, called by various names such as– data integration, data movement, data curation or cleansing, data transformation, etc.
One of the initiators of this movement is a company called Informatica which originated when Data Warehouse became a hot topic during the 1990s, similarly to what Big Data is coined as today. The term ETL (extraction, transformation, loading) became part of the warehouse lexicon. This term can generally be roofed under the generation of the data integration tools.
Data ingestion refers to taking data from the source and placing it in a location where it can be processed. Since we are using Hadoop HDFS as our underlying framework for storage and related echo systems for processing, we will look into the available data ingestion options. The following are the data ingestion options:
- Batch load from RDBMS using Sqoop
- Data loading from files
- Real time data ingestion
Data Cleaning/Curation and Processing (Exploratory Data Analysis)
After getting the data into HDFS, we should clean the data and bring it to a format which can be processed.
A common traditional approach is to use a sample of the large dataset which could fit in memory. But with the arrival of Big Data, processing tools like Hadoop, can now be used to run many exploratory data analysis tasks on full datasets, without sampling. Just write a MapReduce job, PIG or HIVE script, launch it directly on Hadoop over the full dataset, and get the results right back to your laptop.
So How Did ETL Evolve?
Often the T of the ETL was the hardest job as it required business domain knowledge. Data was assembled from fewer source (usually less than 20) into the warehouse for offline analysis and reporting. The cost of data curation (mostly, data cleaning) required getting heterogeneous data into proper format for querying and analysis was high. Data can be streamed in real time or ingested in batches. When data is ingested in real time, each data item is imported as it is emitted by the source. When data is ingested in batches, data items are imported in discrete chunks at periodic intervals of time. An effective data ingestion process begins by prioritizing data sources, validating individual files and routing data items to the correct destination.
When numerous big data sources exist in diverse formats (the sources may often number in the hundreds and the formats in the dozens), it can be challenging for businesses to ingest data at a reasonable speed and process it efficiently in order to maintain a competitive advantage. To that end, vendors offer software programs that are tailored to specific computing environments or software applications. When data ingestion is automated, the software used to carry out the process may also include data preparation features to structure and organize data so it can be analyzed or at a later time by Business Intelligence/Analytics programs.
Subsequently, a second generation of ETL systems arrived where major ETL products were extended with data cleaning modules, additional adaptors to ingest other kinds of data, and data cleaning tools. Data curation got involved: ingesting data sources, cleaning errors, transforming attributes into other ones, schema integration to connect disparate data sources, and performing entity consolidation to remove duplicates. It was a whole different perspective on ETL systems. There was also the underlying need of a professional programmer to handle all these tasks and carry them out efficiently. With the arrival of the Internet, many new sources of data also arrived and the diversity increased manifold and the integration task became much tougher.
Now, there is talk of a third generation of tools termed “scalable data curation” which can scale to hundreds or even thousands of data sources. Experts mention that such tools can use statistics and machine learning to make automatic decision wherever possible. Such tools need human interaction only when needed, allowing for budget cuts and accuracy.
Start-ups such as NodeLogix (?) Metascale, and Paxata emerged attracted by this revolution, applying such techniques to data preparation, an approach subsequently embraced by incumbents Informatica, IBM, and, Cloudera, and Solix. A new startup called TamR which received, funding last year by Google Ventures and NEA ($16M funding), claims to create a true “curation at scale”. It has adopted a similar approach but applied it to a different upstream problem – curating data from multiple sources. IBM has also publicly stated its direction to develop a “Big Match” capability for Big Data that would complement its MDM (master data management) tools. More are expected to enter into this effort and the pool of companies has since then grown.
In summary, ETL systems originally arose to deal with the transformation challenges in early data warehouses. On the surface, this seems quite opposite to the concept of “data lake” where native formats are stored. However, the so-called “data refinery” is no different than the curation process. ETL systems evolved into second-generation data curation systems with an expanded scope of offerings. Now a new generation of data curation systems is emerging to address the Big Data world challenge where sources have multiplied with more heterogeneity of data sources.