The Evolution of Data Engineering

A creative that showcases a floating laptop along with the text ' The Evolution of Data Engineering'

The past decade has exponentially changed the way that data is perceived, and used. Organizations are pursuing and implementing data-centric initiatives and are relying more and more on data-driven decisions to create business value and innovation. The focus of Data Engineering has evolved from managing data inflow/outflow from databases to making it more accessible and providing the right data to its users at the right time.

The Evolution

Data engineering was brought into the main picture a few years ago when the increased demand for data handling and engineering support emerged with the shifting data landscape. However, the original conceptualization of the discipline can be traced back to the late ’90s. At the time, data engineering was a subset of a number of emerging data technologies and techniques in the analytics landscape. It dealt with the use of ETL (Extract, Transform and Load) and managed the mobility of data to various channels such as databases and warehouses.

It was in the early 2000s that the requirement of scalability made itself pronounced upon the data landscape as the easy accessibility of the internet brought an increased online engagement between companies and their consumers. This ushered in a new age for data as opportunities for analytics increased dramatically. Organizations began focusing on collecting and storing data through the use of data lakes. Newer technologies such as Apache Hive gained popularity and gave way to increased flexibility and scalability.

In 2006, it became easier and cheaper to store and manage huge amounts of data with the arrival of Big Data and its sibling technologies. Hadoop open-sourcing became a turning point in the way data was being processed. The new complexities behind processing the data gave birth to a new breed of data and backend engineers. Hadoop played an important part in building the storage layer S3 of Amazon Web Services when it was launched in 2006. The second turning point that bridged the gap between these engineers and Big data was when Hive open-sourced in 2010 and this led to a new era of Data engineering.

This new era brought forth complexities in data sourcing from different storage locations and organizations were subjected to a new challenge of operating this complex flow of data. This challenge gave rise to data orchestration engines that allowed the flow of data from multiple data storage locations and combined it to make it easily available for various data operations.

This massive explosion of data helped usher in a new age for machine learning. Machine learning models transitioned from being trained on a single machine to being trained on the abundance of data collected from the internet. This further evolved in 2014 with the release of MLlib by Spark for Python which led to the democratization of ML computation on Big Data. In addition to this, Spark offered a new direction for data engineers to compute and process streaming data with relative ease and thus advancing towards the era of real-time processing. 

2014 was a significant year in the development of how data can be used. In addition to Spark’s impact, the introduction of the Lambda function on Amazon Web Services gave rise to the serverless movement where data ingestion could be easily done without infrastructure management. This allowed data engineers to take a break from managing infrastructure and spend considerably more time on scaling and development.  In 2016 the new release, Athena helped propagate things further by allowing to query directly onto s3 without the need to set up a cluster.

The data landscape today has been subjected to a number of iterations and this has changed the way data is now sourced, collected, and processed. In addition to working with data from external sources, organizations of today have to worry about data that is generated internally. The complexity has substantially increased to make sure that the right data is available for experimentation. A Gartner Data Science Team Survey 2018 shows that for data projects, a substantial amount of time goes into tasks such as data collection and preparation, problem analysis, before commencing the development of the various data models. Data engineering has become a critical practice that helps bridge the gaps around the accessibility of data and to ensure success in data and analytics initiatives.

With the gap between newer innovations growing shorter and smaller, the existing layers of data and technology keep shifting towards adaptability. The foreseeable increase in the data volumes from various data sources and collection points will pave the way for newer techniques in reconfiguring data into usable forms. It is becoming vital for organizations to embrace data engineering practice to drive data and analytics success.