Data Engineering | Towards Data Science https://towardsdatascience.com/category/data-science/data-engineering/ Publish AI, ML & data-science insights to a global community of data professionals. Mon, 15 Dec 2025 13:17:00 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.3 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Data Engineering | Towards Data Science https://towardsdatascience.com/category/data-science/data-engineering/ 32 32 Geospatial exploratory data analysis with GeoPandas and DuckDB https://towardsdatascience.com/geospatial-exploratory-data-analysis-with-geopandas-and-duckdb/ Mon, 15 Dec 2025 13:17:00 +0000 https://towardsdatascience.com/?p=607897 In this article, I’ll show you how to use two popular Python libraries to carry out some geospatial analysis of traffic accident data within the UK. I was a relatively early adopter of DuckDB, the fast OLAP database, after it became available, but only recently realised that, through an extension, it offered a large number […]

The post Geospatial exploratory data analysis with GeoPandas and DuckDB appeared first on Towards Data Science.

]]>
Bootstrap a Data Lakehouse in an Afternoon https://towardsdatascience.com/bootstrap-a-data-lakehouse-in-an-afternoon/ Thu, 04 Dec 2025 13:30:00 +0000 https://towardsdatascience.com/?p=607806 Using Apache Iceberg on AWS with Athena, Glue/Spark and DuckDB

The post Bootstrap a Data Lakehouse in an Afternoon appeared first on Towards Data Science.

]]>
JSON Parsing for Large Payloads: Balancing Speed, Memory, and Scalability https://towardsdatascience.com/json-parsing-for-large-payloads-balancing-speed-memory-and-scalability/ Tue, 02 Dec 2025 15:30:00 +0000 https://towardsdatascience.com/?p=607786 Benchmarking JSON libraries for large payloads

The post JSON Parsing for Large Payloads: Balancing Speed, Memory, and Scalability appeared first on Towards Data Science.

]]>
Building a Geospatial Lakehouse with Open Source and Databricks https://towardsdatascience.com/building-a-geospatial-lakehouse-with-open-source-and-databricks-2/ Sat, 25 Oct 2025 14:00:00 +0000 https://towardsdatascience.com/?p=607488 An example workflow for vector geospatial data science

The post Building a Geospatial Lakehouse with Open Source and Databricks appeared first on Towards Data Science.

]]>
10 Data + AI Observations for Fall 2025 https://towardsdatascience.com/10-data-ai-observations-to-watch-in-fall-2025/ Fri, 10 Oct 2025 13:44:13 +0000 https://towardsdatascience.com/?p=607371 What's happening—and what's next— for data and AI at the close of 2025.

The post 10 Data + AI Observations for Fall 2025 appeared first on Towards Data Science.

]]>
Data Mesh Diaries: Realities from Early Adopters https://towardsdatascience.com/data-mesh-diaries-realities-from-early-adopters/ Wed, 13 Aug 2025 19:30:59 +0000 https://towardsdatascience.com/?p=606861 Early-adopter realities gathered from real data mesh implementations

The post Data Mesh Diaries: Realities from Early Adopters appeared first on Towards Data Science.

]]>
The Mythical Pivot Point from Buy to Build for Data Platforms https://towardsdatascience.com/mythical-pivot-point-from-buy-to-build-for-data-platforms/ Thu, 26 Jun 2025 17:51:39 +0000 https://towardsdatascience.com/?p=606428 For companies with data-intensive architectures, there often comes a pivotal point where building in-house data platforms makes more sense than buying off-the-shelf solutions

The post The Mythical Pivot Point from Buy to Build for Data Platforms appeared first on Towards Data Science.

]]>
From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle https://towardsdatascience.com/from-configuration-to-orchestration-building-etl-workflow-with-aws-is-no-longer-struggling/ Thu, 19 Jun 2025 17:04:15 +0000 https://towardsdatascience.com/?p=606368 A step-by-step guide to leverage AWS services for efficient data pipeline automation

The post From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle appeared first on Towards Data Science.

]]>
How to Reduce Your Power BI Model Size by 90% https://towardsdatascience.com/how-to-reduce-your-power-bi-model-size-by-90/ Mon, 26 May 2025 23:37:32 +0000 https://towardsdatascience.com/?p=606106 Have you ever wondered what makes Power BI so fast and powerful when it comes to performance? Learn on a real-life example about data model optimization and general rules for reducing data model

The post How to Reduce Your Power BI Model Size by 90% appeared first on Towards Data Science.

]]>
The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated https://towardsdatascience.com/geospatial-capabilities-of-microsoft-fabric-and-esri-geoanalytics-demonstrated/ Thu, 15 May 2025 05:34:57 +0000 https://towardsdatascience.com/?p=606021 A step closer to spatial AI with geospatial processing with Fabric

The post The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated appeared first on Towards Data Science.

]]>
Parquet File Format – Everything You Need to Know! https://towardsdatascience.com/parquet-file-format-everything-you-need-to-know/ Wed, 14 May 2025 18:48:02 +0000 https://towardsdatascience.com/?p=606014 New data flavors require new ways for storing it! Learn everything you need to know about the Parquet file format

The post Parquet File Format – Everything You Need to Know! appeared first on Towards Data Science.

]]>
The Shape‑First Tune‑Up Provides Organizations with a Means to Reduce MongoDB Expenses by 79% https://towardsdatascience.com/the-shape%e2%80%91first-tune%e2%80%91up-provides-organizations-with-a-means-to-reduce-mongodb-expenses-by-79/ Fri, 02 May 2025 18:52:40 +0000 https://towardsdatascience.com/?p=605890 A real-world engineering fix that saved over $12K/month on MongoDB without upgrading infrastructure.

The post The Shape‑First Tune‑Up Provides Organizations with a Means to Reduce MongoDB Expenses by 79% appeared first on Towards Data Science.

]]>
The Difference between Duplicate and Reference in Power Query https://towardsdatascience.com/the-difference-between-duplicate-and-reference-in-power-query/ Fri, 02 May 2025 18:38:22 +0000 https://towardsdatascience.com/?p=605886 In Power Query, we can duplicate or reference existing tables. But what are the differences between them? Let's dive into it to find out.

The post The Difference between Duplicate and Reference in Power Query appeared first on Towards Data Science.

]]>
AWS: Deploying a FastAPI App on EC2 in Minutes https://towardsdatascience.com/aws-deploying-a-fastapi-app-on-ec2-in-minutes/ Fri, 25 Apr 2025 00:40:41 +0000 https://towardsdatascience.com/?p=605805 From zero to EC2: easy steps to launching an AWS Instance

The post AWS: Deploying a FastAPI App on EC2 in Minutes appeared first on Towards Data Science.

]]>
Exporting MLflow Experiments from Restricted HPC Systems https://towardsdatascience.com/exporting-mlflow-experiments-from-restricted-hpc-systems/ Thu, 24 Apr 2025 01:45:05 +0000 https://towardsdatascience.com/?p=605792 A workaround method that bypasses direct communication

The post Exporting MLflow Experiments from Restricted HPC Systems appeared first on Towards Data Science.

]]>
MapReduce: How It Powers Scalable Data Processing https://towardsdatascience.com/mapreduce-how-it-powers-scalable-data-processing/ Tue, 22 Apr 2025 19:29:25 +0000 https://towardsdatascience.com/?p=605770 An overview of the MapReduce programming model and how it can be used to optimize large-scale data processing.

The post MapReduce: How It Powers Scalable Data Processing appeared first on Towards Data Science.

]]>
Beginner’s Guide to Creating a S3 Storage on AWS https://towardsdatascience.com/beginners-guide-to-creating-a-s3-storage-on-aws/ Tue, 22 Apr 2025 04:44:52 +0000 https://towardsdatascience.com/?p=605764 How to quickly create cloud storage and access it remotely

The post Beginner’s Guide to Creating a S3 Storage on AWS appeared first on Towards Data Science.

]]>
Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster https://towardsdatascience.com/mastering-hadoop-part-3-hadoop-ecosystem-get-the-most-out-of-your-cluster/ Sat, 15 Mar 2025 01:20:01 +0000 https://towardsdatascience.com/?p=599606 Exploring the Hadoop ecosystem — key tools to maximize your cluster’s potential

The post Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster appeared first on Towards Data Science.

]]>
Forget About Cloud Computing. On-Premises Is All the Rage Again https://towardsdatascience.com/forget-about-cloud-computing-on-premises-is-all-the-rage-again/ Fri, 14 Mar 2025 18:19:36 +0000 https://towardsdatascience.com/?p=599593 From startups to enterprise, companies are lowering costs and regaining control over their operations

The post Forget About Cloud Computing. On-Premises Is All the Rage Again appeared first on Towards Data Science.

]]>
Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop https://towardsdatascience.com/mastering-hadoop-part-2-getting-hands-on-setting-up-and-scaling-hadoop/ Thu, 13 Mar 2025 20:42:01 +0000 https://towardsdatascience.com/?p=599577 Understanding Hadoop’s core components before installation and scaling

The post Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop appeared first on Towards Data Science.

]]>