Statistics | Towards Data Science

Geospatial exploratory data analysis with GeoPandas and DuckDB

Thomas Reid — Mon, 15 Dec 2025 13:17:00 +0000

In this article, I’ll show you how to use two popular Python libraries to carry out some geospatial analysis of traffic accident data within the UK. I was a relatively early adopter of DuckDB, the fast OLAP database, after it became available, but only recently realised that, through an extension, it offered a large number […]

The post Geospatial exploratory data analysis with GeoPandas and DuckDB appeared first on Towards Data Science.

The Machine Learning “Advent Calendar” Day 14: Softmax Regression in Excel

angela shi — Sun, 14 Dec 2025 18:12:00 +0000

Softmax Regression is simply Logistic Regression extended to multiple classes.

By computing one linear score per class and normalizing them with Softmax, we obtain multiclass probabilities without changing the core logic.

The loss, the gradients, and the optimization remain the same.
Only the number of parallel scores increases.

Implemented in Excel, the model becomes transparent: you can see the scores, the probabilities, and how the coefficients evolve over time.

The post The Machine Learning “Advent Calendar” Day 14: Softmax Regression in Excel appeared first on Towards Data Science.

The Machine Learning “Advent Calendar” Day 12: Logistic Regression in Excel

angela shi — Fri, 12 Dec 2025 17:15:00 +0000

In this article, we rebuild Logistic Regression step by step directly in Excel.
Starting from a binary dataset, we explore why linear regression struggles as a classifier, how the logistic function fixes these issues, and how log-loss naturally appears from the likelihood.
With a transparent gradient-descent table, you can watch the model learn at each iteration—making the whole process intuitive, visual, and surprisingly satisfying.

The post The Machine Learning “Advent Calendar” Day 12: Logistic Regression in Excel appeared first on Towards Data Science.

The Machine Learning “Advent Calendar” Day 8: Isolation Forest in Excel

angela shi — Mon, 08 Dec 2025 18:26:42 +0000

Isolation Forest may look technical, but its idea is simple: isolate points using random splits. If a point is isolated quickly, it is an anomaly; if it takes many splits, it is normal.

Using the tiny dataset 1, 2, 3, 9, we can see the logic clearly. We build several random trees, measure how many splits each point needs, average the depths, and convert them into anomaly scores. Short depths become scores close to 1, long depths close to 0.

The Excel implementation is painful, but the algorithm itself is elegant. It scales to many features, makes no assumptions about distributions, and even works with categorical data. Above all, Isolation Forest asks a different question: not “What is normal?”, but “How fast can I isolate this point?”

The post The Machine Learning “Advent Calendar” Day 8: Isolation Forest in Excel appeared first on Towards Data Science.

The Greedy Boruta Algorithm: Faster Feature Selection Without Sacrificing Recall

Nicolas Vana — Sun, 30 Nov 2025 13:00:00 +0000

A modification to the Boruta algorithm that dramatically reduces computation while maintaining high sensitivity

The post The Greedy Boruta Algorithm: Faster Feature Selection Without Sacrificing Recall appeared first on Towards Data Science.

Metric Deception: When Your Best KPIs Hide Your Worst Failures

Shafeeq Ur Rahaman — Sat, 29 Nov 2025 15:00:00 +0000

The most dangerous KPIs aren’t broken; they’re the ones trusted long after they’ve lost their meaning.

The post Metric Deception: When Your Best KPIs Hide Your Worst Failures appeared first on Towards Data Science.

The Absolute Beginner’s Guide to Pandas DataFrames

Ibrahim Salami — Mon, 17 Nov 2025 14:00:00 +0000

Learn how to initialize dataframes from dictionaries, lists, and NumPy arrays

The post The Absolute Beginner’s Guide to Pandas DataFrames appeared first on Towards Data Science.

Spearman Correlation Coefficient for When Pearson Isn’t Enough

Nikhil Dasari — Thu, 13 Nov 2025 12:30:00 +0000

Not all relationships are linear, and that is where Spearman comes in.

The post Spearman Correlation Coefficient for When Pearson Isn’t Enough appeared first on Towards Data Science.

Evaluating Synthetic Data — The Million Dollar Question

Andrew Skabar — Fri, 07 Nov 2025 20:23:50 +0000

Learn how to evaluate synthetic data quality using the Maximum Similarity Test — a simple, quantitative approach for assessing fidelity, utility, and privacy in synthetic datasets.

The post Evaluating Synthetic Data — The Million Dollar Question appeared first on Towards Data Science.

Expected Value Analysis in AI Product Management

Chinmay Kakatkar — Thu, 06 Nov 2025 16:00:00 +0000

An introduction to key concepts and practical applications

The post Expected Value Analysis in AI Product Management appeared first on Towards Data Science.

What to Do When Your Credit Risk Model Works Today, but Breaks Six Months Later

Javier Marin — Tue, 04 Nov 2025 18:29:20 +0000

Here’s why it happens — and how to fix it

The post What to Do When Your Credit Risk Model Works Today, but Breaks Six Months Later appeared first on Towards Data Science.

The Pearson Correlation Coefficient, Explained Simply

Nikhil Dasari — Sat, 01 Nov 2025 16:00:00 +0000

A simple explanation of the Pearson correlation coefficient with examples

The post The Pearson Correlation Coefficient, Explained Simply appeared first on Towards Data Science.

Using NumPy to Analyze My Daily Habits (Sleep, Screen Time & Mood)

Ibrahim Salami — Tue, 28 Oct 2025 18:19:04 +0000

Can I use NumPy to figure out how my habits affect my mood and productivity?

The post Using NumPy to Analyze My Daily Habits (Sleep, Screen Time & Mood) appeared first on Towards Data Science.

Building a Monitoring System That Actually Works

Mariya Mansurova — Mon, 27 Oct 2025 18:31:18 +0000

A step-by-step guide to catching real anomalies without drowning in false alerts

The post Building a Monitoring System That Actually Works appeared first on Towards Data Science.

The Power of Framework Dimensions: What Data Scientists Should Know

Chinmay Kakatkar — Sun, 26 Oct 2025 16:00:00 +0000

Practical guidance and a case study

The post The Power of Framework Dimensions: What Data Scientists Should Know appeared first on Towards Data Science.

Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know

Ibrahim Salami — Tue, 21 Oct 2025 17:33:00 +0000

I’ve been learning data analytics for a year now. So far, I can consider myself confident in SQL and Power BI. The transition to Python has been quite exciting. I’ve been exposed to some neat and smarter approaches to data analysis. After brushing up on my skills on the Python fundamentals, the ideal next step […]

The post Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know appeared first on Towards Data Science.

Statistical Method mcRigor Enhances the Rigor of Metacell Partitioning in Single-Cell Data Analysis

Jingyi Jessica Li — Fri, 17 Oct 2025 12:30:00 +0000

mcRigor detects dubious metacells within each metacell partition and selects the optimal metacell partitioning method and hyperparameter for a given dataset

The post Statistical Method mcRigor Enhances the Rigor of Metacell Partitioning in Single-Cell Data Analysis appeared first on Towards Data Science.

What Makes a Language Look Like Itself?

Kenneth McCarthy — Thu, 02 Oct 2025 14:00:00 +0000

How simple statistics reveal the visual fingerprints of 20 languages

The post What Makes a Language Look Like Itself? appeared first on Towards Data Science.

The Gini Coefficient: From Lorenz Curves to Model Evaluation

Nikhil Dasari — Tue, 30 Sep 2025 15:30:00 +0000

Understanding how the Gini and Lorenz curves help measure how well a model separates defaulters from non-defaulters.

The post The Gini Coefficient: From Lorenz Curves to Model Evaluation appeared first on Towards Data Science.

The Kolmogorov–Smirnov Statistic, Explained: Measuring Model Power in Credit Risk Modeling

Nikhil Dasari — Mon, 22 Sep 2025 20:56:13 +0000

Understanding how banks use the KS statistic in loan approvals.

The post The Kolmogorov–Smirnov Statistic, Explained: Measuring Model Power in Credit Risk Modeling appeared first on Towards Data Science.