Statistics | Towards Data Science https://towardsdatascience.com/tag/statistics/ Publish AI, ML & data-science insights to a global community of data professionals. Mon, 15 Dec 2025 16:33:40 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.3 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Statistics | Towards Data Science https://towardsdatascience.com/tag/statistics/ 32 32 Geospatial exploratory data analysis with GeoPandas and DuckDB https://towardsdatascience.com/geospatial-exploratory-data-analysis-with-geopandas-and-duckdb/ Mon, 15 Dec 2025 13:17:00 +0000 https://towardsdatascience.com/?p=607897 In this article, I’ll show you how to use two popular Python libraries to carry out some geospatial analysis of traffic accident data within the UK. I was a relatively early adopter of DuckDB, the fast OLAP database, after it became available, but only recently realised that, through an extension, it offered a large number […]

The post Geospatial exploratory data analysis with GeoPandas and DuckDB appeared first on Towards Data Science.

]]>
The Machine Learning “Advent Calendar” Day 14: Softmax Regression in Excel https://towardsdatascience.com/the-machine-learning-advent-calendar-day-14-softmax-regression-in-excel/ Sun, 14 Dec 2025 18:12:00 +0000 https://towardsdatascience.com/?p=607910 Softmax Regression is simply Logistic Regression extended to multiple classes.

By computing one linear score per class and normalizing them with Softmax, we obtain multiclass probabilities without changing the core logic.

The loss, the gradients, and the optimization remain the same.
Only the number of parallel scores increases.

Implemented in Excel, the model becomes transparent: you can see the scores, the probabilities, and how the coefficients evolve over time.

The post The Machine Learning “Advent Calendar” Day 14: Softmax Regression in Excel appeared first on Towards Data Science.

]]>
The Machine Learning “Advent Calendar” Day 12: Logistic Regression in Excel https://towardsdatascience.com/the-machine-learning-advent-calendar-day-12-logistic-regression-in-excel/ Fri, 12 Dec 2025 17:15:00 +0000 https://towardsdatascience.com/?p=607901 In this article, we rebuild Logistic Regression step by step directly in Excel.
Starting from a binary dataset, we explore why linear regression struggles as a classifier, how the logistic function fixes these issues, and how log-loss naturally appears from the likelihood.
With a transparent gradient-descent table, you can watch the model learn at each iteration—making the whole process intuitive, visual, and surprisingly satisfying.

The post The Machine Learning “Advent Calendar” Day 12: Logistic Regression in Excel appeared first on Towards Data Science.

]]>
The Machine Learning “Advent Calendar” Day 8: Isolation Forest in Excel https://towardsdatascience.com/the-machine-learning-advent-calendar-day-8-isolation-forest-in-excel/ Mon, 08 Dec 2025 18:26:42 +0000 https://towardsdatascience.com/?p=607851 Isolation Forest may look technical, but its idea is simple: isolate points using random splits. If a point is isolated quickly, it is an anomaly; if it takes many splits, it is normal.

Using the tiny dataset 1, 2, 3, 9, we can see the logic clearly. We build several random trees, measure how many splits each point needs, average the depths, and convert them into anomaly scores. Short depths become scores close to 1, long depths close to 0.

The Excel implementation is painful, but the algorithm itself is elegant. It scales to many features, makes no assumptions about distributions, and even works with categorical data. Above all, Isolation Forest asks a different question: not “What is normal?”, but “How fast can I isolate this point?”

The post The Machine Learning “Advent Calendar” Day 8: Isolation Forest in Excel appeared first on Towards Data Science.

]]>
The Greedy Boruta Algorithm: Faster Feature Selection Without Sacrificing Recall https://towardsdatascience.com/the-greedy-boruta-algorithm-faster-feature-selection-without-sacrificing-recall/ Sun, 30 Nov 2025 13:00:00 +0000 https://towardsdatascience.com/?p=607762 A modification to the Boruta algorithm that dramatically reduces computation while maintaining high sensitivity

The post The Greedy Boruta Algorithm: Faster Feature Selection Without Sacrificing Recall appeared first on Towards Data Science.

]]>
Metric Deception: When Your Best KPIs Hide Your Worst Failures https://towardsdatascience.com/metric-deception-when-your-best-kpis-hide-your-worst-failures/ Sat, 29 Nov 2025 15:00:00 +0000 https://towardsdatascience.com/?p=607765 The most dangerous KPIs aren’t broken; they’re the ones trusted long after they’ve lost their meaning.

The post Metric Deception: When Your Best KPIs Hide Your Worst Failures appeared first on Towards Data Science.

]]>
The Absolute Beginner’s Guide to Pandas DataFrames https://towardsdatascience.com/the-absolute-beginners-guide-to-pandas-dataframes/ Mon, 17 Nov 2025 14:00:00 +0000 https://towardsdatascience.com/?p=607651 Learn how to initialize dataframes from dictionaries, lists, and NumPy arrays

The post The Absolute Beginner’s Guide to Pandas DataFrames appeared first on Towards Data Science.

]]>
Spearman Correlation Coefficient for When Pearson Isn’t Enough https://towardsdatascience.com/spearman-correlation-coefficient-for-when-pearson-isnt-enough/ Thu, 13 Nov 2025 12:30:00 +0000 https://towardsdatascience.com/?p=607617 Not all relationships are linear, and that is where Spearman comes in.

The post Spearman Correlation Coefficient for When Pearson Isn’t Enough appeared first on Towards Data Science.

]]>
Evaluating Synthetic Data — The Million Dollar Question https://towardsdatascience.com/evaluating-synthetic-data-the-million-dollar-question-a54701d1b621/ Fri, 07 Nov 2025 20:23:50 +0000 https://towardsdatascience.com/?p=607583 Learn how to evaluate synthetic data quality using the Maximum Similarity Test — a simple, quantitative approach for assessing fidelity, utility, and privacy in synthetic datasets.

The post Evaluating Synthetic Data — The Million Dollar Question appeared first on Towards Data Science.

]]>
Expected Value Analysis in AI Product Management https://towardsdatascience.com/expected-value-analysis-in-ai-product-management/ Thu, 06 Nov 2025 16:00:00 +0000 https://towardsdatascience.com/?p=607575 An introduction to key concepts and practical applications

The post Expected Value Analysis in AI Product Management appeared first on Towards Data Science.

]]>
What to Do When Your Credit Risk Model Works Today, but Breaks Six Months Later https://towardsdatascience.com/your-credit-risk-model-works-today-it-breaks-in-six-months/ Tue, 04 Nov 2025 18:29:20 +0000 https://towardsdatascience.com/?p=607556 Here’s why it happens — and how to fix it

The post What to Do When Your Credit Risk Model Works Today, but Breaks Six Months Later appeared first on Towards Data Science.

]]>
The Pearson Correlation Coefficient, Explained Simply https://towardsdatascience.com/pearson-correlation-coefficient-explained-simply/ Sat, 01 Nov 2025 16:00:00 +0000 https://towardsdatascience.com/?p=607538 A simple explanation of the Pearson correlation coefficient with examples

The post The Pearson Correlation Coefficient, Explained Simply appeared first on Towards Data Science.

]]>
Using NumPy to Analyze My Daily Habits (Sleep, Screen Time & Mood) https://towardsdatascience.com/using-numpy-to-analyze-my-daily-habits-sleep-screen-time-mood/ Tue, 28 Oct 2025 18:19:04 +0000 https://towardsdatascience.com/?p=607513 Can I use NumPy to figure out how my habits affect my mood and productivity?

The post Using NumPy to Analyze My Daily Habits (Sleep, Screen Time & Mood) appeared first on Towards Data Science.

]]>
Building a Monitoring System That Actually Works https://towardsdatascience.com/building-a-monitoring-system-that-actually-works/ Mon, 27 Oct 2025 18:31:18 +0000 https://towardsdatascience.com/?p=607498 A step-by-step guide to catching real anomalies without drowning in false alerts

The post Building a Monitoring System That Actually Works appeared first on Towards Data Science.

]]>
The Power of Framework Dimensions: What Data Scientists Should Know https://towardsdatascience.com/the-power-of-framework-dimensions-what-data-scientists-should-know/ Sun, 26 Oct 2025 16:00:00 +0000 https://towardsdatascience.com/?p=607491 Practical guidance and a case study

The post The Power of Framework Dimensions: What Data Scientists Should Know appeared first on Towards Data Science.

]]>
Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know https://towardsdatascience.com/hidden-gems-in-numpy-7-functions-every-data-scientist-should-know/ Tue, 21 Oct 2025 17:33:00 +0000 https://towardsdatascience.com/?p=607447 I’ve been learning data analytics for a year now. So far, I can consider myself confident in SQL and Power BI. The transition to Python has been quite exciting. I’ve been exposed to some neat and smarter approaches to data analysis. After brushing up on my skills on the Python fundamentals, the ideal next step […]

The post Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know appeared first on Towards Data Science.

]]>
Statistical Method mcRigor Enhances the Rigor of Metacell Partitioning in Single-Cell Data Analysis https://towardsdatascience.com/statistical-method-mcrigor-enhances-the-rigor-of-metacell-partitioning-in-single-cell-data-analysis/ Fri, 17 Oct 2025 12:30:00 +0000 https://towardsdatascience.com/?p=607407 mcRigor detects dubious metacells within each metacell partition and selects the optimal metacell partitioning method and hyperparameter for a given dataset

The post Statistical Method mcRigor Enhances the Rigor of Metacell Partitioning in Single-Cell Data Analysis appeared first on Towards Data Science.

]]>
What Makes a Language Look Like Itself? https://towardsdatascience.com/what-makes-a-language-look-like-itself/ Thu, 02 Oct 2025 14:00:00 +0000 https://towardsdatascience.com/?p=607322 How simple statistics reveal the visual fingerprints of 20 languages

The post What Makes a Language Look Like Itself? appeared first on Towards Data Science.

]]>
The Gini Coefficient: From Lorenz Curves to Model Evaluation https://towardsdatascience.com/beyond-roc-auc-and-ks-gini-coefficient-explained-simply/ Tue, 30 Sep 2025 15:30:00 +0000 https://towardsdatascience.com/?p=607304 Understanding how the Gini and Lorenz curves help measure how well a model separates defaulters from non-defaulters.

The post The Gini Coefficient: From Lorenz Curves to Model Evaluation appeared first on Towards Data Science.

]]>
The Kolmogorov–Smirnov Statistic, Explained: Measuring Model Power in Credit Risk Modeling https://towardsdatascience.com/kolmogorov-smirnov-statistic-explained-measuring-model-power-in-credit-risk-modeling/ Mon, 22 Sep 2025 20:56:13 +0000 https://towardsdatascience.com/?p=607209 Understanding how banks use the KS statistic in loan approvals.

The post The Kolmogorov–Smirnov Statistic, Explained: Measuring Model Power in Credit Risk Modeling appeared first on Towards Data Science.

]]>