Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know

I’ve been learning data analytics for a year now. So far, I can consider myself confident in SQL and Power BI. The transition to Python has been quite exciting. I’ve been exposed to some neat and smarter approaches to data analysis.

After brushing up on my skills on the Python fundamentals, the ideal next step was to start learning about some of the Python libraries for data analysis. NumPy is one of them. Being a math nerd, naturally, I’d enjoy exploring this Python library.

This library is designed for individuals who want to perform mathematical computations using Python, from basic mathematics and algebra to advanced concepts like calculus. NumPy can pretty much do it all.

In this article, I wanted to introduce you to some NumPy functions I’ve been playing around with. Whether you’re a data scientist, financial analyst, or research nerd, these functions would help you out a lot. Without further ado, let’s get to it.

Sample Dataset (used throughout)

Before diving in, I’ll define a small dataset that will anchor all examples:

import numpy as np
temps = np.array([30, 32, 29, 35, 36, 33, 31])

Using this small temperature dataset, I’ll be sharing 7 functions that make array operations effortless.

1. np.where()— The Vectorized If-Else

Before I define what this function is, here’s a quick showcase of the function

arr = np.array([10, 15, 20, 25, 30])
indices = np.where(arr > 20)
print(indices)

Output: (array([3, 4]),)

np.where is a condition-based function. When a condition is specified, it outputs the index/indices where that condition is true.
For instance, in the example above, an array is specified, and I have declared a np.where function that retrieves records where the array element is greater than 20. The output is array([3, 4]), because that’s the location/indices where that condition is true — that will be 25 and 30.

Conditional selection/replacement

It’s also useful when you’re trying to define a custom output for the outputs that meet your condition. This is used a lot in data analysis. For instance:

import numpy as np
arr = np.array([1, 2, 3, 4, 5])
result = np.where(arr % 2 == 0, ‘even’, ‘odd’)
print(result)

Output: [‘odd’ ‘even’ ‘odd’ ‘even’ ‘odd’]

The example above tries to retrieve even numbers. After retrieving, a condition selection/replacement function is called that adds a custom name to our conditions. If the condition is true, it’s replaced by even, and if the condition is false, it’s replaced by odd.
Alright, let’s apply this to our small dataset.

Problem: Replace all temperatures above 35°C with 35 (cap extreme readings).

In real-world data, especially from sensors, weather stations, or user inputs, outliers are quite common — sudden spikes or unrealistic values that aren’t realistic.

For example, a temperature sensor might momentarily glitch and record 42°C when the actual temperature was 35°C.

Leaving such anomalies in your data can:

Skew averages — a single high value can pull your mean upward.

Distorted visualisations — charts might stretch to accommodate a few extreme points.

Mislead models — machine learning algorithms are sensitive to unexpected ranges.

Let’s fix that

adjusted = np.where(temps > 35, 35, temps)

Output: array([30, 32, 29, 35, 35, 33, 31])

Looks much better now. With just a few lines of code, we managed to fix unrealistic outliers in our dataset.

2. np.clip() — Keep Values in Range

In many practical datasets, values can spill outside the expected range, probably due to measurement noise, user error, or scaling mismatches.

For instance:

A temperature sensor might read −10°C when the lowest possible is 0°C.

A model output might predict probabilities like 1.03 or −0.05 due to rounding.

When normalising pixel values for an image, a few might go beyond 0–255.

These “out-of-bounds” values may seem minor, but they can:

Break downstream calculations (e.g., log or percentage computations).
Cause unrealistic plots or artefacts (especially in signal/image processing).
Distort normalisation and make metrics unreliable.

np.clip() neatly solves this problem by constraining all elements of an array within a specified minimum and maximum range. It’s kinda like setting boundaries in your dataset.

Example:
Problem: Ensure all readings stay within [28, 35] range.

clipped = np.clip(temps, 28, 35)
clipped

Output: array([30, 32, 29, 35, 35, 33, 31])

Here’s what it does:

Any value below 28 becomes 28.
Any value above 35 becomes 35.
Everything else stays the same.

Sure, this could also be done with np.where() like so
temps = np.where(temps < 28, 28, np.where(temps > 35, 35, temps))
But I’d rather go with np.clip() cos it’s much cleaner and faster.

3. np.ptp() — Find Your Data’s Range in One Line

np.ptp() (peak-to-peak) basically shows you the difference between the maximum and minimum elements.

It’s basically:
np.ptp(a) == np.max(a) — np.min(a)

But in one clean, expressive function.

Here’s how it works

arr = np.array([[1, 5, 2],
[8, 3, 7]])
# Calculate peak-to-peak range of the entire array
range_all = np.ptp(arr)
print(f”Peak-to-peak range of the entire array: {range_all}”)

So that’ll be our maximum value (8) — minimum value(1)

Output: Peak-to-peak range of the entire array: 7

So why is this useful? Similar to averages, understanding how much your data varies is often as important. In weather data, for instance, it shows you how stable or volatile the conditions were.

Instead of separately calling max() and min(), or manually subtracting. np.ptp() makes it concise, readable, and vectorised, especially useful when you’re computing ranges across multiple rows or columns.

Now, let’s apply this to our dataset.

Problem: How much did the temperature vary this week?

temps = np.array([30, 32, 29, 35, 36, 33, 31])
np.ptp(temps)

Output: np.int64(7)

This tells us the temperature fluctuated by 7°C over the period, from 29°C to 36°C.

4. np.diff() — Detect Daily Changes

np.diff() is the quickest way to measure momentum, growth, or decline across time. It basically calculates differences between elements in an array.

To paint a picture for you, if your dataset were a journey, np.ptp() tells you how far you’ve travelled overall, while np.diff() tells you how far you moved between each stop.

Essentially:
np.diff([a1, a2, a3, …]) = [a2 — a1, a3 — a2, …]

Let’s apply this to our dataset.

Let’s look at our temperature data again:

temps = np.array([30, 32, 29, 35, 36, 33, 31])
daily_change = np.diff(temps)
print(daily_change)

Output: [ 2 -3 6 1 -3 -2]

In the real world, np.diff() is used for

Time series analysis — Track daily changes in temperature, sales, or stock prices.
Signal processing — Identify spikes or sudden drops in sensor data.
Data validation — Detect jumps or inconsistencies between consecutive measurements.

5. np.gradient() — Capture Smooth Trends and Slopes

To be honest, when I first came across this, i found it hard to grasp. But essentially, np.gradient() computes the numerical gradient (a smooth estimate of change or slope) across your data. It’s similar to np.diff(), however, np.gradient() works even if your x-values are unevenly spaced (e.g., irregular timestamps). It provides a smoother signal, making trends easier to interpret visually.

For instance:

time = np.array([0, 1, 2, 4, 7])
temp = np.array([30, 32, 34, 35, 36])
np.gradient(temp, time)

Output: array([2. , 2. , 1.5 , 0.43333333, 0.33333333])

Let’s break this down a bit.

Normally, np.gradient() assumes the x-values (your index positions) are evenly spaced — like 0, 1, 2, 3, 4, etc. But in the example above, the time array isn’t evenly spaced: notice the jumps are 1, 1, 2, 3. That means temperature readings weren’t taken every hour.

By passing time as the second argument, we’re essentially telling NumPy to use the actual time gaps when calculating how fast temperature changes.

To explain the output above. It’s saying between 0–2 hours, the temperature rose quickly (about 2°C per hour), and between 2–7 hours, the rise slowed down to around 0.3–1°C per hour.

Let’s apply this to our dataset.

Problem: Estimate the rate of temperature change (like slope).

temps = np.array([30, 32, 29, 35, 36, 33, 31])
grad = np.gradient(temps)
np.round(grad, 2)

Output: array([ 2. , -0.5, 1.5, 3.5, -1. , -2.5, -2. ])

You can read it like:

+2 → temp rising fast (early warm-up)
-0.5 → slight drop (minor cooling)
+1.5, +3.5 → strong rise (big heat jump)
-1, -2.5, -2 → steady cooling trend

So this tells a story of the week’s temperature. Let’s visualise this real quick with matplotlib.

Notice how easy it is to interpret the visualisation. That’s what makes np.gradient() so useful.

6. `np.percentile()` – Spot Outliers or Thresholds

This is one of my favourite functions. np.percentile() helps you retrieve chunks or slices of your data. Numpy defines it well.

numpy.percentile computes the q-th percentile of data along a specified axis, where q is a percentage between 0 and 100.

In np.percentile(), usually there’s a threshold to meet (which is 100%). You can then walk backwards and check the percentage of records that fall below this threshold.

Let’s try this out with sales records

Let’s say your monthly sales target is $60,000.

You can use np.percentile() to understand how often and how strongly you’re hitting or missing that target.

import numpy as np
sales = np.array([45, 50, 52, 48, 60, 62, 58, 70, 72, 66, 63, 80])
np.percentile(sales, [25, 50, 75, 90])

Output: [51.0 61.0 67.5 73.0]

To break this down:

25th percentile = $51k → 25% of your months were below $51k (low performers)
50th percentile = $61k → half of your months were below $61k (around your target)
75th percentile = $67.5k → top-performing months are comfortably above target
90th percentile = $73k → your best months hit $73k or more

So now you can say:
“We hit or exceeded our $60k target in roughly half of all months.”

This can also be visualised using a KPI card. That’s pretty powerful.

That’s KPI storytelling with data.

Let’s apply that to our temperature dataset.

import numpy as np
temps = np.array([30, 32, 29, 35, 36, 33, 31])
np.percentile(temps, [25, 50, 75])

Output: [30.5 32. 34.5]

Here’s what it means:

25% of the readings are below 30.5°C
50% (the median) are below 32°C
75% are below 34.5°C

`7. np.unique()` — Quickly Find Unique Values and Their Counts

This function is perfect for cleaning, summarising, or categorising data. np.unique() finds all the unique elements in your array. It can also check how often those elements appear in your array.

For instance, let’s say you have a list of product categories from your store:

import numpy as np
products = np.array([
‘Shoes’, ‘Bags’, ‘Bags’, ‘Hats’,
‘Shoes’, ‘Shoes’, ‘Belts’, ‘Hats’
])
np.unique(products)

Output: array([‘Bags’, ‘Belts’, ‘Hats’, ‘Shoes’], dtype=’<U5′)

You can take things further by counting the number of times they appear using the return_counts property:

np.unique(products, return_counts=True)

Output: (array([‘Bags’, ‘Belts’, ‘Hats’, ‘Shoes’], dtype=’<U5′), array([2, 1, 2, 3])).

Let’s apply that to my temperature dataset. Currently, there aren’t any duplicates, so we are just gonna get our same input back.

import numpy as np
temps = np.array([30, 32, 29, 35, 36, 33, 31])
np.unique(temps)

Output: array([29, 30, 31, 32, 33, 35, 36])

Notice how the figures are organised accordingly, too — in ascending order.

You can also ask NumPy to count how many times each value appears:

np.unique(temps, return_counts=True)

Output: (array([29, 30, 31, 32, 33, 35, 36]), array([1, 1, 1, 1, 1, 1, 1]))

Wrapping up

So far, these are the functions I stumbled upon. And I find them to be pretty helpful in data analysis. The beauty of NumPy is that the more you play with it, the more you uncover these tiny one-liners that replace pages of code. So next time you’re wrangling data or debugging a messy dataset, stay away from Pandas for a bit, and try dropping in one of these functions. Thanks for reading!

Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know

Sample Dataset (used throughout)

1. np.where()— The Vectorized If-Else

Conditional selection/replacement

2. np.clip() — Keep Values in Range

3. np.ptp() — Find Your Data’s Range in One Line

4. np.diff() — Detect Daily Changes

5. np.gradient() — Capture Smooth Trends and Slopes

6. `np.percentile()` – Spot Outliers or Thresholds

`7. np.unique()` — Quickly Find Unique Values and Their Counts

Wrapping up

Related Articles

Must-Know in Statistics: The Bivariate Normal Projection Explained

Evaluating Cinematic Dialogue - Which syntactic and semantic features are predictive of genre?

Squashing the Average: A Dive into Penalized Quantile Regression for Python

Feature Engineering with Microsoft Fabric and Dataflow Gen2

Done is Better Than Perfect

Gauss, Imposters, and Making Room for Creativity

Python for Data Scientists: Choose Your Own Adventure

Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know

Sample Dataset (used throughout)

1. np.where()— The Vectorized If-Else

Conditional selection/replacement

2. np.clip() — Keep Values in Range

3. np.ptp() — Find Your Data’s Range in One Line

4. np.diff() — Detect Daily Changes

5. np.gradient() — Capture Smooth Trends and Slopes

6. np.percentile() – Spot Outliers or Thresholds

7. np.unique() — Quickly Find Unique Values and Their Counts

Wrapping up

Related Articles

Must-Know in Statistics: The Bivariate Normal Projection Explained

Evaluating Cinematic Dialogue - Which syntactic and semantic features are predictive of genre?

Squashing the Average: A Dive into Penalized Quantile Regression for Python

Feature Engineering with Microsoft Fabric and Dataflow Gen2

Done is Better Than Perfect

Gauss, Imposters, and Making Room for Creativity

Python for Data Scientists: Choose Your Own Adventure

6. `np.percentile()` – Spot Outliers or Thresholds

`7. np.unique()` — Quickly Find Unique Values and Their Counts