Histograms can be misleading

Advertise | Industry-ML guides

TOGETHER WITH ASSEMBLYAI

Speech-to-text at unmatched accuracy with AssemblyAI

AssemblyAI has made it much easier to distinguish speakers and determine what they spoke in a conversation, resulting in:

13% more accurate transcript than previous versions.
85.4% reduction in speaker count errors.
5 new languages (total 16 supported languages).

A demo is shown below.

First, import the package, set the API key, and transcribe the file while setting speaker_labels parameter to True:

Next, print the speaker labels:

AssemblyAI’s speech-to-text models rank top across all major industry benchmarks. You can transcribe 1 hour of audio in ~35 seconds at an industry-leading accuracy of 92.5% (for English).

Sign up today and get $50 in free credits!

Start building with AssemblyAI

Advertise with us

TODAY'S ISSUE

Today's daily dose of data science

Histograms can be misleading

Histograms are quite common in data analysis and visualization. Yet, they can be highly misleading at times.

Why?

To begin, a histogram represents an aggregation of one-dimensional data points based on a specific bin width:

This means that setting different bin widths on the same dataset can generate entirely different histograms.

This is evident from the image below:

Altering the bin width changes the type of histogram created

As shown above, each histogram conveys a different story, even though the underlying data is the same.

Thus, solely looking at a histogram to understand the data distribution may lead to incorrect or misleading conclusions.

Here, the takeaway is not that histograms should not be used. Instead, it is that Whenever you generate any summary statistic, you lose essential information.

In our case, every bin of a histogram also represents a summary statistic — an aggregated count.

And whenever you generate any summary statistic, you lose essential information.

Thus, it is always important to look at the underlying data distribution.

For instance, to understand the data distribution, I prefer a violin (or KDE) plot. This gives me better clarity of data distribution over a histogram.

Visualizing density provides more information and clarity about the data distribution than a histogram.

👉 Over to you: What other measures do you take when using summary statistics?

Extended Piece #1

Beyond grid and random search

There are many issues with Grid search and random search.

They are computationally expensive due to exhaustive search.
The search is restricted to the specified hyperparameter range. But what if the ideal hyperparameter exists outside that range?
They can ONLY perform discrete searches, even if the hyperparameter is continuous.

Bayesian optimization solves this.

It’s fast, informed, and performant, as depicted below:

Learning about optimized hyperparameter tuning and utilizing it will be extremely helpful to you if you wish to build large ML models quickly.

Learn Bayesian Optimization from scratch here →

Extended Piece #2

Beyond linear regression

Linear regression makes some strict assumptions about the type of data it can model, as depicted below.

Can you be sure that these assumptions will never break?

Nothing stops real-world datasets from violating these assumptions.

That is why being aware of linear regression’s extensions is immensely important.

Generalized linear models (GLMs) precisely do that.

They relax the assumptions of linear regression to make linear models more adaptable to real-world datasets.

Learn Generalized linear models from scratch here →

TIP OF THE DAY

short-Circuiting in Python

Consider two functions that take a decent amount of time to execute and return a boolean:

long_function
longer_function

We want to run a conditional if one of them returns True. An optimal way to do this is by shifting the function call in the if statements.

This way, if long_process() returns True, longer_process() will not be executed because of the way OR works. This reduces run-time.

A similar optimization can be achieved if we intend to use AND.

THAT'S A WRAP

SPONSOR US

ADVERTISE TO 450k+ Data Professionals

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →

https://preview.convertkit-mail2.com/unsubscribe

Daily Dose of Data Science

Histograms can be misleading

TOGETHER WITH ASSEMBLYAI

Speech-to-text at unmatched accuracy with AssemblyAI

Today's daily dose of data science

Histograms can be misleading

Extended Piece #1

Beyond grid and random search

Extended Piece #2

Beyond linear regression

TIP OF THE DAY

short-Circuiting in Python

SPONSOR US

ADVERTISE TO 450k+ Data Professionals

KernelPCA vs. PCA

New Broadcast

Daily Dose of Data Science

Histograms can be misleading

TOGETHER WITH ASSEMBLYAI

​Speech-to-text at unmatched accuracy with AssemblyAI​

Today's daily dose of data science

Histograms can be misleading

Extended Piece #1

​Beyond grid and random search​

Extended Piece #2

​Beyond linear regression​

TIP OF THE DAY

short-Circuiting in Python

SPONSOR US

ADVERTISE TO 450k+ Data Professionals

Daily Dose of Data Science

KernelPCA vs. PCA

New Broadcast

Speech-to-text at unmatched accuracy with AssemblyAI

Beyond grid and random search

Beyond linear regression