Histograms can be misleading


Advertise | Industry-ML guides

TOGETHER WITH ASSEMBLYAI

Speech-to-text at unmatched accuracy with AssemblyAI

AssemblyAI has made it much easier to distinguish speakers and determine what they spoke in a conversation, resulting in:

  • 13% more accurate transcript than previous versions.
  • 85.4% reduction in speaker count errors.
  • 5 new languages (total 16 supported languages).

A demo is shown below.

First, import the package, set the API key, and transcribe the file while setting speaker_labels parameter to True:

Next, print the speaker labels:

AssemblyAI’s speech-to-text models rank top across all major industry benchmarks. You can transcribe 1 hour of audio in ~35 seconds at an industry-leading accuracy of 92.5% (for English).

Sign up today and get $50 in free credits!

Advertise with us

TODAY'S ISSUE

Today's daily dose of data science

Histograms can be misleading

Histograms are quite common in data analysis and visualization. Yet, they can be highly misleading at times.

Why?

To begin, a histogram represents an aggregation of one-dimensional data points based on a specific bin width:

This means that setting different bin widths on the same dataset can generate entirely different histograms.

This is evident from the image below:

As shown above, each histogram conveys a different story, even though the underlying data is the same.

Thus, solely looking at a histogram to understand the data distribution may lead to incorrect or misleading conclusions.

Here, the takeaway is not that histograms should not be used. Instead, it is that Whenever you generate any summary statistic, you lose essential information.

In our case, every bin of a histogram also represents a summary statistic — an aggregated count.

And whenever you generate any summary statistic, you lose essential information.

Thus, it is always important to look at the underlying data distribution.

For instance, to understand the data distribution, I prefer a violin (or KDE) plot. This gives me better clarity of data distribution over a histogram.

Visualizing density provides more information and clarity about the data distribution than a histogram.

👉 Over to you: What other measures do you take when using summary statistics?

Extended Piece #1

Beyond grid and random search

There are many issues with Grid search and random search.

  • They are computationally expensive due to exhaustive search.
  • The search is restricted to the specified hyperparameter range. But what if the ideal hyperparameter exists outside that range?
  • They can ONLY perform discrete searches, even if the hyperparameter is continuous.

Bayesian optimization solves this.

It’s fast, informed, and performant, as depicted below:

Learning about optimized hyperparameter tuning and utilizing it will be extremely helpful to you if you wish to build large ML models quickly.

Learn Bayesian Optimization from scratch here →

Extended Piece #2

Beyond linear regression

Linear regression makes some strict assumptions about the type of data it can model, as depicted below.

Can you be sure that these assumptions will never break?

Nothing stops real-world datasets from violating these assumptions.

That is why being aware of linear regression’s extensions is immensely important.

Generalized linear models (GLMs) precisely do that.

They relax the assumptions of linear regression to make linear models more adaptable to real-world datasets.

Learn Generalized linear models from scratch here →

TIP OF THE DAY

short-Circuiting in Python

Consider two functions that take a decent amount of time to execute and return a boolean:

  • long_function
  • longer_function

We want to run a conditional if one of them returns True. An optimal way to do this is by shifting the function call in the if statements.

This way, if long_process() returns True, longer_process() will not be executed because of the way OR works. This reduces run-time.

A similar optimization can be achieved if we intend to use AND.

THAT'S A WRAP

SPONSOR US

ADVERTISE TO 450k+ Data Professionals

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →

https://preview.convertkit-mail2.com/unsubscribe

Daily Dose of Data Science

Daily no-fluff issues that help you succeed and stay relevant in DS/ML roles.

Read more from Daily Dose of Data Science

Advertise | Industry ML guides Before we begin... Today, we have started sending this newsletter from a new platform. If this email landed in your Spam or Promotions folder, please move it to your 'Primary' inbox. Here's how: Gmail on your phone: Tap the 3 dots at the top right corner, click ‘Move to’ then ‘Primary.’ Gmail on your computer: Back out of this email, then drag and drop this email into the ‘Primary’ tab near the top left of your screen. Apple Mail: Tap on our email address at the...

Data Science PDF | Advertise | Deep dives TOGETHER WITH ASSEMBLYAI Speech-to-text at unmatched accuracy with AssemblyAI AssemblyAI has made it much easier to distinguish speakers and determine what they spoke in a conversation, resulting in: 13% more accurate transcript than previous versions. 85.4% reduction in speaker count errors. 5 new languages (total 16 supported languages). A demo is shown below: Import the package, set the API key, and transcribe the file while setting speaker_labels...