Histograms can be misleading


Advertise | Industry-ML guides

TOGETHER WITH ASSEMBLYAI

Speech-to-text at unmatched accuracy with AssemblyAI

AssemblyAI has made it much easier to distinguish speakers and determine what they spoke in a conversation, resulting in:

  • 13% more accurate transcript than previous versions.
  • 85.4% reduction in speaker count errors.
  • 5 new languages (total 16 supported languages).

A demo is shown below.

First, import the package, set the API key, and transcribe the file while setting speaker_labels parameter to True:

Next, print the speaker labels:

AssemblyAI’s speech-to-text models rank top across all major industry benchmarks. You can transcribe 1 hour of audio in ~35 seconds at an industry-leading accuracy of 92.5% (for English).

Sign up today and get $50 in free credits!

Advertise with us

TODAY'S ISSUE

Today's daily dose of data science

Histograms can be misleading

Histograms are quite common in data analysis and visualization. Yet, they can be highly misleading at times.

Why?

To begin, a histogram represents an aggregation of one-dimensional data points based on a specific bin width:

This means that setting different bin widths on the same dataset can generate entirely different histograms.

This is evident from the image below:

As shown above, each histogram conveys a different story, even though the underlying data is the same.

Thus, solely looking at a histogram to understand the data distribution may lead to incorrect or misleading conclusions.

Here, the takeaway is not that histograms should not be used. Instead, it is that Whenever you generate any summary statistic, you lose essential information.

In our case, every bin of a histogram also represents a summary statistic — an aggregated count.

And whenever you generate any summary statistic, you lose essential information.

Thus, it is always important to look at the underlying data distribution.

For instance, to understand the data distribution, I prefer a violin (or KDE) plot. This gives me better clarity of data distribution over a histogram.

Visualizing density provides more information and clarity about the data distribution than a histogram.

👉 Over to you: What other measures do you take when using summary statistics?

Extended Piece #1

Beyond grid and random search

There are many issues with Grid search and random search.

  • They are computationally expensive due to exhaustive search.
  • The search is restricted to the specified hyperparameter range. But what if the ideal hyperparameter exists outside that range?
  • They can ONLY perform discrete searches, even if the hyperparameter is continuous.

Bayesian optimization solves this.

It’s fast, informed, and performant, as depicted below:

Learning about optimized hyperparameter tuning and utilizing it will be extremely helpful to you if you wish to build large ML models quickly.

Learn Bayesian Optimization from scratch here →

Extended Piece #2

Beyond linear regression

Linear regression makes some strict assumptions about the type of data it can model, as depicted below.

Can you be sure that these assumptions will never break?

Nothing stops real-world datasets from violating these assumptions.

That is why being aware of linear regression’s extensions is immensely important.

Generalized linear models (GLMs) precisely do that.

They relax the assumptions of linear regression to make linear models more adaptable to real-world datasets.

Learn Generalized linear models from scratch here →

TIP OF THE DAY

short-Circuiting in Python

Consider two functions that take a decent amount of time to execute and return a boolean:

  • long_function
  • longer_function

We want to run a conditional if one of them returns True. An optimal way to do this is by shifting the function call in the if statements.

This way, if long_process() returns True, longer_process() will not be executed because of the way OR works. This reduces run-time.

A similar optimization can be achieved if we intend to use AND.

THAT'S A WRAP

SPONSOR US

ADVERTISE TO 450k+ Data Professionals

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →

https://preview.kit-mail3.com/unsubscribe

Daily Dose of Data Science

Daily no-fluff issues that help you succeed and stay relevant in DS/ML roles.

Read more from Daily Dose of Data Science

Master Full-stack AI Engineering In today's newsletter: An Open-Source Autonomous BI Agent. A Memory-efficient technique to train large models. Types of memory in AI Agents. TODAY'S ISSUE Open-source An Open-Source Autonomous BI Agent MindsDB just open-sourced Anton, an autonomous BI agent that turns plain-language questions into full dashboards. You ask something like “Show me NVIDIA’s profit margins,” and Anton handles everything: figuring out the right data source, writing and executing...

Master Full-stack AI Engineering In today's newsletter: Is AI actually saving your engineering team time? Anatomy of the .claude/ folder. TODAY'S ISSUE together with postman Is AI actually saving your engineering team time? Most teams have adopted AI in some form, but the gap between “using AI” and “getting measurable ROI from AI” is larger than people realize. Postman released a cost savings analysis that looks at six common API development workflows and benchmarks the actual time and cost...

Master full-stack AI Engineering In today's newsletter: DailyDoseofDS is now on Instagram! MCP & Skills for AI agents. [Hands-on] Building an open NotebookLM clone! TODAY'S ISSUE AI engineering DailyDoseofDS is now on Instagram! This newsletter regularly breaks down RAG architectures, AI agents, LLM internals, and everything in between. Now we’re bringing all of that to Instagram too, in a format that’s quick to consume and hard to ignore. We’re already 240 posts deep with content on RAG vs...