Thread
💽 DATA IS KEY to Machine Learning 🤖

But in a world where data is the new oil, let's understand 6️⃣ issues with Machine Learning data⛔
{ Insufficient Data }

🔵In a world full of data, insufficient data problems still do exist

🟠Models trained with insufficient data perform poorly in real world

🟢 Insufficient data also leads to either overfitting or underfitting
{ Too Much Data }

🔵 Too much data also presents it's own set of challenges such as

🟠Data can be old & outdated data which is no longer relevant

🟢Curse of dimensionality i.e. too many features which are useless or less relevant
{ Non-representative Data }

🔵 ML is simple, if you feed garbage data you get garbage output.

🟠 So, inaccurate or non-representative data leads to poor models

🟢 Select relevant data is a key skill
{ Missing Data }

🔵 Data is key, so missing values is a big problem

🟠Data cleaning solves this problem by substituting missing values using various techniques.

🟢Substitution may lead to to bias & hence poor accuracy
{ Duplicate Data }

🔵 Duplication of data is another major problem.

🟠 Removal of duplicates is easy using Pandaskey

🟢 How much clean & relevant data remains after clearing duplicates is the main thing
{ Outliers }

🔵 Outliers are data points which differ significantly from other data

🟠 e.g. for a temperature data which India ranges from 1 to 45 degree Celsius, -60 or +60 is outlier

🟢 Understanding the nature of outlier data is a problem ML engineers have to solve
Hello 👋

I am Jaydeep from India 🇮🇳

Full time Software Engineer & part time content creator on
🐦Twitter
🖧 Linkedin
🎥YouTube

Follow me for content on
🐍 Python
🤖Ai/ML
🎨Data Visualization
🌟Content creation

Subscribe To My YouTube🔽
youtu.be/FLdS-kBt88M
Mentions
See All