title	titleSuffix	description	services	ms.service	ms.subservice	ms.topic	ms.author	author	ms.date	ms.custom
Understand your datasets	Azure Machine Learning	Perform exploratory data analysis to understand feature biases and imbalances with the Responsible AI dashboard's Data Explorer.	machine-learning	machine-learning	enterprise-readiness	how-to	mesameki	mesameki	05/10/2022	responsible-ml, event-tier1-build-2022

Understand your datasets (preview)

Machine learning models "learn" from historical decisions and actions captured in training data. As a result, their performance in real-world scenarios is heavily influenced by the data they're trained on. When feature distribution in a dataset is skewed, this can cause a model to incorrectly predict datapoints belonging to an underrepresented group or to be optimized along an inappropriate metric. For example, while training a housing price prediction AI, the training set was representing 75% of newer houses that have less than median prices. As a result, it was much less successful in successfully identifying more expensive historic houses. The fix was to add older and expensive houses to the training data and augment the features to include insights about the historic value of the house. Upon incorporating that data augmentation, results improved.

The Data Explorer component of the Responsible AI dashboard helps visualize datasets based on predicted and actual outcomes, error groups, and specific features. This enables you to identify issues of over- and underrepresentation and to see how data is clustered in the dataset. Data visualizations consist of aggregate plots or individual datapoints.

When to use Data Explorer

Use Data Explorer when you need to:

Explore your dataset statistics by selecting different filters to slice your data into different dimensions (also known as cohorts).
Understand the distribution of your dataset across different cohorts and feature groups.
Determine whether your findings related to fairness, error analysis and causality (derived from other dashboard components) are a result of your dataset’s distribution.
Decide in which areas to collect more data to mitigate errors arising from representation issues, label noise, feature noise, label bias, etc.

Next steps

Learn how to generate the Responsible AI dashboard via CLIv2 and SDKv2 or studio UI
Learn how to generate a Responsible AI scorecard based on the insights observed in the Responsible AI dashboard.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

concept-data-analysis.md

concept-data-analysis.md

Understand your datasets (preview)

When to use Data Explorer

Next steps

Files

concept-data-analysis.md

Latest commit

History

concept-data-analysis.md

File metadata and controls

Understand your datasets (preview)

When to use Data Explorer

Next steps