Nemotron-Personas-Brazil：用於主權 AI 的共創數據

Huggingface

大約 1 個月前

AI 生成摘要

Huggingface 與 NVIDIA 發布了 Nemotron-Personas-Brazil，這是一個包含 600 萬個合成人物的開放數據集，其統計數據基於巴西人口普查資料。此舉旨在提供高品質、具在地相關性的訓練數據，以促進巴西的主權 AI 發展。

Nemotron-Personas-Brazil: Co-Designed Data for Sovereign AI

Grounding Brazil’s AI with Real Data

Building AI systems that serve national populations requires data that reflects local language, demographics, and cultural context. For Brazil—home to more than 200 million people across diverse regions—this remains a persistent challenge, as much of today’s high-quality training data is English-centric or unavailable for commercial use.

Nemotron-Personas-Brazil helps close that gap. It is an open dataset (CC BY 4.0) of 6 million fully synthetic personas, statistically grounded in official census and labor data from the Brazilian Institute of Geography and Statistics (IBGE). Every persona is aligned to real demographic, geographic, and occupational distributions—but no real person is represented.

This release extends NVIDIA's growing Nemotron-Personas Collection, which already includes the USA, Japan, India, and Singapore. Like others in the collection, the Brazil dataset covers attributes such as age, sex, education, occupation, and location.

The dataset is designed for Brazilian developers and researchers building sovereign AI, with data that is locally grounded, culturally informed, and commercially usable (CC BY 4.0). It was built in collaboration with WideLabs, an NVIDIA Inception member with deep experience supporting government and regulated-sector AI deployments across Latin America.

What’s in the Dataset?

At a glance:

Each persona is written in natural Brazilian Portuguese and includes cultural background, skills, goals, hobbies, and interests.

How We Built It

Data Generation Pipeline

Nemotron-Personas-Brazil was built using NeMo Data Designer, NVIDIA’s compound AI system for synthetic data generation. The pipeline supports structured generation, validation, and retry mechanisms required to produce large-scale, population-aware datasets.

Key components include:

An extended version of Nemotron-Personas-Brazil will be available directly within NeMo Data Designer, enabling developers to generate, refine, and extend Brazilian Portuguese personas as part of their own synthetic data pipelines.

Enhanced Cultural Context

In order to capture the socio-demographic and geographic diversity and complexity of Brazil's population, Nemotron-Personas-Brazil leveraged population census and labor data published by the Brazilian Institute of Geography and Statistics (IBGE).

The result is a dataset that is statistically grounded, culturally representative, and fully synthetic by design.

Private By Design

This dataset contains no personally identifiable information. While we use real-world distributions of ages, names, and occupations from official public sources, nothing is tied to any real person, living or deceased. Every persona is fully synthetic, so you can train on authentic cultural patterns without compromising privacy.

Who This Data Is For

Nemotron-Personas-Brazil is designed primarily for Brazilian developers and researchers building sovereign AI systems. By providing high-quality, population-representative data in Brazilian Portuguese, the dataset addresses gaps left by predominantly English-language training corpora.

Global developers may also leverage the dataset to improve model performance and alignment in Brazilian cultural and linguistic contexts.

Practical AI Applications

Why It Matters

AI model builders have long struggled with access to diverse, high-quality training data that reflects real-world populations. Proprietary datasets dominate enterprise AI, creating barriers for researchers, startups, and developers in underrepresented regions.

By releasing Nemotron-Personas-Brazil under CC BY 4.0, we're democratizing access to enterprise-grade synthetic data—enabling anyone to build culturally authentic AI without barriers of cost, privacy concerns, or geography.

Start Building with Nemotron-Personas-Brazil

You can load the dataset directly from Hugging Face:

Want to learn more about NVIDIA's open data products, or interested in co-designing a future dataset? Join the conversation on NVIDIA's Discord.

Community

·
Sign up or
log in to comment