为什么 AI 公司使用 JSONL 作为训练数据格式？

JSONL 每行存储一个训练样本，非常适合在训练过程中进行流式传输、随机打乱、统计样本数量和追加新数据。与 JSON 数组不同，JSONL 文件不需要将所有内容解析到内存中。当训练数据集包含数百万个样本、占据数 GB 空间时，这一点至关重要。

OpenAI fine-tuning 的正确 JSONL 格式是什么？

每行必须是一个包含 messages 数组的 JSON 对象。每条消息包含 role（system、user 或 assistant）和 content 字段。每个样本至少需要一条 user 消息和一条 assistant 消息。system 消息是可选的，但建议添加。OpenAI 要求至少 10 个样本，建议 50-100 个以获得良好效果。

Anthropic Claude 的训练数据与 OpenAI 有何不同？

Anthropic Claude 使用类似的基于 messages 的格式，包含 user 和 assistant 角色。系统提示通常是顶层字段，而不是数组中的一条消息。两种格式都使用 JSONL，每行一个对话，但具体的字段名称和结构遵循各自的 API 规范。

如何验证我的 JSONL 训练数据？

需要检查每行是否为有效 JSON、是否包含 messages、role 和 content 等必需字段、内容字符串是否非空、每个样本是否至少有一条 assistant 消息，以及 token 长度是否合理。您可以使用 Python 验证脚本或我们的在线 JSONL 验证器工具在上传前发现问题。

fine-tuning 需要多少训练样本？

OpenAI 建议至少 50-100 个样本以获得明显改善，最低要求是 10 个。对于复杂任务，500-1000 个高质量样本通常能产生出色的效果。质量比数量更重要：一小组精心策划的样本通常优于一个大型但嘈杂的数据集。

如何处理大型 JSONL 训练数据集？

对于大型数据集，使用流式读取而不是将整个文件加载到内存中。打乱数据以防止排序偏差，拆分为训练集和测试集以进行评估，并分片为多个文件用于分布式训练。HuggingFace 的 Python datasets 库支持流式传输任意大小的 JSONL 文件。

JSONL AI 训练数据

为 AI 和机器学习准备 JSONL 训练数据的全面指南。涵盖 OpenAI fine-tuning、Anthropic Claude、HuggingFace 数据集、数据验证和大规模数据管道。

最后更新：2026年2月

为什么 JSONL 是 AI 训练数据的标准格式

JSONL（JSON Lines）已成为 AI 和机器学习训练数据的事实标准。从 OpenAI 到 Anthropic 再到 HuggingFace，每个主要的 AI 提供商都使用 JSONL 作为 fine-tuning 数据集的主要格式。原因很简单：JSONL 每行存储一个样本，使得在训练过程中流式传输数据、打乱样本顺序、统计记录数、拆分训练集和测试集以及追加新样本变得轻而易举，而无需重写整个文件。

与必须完全解析后才能使用的单个 JSON 数组不同，JSONL 文件可以逐行处理。当您的训练数据集增长到数百万条样本、占据数 GB 空间时，这一点至关重要。在本指南中，您将学习 OpenAI、Anthropic Claude 和 HuggingFace 所要求的确切 JSONL 格式，如何验证和清理数据，以及如何为生产级训练准备大规模数据集。

OpenAI Fine-tuning JSONL 格式

OpenAI 在 fine-tuning 和 Batch API 中都使用 JSONL。对于 fine-tuning 聊天模型（如 GPT-4o-mini），每行必须包含一个 messages 数组，采用标准的聊天补全结构。system、user 和 assistant 三种角色定义了模型应学习复现的对话。

每行是一个训练样本。system 消息设定行为，user 消息是输入，assistant 消息是模型应学习的目标输出。

聊天补全训练格式

{"messages": [{"role": "system", "content": "You are a helpful customer support agent for an e-commerce platform."}, {"role": "user", "content": "Where is my order #12345?"}, {"role": "assistant", "content": "I can see that order #12345 was shipped on February 10th via FedEx. The tracking number is FX987654321. Based on the latest update, it should arrive by February 14th."}]}
{"messages": [{"role": "system", "content": "You are a helpful customer support agent for an e-commerce platform."}, {"role": "user", "content": "I want to return my purchase."}, {"role": "assistant", "content": "I would be happy to help with your return. Could you please provide the order number and the reason for the return? Our return policy allows returns within 30 days of delivery for most items."}]}

对于多轮对话，需要包含完整的对话历史。模型从 messages 数组中每个 assistant 回复进行学习。这就是教模型处理追问的方式。

多轮对话训练

{"messages": [{"role": "system", "content": "You are a Python programming tutor."}, {"role": "user", "content": "What is a list comprehension?"}, {"role": "assistant", "content": "A list comprehension is a concise way to create lists in Python. Instead of writing a for loop to build a list, you can write it in one line: [expression for item in iterable if condition]."}, {"role": "user", "content": "Can you show me an example?"}, {"role": "assistant", "content": "Sure! Here is an example that creates a list of squares for even numbers:\n\nsquares = [x**2 for x in range(10) if x % 2 == 0]\n# Result: [0, 4, 16, 36, 64]\n\nThis is equivalent to:\nsquares = []\nfor x in range(10):\n    if x % 2 == 0:\n        squares.append(x**2)"}]}

OpenAI 要求至少 10 个训练样本，但建议 50-100 个以获得明显改善。所有消息必须包含 role 和 content 字段。system 消息是可选的，但建议添加以保持一致的行为。使用 OpenAI CLI 在上传前验证您的文件：openai tools fine_tunes.prepare_data -f training.jsonl

Anthropic Claude 训练格式

Anthropic 使用 JSONL 对 Claude 模型进行 fine-tuning。该格式使用结构化对话中的 human 和 assistant 回合。虽然具体格式取决于 fine-tuning 方法和 API 版本，但核心模式遵循 Messages API 的结构。

每行包含一个带有系统提示和交替出现的 user/assistant 消息的对话。模型学习根据对话上下文生成 assistant 回复。

Claude Messages 格式

{"system": "You are a medical information assistant. Provide accurate, general health information while always recommending consulting a healthcare professional.", "messages": [{"role": "user", "content": "What are the symptoms of vitamin D deficiency?"}, {"role": "assistant", "content": "Common symptoms of vitamin D deficiency include fatigue and tiredness, bone pain and muscle weakness, mood changes including depression, slow wound healing, and frequent infections. However, many people with mild deficiency may not notice symptoms. A blood test is the only way to confirm your vitamin D levels. I recommend speaking with your doctor who can order the appropriate test and recommend supplementation if needed."}]}
{"system": "You are a medical information assistant. Provide accurate, general health information while always recommending consulting a healthcare professional.", "messages": [{"role": "user", "content": "Is it safe to take ibuprofen daily?"}, {"role": "assistant", "content": "Taking ibuprofen daily for extended periods can carry risks including stomach ulcers and gastrointestinal bleeding, increased risk of heart attack and stroke, kidney problems with long-term use, and potential interactions with other medications. For occasional pain relief, short-term use at recommended doses is generally considered safe for most adults. However, if you find yourself needing daily pain relief, please consult your doctor. They can evaluate the underlying cause and recommend a safer long-term treatment plan."}]}

对于多轮训练数据，需要包含交替出现的 user 和 assistant 消息的完整对话。模型学习对话流程以及如何跨回合维持上下文。

多轮 Claude 对话

{"system": "You are a cooking assistant specializing in Italian cuisine.", "messages": [{"role": "user", "content": "How do I make carbonara?"}, {"role": "assistant", "content": "Authentic carbonara uses just five ingredients: spaghetti, guanciale (cured pork cheek), eggs, Pecorino Romano cheese, and black pepper. The key technique is tempering the egg mixture with the hot pasta water so it creates a creamy sauce without scrambling."}, {"role": "user", "content": "Can I use bacon instead of guanciale?"}, {"role": "assistant", "content": "While guanciale gives the most authentic flavor, pancetta is a closer substitute. Bacon works in a pinch but its smoky flavor changes the dish significantly. If using bacon, choose an unsmoked variety. Whichever you use, cut it into small pieces and render the fat slowly over medium-low heat until crispy."}]}

HuggingFace 使用 JSONL 的数据集

HuggingFace 的 datasets 库原生支持 JSONL 作为输入格式。您可以加载本地 JSONL 文件、流式传输远程数据集，并轻松地在格式之间转换。JSONL 是在 HuggingFace Hub 上分享数据集的推荐格式。

使用 datasets 库将 JSONL 文件加载为 Dataset 对象。这提供了高效的内存映射访问、内置的训练/测试拆分，以及与 Trainer API 的无缝集成。

加载 JSONL 数据集

from datasets import load_dataset

# Load a local JSONL file
dataset = load_dataset('json', data_files='training.jsonl')
print(dataset)
# DatasetDict({
#     train: Dataset({
#         features: ['messages', 'system'],
#         num_rows: 5000
#     })
# })

# Load with train/test split
dataset = load_dataset('json', data_files={
    'train': 'train.jsonl',
    'test': 'test.jsonl'
})

# Stream a large remote dataset
dataset = load_dataset(
    'json',
    data_files='https://example.com/large_dataset.jsonl',
    streaming=True
)
for example in dataset['train']:
    print(example)
    break

将任何 HuggingFace 数据集转换为 JSONL 格式，供其他训练管道使用。to_json 方法将每个样本写为单独的 JSON 行。

导出为 JSONL

from datasets import load_dataset

# Load a dataset from the Hub
dataset = load_dataset('squad', split='train')

# Export to JSONL
dataset.to_json('squad_train.jsonl')
print(f'Exported {len(dataset)} examples')

# Export with specific columns
dataset.select_columns(['question', 'context', 'answers']).to_json(
    'squad_filtered.jsonl'
)

# Process and export
def format_for_finetuning(example):
    return {
        'messages': [
            {'role': 'user', 'content': example['question']},
            {'role': 'assistant', 'content': example['answers']['text'][0]}
        ]
    }

formatted = dataset.map(format_for_finetuning, remove_columns=dataset.column_names)
formatted.to_json('squad_chat_format.jsonl')

数据验证与清理

训练数据的质量直接影响模型性能。无效的 JSON、缺失的字段、过长的样本和重复条目都会降低 fine-tuning 的效果。始终在开始训练前验证和清理您的 JSONL 文件。

此脚本针对常见问题验证 JSONL 文件的每一行：无效 JSON、缺失必需字段、空内容和 token 长度。在上传前运行此脚本可以尽早发现问题。

JSONL 训练数据验证器

import json
import sys
from collections import Counter

def validate_training_data(path: str) -> dict:
    """Validate a JSONL file for AI fine-tuning."""
    stats = Counter()
    errors = []

    with open(path, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            stats['total'] += 1

            # Check valid JSON
            try:
                data = json.loads(line)
            except json.JSONDecodeError as e:
                errors.append(f'Line {line_num}: Invalid JSON - {e}')
                stats['invalid_json'] += 1
                continue

            # Check messages field exists
            if 'messages' not in data:
                errors.append(f'Line {line_num}: Missing "messages" field')
                stats['missing_messages'] += 1
                continue

            messages = data['messages']

            # Check message structure
            for i, msg in enumerate(messages):
                if 'role' not in msg:
                    errors.append(f'Line {line_num}, msg {i}: Missing "role"')
                    stats['missing_role'] += 1
                if 'content' not in msg:
                    errors.append(f'Line {line_num}, msg {i}: Missing "content"')
                    stats['missing_content'] += 1
                elif not msg['content'].strip():
                    errors.append(f'Line {line_num}, msg {i}: Empty content')
                    stats['empty_content'] += 1

            # Check has at least one assistant message
            roles = [m.get('role') for m in messages]
            if 'assistant' not in roles:
                errors.append(f'Line {line_num}: No assistant message')
                stats['no_assistant'] += 1

            stats['valid'] += 1

    return {'stats': dict(stats), 'errors': errors[:50]}

result = validate_training_data('training.jsonl')
print(f"Total: {result['stats'].get('total', 0)}")
print(f"Valid: {result['stats'].get('valid', 0)}")
if result['errors']:
    print(f"\nFirst {len(result['errors'])} errors:")
    for err in result['errors']:
        print(f'  {err}')

重复的训练样本浪费计算资源，并可能导致模型偏向过度表示的模式。此脚本根据每行的内容哈希值删除完全重复的样本。

去重脚本

import json
import hashlib

def deduplicate_jsonl(input_path: str, output_path: str) -> dict:
    """Remove duplicate training examples from a JSONL file."""
    seen_hashes = set()
    total = 0
    unique = 0

    with open(input_path, 'r') as fin, open(output_path, 'w') as fout:
        for line in fin:
            line = line.strip()
            if not line:
                continue
            total += 1

            # Hash the normalized JSON to catch formatting differences
            data = json.loads(line)
            canonical = json.dumps(data, sort_keys=True)
            content_hash = hashlib.sha256(canonical.encode()).hexdigest()

            if content_hash not in seen_hashes:
                seen_hashes.add(content_hash)
                fout.write(json.dumps(data) + '\n')
                unique += 1

    duplicates = total - unique
    print(f'Total: {total}, Unique: {unique}, Removed: {duplicates}')
    return {'total': total, 'unique': unique, 'duplicates': duplicates}

deduplicate_jsonl('training.jsonl', 'training_deduped.jsonl')

大规模数据准备

生产级训练数据集通常包含数十万甚至数百万个样本。在这种规模下，您需要自动化管道来打乱、拆分和分片您的 JSONL 数据。正确的准备可以防止灾难性遗忘等训练问题，并确保实验的可复现性。

此脚本随机打乱数据并将其拆分为训练集和测试集。打乱至关重要，因为 JSONL 文件通常按顺序生成，而在有序数据上训练可能导致泛化能力差。

训练/测试拆分和打乱管道

import json
import random
from pathlib import Path

def prepare_training_data(
    input_path: str,
    output_dir: str,
    test_ratio: float = 0.1,
    seed: int = 42
) -> dict:
    """Shuffle and split JSONL into train/test sets."""
    random.seed(seed)
    output = Path(output_dir)
    output.mkdir(parents=True, exist_ok=True)

    # Load all examples
    examples = []
    with open(input_path, 'r') as f:
        for line in f:
            line = line.strip()
            if line:
                examples.append(json.loads(line))

    # Shuffle
    random.shuffle(examples)

    # Split
    split_idx = int(len(examples) * (1 - test_ratio))
    train_data = examples[:split_idx]
    test_data = examples[split_idx:]

    # Write output files
    train_path = output / 'train.jsonl'
    test_path = output / 'test.jsonl'

    for data, path in [(train_data, train_path), (test_data, test_path)]:
        with open(path, 'w') as f:
            for example in data:
                f.write(json.dumps(example) + '\n')

    print(f'Train: {len(train_data)} examples -> {train_path}')
    print(f'Test:  {len(test_data)} examples -> {test_path}')
    return {'train': len(train_data), 'test': len(test_data)}

prepare_training_data(
    'all_examples.jsonl',
    './prepared_data/',
    test_ratio=0.1,
    seed=42
)

当数据集大到单个文件无法容纳，或者您需要将训练分布到多个 GPU 上时，可以将 JSONL 文件分片为更小的块。每个分片可以独立处理。

分布式训练的文件分片

import json
from pathlib import Path

def shard_jsonl(
    input_path: str,
    output_dir: str,
    shard_size: int = 50000
) -> int:
    """Split a large JSONL file into smaller shards."""
    output = Path(output_dir)
    output.mkdir(parents=True, exist_ok=True)

    shard_num = 0
    line_count = 0
    current_file = None

    with open(input_path, 'r') as fin:
        for line in fin:
            line = line.strip()
            if not line:
                continue

            if line_count % shard_size == 0:
                if current_file:
                    current_file.close()
                shard_path = output / f'shard_{shard_num:04d}.jsonl'
                current_file = open(shard_path, 'w')
                shard_num += 1

            current_file.write(line + '\n')
            line_count += 1

    if current_file:
        current_file.close()

    print(f'Created {shard_num} shards from {line_count} examples')
    return shard_num

shard_jsonl('large_dataset.jsonl', './shards/', shard_size=50000)

在线验证您的训练数据

使用我们免费的浏览器端工具，在上传到 OpenAI、Anthropic 或 HuggingFace 之前验证、格式化和转换您的 JSONL 训练数据。

OpenAI JSONL format

JSONL validator

JSONL splitter

准备好处理您的训练数据了吗？

直接在浏览器中验证、格式化和检查您的 JSONL 训练文件。无需上传，无需注册，100% 私密。