為什麼 AI 公司使用 JSONL 作為訓練資料格式？

JSONL 每行儲存一個訓練樣本，這使其非常適合訓練過程中的串流讀取、隨機打亂、統計樣本數量和追加新資料。與 JSON 陣列不同，JSONL 檔案不需要完整載入記憶體即可解析。當訓練資料集包含數百萬個樣本、達到數 GB 大小時，這一點尤為重要。

OpenAI fine-tuning 的正確 JSONL 格式是什麼？

每行必須是一個包含 messages 陣列的 JSON 物件。每個 message 包含 role（system、user 或 assistant）和 content 欄位。每個樣本至少需要一個 user 訊息和一個 assistant 訊息。system 訊息是可選的但建議使用。OpenAI 要求至少 10 個樣本，建議 50-100 個以獲得良好效果。

Anthropic Claude 訓練資料與 OpenAI 有什麼不同？

Anthropic Claude 使用類似的基於 messages 的格式，包含 user 和 assistant 角色。系統提示詞通常是頂層欄位，而非陣列中的一個 message。兩種格式都使用 JSONL，每行一個對話，但確切的欄位名稱和結構遵循各自的 API 規範。

如何驗證我的 JSONL 訓練資料？

檢查每行是否為有效的 JSON，是否包含必填欄位如 messages、role 和 content，內容字串是否非空，每個樣本是否至少有一個 assistant 訊息，以及 token 長度是否合理。您可以使用 Python 驗證腳本或我們的線上 JSONL 驗證器工具在上傳前發現問題。

Fine-tuning 需要多少訓練樣本？

OpenAI 建議至少 50-100 個樣本以獲得明顯改善，最低要求為 10 個。對於複雜任務，500-1000 個高品質樣本通常能產生出色的效果。品質比數量更重要：一小組精心策劃的樣本通常優於大量有雜訊的資料集。

如何處理大型 JSONL 格式的訓練資料集？

對於大型資料集，使用串流讀取而非將整個檔案載入記憶體。打亂資料以防止排序偏差，拆分為訓練集和測試集進行評估，並分片為多個檔案用於分散式訓練。HuggingFace 的 Python datasets 函式庫支援串流讀取任意大小的 JSONL 檔案。

JSONL AI 訓練資料

準備 JSONL 格式 AI 與機器學習訓練資料的完整指南。涵蓋 OpenAI fine-tuning、Anthropic Claude、HuggingFace 資料集、資料驗證及大規模資料管線。

最後更新：2026 年 2 月

為什麼 JSONL 成為 AI 訓練資料的標準格式

JSONL（JSON Lines）已成為 AI 和機器學習訓練資料的事實標準格式。從 OpenAI 到 Anthropic 再到 HuggingFace，每個主要的 AI 供應商都使用 JSONL 作為 fine-tuning 資料集的主要格式。原因很簡單：JSONL 每行儲存一個樣本，使得在訓練過程中串流資料、打亂樣本順序、統計記錄數量、拆分為訓練集和測試集以及追加新樣本而無需重寫整個檔案變得極其簡單。

與必須完整解析後才能使用的單一 JSON 陣列不同，JSONL 檔案可以逐行處理。當您的訓練資料集增長到數百萬筆樣本、數 GB 大小時，這一點尤為重要。在本指南中，您將學習 OpenAI、Anthropic Claude 和 HuggingFace 所要求的確切 JSONL 格式、如何驗證和清理資料，以及如何為正式訓練準備大規模資料集。

OpenAI Fine-tuning JSONL 格式

OpenAI 在 fine-tuning 和 Batch API 中都使用 JSONL 格式。對於 GPT-4o-mini 等聊天模型的 fine-tuning，每行必須包含一個 messages 陣列，採用標準的 chat completion 結構。三種角色 system、user 和 assistant 定義了模型應學會生成的對話。

每行是一個訓練樣本。system 訊息設定行為，user 訊息是輸入，assistant 訊息是模型應學會生成的目標輸出。

Chat Completion 訓練格式

{"messages": [{"role": "system", "content": "You are a helpful customer support agent for an e-commerce platform."}, {"role": "user", "content": "Where is my order #12345?"}, {"role": "assistant", "content": "I can see that order #12345 was shipped on February 10th via FedEx. The tracking number is FX987654321. Based on the latest update, it should arrive by February 14th."}]}
{"messages": [{"role": "system", "content": "You are a helpful customer support agent for an e-commerce platform."}, {"role": "user", "content": "I want to return my purchase."}, {"role": "assistant", "content": "I would be happy to help with your return. Could you please provide the order number and the reason for the return? Our return policy allows returns within 30 days of delivery for most items."}]}

對於多輪對話，需包含完整的對話歷史。模型從 messages 陣列中的每個 assistant 回合學習。這就是教導模型處理後續問題的方法。

多輪對話訓練

{"messages": [{"role": "system", "content": "You are a Python programming tutor."}, {"role": "user", "content": "What is a list comprehension?"}, {"role": "assistant", "content": "A list comprehension is a concise way to create lists in Python. Instead of writing a for loop to build a list, you can write it in one line: [expression for item in iterable if condition]."}, {"role": "user", "content": "Can you show me an example?"}, {"role": "assistant", "content": "Sure! Here is an example that creates a list of squares for even numbers:\n\nsquares = [x**2 for x in range(10) if x % 2 == 0]\n# Result: [0, 4, 16, 36, 64]\n\nThis is equivalent to:\nsquares = []\nfor x in range(10):\n    if x % 2 == 0:\n        squares.append(x**2)"}]}

OpenAI 要求至少 10 個訓練樣本，但建議使用 50-100 個以獲得明顯改善。所有訊息必須包含 role 和 content 欄位。system 訊息是可選的，但建議使用以確保一致的行為。使用 OpenAI CLI 在上傳前驗證您的檔案：openai tools fine_tunes.prepare_data -f training.jsonl

Anthropic Claude 訓練格式

Anthropic 使用 JSONL 格式對 Claude 模型進行 fine-tuning。此格式使用結構化對話中的 human 和 assistant 回合。雖然確切格式取決於 fine-tuning 方法和 API 版本，但核心模式遵循 Messages API 結構。

每行包含一個對話，包括系統提示詞和交替出現的 user/assistant 訊息。模型學習根據對話上下文生成 assistant 回應。

Claude Messages 格式

{"system": "You are a medical information assistant. Provide accurate, general health information while always recommending consulting a healthcare professional.", "messages": [{"role": "user", "content": "What are the symptoms of vitamin D deficiency?"}, {"role": "assistant", "content": "Common symptoms of vitamin D deficiency include fatigue and tiredness, bone pain and muscle weakness, mood changes including depression, slow wound healing, and frequent infections. However, many people with mild deficiency may not notice symptoms. A blood test is the only way to confirm your vitamin D levels. I recommend speaking with your doctor who can order the appropriate test and recommend supplementation if needed."}]}
{"system": "You are a medical information assistant. Provide accurate, general health information while always recommending consulting a healthcare professional.", "messages": [{"role": "user", "content": "Is it safe to take ibuprofen daily?"}, {"role": "assistant", "content": "Taking ibuprofen daily for extended periods can carry risks including stomach ulcers and gastrointestinal bleeding, increased risk of heart attack and stroke, kidney problems with long-term use, and potential interactions with other medications. For occasional pain relief, short-term use at recommended doses is generally considered safe for most adults. However, if you find yourself needing daily pain relief, please consult your doctor. They can evaluate the underlying cause and recommend a safer long-term treatment plan."}]}

對於多輪訓練資料，需包含完整的對話，user 和 assistant 訊息交替出現。模型學習對話流程以及如何在多個回合中維持上下文。

多輪 Claude 對話

{"system": "You are a cooking assistant specializing in Italian cuisine.", "messages": [{"role": "user", "content": "How do I make carbonara?"}, {"role": "assistant", "content": "Authentic carbonara uses just five ingredients: spaghetti, guanciale (cured pork cheek), eggs, Pecorino Romano cheese, and black pepper. The key technique is tempering the egg mixture with the hot pasta water so it creates a creamy sauce without scrambling."}, {"role": "user", "content": "Can I use bacon instead of guanciale?"}, {"role": "assistant", "content": "While guanciale gives the most authentic flavor, pancetta is a closer substitute. Bacon works in a pinch but its smoky flavor changes the dish significantly. If using bacon, choose an unsmoked variety. Whichever you use, cut it into small pieces and render the fat slowly over medium-low heat until crispy."}]}

HuggingFace JSONL 資料集

HuggingFace 的 datasets 函式庫原生支援 JSONL 作為輸入格式。您可以載入本機 JSONL 檔案、串流讀取遠端資料集，並輕鬆在格式之間轉換。JSONL 是在 HuggingFace Hub 上分享資料集的推薦格式。

使用 datasets 函式庫將 JSONL 檔案載入為 Dataset 物件。這提供了高效的記憶體映射存取、內建的訓練/測試集拆分，以及與 Trainer API 的無縫整合。

載入 JSONL 資料集

from datasets import load_dataset

# Load a local JSONL file
dataset = load_dataset('json', data_files='training.jsonl')
print(dataset)
# DatasetDict({
#     train: Dataset({
#         features: ['messages', 'system'],
#         num_rows: 5000
#     })
# })

# Load with train/test split
dataset = load_dataset('json', data_files={
    'train': 'train.jsonl',
    'test': 'test.jsonl'
})

# Stream a large remote dataset
dataset = load_dataset(
    'json',
    data_files='https://example.com/large_dataset.jsonl',
    streaming=True
)
for example in dataset['train']:
    print(example)
    break

將任何 HuggingFace 資料集轉換為 JSONL 格式，以便在其他訓練管線中使用。to_json 方法將每個樣本寫為單獨的 JSON 行。

匯出為 JSONL

from datasets import load_dataset

# Load a dataset from the Hub
dataset = load_dataset('squad', split='train')

# Export to JSONL
dataset.to_json('squad_train.jsonl')
print(f'Exported {len(dataset)} examples')

# Export with specific columns
dataset.select_columns(['question', 'context', 'answers']).to_json(
    'squad_filtered.jsonl'
)

# Process and export
def format_for_finetuning(example):
    return {
        'messages': [
            {'role': 'user', 'content': example['question']},
            {'role': 'assistant', 'content': example['answers']['text'][0]}
        ]
    }

formatted = dataset.map(format_for_finetuning, remove_columns=dataset.column_names)
formatted.to_json('squad_chat_format.jsonl')

資料驗證與清理

訓練資料品質直接影響模型效能。無效的 JSON、缺失欄位、過長的樣本和重複條目都會降低 fine-tuning 的效果。在開始訓練之前，務必驗證和清理您的 JSONL 檔案。

此腳本針對常見問題驗證 JSONL 檔案的每一行：無效的 JSON、缺失的必填欄位、空白內容和 token 長度。在上傳前執行以提早發現問題。

JSONL 訓練資料驗證器

import json
import sys
from collections import Counter

def validate_training_data(path: str) -> dict:
    """Validate a JSONL file for AI fine-tuning."""
    stats = Counter()
    errors = []

    with open(path, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            stats['total'] += 1

            # Check valid JSON
            try:
                data = json.loads(line)
            except json.JSONDecodeError as e:
                errors.append(f'Line {line_num}: Invalid JSON - {e}')
                stats['invalid_json'] += 1
                continue

            # Check messages field exists
            if 'messages' not in data:
                errors.append(f'Line {line_num}: Missing "messages" field')
                stats['missing_messages'] += 1
                continue

            messages = data['messages']

            # Check message structure
            for i, msg in enumerate(messages):
                if 'role' not in msg:
                    errors.append(f'Line {line_num}, msg {i}: Missing "role"')
                    stats['missing_role'] += 1
                if 'content' not in msg:
                    errors.append(f'Line {line_num}, msg {i}: Missing "content"')
                    stats['missing_content'] += 1
                elif not msg['content'].strip():
                    errors.append(f'Line {line_num}, msg {i}: Empty content')
                    stats['empty_content'] += 1

            # Check has at least one assistant message
            roles = [m.get('role') for m in messages]
            if 'assistant' not in roles:
                errors.append(f'Line {line_num}: No assistant message')
                stats['no_assistant'] += 1

            stats['valid'] += 1

    return {'stats': dict(stats), 'errors': errors[:50]}

result = validate_training_data('training.jsonl')
print(f"Total: {result['stats'].get('total', 0)}")
print(f"Valid: {result['stats'].get('valid', 0)}")
if result['errors']:
    print(f"\nFirst {len(result['errors'])} errors:")
    for err in result['errors']:
        print(f'  {err}')

重複的訓練樣本浪費運算資源，並可能使模型偏向過度出現的模式。此腳本根據每行的內容雜湊值移除完全重複的條目。

去重腳本

import json
import hashlib

def deduplicate_jsonl(input_path: str, output_path: str) -> dict:
    """Remove duplicate training examples from a JSONL file."""
    seen_hashes = set()
    total = 0
    unique = 0

    with open(input_path, 'r') as fin, open(output_path, 'w') as fout:
        for line in fin:
            line = line.strip()
            if not line:
                continue
            total += 1

            # Hash the normalized JSON to catch formatting differences
            data = json.loads(line)
            canonical = json.dumps(data, sort_keys=True)
            content_hash = hashlib.sha256(canonical.encode()).hexdigest()

            if content_hash not in seen_hashes:
                seen_hashes.add(content_hash)
                fout.write(json.dumps(data) + '\n')
                unique += 1

    duplicates = total - unique
    print(f'Total: {total}, Unique: {unique}, Removed: {duplicates}')
    return {'total': total, 'unique': unique, 'duplicates': duplicates}

deduplicate_jsonl('training.jsonl', 'training_deduped.jsonl')

大規模資料準備

正式環境的訓練資料集通常包含數十萬甚至數百萬個樣本。在這種規模下，您需要自動化管線來打亂、拆分和分片您的 JSONL 資料。正確的準備可以防止災難性遺忘等訓練問題，並確保可重現的實驗。

此腳本隨機打亂資料並將其拆分為訓練集和測試集。打亂至關重要，因為 JSONL 檔案通常是按順序生成的，而在有序資料上訓練可能導致模型泛化能力差。

訓練/測試拆分與隨機打亂管線

import json
import random
from pathlib import Path

def prepare_training_data(
    input_path: str,
    output_dir: str,
    test_ratio: float = 0.1,
    seed: int = 42
) -> dict:
    """Shuffle and split JSONL into train/test sets."""
    random.seed(seed)
    output = Path(output_dir)
    output.mkdir(parents=True, exist_ok=True)

    # Load all examples
    examples = []
    with open(input_path, 'r') as f:
        for line in f:
            line = line.strip()
            if line:
                examples.append(json.loads(line))

    # Shuffle
    random.shuffle(examples)

    # Split
    split_idx = int(len(examples) * (1 - test_ratio))
    train_data = examples[:split_idx]
    test_data = examples[split_idx:]

    # Write output files
    train_path = output / 'train.jsonl'
    test_path = output / 'test.jsonl'

    for data, path in [(train_data, train_path), (test_data, test_path)]:
        with open(path, 'w') as f:
            for example in data:
                f.write(json.dumps(example) + '\n')

    print(f'Train: {len(train_data)} examples -> {train_path}')
    print(f'Test:  {len(test_data)} examples -> {test_path}')
    return {'train': len(train_data), 'test': len(test_data)}

prepare_training_data(
    'all_examples.jsonl',
    './prepared_data/',
    test_ratio=0.1,
    seed=42
)

當資料集對單一檔案來說太大，或者您需要在多個 GPU 之間分散訓練時，可將 JSONL 檔案分片為更小的區塊。每個分片都可以獨立處理。

分散式訓練的檔案分片

import json
from pathlib import Path

def shard_jsonl(
    input_path: str,
    output_dir: str,
    shard_size: int = 50000
) -> int:
    """Split a large JSONL file into smaller shards."""
    output = Path(output_dir)
    output.mkdir(parents=True, exist_ok=True)

    shard_num = 0
    line_count = 0
    current_file = None

    with open(input_path, 'r') as fin:
        for line in fin:
            line = line.strip()
            if not line:
                continue

            if line_count % shard_size == 0:
                if current_file:
                    current_file.close()
                shard_path = output / f'shard_{shard_num:04d}.jsonl'
                current_file = open(shard_path, 'w')
                shard_num += 1

            current_file.write(line + '\n')
            line_count += 1

    if current_file:
        current_file.close()

    print(f'Created {shard_num} shards from {line_count} examples')
    return shard_num

shard_jsonl('large_dataset.jsonl', './shards/', shard_size=50000)

線上驗證您的訓練資料

使用我們免費的瀏覽器工具，在上傳到 OpenAI、Anthropic 或 HuggingFace 之前驗證、格式化和轉換您的 JSONL 訓練資料。

OpenAI JSONL format

JSONL validator

JSONL splitter

準備好開始準備訓練資料了嗎？

直接在瀏覽器中驗證、格式化和檢查您的 JSONL 訓練檔案。無需上傳、無需註冊，100% 隱私保護。