JSONL AI 訓練資料

準備 JSONL 格式 AI 與機器學習訓練資料的完整指南。涵蓋 OpenAI fine-tuning、Anthropic Claude、HuggingFace 資料集、資料驗證及大規模資料管線。

最後更新:2026 年 2 月

為什麼 JSONL 成為 AI 訓練資料的標準格式

JSONL(JSON Lines)已成為 AI 和機器學習訓練資料的事實標準格式。從 OpenAI 到 Anthropic 再到 HuggingFace,每個主要的 AI 供應商都使用 JSONL 作為 fine-tuning 資料集的主要格式。原因很簡單:JSONL 每行儲存一個樣本,使得在訓練過程中串流資料、打亂樣本順序、統計記錄數量、拆分為訓練集和測試集以及追加新樣本而無需重寫整個檔案變得極其簡單。

與必須完整解析後才能使用的單一 JSON 陣列不同,JSONL 檔案可以逐行處理。當您的訓練資料集增長到數百萬筆樣本、數 GB 大小時,這一點尤為重要。在本指南中,您將學習 OpenAI、Anthropic Claude 和 HuggingFace 所要求的確切 JSONL 格式、如何驗證和清理資料,以及如何為正式訓練準備大規模資料集。

OpenAI Fine-tuning JSONL 格式

OpenAI 在 fine-tuning 和 Batch API 中都使用 JSONL 格式。對於 GPT-4o-mini 等聊天模型的 fine-tuning,每行必須包含一個 messages 陣列,採用標準的 chat completion 結構。三種角色 system、user 和 assistant 定義了模型應學會生成的對話。

每行是一個訓練樣本。system 訊息設定行為,user 訊息是輸入,assistant 訊息是模型應學會生成的目標輸出。

Chat Completion 訓練格式
{"messages": [{"role": "system", "content": "You are a helpful customer support agent for an e-commerce platform."}, {"role": "user", "content": "Where is my order #12345?"}, {"role": "assistant", "content": "I can see that order #12345 was shipped on February 10th via FedEx. The tracking number is FX987654321. Based on the latest update, it should arrive by February 14th."}]}
{"messages": [{"role": "system", "content": "You are a helpful customer support agent for an e-commerce platform."}, {"role": "user", "content": "I want to return my purchase."}, {"role": "assistant", "content": "I would be happy to help with your return. Could you please provide the order number and the reason for the return? Our return policy allows returns within 30 days of delivery for most items."}]}

對於多輪對話,需包含完整的對話歷史。模型從 messages 陣列中的每個 assistant 回合學習。這就是教導模型處理後續問題的方法。

多輪對話訓練
{"messages": [{"role": "system", "content": "You are a Python programming tutor."}, {"role": "user", "content": "What is a list comprehension?"}, {"role": "assistant", "content": "A list comprehension is a concise way to create lists in Python. Instead of writing a for loop to build a list, you can write it in one line: [expression for item in iterable if condition]."}, {"role": "user", "content": "Can you show me an example?"}, {"role": "assistant", "content": "Sure! Here is an example that creates a list of squares for even numbers:\n\nsquares = [x**2 for x in range(10) if x % 2 == 0]\n# Result: [0, 4, 16, 36, 64]\n\nThis is equivalent to:\nsquares = []\nfor x in range(10):\n if x % 2 == 0:\n squares.append(x**2)"}]}

OpenAI 要求至少 10 個訓練樣本,但建議使用 50-100 個以獲得明顯改善。所有訊息必須包含 role 和 content 欄位。system 訊息是可選的,但建議使用以確保一致的行為。使用 OpenAI CLI 在上傳前驗證您的檔案:openai tools fine_tunes.prepare_data -f training.jsonl

Anthropic Claude 訓練格式

Anthropic 使用 JSONL 格式對 Claude 模型進行 fine-tuning。此格式使用結構化對話中的 human 和 assistant 回合。雖然確切格式取決於 fine-tuning 方法和 API 版本,但核心模式遵循 Messages API 結構。

每行包含一個對話,包括系統提示詞和交替出現的 user/assistant 訊息。模型學習根據對話上下文生成 assistant 回應。

Claude Messages 格式
{"system": "You are a medical information assistant. Provide accurate, general health information while always recommending consulting a healthcare professional.", "messages": [{"role": "user", "content": "What are the symptoms of vitamin D deficiency?"}, {"role": "assistant", "content": "Common symptoms of vitamin D deficiency include fatigue and tiredness, bone pain and muscle weakness, mood changes including depression, slow wound healing, and frequent infections. However, many people with mild deficiency may not notice symptoms. A blood test is the only way to confirm your vitamin D levels. I recommend speaking with your doctor who can order the appropriate test and recommend supplementation if needed."}]}
{"system": "You are a medical information assistant. Provide accurate, general health information while always recommending consulting a healthcare professional.", "messages": [{"role": "user", "content": "Is it safe to take ibuprofen daily?"}, {"role": "assistant", "content": "Taking ibuprofen daily for extended periods can carry risks including stomach ulcers and gastrointestinal bleeding, increased risk of heart attack and stroke, kidney problems with long-term use, and potential interactions with other medications. For occasional pain relief, short-term use at recommended doses is generally considered safe for most adults. However, if you find yourself needing daily pain relief, please consult your doctor. They can evaluate the underlying cause and recommend a safer long-term treatment plan."}]}

對於多輪訓練資料,需包含完整的對話,user 和 assistant 訊息交替出現。模型學習對話流程以及如何在多個回合中維持上下文。

多輪 Claude 對話
{"system": "You are a cooking assistant specializing in Italian cuisine.", "messages": [{"role": "user", "content": "How do I make carbonara?"}, {"role": "assistant", "content": "Authentic carbonara uses just five ingredients: spaghetti, guanciale (cured pork cheek), eggs, Pecorino Romano cheese, and black pepper. The key technique is tempering the egg mixture with the hot pasta water so it creates a creamy sauce without scrambling."}, {"role": "user", "content": "Can I use bacon instead of guanciale?"}, {"role": "assistant", "content": "While guanciale gives the most authentic flavor, pancetta is a closer substitute. Bacon works in a pinch but its smoky flavor changes the dish significantly. If using bacon, choose an unsmoked variety. Whichever you use, cut it into small pieces and render the fat slowly over medium-low heat until crispy."}]}

HuggingFace JSONL 資料集

HuggingFace 的 datasets 函式庫原生支援 JSONL 作為輸入格式。您可以載入本機 JSONL 檔案、串流讀取遠端資料集,並輕鬆在格式之間轉換。JSONL 是在 HuggingFace Hub 上分享資料集的推薦格式。

使用 datasets 函式庫將 JSONL 檔案載入為 Dataset 物件。這提供了高效的記憶體映射存取、內建的訓練/測試集拆分,以及與 Trainer API 的無縫整合。

載入 JSONL 資料集
from datasets import load_dataset
# Load a local JSONL file
dataset = load_dataset('json', data_files='training.jsonl')
print(dataset)
# DatasetDict({
# train: Dataset({
# features: ['messages', 'system'],
# num_rows: 5000
# })
# })
# Load with train/test split
dataset = load_dataset('json', data_files={
'train': 'train.jsonl',
'test': 'test.jsonl'
})
# Stream a large remote dataset
dataset = load_dataset(
'json',
data_files='https://example.com/large_dataset.jsonl',
streaming=True
)
for example in dataset['train']:
print(example)
break

將任何 HuggingFace 資料集轉換為 JSONL 格式,以便在其他訓練管線中使用。to_json 方法將每個樣本寫為單獨的 JSON 行。

匯出為 JSONL
from datasets import load_dataset
# Load a dataset from the Hub
dataset = load_dataset('squad', split='train')
# Export to JSONL
dataset.to_json('squad_train.jsonl')
print(f'Exported {len(dataset)} examples')
# Export with specific columns
dataset.select_columns(['question', 'context', 'answers']).to_json(
'squad_filtered.jsonl'
)
# Process and export
def format_for_finetuning(example):
return {
'messages': [
{'role': 'user', 'content': example['question']},
{'role': 'assistant', 'content': example['answers']['text'][0]}
]
}
formatted = dataset.map(format_for_finetuning, remove_columns=dataset.column_names)
formatted.to_json('squad_chat_format.jsonl')

資料驗證與清理

訓練資料品質直接影響模型效能。無效的 JSON、缺失欄位、過長的樣本和重複條目都會降低 fine-tuning 的效果。在開始訓練之前,務必驗證和清理您的 JSONL 檔案。

此腳本針對常見問題驗證 JSONL 檔案的每一行:無效的 JSON、缺失的必填欄位、空白內容和 token 長度。在上傳前執行以提早發現問題。

JSONL 訓練資料驗證器
import json
import sys
from collections import Counter
def validate_training_data(path: str) -> dict:
"""Validate a JSONL file for AI fine-tuning."""
stats = Counter()
errors = []
with open(path, 'r', encoding='utf-8') as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
stats['total'] += 1
# Check valid JSON
try:
data = json.loads(line)
except json.JSONDecodeError as e:
errors.append(f'Line {line_num}: Invalid JSON - {e}')
stats['invalid_json'] += 1
continue
# Check messages field exists
if 'messages' not in data:
errors.append(f'Line {line_num}: Missing "messages" field')
stats['missing_messages'] += 1
continue
messages = data['messages']
# Check message structure
for i, msg in enumerate(messages):
if 'role' not in msg:
errors.append(f'Line {line_num}, msg {i}: Missing "role"')
stats['missing_role'] += 1
if 'content' not in msg:
errors.append(f'Line {line_num}, msg {i}: Missing "content"')
stats['missing_content'] += 1
elif not msg['content'].strip():
errors.append(f'Line {line_num}, msg {i}: Empty content')
stats['empty_content'] += 1
# Check has at least one assistant message
roles = [m.get('role') for m in messages]
if 'assistant' not in roles:
errors.append(f'Line {line_num}: No assistant message')
stats['no_assistant'] += 1
stats['valid'] += 1
return {'stats': dict(stats), 'errors': errors[:50]}
result = validate_training_data('training.jsonl')
print(f"Total: {result['stats'].get('total', 0)}")
print(f"Valid: {result['stats'].get('valid', 0)}")
if result['errors']:
print(f"\nFirst {len(result['errors'])} errors:")
for err in result['errors']:
print(f' {err}')

重複的訓練樣本浪費運算資源,並可能使模型偏向過度出現的模式。此腳本根據每行的內容雜湊值移除完全重複的條目。

去重腳本
import json
import hashlib
def deduplicate_jsonl(input_path: str, output_path: str) -> dict:
"""Remove duplicate training examples from a JSONL file."""
seen_hashes = set()
total = 0
unique = 0
with open(input_path, 'r') as fin, open(output_path, 'w') as fout:
for line in fin:
line = line.strip()
if not line:
continue
total += 1
# Hash the normalized JSON to catch formatting differences
data = json.loads(line)
canonical = json.dumps(data, sort_keys=True)
content_hash = hashlib.sha256(canonical.encode()).hexdigest()
if content_hash not in seen_hashes:
seen_hashes.add(content_hash)
fout.write(json.dumps(data) + '\n')
unique += 1
duplicates = total - unique
print(f'Total: {total}, Unique: {unique}, Removed: {duplicates}')
return {'total': total, 'unique': unique, 'duplicates': duplicates}
deduplicate_jsonl('training.jsonl', 'training_deduped.jsonl')

大規模資料準備

正式環境的訓練資料集通常包含數十萬甚至數百萬個樣本。在這種規模下,您需要自動化管線來打亂、拆分和分片您的 JSONL 資料。正確的準備可以防止災難性遺忘等訓練問題,並確保可重現的實驗。

此腳本隨機打亂資料並將其拆分為訓練集和測試集。打亂至關重要,因為 JSONL 檔案通常是按順序生成的,而在有序資料上訓練可能導致模型泛化能力差。

訓練/測試拆分與隨機打亂管線
import json
import random
from pathlib import Path
def prepare_training_data(
input_path: str,
output_dir: str,
test_ratio: float = 0.1,
seed: int = 42
) -> dict:
"""Shuffle and split JSONL into train/test sets."""
random.seed(seed)
output = Path(output_dir)
output.mkdir(parents=True, exist_ok=True)
# Load all examples
examples = []
with open(input_path, 'r') as f:
for line in f:
line = line.strip()
if line:
examples.append(json.loads(line))
# Shuffle
random.shuffle(examples)
# Split
split_idx = int(len(examples) * (1 - test_ratio))
train_data = examples[:split_idx]
test_data = examples[split_idx:]
# Write output files
train_path = output / 'train.jsonl'
test_path = output / 'test.jsonl'
for data, path in [(train_data, train_path), (test_data, test_path)]:
with open(path, 'w') as f:
for example in data:
f.write(json.dumps(example) + '\n')
print(f'Train: {len(train_data)} examples -> {train_path}')
print(f'Test: {len(test_data)} examples -> {test_path}')
return {'train': len(train_data), 'test': len(test_data)}
prepare_training_data(
'all_examples.jsonl',
'./prepared_data/',
test_ratio=0.1,
seed=42
)

當資料集對單一檔案來說太大,或者您需要在多個 GPU 之間分散訓練時,可將 JSONL 檔案分片為更小的區塊。每個分片都可以獨立處理。

分散式訓練的檔案分片
import json
from pathlib import Path
def shard_jsonl(
input_path: str,
output_dir: str,
shard_size: int = 50000
) -> int:
"""Split a large JSONL file into smaller shards."""
output = Path(output_dir)
output.mkdir(parents=True, exist_ok=True)
shard_num = 0
line_count = 0
current_file = None
with open(input_path, 'r') as fin:
for line in fin:
line = line.strip()
if not line:
continue
if line_count % shard_size == 0:
if current_file:
current_file.close()
shard_path = output / f'shard_{shard_num:04d}.jsonl'
current_file = open(shard_path, 'w')
shard_num += 1
current_file.write(line + '\n')
line_count += 1
if current_file:
current_file.close()
print(f'Created {shard_num} shards from {line_count} examples')
return shard_num
shard_jsonl('large_dataset.jsonl', './shards/', shard_size=50000)

線上驗證您的訓練資料

使用我們免費的瀏覽器工具,在上傳到 OpenAI、Anthropic 或 HuggingFace 之前驗證、格式化和轉換您的 JSONL 訓練資料。

準備好開始準備訓練資料了嗎?

直接在瀏覽器中驗證、格式化和檢查您的 JSONL 訓練檔案。無需上傳、無需註冊,100% 隱私保護。

常見問題

JSONL 訓練資料 — OpenAI、Claude 與 Hugging Face 格式 | jsonl.co