JSONL AI 訓練資料
準備 JSONL 格式 AI 與機器學習訓練資料的完整指南。涵蓋 OpenAI fine-tuning、Anthropic Claude、HuggingFace 資料集、資料驗證及大規模資料管線。
最後更新:2026 年 2 月
為什麼 JSONL 成為 AI 訓練資料的標準格式
JSONL(JSON Lines)已成為 AI 和機器學習訓練資料的事實標準格式。從 OpenAI 到 Anthropic 再到 HuggingFace,每個主要的 AI 供應商都使用 JSONL 作為 fine-tuning 資料集的主要格式。原因很簡單:JSONL 每行儲存一個樣本,使得在訓練過程中串流資料、打亂樣本順序、統計記錄數量、拆分為訓練集和測試集以及追加新樣本而無需重寫整個檔案變得極其簡單。
與必須完整解析後才能使用的單一 JSON 陣列不同,JSONL 檔案可以逐行處理。當您的訓練資料集增長到數百萬筆樣本、數 GB 大小時,這一點尤為重要。在本指南中,您將學習 OpenAI、Anthropic Claude 和 HuggingFace 所要求的確切 JSONL 格式、如何驗證和清理資料,以及如何為正式訓練準備大規模資料集。
OpenAI Fine-tuning JSONL 格式
OpenAI 在 fine-tuning 和 Batch API 中都使用 JSONL 格式。對於 GPT-4o-mini 等聊天模型的 fine-tuning,每行必須包含一個 messages 陣列,採用標準的 chat completion 結構。三種角色 system、user 和 assistant 定義了模型應學會生成的對話。
每行是一個訓練樣本。system 訊息設定行為,user 訊息是輸入,assistant 訊息是模型應學會生成的目標輸出。
{"messages": [{"role": "system", "content": "You are a helpful customer support agent for an e-commerce platform."}, {"role": "user", "content": "Where is my order #12345?"}, {"role": "assistant", "content": "I can see that order #12345 was shipped on February 10th via FedEx. The tracking number is FX987654321. Based on the latest update, it should arrive by February 14th."}]}{"messages": [{"role": "system", "content": "You are a helpful customer support agent for an e-commerce platform."}, {"role": "user", "content": "I want to return my purchase."}, {"role": "assistant", "content": "I would be happy to help with your return. Could you please provide the order number and the reason for the return? Our return policy allows returns within 30 days of delivery for most items."}]}
對於多輪對話,需包含完整的對話歷史。模型從 messages 陣列中的每個 assistant 回合學習。這就是教導模型處理後續問題的方法。
{"messages": [{"role": "system", "content": "You are a Python programming tutor."}, {"role": "user", "content": "What is a list comprehension?"}, {"role": "assistant", "content": "A list comprehension is a concise way to create lists in Python. Instead of writing a for loop to build a list, you can write it in one line: [expression for item in iterable if condition]."}, {"role": "user", "content": "Can you show me an example?"}, {"role": "assistant", "content": "Sure! Here is an example that creates a list of squares for even numbers:\n\nsquares = [x**2 for x in range(10) if x % 2 == 0]\n# Result: [0, 4, 16, 36, 64]\n\nThis is equivalent to:\nsquares = []\nfor x in range(10):\n if x % 2 == 0:\n squares.append(x**2)"}]}
OpenAI 要求至少 10 個訓練樣本,但建議使用 50-100 個以獲得明顯改善。所有訊息必須包含 role 和 content 欄位。system 訊息是可選的,但建議使用以確保一致的行為。使用 OpenAI CLI 在上傳前驗證您的檔案:openai tools fine_tunes.prepare_data -f training.jsonl
Anthropic Claude 訓練格式
Anthropic 使用 JSONL 格式對 Claude 模型進行 fine-tuning。此格式使用結構化對話中的 human 和 assistant 回合。雖然確切格式取決於 fine-tuning 方法和 API 版本,但核心模式遵循 Messages API 結構。
每行包含一個對話,包括系統提示詞和交替出現的 user/assistant 訊息。模型學習根據對話上下文生成 assistant 回應。
{"system": "You are a medical information assistant. Provide accurate, general health information while always recommending consulting a healthcare professional.", "messages": [{"role": "user", "content": "What are the symptoms of vitamin D deficiency?"}, {"role": "assistant", "content": "Common symptoms of vitamin D deficiency include fatigue and tiredness, bone pain and muscle weakness, mood changes including depression, slow wound healing, and frequent infections. However, many people with mild deficiency may not notice symptoms. A blood test is the only way to confirm your vitamin D levels. I recommend speaking with your doctor who can order the appropriate test and recommend supplementation if needed."}]}{"system": "You are a medical information assistant. Provide accurate, general health information while always recommending consulting a healthcare professional.", "messages": [{"role": "user", "content": "Is it safe to take ibuprofen daily?"}, {"role": "assistant", "content": "Taking ibuprofen daily for extended periods can carry risks including stomach ulcers and gastrointestinal bleeding, increased risk of heart attack and stroke, kidney problems with long-term use, and potential interactions with other medications. For occasional pain relief, short-term use at recommended doses is generally considered safe for most adults. However, if you find yourself needing daily pain relief, please consult your doctor. They can evaluate the underlying cause and recommend a safer long-term treatment plan."}]}
對於多輪訓練資料,需包含完整的對話,user 和 assistant 訊息交替出現。模型學習對話流程以及如何在多個回合中維持上下文。
{"system": "You are a cooking assistant specializing in Italian cuisine.", "messages": [{"role": "user", "content": "How do I make carbonara?"}, {"role": "assistant", "content": "Authentic carbonara uses just five ingredients: spaghetti, guanciale (cured pork cheek), eggs, Pecorino Romano cheese, and black pepper. The key technique is tempering the egg mixture with the hot pasta water so it creates a creamy sauce without scrambling."}, {"role": "user", "content": "Can I use bacon instead of guanciale?"}, {"role": "assistant", "content": "While guanciale gives the most authentic flavor, pancetta is a closer substitute. Bacon works in a pinch but its smoky flavor changes the dish significantly. If using bacon, choose an unsmoked variety. Whichever you use, cut it into small pieces and render the fat slowly over medium-low heat until crispy."}]}
HuggingFace JSONL 資料集
HuggingFace 的 datasets 函式庫原生支援 JSONL 作為輸入格式。您可以載入本機 JSONL 檔案、串流讀取遠端資料集,並輕鬆在格式之間轉換。JSONL 是在 HuggingFace Hub 上分享資料集的推薦格式。
使用 datasets 函式庫將 JSONL 檔案載入為 Dataset 物件。這提供了高效的記憶體映射存取、內建的訓練/測試集拆分,以及與 Trainer API 的無縫整合。
from datasets import load_dataset# Load a local JSONL filedataset = load_dataset('json', data_files='training.jsonl')print(dataset)# DatasetDict({# train: Dataset({# features: ['messages', 'system'],# num_rows: 5000# })# })# Load with train/test splitdataset = load_dataset('json', data_files={'train': 'train.jsonl','test': 'test.jsonl'})# Stream a large remote datasetdataset = load_dataset('json',data_files='https://example.com/large_dataset.jsonl',streaming=True)for example in dataset['train']:print(example)break
將任何 HuggingFace 資料集轉換為 JSONL 格式,以便在其他訓練管線中使用。to_json 方法將每個樣本寫為單獨的 JSON 行。
from datasets import load_dataset# Load a dataset from the Hubdataset = load_dataset('squad', split='train')# Export to JSONLdataset.to_json('squad_train.jsonl')print(f'Exported {len(dataset)} examples')# Export with specific columnsdataset.select_columns(['question', 'context', 'answers']).to_json('squad_filtered.jsonl')# Process and exportdef format_for_finetuning(example):return {'messages': [{'role': 'user', 'content': example['question']},{'role': 'assistant', 'content': example['answers']['text'][0]}]}formatted = dataset.map(format_for_finetuning, remove_columns=dataset.column_names)formatted.to_json('squad_chat_format.jsonl')
資料驗證與清理
訓練資料品質直接影響模型效能。無效的 JSON、缺失欄位、過長的樣本和重複條目都會降低 fine-tuning 的效果。在開始訓練之前,務必驗證和清理您的 JSONL 檔案。
此腳本針對常見問題驗證 JSONL 檔案的每一行:無效的 JSON、缺失的必填欄位、空白內容和 token 長度。在上傳前執行以提早發現問題。
import jsonimport sysfrom collections import Counterdef validate_training_data(path: str) -> dict:"""Validate a JSONL file for AI fine-tuning."""stats = Counter()errors = []with open(path, 'r', encoding='utf-8') as f:for line_num, line in enumerate(f, 1):line = line.strip()if not line:continuestats['total'] += 1# Check valid JSONtry:data = json.loads(line)except json.JSONDecodeError as e:errors.append(f'Line {line_num}: Invalid JSON - {e}')stats['invalid_json'] += 1continue# Check messages field existsif 'messages' not in data:errors.append(f'Line {line_num}: Missing "messages" field')stats['missing_messages'] += 1continuemessages = data['messages']# Check message structurefor i, msg in enumerate(messages):if 'role' not in msg:errors.append(f'Line {line_num}, msg {i}: Missing "role"')stats['missing_role'] += 1if 'content' not in msg:errors.append(f'Line {line_num}, msg {i}: Missing "content"')stats['missing_content'] += 1elif not msg['content'].strip():errors.append(f'Line {line_num}, msg {i}: Empty content')stats['empty_content'] += 1# Check has at least one assistant messageroles = [m.get('role') for m in messages]if 'assistant' not in roles:errors.append(f'Line {line_num}: No assistant message')stats['no_assistant'] += 1stats['valid'] += 1return {'stats': dict(stats), 'errors': errors[:50]}result = validate_training_data('training.jsonl')print(f"Total: {result['stats'].get('total', 0)}")print(f"Valid: {result['stats'].get('valid', 0)}")if result['errors']:print(f"\nFirst {len(result['errors'])} errors:")for err in result['errors']:print(f' {err}')
重複的訓練樣本浪費運算資源,並可能使模型偏向過度出現的模式。此腳本根據每行的內容雜湊值移除完全重複的條目。
import jsonimport hashlibdef deduplicate_jsonl(input_path: str, output_path: str) -> dict:"""Remove duplicate training examples from a JSONL file."""seen_hashes = set()total = 0unique = 0with open(input_path, 'r') as fin, open(output_path, 'w') as fout:for line in fin:line = line.strip()if not line:continuetotal += 1# Hash the normalized JSON to catch formatting differencesdata = json.loads(line)canonical = json.dumps(data, sort_keys=True)content_hash = hashlib.sha256(canonical.encode()).hexdigest()if content_hash not in seen_hashes:seen_hashes.add(content_hash)fout.write(json.dumps(data) + '\n')unique += 1duplicates = total - uniqueprint(f'Total: {total}, Unique: {unique}, Removed: {duplicates}')return {'total': total, 'unique': unique, 'duplicates': duplicates}deduplicate_jsonl('training.jsonl', 'training_deduped.jsonl')
大規模資料準備
正式環境的訓練資料集通常包含數十萬甚至數百萬個樣本。在這種規模下,您需要自動化管線來打亂、拆分和分片您的 JSONL 資料。正確的準備可以防止災難性遺忘等訓練問題,並確保可重現的實驗。
此腳本隨機打亂資料並將其拆分為訓練集和測試集。打亂至關重要,因為 JSONL 檔案通常是按順序生成的,而在有序資料上訓練可能導致模型泛化能力差。
import jsonimport randomfrom pathlib import Pathdef prepare_training_data(input_path: str,output_dir: str,test_ratio: float = 0.1,seed: int = 42) -> dict:"""Shuffle and split JSONL into train/test sets."""random.seed(seed)output = Path(output_dir)output.mkdir(parents=True, exist_ok=True)# Load all examplesexamples = []with open(input_path, 'r') as f:for line in f:line = line.strip()if line:examples.append(json.loads(line))# Shufflerandom.shuffle(examples)# Splitsplit_idx = int(len(examples) * (1 - test_ratio))train_data = examples[:split_idx]test_data = examples[split_idx:]# Write output filestrain_path = output / 'train.jsonl'test_path = output / 'test.jsonl'for data, path in [(train_data, train_path), (test_data, test_path)]:with open(path, 'w') as f:for example in data:f.write(json.dumps(example) + '\n')print(f'Train: {len(train_data)} examples -> {train_path}')print(f'Test: {len(test_data)} examples -> {test_path}')return {'train': len(train_data), 'test': len(test_data)}prepare_training_data('all_examples.jsonl','./prepared_data/',test_ratio=0.1,seed=42)
當資料集對單一檔案來說太大,或者您需要在多個 GPU 之間分散訓練時,可將 JSONL 檔案分片為更小的區塊。每個分片都可以獨立處理。
import jsonfrom pathlib import Pathdef shard_jsonl(input_path: str,output_dir: str,shard_size: int = 50000) -> int:"""Split a large JSONL file into smaller shards."""output = Path(output_dir)output.mkdir(parents=True, exist_ok=True)shard_num = 0line_count = 0current_file = Nonewith open(input_path, 'r') as fin:for line in fin:line = line.strip()if not line:continueif line_count % shard_size == 0:if current_file:current_file.close()shard_path = output / f'shard_{shard_num:04d}.jsonl'current_file = open(shard_path, 'w')shard_num += 1current_file.write(line + '\n')line_count += 1if current_file:current_file.close()print(f'Created {shard_num} shards from {line_count} examples')return shard_numshard_jsonl('large_dataset.jsonl', './shards/', shard_size=50000)
線上驗證您的訓練資料
使用我們免費的瀏覽器工具,在上傳到 OpenAI、Anthropic 或 HuggingFace 之前驗證、格式化和轉換您的 JSONL 訓練資料。