JSONL AI 训练数据
为 AI 和机器学习准备 JSONL 训练数据的全面指南。涵盖 OpenAI fine-tuning、Anthropic Claude、HuggingFace 数据集、数据验证和大规模数据管道。
最后更新:2026年2月
为什么 JSONL 是 AI 训练数据的标准格式
JSONL(JSON Lines)已成为 AI 和机器学习训练数据的事实标准。从 OpenAI 到 Anthropic 再到 HuggingFace,每个主要的 AI 提供商都使用 JSONL 作为 fine-tuning 数据集的主要格式。原因很简单:JSONL 每行存储一个样本,使得在训练过程中流式传输数据、打乱样本顺序、统计记录数、拆分训练集和测试集以及追加新样本变得轻而易举,而无需重写整个文件。
与必须完全解析后才能使用的单个 JSON 数组不同,JSONL 文件可以逐行处理。当您的训练数据集增长到数百万条样本、占据数 GB 空间时,这一点至关重要。在本指南中,您将学习 OpenAI、Anthropic Claude 和 HuggingFace 所要求的确切 JSONL 格式,如何验证和清理数据,以及如何为生产级训练准备大规模数据集。
OpenAI Fine-tuning JSONL 格式
OpenAI 在 fine-tuning 和 Batch API 中都使用 JSONL。对于 fine-tuning 聊天模型(如 GPT-4o-mini),每行必须包含一个 messages 数组,采用标准的聊天补全结构。system、user 和 assistant 三种角色定义了模型应学习复现的对话。
每行是一个训练样本。system 消息设定行为,user 消息是输入,assistant 消息是模型应学习的目标输出。
{"messages": [{"role": "system", "content": "You are a helpful customer support agent for an e-commerce platform."}, {"role": "user", "content": "Where is my order #12345?"}, {"role": "assistant", "content": "I can see that order #12345 was shipped on February 10th via FedEx. The tracking number is FX987654321. Based on the latest update, it should arrive by February 14th."}]}{"messages": [{"role": "system", "content": "You are a helpful customer support agent for an e-commerce platform."}, {"role": "user", "content": "I want to return my purchase."}, {"role": "assistant", "content": "I would be happy to help with your return. Could you please provide the order number and the reason for the return? Our return policy allows returns within 30 days of delivery for most items."}]}
对于多轮对话,需要包含完整的对话历史。模型从 messages 数组中每个 assistant 回复进行学习。这就是教模型处理追问的方式。
{"messages": [{"role": "system", "content": "You are a Python programming tutor."}, {"role": "user", "content": "What is a list comprehension?"}, {"role": "assistant", "content": "A list comprehension is a concise way to create lists in Python. Instead of writing a for loop to build a list, you can write it in one line: [expression for item in iterable if condition]."}, {"role": "user", "content": "Can you show me an example?"}, {"role": "assistant", "content": "Sure! Here is an example that creates a list of squares for even numbers:\n\nsquares = [x**2 for x in range(10) if x % 2 == 0]\n# Result: [0, 4, 16, 36, 64]\n\nThis is equivalent to:\nsquares = []\nfor x in range(10):\n if x % 2 == 0:\n squares.append(x**2)"}]}
OpenAI 要求至少 10 个训练样本,但建议 50-100 个以获得明显改善。所有消息必须包含 role 和 content 字段。system 消息是可选的,但建议添加以保持一致的行为。使用 OpenAI CLI 在上传前验证您的文件:openai tools fine_tunes.prepare_data -f training.jsonl
Anthropic Claude 训练格式
Anthropic 使用 JSONL 对 Claude 模型进行 fine-tuning。该格式使用结构化对话中的 human 和 assistant 回合。虽然具体格式取决于 fine-tuning 方法和 API 版本,但核心模式遵循 Messages API 的结构。
每行包含一个带有系统提示和交替出现的 user/assistant 消息的对话。模型学习根据对话上下文生成 assistant 回复。
{"system": "You are a medical information assistant. Provide accurate, general health information while always recommending consulting a healthcare professional.", "messages": [{"role": "user", "content": "What are the symptoms of vitamin D deficiency?"}, {"role": "assistant", "content": "Common symptoms of vitamin D deficiency include fatigue and tiredness, bone pain and muscle weakness, mood changes including depression, slow wound healing, and frequent infections. However, many people with mild deficiency may not notice symptoms. A blood test is the only way to confirm your vitamin D levels. I recommend speaking with your doctor who can order the appropriate test and recommend supplementation if needed."}]}{"system": "You are a medical information assistant. Provide accurate, general health information while always recommending consulting a healthcare professional.", "messages": [{"role": "user", "content": "Is it safe to take ibuprofen daily?"}, {"role": "assistant", "content": "Taking ibuprofen daily for extended periods can carry risks including stomach ulcers and gastrointestinal bleeding, increased risk of heart attack and stroke, kidney problems with long-term use, and potential interactions with other medications. For occasional pain relief, short-term use at recommended doses is generally considered safe for most adults. However, if you find yourself needing daily pain relief, please consult your doctor. They can evaluate the underlying cause and recommend a safer long-term treatment plan."}]}
对于多轮训练数据,需要包含交替出现的 user 和 assistant 消息的完整对话。模型学习对话流程以及如何跨回合维持上下文。
{"system": "You are a cooking assistant specializing in Italian cuisine.", "messages": [{"role": "user", "content": "How do I make carbonara?"}, {"role": "assistant", "content": "Authentic carbonara uses just five ingredients: spaghetti, guanciale (cured pork cheek), eggs, Pecorino Romano cheese, and black pepper. The key technique is tempering the egg mixture with the hot pasta water so it creates a creamy sauce without scrambling."}, {"role": "user", "content": "Can I use bacon instead of guanciale?"}, {"role": "assistant", "content": "While guanciale gives the most authentic flavor, pancetta is a closer substitute. Bacon works in a pinch but its smoky flavor changes the dish significantly. If using bacon, choose an unsmoked variety. Whichever you use, cut it into small pieces and render the fat slowly over medium-low heat until crispy."}]}
HuggingFace 使用 JSONL 的数据集
HuggingFace 的 datasets 库原生支持 JSONL 作为输入格式。您可以加载本地 JSONL 文件、流式传输远程数据集,并轻松地在格式之间转换。JSONL 是在 HuggingFace Hub 上分享数据集的推荐格式。
使用 datasets 库将 JSONL 文件加载为 Dataset 对象。这提供了高效的内存映射访问、内置的训练/测试拆分,以及与 Trainer API 的无缝集成。
from datasets import load_dataset# Load a local JSONL filedataset = load_dataset('json', data_files='training.jsonl')print(dataset)# DatasetDict({# train: Dataset({# features: ['messages', 'system'],# num_rows: 5000# })# })# Load with train/test splitdataset = load_dataset('json', data_files={'train': 'train.jsonl','test': 'test.jsonl'})# Stream a large remote datasetdataset = load_dataset('json',data_files='https://example.com/large_dataset.jsonl',streaming=True)for example in dataset['train']:print(example)break
将任何 HuggingFace 数据集转换为 JSONL 格式,供其他训练管道使用。to_json 方法将每个样本写为单独的 JSON 行。
from datasets import load_dataset# Load a dataset from the Hubdataset = load_dataset('squad', split='train')# Export to JSONLdataset.to_json('squad_train.jsonl')print(f'Exported {len(dataset)} examples')# Export with specific columnsdataset.select_columns(['question', 'context', 'answers']).to_json('squad_filtered.jsonl')# Process and exportdef format_for_finetuning(example):return {'messages': [{'role': 'user', 'content': example['question']},{'role': 'assistant', 'content': example['answers']['text'][0]}]}formatted = dataset.map(format_for_finetuning, remove_columns=dataset.column_names)formatted.to_json('squad_chat_format.jsonl')
数据验证与清理
训练数据的质量直接影响模型性能。无效的 JSON、缺失的字段、过长的样本和重复条目都会降低 fine-tuning 的效果。始终在开始训练前验证和清理您的 JSONL 文件。
此脚本针对常见问题验证 JSONL 文件的每一行:无效 JSON、缺失必需字段、空内容和 token 长度。在上传前运行此脚本可以尽早发现问题。
import jsonimport sysfrom collections import Counterdef validate_training_data(path: str) -> dict:"""Validate a JSONL file for AI fine-tuning."""stats = Counter()errors = []with open(path, 'r', encoding='utf-8') as f:for line_num, line in enumerate(f, 1):line = line.strip()if not line:continuestats['total'] += 1# Check valid JSONtry:data = json.loads(line)except json.JSONDecodeError as e:errors.append(f'Line {line_num}: Invalid JSON - {e}')stats['invalid_json'] += 1continue# Check messages field existsif 'messages' not in data:errors.append(f'Line {line_num}: Missing "messages" field')stats['missing_messages'] += 1continuemessages = data['messages']# Check message structurefor i, msg in enumerate(messages):if 'role' not in msg:errors.append(f'Line {line_num}, msg {i}: Missing "role"')stats['missing_role'] += 1if 'content' not in msg:errors.append(f'Line {line_num}, msg {i}: Missing "content"')stats['missing_content'] += 1elif not msg['content'].strip():errors.append(f'Line {line_num}, msg {i}: Empty content')stats['empty_content'] += 1# Check has at least one assistant messageroles = [m.get('role') for m in messages]if 'assistant' not in roles:errors.append(f'Line {line_num}: No assistant message')stats['no_assistant'] += 1stats['valid'] += 1return {'stats': dict(stats), 'errors': errors[:50]}result = validate_training_data('training.jsonl')print(f"Total: {result['stats'].get('total', 0)}")print(f"Valid: {result['stats'].get('valid', 0)}")if result['errors']:print(f"\nFirst {len(result['errors'])} errors:")for err in result['errors']:print(f' {err}')
重复的训练样本浪费计算资源,并可能导致模型偏向过度表示的模式。此脚本根据每行的内容哈希值删除完全重复的样本。
import jsonimport hashlibdef deduplicate_jsonl(input_path: str, output_path: str) -> dict:"""Remove duplicate training examples from a JSONL file."""seen_hashes = set()total = 0unique = 0with open(input_path, 'r') as fin, open(output_path, 'w') as fout:for line in fin:line = line.strip()if not line:continuetotal += 1# Hash the normalized JSON to catch formatting differencesdata = json.loads(line)canonical = json.dumps(data, sort_keys=True)content_hash = hashlib.sha256(canonical.encode()).hexdigest()if content_hash not in seen_hashes:seen_hashes.add(content_hash)fout.write(json.dumps(data) + '\n')unique += 1duplicates = total - uniqueprint(f'Total: {total}, Unique: {unique}, Removed: {duplicates}')return {'total': total, 'unique': unique, 'duplicates': duplicates}deduplicate_jsonl('training.jsonl', 'training_deduped.jsonl')
大规模数据准备
生产级训练数据集通常包含数十万甚至数百万个样本。在这种规模下,您需要自动化管道来打乱、拆分和分片您的 JSONL 数据。正确的准备可以防止灾难性遗忘等训练问题,并确保实验的可复现性。
此脚本随机打乱数据并将其拆分为训练集和测试集。打乱至关重要,因为 JSONL 文件通常按顺序生成,而在有序数据上训练可能导致泛化能力差。
import jsonimport randomfrom pathlib import Pathdef prepare_training_data(input_path: str,output_dir: str,test_ratio: float = 0.1,seed: int = 42) -> dict:"""Shuffle and split JSONL into train/test sets."""random.seed(seed)output = Path(output_dir)output.mkdir(parents=True, exist_ok=True)# Load all examplesexamples = []with open(input_path, 'r') as f:for line in f:line = line.strip()if line:examples.append(json.loads(line))# Shufflerandom.shuffle(examples)# Splitsplit_idx = int(len(examples) * (1 - test_ratio))train_data = examples[:split_idx]test_data = examples[split_idx:]# Write output filestrain_path = output / 'train.jsonl'test_path = output / 'test.jsonl'for data, path in [(train_data, train_path), (test_data, test_path)]:with open(path, 'w') as f:for example in data:f.write(json.dumps(example) + '\n')print(f'Train: {len(train_data)} examples -> {train_path}')print(f'Test: {len(test_data)} examples -> {test_path}')return {'train': len(train_data), 'test': len(test_data)}prepare_training_data('all_examples.jsonl','./prepared_data/',test_ratio=0.1,seed=42)
当数据集大到单个文件无法容纳,或者您需要将训练分布到多个 GPU 上时,可以将 JSONL 文件分片为更小的块。每个分片可以独立处理。
import jsonfrom pathlib import Pathdef shard_jsonl(input_path: str,output_dir: str,shard_size: int = 50000) -> int:"""Split a large JSONL file into smaller shards."""output = Path(output_dir)output.mkdir(parents=True, exist_ok=True)shard_num = 0line_count = 0current_file = Nonewith open(input_path, 'r') as fin:for line in fin:line = line.strip()if not line:continueif line_count % shard_size == 0:if current_file:current_file.close()shard_path = output / f'shard_{shard_num:04d}.jsonl'current_file = open(shard_path, 'w')shard_num += 1current_file.write(line + '\n')line_count += 1if current_file:current_file.close()print(f'Created {shard_num} shards from {line_count} examples')return shard_numshard_jsonl('large_dataset.jsonl', './shards/', shard_size=50000)
在线验证您的训练数据
使用我们免费的浏览器端工具,在上传到 OpenAI、Anthropic 或 HuggingFace 之前验证、格式化和转换您的 JSONL 训练数据。