AI 회사들이 훈련 데이터에 JSONL을 사용하는 이유는 무엇인가요?

JSONL은 줄당 하나의 훈련 예제를 저장하므로 훈련 중 스트리밍, 무작위 셔플링, 예제 수 세기, 새 데이터 추가에 이상적입니다. JSON 배열과 달리 JSONL 파일은 전체를 메모리에 파싱할 필요가 없습니다. 이는 훈련 데이터셋이 수 기가바이트에 걸쳐 수백만 개의 예제를 포함할 때 중요합니다.

OpenAI 파인튜닝을 위한 올바른 JSONL 형식은 무엇인가요?

각 줄은 messages 배열이 포함된 JSON 객체여야 합니다. 각 메시지에는 role(system, user 또는 assistant)과 content 필드가 있어야 합니다. 예제당 최소 하나의 user 메시지와 하나의 assistant 메시지가 필요합니다. system 메시지는 선택 사항이지만 권장됩니다. OpenAI는 최소 10개의 예제를 요구하며, 좋은 결과를 위해 50-100개를 권장합니다.

Anthropic Claude 훈련 데이터는 OpenAI와 어떻게 다른가요?

Anthropic Claude는 user와 assistant 역할이 포함된 유사한 messages 기반 형식을 사용합니다. system 프롬프트는 일반적으로 배열의 메시지가 아닌 최상위 레벨 필드입니다. 두 형식 모두 줄당 하나의 대화가 있는 JSONL을 사용하지만, 정확한 필드 이름과 구조는 각각의 API 규약을 따릅니다.

JSONL 훈련 데이터를 어떻게 검증하나요?

모든 줄에서 유효한 JSON, messages, role, content와 같은 필수 필드, 비어있지 않은 content 문자열, 예제당 최소 하나의 assistant 메시지, 합리적인 토큰 길이를 확인하세요. Python 검증 스크립트나 온라인 JSONL 검증기 도구를 사용하여 업로드 전에 문제를 발견할 수 있습니다.

파인튜닝에 몇 개의 훈련 예제가 필요한가요?

OpenAI는 눈에 띄는 개선을 위해 최소 50-100개의 예제를 권장하지만 10개가 최소 요건입니다. 복잡한 작업의 경우 500-1000개의 고품질 예제가 일반적으로 강력한 결과를 제공합니다. 양보다 질이 중요합니다: 신중하게 선별된 소규모 예제 세트가 일반적으로 대규모 노이즈 데이터셋보다 더 나은 성능을 보입니다.

JSONL 형식의 대규모 훈련 데이터셋은 어떻게 처리하나요?

대규모 데이터셋의 경우 전체 파일을 메모리에 로드하는 대신 스트리밍 읽기를 사용하세요. 순서 편향을 방지하기 위해 데이터를 셔플하고, 평가를 위해 train과 test 세트로 분할하며, 분산 훈련을 위해 여러 파일로 샤딩하세요. HuggingFace의 Python datasets 라이브러리는 모든 크기의 JSONL 파일 스트리밍을 지원합니다.

AI 훈련 데이터를 위한 JSONL

AI 및 머신러닝을 위한 JSONL 훈련 데이터 준비에 대한 종합 가이드. OpenAI 파인튜닝, Anthropic Claude, HuggingFace 데이터셋, 데이터 검증 및 대규모 데이터 파이프라인을 다룹니다.

최종 업데이트: 2026년 2월

JSONL이 AI 훈련 데이터의 표준인 이유

JSONL(JSON Lines)은 AI 및 머신러닝 훈련 데이터의 사실상 표준이 되었습니다. OpenAI부터 Anthropic, HuggingFace까지 모든 주요 AI 제공업체가 파인튜닝 데이터셋의 기본 형식으로 JSONL을 사용합니다. 이유는 간단합니다: JSONL은 줄당 하나의 예제를 저장하므로 훈련 중 데이터 스트리밍, 예제 셔플링, 레코드 수 세기, 훈련/테스트 세트 분할, 전체 파일을 다시 쓰지 않고 새 샘플을 추가하는 것이 매우 간단합니다.

사용 전에 전체를 파싱해야 하는 단일 JSON 배열과 달리, JSONL 파일은 줄 단위로 처리할 수 있습니다. 이는 훈련 데이터셋이 기가바이트에 걸쳐 수백만 개의 예제로 늘어날 때 중요합니다. 이 가이드에서는 OpenAI, Anthropic Claude 및 HuggingFace에서 요구하는 정확한 JSONL 형식, 데이터를 검증하고 정리하는 방법, 그리고 프로덕션 훈련 실행을 위해 대규모 데이터셋을 준비하는 방법을 배웁니다.

OpenAI 파인튜닝 JSONL 형식

OpenAI는 파인튜닝과 Batch API 모두에 JSONL을 사용합니다. GPT-4o-mini와 같은 채팅 모델을 파인튜닝할 때 각 줄에는 표준 채팅 완성 구조의 messages 배열이 포함되어야 합니다. system, user, assistant 세 가지 역할이 모델이 재현하도록 학습할 대화를 정의합니다.

각 줄은 하나의 훈련 예제입니다. system 메시지는 동작을 설정하고, user 메시지는 입력이며, assistant 메시지는 모델이 학습해야 할 목표 출력입니다.

채팅 완성 훈련 형식

{"messages": [{"role": "system", "content": "You are a helpful customer support agent for an e-commerce platform."}, {"role": "user", "content": "Where is my order #12345?"}, {"role": "assistant", "content": "I can see that order #12345 was shipped on February 10th via FedEx. The tracking number is FX987654321. Based on the latest update, it should arrive by February 14th."}]}
{"messages": [{"role": "system", "content": "You are a helpful customer support agent for an e-commerce platform."}, {"role": "user", "content": "I want to return my purchase."}, {"role": "assistant", "content": "I would be happy to help with your return. Could you please provide the order number and the reason for the return? Our return policy allows returns within 30 days of delivery for most items."}]}

멀티턴 대화의 경우 전체 대화 기록을 포함합니다. 모델은 messages 배열의 모든 assistant 턴에서 학습합니다. 이를 통해 모델이 후속 질문을 처리하는 방법을 가르칩니다.

멀티턴 대화 훈련

{"messages": [{"role": "system", "content": "You are a Python programming tutor."}, {"role": "user", "content": "What is a list comprehension?"}, {"role": "assistant", "content": "A list comprehension is a concise way to create lists in Python. Instead of writing a for loop to build a list, you can write it in one line: [expression for item in iterable if condition]."}, {"role": "user", "content": "Can you show me an example?"}, {"role": "assistant", "content": "Sure! Here is an example that creates a list of squares for even numbers:\n\nsquares = [x**2 for x in range(10) if x % 2 == 0]\n# Result: [0, 4, 16, 36, 64]\n\nThis is equivalent to:\nsquares = []\nfor x in range(10):\n    if x % 2 == 0:\n        squares.append(x**2)"}]}

OpenAI는 최소 10개의 훈련 예제를 필요로 하지만, 눈에 띄는 개선을 위해 50-100개를 권장합니다. 모든 메시지에는 role과 content 필드가 있어야 합니다. system 메시지는 선택 사항이지만 일관된 동작을 위해 권장됩니다. 업로드 전에 OpenAI CLI를 사용하여 파일을 검증하세요: openai tools fine_tunes.prepare_data -f training.jsonl

Anthropic Claude 훈련 형식

Anthropic은 Claude 모델 파인튜닝에 JSONL을 사용합니다. 이 형식은 구조화된 대화에서 human과 assistant 턴을 사용합니다. 정확한 형식은 파인튜닝 방법과 API 버전에 따라 다르지만, 핵심 패턴은 Messages API 구조를 따릅니다.

각 줄에는 system 프롬프트와 교대로 나오는 user/assistant 메시지가 포함된 대화가 있습니다. 모델은 대화 컨텍스트가 주어졌을 때 assistant 응답을 생성하는 방법을 학습합니다.

Claude Messages 형식

{"system": "You are a medical information assistant. Provide accurate, general health information while always recommending consulting a healthcare professional.", "messages": [{"role": "user", "content": "What are the symptoms of vitamin D deficiency?"}, {"role": "assistant", "content": "Common symptoms of vitamin D deficiency include fatigue and tiredness, bone pain and muscle weakness, mood changes including depression, slow wound healing, and frequent infections. However, many people with mild deficiency may not notice symptoms. A blood test is the only way to confirm your vitamin D levels. I recommend speaking with your doctor who can order the appropriate test and recommend supplementation if needed."}]}
{"system": "You are a medical information assistant. Provide accurate, general health information while always recommending consulting a healthcare professional.", "messages": [{"role": "user", "content": "Is it safe to take ibuprofen daily?"}, {"role": "assistant", "content": "Taking ibuprofen daily for extended periods can carry risks including stomach ulcers and gastrointestinal bleeding, increased risk of heart attack and stroke, kidney problems with long-term use, and potential interactions with other medications. For occasional pain relief, short-term use at recommended doses is generally considered safe for most adults. However, if you find yourself needing daily pain relief, please consult your doctor. They can evaluate the underlying cause and recommend a safer long-term treatment plan."}]}

멀티턴 훈련 데이터의 경우 교대로 나오는 user와 assistant 메시지가 포함된 전체 대화를 포함합니다. 모델은 대화 흐름과 턴 간 컨텍스트를 유지하는 방법을 학습합니다.

멀티턴 Claude 대화

{"system": "You are a cooking assistant specializing in Italian cuisine.", "messages": [{"role": "user", "content": "How do I make carbonara?"}, {"role": "assistant", "content": "Authentic carbonara uses just five ingredients: spaghetti, guanciale (cured pork cheek), eggs, Pecorino Romano cheese, and black pepper. The key technique is tempering the egg mixture with the hot pasta water so it creates a creamy sauce without scrambling."}, {"role": "user", "content": "Can I use bacon instead of guanciale?"}, {"role": "assistant", "content": "While guanciale gives the most authentic flavor, pancetta is a closer substitute. Bacon works in a pinch but its smoky flavor changes the dish significantly. If using bacon, choose an unsmoked variety. Whichever you use, cut it into small pieces and render the fat slowly over medium-low heat until crispy."}]}

HuggingFace 데이터셋과 JSONL

HuggingFace의 datasets 라이브러리는 입력 형식으로 JSONL을 기본적으로 지원합니다. 로컬 JSONL 파일을 로드하고, 원격 데이터셋을 스트리밍하며, 형식 간 변환을 쉽게 할 수 있습니다. JSONL은 HuggingFace Hub에서 데이터셋을 공유할 때 권장되는 형식입니다.

datasets 라이브러리를 사용하여 JSONL 파일을 Dataset 객체로 로드합니다. 이는 효율적인 메모리 매핑 접근, 내장 train/test 분할, Trainer API와의 원활한 통합을 제공합니다.

JSONL 데이터셋 로딩

from datasets import load_dataset

# Load a local JSONL file
dataset = load_dataset('json', data_files='training.jsonl')
print(dataset)
# DatasetDict({
#     train: Dataset({
#         features: ['messages', 'system'],
#         num_rows: 5000
#     })
# })

# Load with train/test split
dataset = load_dataset('json', data_files={
    'train': 'train.jsonl',
    'test': 'test.jsonl'
})

# Stream a large remote dataset
dataset = load_dataset(
    'json',
    data_files='https://example.com/large_dataset.jsonl',
    streaming=True
)
for example in dataset['train']:
    print(example)
    break

모든 HuggingFace 데이터셋을 다른 훈련 파이프라인에서 사용하기 위해 JSONL 형식으로 변환합니다. to_json 메서드는 각 예제를 별도의 JSON 줄로 작성합니다.

JSONL로 내보내기

from datasets import load_dataset

# Load a dataset from the Hub
dataset = load_dataset('squad', split='train')

# Export to JSONL
dataset.to_json('squad_train.jsonl')
print(f'Exported {len(dataset)} examples')

# Export with specific columns
dataset.select_columns(['question', 'context', 'answers']).to_json(
    'squad_filtered.jsonl'
)

# Process and export
def format_for_finetuning(example):
    return {
        'messages': [
            {'role': 'user', 'content': example['question']},
            {'role': 'assistant', 'content': example['answers']['text'][0]}
        ]
    }

formatted = dataset.map(format_for_finetuning, remove_columns=dataset.column_names)
formatted.to_json('squad_chat_format.jsonl')

데이터 검증 및 정리

훈련 데이터 품질은 모델 성능에 직접적인 영향을 미칩니다. 잘못된 JSON, 누락된 필드, 지나치게 긴 예제, 중복 항목 모두 파인튜닝 결과를 저하시킬 수 있습니다. 훈련 실행을 시작하기 전에 항상 JSONL 파일을 검증하고 정리하세요.

이 스크립트는 JSONL 파일의 각 줄을 일반적인 문제에 대해 검증합니다: 잘못된 JSON, 누락된 필수 필드, 빈 콘텐츠, 토큰 길이. 문제를 조기에 발견하기 위해 업로드 전에 실행하세요.

JSONL 훈련 데이터 검증기

import json
import sys
from collections import Counter

def validate_training_data(path: str) -> dict:
    """Validate a JSONL file for AI fine-tuning."""
    stats = Counter()
    errors = []

    with open(path, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            stats['total'] += 1

            # Check valid JSON
            try:
                data = json.loads(line)
            except json.JSONDecodeError as e:
                errors.append(f'Line {line_num}: Invalid JSON - {e}')
                stats['invalid_json'] += 1
                continue

            # Check messages field exists
            if 'messages' not in data:
                errors.append(f'Line {line_num}: Missing "messages" field')
                stats['missing_messages'] += 1
                continue

            messages = data['messages']

            # Check message structure
            for i, msg in enumerate(messages):
                if 'role' not in msg:
                    errors.append(f'Line {line_num}, msg {i}: Missing "role"')
                    stats['missing_role'] += 1
                if 'content' not in msg:
                    errors.append(f'Line {line_num}, msg {i}: Missing "content"')
                    stats['missing_content'] += 1
                elif not msg['content'].strip():
                    errors.append(f'Line {line_num}, msg {i}: Empty content')
                    stats['empty_content'] += 1

            # Check has at least one assistant message
            roles = [m.get('role') for m in messages]
            if 'assistant' not in roles:
                errors.append(f'Line {line_num}: No assistant message')
                stats['no_assistant'] += 1

            stats['valid'] += 1

    return {'stats': dict(stats), 'errors': errors[:50]}

result = validate_training_data('training.jsonl')
print(f"Total: {result['stats'].get('total', 0)}")
print(f"Valid: {result['stats'].get('valid', 0)}")
if result['errors']:
    print(f"\nFirst {len(result['errors'])} errors:")
    for err in result['errors']:
        print(f'  {err}')

중복된 훈련 예제는 컴퓨팅 자원을 낭비하고 모델이 과대 표현된 패턴으로 편향될 수 있습니다. 이 스크립트는 각 줄의 콘텐츠 해시를 기반으로 정확한 중복을 제거합니다.

중복 제거 스크립트

import json
import hashlib

def deduplicate_jsonl(input_path: str, output_path: str) -> dict:
    """Remove duplicate training examples from a JSONL file."""
    seen_hashes = set()
    total = 0
    unique = 0

    with open(input_path, 'r') as fin, open(output_path, 'w') as fout:
        for line in fin:
            line = line.strip()
            if not line:
                continue
            total += 1

            # Hash the normalized JSON to catch formatting differences
            data = json.loads(line)
            canonical = json.dumps(data, sort_keys=True)
            content_hash = hashlib.sha256(canonical.encode()).hexdigest()

            if content_hash not in seen_hashes:
                seen_hashes.add(content_hash)
                fout.write(json.dumps(data) + '\n')
                unique += 1

    duplicates = total - unique
    print(f'Total: {total}, Unique: {unique}, Removed: {duplicates}')
    return {'total': total, 'unique': unique, 'duplicates': duplicates}

deduplicate_jsonl('training.jsonl', 'training_deduped.jsonl')

대규모 데이터 준비

프로덕션 훈련 데이터셋은 종종 수십만 또는 수백만 개의 예제를 포함합니다. 이 규모에서는 JSONL 데이터의 셔플링, 분할, 샤딩을 위한 자동화된 파이프라인이 필요합니다. 적절한 준비는 치명적 망각과 같은 훈련 문제를 방지하고 재현 가능한 실험을 보장합니다.

이 스크립트는 데이터를 무작위로 셔플하고 train과 test 세트로 분할합니다. JSONL 파일은 종종 순서대로 생성되므로 셔플링은 매우 중요하며, 순서대로 된 데이터로 훈련하면 일반화 성능이 떨어질 수 있습니다.

Train/Test 분할 및 셔플 파이프라인

import json
import random
from pathlib import Path

def prepare_training_data(
    input_path: str,
    output_dir: str,
    test_ratio: float = 0.1,
    seed: int = 42
) -> dict:
    """Shuffle and split JSONL into train/test sets."""
    random.seed(seed)
    output = Path(output_dir)
    output.mkdir(parents=True, exist_ok=True)

    # Load all examples
    examples = []
    with open(input_path, 'r') as f:
        for line in f:
            line = line.strip()
            if line:
                examples.append(json.loads(line))

    # Shuffle
    random.shuffle(examples)

    # Split
    split_idx = int(len(examples) * (1 - test_ratio))
    train_data = examples[:split_idx]
    test_data = examples[split_idx:]

    # Write output files
    train_path = output / 'train.jsonl'
    test_path = output / 'test.jsonl'

    for data, path in [(train_data, train_path), (test_data, test_path)]:
        with open(path, 'w') as f:
            for example in data:
                f.write(json.dumps(example) + '\n')

    print(f'Train: {len(train_data)} examples -> {train_path}')
    print(f'Test:  {len(test_data)} examples -> {test_path}')
    return {'train': len(train_data), 'test': len(test_data)}

prepare_training_data(
    'all_examples.jsonl',
    './prepared_data/',
    test_ratio=0.1,
    seed=42
)

데이터셋이 단일 파일에 담기에 너무 크거나 여러 GPU에 걸쳐 훈련을 분산해야 할 때, JSONL 파일을 더 작은 청크로 샤딩합니다. 각 샤드는 독립적으로 처리할 수 있습니다.

분산 훈련을 위한 파일 샤딩

import json
from pathlib import Path

def shard_jsonl(
    input_path: str,
    output_dir: str,
    shard_size: int = 50000
) -> int:
    """Split a large JSONL file into smaller shards."""
    output = Path(output_dir)
    output.mkdir(parents=True, exist_ok=True)

    shard_num = 0
    line_count = 0
    current_file = None

    with open(input_path, 'r') as fin:
        for line in fin:
            line = line.strip()
            if not line:
                continue

            if line_count % shard_size == 0:
                if current_file:
                    current_file.close()
                shard_path = output / f'shard_{shard_num:04d}.jsonl'
                current_file = open(shard_path, 'w')
                shard_num += 1

            current_file.write(line + '\n')
            line_count += 1

    if current_file:
        current_file.close()

    print(f'Created {shard_num} shards from {line_count} examples')
    return shard_num

shard_jsonl('large_dataset.jsonl', './shards/', shard_size=50000)

온라인으로 훈련 데이터 검증하기

OpenAI, Anthropic 또는 HuggingFace에 업로드하기 전에 무료 브라우저 기반 도구를 사용하여 JSONL 훈련 데이터를 검증, 포맷 및 변환하세요.

OpenAI JSONL format

JSONL validator

JSONL splitter

훈련 데이터를 준비할 준비가 되셨나요?

브라우저에서 바로 JSONL 훈련 파일을 검증, 포맷 및 검사하세요. 업로드 없음, 가입 없음, 100% 프라이빗.