JSONL 正確的副檔名是什麼？

使用 .jsonl 作為標準副檔名。.ndjson（Newline Delimited JSON）副檔名代表相同格式但較不常見。大多數工具和服務，包括 OpenAI 和 BigQuery，都預期 .jsonl 檔案。選擇一種慣例並在專案中一致使用。

應該在 JSONL 檔案中使用格式化 JSON 嗎？

不應該。JSONL 檔案中的每一行必須是完整、緊湊的 JSON 值，沒有縮排或額外空白。格式化後的 JSON 跨越多行，違反了每行一筆記錄的規則，會導致解析錯誤。始終使用緊湊格式序列化。

如何處理 JSONL 字串值中的換行符？

在 JSON 字串中使用跳脫形式 \n，而不是原始換行字元。原始換行符會破壞 JSONL 基於行的結構。Python 的 json.dumps() 和 JavaScript 的 JSON.stringify() 等標準 JSON 序列化器會自動處理此跳脫。

JSONL 需要 Schema 嗎？

不需要，JSONL 沒有內建的 Schema 要求。每一行都是獨立的 JSON 值，可以有不同的欄位。然而，強烈建議在記錄間保持一致的 Schema。這簡化了下游處理，防止執行時期類型錯誤，並使您的資料與預期統一記錄的工具相容。

JSONL 檔案應該使用什麼編碼？

始終使用 UTF-8 編碼。這是幾乎所有 JSONL 解析器、雲端服務和資料工具所預期的標準。避免使用 UTF-16、Latin-1 或其他編碼。如果您的來源資料使用不同的編碼，請在寫入 JSONL 之前將其轉換為 UTF-8。

如何快速驗證 JSONL 檔案？

使用 jsonl.co 上的 JSONL 驗證器等線上工具，幾秒鐘內即可檢查您的檔案。它能偵測無效的 JSON 行、編碼問題和格式問題。對於程式化驗證，將每行的 JSON 解析包裝在 try-catch 區塊中，並記錄任何失敗的行號和錯誤。

JSONL 最佳實踐

Name: JSONL Viewer & Editor Online
Author: jsonl.co

撰寫乾淨、可靠且高效能 JSONL 檔案的完整指南。學習格式化規則、Schema 設計、錯誤處理策略和正式環境工作負載的最佳化技巧。

最後更新：2026 年 2 月

為什麼 JSONL 最佳實踐很重要

JSONL（JSON Lines）看似非常簡單：每行一個 JSON 物件，以換行符分隔。但簡單並不代表不會出錯。不一致的 Schema、編碼問題、尾隨逗號和嵌入的換行符是正式資料管線中最常見的解析失敗原因。遵循一套明確的最佳實踐可以在問題發生之前就預防它們。

本指南涵蓋了可靠地產生和消費 JSONL 資料的基本規則。無論您是在建構機器學習資料集、串流應用程式日誌，還是在服務之間交換資料，這些實踐都能幫助您避免微妙的錯誤並從 JSONL 工作流程中獲得更好的效能。

格式化規則

有效 JSONL 的基礎是嚴格遵守幾條格式化規則。違反其中任何一條都會產生大多數解析器無法處理的檔案。

JSONL 檔案中的每一行都必須是一個完整、獨立的 JSON 值。永遠不要將單一 JSON 物件拆分到多行。格式化後的 JSON 不是有效的 JSONL。始終使用緊湊格式序列化（鍵和值之間無縮排或額外空白）。

每行一個 JSON 物件

# Valid JSONL - one complete JSON per line
{"id":1,"name":"Alice","tags":["admin","user"]}
{"id":2,"name":"Bob","tags":["user"]}

# INVALID - pretty-printed JSON spans multiple lines
{
  "id": 1,
  "name": "Alice"
}

JSONL 檔案必須使用 UTF-8 編碼。這是幾乎所有 JSONL 解析器、串流工具和雲端服務所假設的編碼。避免使用 UTF-16、Latin-1 或其他編碼。如果您的來源資料使用不同的編碼，請在寫入 JSONL 之前將其轉換為 UTF-8。

始終使用 UTF-8 編碼

# Python: always specify UTF-8 when reading/writing
with open('data.jsonl', 'w', encoding='utf-8') as f:
    f.write(json.dumps(record, ensure_ascii=False) + '\n')

# Node.js: UTF-8 is the default for fs
fs.appendFileSync('data.jsonl', JSON.stringify(record) + '\n', 'utf-8');

使用單一換行字元（LF，\n）作為行分隔符。這是 Linux、macOS 和大多數雲端環境中的標準。避免使用 Windows 的回車加換行（CRLF，\r\n），因為它可能導致解析問題。大多數現代編輯器和工具會自動處理這個問題，但如果您在跨平台工作，請檢查您的設定。

換行符的選擇

# Correct: LF line endings (\n)
{"id":1}\n{"id":2}\n

# Avoid: CRLF line endings (\r\n)
{"id":1}\r\n{"id":2}\r\n

# Tip: configure Git to normalize line endings
# .gitattributes
*.jsonl text eol=lf

Schema 一致性

雖然 JSONL 不強制要求 Schema，但在記錄間保持一致性使您的資料更容易處理。不一致的 Schema 會導致執行時期錯誤、意外的 null 值和匯入失敗。

在所有記錄中保持相同的欄位名稱、欄位順序和值類型。雖然 JSON 不要求欄位排序，但一致的排序可以提高可讀性和可壓縮性。永遠不要對同一欄位混用類型（例如，"price" 欄位不應在某些記錄中是字串，在其他記錄中是數字）。

一致的欄位順序和類型

# Good: consistent field order and types
{"id":1,"name":"Alice","age":30,"active":true}
{"id":2,"name":"Bob","age":25,"active":false}
{"id":3,"name":"Charlie","age":35,"active":true}

# Bad: inconsistent order, mixed types, missing fields
{"name":"Alice","id":1,"active":true}
{"id":"2","age":25,"name":"Bob"}
{"id":3,"active":"yes","name":"Charlie"}

當欄位沒有值時，使用 JSON null 而不是省略該鍵。這使得下游處理更加簡單，因為每筆記錄都有相同的鍵集合。消費者不需要區分「欄位缺失」和「欄位為 null」。

明確處理缺失值

# Good: include all fields, use null for missing values
{"id":1,"name":"Alice","email":"alice@example.com","phone":null}
{"id":2,"name":"Bob","email":null,"phone":"+1-555-0100"}

# Avoid: omitting keys for missing data
{"id":1,"name":"Alice","email":"alice@example.com"}
{"id":2,"name":"Bob","phone":"+1-555-0100"}

錯誤處理

現實中的 JSONL 檔案由於編碼故障、截斷寫入或上游錯誤，經常包含少量無效行。健壯的消費者會優雅地處理這些問題，而不是在遇到第一行錯誤時就崩潰。

將每行的解析操作包裝在 try-catch 區塊中，並記錄任何失敗的行號和錯誤訊息。這讓您可以跳過無效行，同時保留問題記錄。對於關鍵管線，將錯誤行收集到單獨的檔案中以便日後檢查。

帶行號追蹤的容錯解析

import json

def parse_jsonl_safe(path: str):
    """Parse JSONL with error tolerance."""
    valid, errors = [], []
    with open(path, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            try:
                valid.append(json.loads(line))
            except json.JSONDecodeError as e:
                errors.append({'line': line_num, 'error': str(e), 'raw': line})
    print(f'Parsed {len(valid)} records, {len(errors)} errors')
    return valid, errors

對於資料管線，在主要處理邏輯之前新增驗證步驟。檢查每筆記錄是否具有預期的欄位和類型。拒絕或隔離不符合的記錄。這可以防止在管線深處出現類型錯誤，那裡的問題更難除錯。

處理前先驗證

def validate_record(record: dict) -> list[str]:
    """Validate a JSONL record against expected schema."""
    issues = []
    required = ['id', 'name', 'timestamp']
    for field in required:
        if field not in record:
            issues.append(f'Missing required field: {field}')
    if 'id' in record and not isinstance(record['id'], int):
        issues.append(f'Field "id" should be int, got {type(record["id"]).__name__}')
    return issues

# Usage in pipeline
for record in parse_jsonl_safe('data.jsonl')[0]:
    issues = validate_record(record)
    if issues:
        log_warning(f'Record {record.get("id")}: {issues}')
    else:
        process(record)

效能最佳化

在資料工程和機器學習工作流程中，JSONL 檔案可能增長到數 GB。正確的處理策略可以保持記憶體使用量受控且吞吐量較高。

永遠不要一次將整個 JSONL 檔案載入記憶體。一次讀取和處理一行（或一批行）。這使記憶體使用量保持恆定，與檔案大小無關。Python 的檔案迭代天然是逐行的，Node.js 有 readline 和 stream API 實現相同目的。

串流處理

# Python: stream with constant memory
import json

count = 0
with open('large.jsonl', 'r', encoding='utf-8') as f:
    for line in f:  # One line at a time, not f.readlines()!
        record = json.loads(line)
        process(record)
        count += 1
print(f'Processed {count} records')

寫入資料庫或呼叫 API 時，將多筆記錄批次處理而不是逐筆處理。批次處理減少 I/O 開銷，可將吞吐量提升 10-100 倍。對於大多數使用情境，每批 1,000 到 10,000 筆記錄效果良好。

批次操作

import json

def process_in_batches(path: str, batch_size: int = 5000):
    """Process JSONL records in batches for better throughput."""
    batch = []
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            batch.append(json.loads(line))
            if len(batch) >= batch_size:
                bulk_insert(batch)  # Send batch to database
                batch.clear()
    if batch:
        bulk_insert(batch)  # Flush remaining records

JSONL 的壓縮效果非常好，因為相鄰的行通常共享相同的鍵和相似的值。使用 gzip 進行儲存和傳輸可以將檔案大小減少 5-10 倍。大多數程式語言可以直接讀取 gzip 壓縮的 JSONL，無需先解壓縮到磁碟。

使用壓縮進行儲存和傳輸

import gzip
import json

# Write compressed JSONL
with gzip.open('data.jsonl.gz', 'wt', encoding='utf-8') as f:
    for record in records:
        f.write(json.dumps(record, ensure_ascii=False) + '\n')

# Read compressed JSONL
with gzip.open('data.jsonl.gz', 'rt', encoding='utf-8') as f:
    for line in f:
        record = json.loads(line)
        process(record)

應避免的常見錯誤

這些是使用者使用我們的工具驗證 JSONL 檔案時最常見的問題。每個問題都會導致解析失敗，如果沒有正確的方法可能難以診斷。

JSON 不允許在物件或陣列的最後一個元素後面有尾隨逗號。這是最常見的錯誤之一，尤其對於來自 JavaScript 的開發者，因為 JavaScript 允許尾隨逗號。始終從輸出中移除尾隨逗號。

尾隨逗號

# INVALID: trailing comma after last property
{"id": 1, "name": "Alice",}

# VALID: no trailing comma
{"id": 1, "name": "Alice"}

# INVALID: trailing comma in array
{"tags": ["admin", "user",]}

# VALID: no trailing comma in array
{"tags": ["admin", "user"]}

如果字串值包含原始換行字元，它將破壞每行一筆記錄的規則並損壞您的 JSONL 檔案。始終在 JSON 字串中使用跳脫形式 \n，而不是原始換行符。大多數 JSON 序列化器會自動處理這個問題，但手動建構 JSON 字串時要注意。

字串值中的嵌入換行符

# INVALID: raw newline inside a string value breaks JSONL
{"id": 1, "bio": "Line one
Line two"}

# VALID: escaped newline keeps everything on one line
{"id": 1, "bio": "Line one\nLine two"}

# Tip: json.dumps() in Python handles this automatically
import json
record = {"bio": "Line one\nLine two"}
print(json.dumps(record))
# Output: {"bio": "Line one\nLine two"}

在同一檔案中混合使用 UTF-8 和 Latin-1（或其他編碼）會產生亂碼字元和解析錯誤。這通常發生在從不同來源附加資料時。寫入前始終正規化為 UTF-8。如果您收到未知編碼的資料，在轉換前使用 chardet 等套件偵測編碼。

混合或錯誤的編碼

# Python: detect and convert encoding
import chardet

def normalize_to_utf8(input_path: str, output_path: str):
    """Detect encoding and convert to UTF-8."""
    with open(input_path, 'rb') as f:
        raw = f.read()
    detected = chardet.detect(raw)
    encoding = detected['encoding'] or 'utf-8'
    print(f'Detected encoding: {encoding}')
    text = raw.decode(encoding)
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(text)

檔案命名與組織

良好的檔案命名和目錄結構使 JSONL 資料更容易被發現、管理和在自動化管線中處理。

使用 .jsonl 作為預設副檔名。它是 JSON Lines 檔案最廣泛認可的副檔名，也是 OpenAI 微調 API、BigQuery 和大多數資料平台所預期的。.ndjson 副檔名（Newline Delimited JSON）技術上是相同格式的不同名稱。在專案中選擇一種慣例並堅持使用。

.jsonl 與 .ndjson 副檔名

# Recommended file naming conventions
data.jsonl              # Standard JSONL file
users_2026-02-14.jsonl  # Date-stamped export
train.jsonl             # ML training data
validation.jsonl        # ML validation split
events.jsonl.gz         # Compressed JSONL

按用途和日期組織 JSONL 檔案。將原始輸入資料與處理後的輸出分開。對時間序列或日誌資料使用基於日期的分區，便於處理特定日期範圍和清理舊資料。

JSONL 專案的目錄結構

project/
  data/
    raw/                  # Original unprocessed files
      events_2026-02-13.jsonl
      events_2026-02-14.jsonl
    processed/            # Cleaned and transformed
      events_clean.jsonl
    schemas/              # Schema documentation
      event_schema.json
  scripts/
    validate.py           # Validation script
    transform.py          # Transformation pipeline

線上驗證您的 JSONL 檔案

將這些最佳實踐付諸行動。使用我們的免費線上工具直接在瀏覽器中驗證、格式化和檢查您的 JSONL 檔案。

JSONL validator

JSONL formatter

JSONL schema validation

立即檢查您的 JSONL 檔案

在瀏覽器中驗證和格式化高達 1GB 的 JSONL 檔案。即時捕獲格式錯誤、Schema 問題和編碼問題。