JSONL vs Parquet: Choosing the Right Data Format

A comprehensive comparison of JSONL (JSON Lines) and Apache Parquet. Understand the trade-offs in compression, query performance, schema enforcement, and ecosystem support to pick the right format for your data workloads.

Last updated: February 2026

What Is JSONL?

JSONL (JSON Lines) is a text-based data format where each line contains a single, self-contained JSON object separated by newline characters. It is a natural extension of the ubiquitous JSON format, designed specifically for streaming and log-style data. Because every line is an independent JSON document, files can be appended to without rewriting existing data, and they can be processed line by line with minimal memory overhead.

JSONL has become the standard interchange format for machine learning training data (OpenAI fine-tuning, Hugging Face datasets), application logs, event streams, and any scenario where data arrives incrementally. Its human-readable nature makes it easy to inspect with any text editor or command-line tool like grep, head, and jq.

employees.jsonl
{"id": 1, "name": "Alice", "role": "engineer", "salary": 95000}
{"id": 2, "name": "Bob", "role": "designer", "salary": 88000}
{"id": 3, "name": "Charlie", "role": "manager", "salary": 105000}

What Is Parquet?

Apache Parquet is a columnar binary storage format designed for efficient analytical queries on large datasets. Instead of storing data row by row like JSONL, Parquet organizes values by column, which means reading a single field across millions of rows requires scanning only the relevant column rather than every complete record. This columnar layout enables aggressive compression because values within a column tend to be similar in type and distribution.

Parquet was created within the Apache Hadoop ecosystem and is now the de facto storage format for data lakes built on AWS S3, Google Cloud Storage, and Azure Blob Storage. It is deeply integrated with Apache Spark, Apache Hive, Presto, DuckDB, Snowflake, BigQuery, and virtually every modern analytics engine. Parquet files embed a strict schema in their metadata, so consumers always know the exact types and structure of the data without external documentation.

Reading and writing Parquet with Python
import pyarrow.parquet as pq
import pandas as pd
# Write a DataFrame to Parquet
df = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'role': ['engineer', 'designer', 'manager'],
'salary': [95000, 88000, 105000]
})
df.to_parquet('employees.parquet')
# Read specific columns (columnar advantage)
df = pq.read_table('employees.parquet', columns=['name', 'salary']).to_pandas()

JSONL vs Parquet: Side-by-Side Comparison

The following table summarizes the key differences between JSONL and Parquet across the dimensions that matter most when choosing a data format for your project.

FeatureJSONLParquet
Data LayoutRow-oriented, text-based. Each line is a complete JSON object.Column-oriented, binary. Values stored by column with row groups.
EncodingUTF-8 plain text. Human-readable, editable in any text editor.Binary with dictionary, RLE, and bit-packing encodings. Not human-readable.
CompressionOptional external compression (gzip, zstd). Field names repeated every row.Built-in columnar compression (Snappy, Zstd, Gzip). 2-10x smaller files.
Query PerformanceMust scan full file for any query. No column pruning or predicate pushdown.Column pruning and predicate pushdown skip irrelevant data. Orders of magnitude faster for analytical queries.
SchemaSchema-free. Each line can have different fields and types. Flexible but error-prone.Strict typed schema embedded in file metadata. Enforced on read and write.
Streaming / AppendExcellent. Append a new line to the end of the file. Ideal for real-time ingestion.Poor. Requires rewriting or creating new file partitions to add data.
Human ReadableYes. Inspect with cat, head, grep, jq, or any text editor.No. Requires specialized tools (parquet-tools, PyArrow, DuckDB) to inspect.
EcosystemUniversal. Supported by every programming language with a JSON parser.Analytics-focused. Deep integration with Spark, Hive, Presto, DuckDB, Snowflake, BigQuery.

Performance Benchmarks

The performance gap between JSONL and Parquet becomes dramatic at scale. Below are representative benchmarks for a 10-million-row dataset with 20 columns (a mix of strings, integers, floats, and timestamps).

File Size (Compression)

JSONL~4.2 GB raw, ~1.1 GB with gzip
Parquet~0.5 GB with Snappy, ~0.4 GB with Zstd

Full Table Scan

JSONL~45 seconds (parse every JSON line)
Parquet~8 seconds (binary decode, vectorized reads)

Write Speed

JSONL~30 seconds (serialize + write text)
Parquet~22 seconds (encode + compress columns)

Single Column Query

JSONL~45 seconds (must still read entire file)
Parquet~1.5 seconds (reads only the target column)

Benchmarks measured with Python (pandas + PyArrow) on an M2 MacBook Pro with 16 GB RAM. Real-world results vary by hardware, data distribution, and compression codec.

When to Use JSONL vs Parquet

JSONLBest for JSONL
  • Real-time log ingestion and event streaming
  • ML training data for OpenAI, Anthropic, and Hugging Face
  • Data interchange between microservices and APIs
  • Semi-structured data with varying schemas per record
  • Quick prototyping and debugging where human readability matters
  • Append-only data where records arrive continuously
  • Small to medium datasets (under 1 GB) that don't need analytical queries
ParquetBest for Parquet
  • Data lake storage on S3, GCS, or Azure Blob
  • Analytical queries with Spark, Presto, DuckDB, or Snowflake
  • Columnar aggregations (SUM, AVG, COUNT) across billions of rows
  • Strict schema enforcement and data governance requirements
  • Long-term archival where storage cost matters (2-10x smaller files)
  • Feature stores and ML feature pipelines reading specific columns
  • Datasets larger than 1 GB where query performance is critical

Hybrid Architecture: JSONL Ingest, Parquet Storage

In production data platforms, JSONL and Parquet are not mutually exclusive. A common and effective pattern is to use JSONL for data ingestion and Parquet for long-term storage and analytics. This hybrid approach combines the strengths of both formats: JSONL's simplicity for real-time data capture and Parquet's efficiency for downstream queries.

The pipeline works in three stages. First, raw events or records are appended to JSONL files as they arrive, because JSONL supports fast, lock-free appends. Second, a periodic batch job (hourly, daily, or triggered by file size) reads the accumulated JSONL files, validates and transforms the data, and converts them into Parquet format. Third, the resulting Parquet files are stored in a data lake partitioned by date, region, or other dimensions for efficient querying.

1. Ingest as JSONL

Collect raw events, logs, and API responses as JSONL files. Fast appends, no schema required, and easy to debug in real time.

2. Transform & Validate

Periodically read JSONL batches, apply schema validation, clean and normalize data, and handle malformed records.

3. Store as Parquet

Write validated data as partitioned Parquet files to your data lake. Query with Spark, DuckDB, or any analytics engine.

JSONL to Parquet pipeline
import json
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path
from datetime import datetime
def jsonl_to_parquet(jsonl_dir: str, parquet_dir: str):
"""Convert accumulated JSONL files to partitioned Parquet."""
records = []
for jsonl_file in Path(jsonl_dir).glob('*.jsonl'):
with open(jsonl_file, 'r') as f:
for line in f:
line = line.strip()
if line:
records.append(json.loads(line))
if not records:
return
df = pd.DataFrame(records)
# Add partition column
df['date'] = datetime.now().strftime('%Y-%m-%d')
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
table,
root_path=parquet_dir,
partition_cols=['date'],
compression='zstd'
)
print(f'Converted {len(records)} records to Parquet')
jsonl_to_parquet('raw_events/', 'data_lake/')

Try Our Free JSONL Tools

Working with JSONL files? Use our free browser-based tools to view, validate, and convert JSONL data instantly. No installation or uploads required.

Work with JSONL Files Online

View, validate, and convert JSONL files up to 1 GB right in your browser. No uploads required, 100% private.

Frequently Asked Questions

JSONL vs Parquet β€” Speed, Storage, and When to Use Each |...