为什么 JSONL 非常适合流式数据？

JSONL 每行存储一个完整的 JSON 对象，因此每一行都是一个自包含的、可独立解析的单元。流式消费者可以逐行读取、解析和处理，无需将完整数据集加载到内存中。这种行分隔的结构与操作系统读取文件的方式（行缓冲 I/O）、网络协议构建消息的方式以及日志系统追加条目的方式天然契合。与 JSON 数组不同，您无需等待闭合括号即可开始处理。

如何在 Node.js 中流式处理 JSONL？

使用内置的 readline 模块与 fs.createReadStream 逐行读取 JSONL 文件。将 readline 接口包装在 for-await-of 循环中，在每条解析的记录到达时进行处理。对于更高级的管道，使用 Node.js Transform 流来链接多个处理步骤（过滤、丰富、重新序列化），并自动处理背压。stream/promises 中的 pipeline() 函数将所有内容整洁地连接在一起。

如何在 Python 中流式处理 JSONL？

编写一个生成器函数，打开文件并对每个非空行 yield json.loads(line)。生成器是惰性的，因此每次消费者调用 next() 时只读取和解析一行。将其与 itertools.islice 用于切片、filter() 用于选择记录以及自定义批处理生成器用于分组处理相结合。这种方法可以用最少的代码处理大于可用内存的文件。

什么是 SSE，JSONL 如何与之配合？

Server-Sent Events (SSE) 是一个浏览器 API，允许服务器通过标准 HTTP 连接向客户端推送实时更新。每个 SSE 事件包含一个 data 字段，JSONL 是天然的载荷格式，因为每条记录都是一个单行 JSON 对象，客户端可以立即解析。SSE 自动处理重连，可以通过 HTTP 代理和负载均衡器工作，除了将 Content-Type 设置为 text/event-stream 之外不需要特殊的服务器基础设施。

如何处理 JSONL 流中的背压？

当生产者发送数据的速度快于消费者处理速度时，就会产生背压。在 Node.js 中，内置的 stream 模块自动处理背压：当消费者的内部缓冲区满时，可写流的 write() 方法返回 false，通知生产者暂停。一旦缓冲区排空，drain 事件触发，生产者恢复。使用 pipeline() 时，背压会在整个链中传播。在自定义实现中，始终检查 write() 的返回值并 await drain 事件。

可以通过 WebSocket 流式传输 JSONL 吗？

可以。WebSocket 提供全双工通信，每条消息可以携带一条 JSONL 记录。服务器发送 JSON.stringify(record)，客户端使用 JSON.parse(event.data) 解析每条消息。当您需要双向流式传输时，这种方法非常理想，例如客户端订阅特定频道或在接收实时数据的同时发送命令。WebSocket 在高频更新时比 SSE 延迟更低，但需要更多基础设施（粘性会话、特殊代理规则）。

JSONL 流式处理：实时处理数据

实时流式处理 JSONL 数据的全面指南。学习使用 Node.js readline、Python 生成器、Server-Sent Events、WebSocket 和生产级日志流式处理模式构建流式管道。

最后更新：2026年2月

为什么 JSONL 非常适合流式处理

JSONL (JSON Lines) 每行存储一个完整的 JSON 对象，用换行符分隔。这种行分隔的结构使其成为流式处理的完美格式，因为每一行都是一个自包含的、可解析的单元。与单个大型 JSON 数组不同，您无需将整个数据集加载到内存中即可开始处理记录。消费者可以读取一行、解析它、处理它，然后继续下一行，无论数据源增长到多大，内存使用量始终保持恒定。

这一特性使 JSONL 在日志聚合系统、实时分析管道、机器学习数据流和 API 流式响应中占据主导地位。OpenAI 的 API、AWS CloudWatch 和 Apache Kafka 等工具都使用换行分隔的 JSON，因为它将 JSON 的通用可读性与面向行协议的流式效率相结合。在本指南中，您将学习如何在 Node.js 和 Python 中构建 JSONL 流式解决方案、使用 Server-Sent Events 向浏览器推送实时数据、通过 WebSocket 交换 JSONL，以及实现生产级日志流式系统。

在 Node.js 中流式处理 JSONL

Node.js 建立在事件驱动、非阻塞 I/O 模型之上，这使其非常适合流式工作负载。内置的 readline 和 stream 模块提供了逐行处理 JSONL 文件和网络流所需的一切，无需将完整负载加载到内存中。

readline 模块与 fs.createReadStream 配合使用，是在 Node.js 中消费 JSONL 文件的标准方式。它分块读取文件并发出完整的行，因此即使对于数 GB 的文件，内存也保持平稳。

使用 readline 进行流式读取

import { createReadStream } from 'node:fs';
import { createInterface } from 'node:readline';

async function streamJsonl(filePath, onRecord) {
  const rl = createInterface({
    input: createReadStream(filePath, 'utf-8'),
    crlfDelay: Infinity,
  });

  let count = 0;
  for await (const line of rl) {
    const trimmed = line.trim();
    if (!trimmed) continue;
    try {
      const record = JSON.parse(trimmed);
      await onRecord(record);
      count++;
    } catch (err) {
      console.error(`Skipping invalid line: ${err.message}`);
    }
  }
  return count;
}

// Usage: process each record as it arrives
const total = await streamJsonl('events.jsonl', async (record) => {
  console.log(`Event: ${record.type} at ${record.timestamp}`);
});
console.log(`Processed ${total} records`);

Node.js Transform 流允许您构建可组合的数据管道，读取 JSONL、过滤或丰富每条记录，然后将结果写回。此模式自动处理背压，当目标无法跟上时暂停源。

用于管道的 Transform 流

import { createReadStream, createWriteStream } from 'node:fs';
import { Transform } from 'node:stream';
import { pipeline } from 'node:stream/promises';

// Split incoming chunks into individual lines
class LineSplitter extends Transform {
  constructor() {
    super({ readableObjectMode: true });
    this.buffer = '';
  }
  _transform(chunk, encoding, cb) {
    this.buffer += chunk.toString();
    const lines = this.buffer.split('\n');
    this.buffer = lines.pop();
    for (const line of lines) {
      if (line.trim()) this.push(line.trim());
    }
    cb();
  }
  _flush(cb) {
    if (this.buffer.trim()) this.push(this.buffer.trim());
    cb();
  }
}

// Parse, transform, and re-serialize each record
class JsonlTransform extends Transform {
  constructor(transformFn) {
    super({ objectMode: true });
    this.transformFn = transformFn;
  }
  _transform(line, encoding, cb) {
    try {
      const record = JSON.parse(line);
      const result = this.transformFn(record);
      if (result) cb(null, JSON.stringify(result) + '\n');
      else cb(); // Filter out null results
    } catch (err) {
      cb(); // Skip invalid lines
    }
  }
}

// Build the pipeline: read -> split -> transform -> write
await pipeline(
  createReadStream('input.jsonl', 'utf-8'),
  new LineSplitter(),
  new JsonlTransform((record) => {
    if (record.level === 'error') {
      return { ...record, flagged: true, reviewedAt: new Date().toISOString() };
    }
    return null; // Filter out non-error records
  }),
  createWriteStream('errors.jsonl', 'utf-8')
);
console.log('Pipeline complete');

在 Python 中流式处理 JSONL

Python 生成器提供了一种优雅的方式来流式处理 JSONL 数据。因为生成器一次只 yield 一条记录，并且仅在消费者请求下一个值时才向前推进，内存使用量保持最低。结合内置的 json 模块，您只需几行代码就可以构建强大的流式管道。

Python 生成器函数使用标准 for 循环逐行读取文件。每次调用 next() 时，它恰好读取一行、解析它并 yield 结果。文件句柄保持打开状态，位置仅在消费者拉取记录时才前进。

基于生成器的流式处理

import json
from typing import Generator, Any

def stream_jsonl(file_path: str) -> Generator[dict[str, Any], None, None]:
    """Stream JSONL records one at a time using a generator."""
    with open(file_path, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            try:
                yield json.loads(line)
            except json.JSONDecodeError as e:
                print(f"Skipping line {line_num}: {e}")

# Usage: process records lazily
for record in stream_jsonl('events.jsonl'):
    if record.get('level') == 'error':
        print(f"ERROR: {record['message']}")

# Or collect a subset with itertools
import itertools
first_100 = list(itertools.islice(stream_jsonl('large.jsonl'), 100))
print(f"Loaded first {len(first_100)} records")

Python itertools 允许您在不具体化中间结果的情况下组合过滤、批处理和切片等流式操作。链中的每一步一次处理一条记录，因此您可以处理大于可用 RAM 的文件。

使用 itertools 进行链式处理

import json
import itertools
from typing import Generator, Iterable

def stream_jsonl(path: str) -> Generator[dict, None, None]:
    with open(path, 'r') as f:
        for line in f:
            if line.strip():
                yield json.loads(line)

def filter_records(records: Iterable[dict], key: str, value) -> Generator[dict, None, None]:
    """Filter records where key equals value."""
    for record in records:
        if record.get(key) == value:
            yield record

def batch(iterable: Iterable, size: int) -> Generator[list, None, None]:
    """Yield successive batches of the given size."""
    it = iter(iterable)
    while True:
        chunk = list(itertools.islice(it, size))
        if not chunk:
            break
        yield chunk

# Compose a streaming pipeline
records = stream_jsonl('access_logs.jsonl')
errors = filter_records(records, 'status', 500)

for i, error_batch in enumerate(batch(errors, 50)):
    print(f"Batch {i + 1}: {len(error_batch)} errors")
    # Send batch to alerting system, database, etc.
    for record in error_batch:
        print(f"  {record['timestamp']} - {record['path']}")

Server-Sent Events 与 JSONL

Server-Sent Events (SSE) 提供了一种简单的、基于 HTTP 的协议，用于从服务器向客户端推送实时更新。每个 SSE 事件携带一个 data 字段，JSONL 是天然的载荷格式，因为每一行都是一个完整的 JSON 对象，客户端可以立即解析。与 WebSocket 不同，SSE 工作在标准 HTTP 上，不需要特殊的代理配置，并且在连接失败时自动重连。

服务器将 Content-Type 设置为 text/event-stream，并将每条 JSONL 记录作为 SSE data 字段写入。连接保持打开状态，服务器在有新事件可用时推送它们。

SSE 服务端 (Node.js / Express)

import express from 'express';

const app = express();

// Simulate a real-time data source
function* generateMetrics() {
  let id = 0;
  while (true) {
    yield {
      id: ++id,
      cpu: Math.random() * 100,
      memory: Math.random() * 16384,
      timestamp: new Date().toISOString(),
    };
  }
}

app.get('/api/stream/metrics', (req, res) => {
  // SSE headers
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');
  res.setHeader('Access-Control-Allow-Origin', '*');
  res.flushHeaders();

  const metrics = generateMetrics();

  const interval = setInterval(() => {
    const record = metrics.next().value;
    // Each SSE event contains one JSONL record
    res.write(`data: ${JSON.stringify(record)}\n\n`);
  }, 1000);

  req.on('close', () => {
    clearInterval(interval);
    res.end();
  });
});

app.listen(3000, () => console.log('SSE server on :3000'));

浏览器使用 EventSource API 连接到 SSE 端点。每个传入的事件包含一条 JSONL 记录，实时解析和渲染。EventSource 在连接断开时自动处理重连。

SSE 客户端（浏览器）

// Connect to the SSE endpoint
const source = new EventSource('/api/stream/metrics');

// Parse each JSONL record as it arrives
source.onmessage = (event) => {
  const record = JSON.parse(event.data);
  console.log(`[${record.timestamp}] CPU: ${record.cpu.toFixed(1)}%`);

  // Update your UI in real time
  updateDashboard(record);
};

source.onerror = (err) => {
  console.error('SSE connection error:', err);
  // EventSource automatically reconnects
};

// For Fetch-based SSE (more control over the stream)
async function fetchSSE(url, onRecord) {
  const response = await fetch(url);
  const reader = response.body
    .pipeThrough(new TextDecoderStream())
    .getReader();

  let buffer = '';
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    buffer += value;
    const events = buffer.split('\n\n');
    buffer = events.pop();
    for (const event of events) {
      const dataLine = event.split('\n').find(l => l.startsWith('data: '));
      if (dataLine) {
        const record = JSON.parse(dataLine.slice(6));
        onRecord(record);
      }
    }
  }
}

await fetchSSE('/api/stream/metrics', (record) => {
  console.log('Metric:', record);
});

WebSocket + JSONL 流式处理

SSE 处理服务器到客户端的推送，而 WebSocket 则支持全双工通信，双方可以随时发送 JSONL 记录。这非常适合协作编辑、带用户命令的实时仪表板和双向数据同步等交互式应用。每个 WebSocket 消息包含一条 JSONL 记录，使两端的解析都很简单。

使用 JSONL 消息的 WebSocket 服务器

import { WebSocketServer } from 'ws';

const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', (ws) => {
  console.log('Client connected');

  // Send JSONL records to the client
  const interval = setInterval(() => {
    const record = {
      type: 'metric',
      value: Math.random() * 100,
      timestamp: new Date().toISOString(),
    };
    ws.send(JSON.stringify(record));
  }, 500);

  // Receive JSONL records from the client
  ws.on('message', (data) => {
    try {
      const record = JSON.parse(data.toString());
      console.log(`Received: ${record.type}`, record);

      // Echo back an acknowledgment
      ws.send(JSON.stringify({
        type: 'ack',
        originalId: record.id,
        processedAt: new Date().toISOString(),
      }));
    } catch (err) {
      ws.send(JSON.stringify({ type: 'error', message: 'Invalid JSON' }));
    }
  });

  ws.on('close', () => {
    clearInterval(interval);
    console.log('Client disconnected');
  });
});

// Client-side usage:
// const ws = new WebSocket('ws://localhost:8080');
// ws.onmessage = (e) => {
//   const record = JSON.parse(e.data);
//   console.log(record);
// };
// ws.send(JSON.stringify({ type: 'subscribe', channel: 'metrics' }));

每个 WebSocket 消息都是一个单独的 JSON 对象，遵循 JSONL 每条消息一条记录的约定。服务器以 500 毫秒的间隔向客户端流式传输指标，同时也接受来自客户端的命令。两个方向使用相同的 JSON 序列化，因此协议是对称的且易于调试。

日志流式处理实践

JSONL 流式处理最常见的实际用途之一是集中式日志收集。应用程序将结构化日志条目作为 JSONL 行写入 stdout 或文件，收集器代理实时跟踪输出，将记录转发到中央系统。Docker、Kubernetes 和大多数云日志服务都使用这种模式。

实时 JSONL 日志收集器

import { createReadStream, watchFile } from 'node:fs';
import { createInterface } from 'node:readline';
import { stat } from 'node:fs/promises';

class JsonlLogCollector {
  constructor(filePath, handlers = {}) {
    this.filePath = filePath;
    this.handlers = handlers;
    this.position = 0;
    this.running = false;
  }

  async start() {
    this.running = true;
    // Read existing content first
    await this.readFrom(0);
    // Watch for new appends
    this.watch();
  }

  async readFrom(startPos) {
    const rl = createInterface({
      input: createReadStream(this.filePath, {
        encoding: 'utf-8',
        start: startPos,
      }),
      crlfDelay: Infinity,
    });

    for await (const line of rl) {
      if (!line.trim()) continue;
      try {
        const record = JSON.parse(line);
        await this.dispatch(record);
      } catch (err) {
        this.handlers.onError?.(err, line);
      }
    }
    const info = await stat(this.filePath);
    this.position = info.size;
  }

  watch() {
    watchFile(this.filePath, { interval: 500 }, async (curr, prev) => {
      if (!this.running) return;
      if (curr.size > this.position) {
        await this.readFrom(this.position);
      }
    });
  }

  async dispatch(record) {
    const level = record.level || 'info';
    this.handlers[level]?.(record);
    this.handlers.onRecord?.(record);
  }

  stop() {
    this.running = false;
  }
}

// Usage
const collector = new JsonlLogCollector('app.log.jsonl', {
  error: (r) => console.error(`[ERROR] ${r.message}`),
  warn: (r) => console.warn(`[WARN] ${r.message}`),
  onRecord: (r) => forwardToElasticsearch(r),
  onError: (err, line) => console.error('Bad line:', line),
});

await collector.start();

此收集器实时跟踪 JSONL 日志文件，根据日志级别将记录分发给处理器。它跟踪文件位置，因此在初始追赶读取之后只读取新内容。在生产环境中，您需要添加文件轮转检测、优雅关闭和批量转发以减少到集中式日志系统的网络调用。

试试我们的免费 JSONL 工具

想在构建流式管道之前检查或转换 JSONL 数据？使用我们的免费在线工具在浏览器中查看、转换和格式化 JSONL 文件。所有处理均在本地完成，您的数据始终保持私密。

large JSONL files

JSONL logging

JSONL best practices

在线处理 JSONL 文件

在浏览器中查看、验证和转换最大 1GB 的 JSONL 文件。无需上传，100% 私密。