Python 迭代器协议实现自定义数据流

在处理大规模数据时，内存限制是绕不开的痛点。如果一次性加载几个GB的日志文件或数据流，程序很可能直接崩溃。Python 的迭代器协议提供了一种优雅的解决方案——按需加载、逐项处理、内存占用可控。本文将深入讲解迭代器协议的底层原理，并手把手教你实现自定义数据流。

理解迭代器协议的核心机制

迭代器协议是 Python 统一的遍历标准，任何遵循该协议的对象都可以在 for 循环、列表推导式、解包操作等场景中无缝使用。协议的核心在于两个特殊方法：__iter__ 和 __next__。

__iter__ 方法返回迭代器对象本身。当你对一个对象调用 iter() 函数时，Python 会自动执行该方法。这个设计让可迭代对象和迭代器明确区分——可迭代对象的 __iter__ 返回一个新的迭代器实例，而迭代器的 __iter__ 通常返回自身。

__next__ 方法是迭代器的核心。每次调用它时，返回序列中的下一个元素。当没有更多元素可返回时，必须抛出 StopIteration 异常。这个异常是 Python 用来告知遍历结束的标准信号，for 循环正是依赖它来正确终止循环。

# 最简单的迭代器示例：倒计时
class Countdown:
    def __init__(self, start):
        self.current = start

    def __iter__(self):
        return self

    def __next__(self):
        if self.current < 0:
            raise StopIteration
        value = self.current
        self.current -= 1
        return value

# 使用方式
for num in Countdown(5):
    print(num)

这段代码的执行结果是打印 5、4、3、2、1。当 current 变为 -1 时，__next__ 抛出 StopIteration，for 循环捕获该异常并正常退出，而非报错终止。

为什么需要自定义迭代器

Python 内置的列表、字典、文件对象等都已经实现了迭代器协议，直接使用即可。但某些场景下，内置类型无法满足需求。

处理实时数据流是最典型的例子。假设你需要一个持续监听网络端口、逐行返回数据的迭代器；或者需要从数据库游标中分批拉取记录，而非一次性加载全部结果。这类场景都需要自定义迭代器来精确控制数据拉取的时机和方式。

另一个常见需求是惰性计算。对于超大规模数据集，一次性生成所有元素会占用巨量内存。通过迭代器按需生成，可以将内存复杂度从 $O(n)$ 降低到 $O(1)$。比如计算斐波那契数列的前 100 万项，使用迭代器可以逐个计算、逐个返回，而不需要维护一个包含 100 万个元素的列表。

实战：实现文件数据流处理器

下面通过一个完整案例，展示如何实现一个自定义文件数据流处理器。这个处理器能够分块读取大型文件、过滤空行、转换数据格式，并且支持链式操作。

class FileStream:
    def __init__(self, filepath, chunk_size=8192, encoding='utf-8'):
        self.filepath = filepath
        self.chunk_size = chunk_size
        self.encoding = encoding
        self._file = None
        self._leftover = ''

    def __enter__(self):
        self._file = open(self.filepath, 'r', encoding=self.encoding)
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        if self._file:
            self._file.close()

    def __iter__(self):
        return self

    def __next__(self):
        if self._file is None:
            raise StopIteration

        # 尝试从文件读取数据
        chunk = self._file.read(self.chunk_size)
        if not chunk:
            # 检查最后是否有残留数据
            if self._leftover:
                line = self._leftover
                self._leftover = ''
                return line
            raise StopIteration

        # 处理可能跨块的多行数据
        lines = (self._leftover + chunk).split('\n')
        self._leftover = lines.pop()  # 最后一行可能不完整

        if lines:
            return lines.pop()  # 返回最后一行完整数据
        return self.__next__()  # 当前块没有完整行，继续读取


class LineProcessor:
    def __init__(self, stream):
        self.stream = stream

    def filter(self, condition):
        """过滤符合条件的行"""
        return FilteredStream(self.stream, condition)

    def map(self, transform):
        """对每行应用转换函数"""
        return MappedStream(self.stream, transform)


class FilteredStream:
    def __init__(self, stream, condition):
        self.stream = stream
        self.condition = condition

    def __iter__(self):
        return self

    def __next__(self):
        for line in self.stream:
            if self.condition(line):
                return line
        raise StopIteration


class MappedStream:
    def __init__(self, stream, transform):
        self.stream = stream
        self.transform = transform

    def __iter__(self):
        return self

    def __next__(self):
        line = next(self.stream)
        return self.transform(line)


# 使用示例
with FileStream('access.log') as stream:
    processor = LineProcessor(stream)

    # 链式操作：过滤空行 → 过滤包含"ERROR"的行 → 转换为大写
    for line in processor.filter(lambda x: x.strip()) \
                          .filter(lambda x: 'ERROR' in x) \
                          .map(lambda x: x.upper()):
        print(line)

这段代码实现了三个核心组件。FileStream 负责分块读取文件，它会将跨越块边界的行正确拼接，避免数据截断。LineProcessor 提供链式操作的入口，通过 filter 和 map 方法构建处理管道。FilteredStream 和 MappedStream 是实际的迭代器，它们持有上游迭代器的引用，在 __next__ 中实现过滤和转换逻辑。

这种设计的优势在于惰性求值——只有在真正遍历时才会读取文件、过滤数据、转换格式。如果只创建管道但不遍历，内存中不会有任何实际数据。

生成器：简化迭代器实现的利器

手写迭代器类代码量较大，Python 提供了 yield 关键字来简化这一过程。包含 yield 的函数自动变成生成器，调用时会返回一个迭代器对象。

# 使用生成器实现相同功能
def file_generator(filepath, chunk_size=8192, encoding='utf-8'):
    with open(filepath, 'r', encoding=encoding) as f:
        leftover = ''
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                if leftover:
                    yield leftover
                return

            lines = (leftover + chunk).split('\n')
            leftover = lines.pop()

            for line in lines:
                yield line


def process_lines(filepath):
    for line in file_generator(filepath):
        line = line.strip()
        if not line:
            continue
        if 'ERROR' not in line:
            continue
        yield line.upper()


# 使用生成器表达式（更简洁）
def process_lines_v2(filepath):
    return (
        line.upper()
        for line in file_generator(filepath)
        if line.strip() and 'ERROR' in line
    )

# 使用
for line in process_lines('access.log'):
    print(line)

生成器的核心优势在于函数体内的执行暂停。当 yield 语句执行时，函数的当前状态（包括局部变量、指令指针等）被保存，下次调用 __next__ 时从暂停点继续执行。这种机制让编写状态ful的迭代器变得极为简单。

迭代器的高级应用模式

无限数据流迭代器

迭代器不限于有限数据流。通过精心设计，可以让迭代器永不抛出 StopIteration，实现无限序列的惰性生成。

def fibonacci():
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b


def prime_sieve(limit):
    sieve = [True] * (limit + 1)
    for p in range(2, limit + 1):
        if sieve[p]:
            yield p
            for multiple in range(p * p, limit + 1, p):
                sieve[multiple] = False


# 获取前10个斐波那契数
fib_iter = fibonacci()
for _ in range(10):
    print(next(fib_iter))

多迭代器协调

某些场景需要多个迭代器协同工作，比如实现 zip 的功能。关键在于使用工厂函数，每次调用返回新的迭代器实例。

def my_zip(*iterables):
    iterators = [iter(it) for it in iterables]
    while True:
        values = []
        for it in iterators:
            try:
                values.append(next(it))
            except StopIteration:
                return
        yield tuple(values)


# 使用
for pair in my_zip([1, 2, 3], 'abc'):
    print(pair)

迭代器与上下文管理器结合

迭代器可以与上下文管理器结合，实现更复杂的数据处理流程。itertools.groupby、contextlib.contextmanager 等工具经常用于这类场景。

from itertools import groupby
from contextlib import contextmanager


@contextmanager
def open_log_chunks(filepath):
    with open(filepath, 'r') as f:
        # 将文件按行分组，每组属于同一个时间段
        def chunk_generator():
            current_date = None
            chunk = []
            for line in f:
                line = line.strip()
                if not line:
                    continue
                parts = line.split()
                date = parts[0] if parts else ''

                if date != current_date and chunk:
                    yield chunk
                    chunk = []
                current_date = date
                chunk.append(line)

            if chunk:
                yield chunk

        yield chunk_generator()


# 使用：按日期分组处理日志
with open_log_chunks('access.log') as chunks:
    for chunk in chunks:
        error_count = sum(1 for line in chunk if 'ERROR' in line)
        print(f"Chunk: {len(chunk)} lines, {error_count} errors")

性能与内存优化建议

设计迭代器时，内存占用是首要考量。迭代器的优势在于惰性求值，但如果实现不当，反而可能适得其反。

避免在迭代器内部积累数据。__next__ 方法应当只返回当前元素，而非缓存整个序列。如果需要回溯功能，考虑使用 itertools.tee 复制迭代器，而非手动实现缓存机制。

from itertools import tee

def process_stream(data):
    # 错误示范：在列表中积累所有数据
    # buffer = []
    # for item in data:
    #     buffer.append(item)
    #     if len(buffer) > 1000:
    #         yield from buffer
    #         buffer = []

    # 正确示范：直接yield
    for item in data:
        yield process(item)


# 需要回溯时使用tee
data = [1, 2, 3, 4, 5]
a, b = tee(data)  # 两个独立的迭代器共享底层数据
print(list(a))  # [1, 2, 3, 4, 5]
print(list(b))  # [1, 2, 3, 4, 5]

正确管理文件和网络连接也是关键。迭代器应当与上下文管理器配合，确保资源在使用完毕后正确释放。上文中的 FileStream 类展示了这一最佳实践——通过 __enter__ 和 __exit__ 实现资源的获取和释放。

常见陷阱与排查方法

迭代器最常见的陷阱是迭代一次后失效。由于迭代器持有内部状态，遍历完毕后再次遍历会立即触发 StopIteration。

# 错误示例：迭代器只能遍历一次
def wrong_example():
    numbers = [1, 2, 3]
    for num in numbers:
        yield num

it = wrong_example()
print(list(it))  # [1, 2, 3]
print(list(it))  # [] 第二次遍历为空！

# 正确做法：每次需要新的迭代器时调用函数
def correct_example():
    numbers = [1, 2, 3]
    for num in numbers:
        yield num

def get_fresh_iterator():
    return correct_example()

print(list(get_fresh_iterator()))  # [1, 2, 3]
print(list(get_fresh_iterator()))  # [1, 2, 3]

另一个陷阱是协程与迭代器的混淆。yield 用于创建生成器/迭代器，但如果需要实现真正的协程（双向通信），应使用 asyncio 或 types.coroutine。

# 生成器：单向数据流
def simple_generator():
    yield 1
    yield 2
    yield 3

# 协程：双向通信
def coroutine_example():
    print("启动协程")
    while True:
        x = yield
        print(f"收到: {x}")

# 协程需要先发送None启动
co = coroutine_example()
next(co)  # 或 co.send(None)
co.send("hello")  # 收到: hello
co.send("world")  # 收到: world

结语

迭代器协议是 Python 异步编程和数据处理的基础设施。掌握这一协议，你能够编写出内存高效、代码优雅的数据处理管道。从简单的倒计时类到复杂的文件流处理器，核心机制始终是 __iter__ 返回迭代器、__next__ 逐项返回、StopIteration 通知结束。生成器语法大幅简化了迭代器的编写，而链式操作和惰性求值则让大规模数据处理变得可控且高效。

文章目录

Python 迭代器协议实现自定义数据流

Python 迭代器协议实现自定义数据流

理解迭代器协议的核心机制

为什么需要自定义迭代器

实战：实现文件数据流处理器

生成器：简化迭代器实现的利器

迭代器的高级应用模式

无限数据流迭代器

多迭代器协调

迭代器与上下文管理器结合

性能与内存优化建议

常见陷阱与排查方法

结语

评论 (0)

文章目录

Python 迭代器协议实现自定义数据流

Python 迭代器协议实现自定义数据流

理解迭代器协议的核心机制

为什么需要自定义迭代器

实战：实现文件数据流处理器

生成器：简化迭代器实现的利器

迭代器的高级应用模式

无限数据流迭代器

多迭代器协调

迭代器与上下文管理器结合

性能与内存优化建议

常见陷阱与排查方法

结语

评论 (0)

扫一扫，手机查看