2009-04-18

Batch Iterator and obscure Python details

I love Python generators and iterators, when they aren't making the easy trivial they are making the impossible possible.

I specially like to use iterators in streaming situations, like when reading from very large files or a database, because you don't have to traverse the sequence twice.

However in very large sequences I have had the need to perform some action every n items. I had the idea of using an special iterator that could split a sequence in sub-sequences but then I have to step over every item twice, once to pack it into the sub-sequence and once again to process it. Using islice was my first idea, but I needed to, somehow, comunicate to the "outer" iterator that the sequence has been exhausted or else I'd be stuck in an infinite loop iterating over empty subsequences.


I tough about adding an is_exhausted property to the sub, sequences, then I found out something interesting, you can't stuff properties into standard iterators, including those you get with generator expressions.


>>> i = iter([])
>>> i.is_exhausted = True
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
i.is_exhausted = True
AttributeError: 'listiterator' object has no attribute 'is_exhausted'
>>> def generator():
yield True
>>> g = generator()
>>> g.is_exhausted = True
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
g.is_exhausted = True
AttributeError: 'generator' object has no attribute 'is_exhausted'
>>>


No prob, I though, I can make everything inside a single generator! Actually I can't, once a generator raises StopIteration it can't do anything else.


>>> def anotherGenerator():
yield 1
yield 2
raise StopIteration
yield 3
yield 4
>>> a = anotherGenerator()
>>> a.next()
1
>>> a.next()
2
>>> a.next()
Traceback (most recent call last):
File "<pyshell#16>", line 1, in <module>
a.next()
File "<pyshell#12>", line 4, in anotherGenerator
raise StopIteration
StopIteration
>>> a.next()
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
a.next()
StopIteration
>>> a.next()
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
a.next()
StopIteration
>>>


Ok, so I thought about using a custom class for the sub-sequences, one that stored a reference to the "parent" iterator, but then I thought, Why making two new classes if the parent is simply returning iterators why don't return self? This lead to the first working implementation:


class ibatch(object):
"""A batch iterator by rgz"""
def __init__(self, sequence, size):
"""ibatch(iterable, size) -> sequence of iterables
splits an iterable into groups of 'size' items lazily"""
self.__sequence = iter(sequence)
self.__size = size
self.__counter = 0
def __repr__(self):
return "<batch iterator at %s>" % hex(id(self))
def __iter__(self):
return self
def next(self):
if self.__counter:
if self.__counter > self.__size:
self.__counter = 0
raise StopIteration
else:
self.__counter += 1
# When this raises StopIteration, it's the end.
return self.__sequence.next()
else:
self.__counter = 1
return self


This one uses two magic constants internally but is overall nice and compact, this is a demonstration of how it runs:


>>> for enum, items in enumerate(ibatch(xrange(10), 3)):
print "Block #%s" % enum
for item in items:
print item,
print '\n--'

Block #0
0 1 2
--
Block #1
3 4 5
--
Block #2
6 7 8
--
Block #3
9
--


Pretty nice, as long as the number of items isn't exactly divisible by the size of the sub-sequence, when that happen we get an empy sub-sequence complete with empty header and footer sections:


>>> for enum, items in enumerate(ibatch(xrange(9), 3)): # 9 items intead of 10...
print "Block #%s" % enum
for item in items:
print item,
print '\n--'


Block #0
0 1 2
--
Block #1
3 4 5
--
Block #2
6 7 8
--
Block #3

--


See that empty block? We can't get rid of it, because we don't know if the current sub-sequence is empty unless we try to get an item from it. This breaks a little of the conceptual cleanness of iterators, if the header depends on the sequence to not be opened first. However most of the time it is not a problem and it is very convenient, what we do is that we preload the first item in the sub-sequence to find out if the sequence is empty or not:


>>> class ibatch(object):
"""A batch iterator by rgz"""
def __init__(self, sequence, size, preloading = False):
"""ibatch(iterable, size) -> sequence of iterables splits an iterable
into groups of 'size' items lazily"""
self.__sequence = iter(sequence)
self.__size = size
self.__counter = 0
if preloading:
self.next = self._next_preloading
else:
self.next = self._next
assert self.next
def __repr__(self):
return "<batch iterator at %s>" % hex(id(self))
def __iter__(self):
return self
def _next(self):
if self.__counter:
if self.__counter > self.__size:
self.__counter = 0
raise StopIteration
else:
self.__counter += 1
# When this raises StopIteration, it's the end.
return self.__sequence.next()
else:
self.__counter = 1
return self
def _next_preloading(self):
if self.__counter == 0:
self.__preloaded = self.__sequence.next()
self.__counter = 1
return self
elif self.__counter == 1:
self.__counter = 2
return self.__preloaded
elif self.__counter <= self.__size:
self.__counter += 1
# When this raises StopIteration, it's the end.
return self.__sequence.next()
else:
self.__counter = 0
raise StopIteration


So this class takes a preloading argument and choses the apropiate next method, I'm soo clever! Except it doesn't work.


>>> for enum, items in enumerate(ibatch(xrange(9), 3)):
print "Block #%s" % enum
for item in items:
print item,
print '\n--'


Traceback (most recent call last):
File "<pyshell#43>", line 1, in <module>
for enum, items in enumerate(ibatch(xrange(9), 3)):
TypeError: iter() returned non-iterator of type 'ibatch'


Wait what? how is it not an iterator? The minimal requisites for the iteration protocol are the __iter__ and next methods and it has both right? Unless iter() expects the class to have a next method, so I added it to the ibatch class:


def next(self):
pass


This still doesn't work, but for an entirely different reason...


>>> for enum, items in enumerate(ibatch(xrange(9), 3)):
print "Block #%s" % enum
for item in items:
print item,
print '\n--'


Block #0
Traceback (most recent call last):
File "<pyshell#48>", line 3, in <module>
for item in items:
TypeError: 'NoneType' object is not iterable


What NoneType? It is talking about the return of the next method, so it is calling the next method in the class! Now it makes sense that iter() looks for it in the class definition, in other words the for statement doesn't call foo.next(), it calls foo.__class__.next(foo).

I understand why they don't want to use method resolution over and over in each iteration but grabbing a reference to the next method in the instance is the right thing to do, in my opinion. A dirty fix is calling the instance method in the class method like this:


def next(self):
return self.next()


But that's innefficient, the most readable solution seems to be using two clases like this:

FINAL VERSION


class ibatch(object):
"""A batch iterator by rgz that doesn't creates empty batches"""
def __init__(self, sequence, size):
"""ibatch(iterable, size) -> sequence of iterables
splits an iterable into groups of 'size' items lazily"""
self.__sequence = iter(sequence)
self.__size = size
self.__counter = 0
def __iter__(self):
return self
def next(self):
if self.__counter == 0:
self.__preloaded = self.__sequence.next()
self.__counter = 1
return self
elif self.__counter == 1:
self.__counter = 2
return self.__preloaded
elif self.__counter <= self.__size:
self.__counter += 1
# When this raises StopIteration, it's the end.
return self.__sequence.next()
else:
self.__counter = 0
raise StopIteration

class ibatch_strict(object):
"""A batch iterator by rgz"""
def __init__(self, sequence, size, preloading = False):
"""ibatch(iterable, size) -> sequence of iterables
splits an iterable into groups of 'size' items lazily
it is strict because it doesn't open the subsequence
before the header is procesed but in turn it can leave
an empty batch at the end if (len(sequence) % size) == 0"""
self.__sequence = iter(sequence)
self.__size = size
self.__counter = 0
def __iter__(self):
return self
def next(self):
if self.__counter:
if self.__counter > self.__size:
self.__counter = 0
raise StopIteration
else:
self.__counter += 1
# When this raises StopIteration, it's the end.
return self.__sequence.next()
else:
self.__counter = 1
return self


As you can see I removed the __repr__ methods since they have nothing interesting to say, I also decided to make preloading the default class because I like it better ^^. Here is how it runs:


>>> for enum, items in enumerate(ibatch(xrange(9), 3)):
print "Block: %s" % enum
for item in items:
print item,
print "\n--"

Block: 0
0 1 2
--
Block: 1
3 4 5
--
Block: 2
6 7 8
--


So the lesson of today is: "python iterators use foo.__class__.next(foo) not foo.next()"
I'll try not forgetting that.

I hope this class is usefull for someone.