2009-04-29

Lady Hopes Lies

 Purging some links from toBlog list, first comes this: the myth of Darwin turning Christian in his death bed.


 The Lady Hope Story: A Widespread Falsehood

 In summary Lady Hope, a hag that didn't even met Darwin said that she visited Darwin in his last moments, where he confessed to Christ and recanted his scientific research.

 Bullshit.

 Debunked first by Darwin's daughter who WAS with him in his last moments, then by his son who knew him well. Also in the many letters he wrote to friends, his agnosticism is more than evident.

2009-04-18

Blog name change / Cambio de nombre

I bumped into a blog named "cat /dev/random", hey that's my blog! Wait no, a quick google search reveals this is a fairly usual blog name among unix users.

So I decided to change the name of this blog into "import random", to fit my Python bias, it seems there arent any blogs with this name so I'm original and unique again! ...for the time being...

---

Me topé con otro blog llamado "cat /dev/random", hey ese es mi blog! No, esperen, una rápida busqueda en google revela que este es un nombre para blog común entre usuarios de unix.

Asi que decidi cambiar el nombre de este blog por "import random", adecuado para mi gusto por Python parece que no hay blogs con este nombre asi que soy original y único de nuevo! ...por el momento...

IntegerDateTime

So I try once and again to get into the Python ORM wave, this time with the elegant Elixir library.

The way I see it, Elixir/SqlAlchemy or any of the other ORM libraries can make a lot for you, provided you do the right things from the beginning. Integrating it into my current work flow is just too much work and I'm in constant fear everything will crumble down at some point and I'll have to rewrite everything.

Anyway the problem I had, which I pasted into stackoverflow involved our consistent use of integer columns instead of datetime columns in mysql, if I wanted to make a table wrapper I needed to cover that case so in the end I wrote my own schalchem data type (also pasted at stackoverflow)



import datetime, time
from sqlalchemy.types import TypeDecorator, DateTime
class IntegerDateTime(TypeDecorator):
"""a type that decorates DateTime, converts to unix time on
the way in and to datetime.datetime objects on the way out."""
impl = DateTime
def process_bind_param(self, value, engine):
"""Assumes a datetime.datetime"""
assert isinstance(value, datetime.datetime)
return int(time.mktime(value.timetuple()))
def process_result_value(self, value, engine):
return datetime.datetime.fromtimestamp(float(value))
def copy(self):
return IntegerDateTime(timezone=self.timezone)

In the end tough, it wasn't very useful because I don't have another table to which I can link this one so the main reason to write a wrapper for this class was void. Also, the query syntax was less nice than the SqlSoup auto-generated one so I should probably just use SqlSoup.

I still think Elixir/SqlAlchemy mappers are great, I understand they do a lot of stuff for you, like data definition centralization. But I just can't get a chance to use them where they aren't a hindrance!

So sad...

Batch Iterator and obscure Python details

I love Python generators and iterators, when they aren't making the easy trivial they are making the impossible possible.

I specially like to use iterators in streaming situations, like when reading from very large files or a database, because you don't have to traverse the sequence twice.

However in very large sequences I have had the need to perform some action every n items. I had the idea of using an special iterator that could split a sequence in sub-sequences but then I have to step over every item twice, once to pack it into the sub-sequence and once again to process it. Using islice was my first idea, but I needed to, somehow, comunicate to the "outer" iterator that the sequence has been exhausted or else I'd be stuck in an infinite loop iterating over empty subsequences.


I tough about adding an is_exhausted property to the sub, sequences, then I found out something interesting, you can't stuff properties into standard iterators, including those you get with generator expressions.


>>> i = iter([])
>>> i.is_exhausted = True
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
i.is_exhausted = True
AttributeError: 'listiterator' object has no attribute 'is_exhausted'
>>> def generator():
yield True
>>> g = generator()
>>> g.is_exhausted = True
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
g.is_exhausted = True
AttributeError: 'generator' object has no attribute 'is_exhausted'
>>>


No prob, I though, I can make everything inside a single generator! Actually I can't, once a generator raises StopIteration it can't do anything else.


>>> def anotherGenerator():
yield 1
yield 2
raise StopIteration
yield 3
yield 4
>>> a = anotherGenerator()
>>> a.next()
1
>>> a.next()
2
>>> a.next()
Traceback (most recent call last):
File "<pyshell#16>", line 1, in <module>
a.next()
File "<pyshell#12>", line 4, in anotherGenerator
raise StopIteration
StopIteration
>>> a.next()
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
a.next()
StopIteration
>>> a.next()
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
a.next()
StopIteration
>>>


Ok, so I thought about using a custom class for the sub-sequences, one that stored a reference to the "parent" iterator, but then I thought, Why making two new classes if the parent is simply returning iterators why don't return self? This lead to the first working implementation:


class ibatch(object):
"""A batch iterator by rgz"""
def __init__(self, sequence, size):
"""ibatch(iterable, size) -> sequence of iterables
splits an iterable into groups of 'size' items lazily"""
self.__sequence = iter(sequence)
self.__size = size
self.__counter = 0
def __repr__(self):
return "<batch iterator at %s>" % hex(id(self))
def __iter__(self):
return self
def next(self):
if self.__counter:
if self.__counter > self.__size:
self.__counter = 0
raise StopIteration
else:
self.__counter += 1
# When this raises StopIteration, it's the end.
return self.__sequence.next()
else:
self.__counter = 1
return self


This one uses two magic constants internally but is overall nice and compact, this is a demonstration of how it runs:


>>> for enum, items in enumerate(ibatch(xrange(10), 3)):
print "Block #%s" % enum
for item in items:
print item,
print '\n--'

Block #0
0 1 2
--
Block #1
3 4 5
--
Block #2
6 7 8
--
Block #3
9
--


Pretty nice, as long as the number of items isn't exactly divisible by the size of the sub-sequence, when that happen we get an empy sub-sequence complete with empty header and footer sections:


>>> for enum, items in enumerate(ibatch(xrange(9), 3)): # 9 items intead of 10...
print "Block #%s" % enum
for item in items:
print item,
print '\n--'


Block #0
0 1 2
--
Block #1
3 4 5
--
Block #2
6 7 8
--
Block #3

--


See that empty block? We can't get rid of it, because we don't know if the current sub-sequence is empty unless we try to get an item from it. This breaks a little of the conceptual cleanness of iterators, if the header depends on the sequence to not be opened first. However most of the time it is not a problem and it is very convenient, what we do is that we preload the first item in the sub-sequence to find out if the sequence is empty or not:


>>> class ibatch(object):
"""A batch iterator by rgz"""
def __init__(self, sequence, size, preloading = False):
"""ibatch(iterable, size) -> sequence of iterables splits an iterable
into groups of 'size' items lazily"""
self.__sequence = iter(sequence)
self.__size = size
self.__counter = 0
if preloading:
self.next = self._next_preloading
else:
self.next = self._next
assert self.next
def __repr__(self):
return "<batch iterator at %s>" % hex(id(self))
def __iter__(self):
return self
def _next(self):
if self.__counter:
if self.__counter > self.__size:
self.__counter = 0
raise StopIteration
else:
self.__counter += 1
# When this raises StopIteration, it's the end.
return self.__sequence.next()
else:
self.__counter = 1
return self
def _next_preloading(self):
if self.__counter == 0:
self.__preloaded = self.__sequence.next()
self.__counter = 1
return self
elif self.__counter == 1:
self.__counter = 2
return self.__preloaded
elif self.__counter <= self.__size:
self.__counter += 1
# When this raises StopIteration, it's the end.
return self.__sequence.next()
else:
self.__counter = 0
raise StopIteration


So this class takes a preloading argument and choses the apropiate next method, I'm soo clever! Except it doesn't work.


>>> for enum, items in enumerate(ibatch(xrange(9), 3)):
print "Block #%s" % enum
for item in items:
print item,
print '\n--'


Traceback (most recent call last):
File "<pyshell#43>", line 1, in <module>
for enum, items in enumerate(ibatch(xrange(9), 3)):
TypeError: iter() returned non-iterator of type 'ibatch'


Wait what? how is it not an iterator? The minimal requisites for the iteration protocol are the __iter__ and next methods and it has both right? Unless iter() expects the class to have a next method, so I added it to the ibatch class:


def next(self):
pass


This still doesn't work, but for an entirely different reason...


>>> for enum, items in enumerate(ibatch(xrange(9), 3)):
print "Block #%s" % enum
for item in items:
print item,
print '\n--'


Block #0
Traceback (most recent call last):
File "<pyshell#48>", line 3, in <module>
for item in items:
TypeError: 'NoneType' object is not iterable


What NoneType? It is talking about the return of the next method, so it is calling the next method in the class! Now it makes sense that iter() looks for it in the class definition, in other words the for statement doesn't call foo.next(), it calls foo.__class__.next(foo).

I understand why they don't want to use method resolution over and over in each iteration but grabbing a reference to the next method in the instance is the right thing to do, in my opinion. A dirty fix is calling the instance method in the class method like this:


def next(self):
return self.next()


But that's innefficient, the most readable solution seems to be using two clases like this:

FINAL VERSION


class ibatch(object):
"""A batch iterator by rgz that doesn't creates empty batches"""
def __init__(self, sequence, size):
"""ibatch(iterable, size) -> sequence of iterables
splits an iterable into groups of 'size' items lazily"""
self.__sequence = iter(sequence)
self.__size = size
self.__counter = 0
def __iter__(self):
return self
def next(self):
if self.__counter == 0:
self.__preloaded = self.__sequence.next()
self.__counter = 1
return self
elif self.__counter == 1:
self.__counter = 2
return self.__preloaded
elif self.__counter <= self.__size:
self.__counter += 1
# When this raises StopIteration, it's the end.
return self.__sequence.next()
else:
self.__counter = 0
raise StopIteration

class ibatch_strict(object):
"""A batch iterator by rgz"""
def __init__(self, sequence, size, preloading = False):
"""ibatch(iterable, size) -> sequence of iterables
splits an iterable into groups of 'size' items lazily
it is strict because it doesn't open the subsequence
before the header is procesed but in turn it can leave
an empty batch at the end if (len(sequence) % size) == 0"""
self.__sequence = iter(sequence)
self.__size = size
self.__counter = 0
def __iter__(self):
return self
def next(self):
if self.__counter:
if self.__counter > self.__size:
self.__counter = 0
raise StopIteration
else:
self.__counter += 1
# When this raises StopIteration, it's the end.
return self.__sequence.next()
else:
self.__counter = 1
return self


As you can see I removed the __repr__ methods since they have nothing interesting to say, I also decided to make preloading the default class because I like it better ^^. Here is how it runs:


>>> for enum, items in enumerate(ibatch(xrange(9), 3)):
print "Block: %s" % enum
for item in items:
print item,
print "\n--"

Block: 0
0 1 2
--
Block: 1
3 4 5
--
Block: 2
6 7 8
--


So the lesson of today is: "python iterators use foo.__class__.next(foo) not foo.next()"
I'll try not forgetting that.

I hope this class is usefull for someone.