Nidelven IT - All about Python, Zope & Plone - and Open Source!

Here you'll find issues related to our services. Mostly about Python, Zope and Plone, as well as hosting-related issues.

"Keeping IT real"






Older entries



Atom - Subscribe - Categories
Previous | Next

Discovering sets. . What? That fast?

There has been some dialogue on the posts on this blog, especially when it comes to Python, ways of coding and speed.

Speed has never been a big deal for me when I've worked with Python, it has been fast enough so the trade-off between programmer time costs and hardware costs has been an easy one.

However, as I look around to learn more about Python I've blogged about it and gotten amongst other, a comment about Python and sets.

Sets looked nice to me, to do unions, intersections and other things that are sometimes necessary.

However, I did not expect to find what I did find when running some speed tests..

If you run this code in a module called time_sets.py:

list_ = [1,2,3,4,5,6,7, 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
set_ = set(list_)
tuple_ = tuple(list_)

import timeit
repeat = range(10000000)

def in_sequence(sequence, item, repeat=repeat):
    for x in repeat:
        item in sequence

if __name__ == '__main__':
    print 3 in set_, 'j' in set_

    runner = timeit.Timer(setup='import time_sets', stmt='time_sets.in_sequence(time_sets.list_, 3)')
    print runner.timeit(number=1)
    runner = timeit.Timer(setup='import time_sets', stmt='time_sets.in_sequence(time_sets.tuple_, 3)')
    print runner.timeit(number=1)
    runner = timeit.Timer(setup='import time_sets', stmt='time_sets.in_sequence(time_sets.set_, 3)')
    print runner.timeit(number=1)

    runner = timeit.Timer(setup='import time_sets', stmt='time_sets.in_sequence(time_sets.list_, "j")')
    print runner.timeit(number=1)
    runner = timeit.Timer(setup='import time_sets', stmt='time_sets.in_sequence(time_sets.tuple_, "j")')
    print runner.timeit(number=1)
    runner = timeit.Timer(setup='import time_sets', stmt='time_sets.in_sequence(time_sets.set_, "j")')
    print runner.timeit(number=1)

Results:

True True
1.43136191368
1.42777395248
0.926907062531
10.305590868
10.6528999805
0.954577922821

To test speed of "if item is in sequence" operations, you'll see it took about 10% of the time with sets compared to lists and tuples. I expected tuples to be faster because they are immutable, but that lists and tuples were over 10 times slower than sets when testing for whether an item was in sequence I never expected. And it looks like to me that the difference in speed will be greater the bigger the sequence is.

Whoa.



[Permalink] [By morphex] [Python-only or > 0.5 Python related (Atom feed)] [2012 06 Aug 12:27 GMT+2]

Comments

hash tables

By: anonymous

this is because sets are basically hashtables. checking for containment of an item doesn't require scanning the entire collection, it just hashes the given value and checks for that bucket.

no suprise

By: Uwe

See http://stackoverflow.com/questions/3949310/how-is-cpythons-set-implemented

Well Daaa...

By: Yoav Glazner

sets are implemented as a Hash Map and so have O(1) lookup while list/tuple have a O(n) lookup...

OK

By: Morten W. Petersen

Yes I think I've read a bit about hashes and thought that could be part of the explanation. Still the discussion on SO for example is a bit too deep for me, but it is nice to know that sets exists and that they can be so much faster than lists and tuples.

Some hash links

By: Morten W. Petersen

I found these searching on Google:

http://en.wikipedia.org/wiki/Hash_table

http://en.wikipedia.org/wiki/Hash_function

Which describe the big and small picture with hashes. At least I now know that hash tables use buckets and that if there is a collision between the hash value of two items the code will do a comparison with each item in the bucket. :)

Data structures

By: Joel N

A quick introduction to basic container efficiencies:

First, we should define some operations on a container c:
1. traversal: for el in c: pass
2. random access: c[key]
3. containment: el in c
4. append: c.append(el)
5. random insertion: c.insert(ind, el); or c[key] = el; or c.add(key)
6. element deletion: del c[key]

Now we can tabulate the O-notation efficiencies for each of these operations:
* lists and tuples: 1=O(n) 2=O(1) 3=O(n) 4=O(1) 5=O(n) 6=O(n)
* dicts and sets: 1=O(n) 2=O(1) 3=O(1) 4=N/A 5=O(1) 6=O(1)

O(n) roughly means the operation will take time proportional to the number of items in the container.
O(1) means the operation will take about the same time irrespective of the size of the container.

So sets and dicts are faster for element insertion, deletion and containment. Their main disadvantage is that they do not maintain order, and require unique keys. Implementations of ordered or multi-valued dict structures may only partially retain their efficiency benefits. Note that sets and dicts take up more memory than lists.

These are theoretical results which may guide your choice of data structure for a particular purpose. In practice there are some other considerations to do with memory, caching.

O(k)

By: Morten W. Petersen

Ah yes, I get it. O(n) and O(1). That was informative. :)

From what I've gleaned, for example a large sequence of items that has to be searched multiple times for containment of something would be a good candidate for sets. And I've learned over time that greater size often means greater speed..

Regarding a multi-valued dictionary or, rather set.. Wouldn't that simply be making an entry in a bucket into some sort of table or list? In the case that you need to count occurrences of value x in a dictionary..



Add comment (text format)

Passphrase

A passphrase is required to comment on this weblog. It is required to make sure that bots aren't doing automatic spamming. It is: nit is the best!.

Title

Name

Email

Comment