Traits wrapper for pandas

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Traits wrapper for pandas

Adam Hughes
Hey everyone.  I noticed that epd comes with pandas and i have recently been intrigued by this library. Has there ever been talk of converting panda's main data structures to trait objects?  If not how difficult would this be? I have alreafy created plot objects for handling series data and the main problems with these come back to managing the data and label arrays consistently. Since pandas is optimized for speed and so is chaco a simple traits wrapper for pandas would probably be the fastest interactive series databases anlyais option out there.  I have some ground work towards this end without the incorporation of pandas data structures.  Any one ever think of trying this?

Sorry for spellin and grammar, working from a phone
_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev
Reply | Threaded
Open this post in threaded view
|

Re: Traits wrapper for pandas

Jaidev Deshpande
Hi Adam,

This sounds really interesting.

> I have alreafy created plot objects for handling series data and the main
> problems with these come back to managing the data and label arrays
> consistently.

Can I have a look at your code?

Thanks
_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev
Reply | Threaded
Open this post in threaded view
|

Re: Traits wrapper for pandas

Adam Hughes
Hi Jaidev,

I'm actually out of town until mid next week.  When I get back, I'll try to send you an example.  The code is currently in a non-working state but I will get you something out of it to illustrate the concepts.

On Fri, May 11, 2012 at 11:56 PM, Jaidev Deshpande <[hidden email]> wrote:
Hi Adam,

This sounds really interesting.

> I have alreafy created plot objects for handling series data and the main
> problems with these come back to managing the data and label arrays
> consistently.

Can I have a look at your code?

Thanks
_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev


_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev
Reply | Threaded
Open this post in threaded view
|

Re: Traits wrapper for pandas

Jonathan Rocher
Sorry Adam, for the delay in response. I think your idea is very interesting (I have gone through some of the same steps recently). Here are a few thoughts after talking internally a bit:

1. The trivial way to do this already now is by using Instance. Something like:

In [2]: from pandas.core.series import Series
In [4]: from numpy.random import randn
In [7]: s = Series(randn(10))

In [11]: class myclass(HasTraits):
   ....:     p = Instance(pandas.core.series.Series)
   ....:    
In [12]: m = myclass()

# This creates no error:
In [13]: m.p = s
In [14]: m.p
Out[14]:
0   -1.297089
1    0.296848
2   -0.507491
3    1.237853
4    0.312502
5   -1.066873
6   -0.476689
7   -0.155301
8    0.827380
9    0.383985

2. The advantage of creating a real trait beyond the Instance would be to gain more validation beyond the class of the object (such as the dtype for numpy arrays given to an Array). You can do this easily in your own code by creating a validate method on that class. For more details see:
http://docs.enthought.com/traits/traits_user_manual/custom.html#trait-subclassing
In particular, it would make sense to have a validation on the number of dimensions, and the dtypes. We should think about what else.

3. Because people are starting to play quite a bit with Pandas, wrapping it into a predefined trait could make sense in the near future. In fact the first obvious place to use it would actually be inside Chaco itself. If more people feel like this could be useful we can implement it.

Jonathan

On Sat, May 12, 2012 at 2:01 PM, Adam Hughes <[hidden email]> wrote:
Hi Jaidev,

I'm actually out of town until mid next week.  When I get back, I'll try to send you an example.  The code is currently in a non-working state but I will get you something out of it to illustrate the concepts.


On Fri, May 11, 2012 at 11:56 PM, Jaidev Deshpande <[hidden email]> wrote:
Hi Adam,

This sounds really interesting.

> I have alreafy created plot objects for handling series data and the main
> problems with these come back to managing the data and label arrays
> consistently.

Can I have a look at your code?

Thanks
_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev


_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev




--
Jonathan Rocher, PhD
Scientific software developer
Enthought, Inc.
[hidden email]
1-512-536-1057
http://www.enthought.com



_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev
Reply | Threaded
Open this post in threaded view
|

Re: Traits wrapper for pandas

Adam Hughes
On Wed, May 16, 2012 at 3:11 PM, Jonathan Rocher <[hidden email]> wrote:
Sorry Adam, for the delay in response. I think your idea is very interesting (I have gone through some of the same steps recently). Here are a few thoughts after talking internally a bit:

Thanks Jonathan.  I've also posted this to the pandas mailing list to see what they think.
 

1. The trivial way to do this already now is by using Instance. Something like:

In [2]: from pandas.core.series import Series
In [4]: from numpy.random import randn
In [7]: s = Series(randn(10))

In [11]: class myclass(HasTraits):
   ....:     p = Instance(pandas.core.series.Series)
   ....:    
In [12]: m = myclass()

# This creates no error:
In [13]: m.p = s
In [14]: m.p
Out[14]:
0   -1.297089
1    0.296848
2   -0.507491
3    1.237853
4    0.312502
5   -1.066873
6   -0.476689
7   -0.155301
8    0.827380
9    0.383985

Very interesting!  I didn't realize that Instance took non-Trait classes as inputs; always thought the input class of Instance had to be a pre-built trait type.  This is very good to know.
 


2. The advantage of creating a real trait beyond the Instance would be to gain more validation beyond the class of the object (such as the dtype for numpy arrays given to an Array). You can do this easily in your own code by creating a validate method on that class. For more details see:
http://docs.enthought.com/traits/traits_user_manual/custom.html#trait-subclassing
In particular, it would make sense to have a validation on the number of dimensions, and the dtypes. We should think about what else.

I see.  If the pandas data structures already handle the majority of bad input scenarios, maybe the validation step in a Trait object could be lax, at least for now.  Of course, I'm a novice on this so I'll have to check out the link you sent to get an idea of what validations we'd like to program.  What I'm most concerned about right now is the listeners.  Will the Instance calss build its own listeners so I can monitor changes in the Series object?  That would be nice because the data and its labels will change together so there's presumably no need to listen to these composite pieces separately.
 

3. Because people are starting to play quite a bit with Pandas, wrapping it into a predefined trait could make sense in the near future. In fact the first obvious place to use it would actually be inside Chaco itself. If more people feel like this could be useful we can implement it.  

Indeed.  For full integration, its probably necessary to build some structures that connect the labels/data from the pandas objects to various aspects of the Chaco plot.  For now, I think what I'm going to do is simply use a basic Plot object, and then update the Plot via set_data (and also update the labels) when changes in the Series are triggered.  Down the road, maybe these listeners could be incorporated directly into Chaco objects (I believe this is what you're suggesting, right?), but it certainly will work this way if we can reliably listen to the pandas data.  I'll try to throw a prototype together for the list and pass it on.

~Adam
 

Jonathan


On Sat, May 12, 2012 at 2:01 PM, Adam Hughes <[hidden email]> wrote:
Hi Jaidev,

I'm actually out of town until mid next week.  When I get back, I'll try to send you an example.  The code is currently in a non-working state but I will get you something out of it to illustrate the concepts.


On Fri, May 11, 2012 at 11:56 PM, Jaidev Deshpande <[hidden email]> wrote:
Hi Adam,

This sounds really interesting.

> I have alreafy created plot objects for handling series data and the main
> problems with these come back to managing the data and label arrays
> consistently.

Can I have a look at your code?

Thanks
_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev


_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev




--
Jonathan Rocher, PhD
Scientific software developer
Enthought, Inc.
[hidden email]
<a href="tel:1-512-536-1057" value="+15125361057" target="_blank">1-512-536-1057
http://www.enthought.com



_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev



_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev
Reply | Threaded
Open this post in threaded view
|

Re: Traits wrapper for pandas

Corran Webster
Hi Adam,

a quick comment on one thing below:

On Wed, May 16, 2012 at 2:55 PM, Adam Hughes <[hidden email]> wrote:
On Wed, May 16, 2012 at 3:11 PM, Jonathan Rocher <[hidden email]> wrote:2. The advantage of creating a real trait beyond the Instance would be to gain more validation beyond the class of the object (such as the dtype for numpy arrays given to an Array). You can do this easily in your own code by creating a validate method on that class. For more details see:
http://docs.enthought.com/traits/traits_user_manual/custom.html#trait-subclassing
In particular, it would make sense to have a validation on the number of dimensions, and the dtypes. We should think about what else.

I see.  If the pandas data structures already handle the majority of bad input scenarios, maybe the validation step in a Trait object could be lax, at least for now.  Of course, I'm a novice on this so I'll have to check out the link you sent to get an idea of what validations we'd like to program.  What I'm most concerned about right now is the listeners.  Will the Instance calss build its own listeners so I can monitor changes in the Series object?  That would be nice because the data and its labels will change together so there's presumably no need to listen to these composite pieces separately. 

Because Pandas objects are not HasTraits subclasses, Traits has no way of listening to changes to internal state of the Pandas objects, so the only thing you can listen to right now is if the entire object is replaced.

When you push hard enough with objects like numpy arrays and Pandas objects, since they can contain data which is allocated in C, possibly by code which is completely unaware of the existence of Python, values can change inside an array or Pandas data structure without Python being aware of it, let along Traits (the classic example of this is a memory-mapped array, which can have any process which modifies the file it is mapped to change the contents of the array at any time with no notification to anyone).

If you look closely at Traits, you'll see that Traits has its own subclasses of the basic Python container types (list, dict, set, etc.) which are traits-aware and emit appropriate traits events when their items change.  In 99% of cases this is fine, and in 1% of cases the fact that you aren't using a "real" list bites you.  Because these objects are allocated and controlled by Python, this approach will work except in the most pathological cases.

So it you were prepared to do a bit of work, you could subclass the main Pandas datatypes with your own version which emits Traits events, and write a Trait which silently switches out the Pandas object for your own the same way that the List trait does.  But even if you were to do this, there is no way of knowing if something outside of Python (or even another Python object sharing the same memory) is changing your data out from underneath you at the C level.

-- Corran



_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev
Reply | Threaded
Open this post in threaded view
|

Re: Traits wrapper for pandas

Adam Hughes
Thanks Corran.  I had feared that something along these lines may be an issue.  Although I would be interested in making my own Traits subclasses, I don't think I have the experience to do something like that.  And I would totally be clueless about having any control at the C-level.  I read that pandas does have a lot of optimizations at the C-level, so I really wouldn't know what's going on there.  How does the Array trait avoid these issues, since numpy is also optimized at the C code level?

On Wed, May 16, 2012 at 4:11 PM, Corran Webster <[hidden email]> wrote:
Hi Adam,

a quick comment on one thing below:

On Wed, May 16, 2012 at 2:55 PM, Adam Hughes <[hidden email]> wrote:
On Wed, May 16, 2012 at 3:11 PM, Jonathan Rocher <[hidden email]> wrote:2. The advantage of creating a real trait beyond the Instance would be to gain more validation beyond the class of the object (such as the dtype for numpy arrays given to an Array). You can do this easily in your own code by creating a validate method on that class. For more details see:
http://docs.enthought.com/traits/traits_user_manual/custom.html#trait-subclassing
In particular, it would make sense to have a validation on the number of dimensions, and the dtypes. We should think about what else.

I see.  If the pandas data structures already handle the majority of bad input scenarios, maybe the validation step in a Trait object could be lax, at least for now.  Of course, I'm a novice on this so I'll have to check out the link you sent to get an idea of what validations we'd like to program.  What I'm most concerned about right now is the listeners.  Will the Instance calss build its own listeners so I can monitor changes in the Series object?  That would be nice because the data and its labels will change together so there's presumably no need to listen to these composite pieces separately. 

Because Pandas objects are not HasTraits subclasses, Traits has no way of listening to changes to internal state of the Pandas objects, so the only thing you can listen to right now is if the entire object is replaced.

When you push hard enough with objects like numpy arrays and Pandas objects, since they can contain data which is allocated in C, possibly by code which is completely unaware of the existence of Python, values can change inside an array or Pandas data structure without Python being aware of it, let along Traits (the classic example of this is a memory-mapped array, which can have any process which modifies the file it is mapped to change the contents of the array at any time with no notification to anyone).

If you look closely at Traits, you'll see that Traits has its own subclasses of the basic Python container types (list, dict, set, etc.) which are traits-aware and emit appropriate traits events when their items change.  In 99% of cases this is fine, and in 1% of cases the fact that you aren't using a "real" list bites you.  Because these objects are allocated and controlled by Python, this approach will work except in the most pathological cases.

So it you were prepared to do a bit of work, you could subclass the main Pandas datatypes with your own version which emits Traits events, and write a Trait which silently switches out the Pandas object for your own the same way that the List trait does.  But even if you were to do this, there is no way of knowing if something outside of Python (or even another Python object sharing the same memory) is changing your data out from underneath you at the C level.

-- Corran



_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev



_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev
Reply | Threaded
Open this post in threaded view
|

Re: Traits wrapper for pandas

Corran Webster
Hi Adam,

The Array trait can't avoid these issues, and so doesn't even attempt to listen to changes in the underlying memory - that's why you can't set a listener on 'myarray_items' or similar for an Array trait.

-- Corran

On Wed, May 16, 2012 at 3:19 PM, Adam Hughes <[hidden email]> wrote:
Thanks Corran.  I had feared that something along these lines may be an issue.  Although I would be interested in making my own Traits subclasses, I don't think I have the experience to do something like that.  And I would totally be clueless about having any control at the C-level.  I read that pandas does have a lot of optimizations at the C-level, so I really wouldn't know what's going on there.  How does the Array trait avoid these issues, since numpy is also optimized at the C code level?

On Wed, May 16, 2012 at 4:11 PM, Corran Webster <[hidden email]> wrote:
Hi Adam,

a quick comment on one thing below:

On Wed, May 16, 2012 at 2:55 PM, Adam Hughes <[hidden email]> wrote:
On Wed, May 16, 2012 at 3:11 PM, Jonathan Rocher <[hidden email]> wrote:2. The advantage of creating a real trait beyond the Instance would be to gain more validation beyond the class of the object (such as the dtype for numpy arrays given to an Array). You can do this easily in your own code by creating a validate method on that class. For more details see:
http://docs.enthought.com/traits/traits_user_manual/custom.html#trait-subclassing
In particular, it would make sense to have a validation on the number of dimensions, and the dtypes. We should think about what else.

I see.  If the pandas data structures already handle the majority of bad input scenarios, maybe the validation step in a Trait object could be lax, at least for now.  Of course, I'm a novice on this so I'll have to check out the link you sent to get an idea of what validations we'd like to program.  What I'm most concerned about right now is the listeners.  Will the Instance calss build its own listeners so I can monitor changes in the Series object?  That would be nice because the data and its labels will change together so there's presumably no need to listen to these composite pieces separately. 

Because Pandas objects are not HasTraits subclasses, Traits has no way of listening to changes to internal state of the Pandas objects, so the only thing you can listen to right now is if the entire object is replaced.

When you push hard enough with objects like numpy arrays and Pandas objects, since they can contain data which is allocated in C, possibly by code which is completely unaware of the existence of Python, values can change inside an array or Pandas data structure without Python being aware of it, let along Traits (the classic example of this is a memory-mapped array, which can have any process which modifies the file it is mapped to change the contents of the array at any time with no notification to anyone).

If you look closely at Traits, you'll see that Traits has its own subclasses of the basic Python container types (list, dict, set, etc.) which are traits-aware and emit appropriate traits events when their items change.  In 99% of cases this is fine, and in 1% of cases the fact that you aren't using a "real" list bites you.  Because these objects are allocated and controlled by Python, this approach will work except in the most pathological cases.

So it you were prepared to do a bit of work, you could subclass the main Pandas datatypes with your own version which emits Traits events, and write a Trait which silently switches out the Pandas object for your own the same way that the List trait does.  But even if you were to do this, there is no way of knowing if something outside of Python (or even another Python object sharing the same memory) is changing your data out from underneath you at the C level.

-- Corran



_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev



_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev



_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev
Reply | Threaded
Open this post in threaded view
|

Re: Traits wrapper for pandas

Adam Hughes
Ah I see.  Well, I'll read over the document Jonathan sent me and see how feasible it would be to make a HasTraits subclass.  If there are any developers that would be interested in helping and actually have spare time, that would be great :)

On Wed, May 16, 2012 at 4:23 PM, Corran Webster <[hidden email]> wrote:
Hi Adam,

The Array trait can't avoid these issues, and so doesn't even attempt to listen to changes in the underlying memory - that's why you can't set a listener on 'myarray_items' or similar for an Array trait.

-- Corran


On Wed, May 16, 2012 at 3:19 PM, Adam Hughes <[hidden email]> wrote:
Thanks Corran.  I had feared that something along these lines may be an issue.  Although I would be interested in making my own Traits subclasses, I don't think I have the experience to do something like that.  And I would totally be clueless about having any control at the C-level.  I read that pandas does have a lot of optimizations at the C-level, so I really wouldn't know what's going on there.  How does the Array trait avoid these issues, since numpy is also optimized at the C code level?

On Wed, May 16, 2012 at 4:11 PM, Corran Webster <[hidden email]> wrote:
Hi Adam,

a quick comment on one thing below:

On Wed, May 16, 2012 at 2:55 PM, Adam Hughes <[hidden email]> wrote:
On Wed, May 16, 2012 at 3:11 PM, Jonathan Rocher <[hidden email]> wrote:2. The advantage of creating a real trait beyond the Instance would be to gain more validation beyond the class of the object (such as the dtype for numpy arrays given to an Array). You can do this easily in your own code by creating a validate method on that class. For more details see:
http://docs.enthought.com/traits/traits_user_manual/custom.html#trait-subclassing
In particular, it would make sense to have a validation on the number of dimensions, and the dtypes. We should think about what else.

I see.  If the pandas data structures already handle the majority of bad input scenarios, maybe the validation step in a Trait object could be lax, at least for now.  Of course, I'm a novice on this so I'll have to check out the link you sent to get an idea of what validations we'd like to program.  What I'm most concerned about right now is the listeners.  Will the Instance calss build its own listeners so I can monitor changes in the Series object?  That would be nice because the data and its labels will change together so there's presumably no need to listen to these composite pieces separately. 

Because Pandas objects are not HasTraits subclasses, Traits has no way of listening to changes to internal state of the Pandas objects, so the only thing you can listen to right now is if the entire object is replaced.

When you push hard enough with objects like numpy arrays and Pandas objects, since they can contain data which is allocated in C, possibly by code which is completely unaware of the existence of Python, values can change inside an array or Pandas data structure without Python being aware of it, let along Traits (the classic example of this is a memory-mapped array, which can have any process which modifies the file it is mapped to change the contents of the array at any time with no notification to anyone).

If you look closely at Traits, you'll see that Traits has its own subclasses of the basic Python container types (list, dict, set, etc.) which are traits-aware and emit appropriate traits events when their items change.  In 99% of cases this is fine, and in 1% of cases the fact that you aren't using a "real" list bites you.  Because these objects are allocated and controlled by Python, this approach will work except in the most pathological cases.

So it you were prepared to do a bit of work, you could subclass the main Pandas datatypes with your own version which emits Traits events, and write a Trait which silently switches out the Pandas object for your own the same way that the List trait does.  But even if you were to do this, there is no way of knowing if something outside of Python (or even another Python object sharing the same memory) is changing your data out from underneath you at the C level.

-- Corran



_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev



_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev



_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev



_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev