Chaco/pandas compatibility glitch?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Chaco/pandas compatibility glitch?

Adam Hughes
Hi,

I've been working with Chaco and Pandas together and the following issue.

If I pass numpy arrays to the the set_data() method of a chaco PlotData source, it works splendidly, even if these arrays are columns/rows in a pandas dataframe.  This is no surprise, since the row/columns of pandas dataframes are numpy arrays.  For example, I have an ArrayDataSource keyed by a timestamp string.

test=self.plotdata.get_data('1970-01-16 226:25:59')
print type(test), test.shape
<type 'numpy.ndarray'> (2048,)
print test
[   0.    212.9   213.9  ...,  225.97  228.73  224.67]

Since I am working with spectral data, it turns out that my index label of the dataframe are in fact data that I want to plot.  Therefore, I used the set_data() and piped in the dataframe.index.values array.  This is also an numpy array of floats, so I thought it would plot no problem.  The data looks almost identical.

id=self.plotdata.get_data('Index')
print type(id), id.shape
<type 'numpy.ndarray'> (2048,)
print id
[339.09 339.48 339.86 ..., 1023.08 1023.36 1023.65]

However, there is one major subtlety.  My IDE picks it out.

test
array([   0.  ,  212.9 ,  213.9 , ...,  225.97,  228.73,  224.67])
id
array([339.09, 339.48, 339.86, ..., 1023.08, 1023.36, 1023.65], dtype=object)

For now I have found a workaround.  I can flush away the object behavior by converting the values from array to list and back to array. 

If one doesn't do this, there is a failure when plot.plot() calls the DataRange1D method, _refresh_bounds.  In particular

mins, maxes = zip(*bounds_list)

returns full arrays rather than floats (see below).  Maybe this has to do with compatibility of zip() with pandas index objects?

zip(*bounds_list)
[(array([[339.09, 339.48, 339.86, ..., 1023.08, 1023.36, 1023.65]], dtype=object),),
 (array([[339.09, 339.48, 339.86, ..., 1023.08, 1023.36, 1023.65]], dtype=object),)]

Instead of what one would expect, eg:

(0.0,) (1442.9300000000001,)
(0.0,) (1442.9300000000001,)

This was a difficult bug to track down (thank you wing), so I wanted to report it for anyone else who may work on this type of thing.








--
Stay thirsty my friends.


_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev
Reply | Threaded
Open this post in threaded view
|

Re: Chaco/pandas compatibility glitch?

Robert Kern
On Tue, Oct 16, 2012 at 12:19 AM, Adam Hughes <[hidden email]> wrote:

> Hi,
>
> I've been working with Chaco and Pandas together and the following issue.
>
> If I pass numpy arrays to the the set_data() method of a chaco PlotData
> source, it works splendidly, even if these arrays are columns/rows in a
> pandas dataframe.  This is no surprise, since the row/columns of pandas
> dataframes are numpy arrays.  For example, I have an ArrayDataSource keyed
> by a timestamp string.
>
> test=self.plotdata.get_data('1970-01-16 226:25:59')
> print type(test), test.shape
> <type 'numpy.ndarray'> (2048,)
> print test
> [   0.    212.9   213.9  ...,  225.97  228.73  224.67]
>
> Since I am working with spectral data, it turns out that my index label of
> the dataframe are in fact data that I want to plot.  Therefore, I used the
> set_data() and piped in the dataframe.index.values array.  This is also an
> numpy array of floats, so I thought it would plot no problem.  The data
> looks almost identical.
>
> id=self.plotdata.get_data('Index')
> print type(id), id.shape
> <type 'numpy.ndarray'> (2048,)
> print id
> [339.09 339.48 339.86 ..., 1023.08 1023.36 1023.65]
>
> However, there is one major subtlety.  My IDE picks it out.
>
> test
> array([   0.  ,  212.9 ,  213.9 , ...,  225.97,  228.73,  224.67])
> id
> array([339.09, 339.48, 339.86, ..., 1023.08, 1023.36, 1023.65],
> dtype=object)
>
> For now I have found a workaround.  I can flush away the object behavior by
> converting the values from array to list and back to array.
>
> If one doesn't do this, there is a failure when plot.plot() calls the
> DataRange1D method, _refresh_bounds.  In particular
>
> mins, maxes = zip(*bounds_list)
>
> returns full arrays rather than floats (see below).  Maybe this has to do
> with compatibility of zip() with pandas index objects?

bounds_list is populated by data_source.get_bounds() for each
data_source that is attached to the DataRange1D. These are your own
subclasses of AbstractDataSource, right? I expect that you are
returning something wrong from that method.

--
Robert Kern
Enthought
_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev
Reply | Threaded
Open this post in threaded view
|

Re: Chaco/pandas compatibility glitch?

Adam Hughes


On Tue, Oct 16, 2012 at 6:17 AM, Robert Kern <[hidden email]> wrote:
On Tue, Oct 16, 2012 at 12:19 AM, Adam Hughes <[hidden email]> wrote:
> Hi,
>
> I've been working with Chaco and Pandas together and the following issue.
>
> If I pass numpy arrays to the the set_data() method of a chaco PlotData
> source, it works splendidly, even if these arrays are columns/rows in a
> pandas dataframe.  This is no surprise, since the row/columns of pandas
> dataframes are numpy arrays.  For example, I have an ArrayDataSource keyed
> by a timestamp string.
>
> test=self.plotdata.get_data('1970-01-16 226:25:59')
> print type(test), test.shape
> <type 'numpy.ndarray'> (2048,)
> print test
> [   0.    212.9   213.9  ...,  225.97  228.73  224.67]
>
> Since I am working with spectral data, it turns out that my index label of
> the dataframe are in fact data that I want to plot.  Therefore, I used the
> set_data() and piped in the dataframe.index.values array.  This is also an
> numpy array of floats, so I thought it would plot no problem.  The data
> looks almost identical.
>
> id=self.plotdata.get_data('Index')
> print type(id), id.shape
> <type 'numpy.ndarray'> (2048,)
> print id
> [339.09 339.48 339.86 ..., 1023.08 1023.36 1023.65]
>
> However, there is one major subtlety.  My IDE picks it out.
>
> test
> array([   0.  ,  212.9 ,  213.9 , ...,  225.97,  228.73,  224.67])
> id
> array([339.09, 339.48, 339.86, ..., 1023.08, 1023.36, 1023.65],
> dtype=object)
>
> For now I have found a workaround.  I can flush away the object behavior by
> converting the values from array to list and back to array.
>
> If one doesn't do this, there is a failure when plot.plot() calls the
> DataRange1D method, _refresh_bounds.  In particular
>
> mins, maxes = zip(*bounds_list)
>
> returns full arrays rather than floats (see below).  Maybe this has to do
> with compatibility of zip() with pandas index objects?

bounds_list is populated by data_source.get_bounds() for each
data_source that is attached to the DataRange1D. These are your own
subclasses of AbstractDataSource, right? I expect that you are
returning something wrong from that method.

I had thought so as well.  The issue does seem to be that the get_bounds() method gets hung up, if a Pandas Index object is passed in instead of a numpy array.  Maybe this is not surprising.  Even though Pandas Index are to behave like numpy arrays, they were causing issues in the get_bounds() function.

I just had to take the Index, convert to list, then back to a numpy array to drop all other Index functionality.  This workaround is sensible enough that I probably won't concern myself on why exactly this is happening, and will just be aware of it.
 

--
Robert Kern
Enthought
_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev



--
Stay thirsty my friends.


_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev
Reply | Threaded
Open this post in threaded view
|

Re: Chaco/pandas compatibility glitch?

Robert Kern
On Tue, Oct 16, 2012 at 7:26 PM, Adam Hughes <[hidden email]> wrote:

>
> On Tue, Oct 16, 2012 at 6:17 AM, Robert Kern <[hidden email]> wrote:
>>
>> On Tue, Oct 16, 2012 at 12:19 AM, Adam Hughes <[hidden email]>
>> wrote:
>> > Hi,
>> >
>> > I've been working with Chaco and Pandas together and the following
>> > issue.
>> >
>> > If I pass numpy arrays to the the set_data() method of a chaco PlotData
>> > source, it works splendidly, even if these arrays are columns/rows in a
>> > pandas dataframe.  This is no surprise, since the row/columns of pandas
>> > dataframes are numpy arrays.  For example, I have an ArrayDataSource
>> > keyed
>> > by a timestamp string.
>> >
>> > test=self.plotdata.get_data('1970-01-16 226:25:59')
>> > print type(test), test.shape
>> > <type 'numpy.ndarray'> (2048,)
>> > print test
>> > [   0.    212.9   213.9  ...,  225.97  228.73  224.67]
>> >
>> > Since I am working with spectral data, it turns out that my index label
>> > of
>> > the dataframe are in fact data that I want to plot.  Therefore, I used
>> > the
>> > set_data() and piped in the dataframe.index.values array.  This is also
>> > an
>> > numpy array of floats, so I thought it would plot no problem.  The data
>> > looks almost identical.
>> >
>> > id=self.plotdata.get_data('Index')
>> > print type(id), id.shape
>> > <type 'numpy.ndarray'> (2048,)
>> > print id
>> > [339.09 339.48 339.86 ..., 1023.08 1023.36 1023.65]
>> >
>> > However, there is one major subtlety.  My IDE picks it out.
>> >
>> > test
>> > array([   0.  ,  212.9 ,  213.9 , ...,  225.97,  228.73,  224.67])
>> > id
>> > array([339.09, 339.48, 339.86, ..., 1023.08, 1023.36, 1023.65],
>> > dtype=object)
>> >
>> > For now I have found a workaround.  I can flush away the object behavior
>> > by
>> > converting the values from array to list and back to array.
>> >
>> > If one doesn't do this, there is a failure when plot.plot() calls the
>> > DataRange1D method, _refresh_bounds.  In particular
>> >
>> > mins, maxes = zip(*bounds_list)
>> >
>> > returns full arrays rather than floats (see below).  Maybe this has to
>> > do
>> > with compatibility of zip() with pandas index objects?
>>
>> bounds_list is populated by data_source.get_bounds() for each
>> data_source that is attached to the DataRange1D. These are your own
>> subclasses of AbstractDataSource, right? I expect that you are
>> returning something wrong from that method.
>
> I had thought so as well.  The issue does seem to be that the get_bounds()
> method gets hung up, if a Pandas Index object is passed in instead of a
> numpy array.  Maybe this is not surprising.  Even though Pandas Index are to
> behave like numpy arrays, they were causing issues in the get_bounds()
> function.
>
> I just had to take the Index, convert to list, then back to a numpy array to
> drop all other Index functionality.  This workaround is sensible enough that
> I probably won't concern myself on why exactly this is happening, and will
> just be aware of it.

I'm not really sure what you are doing, but you just need to implement
get_bounds() correctly for your underlying data. Are you trying to
reuse ArrayDataSource as-is? In any case, to convert an Index to a
numpy array, all you need to do is use np.asarray().

--
Robert Kern
Enthought
_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev
Reply | Threaded
Open this post in threaded view
|

Re: Chaco/pandas compatibility glitch?

Adam Hughes


On Wed, Oct 17, 2012 at 6:14 AM, Robert Kern <[hidden email]> wrote:
On Tue, Oct 16, 2012 at 7:26 PM, Adam Hughes <[hidden email]> wrote:
>
> On Tue, Oct 16, 2012 at 6:17 AM, Robert Kern <[hidden email]> wrote:
>>
>> On Tue, Oct 16, 2012 at 12:19 AM, Adam Hughes <[hidden email]>
>> wrote:
>> > Hi,
>> >
>> > I've been working with Chaco and Pandas together and the following
>> > issue.
>> >
>> > If I pass numpy arrays to the the set_data() method of a chaco PlotData
>> > source, it works splendidly, even if these arrays are columns/rows in a
>> > pandas dataframe.  This is no surprise, since the row/columns of pandas
>> > dataframes are numpy arrays.  For example, I have an ArrayDataSource
>> > keyed
>> > by a timestamp string.
>> >
>> > test=self.plotdata.get_data('1970-01-16 226:25:59')
>> > print type(test), test.shape
>> > <type 'numpy.ndarray'> (2048,)
>> > print test
>> > [   0.    212.9   213.9  ...,  225.97  228.73  224.67]
>> >
>> > Since I am working with spectral data, it turns out that my index label
>> > of
>> > the dataframe are in fact data that I want to plot.  Therefore, I used
>> > the
>> > set_data() and piped in the dataframe.index.values array.  This is also
>> > an
>> > numpy array of floats, so I thought it would plot no problem.  The data
>> > looks almost identical.
>> >
>> > id=self.plotdata.get_data('Index')
>> > print type(id), id.shape
>> > <type 'numpy.ndarray'> (2048,)
>> > print id
>> > [339.09 339.48 339.86 ..., 1023.08 1023.36 1023.65]
>> >
>> > However, there is one major subtlety.  My IDE picks it out.
>> >
>> > test
>> > array([   0.  ,  212.9 ,  213.9 , ...,  225.97,  228.73,  224.67])
>> > id
>> > array([339.09, 339.48, 339.86, ..., 1023.08, 1023.36, 1023.65],
>> > dtype=object)
>> >
>> > For now I have found a workaround.  I can flush away the object behavior
>> > by
>> > converting the values from array to list and back to array.
>> >
>> > If one doesn't do this, there is a failure when plot.plot() calls the
>> > DataRange1D method, _refresh_bounds.  In particular
>> >
>> > mins, maxes = zip(*bounds_list)
>> >
>> > returns full arrays rather than floats (see below).  Maybe this has to
>> > do
>> > with compatibility of zip() with pandas index objects?
>>
>> bounds_list is populated by data_source.get_bounds() for each
>> data_source that is attached to the DataRange1D. These are your own
>> subclasses of AbstractDataSource, right? I expect that you are
>> returning something wrong from that method.
>
> I had thought so as well.  The issue does seem to be that the get_bounds()
> method gets hung up, if a Pandas Index object is passed in instead of a
> numpy array.  Maybe this is not surprising.  Even though Pandas Index are to
> behave like numpy arrays, they were causing issues in the get_bounds()
> function.
>
> I just had to take the Index, convert to list, then back to a numpy array to
> drop all other Index functionality.  This workaround is sensible enough that
> I probably won't concern myself on why exactly this is happening, and will
> just be aware of it.

I'm not really sure what you are doing, but you just need to implement
get_bounds() correctly for your underlying data. Are you trying to
reuse ArrayDataSource as-is? In any case, to convert an Index to a
numpy array, all you need to do is use np.asarray().

I am using ArrayDataSource unchanged, yes.  I've been using np.asarray() to convert and Index, although, my editor seems to catch a discrepancy between the return of asarray() vs. list(asarray())

index=dataframe.index
index
Index([339.09, 339.48, 339.86, ..., 1023.08, 1023.36, 1023.65], dtype=object)

np.asarray(index)
array([339.09, 339.48, 339.86, ..., 1023.08, 1023.36, 1023.65], dtype=object)

np.asarray(list(index))
array([  339.09,   339.48,   339.86, ...,  1023.08,  1023.36,  1023.65])

And for whatever reason, this does seem to cause get_bounds to trip up. 

It isn't actually a big deal so I don't want to waste your time on it.  When I finish what I've been working on, I'll post some source codes and if this problem creeps back in the future, I'll try to address it then.

Thanks for your help Robert.


 

--
Robert Kern
Enthought
_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev



--
Stay thirsty my friends.


_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev
Reply | Threaded
Open this post in threaded view
|

Re: Chaco/pandas compatibility glitch?

Pietro Berkes
Hi Adam,

please give a try to

np.asarray(index, dtype=float)

This should do the type conversion without passing through an inefficient conversion to a list.

I'm really not sure why dataframe.index returns an array of type 'object'. Could you please send the output of

dataframe.index

and

dataframe.index.dtype

?

Thank you,
Pietro



On Wed, Oct 17, 2012 at 10:08 PM, Adam Hughes <[hidden email]> wrote:


On Wed, Oct 17, 2012 at 6:14 AM, Robert Kern <[hidden email]> wrote:
On Tue, Oct 16, 2012 at 7:26 PM, Adam Hughes <[hidden email]> wrote:
>
> On Tue, Oct 16, 2012 at 6:17 AM, Robert Kern <[hidden email]> wrote:
>>
>> On Tue, Oct 16, 2012 at 12:19 AM, Adam Hughes <[hidden email]>
>> wrote:
>> > Hi,
>> >
>> > I've been working with Chaco and Pandas together and the following
>> > issue.
>> >
>> > If I pass numpy arrays to the the set_data() method of a chaco PlotData
>> > source, it works splendidly, even if these arrays are columns/rows in a
>> > pandas dataframe.  This is no surprise, since the row/columns of pandas
>> > dataframes are numpy arrays.  For example, I have an ArrayDataSource
>> > keyed
>> > by a timestamp string.
>> >
>> > test=self.plotdata.get_data('1970-01-16 226:25:59')
>> > print type(test), test.shape
>> > <type 'numpy.ndarray'> (2048,)
>> > print test
>> > [   0.    212.9   213.9  ...,  225.97  228.73  224.67]
>> >
>> > Since I am working with spectral data, it turns out that my index label
>> > of
>> > the dataframe are in fact data that I want to plot.  Therefore, I used
>> > the
>> > set_data() and piped in the dataframe.index.values array.  This is also
>> > an
>> > numpy array of floats, so I thought it would plot no problem.  The data
>> > looks almost identical.
>> >
>> > id=self.plotdata.get_data('Index')
>> > print type(id), id.shape
>> > <type 'numpy.ndarray'> (2048,)
>> > print id
>> > [339.09 339.48 339.86 ..., 1023.08 1023.36 1023.65]
>> >
>> > However, there is one major subtlety.  My IDE picks it out.
>> >
>> > test
>> > array([   0.  ,  212.9 ,  213.9 , ...,  225.97,  228.73,  224.67])
>> > id
>> > array([339.09, 339.48, 339.86, ..., 1023.08, 1023.36, 1023.65],
>> > dtype=object)
>> >
>> > For now I have found a workaround.  I can flush away the object behavior
>> > by
>> > converting the values from array to list and back to array.
>> >
>> > If one doesn't do this, there is a failure when plot.plot() calls the
>> > DataRange1D method, _refresh_bounds.  In particular
>> >
>> > mins, maxes = zip(*bounds_list)
>> >
>> > returns full arrays rather than floats (see below).  Maybe this has to
>> > do
>> > with compatibility of zip() with pandas index objects?
>>
>> bounds_list is populated by data_source.get_bounds() for each
>> data_source that is attached to the DataRange1D. These are your own
>> subclasses of AbstractDataSource, right? I expect that you are
>> returning something wrong from that method.
>
> I had thought so as well.  The issue does seem to be that the get_bounds()
> method gets hung up, if a Pandas Index object is passed in instead of a
> numpy array.  Maybe this is not surprising.  Even though Pandas Index are to
> behave like numpy arrays, they were causing issues in the get_bounds()
> function.
>
> I just had to take the Index, convert to list, then back to a numpy array to
> drop all other Index functionality.  This workaround is sensible enough that
> I probably won't concern myself on why exactly this is happening, and will
> just be aware of it.

I'm not really sure what you are doing, but you just need to implement
get_bounds() correctly for your underlying data. Are you trying to
reuse ArrayDataSource as-is? In any case, to convert an Index to a
numpy array, all you need to do is use np.asarray().

I am using ArrayDataSource unchanged, yes.  I've been using np.asarray() to convert and Index, although, my editor seems to catch a discrepancy between the return of asarray() vs. list(asarray())

index=dataframe.index
index
Index([339.09, 339.48, 339.86, ..., 1023.08, 1023.36, 1023.65], dtype=object)

np.asarray(index)

array([339.09, 339.48, 339.86, ..., 1023.08, 1023.36, 1023.65], dtype=object)

np.asarray(list(index))
array([  339.09,   339.48,   339.86, ...,  1023.08,  1023.36,  1023.65])

And for whatever reason, this does seem to cause get_bounds to trip up. 

It isn't actually a big deal so I don't want to waste your time on it.  When I finish what I've been working on, I'll post some source codes and if this problem creeps back in the future, I'll try to address it then.

Thanks for your help Robert.


 

--
Robert Kern
Enthought
_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev



--
Stay thirsty my friends.


_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev




--
Pietro Berkes
Scientific software developer
Enthought UK



_______________________________________________
Enthought-Dev mailing list
[hidden email]
https://mail.enthought.com/mailman/listinfo/enthought-dev