[WIP] pandas migration #1347

sstanovnik · 2016-06-17T15:03:49Z

This is the giant ~~(well, WIP)~~ pull request for converting the existing architecture to pandas.
See this wiki page and this Google Sheet for comments on more general architectural decisions.

So far, the majority of the additional new API is done, but few functions actually work and none have been tested. This will break until I start writing tests and modifying existing code. This currently still does not pass tests, but the Table by itself is completely functional, as well as some other learners. Now in progress: making sure everything works.

Over 95 % of all base widgets now work! ~~Most unit tests pass, the ones that don't are tied to sparse matrices and SQL support. Up next: (re)adding sparse Table and SQL Table support.~~ All unit tests now pass ~~(not including widget tests)~~!

Multi-type sparse support depends on pandas 0.19.0 (unreleased). That release will fix a lot of sparse issues (some were even fixed by me =D).

Todos and progress, along with code examples, is monitored in the Gsheet.

To check-out and properly use this right now, you have to build pandas from their git master so you have the necessary sparse changes. If you don't, some sparse features may not correctly.

Linked pull requests: biolab/orange3-text#97

kernc · 2016-06-21T16:03:24Z

Orange/data/table.py

+    @property
+    def columns_X(self):
+        """A read-only list of X column variables."""
+        return [c for c in self.columns if c in self._columns_X]


I don't think this is ok. If I only need "X", i should get X as a subset dataframe and then query its columns. Like:

only_data = table.columns_subset(Role.DATA) columns_X = only_data.columns

Never mind, this is ok. I, too, would rather have:

columns_X = data.columns_X subset_X = data[columns_X]

than

subset_X = data.subset_X columns_X = subset_X.columns

Or not, I don't know.

What are some pros and cons for either approach?

I think we'll use a variant of the first approach. Getting the columns through Domain (@astaric and I had a talk today about Domain, I'll fill you in tomorrow), then selecting the data through those columns.

Closes #1518.

Mends #1519.

Fix domain editor in the file widget when non-string discrete values. Also file reader hardening. Xref #1471.

xiaoerlaigeid · 2017-02-08T17:20:30Z

Hi, I am a university student from China. My major is computer science. I want to participate Gsoc 2017. I keen on machine learning and python. Could you tell me something about Orange program? I want to contribute this program.
Thank you!

AlexS12 · 2017-05-19T08:42:04Z

I have been having a look at Orange3 lately and find this approach (of using pandas dataframes internally) really interesting. However, I see that there has been little activity lately. Is this something that you still want to merge or the development has dead due to any major incompatibility? Are there any specific things on which I could help to accomplish this pull request?

kernc · 2017-05-19T08:52:03Z

I've continued development on a local, not yet pushed, branch. It's stalled for the moment, but it was progressing quite nicely, and I expect to have something to show quite soon!

AlexS12 · 2017-05-19T13:54:18Z

Oh! That's great! I am looking forward to having a look at it =) Thanks for such a quick response! I will subscribe to the pull req!

astaric · 2017-12-20T10:57:54Z

As this PR touches almost every file in the repository and has not been updated for more than a year, it is highly unlikely that it will be merged. We are still interested in porting the Table to pandas, but it will have to be done in a more gradual way.

If anyone is interested in working on this, I would suggest the following steps:

Add new methods to Table object that match pandas interface
(empty(), .iterrows(), ...)
Modify the code to call the new methods while deprecating the old ones
(code calls empty(), bool becomes deprecated, ...)
Figure out if Table can be ported to pandas without modifying more than 10 files, if not, return to 1. :)

Each "pandas compatible" method should be added in a separate pull request, as that eases reviewing and raises the chances of PR actually being merged. I would also split 1. and 2. into two PRs as 1 should only touch a single file and rebasing it should be much easier than the modifications of all places the code is used in 2.

vanatteveldt · 2018-01-09T19:25:09Z

I would also love to see pandas integration!

I think @astaric's roadmap makes sense, but I suspect that @kernc and @sstanovnik have most experience with trying to integrate pandas and orange. I would love to contribute if there is a feasible path to integration with official support from the biolab team!

ZoraizQ · 2020-07-15T23:54:08Z

Updates?

janezd · 2020-07-16T09:07:27Z

It's still on a would-be-really-nice-to-have list, but nobody is actively working on it.

sstanovnik added easy labels Jun 17, 2016

kernc reviewed Jun 21, 2016
View reviewed changes

sstanovnik added 20 commits August 26, 2016 11:42

Remove shuffle in favour of .sample(frac=1).

fc9d867

Consolidate usages of the _transferer hack.

b576a7e

Comments and tests to setUpClass, other test fixes.

d0a03dd

Add time component awareness to TimeVariable.

118a8d1

Fix a failing doctest.

a70e2c5

Improve time column display with month and day.

8dbc1c7

Add a pandas git build to travis.

b22e0d8

Some general fixes, report test fixes.

98b745f

Requirements.txt requires a different requirement format.

50e0989

Further improvements to the documentation.

b561d2f

Revert 68b18c5: overriding __iter__.

27c596f

Simplify weight assignment.

6ea711a

Cherry-pick: sstanovnik/orange3:benches.

06bcc77

Weight setting robustness.

2bfca35

Properer sparse handling.

6460ab0

Always convert weights to floats on assignment.

259bfc1

Fix visualizing continuous variables in Data Table.

ac75022

Closes #1518.

Significantly improve feature constructor performance.

3317ff0

Mends #1519.

Domain editor fix and file reader hardening.

fc48858

Fix domain editor in the file widget when non-string discrete values. Also file reader hardening. Xref #1471.

Fix a failing owkmeans test.

3e6030f

sstanovnik mentioned this pull request Aug 26, 2016

Feature request: Way to convert Pandas dataframe to Table object #68

Closed

astaric closed this Dec 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] pandas migration #1347

[WIP] pandas migration #1347

[WIP] pandas migration #1347

[WIP] pandas migration #1347

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment