Tag Archives: Planet SciPy

Summer school announcement: 2nd Advanced Scientific Programming in Python (ASPP) Asia Pacific!

The Advanced Scientific Programming in Python (ASPP) summer school has had 10 successful iterations in Europe and one iteration here in Melbourne earlier this year. Another European iteration is starting next week in Camerino, Italy.

Now, thanks to the generous sponsorship of CSIRO, and the efforts of Benjamin Schwessinger and Genevieve Buckley, two alumni from the Melbourne school, and Kerensa McElroy, Agriculture Data School Coordinator at CSIRO, the Asia Pacific fork of ASPP gets its second iteration in Canberra, Jan 20-27, 2019.

Key details

  • The workshop runs January 20-27, 2019 at the Australian National University in Canberra, Australia.
  • topics include git, contributing to open source software with github, testing, debugging, profiling, advanced NumPy, Cython, and data visualisation.
  • hands-on learning using pair programming
  • free to attend (but students are responsible for travel, accommodation, and meals)
  • 30 student places, to be selected competitively
  • application deadline is Oct 7, 2018, 23:59 Anywhere On Earth
  • website: https://scipy-school.org
  • FAQ: https://scipy-school.org/faq
  • apply: https://scipy-school.org/applications

Background

Three years ago, I had the privilege of teaching the 2015 ASPP school in Munich. It turned out to be a fantastic teaching experience (I have taught in 2 more since), and more importantly, it was a fantastic experience for the students. Students are selected for the school to fit a certain profile, neither too novice nor too advanced. As such, participants selected for the school are almost guaranteed to learn a great deal.

Indeed, almost every iteration of the school has been co-organised by former students. Sure enough, with the help of two students from the Melbourne instance, we will be able to have a new iteration in Canberra this January.

Course description

Scientists spend increasingly more time writing, maintaining, and debugging software. While techniques for doing this efficiently have evolved, only few scientists have been trained to use them. As a result, instead of doing their research, they spend far too much time writing deficient code and reinventing the wheel. In this course we will present a selection of advanced programming techniques and best practices that are standard in industry, but especially tailored to the needs of a programming scientist. Lectures are devised to be interactive and to give the students enough time to acquire direct hands-on experience with the materials. Students will work in pairs throughout the school and will team up to practice the newly learned skills in a real programming project — an entertaining computer game.

We use the Python programming language for the entire course. Python works as a simple programming language for beginners, but more importantly, it also works great in scientific simulations and data analysis. We show how clean language design, ease of extensibility, and the great wealth of open source libraries for scientific computing and data visualization are driving Python to becoming a standard tool for scientists.

Who is eligible?

This school is targeted at Master/PhD students, postdocs, and academic staff and technicians from all areas of science. Competence in Python or in another language such as Java, C/C++, MATLAB, or Mathematica is absolutely required. Basic knowledge of Python and of a version control system such as git, subversion, mercurial, or bazaar is assumed. Participants without any prior experience with Python and/or git should work through the proposed introductory material before the course.

We have strived to get a pool of students that is international and gender-balanced, and have succeeded, with gender parity in the last five schools.

More questions

If you have any questions, contact info@scipy-school.org.

Please circulate this announcement widely! And follow @scipyschool for further developments.

Juan.

The road to scikit-image 1.0

This is the first in a series of posts about the joint scikit-image, scikit-learn, and dask sprint that took place at the Berkeley Insitute of Data Science, May 28-Jun 1, 2018.

In addition to the dask and scikit-learn teams, the sprint brought together three core developers of scikit-image (Emmanuelle Gouillart, Stéfan van der Walt, and myself), and two newer contributors, Kira Evans and Mark Harfouche. Since we are rarely in the same timezone, let alone in the same room, we took the opportunity to discuss some high level goals using a framework suggested by Tracy Teal (via Chris Holdgraf): Vision, Mission, Values. I’ll try do Chris’s explanation of these ideas justice:

  • Vision: what are we trying to achieve? What is the future that we are trying to bring about?
  • Mission: what are we going to do about it? This is the plan needed to make the vision a reality.
  • Values: what are we willing to do, and not willing to do, to complete our mission?

So, on the basis of this framework, I’d like to review where scikit-image is now, where I think it needs to go, and the ideas that Emma, Stéfan, and I came up with during the sprint to get scikit-image there.

I will point out, from the beginning, that one of our values is that we are community-driven, and this is not a wishy-washy concept. (More below.) Therefore this blog post constitutes only a preliminary document, designed to kick-start an official roadmap for scikit-image 1.0 with more than a blank canvas. The roadmap will be debated on GitHub and the mailing list, open to discussion by anyone, and when completed will appear on our webpage. This post is not the roadmap.

Part one: where we are

scikit-image is a tremendously successful project that I feel very proud to have been a part of until now. I still cherish the email I got from Stéfan inviting me to join the core team. (Five years ago now!)

Like many open source projects, though, we are threatened by our own success, with feature requests and bug reports piling on faster than we can get through them. And, because we grew organically, with no governance model, it is often difficult to resolve thorny questions about API design, what gets included in the library, and how to deprecate old functionality. Discussion usually stalls before any decision is taken, resulting in a process heavily biased towards inaction. Many issues and PRs languish for years, resulting in a double loss for the project: a smaller loss from losing the PR, and a bigger one from losing a potential contributor that understandably has lost interest.

Possibly the most impactful decision that we took at the BIDS sprint is that at least three core developers will video once a month to discuss stalled issues and PRs. (The logistics are still being worked out.) We hope that this sustained commitment will move many PRs and issues forward much faster than they have until now.

Part two: where we’re going

Onto the framework. What are the vision, mission, and values of scikit-image? How will these help guide the decisions that we make daily and in our dev meetings?

Our vision

We want scikit-image to be the reference image processing and analysis library for science in Python. In one sense I think that we are already there, but there are more than enough remaining warts that they might cause the motivated user to go looking elsewhere. The vision, then, is to increase our customer satisfaction fraction in this space to something approaching 1.0.

Our mission

How do we get there? Here is our mission:

  • Our library must be easily re-usable. This means that we will be careful in adding new dependencies, and possibly cull some existing ones, or make them optional. We also want to remove some of the bigger test datasets from our package, which at 24MB is getting rather unwieldy! (By comparison, Python 3.7 is 16MB.) (Props to Josh Warner for noticing this.)
  • It also means providing a consistent API. This means that conceptually identical function arguments, such as images, label images, and arguments defining whether an input image is grayscale, should have the same name across various the library. We’ve made great strides in this goal thanks to Egor Panfilov and Adrian Sieber, but we still have some way to go.
  • We want to ensure accuracy of our algorithms. This means comprehensive testing, even against external libraries, and engaging experts in relevant fields to audit our code. (Though this of course is a challenge!)
  • Show utmost care with users’ data. Not that we haven’t cared until now, but there are places in scikit-image where too much responsibility (in my view) rests with the user, with insufficient transparency from our functions for new users to predict what will happen to their data. For example, we are quite liberal with how we deal with input data: it gets rescaled whenever we need to change the type, for example from unsigned 8-bit integers (uint8) to floating point. Although we have good technical reasons for doing this, and rather extensive documentation about it, these conversions are the source of much user confusion. We are aiming to improve this in issue 3009. Likewise, we don’t handle image metadata at all. What is the physical extent of the input image? What is the range and units of the data points in the image? What do the different channels represent? These are all important questions in scientific images, but until now we have completely abdicated responsibility in them and simply ignore any metadata. I don’t think this is tenable for a scientific imaging library. We don’t have a good answer for how we will do it, but I consider this a must-solve before we can call ourselves 1.0.

Our values

Finally, how do we solve the thorny questions of API design, whether to include algorithms, etc? Here are our values:

  • We used the word “reference” in our vision. This phrasing is significant. It means that we value elegant implementations, that are easy to understand for newcomers, over obtaining every last ounce of speed. This value is a useful guide in reviewing pull requests. We will prefer a 20% slowdown when it reduces the lines of code two-fold.
  • We also used the word science in our vision. This means our aim is to serve scientific applications, and not, for example, image editing in the vein of Photoshop or GIMP. Having said this, we value being part of diverse scientific fields. (One of the first citations of the scikit-image paper was a remote sensing paper, to our delight: none of the core developers work in that field!)
  • We are inclusive. From my first contributions to the project, I have received patient mentorship from Stéfan, Emmanuelle, Johannes Schönberger, Andy Mueller, and others. (Indeed, I am still learning from fellow contributors, as seen here, to show just one example.) We will continue to welcome and mentor newcomers to the Scientific Python ecosystem who are making their first contribution.
  • Both of the above points have a corrolary: we require excellent documentation, in the form of usage examples, docstrings documenting the API of each function, and comments explaining tricky parts of the code. This requirement has stalled a few PRs in the past, but this is something that our monthly meetings will specifically address.
  • We don’t do magic. We use NumPy arrays instead of fancy façade objects that mask their complexity. We prefer to educate our users over making decisions on their behalf (through quality documentation, particularly in docstrings).
  • We are community-driven, which means that decisions about the API and features will be driven by our users’ requirements, and not the whims of the core team. (For example, I would happily curry all of our functions, but that would be confusing to most users, so I suffer in silence. =P)

I hope that the above values are uncontroversial in the scikit-image core team. (I myself used to fall heavily on the pro-magic side, but hard experience with this library has shown me the error of my ways.) I also hope, but more hesitantly, that our much wider community of users will also see these values as, well, valuable.

As I mentioned above, I hope this blog post will spawn a discussion involving both the core team and the wider community, and that this discussion can be distilled into a public roadmap for scikit-image.

Part three: scikit-image 1.0

I have deliberately left out new features off the mission, except for metadata handling. The library will never be “feature complete”. But we can develop a stable and consistent enough API that adding new features will almost never require breaking it.

For completeness, I’ll compile my personal pet list of things I will attempt to work on or be particularly excited about other people working on. This is not part of the roadmap, it’s part of my roadmap.

  • Near-complete support for n-dimensional data. I want 2D-only functions to become the exception in the library, maybe so much so that we are forced to add a _2d suffix to the function name.
  • Typing support. I never want to move from simple arrays as our base data type, but I want a way to systematically distinguish between plain images, label images, coordinate lists, and other types, in a way that is accessible to automatic tools.
  • Basic image registration functionality.
  • Evaluation algorithms for all parts of the library (such as segmentation, or keypoint matching).

The human side

Along with articulating the way we see the project, another key part of getting to 1.0 is supporting existing maintainers, and onboarding new ones. It is clear that the project is currently straining under the weight of its popularity. While we solve one issue, three more are opened, and two pull requests.

In the past, we have been too hesitant to invite new members to the core team, because it is difficult to tell whether a new contributor shares your vision. Our roadmap document is an important step towards rectifying this, because it clarifies where the library is going, and therefore the decision making process when it comes to accepting new contributions, for example.

In a followup to this post, I aim to propose a maintainer onboarding document, in a similar vein, to make sure that new maintainers all share the same process when evaluating new PRs and communicating with contributors. A governance model is also in the works, by which I mean that Stéfan has been wanting to establish one for years and now Emmanuelle and I are onboard with this plan, and I hope others will be too, and now we just need to decide on the damn thing.

I hope that all of these changes will allows us to reach the scikit-image 1.0 milestone sooner rather than later, and that everyone reading this is as excited about it as I was while we hashed this plan together.

As a reminder, this is not our final roadmap, nor our final vision/mission statement. Please comment on the corresponding GitHub issue for this post if you have thoughts and suggestions! (You can also use the mailing list, and we will soon provide a way to submit anonymous comments, too.) As a community, we will come together to create the library we all want to use and contribute to.

As a reminder, everything in this blog is CC0+BY, so feel free to reuse any or all of it in your own projects! And I want to thank BIDS, and specifically Nelle Varoquaux at BIDS, for making this discussion possible, among many other things that will be written up in upcoming posts.

Update: Anonymous comments are now open at https://pollev.com/juannunezigl611. To summarise, to comment on this proposal you can:

What do scientists know about open source?

A friend recently pointed out this great talk by Matt Bernius, What students know and don’t know about open source. If you have even a minor interest in open source it’s worth a watch, but the gist is: in the US alone, there are about 200,000 students enrolled in a computer science major. Open source communities are a great space to learn real-world programming, so why don’t these numbers translate into massive contributions to open source?

At the core of the issue, Matt identifies two main problems: (1) colleges and universities simply don’t teach open source, or even collaborative coding; and (2), many open source communities make newcomers feel unwelcome in a variety of ways.

I want to comment about this in the context of programming in science. That is, programming where the code is not the main product, but rather a useful tool to obtain a scientific result, for example in biology or physics. Here, we still see relatively little contribution to open source, for related but different cultural issues.

I’ve sent my scientists should code in the open post to a few people and the response from most remains sceptical. I hope this post will address some of their concerns.

Scientific culture is ridiculously secretive

The most common objection is to my assertion that people won’t scoop you by looking at your code. I remember a tweet (that I sadly can’t find now) that really got to the gist of the problem. It went something like this:

Someone in science having a new idea: “Ooh, I hope I don’t get scooped!”
Someone in open source having a new idea: “Ooh, I hope someone has implemented this already!”

Update: I found the source! It’s this tweet by Elizabeth Seiver.

This is a huge gap in culture that won’t soon go away, but there are encouraging steps towards narrowing it. For example, PLOS Biology, a leading journal, recently announced that they would consider “scooped” studies for publication within six months of the “scooping”. That goes some way towards re-aligning incentives towards collaborative and open science.
olga botvinic
I’ve come across many collaborations that have started because of open source. I have not heard of someone getting scooped because of open source, but of course that sort of information would be hard to trace and come by. Several people did write to me that they were concerned about very specific groups rifling through their code expressly for the purpose of scooping them. For me it’s hard to imagine someone even having that attitude, and my advice is that if you do face such a toxic community, it might be wise to change your chosen field of study.

Nevertheless, I want to emphasise here that open source programming can take many forms, with the zip file attached to the paper being the lowest, coding in the open being the highest, and several other models in between. Any steps you can take towards the higher models will ultimately help you. My preferred mode for code that really does have to be private is to use a private GitHub repository, and just make that repo public once the paper is accepted.

A lot of people prefer the “code dump with no revision history” model of post-publication sharing, but this tosses out a lot of valuable information for people coming after you: what have you tried that didn’t work? What issues did you have with the code? Have you considered coding in one style or another? The code dump model also makes you less likely to use GitHub in the first place, depriving you of an opportunity to learn some valuable real-world skills.

For coding, scientists have even more severe impostor syndrome

As I mentioned in my original post, and this I find completely uncontestable, publishing shitty code is not a bad thing. Everybody writes bad code, and nearly everybody knows it. Here’s Hadley Wickham, creator of dplyr, tidy data, ggplot2, among other things; in other words, someone who knows a thing or two about elegant code and about as close as one gets to coding royalty in science:

The only way to write great code is to write lots of shitty code first.

Publishing your raw code is a good thing and will absolutely not be a black mark on your career. Indeed, in open source circles, it is often a bare GitHub contribution history that is a black mark. (And this is another problem, but in my opinion a better one.)

Scientists don’t know about open source

If knowledge of open source is lacking in computer science, what chance does it have in other fields? The truth is that outreach and education need to become a massive part of open source culture, especially in science.

I credit Stéfan van der Walt for my life in open source. After I gave a talk at SciPy 2012, he invited me to join the scikit-image sprint at the end of the conference. If it hadn’t been for that, I probably would have just wandered around the hall, too shy to join any sprint (see “impostor syndrome”, above), and my life would be very different right now.

Anyway, at that point I’d made my code “open source”, which meant it was on GitHub. I had only added a license to submit to the conference. As a reminder, unlicensed code doesn’t count as open source. But I had never really collaborated in open source. My idea of collaboration was my workflow with my colleague: a single branch (master), from which we both pulled and to which we both pushed. When I sat down with Stéfan and Tony Yu, and I figured what I wanted to work on, I asked: “So, should I just push to master, or what?” I still remember, with some embarrassment, the dubious look Stéfan and Tony exchanged, as they silently figured out which of them would introduce this newbie to pull requests.

But that’s the thing: I shouldn’t feel embarrassment. Scientists for the most part don’t get introduced to coding in their education, much less to open source.

What can scientists in open source do?

A lesson from my continued contributions to the SciPy ecosystem, I hope, is that some light mentorship can yield enormous dividends later on. Stéfan and Tony took the time to walk me through the open source contribution process, when they could have dismissively sent me a link to some page explaining it. I’m a big fan of writing good documents for newcomers, but nothing beats a good hand-holding. It’s very easy for me to imagine an alternate reality where I had not felt welcome or rewarded by the scikit-image project and my life had not taken this productive turn.

Continuing on imaginary themes, it is only slightly less plausible that the open source scientific world should be awash with new contributors at every level of science. How do we turn this dream into a reality?

If you are a scientist and this post is among your first encounters with the term “open source”, and you think you might be interested in learning more, here are a few things I recommend, in order of easiest to hardest:

  • Read the preface and epilogue of my book with Stéfan and Harriet Dashnow. (Free online!) I feel a bit icky recommending my own book, but why repeat myself? In those chapters I tried to distill my thoughts on joining the SciPy community, which is a fantastic, rewarding space in which to do open source programming as a scientist. I expect many things we wrote generalise well to e.g. the tidyverse.
  • Look for upcoming software carpentry workshops near you. These are free two-day programming boot camps to introduce you to computational thinking, and, crucially, to version control with git.
  • Go to a SciPy conference. I know of SciPy, EuroSciPy, and SciPy India, but I have a vague memory of offshoots in Africa and South America.

If you are in a boat similar to mine (intermediate/advanced open source contributor in science), and you feel like you would like your work to feel a bit more crowded, I can tell you what I’m going to be doing in response to this talk:

  • Sign up to deliver (more) software carpentry training (or similar). Getting the word out is the number one thing.
  • In software carpentry, emphasise the role of git in collaboration. (I think the official program does not go far enough in this direction, and focuses instead on the initial linear history.)
  • If you are located in a university, talk to your CS department to see whether they have any courses in open source development. If not, see whether you can guest lecture in a suitable course to make students aware of the open source opportunities out there.
  • Similarly, follow up software carpentry with more advanced sessions on open source collaboration. I gained an enormous fraction of my programming skills from collaborating on open source. I really think there is no better tool for long-term learning in this space. An idea that I’d like to try out is to curate a bunch of open issues on prominent repos and get SWC students to sprint on them for a day1. I know about the “good first issue” tag on GitHub. Unfortunately, my experience with it is mixed. I think many repos are overly optimistic with theirs (this includes scikit-image), and, furthermore, a large proportion of these tagged issues get “claimed” quickly — and often half-heartedly!
  • Write, write, write! Did you get a cool PR merged? Write a blog post about it! Or at least tweet! We need to get the message out that writing PRs is for everyone. =)

If you have any further ideas, I’d love to hear them.


  1. Actually I drafted this post a while back, and tried this yesterday, with mixed success. I’ll write about that experience soon. ;) 

Summer School Announcement: ASPP Asia-Pacific 2018

The Advanced Scientific Programming in Python (ASPP) summer school has had 10 extremely successful iterations in Europe. (You can find past materials, schedules, and student evaluations at https://python.g-node.org/archives.) Now, thanks to the INCF, we will be holding its first iteration in Australia, to cater to the Asia Pacific region. (Note: the original ASPP will still take place in Europe next Northern summer; this is a fork of that school.)

Key details

  • The workshop runs January 14-21 at the Melbourne Brain Centre, University of Melbourne, Australia
  • topics include: git, contributing to open source software with github, testing, debugging, profiling, advanced NumPy, Cython, data visualisation.
  • hands-on learning using pair programming
  • free to attend (but students are responsible for travel, accommodation, and meals)
  • 30 student places, to be selected competitively
  • application deadline is Oct 31, 2017, 23:59 UTC.
  • website: https://melbournebioinformatics.org.au/aspp-asia-pacific
  • apply: https://melbournebioinformatics.org.au/aspp-asia-pacific/applications (make sure you read the FAQ on that page)

Background

Two-and-a-bit years ago, Tiziano Zito asked me if I could join the faculty at the 2015 ASPP school in Munich (then in its 8th iteration). It turned out to be a fantastic teaching experience, and, more importantly, it was a fantastic experience for the students. Students selected for the school fit a certain profile, neither novice nor advanced. As such, you can be sure that if you participate in the school, you will learn a great deal. We teach tools that will immediately improve your scientific practice. I decided that I wanted to replicate the school in Australia. Now it is finally here!

Course outline

Scientists spend increasingly more time writing, maintaining, and debugging software. While techniques for doing this efficiently have evolved, only few scientists have been trained to use them. As a result, instead of doing their research, they spend far too much time writing deficient code and reinventing the wheel. In this course we will present a selection of advanced programming techniques and best practices that are standard in industry, but especially tailored to the needs of a programming scientist. Lectures are devised to be interactive and to give the students enough time to acquire direct hands-on experience with the materials. Students will work in pairs throughout the school and will team up to practice the newly learned skills in a real programming project — an entertaining computer game.

We use the Python programming language for the entire course. Python works as a simple programming language for beginners, but more importantly, it also works great in scientific simulations and data analysis. We show how clean language design, ease of extensibility, and the great wealth of open source libraries for scientific computing and data visualization are driving Python to becoming a standard tool for scientists.

Who is eligible?

This school is targeted at Master/PhD students and postdocs from all areas of science. Competence in Python or in another language such as Java, C/C++, MATLAB, or Mathematica is absolutely required. Basic knowledge of Python and of a version control system such as git, subversion, mercurial, or bazaar is assumed. Participants without any prior experience with Python and/or git should work through the proposed introductory material before the course.

We have strived to get a pool of students that is international and gender-balanced, and have succeeded, with gender parity in the last four schools.

More questions

If you have any questions, contact aspp@melbournebioinformatics.org.au.

Please circulate this announcement widely!

Thanks,

Juan.

Prettier LowLevelCallables with Numba JIT and decorators

In my recent post, I extolled the virtues of SciPy 0.19’s LowLevelCallable. I did lament, however, that for generic_filter, the LowLevelCallable interface is a good deal uglier than the standard function interface. In the latter, you merely need to provide a function that takes the values within a pixel neighbourhood, and outputs a single value — an arbitrary function of the input values. That is a Wholesome and Good filter function, the way God intended.

In contrast, a LowLevelCallable takes the following signature:

int callback(double *buffer, intptr_t filter_size, 
             double *return_value, void *user_data)

That’s not very Pythonic at all. In fact, it’s positively Conic (TM). For those that don’t know, pointers are evil, so let’s aim to avoid their use.

“But Juan!”, you are no doubt exclaiming. “Juan! Didn’t you just tell us how to use pointers in Numba cfuncs, and tell us how great it was because it was so fast?”

Indeed I did. But it left a bad taste in my mouth. Although I felt that the tradeoff was worth it for such a phenomenal speed boost (300x!), I was unsatisfied. So I started immediately to look for a tidier solution. One that would let me write proper filter functions while still taking advantage of LowLevelCallables.

It turns out Numba cfuncs can call Numba jitted functions, so, with a little bit of decorator magic, it’s now ludicrously easy to write performant callables for SciPy using just pure Python. (If you don’t know what Numba JIT is, read my earlier post.) As in the last post, let’s look at grey_erosion as a baseline benchmark:

>>> import numpy as np
>>> footprint = np.array([[0, 1, 0],
...                       [1, 1, 1],
...                       [0, 1, 0]], dtype=bool)
>>> from scipy import ndimage as ndi
>>>
>>> %timeit ndi.grey_erosion(image, footprint=fp)
1 loop, best of 3: 160 ms per loop

Now, we write a decorator that uses Numba jit and Numba cfunc to make a LowLevelCallable suitable for passing directly into generic_filter:

>>> import numba
>>> from numba import cfunc, carray
>>> from numba.types import intc, CPointer, float64, intp, voidptr
>>> from scipy import LowLevelCallable
>>>
>>> def jit_filter_function(filter_function):
...     jitted_function = numba.jit(filter_function, nopython=True)
...     @cfunc(intc(CPointer(float64), intp, CPointer(float64), voidptr))
...     def wrapped(values_ptr, len_values, result, data):
...         values = carray(values_ptr, (len_values,), dtype=float64)
...         result[0] = jitted_function(values)
...         return 1
...     return LowLevelCallable(wrapped.ctypes)

If you haven’t seen decorators before, read this primer from Real Python. To summarise, we’ve written a function that takes as input a Python function, and outputs a LowLevelCallable. Here’s how to use it:

>>> @jit_filter_function
... def fmin(values):
...     result = np.inf
...     for v in values:
...         if v < result:
...             result = v
...     return result

As you can see, the fmin function definition looks just like a normal Python function. All the magic happens when we attach our @jit_filter_function decorator to the top of the function. Let’s see it in action:

>>> %timeit ndi.generic_filter(image, fmin, footprint=fp)
10 loops, best of 3: 92.9 ms per loop

Wow! numba.jit is actually over 70% faster than grey_erosion or the plain cfunc approach!

In case you want to use this, I’ve made a package available on PyPI, so you can actually pip install it right now with pip install llc (for low-level callable), and then:

>>> from llc import jit_filter_function

The source code is on GitHub. Currently it only covers ndi.generic_filter‘s signature, and only with Numba, but I hope to gradually expand it to cover all the functions that take LowLevelCallables in SciPy, as well as support Cython. Pull requests are welcome!

SciPy’s new LowLevelCallable is a game-changer

… and combines rather well with that other game-changing library I like, Numba.

I’ve lamented before that function calls are expensive in Python, and that this severely hampers many functions that should be insanely useful, such as SciPy’s ndimage.generic_filter.

To illustrate this, let’s look at image erosion, which is the replacement of each pixel in an image by the minimum of its neighbourhood. ndimage has a fast C implementation, which serves as a perfect benchmark against the generic version, using a generic filter with min as the operator. Let’s start with a 2048 x 2048 random image:

>>> import numpy as np
>>> image = np.random.random((2048, 2048))

and a neighbourhood “footprint” that picks out the pixels to the left and right, and above and below, the centre pixel:

>>> footprint = np.array([[0, 1, 0],
...                       [1, 1, 1],
...                       [0, 1, 0]], dtype=bool)

Now, we measure the speed of grey_erosion and generic_filter. Spoiler alert: it’s not pretty.

>>> from scipy import ndimage as ndi
>>> %timeit ndi.grey_erosion(image, footprint=footprint)
10 loops, best of 3: 118 ms per loop
>>> %timeit ndi.generic_filter(image, np.min, footprint=footprint)
1 loop, best of 3: 27 s per loop

As you can see, with Python functions, generic_filter is unusable for anything but the tiniest of images.

A few months ago, I was trying to get around this by using Numba-compiled functions, but the way to feed C functions to SciPy was different depending on which part of the library you were using. scipy.integrate used ctypes, while scipy.ndimage used PyCObjects or PyCapsules, depending on your Python version, and Numba only supported the former method at the time. (Plus, this topic starts to stretch my understanding of low-level Python, so I felt there wasn’t much I could do about it.)

Enter this pull request to SciPy from Pauli Virtanen, which is live in the most recent SciPy version, 0.19. It unifies all C-function interfaces within SciPy, and Numba already supports this format. It takes a bit of gymnastics, but it works! It really works!

(By the way, the release is full of little gold nuggets. If you use SciPy at all, the release notes are well worth a read.)

First, we need to define a C function of the appropriate signature. Now, you might think this is the same as the Python signature, taking in an array of values and returning a single value, but that would be too easy! Instead, we have to go back to some C-style programming with pointers and array sizes. From the generic_filter documentation:

This function also accepts low-level callback functions with one of the following signatures and wrapped in scipy.LowLevelCallable:

int callback(double *buffer, npy_intp filter_size, 
             double *return_value, void *user_data)
int callback(double *buffer, intptr_t filter_size, 
             double *return_value, void *user_data)

The calling function iterates over the elements of the input and output arrays, calling the callback function at each element. The elements within the footprint of the filter at the current element are passed through the buffer parameter, and the number of elements within the footprint through filter_size. The calculated value is returned in return_value. user_data is the data pointer provided to scipy.LowLevelCallable as-is.

The callback function must return an integer error status that is zero if something went wrong and one otherwise.

(Let’s leave aside that crazy reversal of Unix convention of the past 50 years in the last paragraph, except to note that our function must return 1 or it will be killed.)

So, we need a Numba cfunc that takes in:

  • a double pointer pointing to the values within the footprint,
  • a pointer-sized integer that specifies the number of values in the footprint,
  • a double pointer for the result, and
  • a void pointer, which could point to additional parameters, but which we can ignore for now.

The Numba type names are listed in this page. Unfortunately, at the time of writing, there’s no mention of how to make pointers there, but finding such a reference was not too hard. (Incidentally, it would make a good contribution to Numba’s documentation to add CPointer to the Numba types page.)

So, armed with all that documentation, and after much trial and error, I was finally ready to write that C callable:

>>> from numba import cfunc, carray
>>> from numba.types import intc, intp, float64, voidptr
>>> from numba.types import CPointer
>>> 
>>> 
>>> @cfunc(intc(CPointer(float64), intp,
...             CPointer(float64), voidptr))
... def nbmin(values_ptr, len_values, result, data):
...     values = carray(values_ptr, (len_values,), dtype=float64)
...     result[0] = np.inf
...     for v in values:
...         if v < result[0]:
...             result[0] = v
...     return 1

The only other tricky bits I had to watch out for while writing that function were as follows:

  • remembering that there’s two ways to de-reference a pointer in C: *ptr, which is not valid Python and thus not valid Numba, and ptr[0]. So, to place the result at the given double pointer, we use the latter syntax. (If you prefer to use Cython, the same rule applies.)
  • Creating an array out of the values_ptr and len_values variables, as shown here. That’s what enables the for v in values Python-style access to the array.

Ok, so now what you’ve been waiting for. How did we do? First, to recap, the original benchmarks:

>>> %timeit ndi.grey_erosion(image, footprint=footprint)
10 loops, best of 3: 118 ms per loop
>>> %timeit ndi.generic_filter(image, np.min, footprint=footprint)
1 loop, best of 3: 27 s per loop

And now, with our new Numba cfunc:

>>> %timeit ndi.generic_filter(image, LowLevelCallable(nbmin.ctypes), footprint=footprint)
10 loops, best of 3: 113 ms per loop

That’s right: it’s even marginally faster than the pure C version! I almost cried when I ran that.


Higher-order functions, ie functions that take other functions as input, enable powerful, concise, elegant expressions of various algorithms. Unfortunately, these have been hampered in Python for large-scale data processing because of Python’s function call overhead. SciPy’s latest update goes a long way towards redressing this.

Numba in the real world

Numba is a just-in-time compiler (JIT) for Python code focused on NumPy arrays and scientific Python. I’ve seen various tutorials around the web and in conferences, but I have yet to see someone use Numba “in the wild”. In the past few months, I’ve been using Numba in my own code, and I recently released my first real package using Numba, skan. The short version is that Numba is amazing and you should strongly consider it to speed up your scientific Python bottlenecks. Read on for the longer version.

Part 1: some toy examples

Let me illustrate what Numba is good for with the most basic example: adding two arrays together. You’ve probably seen similar examples around the web.

We start by defining a pure Python function for iterating over a pair of arrays and adding them:

In [1]:
import numpy as np


def addarr(x, y):
    result = np.zeros_like(x)
    for i in range(x.size):
        result[i] = x[i] + y[i]
    return result
How long does this take in pure Python?

In [2]:
n = int(1e6)
a = np.random.rand(n)
b = np.random.rand(n)
In [3]:
%timeit -r 1 -n 1 addarr(a, b)
1 loop, best of 1: 721 ms per loop
About half a second on my machine. Let’s try with Numba using its JIT decorator:

In [4]:
import numba

addarr_nb = numba.jit(addarr)
In [5]:
%timeit -r 1 -n 1 addarr_nb(a, b)
1 loop, best of 1: 283 ms per loop
The first time it runs, it’s only a tiny bit faster. That’s because of the nature of JITs: they only compile code as it is being run, in order to use object type information of the objects passed into the function. (Note that, in Python, the arguments a and b to addarr could be anything: an array, as expected, but also a list, a tuple, even a Banana, if you’ve defined such a class, and the meaning of the function body is different for each of those types.)

Let’s see what happens the next time we run it:

In [6]:
%timeit -r 1 -n 1 addarr_nb(a, b)
1 loop, best of 1: 6.36 ms per loop
Whoa! Now the code takes 5ms, about 100 times faster than the pure Python version. And the NumPy equivalent?

In [7]:
%timeit -r 1 -n 1 a + b
1 loop, best of 1: 5.62 ms per loop
Only marginally faster than Numba, even though NumPy addition is implemented in highly optimised C code. And, for some data types, Numba even beats NumPy:

In [8]:
r = np.random.randint(0, 128, size=n).astype(np.uint8)
s = np.random.randint(0, 128, size=n).astype(np.uint8)
In [9]:
%timeit -r 1 -n 1 r + s
1 loop, best of 1: 2.92 ms per loop
In [10]:
%timeit -r 1 -n 1 addarr_nb(r, s)
1 loop, best of 1: 238 ms per loop
In [11]:
%timeit -r 1 -n 1 addarr_nb(r, s)
1 loop, best of 1: 234 µs per loop
WOW! For smaller data types, Numba beats NumPy by over 10x!

I’m only speculating, but since my clock speed is about 1GHz (I’m writing this on a base Macbook with a 1.1GHz Core-m processor), I suspect that Numba is taking advantage of some SIMD capabilities of the processor, whereas NumPy is treating each array element as an individual arithmetic operation. (If any Numba or NumPy devs are reading this and have more concrete implementation details that explain this, please share them in the comments!)

So hopefully I’ve got your attention now. For years, NumPy has been the go-to library for performance Python in scientific computing. But, if you wanted to do something a little out of the ordinary, you were stuck. Now, Numba generally matches that for arbitrary code and sometimes beats it handily!

In this context, I decided to use Numba to do something a little less trivial, as part of my research.

Part 2: Real Numba

I’ll present below a slightly simplified version of the code present in my library, skan, which is currently available on PyPI and conda-forge. The task is to build an graph out of the pixels of a skleton image, like this one:

In [12]:
%matplotlib inline
In [13]:
import matplotlib.pyplot as plt
plt.rcParams['image.cmap'] = 'gray'
plt.rcParams['image.interpolation'] = 'nearest'
In [14]:
skeleton = np.array([[0, 1, 0, 0, 0, 1, 1],
                     [0, 0, 1, 1, 1, 0, 0],
                     [0, 1, 0, 0, 0, 1, 0],
                     [0, 0, 1, 0, 1, 0, 0],
                     [1, 1, 0, 1, 0, 0, 0]], dtype=bool)
skeleton = np.pad(skeleton, pad_width=1, mode='constant')
In [15]:
fig, ax = plt.subplots(figsize=(5, 5))
ax.imshow(skeleton)
ax.set_title('Skeleton')
ax.set_xlabel('col')
ax.set_ylabel('row')
Out[15]:

Every white pixel in the image will be a node in our graph, and we place edges between nodes if the pixels are next to each other (counting diagonals). A natural way to represent a graph in the SciPy world is as a sparse matrix A: we number the nonzero pixels from 1 onwards — these are the rows of the matrix — and then place a 1 at entry A(i, j) when pixel i is adjacent to pixel j. SciPy’s sparse.coo_matrix format make it very easy to construct such a matrix: we just need an array with the row coordinates and another with the column coordinates.

Because NumPy arrays are not dynamically resizable like Python lists, it helps to know ahead of time how many edges we are going to need to put in our row and column arrays. Thankfully, a well-known theorem of graph theory states that the number of edges of a graph is half the sum of the degrees. In our case, because we want to add the edges twice (once from i to j and once from j to i, we just need the sum of the degrees exactly. We can find this out with a convolution using scipy.ndimage:

In [16]:
from scipy import ndimage as ndi

neighbors = np.array([[1, 1, 1],
                      [1, 0, 1],
                      [1, 1, 1]])

degrees = ndi.convolve(skeleton.astype(int), neighbors) * skeleton
In [17]:
fig, ax = plt.subplots(figsize=(5, 5))
result = ax.imshow(degrees, cmap='magma')
ax.set_title('Skeleton, colored by node degree')
ax.set_xlabel('col')
ax.set_ylabel('row')
cbar = fig.colorbar(result, ax=ax, shrink=0.7)
cbar.set_ticks([0, 1, 2, 3])
cbar.set_label('degree')
There you can see “tips” of the skeleton, with only 1 neighbouring pixel, as purple, “paths”, with 2 neighbours, as red, and “junctions”, with 3 neighbors, as yellow.

Now, consider the pixel at position (1, 6). It has two neighbours (as indicated by its colour): (2, 5) and (1, 7). If we number the nonzero pixels as 1, 2, …, n from left to right and top to bottom, then this pixel has label 2, and its neighbours have labels 6 and 3. We therefore need to add edges (2, 3) and (2, 6) to the graph. Similarly, when we consider pixel 6, we will add edges (6, 5), (6, 3), and (6, 8).

In [18]:
fig, ax = plt.subplots(figsize=(5, 5))
result = ax.imshow(degrees, cmap='magma')
cbar = fig.colorbar(result, ax=ax, shrink=0.7)
cbar.set_ticks([0, 1, 2, 3])

nnz = len(np.flatnonzero(degrees))
pixel_labels = np.arange(nnz) + 1
for lab, y, x in zip(pixel_labels, *np.nonzero(degrees)):
    ax.text(x, y, lab, horizontalalignment='center',
            verticalalignment='center')

ax.set_xlabel('col')
ax.set_ylabel('row')
ax.set_title('Skeleton, with pixel IDs')
cbar.set_label('degree')
Scanning over the whole image, we see that we need row and col arrays of length exactly np.sum(degrees).

In [19]:
n_edges = np.sum(degrees)
row = np.empty(n_edges, dtype=np.int32)  # type expected by scipy.sparse
col = np.empty(n_edges, dtype=np.int32)
The final piece of the puzzle is finding neighbours. For this, we need to know a little about how NumPy stores arrays. Even though our array is 2-dimensional (rows and columns), these are all arrayed in a giant line, each row placed one after the other. (This is called “C-order”.) If we index into this linearised array (“raveled”, in NumPy’s language), we can make sure that our code works for 2D, 3D, and even higher-dimensional images. Using this indexing, neighbouring pixels to the left and right are accessed by subtracting or adding 1 to the current index. Neighbouring pixels above and below are accessed by subtracting or adding the length of a whole row. Finally, diagonal neighbours are found by combining these two. For simplicity, we only show the 2D version below:

In [20]:
def neighbour_steps(shape):
    step_sizes = np.cumprod((1,) + shape[-1:0:-1])
    axis_steps = np.array([[-1, -1],
                           [-1,  1],
                           [ 1, -1],
                           [ 1,  1]])
    diag = axis_steps @ step_sizes
    steps = np.concatenate((step_sizes, -step_sizes, diag))
    return steps
In [21]:
steps = neighbour_steps(degrees.shape)
print(steps)
[  1   9  -1  -9 -10   8  -8  10]
Of course, if we use these steps near the right edge of the image, we’ll wrap around, and mistakenly think that the first element of the next row is a neighbouring pixel! Our solution is to only process nonzero pixels, and make sure that we have a 1-pixel-wide “pad” of zero pixels — which we do, in the image above!

Now, we iterate over image pixels, look at neighbors, and populate the row and column vectors.

In [22]:
def build_graph(labeled_pixels, steps_to_neighbours, row, col):
    start = np.max(steps_to_neighbours)
    end = len(labeled_pixels) - start
    elem = 0  # row/col index
    for k in range(start, end):
        i = labeled_pixels[k]
        if i != 0:
            for s in steps:
                neighbour = k + s
                j = labeled_pixels[neighbour]
                if j != 0:
                    row[elem] = i
                    col[elem] = j
                    elem += 1
In [23]:
skeleton_int = np.ravel(skeleton.astype(np.int32))
skeleton_int[np.nonzero(skeleton_int)] = 1 + np.arange(nnz)
In [24]:
%timeit -r 1 -n 1 build_graph(skeleton_int, steps, row, col)
1 loop, best of 1: 917 µs per loop
Now we try the Numba version:

In [25]:
build_graph_nb = numba.jit(build_graph)
In [26]:
%timeit -r 1 -n 1 build_graph_nb(skeleton_int, steps, row, col)
1 loop, best of 1: 346 ms per loop
In [27]:
%timeit -r 1 -n 1 build_graph_nb(skeleton_int, steps, row, col)
1 loop, best of 1: 14.3 µs per loop
Nice! We get more than a 50-fold speedup using Numba, and this operation would have been difficult if not impossible to convert to a NumPy vectorized operation! We can now build our graph:

In [28]:
from scipy import sparse
G = sparse.coo_matrix((np.ones_like(row), (row, col))).tocsr()
As to what to do with said graph, I’ll leave that for another post. (You can also peruse the skan source code.) In the meantime, though, you can visualize it with NetworkX:

In [29]:
import networkx as nx

Gnx = nx.from_scipy_sparse_matrix(G)
Gnx.remove_node(0)

nx.draw_spectral(Gnx, with_labels=True)
There’s our pixel graph! Obviously, the speedup and n-d support are important for bigger, 3D volumes, not for this tiny graph. But they are important, and, thanks to Numba, easy to obtain.

Conclusion

I hope I’ve piqued your interest in Numba and encouraged you to use it in your own projects. I think the future of success of Python in science heavily depends on JITs, and Numba is a strong contender to be the default JIT in this field.

Note:This post was written using Jupyter Notebook. You can find the source notebook here.

Why scientists should code in the open

All too often, I encounter published papers in which the code is “available upon request”, or “available in the supplementary materials” (as a zip file). This is not just poor form. It also hurts your software’s future. (And, in my opinion, when results depend on software, it is inexcusable.)

Given the numerous options for posting code online, there’s just no excuse to give code in a less-than-convenient format, upon publication. When you publish, put your code on Github or Bitbucket.

In this piece, I’ll go even further: put your code there from the beginning. Put your code there as soon as you finish reading this article. Here’s why:

No, you won’t get scooped

Reading code is hard. Ask any experienced programmer: most have trouble reading code they themselves wrote a few months ago, let alone someone else’s code. It’s extremely unlikely that someone will browse your code looking for a scoop. That time is better spent doing research.

It’s never going to be ready

Another thing I hear is that they want to post their code, but they want to clean it up first, and remove all the “embarrassing” bits. Unfortunately, science doesn’t reward time spent “cleaning up” your code, at least not yet. So the sad reality is that you probably will never actually get to the point where you are happy to post your code online.

But here’s the secret: everybody is in that boat with you. That’s why this document exists. I recommend you read it in full, but this segment is particularly important:

When it comes time to empirically evaluate new research with respect to someone else’s prior work, even a rickety implementation of their work can save grad-student-months, or even grad-student-years, of time.

Matt Might himself is as thorough and high-profile as you get in computer science, and yet, he has this to say about code clean-up:

I kept telling myself that I’d clean it all up and release it some day.

I have to be honest with myself: this clean-up is never going to happen.

Your code might not meet your standards, but, believe it or not, your code will help others, and the sooner it’s out there, the sooner they can be helped.

You will gain collaborators and citations

If anyone is going to be rifling through your code, they will probably end up asking for your help. This happens with even the best projects: have a look at the activity on the mailing lists for scikit-learn or NumPy, two of the best-maintained open-source projects out there.

When you have to go back and explain how a piece of code worked, that’s when you will actually take the time and clean it up. In the process, the person you help will be more likely to contribute to your project, either in code or in bug reports, improvement suggestions, or even citations.

In the case of my own gala project, I guess that about half of the citations it received happened because of its open-source code and open mailing list.

Your coding ability will automagically improve

I first heard this one from Albert Cardona. They say sunlight is the best disinfectant, and this is certainly true of code. Just the very idea that anyone can easily read their code will make most people more careful when programming. Over time, this care will become second nature, and you will develop a taste for nice, easy-to-read code.

In short, the alleged downsides of code-sharing are, at best, longshots, while there are many tangible upsides. Put your code out there. (And use a liberal open-source license!)

The cost of a Python function call

I’ve read in various places that the Python function call overhead is very high. As I was parroting this “fact” to Ed Schofield recently, he asked me what the cost of a function actually was. I had no idea. This prompted us to do a few quick benchmarks.

The short version is that it takes about 150ns to call a function in Python (on my laptop). This doesn’t sound like a lot, but it means that you can make at most 6.7 million calls per second, two to three orders of magnitude slower than your processor’s clock speed.

If you want your function to do something, such as, oh, I don’t know, receive an input argument, this goes up to 350ns, throttling you at 2.8 million calls per second.

Benchmarking function calls

I cleaned up Ed’s and my initial experiments to make a small module and timer to measure all these values. You can clone the repo and run python function-calls/timer.py to check the numbers on your machine.

The benchmarks are variations of comparing the execution time of:

for i in range(n):
    pass

and:

def f():
    pass

for i in range(n):
    f()

for some suitably large n. As I mentioned above, that comes out to an absolute minimum of 150ns per function call.

What this means

I’ve been making a fuss over the past year about the excellent Toolz and the way it enables elegant streaming data processing. (See my demo repo and my EuroSciPy talk.) You can read data from a modern SSD at speeds approaching 500MB/s. If you want to stream each byte through Python functions, you’ll instantly lose two orders of magnitude of speed. And, the more functions you use, the slower you’ll go, which discourages functional programming and modularity — the very things I was trying to promote!

In the DNA sequence processing I demo in the talk, I get a throughput of about 0.5MB/s. On one hand, this is kind of OK because we are using effectively zero RAM, so we can just let the code run over lunch. On the other, it’s starting to bug me that 99% of my processor time is spent on Python function calls, rather than on actual data crunching.

This is a problem for Python. To work on seriously big data, you need to drop into a library written in C, such as NumPy or Pandas. You need to do this on a high level: any per-byte or per-data-element processing cannot be in Python, if you don’t want to waste your processor’s cycles. Python’s ecosystem is Insanely Great, so this is mostly fine, but it does limit your ability to research or implement cool new methods using Python.

As an example, the generic_filter function in SciPy’s ndimage package has infinitely many cool uses, but using it to process a 100MB image (which is small in biology) would take 15 seconds in function call overhead alone. Lest you think this is reasonable, SciPy’s greyscale erosion, implemented in C, takes less than 4 seconds on an image that size. A lot of my once-lackadaisical attitude towards Python performance stemmed from not knowing how long things should take. A lot less than they do, it turns out.

What to do about it

As I mentioned, Python’s high performance libraries are many and great. Look hard for optimised libraries that already do what you want. Try to express what you want to do as combinations of functions from NumPy, SciPy, Pandas, scikit-image, scikit-learn, and so on. Minimise the amount of time spent in Python. This is advice that you learn early on in scientific Python programming, but I didn’t appreciate just how important it is.

At some point, that approach will fail, and you will want to do something cute and custom with your data points. Reach for Cython sooner rather than later. As a primer, I recommend Stefan Behnel’s excellent tutorial from EuroSciPy 2015.

There is also Continuum’s Numba, which is sometimes easier to use than Cython. I don’t have any experience with it so I can’t comment much here. However, I’d consider it a very valuable project to implement generic_filter in Numba.
In the long-run, these are all workarounds, and I hope that the Python interpreter itself becomes faster, though there are few signs of that happening.

If you have other ideas on how to get around Python’s function call cost, please let me know in the comments!

My first use of Python 3’s `yield from`!

I never really understood why yield from was useful. Last weekend, I wanted to use Python 3.5’s new os.scandir to explore a directory (and its subdirectories). Tragically, os.scandir is not recursive, and I find os.walks 3-tuple values obnoxious.
Lo and behold, while I was trying to implement a recursive version of scandir, a yield from use just popped right out!

import os
def rscandir(path):
    for entry in os.scandir(path):
        yield entry
        if entry.is_dir():
            yield from rscandir(entry.path)

That’s it! I have to admit that reads wonderfully. The Legacy Python (aka Python 2.x) alternative is quite a bit uglier:

import os
def rscandir(path):
    for p in os.listdir(path):
        yield p
        if os.path.isdir(p):
            for q in rscandir(p):
                yield q

Yuck. So, yet again: time to move away from Legacy Python! ;)