The road to scikit-image 1.0

This is the first in a series of posts about the joint scikit-image, scikit-learn, and dask sprint that took place at the Berkeley Insitute of Data Science, May 28-Jun 1, 2018.

In addition to the dask and scikit-learn teams, the sprint brought together three core developers of scikit-image (Emmanuelle Gouillart, Stéfan van der Walt, and myself), and two newer contributors, Kira Evans and Mark Harfouche. Since we are rarely in the same timezone, let alone in the same room, we took the opportunity to discuss some high level goals using a framework suggested by Tracy Teal (via Chris Holdgraf): Vision, Mission, Values. I’ll try do Chris’s explanation of these ideas justice:

  • Vision: what are we trying to achieve? What is the future that we are trying to bring about?
  • Mission: what are we going to do about it? This is the plan needed to make the vision a reality.
  • Values: what are we willing to do, and not willing to do, to complete our mission?

So, on the basis of this framework, I’d like to review where scikit-image is now, where I think it needs to go, and the ideas that Emma, Stéfan, and I came up with during the sprint to get scikit-image there.

I will point out, from the beginning, that one of our values is that we are community-driven, and this is not a wishy-washy concept. (More below.) Therefore this blog post constitutes only a preliminary document, designed to kick-start an official roadmap for scikit-image 1.0 with more than a blank canvas. The roadmap will be debated on GitHub and the mailing list, open to discussion by anyone, and when completed will appear on our webpage. This post is not the roadmap.

Part one: where we are

scikit-image is a tremendously successful project that I feel very proud to have been a part of until now. I still cherish the email I got from Stéfan inviting me to join the core team. (Five years ago now!)

Like many open source projects, though, we are threatened by our own success, with feature requests and bug reports piling on faster than we can get through them. And, because we grew organically, with no governance model, it is often difficult to resolve thorny questions about API design, what gets included in the library, and how to deprecate old functionality. Discussion usually stalls before any decision is taken, resulting in a process heavily biased towards inaction. Many issues and PRs languish for years, resulting in a double loss for the project: a smaller loss from losing the PR, and a bigger one from losing a potential contributor that understandably has lost interest.

Possibly the most impactful decision that we took at the BIDS sprint is that at least three core developers will video once a month to discuss stalled issues and PRs. (The logistics are still being worked out.) We hope that this sustained commitment will move many PRs and issues forward much faster than they have until now.

Part two: where we’re going

Onto the framework. What are the vision, mission, and values of scikit-image? How will these help guide the decisions that we make daily and in our dev meetings?

Our vision

We want scikit-image to be the reference image processing and analysis library for science in Python. In one sense I think that we are already there, but there are more than enough remaining warts that they might cause the motivated user to go looking elsewhere. The vision, then, is to increase our customer satisfaction fraction in this space to something approaching 1.0.

Our mission

How do we get there? Here is our mission:

  • Our library must be easily re-usable. This means that we will be careful in adding new dependencies, and possibly cull some existing ones, or make them optional. We also want to remove some of the bigger test datasets from our package, which at 24MB is getting rather unwieldy! (By comparison, Python 3.7 is 16MB.) (Props to Josh Warner for noticing this.)
  • It also means providing a consistent API. This means that conceptually identical function arguments, such as images, label images, and arguments defining whether an input image is grayscale, should have the same name across various the library. We’ve made great strides in this goal thanks to Egor Panfilov and Adrian Sieber, but we still have some way to go.
  • We want to ensure accuracy of our algorithms. This means comprehensive testing, even against external libraries, and engaging experts in relevant fields to audit our code. (Though this of course is a challenge!)
  • Show utmost care with users’ data. Not that we haven’t cared until now, but there are places in scikit-image where too much responsibility (in my view) rests with the user, with insufficient transparency from our functions for new users to predict what will happen to their data. For example, we are quite liberal with how we deal with input data: it gets rescaled whenever we need to change the type, for example from unsigned 8-bit integers (uint8) to floating point. Although we have good technical reasons for doing this, and rather extensive documentation about it, these conversions are the source of much user confusion. We are aiming to improve this in issue 3009. Likewise, we don’t handle image metadata at all. What is the physical extent of the input image? What is the range and units of the data points in the image? What do the different channels represent? These are all important questions in scientific images, but until now we have completely abdicated responsibility in them and simply ignore any metadata. I don’t think this is tenable for a scientific imaging library. We don’t have a good answer for how we will do it, but I consider this a must-solve before we can call ourselves 1.0.

Our values

Finally, how do we solve the thorny questions of API design, whether to include algorithms, etc? Here are our values:

  • We used the word “reference” in our vision. This phrasing is significant. It means that we value elegant implementations, that are easy to understand for newcomers, over obtaining every last ounce of speed. This value is a useful guide in reviewing pull requests. We will prefer a 20% slowdown when it reduces the lines of code two-fold.
  • We also used the word science in our vision. This means our aim is to serve scientific applications, and not, for example, image editing in the vein of Photoshop or GIMP. Having said this, we value being part of diverse scientific fields. (One of the first citations of the scikit-image paper was a remote sensing paper, to our delight: none of the core developers work in that field!)
  • We are inclusive. From my first contributions to the project, I have received patient mentorship from Stéfan, Emmanuelle, Johannes Schönberger, Andy Mueller, and others. (Indeed, I am still learning from fellow contributors, as seen here, to show just one example.) We will continue to welcome and mentor newcomers to the Scientific Python ecosystem who are making their first contribution.
  • Both of the above points have a corrolary: we require excellent documentation, in the form of usage examples, docstrings documenting the API of each function, and comments explaining tricky parts of the code. This requirement has stalled a few PRs in the past, but this is something that our monthly meetings will specifically address.
  • We don’t do magic. We use NumPy arrays instead of fancy façade objects that mask their complexity. We prefer to educate our users over making decisions on their behalf (through quality documentation, particularly in docstrings).
  • We are community-driven, which means that decisions about the API and features will be driven by our users’ requirements, and not the whims of the core team. (For example, I would happily curry all of our functions, but that would be confusing to most users, so I suffer in silence. =P)

I hope that the above values are uncontroversial in the scikit-image core team. (I myself used to fall heavily on the pro-magic side, but hard experience with this library has shown me the error of my ways.) I also hope, but more hesitantly, that our much wider community of users will also see these values as, well, valuable.

As I mentioned above, I hope this blog post will spawn a discussion involving both the core team and the wider community, and that this discussion can be distilled into a public roadmap for scikit-image.

Part three: scikit-image 1.0

I have deliberately left out new features off the mission, except for metadata handling. The library will never be “feature complete”. But we can develop a stable and consistent enough API that adding new features will almost never require breaking it.

For completeness, I’ll compile my personal pet list of things I will attempt to work on or be particularly excited about other people working on. This is not part of the roadmap, it’s part of my roadmap.

  • Near-complete support for n-dimensional data. I want 2D-only functions to become the exception in the library, maybe so much so that we are forced to add a _2d suffix to the function name.
  • Typing support. I never want to move from simple arrays as our base data type, but I want a way to systematically distinguish between plain images, label images, coordinate lists, and other types, in a way that is accessible to automatic tools.
  • Basic image registration functionality.
  • Evaluation algorithms for all parts of the library (such as segmentation, or keypoint matching).

The human side

Along with articulating the way we see the project, another key part of getting to 1.0 is supporting existing maintainers, and onboarding new ones. It is clear that the project is currently straining under the weight of its popularity. While we solve one issue, three more are opened, and two pull requests.

In the past, we have been too hesitant to invite new members to the core team, because it is difficult to tell whether a new contributor shares your vision. Our roadmap document is an important step towards rectifying this, because it clarifies where the library is going, and therefore the decision making process when it comes to accepting new contributions, for example.

In a followup to this post, I aim to propose a maintainer onboarding document, in a similar vein, to make sure that new maintainers all share the same process when evaluating new PRs and communicating with contributors. A governance model is also in the works, by which I mean that Stéfan has been wanting to establish one for years and now Emmanuelle and I are onboard with this plan, and I hope others will be too, and now we just need to decide on the damn thing.

I hope that all of these changes will allows us to reach the scikit-image 1.0 milestone sooner rather than later, and that everyone reading this is as excited about it as I was while we hashed this plan together.

As a reminder, this is not our final roadmap, nor our final vision/mission statement. Please comment on the corresponding GitHub issue for this post if you have thoughts and suggestions! (You can also use the mailing list, and we will soon provide a way to submit anonymous comments, too.) As a community, we will come together to create the library we all want to use and contribute to.

As a reminder, everything in this blog is CC0+BY, so feel free to reuse any or all of it in your own projects! And I want to thank BIDS, and specifically Nelle Varoquaux at BIDS, for making this discussion possible, among many other things that will be written up in upcoming posts.

Update: Anonymous comments are now open at https://pollev.com/juannunezigl611. To summarise, to comment on this proposal you can:

3 thoughts on “The road to scikit-image 1.0

  1. Hi Juan,

    Great text! Many thanks to all of you for the effort to make this.

    Few random comments:
    – Recently, we have also an increasing number of PR related to the coding machinery. It’s important, but this machinery is getting more and more complex, hard to follow the evolutions and it will likely increase the step for newcomers (+ myself).
    – I would wish that PR would be more collaborative. Not just two hands improving a PR based on feedbacks. Often, I have the feeling to wind myself as I’m not as good as the other core dev.
    – I wish also to see more “immediate” features like “skimage, give me the information related to the histogram of the image” and not “skimage, take this image, use 256 bins, give me two arrays. Ok, now, matplotlib, please, draw an histogram of this…”. Something more interactive and single shot, which is useful when developing a data analysis from scratch from my jupyter notebook. Automation comes later with more step-wise functions.

    1. Hey François! Thanks for responding so quickly and sorry for the delay in my response. To answer your concerns:

      Regarding your first comment, I hear you 100%! Actually I myself am not very comfortable in this space. I have rarely succeeded in building the docs on my machine, I don’t know what half the lines in the travis or appveyor files are doing, and releases are way too messy. Unfortunately, the machinery is actually super useful and I can’t imagine working without it! And improving the release process involves, you guessed it, more machinery!

      So I agree, we should strive to simplify wherever possible. Another idea is that we should nominate people “in charge” of particular modules, including the CI. Right now “responsibility” is very diffuse, and I think that contributes to many issues and PRs stalling for so long, and many areas of the code being poorly maintained.

      On the other hand, it says something very good about the modularity of scikit-image that both of us have made many contributions without really understanding the CI! =)

      About your second comment, do not feel pressured about your skill levels! I myself am getting schooled on a regular basis in this project, and part of the fun of contributing and even maintaining is the sheer amount of learning it provides! (See for example the recent issue about the incorrect computation of the inertia tensor — a bug that I introduced!) Ideas about improving the PR process are extremely welcome, and would fit very well with our values around inclusivity and mentorship mentioned above.

      I recently used an issue to coordinate a “live” coding session with two people more experienced than me. You should definitely do the same if you feel you need some help with particular implementations or PRs.

      Finally, about higher-level functions, I’m more hesitant about. It might conflict a little with the idea that we “don’t do magic”. However, high-level does not necessarily mean “magic” or “imprecisely determined”. In many cases, such as the image histogram, I think we could shuffle them into a “viewer” package that depends on matplotlib and ties a lot of computation to display.

      At any rate, of course raise issues and PRs for discussion! I agree that something like skimage.viewer.display_histogram would be handy! =)

Leave a comment