Skip to content

GSoC 2021 Ideas Page

Hameer Abbasi edited this page Mar 14, 2021 · 2 revisions

About PyData/Sparse

PyData/Sparse is a software project that provides sparse arrays for the PyData ecosystem, conforming to the NumPy API. That's a lot to digest, so let's break it down:

What is a sparse array?

A sparse array is one that has a lot of zeros in it. Except in this package, we can also treat other arrays as sparse: Ones that have a lot of the same non-zero values in them.

Why is this important?

Because we don't have infinite memory or computational power, so it's important to make the best use of it possible. If we "skip over" the zeros when doing computations, it will be a lot faster. In practice, this also means keeping track of where the zeros are, so that also has some extra overhead.

What does "conforms to the NumPy API even mean"?

It means you can use it mostly as you would use NumPy. In fact, if you do try using it, some of the familiar functions, like np.max, np.exp etc. work on arrays provided by this project.

Who uses this package?

A lot of people, actually. Sparse arrays are important in physics and simulations, as well as electron microscopy. If you look at the public dependents, you'll even find some COVID-19 research done with this package.

How do I get involved?

Look at our contributing page! There are a lot of great instructions there. Our source code is hosted here.

What technologies are used?

Currently, we use mainly Numba, a package that makes Python go faster than it normally does. However, we are considering using other approaches, such as leveraging research by the TACO team to make things faster. For the curious reader, here's a PhD thesis from the pioneer of the topic. Most of our ideas are in that direction.

Getting in Touch

Our Gitter Channel is the best place to get in touch, or to ask if something should go someplace else. We also have an issue tracker for the more experienced among you!

Getting Started

We have a contributing page that we'll link to as the go-to source for how to get started. If you get stuck, just see above on how to contact us!

Writing your GSoC Application

Usually, your GSoC application has to be a true "game plan" if what you'd like to achieve. It has to be hashed out in enough detail so we are reasonably sure you can make it to the very end. We'd like to remind you that the tile of the sub-org, in this case "PyData/Sparse", must be in the title of your application. We'd also like to point you to Google's own instructions for writing GSoC proposals.

Project Ideas

  1. LLVM Back-end for the Tensor Algebra Compiler (TACO)
    • Description: The TACO project does some JIT compilation in an ad-hoc manner by writing out *.c files, compiling them and dynamically linking them into the executable. We would like to have a back-end for TACO that produces LLVM bytecode using the LLVM C++ API, and also compiles it in-memory.
    • Skills: LLVM bytecode, LLVM C++ API
    • Difficulty Level: Hard
    • Related Readings/Links:
    • Potential mentors: Guilherme Leobas (@guilhermeleobas), Hameer Abbasi (@hameerabbasi)
  2. Completion of Python Bindings for the TACO compiler
    • Description: The TACO project has partial Python bindings, but these are missing tests and API coverage. We'd like to add some tests and more API coverage to the Python bindings.
    • Skills: C++/pybind11 knowledge
    • Difficulty Level: Medium
    • Related Readings/Links:
    • Potential mentors: Dale Tovar (@daletovar), Hameer Abbasi (@hameerabbasi)
  3. Creating a conda-forge package for TACO
    • Description: The TACO project has no conda-forge package. We'd like to have one so we can depend on it in PyData/Sparse
    • Skills: CMake/conda packaging knowledge
    • Difficulty Level: Medium
    • Related Readings/Links:
    • Potential mentors: John Lee (@leej3), Hameer Abbasi (@hameerabbasi)
  4. CSR/CSC format support and performance
    • Description: The sparse project has the GCXS format, which is a generalization of CSR/CSC. Ideally, we'd like to special case it for CSR/CSC as well as have better performance for certain operations.
    • Skills: Python knowledge, data structures and algorithms
    • Difficulty Level: Easy
    • Related Readings/Links:
    • Potential mentors: Dale Tovar (@daletovar), Hameer Abbasi (@hameerabbasi)