CS766 Final Project: Sam Vinitsky

Staff Line Removal for Optical Music Recognition

CS766 Final Project: Sam Vinitsky

This website was created using the Skeleton 2.0 Template.

Introduction

Optical music recognition (OMR) is the process of taking an image of sheet music and converting it into a computer readable format (such as musicXML or MIDI).

OMR is similar to optical character recognition, which is the task of identifying handwritten text. However, OMR is much more difficult due to the hierarchical and nonlinear structure of sheet music.

There has been a great deal of work in OMR in recent years. For a thorough overview of the field, see [Novotny and Pokorny, 2015].

Sheet music on the left, musicXML on the right.

Sheet Music Primer

Sheet music is the language that musicians use to write down music. It consists of staff lines, and symbols.

Staff lines (drawn here in blue) are the horizontal lines that span the image. Each set of 5 staff lines is called a "stave". Everything else is a symbol. Symbols tell you what to do, such as what note to play, how loud to play, what instrument is playing, etc.

You read music by reading along the stave from left to right, and whenever you see a symbol, you do something based on that symbol.

Note: the location of the symbol with respect to the staff lines may change its meaning! For example, a note on the top line of a stave means to play a different note than the same note on the bottom line. The horizontal location of a note determines when to play it.

For a more thorough introduction to sheet music, see here.

Staff lines drawn in blue, everything black is a symbol.

Applications of OMR

The biggest application of OMR is converting handwritten sheet music into a clean, digital format. Most composers still choose to write music by hand due to a lack of easy-to-use music writing software. Furthermore, there is a great deal of historical music that only exists in handwritten form. Converting these scores into a digital format is time consuming (and therefore expensive).

There are several other interesting applications of OMR such as creating robots that can play music, designing automated page turners for performers, and improving the process of modifying existing sheet music.

There are several commercial and open-source products that claim to perform OMR, but these tend to work only for a very narrow set of clean sheet music scores, such as those generated from other music notation software.

Optical Music Recognition Pipeline

There is a standard pipeline for performing optical music recognition. After preprocessing the image, we do the following:

Find and remove staff lines
Find symbols
Determine type of each symbol
Determine how symbols relate to each other
Convert into computer readable format

Each step of this process depends on the output from the last step, which makes this a very challenging task! For example, if we fail to remove all of the staff lines in step 1, we will not be able to find meaningful symbols in step 2, which means we will be unable to classify them correctly in step 3, etc.

In light of this, most research focuses on a single step of the pipeline. There has been some effort to use deep learning for end-to-end optical music recognition, but these have been unsuccessful [Calvo-Zaragoza et al., 2017a].

GIF showing the steps of the OMR pipeline.

Staff Line Removal

Due to the interconnected nature of the pipeline, I chose to focus on the very first step: finding and removing staff lines.

Staff lines are perhaps the most important part of sheet music, as all other symbols are defined with respect to the staff lines. If we do not locate the staff lines correctly, we will be unable to correctly interpret the rest of the sheet music.

That being said, it will be difficult to segment and classify the symbols with the staff lines in the image, so we will remove the staff lines from the image once we have located them.

Handwritten sheet music, with and without staff lines.

What issues could we run into?

First off, not all staff lines are lines. For example, if the sheet music has been scanned in from a book, the staff lines may appear curved near the spine of the book (see the figure on the right). This means that standard line detection methods like the Hough Transform will not suffice here.

Also, since staff lines are usually very thin, any distortion or noise (such as from scanning) will severely alter the structure of the staff line.

This means staff line removal algorithms must be robust to distortions and noise.

Staff lines in the red box are *curved*.

Algorithms for Staff Line Removal

Handcrafted

The first two decades of research in staff line removal focused on developing handcrafted algorithms based on domain knowledge. These algorithms tend to perform very well for perfect sheet music, but perform miserably for music with any visual flaws. As discussed above, most sheet music is distorted or noisy, which means these algorithms are less than ideal for practical applications.

See [Fujinaga et al., 2007] for a thorough overview and comparative study of various handcrafted algorithms.

State of the Art: Supervised Learning

Over the last two years, researchers have shifted to designing staff line removal algorithms using supervised learning, which uses labeled data to train a predictive model.

For example, one might train a model that goes pixel by pixel through the image, and asks “is this pixel a staff line?”. In order to do this, we convert each pixel into a feature vector, which is a point in n-dimensional space. We then use sheet music where each pixel is labeled as “staff” or “not staff” in order to learn a rule to discriminate between the two. This technique was used to great success in [Calvo-Zarogoza et al., 2016].

Researchers have also applied deep learning to this task. This method tries to train a model to solve the black-box problem of converting the original image into the staff-less image in a single step. There are many ways of training such a model, such as convolutional neural networks [Calvo-Zarogoza et al., 2017b], or generative adversarial networks [Bhunia et al., 2018].

One possible feature vector representation of a pixel.
Image source: [Calvo-Zarogoza et al., 2016]

Issues with Supervised Learning

Both of these supervised learning methods have the same issue: they require a great variety of labeled data for the trained model to be robust (which is essential, as discussed above). Unfortunately, there are only two labeled datasets available for staff line removal, and generating new datasets is incredibly time consuming (since we need to hand-label every one of the millions of pixels in each training image, of which we would need thousands.)

This issue led me to ask the question: “how can we harness the predictive power of machine learning without needing so much labeled data?”

My Approach: Unsupervised Learning

My approach uses clustering to find patterns in the data. This method does not require labeled data and is robust to noise, which mitigates our two main concerns.

My algorithm is as follows:

Convert each black pixel into a feature vector as described above.
Cluster the feature vectors into two clusters.
Choose the cluster that looks more like staff lines.

Intuitively, why should this approach work? Clustering does really well when there is a lot of data, and there are hundreds of thousands of black pixels in each image. This means that clusters should naturally form. Furthermore, “staff” and “non-staff” pixels should look very different under the right feature vectors, which means these would indeed be the two primary clusters.

An illustration of clustering using four clusters. Image source.

Results

I tested the performance of my algorithm on two datasets: one constisting of computer-generated sheet music and one consisting of handwritten sheet music. I compared the results against several other algorithms I implemented, as well as the reported state of the art results (the reported results were only for the handwritten dataset). Following the state of the art papers, I used F1 score as my metric for success (for an intuitive explanation of F1 score, see this post).

Although my clustering algorithm was outperformed on both datasets, it was the only algorithm that had decent performance on both datasets. All of my other algorithms performed very well on the typeset dataset, but horribly on the handwritten dataset. I was unable to test state of the art algorithms on the typeset dataset, but it would likely have performed better, since the state of the art algorithms were all trained on handwritten data.

Comparison of results using F1 score. Our algorithm is “clustering (window)”.

Qualitative Results: Typeset

This is one snapshot of a single output from the typeset dataset, comparing our results (left) with the true staff-less image (right). Some observations:

Our algorithm removed almost all of the staff lines
The notes (the symbols on the right) look mostly intact
The “C” was chopped up into four different symbols. This is bad, because now we will be unable to classify the “C” correctly in the next step of the pipeline
The sharps (hashtags) and clef (large fancy symbol on the left) are degraded but recognizable

Our output is on the left, ground truth staff-less image on the right. Results taken from a sample image from the typeset dataset.

Qualitative Results: Handwritten

This is one snapshot of a single output from the handwritten dataset, comparing our results (top) with the true staff-less image (bottom). Some observations:

Almost all of the symbols look very good
Some of the non-staff lines in the image were pretty badly degraded
We did miss large chunks of staff line, which is an issue

Our output is on the top, ground truth staff-less image on the bottom. Results taken from a sample image from the handwritten dataset.

Discussion

Overall, the performance of my algorithm was adequate, but not fantastic. We give up a bit of accuracy for flexibility, running time, and the fact that we don't need labeled data. However, this last point is incredibly important, and means that our algorithm will generalize much better than the state of the art algorithms.

Future Work

This was just a first attempt at a clustering algorithm, and there are many places where it could be improved:

Better feature vectors: This is the biggest factor in the success of the algorithm. I tried a few different feature vectors, and they all had wildly different performance. The window feature vector described above performed the best, but there may be better feature vectors.
Picking the “staff” cluster: Currently my algorithm does this naively by picking the smaller cluster, but this fails in many cases.
Including white pixels: I didn't include any white pixels, but white pixels could be staff lines due to noise.
Better clustering algorithms: I used the k-means algorithm for clustering, but I believe that Gaussian mixture models may work better here.

Lessons Learned

What did I learn from doing this project?

Independent Research Skills: This was my first independent research endeavor. It required me to formulate a topic and conduct an extensive literature search. I then chose one small portion of the field that I thought I could make progress in. This was an entirely new process for me, and I learned a lot about how to act as an independent researcher.
Organizing complicated code: This project involved designing several different models and testing their performance on multiple datasets. This required me to write a modular code base for testing, which was something I had never done before. It also required me to keep very thorough documentation so I would not get lost in my own code!
Practical programming skills: For this project, I did more intense programming in MATLAB than I had ever done in any language. This forced me to become more closely acquainted with MATLAB than I have ever been with any language. Building this website was my first serious HTML/CSS project.

Downloads

Here are several downloads from my work on this project: (these belong to me)

Full source code in MATLAB.
All results, sorted by algorithm and dataset.
Presentation as a PDF.
Presentation as a PPTX, with missing fonts.

Here are links to the download pages for the datasets used in this project: (these do not belong to me)

The Synthetic Score Database by Christopher Dalitz.
Handwritten binary scores from the 2013 ICDAR/GREC competition.

References

[Calvo-Zaragoza et al., 2016] Calvo-Zaragoza, J., Mico, L., and Oncina, J. (2016). Music staff removal with supervised pixel classification. International Journal on Document Analysis and Recognition (IJDAR), 19(3):211-219.
[Calvo-Zaragoza et al., 2017a] Calvo-Zaragoza, J., Valero-Mas, J. J., and Pertusa, A. (2017a). End-to-end optical music recognition using neural networks. In ISMIR.
[Calvo-Zaragoza et al., 2017b] Calvo-Zaragoza, J., Vigliensoni, G., and Fujinaga, I. (2017b). Staff-line detection on grayscale images with pixel classification. In Pattern Recognition and Image Analysis, pages 279-286, Cham. Springer International Publishing.
[Cardoso et al., 2008] Cardoso, J., Capela, A., Rebelo, A., and Guedes, C. (2008). A connected path approach for staff detection on a music score.
[Dutta et al., 2010] Dutta, A., Pal, U., Fornas, A., and Llados, J. (2010). An efficient staff removal approach from printed musical documents.
[Fujinaga et al., 2007] Fujinaga, I., Droettboom, M., Dalitz, C., and Pranzas, B. (2007). A comparative study of staff removal algorithms. IEEE Transactions on Pattern Analysis & Machine Intelligence, 30:753-766.
[Hajic and Pecina, 2017] Hajic, jr., J. and Pecina, P. (2017). In Search of a Dataset for Handwritten Optical Music Recognition: Introducing MUSCIMA++. ArXiv e-prints.
[Bhunia et al., 2018] Bhunia, A. K., Konwer, A., Bhowmick, A.,, Banerjee, P., Pratim Roy, P., and Pal, U. (2018). Staff line Removal using Generative Adversarial Networks. ArXiv e-prints.
[Novotny and Pokorny, 2015] Novotny, J. and Pokorny, J. (2015). Introduction to optical music recognition: Overview and practical challenges.
[Pacha and Eidenberger, 2017] Pacha, A. and Eidenberger, H. (2017). Towards a universal music symbol classifier. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 02, pages 35-36.
[Rebelo et al., 2009] Rebelo, A., Capela, G., and Cardoso, J. S. (2009). Optical recognition of music symbols. International Journal on Document Analysis and Recognition (IJDAR), 13:19-31.
[Su et al., 2012] Su, B., Lu, S., Pal, U., and Tan, C. L. (2012). An effective staff detection and removal technique for musical documents.
[van der Wel and Ullrich, 2017] van der Wel, E. and Ullrich, K. (2017). Optical Music Recognition with Convolutional Sequence-to-Sequence Models. ArXiv e-prints.
[Wen et al., 2014] Wen, C., Rebelo, A., Zhang, J., and Cardoso, J. (2014). Classification of optical music symbols based on combined neural network.