This website was created using the Skeleton 2.0 Template.
Optical music recognition (OMR) is the process of taking an image of sheet music and converting it into a computer readable format (such as musicXML or MIDI).
OMR is similar to optical character recognition, which is the task of identifying handwritten text. However, OMR is much more difficult due to the hierarchical and nonlinear structure of sheet music.
There has been a great deal of work in OMR in recent years. For a thorough overview of the field, see [Novotny and Pokorny, 2015].
Sheet music is the language that musicians use to write down music. It consists of staff lines, and symbols.
Staff lines (drawn here in blue) are the horizontal lines that span the image. Each set of 5 staff lines is called a "stave". Everything else is a symbol. Symbols tell you what to do, such as what note to play, how loud to play, what instrument is playing, etc.
You read music by reading along the stave from left to right, and whenever you see a symbol, you do something based on that symbol.
Note: the location of the symbol with respect to the staff lines may change its meaning! For example, a note on the top line of a stave means to play a different note than the same note on the bottom line. The horizontal location of a note determines when to play it.
For a more thorough introduction to sheet music, see here.
The biggest application of OMR is converting handwritten sheet music into a clean, digital format. Most composers still choose to write music by hand due to a lack of easy-to-use music writing software. Furthermore, there is a great deal of historical music that only exists in handwritten form. Converting these scores into a digital format is time consuming (and therefore expensive).
There are several other interesting applications of OMR such as creating robots that can play music, designing automated page turners for performers, and improving the process of modifying existing sheet music.
There are several commercial and open-source products that claim to perform OMR, but these tend to work only for a very narrow set of clean sheet music scores, such as those generated from other music notation software.
There is a standard pipeline for performing optical music recognition. After preprocessing the image, we do the following:
Each step of this process depends on the output from the last step, which makes this a very challenging task! For example, if we fail to remove all of the staff lines in step 1, we will not be able to find meaningful symbols in step 2, which means we will be unable to classify them correctly in step 3, etc.
In light of this, most research focuses on a single step of the pipeline. There has been some effort to use deep learning for end-to-end optical music recognition, but these have been unsuccessful [Calvo-Zaragoza et al., 2017a].
Due to the interconnected nature of the pipeline, I chose to focus on the very first step: finding and removing staff lines.
Staff lines are perhaps the most important part of sheet music, as all other symbols are defined with respect to the staff lines. If we do not locate the staff lines correctly, we will be unable to correctly interpret the rest of the sheet music.
That being said, it will be difficult to segment and classify the symbols with the staff lines in the image, so we will remove the staff lines from the image once we have located them.
First off, not all staff lines are lines. For example, if the sheet music has been scanned in from a book, the staff lines may appear curved near the spine of the book (see the figure on the right). This means that standard line detection methods like the Hough Transform will not suffice here.
Also, since staff lines are usually very thin, any distortion or noise (such as from scanning) will severely alter the structure of the staff line.
This means staff line removal algorithms must be robust to distortions and noise.
The first two decades of research in staff line removal focused on developing handcrafted algorithms based on domain knowledge. These algorithms tend to perform very well for perfect sheet music, but perform miserably for music with any visual flaws. As discussed above, most sheet music is distorted or noisy, which means these algorithms are less than ideal for practical applications.
See [Fujinaga et al., 2007] for a thorough overview and comparative study of various handcrafted algorithms.
Over the last two years, researchers have shifted to designing staff line removal algorithms using supervised learning, which uses labeled data to train a predictive model.
For example, one might train a model that goes pixel by pixel through the image, and asks “is this pixel a staff line?”. In order to do this, we convert each pixel into a feature vector, which is a point in n-dimensional space. We then use sheet music where each pixel is labeled as “staff” or “not staff” in order to learn a rule to discriminate between the two. This technique was used to great success in [Calvo-Zarogoza et al., 2016].
Researchers have also applied deep learning to this task. This method tries to train a model to solve the black-box problem of converting the original image into the staff-less image in a single step. There are many ways of training such a model, such as convolutional neural networks [Calvo-Zarogoza et al., 2017b], or generative adversarial networks [Bhunia et al., 2018].
Both of these supervised learning methods have the same issue: they require a great variety of labeled data for the trained model to be robust (which is essential, as discussed above). Unfortunately, there are only two labeled datasets available for staff line removal, and generating new datasets is incredibly time consuming (since we need to hand-label every one of the millions of pixels in each training image, of which we would need thousands.)
This issue led me to ask the question: “how can we harness the predictive power of machine learning without needing so much labeled data?”
My approach uses clustering to find patterns in the data. This method does not require labeled data and is robust to noise, which mitigates our two main concerns.
My algorithm is as follows:
Intuitively, why should this approach work? Clustering does really well when there is a lot of data, and there are hundreds of thousands of black pixels in each image. This means that clusters should naturally form. Furthermore, “staff” and “non-staff” pixels should look very different under the right feature vectors, which means these would indeed be the two primary clusters.
I tested the performance of my algorithm on two datasets: one constisting of computer-generated sheet music and one consisting of handwritten sheet music. I compared the results against several other algorithms I implemented, as well as the reported state of the art results (the reported results were only for the handwritten dataset). Following the state of the art papers, I used F1 score as my metric for success (for an intuitive explanation of F1 score, see this post).
Although my clustering algorithm was outperformed on both datasets, it was the only algorithm that had decent performance on both datasets. All of my other algorithms performed very well on the typeset dataset, but horribly on the handwritten dataset. I was unable to test state of the art algorithms on the typeset dataset, but it would likely have performed better, since the state of the art algorithms were all trained on handwritten data.
This is one snapshot of a single output from the typeset dataset, comparing our results (left) with the true staff-less image (right). Some observations:
This is one snapshot of a single output from the handwritten dataset, comparing our results (top) with the true staff-less image (bottom). Some observations:
Overall, the performance of my algorithm was adequate, but not fantastic. We give up a bit of accuracy for flexibility, running time, and the fact that we don't need labeled data. However, this last point is incredibly important, and means that our algorithm will generalize much better than the state of the art algorithms.
This was just a first attempt at a clustering algorithm, and there are many places where it could be improved:
What did I learn from doing this project?
Here are several downloads from my work on this project: (these belong to me)
Here are links to the download pages for the datasets used in this project: (these do not belong to me)