HUM-TO-CHORD CONVERSION USING CHROMA FEATURES AND HIDDEN MARKOV MODEL

Music is basically a sound arranged in such a way to produce a harmonious and rhythmic sound. The basis of music is a tone, which is a natural sound and has different frequencies for each sound. Each constant sound represents a tone. The tones can also be represented in a chord. Humans are capable of creating a sound or imitating a tone from other human beings, but they are naturally unable to represent them into musical notation without musical instruments. This research addresses a model of Hum-to-Chord (H2C) conversion using a Chroma Feature (CF) to extract the characteristics and a Hidden Markov Model (HMM) to classify them. A 10-fold cross-validating shows that the best model is represented by the chroma coefficients of 55 and HMM with a codebook of 16, which gives an average accuracy of 94.83%. Examining on a 30% testing set proves that the best model has a high accuracy of up to 97.78%. Most errors come from the chords with both high and low octaves since they are unstable. Compared to a similar model called musical note classification (MNC), the proposed H2C model performs better in terms of both accuracy and complexity.


INTRODUCTION
Many musicians can make a song from a hum, but they have a difficulty to directly translate the tone into the chord because of lack of experience in playing musical instruments. A novice musician needs a musical instrument to match a hum with that tone or with the help of other musicians who have cognitive knowledge of music. To address this problem, many experts do researches on the hum, such as in [1,2]. Since 2010, the query-by-humming has been a part of the area of research called Singing Information Processing [3][4][5].
This paper focuses on developing a model to recognize a hum and convert it into a chord automatically. This model is built using the chroma features (CF) method to extract some features of the input data in the form of a hum that produces 12dimensional vectors representing diatonic tones. The chroma used in this research is a chroma discrete cosine transform-reduced log pitch (CRP), which is a variant of CF that is capable of handling the timbre changes [6][7][8]. The CF method is used as a feature extraction since it produces the form of pitch information needed to recognize a chord [9].
In this paper, HMM is selected since it is well known as an excellent classifier for recognizing a data sequence without being affected by the variances of the sequence length [13]. It is also capable of predicting the next frame of data based on the previously recognized frames. HMM is commonly used for many applications with an input of data sequences, such as automatic speech recognition [14][15][16], video processing [17,18], gene sequence classification [19], chord recognition [20], and music analysis [21].

HUM-TO-CHORD CONVERSION
The model of Hum-to-Chord Conversion is illustrated in Fig. 1. It consists of three subprocesses, i.e., Preprocessing, Feature Extraction, and HMM-Based Classifier.

DATASET
The dataset used here is a set of audio .wav files containing hums of male and female with a distance of E2 to C6. It is collected by recording the hum using a smartphone. It consists of four types of soprano, alto, tenor, and bass from three singers each. All singers have cognitive knowledge of music. The process of recording is accompanied by a qualified choir trainer so that the accuracy of the hum tone can be corrected directly.
A singer sings the tone with a hum that consists of twelve diatonic tones with octave distance according to the capacity of the singer, where each tone is repeated five times. Hence, each singer produces 12×2×5=120 hums. Those recorded hums are then manually selected by the choir trainer to get the 100 best ones. Thus, the total number of recorded hums is 4×3×100=1,200, as listed in Table 1.

PREPROCESSING
There are two steps in the preprocessing. Firstly, a recorded hum of .m4a format is converted into a .wav format. Secondly, the stereo .wav file is then converted into a mono channel as needed by the next process of feature extraction.

FEATURE EXTRACTION
Chroma features can be used to estimate a pitch of signal by framing the signal. A frame is then transformed using a Discrete Cosine Transform (DCT) to get its spectral energy distributed in 12 bins that represent 12 diatonic tones. Fig. 2 illustrates the block diagram of chroma feature extraction. The output of the feature extraction is a chromagram, which is a spectrogram that shows the intensity of each bin in the time domain. The values of each bin are then added up to get a chroma pitch [8]. In this paper, a chroma pitch is a column vector with a size of Each person has a unique timbre. Hence, the feature extraction method used here is the CRP that is capable of handling the varying timbres [7]. The CRP works by extracting the signal using a Mel Frequency Cepstral Coefficients (MFCC). The extracted signal is then transformed using a DCT to produce 120 pitch logarithmic vectors that produce 120 coefficients. The top 120 coefficients are then selected (n: 120), and the lowest coefficients are discarded to solve the problem of timbre variances [7] and [8]. An inverse DCT is then performed to return the signal into the time domain. Finally, the Chroma extraction process is carried out.
A chroma representation is illustrated in Fig. 3, which is adopted from a paper described in [6]. An example of a chromagram is illustrated in Fig. 4. It is generated by the stable energy of chroma bin. (1) The trained HMM models are used to calculate the probabilities of converting a tone into a chord. The testing data is classified by maximizing the likelihood between the model and the input tone.

RESULT AND DISCUSSION
In order to get the highest accuracy, the proposed model is evaluated using two scenarios: 1) observation of the effects of chroma parameters to the model accuracy; and 2) observation of model performance using 10-fold cross-validation. The CRP coefficients used here are 55 that refers to the result in [6], which proves that those CRP coefficients reach the highest accuracy. Those CRP coefficients of 55 can also be used for hum since it has similar characteristics with music and other instruments for chord extraction and analysis.

SCENARIO 1
The dataset used in this scenario is a train-set of 840 (70%) hums, and a test-set of 360 (30%) hums with a codebook size of 16. The experimental result shows that the CRP parameters give a much higher accuracy (up to 95.83%) than the default parameters (90.00%). It is achieved using the CRP coefficient of 55. This result proves that the log compression in CRP makes the signal stable, and the DCT is capable of distinguishing the unique timbres.

SCENARIO 2
The H2C model is then evaluated using k-fold cross-validation. In this research, k = 10 since it is recommended for many model-validation methods [22]. The dataset of 1,200 hums is divided into ten folds, where each fold consists of 120 hums. Hence, there are ten experiments conducted. In each experiment, nine folds are fed to train a model, and one other fold is used to validate the model.
The experimental results of 10-fold crossvalidation using the CRP coefficient of 55 and varying codebook sizes are illustrated in Fig. 6. The codebook size of 256 gives the lowest averaged accuracy of 23.91%, and the codebook size of 16 gives the highest one up to 94.83%. A bigger codebook does not always give higher accuracy. A too-large codebook makes optimizing cluster does not converge to an optimum point as there are so many centroids. Hence, the codebook size of 16 is the optimum parameter that gives the stable accuracies for the ten folds. Furthermore, the proposed H2C model is then compared to the musical note classification (MNC) model described in [10], which uses a harmonic product spectrum (HPS) as the feature extraction and ANN as the classifier. Both H2C and MNC have the same task, which is to classify the sounds in the dataset into 12 chords. However, there are two differences between H2C and MNC models related to the input and the number of datasets. H2C receives a hum as the input while MNC accepts the sound of electric guitar and other instruments. H2C is trained using 840 data (70 data in each chord) to generalize 360 unseen data into 12 chords (30 data each) in the testing set. Meanwhile, MNC is trained on 2,400 data (200 data in each chord) to classify 2,400 unseen data into 12 chords (200 data each) in the testing set. It seems that the testing set for MNC is more challenging than that in H2C.
The results are illustrated in Table 2. H2C model generally performs better than MNC. It gives perfect accuracies of 100% for ten chords and suffers for two chords: F and A# with an accuracy of 83.33% and 90%, respectively. The errors occur since those chords are unstable that have both high and low octaves for all singers, as illustrated in Fig. 7 that shows the chromagrams of chord A generated from two different hums. The figure on the left shows a chromagram with stable energy of chroma bin that gives a true conversion. In contrast, the figure on the right shows a chromagram with unstable energy of chroma bin and, as a consequence, gives a false conversion. In contrast, MNC that receives electric guitar produces the perfect accuracies of 100% only for four chords: D#, F, F#, and G. It gives low accuracies of 91.30% and 93.60% for both chord G# and C, respectively. Meanwhile, MNC that accepts other instruments gives the perfect accuracies of 100% only for one chord (G) and obtains low accuracies ranging from 84% to 96% for the other 11 chords.
Overall, H2C produces an averaged accuracy of 97.78% that is slightly higher than MNC with the electric guitar that gives a mean accuracy of 97.50%. This result seems to indicate that the performance of H2C is similar to MNC. However, in practice, the electric guitar sounds generated by some guitarists are commonly more stable than the hums produced by several singers. It means that the conversion of electric guitar to chord is easier than hum. It is proved by the result of MNC for other instruments that yields a much lower averaged accuracy of 93.60%.
Hence, it can be implicitly said that the combined CRP and HMM is better than both HPS and ANN in converting a hum or music into a chord. Firstly, CRP offers much lower complexity (55 feature elements) than HPS (181 feature elements). Secondly, HMM is easier trained using some codebook sizes to get the optimum structure than ANN that should be trained on a manually-designed structure, which can be not optimum.

CONCLUSION
The 10-fold cross-validating on the developed H2C model shows that the best model is achieved using a chroma coefficient of 55 and an HMM with a codebook of 16, which gives an average accuracy of 94.83%. Testing on a more challenging dataset (where the testing set is 30% of the total data) proves that the best model is capable of converting hums into chords with high accuracy of 97.78%. The feature extraction using CRP parameters gives significantly higher accuracy than the default parameters. The proposed H2C also performs better than MNC in terms of accuracy as well as complexity. Most errors come from the chords that have both high and low octaves because of their inflections or instability of sounds.

ACKNOWLEDGMENT
We would like to thank all students who are singers of Telkom University Choir for contributing the dataset of hums.