The bioacoustics package contains all the necessary R functions to read audio recordings of various formats (e.g., WAV, WAC, MP3, Zero-Crossing), filter noisy files, display audio signals, detect and extract automatically acoustic features for further analysis such as species identification based on classification of animal vocalizations. This package does not provide R functions to perform classification tasks. Other packages such as randomForest, extraTrees, mclust, or keras can be used in addition with the bioacoustics package to perform these tasks. Complementary functions for audio processing in R are also available in the tuneR, monitorR, warbleR, and seewave packages.
This package was originally built for bat bioacoustics, and thus the default arguments for all the available functions are set for bats. Users will have to set the arguments according to the group of animal vocalizations they which to study[See Table 1 for examples of extraction arguments for different species groups]. The bioacoustics package can be used to display and extract acoustic features from recordings of birds, frogs, insects, marine and other terrestrial mammals as long as their vocalizations can be recorded.
This package is subdivided into three main components containing R functions accessible to users and internal processes that will also be defined below for the sake of clarity and ease of comprehension. The three main components with the current functions are as follow:
Read, write, display, extract data (not yet implemented) from Zero-Crossing files: read_zc, plot_zc, metadata_zc, write_zc
Stand-alone tools to display, convert or resample audio recordings: read_audio, metadata, resample, mp3_to_wav, spectro 
Extract acoustic features from audio recordings: fspec, threshold_detection, blob_detection
The read_zc function can read into R a Zero-Crossing file generated from various bat recorders and by the Kaleidoscope software (Wildlife Acoustics, Inc). This function is a modified version of Peter Wilson's Anabat Tools (2013) and the C++ source code provided by Chris Corben.
The plot_zc function can be used to plot the content of a Zero-Crossing file previously loaded with the read_zc function. This function is a modified version of Peter Wilson's Anabat Tools (2013).
The metadata_zc function automatically extracts file header from a Zero-Crossing file previously loaded with the read_zc function.
The write_zc function writes or re-writes a Zero-Crossing file loaded with the write_zc function.
The read_audio function loads into R mono and stereo audio recordings with MP3, Wave, or proprietary compressed WAC (from Wildlife Acoustics) formats. It can be used as a stand-alone function and is used as an helper in all the functions dealing with audio recordings. <!– The read_wac function was built using the  necesary information from the wac2wavcmd-1.1 (provided by Wildlife Acoustics under the MIT open software license) to decode a WAC file.–>
The metadata function automatically extracts metadata encapsulated in audio recordings that were previously loaded into R with the read_audio function. It can also extract metadata from objects generated with the threshold_detection or blob_detection functions. GUANO metadata extraction is not yet implemented with this function. <!–is a modified version of the wa_meta-1.0 Wildlife Acoustics metadata reader. The metadata encapsulated in a Wave file produced by some of their bioacoustics recorders, including (with the latest firmware) the SM2, SM2+, SM2BAT192x2, SM2BAT384, SM2BAT+, SM3, SM3BAT, EM3, EM3+ and Echo Meter Touch, can be read and extracted with this function.–> 
The resample function up- or down-sample the sample rate of a given audio recording. 
The mp3_to_wav function converts MP3 to Wave files. 
The spectro function generates a frequency spectrum representation of a time wave using the Fast Fourrier Transform (FFT).
The fspec function returns as an array the frequency spectrum representation of a time wave using the Fast Fourrier Transform (FFT). Values are expressed in decibels (dB).
The threshold_detection function is a modified version of the Bat Bioacoustics software developed by Christopher Scott (2012). It combines several detection, filtering and audio feature extraction functions. 
The proposed audio event detection function is a modified spectral sum function for peak-picking, with background noise reduction and echo suppression (Scott, 2012).
A recording typically contains broadband continuous background noise, and discrete pulses of acoustic energy expressed in dB. Any discrete portion of a recorded signal above the level of background noise in a recording is defined here as an 'audio event'. Under this definition, an audio event may include animal vocalizations, call echoes (e.g., for bats), and other noise sources either biotic (e.g., stridulating insects) or abiotic (flowing water, rain, wind, and wind-induced vegetation noise).
Audio events can be detected in a recording by its energy content, and defining a threshold rule for selecting discrete portions of the recording containing energy (dB) above a fixed threshold. The spectral energy content of a recording is typicaly revealed by applying a Short-Time Fourier Transform (STFT), a technique that slides an analysis window of fixed size through the recording. The output from the windowed FFT analysis is a discrete set of values from which the locations of audio events can be identified through simple energy thresholding and peak-picking.
The two conventional methods of bioacoustic signal detection are the spectral peak and the spectral sum functions. For a signal \(x\) at time \(n\), \(X[n]\) is defined as its STFT, where \(|Xk[n]|\) is the spectral magnitude of the \(Kth\) FFT bin at \(n\). The spectral sum function calculates the sum of the STFT magnitudes (the total energy over the entire spectrum) at each consecutive window through the recording to create the following detection function:
This conventional signal detection function depends on both levels of gain and background noise in the recording. This complicates the setting of a consistent detection threshold value, as recordings may be at different levels. Thus, requiring different threshold levels to detect the exact same audio events.
To resolve this problem, the detection function should be normalised for each recording by substracting the median value over all analysis windows from each detection function data point (Skowronski & Fenton, 2009). The median value is an estimate of the noise floor of the recording, and this process of median offsetting allows the use of a fixed threshold parameter, that is then independent of the recording level. However, the process of normalisation requires that the entire recording must be acquired prior to processing, ruling out real-time analysis.
This is why the threshold_detection function estimates the noise floor using only past values of the recording (i.e., fixed windows of previous analysis frames). Prior knowledge is therefore not required for the normalisation of the detection function, and real-time analysis remains a possibility. In addition, by estimating the noise floor locally, the function can dynamically react to changes in the magnitude of background noise winthin the recording. The size of the noise estimation window can be set using the NWS argument. We recommend a time window of 100 ms for short recordings (typically < 1 min length) containing bat echolocation calls. However, we have found that a time window of 5000 ms is ideal in long recordings (typically > 60 min length) containing bird vocalizations.
The noise floor is estimated and subtracted independently for each spectral bin of the FFT spectrum. Environmental noise and microphone self noise is rarely white in nature (i.e., equal power at all frequencies), and is typically weighted more heavily at the low end of the frequency spectrum. Frequency-specific noise substraction can attenuate the noisier low-frequency regions of the spectrum more heavily than the higher frequency and lower noise parts. This process increases the sensitivity of the audio event detection at higher frequency regions of the spectrum, as signals in those regions consequently have a higher Signal to Noise Ratio (SNR).
A threshold function then selects candidate audio event locations from the detection function normalised outputs. This threshold function works by first marking the location at which the detection function crosses the trigger threshold level. The value (in dB above SNR) of this trigger threshold can be set using the threshold argument available in the threshold_detection function. 
If the detection function does not fall below the threshold level (in dB above SNR) after a certain audio event duration, the noise estimation is resumed from this location. This duration threshold can be set using the duration_thr argument. The duration_thr length is usually set at the maximum audio event duration of the targeted group of species. For bats in eastern Canada this parameter is set by default at 80 ms. For calls of Bicknell's Thrush (Catharus bicknelii), it is recommended to set it at 440 ms (see Table 1). 
Candidate audio events are subsequently filtered using the following rule: if the duration of the detected audio event is less than \(x\) milliseconds (ms) it is removed. This minimum duration filter is set with the min_dur argument available in the threshold_detection function. This duration threshold is set to help remove spurious detections caused by transient noise. For bats in eastern Canada this parameter is set by default at 1.5 ms.
A temporal masking is also employed to reduce the influence of echoes on the detection function: an exponential decay curve is applied to the output of the detection function, which acts as an adaptive threshold. Echoes falling below the threshold do not contribute to the detection function as they are masked by the louder preceding audio event. The exponential decay curve is defined as:
where \(F[n]\) is the threshold function, \(D[n]\) is the detection function and \(a\) is the exponential decay gain. This echo suppression function aims to reduce the false alarms caused by echoes exceeding the energy threshold that triggers the detection. Exponential Decay Gain (EDG) values > 0.8 worked well in practice and are set by default at 0.996. The exponential decay gain can be set using the EDG argument in the threshold_detection function.
The detection function \(D[n]\) is generated by first summing all spectral magnitudes in frequency bands that are greater than both their local median values, and the temporal masking threshold. This value is considered to be the audio event content as defined above. A noise estimate is then taken as the sum of all local median values. Finally, the detection function is expressed as SNR in dB as:
The function uses by default a FFT window size of 256 points (which can be set using the FFT_size argument), with a Blackman-Harris 7-term window to reduce spectral leakage (Harris, 1978). Larger FFT windows produce finer frequency resolution (i.e., an increased number of FFT bins, each with a narrower frequency span), but increase computation time and reduce temporal resolution due to Gabor's uncertainty principle (Gabor, 1946). The overlap between two consecutives FFT windows is set by default at 87.5 % (which can be set using the FFT_overlap argument). The bioacoustics package rely on the FFTW library for efficient Fourier transforms.
The above mentioned functions has been tested and their performances evaluated on bat echolocation calls in Scott (2012). Tests on a real-world dataset of field recordings confirmed that the modified spectral sum function with background noise reduction and echo suppression outperformed the two conventional approaches (i.e., spectral sum and spectral peak functions) in terms of accuracy. The noise substraction function performed well with all signal types across a broad range of thresholds, making it simpler to apply in practice where the recording content is not known a priori. As normalisation is applied in real time through local background noise substraction, the length of the recording to be analysed has no effect on function efficiency (Scott, 2012).
High Pass (HPF) and Low Pass filters (LPF) can be employed to reduce the amount of unwanted noise in the recording or to track particuliar audio events within a narrower frequency bandwith than the recording sampling rate. Frequencies below the HPF and above the LPF cutoff are greatly attenuated. These frequency filters can be set using the HPF and LPF arguments in the threshold_detection function. Note that these filters are described in this section, but are used by the threshold_detection function just after the conversion of the recording in the time / frequency domain using the Fast Fourier transform (FFT).
After the recording has been filtered and that the candidate audio events have been located, a second series of functions are employed to filter, extract and smooth the audio event in the time / frequency domain. The audio event extraction function starts a search for the next FFT windows from right-to-left (start) and from left-to-right (end) from each audio event centroid (i.e., location of the audio event at the peak of its maximum energy content), in search for the start and the end of the audio event, respectivelly.
The audio event extraction function relies on three different thresholds to either pursue or stop its search for the next FFT window: the energy content (expressed in dB), the SNR (dB), and the angle of the next FFT window. These thresholds can be set using the start_thr, end_thr, SNR_thr and angle_thr arguments respectively. The extraction stops as soon as the next FFT falls under the minimum threshold value of any of these thresholds. Note that the start and the end thresholds can be set independently, because audio events may have a louder part before (start) or after (end) their peak of maximum energy (dB). A function is then verifying on the temporal X-axis of the recording, if an audio event has been extracted twice, and if so, it keeps only the longest audio event.
The extracted audio event sequence is smoothed with a Kalman filtering function, whose parameters can also be set using the KPE (Kalman Process Error) and KME (Kalman Measurement Error) arguments. Another series of filtering thresholds are applied to the Kalman filtered outputs such as minimum duration (min_dur), maximum duration (max_dur), minimum time between two audio events (TBE). Acoustic features are then extracted from the filtered frequency (Hz) and energy (dB) outputs.
| Feature | Unit | Description | 
|---|---|---|
| starting_time | sec | Location of the audio event in the recording | 
| duration | msec | Duration of the audio event | 
| freq_max_amp | Hz | Frequency of the maximum energy of the audio event | 
| freq_max | Hz | Highest frequency of the audio event | 
| freq_min | Hz | Lowest frequency of the audio event | 
| bandwidth | Hz | Difference between the highest (freq_max) and lowest (freq_min) frequencies | 
| freq_start | Hz | Frequency at the start of the audio event | 
| freq_center | Hz | Frequency at the half of the audio event | 
| freq_end | Hz | Frequency at the end of the audio event | 
| freq_knee | Hz | Frequency at which the slope is the steepest (knee) | 
| freq_c | Hz | Frequency at which the slope is the flatest (caracteristic frequency) | 
| freq_bw_knee_fc | Hz | Frequency bandwith between the knee and caracteristic frequency | 
| bin_max_energy | Hz | Frequency at the maximum of energy where the slope is the flatest | 
| pc_freq_max_amp | Hz | Location of the frequency with the maximum of energy | 
| pc_freq_min | % | Location of the minimum frequency | 
| pc_fmax | % | Location of the maximum frequency | 
| pc_knee | % | Location of the frequency at which the slope is the steepest | 
| temp_bw_knee_fc | % | Temporal bandwith between the knee and caracteristic frequency | 
| slope | ms | Raw slope estimate (frequency bandwith against duration) | 
| kalman_slope | Hz / ms | Smoothed slope estimate after Kalman filtering | 
| curve_pos_start | Hz / ms | Slope estimate at the begining of the audio event | 
| curve_pos_end | Hz / ms | Slope estimate at the end of the audio event | 
| curve_neg | Hz / ms | Slope negative antropy | 
| mid_offset | dB | Mid-offset | 
| snr | dB | Signal to noise ratio | 
| harmonic_distortion | dB | Level of harmonic distortion | 
| smoothness | Time / frequency regularity | 
The metadata embeded in the Wave file header can be extracted by setting the metadata argument to TRUE. Note that the metadata can also be extracted independently from the threshold_detection function by using the metadata function
The detection and extraction settings used with the threshold_detection and blob_detection functions can be saved along with the metadata by setting the settings argument to TRUE.
The blob_detection function is a modified version of the Bat Classify software developed by Christopher Scott (2014). It combines several image processing, filtering and image feature extraction.
A recording typically contains broadband continuous background noise, and discrete pulses of acoustic energy expressed in dB. Any discrete portion of a recorded signal above the level of background noise in a recording is defined here as an 'audio event'. Under this definition, an audio event may include animal vocalizations, call echoes (e.g., for bats), and other noise sources, either biotic (e.g., stridulating insects) or abiotic (flowing water, rain, wind, and wind-induced vegetation noise).
The spectral energy content of a recording is typicaly revealed by applying a Short-Time Fourier Transform (STFT). This technique slides an analysis window of fixed size through the recording. The output from the windowed FFT analysis is a discrete set of values from which the locations of the audio events can be identified through simple thresholding after background noise substraction (Scott, 2012).
The blob_detection function uses by default a FFT window size of 256 points (can be changed using the FFT_size argument), with a Blackman Harris 4-term window to reduce spectral leakage (Harris, 1978). Larger FFT windows produce finer frequency resolution (an increased number of FFT bins, each with a narrower frequency span), but increase computation time and reduce temporal resolution due to Gabor's uncertainty principle (Gabor, 1946). The overlap between two consecutives FFT windows is set by default at 87.5 % (can be changed using the FFT_overlap argument). The bioacoustics package rely on the FFTW library for efficient Fourier transforms.
FFT values are used to display the spectrogram representation of the audio event for image processing. A blur is applied to smooth the spectrogram with a Gaussian function and the level of blur can be changed using the blur argument. This function reduces the level of Gaussian noise in the spectrogram. The background noise is then substracted. The background substraction uses a local average of the energy spectrum intensity to estimate the amount of noise to substract from each spectrogram. This estimate is then multiplied by the value of the bg_substract argument. By estimating the noise floor locally, the function can dynamically react to local variations in the noise floor intensity. A contrast boost function is finally applied on the normalised spectrogram values to increase the definition of the audio event contour against the background noise. The level of contrast boost can be set with the contrast_boost argument.
A 'blob detection' algorithm is applied on the processed spectrogram to detect a Region Of Interest (ROI). This function relies on the linear-time connected component labelling algorithm described in Chang et al. (2004), and adapted from Andrew Brampton (2011). The resulting function simultaneously labels the connected FFT values (or 'blob') and their contours in the spectrogram. Both external, and possibly internal contours of each blob are detected and labeled. Labeling is done in a single pass over the spectrogram, while contour points are revisited more than once and up to four times (Chang et al., 2004). Moreover, the detection function extracts a blob in a sequential order of contour points, i.e. a 'segment', which is useful in the case of animal vocalizations.
The blob_detection function then discards any extracted segment which area is < n pixels. This can be set using the min_area argument, and is set by default at 40 pixels to best extract bat echolocation calls. The values of these filtering parameters may be set differently to extract vocalizations from other group of animals. The segment is finally log-compressed, to convert magnitude into dB, before feature extraction.
A series of filtering functions are also applied to the segments such as minimum duration (min_dur), maximum duration (max_dur), minimum time between two segments (TBE) to reduce the amount of unwanted noise or to track particuliar segments within a narrower temporal window. Like temporal filters, High Pass (HPF) and Low Pass frequency filters (LPF) can also be set using the HPF and LPF arguments in the blob_detection function. Frequencies below the HPF and above the LPF cutoff are greatly attenuated. Acoustic features are then extracted from these filtered segments.
The spectral and temporal moments are extracted from the audio event contour points (called 'segment'), and a gradient histogram is built from the sequence of frequencies (Hz) and energy (expressed in dB) stored in each pixel values (for each extracted audio event). These sequences are available to users in addition to the other acoustic features from the function's output.
| Feature | Unit | Description | 
|---|---|---|
| starting_time | sec | Location of the audio event in the recording | 
| duration | msec | Duration of the audio event | 
| area | pixels | Estimated area of the audio event (in pixels) | 
| freq_centroid | Hz | Frequency at the centroid of the extracted audio event | 
| freq_bandwith | Hz | Difference between the highest (freq_max) and lowest (freq_min) frequencies | 
| freq_skew | Hz | Skewness of the frequency distribution of the audio event | 
| freq_kurtosis | Hz | Kurtosis of the frequency distribution of the audio event | 
| q | Hz | Centroid frequency divided by the frequency bandwith of the audio event | 
| freq_gini_impurity | Hz | Degree of smoothness of the frequency distribution of the audio event | 
| quant_2.5 | Hz | 2.5 percentile of the frequency distribution of the audio event | 
| quant_25 | Hz | 25 percentile of the frequency distribution of the audio event | 
| quant_50 | Hz | 50 percentile of the frequency distribution of the audio event | 
| quant_75 | Hz | 75 percentile of the frequency distribution of the audio event | 
| quant_97.5 | Hz | 97.5 percentile of the frequency distribution of the audio event | 
| freq_bw_95_ci | Hz | Frequency bandwith between the 97.5 and the 2.5 percentiles | 
| freq_bw_75_ci | Hz | Frequency bandwith between the 75 and the 25 percentiles | 
| temp_centroid | msec | Time at the centroid of the extracted audio event | 
| temp_bandwith | msec | Time difference between the begining and the end of the audio event | 
| temp_skew | msec | Skewness of the time distribution of the audio event | 
| temp_kurtosis | msec | Kurtosis of the time distribution of the audio event | 
| temp_gini_impurity | msec | Degree of smoothness of the temporal distribution of the audio event | 
| grad_centroid | Gradient at the centroid of the extracted audio event | |
| grad_bandwith | Gradient difference between the begining and the end of the audio event | |
| grad_skew | Skewness of the gradient distribution of the audio event | |
| grad_kurtosis | Kurtosis of the gradient distribution of the audio event | |
| grad_gini_impurity | Degree of smoothness of the gradient distribution of the audio event | 
The myotis dataset is a Wave file of 10 seconds, 16 bits, mono, with an original sampling rate at 500 kHz (time expanded by 10). It contains 11 echolocation calls of bats from the Myotis genus. The recording was made in United-Kingdom with a D500X bat detector from Pettersson Elektronik AB.
The zc dataset is a Zero-Crossing file of 16384 dots containing a sequence of 24 echolocation calls of a hoary bat (Lasiurus cinereus). This ZC recording was made in Gatineau Park, Quebec, eastern Canada, during the summer 2017 with a Walkabout bat detector from Titley Scientific.
Chang, F., Chen, C.-J., & Lu, C.J. (2004). A linear-time component-labeling algorithm using contour tracing technique. computer vision and image understanding, 93(2), 206-220. Link
Gabor, D. (1946). Theory of communication. Part 1: The analysis of information. Engineers-Part III: Radio and Communication. Link
Harris, F. J. (1978). On then use of windows with the discrete for harmonic analysis Fourier transform. Proceedings of the IEEE, 66(1), 51-83. Link
Scott, C. D. (2012). Automated techniques for bat echolocation call analysis. PhD thesis. The University of Leeds Institute of Integrative and Comparative Biology, University of Leeds, Leeds, UK. 166 pages.
Skowronski, M. D., & Fenton, M. B (2009). Detecting bat calls: An analysis of automated methods. Acta Chiropterologica, 11(1), 191-203.
Table 1. Example of detection, filtering and extraction settings with the threshold_detection function for eastern canadian bats and calls of Bicknell's Thrush (Catharus bicknelii).
| Parameters | Eastern canadian bats | Bicknell's Thrush calls | 
|---|---|---|
| threshold | 14 | 12 | 
| time_exp | 1 | 1 | 
| min_dur | 1.5 | 140 | 
| max_dur | 80 | 440 | 
| TBE | 30 | 300 | 
| EDG | 0.996 | 0.996 | 
| LPF | 250000 | 8000 | 
| HPF | 16000 | 2000 | 
| FFT_size | 256 | 256 | 
| FFT_overlap | 0.875 | 0.875 | 
| start_thr | 40 | 25 | 
| end_thr | 20 | 30 | 
| SNR_thr | 10 | 10 | 
| angle_thr | 40 | 45 | 
| duration_thr | 80 | 440 | 
| NWS | 100 | 1000 | 
| KPE | 1e-05 | 1e-05 | 
| KME | 1e-05 | 1e-04 |