Syntax¶
This module contains functions for syntax analysis given a table of syllables with timestamps and labels.
avn.syntax module¶
Created on Wed Oct 20 10:49:15 2021
@author: Therese
-
class
avn.syntax.SyntaxData(Bird_ID, syll_df)¶ Bases:
object-
add_file_bounds(song_folder_path)¶ Add rows representing syllable boundaries to self.syll_df. A new row with label value ‘file_start’ and onset and offset values = 0 will be added before the first syllable of a new file and a new row with label value ‘file_end’ and onset and offset values reflecting the duration of the file in question will be added after the last syllable of a file.
- Parameters
song_folder_path (str) – Path to folder containing .wav files of songs in SyntaxData.syll_df. Should end with ‘/’.
- Raises
RuntimeError – If file bounds have already been added to this SyntaxData object, this error is raised to inform the user that file bounds will not be added a second time. This is based on the value of the boolean self.file_bounds_added.
- Returns
- Return type
None.
-
add_gaps(min_gap=0.2)¶ Add rows representing silent gaps between syllables longer than min_gap to self.syll_df.
- Parameters
min_gap (float, optional) – Minimum duration in seconds for a gap between syllables to be considered syntactically relevant. This value should be selected such that gaps between syllables in a bout are shorter than min_gap, but gaps between bouts are longer than min_gap. The default is 0.2.
- Raises
RuntimeError – If file bounds have already been added to this SyntaxData object, this error is raised to inform the user that file bounds will not be added a second time. This is based on the value of the boolean self.file_bounds_added.
- Returns
- Return type
None.
-
drop_calls()¶ This function drops any rows in self.syll_df reflecting syllables that are preceeded and followed by silent gaps, as these likely reflect calls.
- Raises
RuntimeError – Gaps must be added to self.syll_df before calls can be identified, so this function will raise an error if gaps have not been added. It will also raise an error if calls have already been dropped from self.syll_df, to avoid repeating this process unnecessarily.
- Returns
- Return type
None.
-
get_entropy_rate()¶ Calculates the entropy rate of bird’s syntax based on transition matrices. For more information on entropy rate, refer to online documentation.
- Returns
entropy_rate – Entropy rate of syntax summarised in self.trans_mat. bounded by 0 and log_2(number of unique syllable states), where larger values reflect more variable / unpredictable syntax.
- Return type
float
-
get_gaps_df()¶ Makes a dataframe with all the gaps between syllables in self.syll_df
- Returns
gaps_df – Dataframe with columns files, onsets, offsets, labels, and duration which represents each gap as a single row. Onsets and offsets give the timestamps in seconds at which the gap occurs, and duration gives the duration of the gap in seconds. The label for all gaps is ‘silent_gap’.
- Return type
Pandas DataFrame
-
get_intro_notes_df()¶ Determines whether each syllable type in self.syll_df is likely to be an intro note. A syllable is considered a possible intro note if:
the syllable is among the most common syllables transitioned to from silence
- AND the syllable makes done dominant transition other than to itself to a
syllable that is not silence.
- Returns
all_intro_notes – Dataframe with columns ‘syllable’, ‘Bird_ID’ and ‘intro_note’ where each row corresponds to a syllable type in self.syll_df, and ‘intro_note’ contains a boolean value reflecting whether or not the syllable meets criteria to be considered an intro note.
- Return type
Pandas DataFrame
-
get_pair_repetition_stats()¶ Analogous to self.get_single_repetition_stats, but with repetitions of a syllable pair, rather than a syllable type. For example, the sequence ‘ababab’ reflects a repetition bout of duration 3 for the syllable pair ‘a’ and ‘b’.
- Returns
rep_count_df (Pandas DataFrame) – Dataframe containing counts of syllable pair occurances in repetition bouts of different durations for every syllable pair occuring in self.syll_df.labels. There is one row per syllable pair and column names refer to duration of repetition bout. 1 = syllable pair produced but not repeated (eg iabcd). 2 = syllable pair repeated twice in a row only (eg iababcd) etc.
rep_stats_df (Pandas DataFrame) – DataFrame containing statistics about repetition bout length. Columns contain the mean_bout_length, median_bout_length, and CV_bout_length, where bout refers to a repetition bout, ie an instance of the same syllable pair being repeated many times. These values can be used for identifying abnormally repeated song syllables.
-
get_prob_repetitions()¶ Find the probability of transition to self (ie repetition) for each syllable type, and return values in a dataframe.
- Returns
prob_repetition_df – DataFrame with columns Bird_ID, syllable and prob_repetition containing the probability that each syllable type produced by this bird transitions to itself base on the song data in self.syll_df.
- Return type
pandas DataFrame
-
get_prop_sylls_in_short_bouts(max_short_bout_len=2)¶ Calculates the proportion of occurances of each syllable type in self.syll_df occur in a bout with length equal to or shorter than max_short_bout_len. This can be useful for identifying which syllable types reflect calls.
- Parameters
max_short_bout_len (int, optional) – The maximum length of bout where the bout will be considered ‘short’ and occurances of syllables within bouts of that length or shorter will contribute to the count of syllables occuring in short bouts. The default value is 2.
- Returns
all_syll_counts_df – Dataframe with columns ‘syllable’, ‘full_count’, ‘short_bout_count’, ‘Bird_ID’ and ‘prop_short_bout’, where ‘syllable’ contains the label of a syllable type in self.syll_df, ‘full_count’ contains the total number of times that syllable occurs in self.syll_df, ‘short_bout_count’ contains the total number of times that syllable occurs in a bout of length max_short_bout_len or shorter, and ‘prop_short_bout’ contains the proportion of all occurances of the syllable in short bouts. This proportion can be useful for identifying which syllable types represent calls.
- Return type
Pandas DataFrame
-
get_single_repetition_stats()¶ Analyzes repetitions of single syllables. Specifically looks at occurances of repetition bouts of different durations (ie 2 identical syllables in a row, 3 identical syllables in a row, etc.).
- Returns
rep_count_df (Pandas DataFrame) – Dataframe containing counts of syllable occurances in repetition bouts of different durations for every syllable type in self.syll_df.labels. There is one row per syllable and column names refer to duration of repetition bout. 1 = syllable produced but not repeated. 2 = syllable repeated twice in a row only etc.
rep_stats_df (Pandas DataFrame) – DataFrame containing statistics about repetition bout length. Columns contain the mean_bout_length, median_bout_length, and CV_bout_length, where bout refers to a repetition bout, ie an instance of the same syllable being repeated many times. These values can be used for identifying introductory notes and/or abnormally repeated song syllables.
-
make_syntax_raster(alignment_syllable=None, sort_bouts=True)¶ Create a dataframe where each row reflects a song bout (a sequence of syllables flanked by file boundaries or long silent gaps), and each cell contains the label of the song syllable produced at that index in the song bout. This can be plotted using plot_syntax_raster() to get an view of song syntax variability from the subject bird.
- Parameters
alignment_syllable (string, optional) – The alignment syllable should correspond to a syllable label in self.syll_df.labels. If provided, song bouts will be aligned such that the first occurance of the alignment syllable happens at the same index across bouts. This can make it easier to detect patterns in syntax across bouts. It is generally best to set the alignment syllable to be the first syllable of the dominant song motif, following any intro notes.
sort_bouts (bool, optional) – If True, bouts will be sorted such that bouts with more similar sequences will occupy sequential rows in syntax_raster_df. This can make it easier to detect syntax patterns agnostic to the order in which bouts were produced. If False, the order of bouts in syntax_raster_df will be the order in which the bouts occur in self.syll_df. The default is True.
- Returns
syntax_raster_df – Dataframe where each row reflects a song bout (a sequence of syllables flanked by file boundaries or long silent gaps), and each cell contains the label of the song syllable produced at that index in the song bout, based on self.syll_df.labels. The number of columns depends on the length of the longest song bout. This can be plotted using plot_syntax_raster() to get a view of song syntx variability from the subject bird.
- Return type
Pandas DataFrame
-
make_transition_matrix()¶ This funtion calculates the first order transition matrix of syllables in syll_df. It creates 2 new attributes to self; self.trans_mat which contains the raw counts of each transition between syllables types and self.trans_mat_prob which contains the conditional probability of a transition, given that a particular syllable was just produced.
- Returns
- Return type
None.
-
plot_syntax_raster(syntax_raster_df, figsize=(10, 10), title=None, palette='husl')¶ Plots a syntax_raster_df dataframe.
- Parameters
syntax_raster_df (Pandas DataFrame) – Dataframe where each row reflects a song bout (a sequence of syllables flanked by file boundaries or long silent gaps), and each cell contains the label of the song syllable produced at that index in the song bout, based on self.syll_df.labels. This is returned by self.make_syntax_raster().
figsize (tuple, optional) – Tuple specifying dimensions of output figure. The default is (10, 10)
title (String, optional) – Title of the output figure. The default is None, which will result in a figure without a title.
palette (string or sequence, optional) – String corresponding to the name of a seaborn palette, matplotlib colormap or sequence of colors in any format matplotlib accepts. See seaborn.color_palette() documentation for more information. The default is ‘husl’.
- Returns
- Return type
-
save_syntax_data(output_directory)¶ Saves a copy of .syll_df as a .csv file in the output directory. Also saves CSVs of the transition matrices if they exist, and a syntax analysis metadata.csv file with information on the processes used to modify syll_df and create the transition matrices
- Parameters
output_directory (string) – Path to a folder in which to save [Bird_ID]_syll_df.csv, [Bird_ID]_syntax_analysis_metadata.csv, and [Bird_ID]_trans_mat.csv and [Bird_ID]_trans_mat_prob.csv, if they exist.
- Returns
syntax_analysis_metadata – Dataframe containing information about the package version and processing steps used in creating the versions of .syll_df and transition matrices that were saved by the function.
- Return type
Pandas DataFrame
-
-
class
avn.syntax.Utils¶ Bases:
objectContains syntax analysis utilities
-
calc_entropy_rate_all_birds(syll_df_folder_path, syll_df_file_name_suffix, song_folder_path, min_gap=0.2, label_column_name=None)¶ Creates a dataframe with the syntax entropy rate of each bird in Bird_IDs.
- Parameters
Bird_IDs (list of strings) – List of Bird_IDs (as strings) for which the transition matrix should be plotted.
syll_df_folder_path (string) – Path to a folder containing a syll_df for each bird in Bird_IDs. The syll_df must be a dataframe with one row for every syllable to be analyzed from the subject bird. It must contain columns onsets and offsets which contain the timestamp in seconds at which the syllable occurs within a file, files which contains the name of the .wav file in which the syllable is found, and labels which contains a categorical label for the syllable type. These can be generated through manual song annotation, or automated labeling methods. The syll_df files must be .csv files named Bird_ID_`syll_df_file_name_suffix`.
syll_df_file_name_suffix (string) – String that specifies the name of the file containing syll_df for each Bird_ID. For example, if syll_df files are named ‘Bird_ID_syll_df.csv’, syll_df_file_name_suffix should be ‘_syll_df.csv’.
song_folder_path (string) – Path to a folder containing subfolders named according to the Bird_IDs, where each subfolder contains the complete set of .wav files used to generate the syll_df loaded from syll_df_folder_path.
min_gap (float, optional) – Minimum duration in seconds for a gap between syllables to be considered syntactically relevant. This value should be selected such that gaps between syllables in a bout are shorter than min_gap, but gaps between bouts are longer than min_gap. The default is 0.2.
label_column_name (string, optional) – If the column of the syll_df containing syllable labels is not called ‘labels’, the name of that column should be specified here as a string. If no value is provided, an existing column called ‘labels’ in syll_df will be used as syllable labels.
- Returns
all_entropy_rates – Dataframe with columns ‘Bird_ID’, ‘entropy_rate’, ‘num_unique_syll_types’ and ‘entropy_rate_norm’. ‘entropy_rate’ contains the raw syntax entropy rate, and ‘entropy_rate_norm’ contains an entopy rate value that is normalized to account for the number of unique syllable types.
- Return type
Pandas DataFrame
-
get_syll_stats_all_birds(syll_df_folder_path, syll_df_file_name_suffix, song_folder_path, min_gap=0.2, label_column_name=None, max_short_bout_len=2)¶ Compile all per-syllable syntax statistics from all birds in Bird_IDs into a single dataframe. This dataframe can then be used to detect syllable with abnormal repetition patterns.
- Parameters
Bird_IDs (list of strings) – List of Bird_IDs (as strings) for which the transition matrix should be plotted.
syll_df_folder_path (string) – Path to a folder containing a syll_df for each bird in Bird_IDs. The syll_df must be a dataframe with one row for every syllable to be analyzed from the subject bird. It must contain columns onsets and offsets which contain the timestamp in seconds at which the syllable occurs within a file, files which contains the name of the .wav file in which the syllable is found, and labels which contains a categorical label for the syllable type. These can be generated through manual song annotation, or automated labeling methods. The syll_df files must be .csv files named Bird_ID_`syll_df_file_name_suffix`.
syll_df_file_name_suffix (string) – String that specifies the name of the file containing syll_df for each Bird_ID. For example, if syll_df files are named ‘Bird_ID_syll_df.csv’, syll_df_file_name_suffix should be ‘_syll_df.csv’.
song_folder_path (string) – Path to a folder containing subfolders named according to the Bird_IDs, where each subfolder contains the complete set of .wav files used to generate the syll_df loaded from syll_df_folder_path.
min_gap (float, optional) – Minimum duration in seconds for a gap between syllables to be considered syntactically relevant. This value should be selected such that gaps between syllables in a bout are shorter than min_gap, but gaps between bouts are longer than min_gap. The default is 0.2.
label_column_name (string, optional) – If the column of the syll_df containing syllable labels is not called ‘labels’, the name of that column should be specified here as a string. If no value is provided, an existing column called ‘labels’ in syll_df will be used as syllable labels.
max_short_bout_len (int, optional) – The maximum length of bout where the bout will be considered ‘short’ and occurances of syllables within bouts of that length or shorter will contribute to the count of syllables occuring in short bouts. This is used to identify calls. The default value is 2.
- Returns
syll_stats_all_birds – Dataframe with one row for each unique syllable type produced by each bird in Bird_IDs containing information about the repetition and syntax patterns of each syllable. This can be used for detecting abnormal syllable types with Utils.identify_abnormal_syllables().
- Return type
Pandas DataFrame
-
identify_abnormal_syllables(std_cutoff=2, exclude_calls=True, exclude_intro_notes=True, syll_labels_to_exclude=[- 1], prop_short_bout_cutoff=0.5)¶ Identifies syllables that are over std_cutoff standard deviations from the mean in terms of mean_repetition_length or CV_repetition_length, and returns a version of syll_stats_all_birds with a new column ‘abnormal_repetition’ containing a boolean to indicate whether that syllable exhibits unusually high repetition or repetition variability.
- Parameters
syll_stats_all_birds (Pandas DataFrame) – Dataframe with one row for each unique syllable type produced by each bird in Bird_IDs containing information about the repetition and syntax patterns of each syllable.
std_cutoff (float, optional) – The number of standard deviations from the mean a syllable feature must be for that syllable to be identified as ‘abnormal’. The default value is 2.
exclude_calls (bool, optional) – If True, syllables which occur in short bouts > prop_short_bout_cutoff proprotion of the time will be considered calls, and not be considered when calculating the mean and std used to identify abnormal syllable types. These calls will also cannot be identified as ‘abnormal’. If False, syllable occuring in short bouts at high rates will be treated like standard syllables. The default value is True.
exclude_intro_notes (bool, optional) – If True, syllables with intro_note == True will not be considered when calculating the mean and std used to identify abnormal syllable types. These intro notes also cannot be identified as ‘abnormal’. If False, intro notes will be treated like standard syllables. The default value is True.
syll_labels_to_exclude (list, optional) – List of syllable labels that should not be considered when calculating the mean and std used to identify abnormal syllables. For example, if syllables are labeled automatically with HDBSCAN, the label ‘-1’ doesn’t reflect a relevant grouping of syllables, and thus shouldn’t contribute to population statistics about syllable repetition patterns. The default value is [-1].
prop_short_bout_cutoff (float between 0 and 1, optional) – If exclude_calls == True, syllables with which occur in short bouts with a proportion greater than this value will be considered calls and be excluded from analysis of abnormal syllables.
- Returns
syll_stats_all_birds – Copy of input syll_stats_all_birds dataframe, with will a column called ‘abnormal_repetition’ added, which contains a boolean value indicating whether that syllable has a mean_repetition_length or CV_repetition_length over std_cutoff standard deviations from the mean.
- Return type
Pandas DataFrame
-
merge_per_syll_stats(short_bout_counts, intro_notes_df)¶ Merge 3 dataframes containing syntax related measures per syllable type into a single dataframe with all per syllable syntax stats.
- Parameters
single_rep_stats (Pandas DataFrame) – Dataframe with columns ‘Bird_ID’ and ‘syllable’, as well as other columns with summary statistics, which contains one row per unique syllable type in the bird’s repertoire. This could be the single_rep_stats dataframe returned by .get_single_repetition_stats().
short_bout_counts (Pandas DataFrame) – Dataframe with columns ‘Bird_ID’ and ‘syllable’, as well as other columns with summary statistics, which contains one row per unique syllable type in the bird’s repertoire. This could be the ‘short_bout_counts’ dataframe returned by .get_prop_sylls_in_short_bouts().
intro_notes_df (Pandas DataFrame) – Dataframe with columns ‘Bird_ID’ and ‘syllable’, as well as other columns with summary statistics, which contains one row per unique syllable type in the bird’s repertoirer. This could be the ‘intro_notes_df’ dataframe returned by .get_intro_notes_df().
- Returns
syllable_syntax_stats – DataFrame resulting from merge of the 3 input dataframes on columns ‘Bird_ID’ and ‘syllable’.
- Return type
Pandas DataFrame
-
plot_syntax_raster_all_birds(syll_df_folder_path, syll_df_file_name_suffix, song_folder_path, min_gap=0.2, label_column_name=None, figsize=(10, 10), sort_bouts=True, calc_entropy_rate=True)¶ Plots the syntax raster plot for each bird in Bird_IDs.
- Parameters
Bird_IDs (list of strings) – List of Bird_IDs (as strings) for which the transition matrix should be plotted.
syll_df_folder_path (string) – Path to a folder containing a syll_df for each bird in Bird_IDs. The syll_df must be a dataframe with one row for every syllable to be analyzed from the subject bird. It must contain columns onsets and offsets which contain the timestamp in seconds at which the syllable occurs within a file, files which contains the name of the .wav file in which the syllable is found, and labels which contains a categorical label for the syllable type. These can be generated through manual song annotation, or automated labeling methods. The syll_df files must be .csv files named Bird_ID_`syll_df_file_name_suffix`.
syll_df_file_name_suffix (string) – String that specifies the name of the file containing syll_df for each Bird_ID. For example, if syll_df files are named ‘Bird_ID_syll_df.csv’, syll_df_file_name_suffix should be ‘_syll_df.csv’.
song_folder_path (string) – Path to a folder containing subfolders named according to the Bird_IDs, where each subfolder contains the complete set of .wav files used to generate the syll_df loaded from syll_df_folder_path.
min_gap (float, optional) – Minimum duration in seconds for a gap between syllables to be considered syntactically relevant. This value should be selected such that gaps between syllables in a bout are shorter than min_gap, but gaps between bouts are longer than min_gap. The default is 0.2.
label_column_name (string, optional) – If the column of the syll_df containing syllable labels is not called ‘labels’, the name of that column should be specified here as a string. If no value is provided, an existing column called ‘labels’ in syll_df will be used as syllable labels.
figsize (tuple, optional) – Tuple to specify dimensions of each output syntax raster plot. The default is (10,10).
sort_bouts (bool, optional) – If True, bouts will be sorted such that bouts with more similar sequences will occupy sequential rows in the plot. This can make it easier to detect syntax patterns agnostic to the order in which bouts were produced. If False, the order of bouts in syntax_raster_df will be the order in which the bouts occur in self.syll_df. The default is True.
calc_entropy_rate (bool, optional) – Determines whether entropy rate is calculated for each bird. If True, entropy rate will be calculated and reported in the title of the syntax raster plot for each bird. The default is True.
- Returns
- Return type
-
plot_transition_matrix_all_birds(syll_df_folder_path, syll_df_file_name_suffix, song_folder_path, min_gap=0.2, calc_entropy_rate=True, label_column_name=None, trans_mat_version='prob', figsize=(10, 8))¶ Plots the transition matrices of all birds in Bird_IDs
- Parameters
Bird_IDs (list of strings) – List of Bird_IDs (as strings) for which the transition matrix should be plotted.
syll_df_folder_path (string) – Path to a folder containing a syll_df for each bird in Bird_ID. The syll_df must be a dataframe with one row for every syllable to be analyzed from the subject bird. It must contain columns onsets and offsets which contain the timestamp in seconds at which the syllable occurs within a file, files which contains the name of the .wav file in which the syllable is found, and labels which contains a categorical label for the syllable type. These can be generated through manual song annotation, or automated labeling methods. The syll_df files must be .csv files named Bird_ID_`syll_df_file_name_suffix`.
syll_df_file_name_suffix (string) – String that specifies the name of the file containing syll_df for each Bird_ID. For example, if syll_df files are named ‘Bird_ID_syll_df.csv’, syll_df_file_name_suffix should be ‘_syll_df.csv’.
song_folder_path (string) – Path to a folder containing subfolders named according to the Bird_IDs, where each subfolder contains the complete set of .wav files used to generate the syll_df loaded from syll_df_folder_path.
min_gap (float, optional) – Minimum duration in seconds for a gap between syllables to be considered syntactically relevant. This value should be selected such that gaps between syllables in a bout are shorter than min_gap, but gaps between bouts are longer than min_gap. The default is 0.2.
calc_entropy_rate (bool, optional) – Determines whether entropy rate is calculated for each bird. If True, entropy rate will be calculated and reported in the title of the transition matrix plot for each bird. The default is True.
label_column_name (string, optional) – If the column of the syll_df containing syllable labels is not called ‘labels’, the name of that column should be specified here as a string. If no value is provided, an existing column called ‘labels’ in syll_df will be used as syllable labels.
trans_mat_version ('prob' or 'count', optional) – Specifies whether to plot transition probabilities in the transition matrix or counts of transitions in the dataset. The default value is ‘prob’ which results in the plotting of transition probabilities between syllables.
figsize (tuple, optional) – Tuple which sets the dimensions of each output transition matrix plot.
- Returns
- Return type
-