‘soundcorrs’ is a small package whose purpose is to help linguists analyse sound correspondences between languages. It does not attempt to draw any conclusions on its own; this responsibility is placed entirely on the user. ‘soundcorrs’ merely automates and facilitates certain tasks, such as preparing the material part of the paper, looking for examples of specific correspondences, or applying series of sound changes, and, by making various functions available, suggests possible paths of analysis which may not be immediately obvious to the more traditional linguist.
This vignette assumes that the reader not only is a linguist and has at least a general idea about what kind of outputs he or she might want from ‘soundcorrs’, but also has at least a passing familiarity with R and a basic understanding of statistics. Most problems can probably be read up on as they appear in the text, but it is nevertheless recommended to start by very briefly acquainting oneself with R by reading the first page of maybe Quick-R, R Tutorial, or another R primer. In particular, it is assumed that the reader will know how to access and understand the built-in documentation, as not all of the arguments are discussed here. Another topic that is highly recommended, as without it ‘soundcorrs’ cannot be used to its full potential, are regular expressions. An accessible introduction can be found in R.D. Peng’s R Programming for Data Science, as well as in many other places around the Internet, and a handy cheat sheet has been made available by RStudio.
Though a little dated by now, a less technical introduction to ‘soundcorrs’ is also available in Stachowski K. 2020. Tools for Semi-Automatic Analysis of Sound Correspondences: The soundcorrs Package for R. Glottometrics 49. 66–86. If you use ‘soundcorrs’ in your research, please cite this paper.
The first section of this vignette discusses in short how to prepare data for ‘soundcorrs’. The second section is an overview of all the analytic functions exported by ‘soundcorrs’ organized by their output, and of helper functions in the alphabetical order.
‘soundcorrs’ functions operate on pairs/triples/… of words which come from different languages. The discussion below will use ‘L1’ to refer to the first language in the dataset, ‘L2’ to the second, etc.
Naturally, all the examples given below assume that ‘soundcorrs’ is installed and loaded:
# install.packages ("soundcorrs")
library (soundcorrs)
#> NOTE. Version 0.2.0 introduced some important changes.
#> Please run vignette("soundcorrs")
#>  and consult https://cran.r-project.org/web/packages/soundcorrs/NEWS.Whether you intend to use ‘soundcorrs’ for analysis of sound correspondences or sound changes, you will need to define a transcription. For sound correspondences, you will additionally need to segment and align word pairs/triples/…, and for sound changes, define the sound changes. Transcription and word pairs/triples/… are stored in tsv files, i.e. as tab-separated tables in text files. Sound changes are functions, so they are stored as R code.
Under BSD, Linux, and macOS, the recommended encoding is UTF-8. Unfortunately, it has been found to cause problems under Windows, so Windows users are advised to not use characters outside of the ASCII standard. Some issues can be fixed by converting from UTF-8 to UTF-8 (sic!) with ‘iconv()’, but others resist this and other treatments. Future versions of ‘soundcorrs’ hope to include a solution for this problem.
Transcription is not strictly necessary for the very functioning of ‘soundcorrs’, but without it linguistic regular expresssions (“wildcards”) could not be defined, and involvement of phonetics in the analysis would be made more difficult. Transcription is stored in tsv files with two or three columns:
GRAPHEME, which contains the graphemes. Characters used by R as metacharacters in regular expressions, i.e. . + * ^ $ ? | ( ) [ ] { }, are not allowed. Multigraphs also should not be used as they can lead to unexpected and incorrect results, especially in the case metacharacters (“wildcards”).
VALUE, which contains a comma-separated list of features of the given grapheme. These are intended to be phonetic but do not necessarily have to be so. If the column META is missing, it is generated based on the column VALUE.
META, which contains a regular expression covering all the graphemes the given grapheme is meant to represent. In regular graphemes, this is simply the grapheme itself. In a metacharacter, such as ‘C’ for ‘any consonant’, this needs to be a listing of all consonantal graphemes in the transcription file, formatted as a regular expression. It is recommended to leave this column empty, as in such case ‘soundcorrs’ will generate it automatically. Beware, however, that in this process any grapheme whose value is a subset of the value of another grapheme, will become a metacharacter for that other grapheme.
‘soundcorrs’ contains two sample transcription files: ‘trans-common.tsv’ and ‘trans-ipa.tsv’. Both only cover the basics and are intended more as an illustration than anything else. The paths of the data files can be established with ‘system.file()’ as in the snippet below, or the entire ‘transcription’ objects can be loaded using ‘loadSampleDataset()’ (see below).
# establish the paths of the samples included in ‘soundcorrs’
path.trans.com <- system.file ("extdata", "trans-common.tsv", package="soundcorrs")
path.trans.ipa <- system.file ("extdata", "trans-ipa.tsv", package="soundcorrs")
# and load them
trans.com <- read.transcription (path.trans.com)
trans.ipa <- read.transcription (path.trans.ipa)
#> Warning in transcription(data, col.grapheme, col.meta, col.value): Missing
#> the metacharacters column. The "META" column was generated.
# transcription needs to be an object of class ‘transcription’
class (trans.com)
#> [1] "transcription"
# a basic summary
trans.com
#> A "transcription" object.
#>   File: /tmp/RtmpuumB5F/Rinst5910145558fb/soundcorrs/extdata/trans-common.tsv.
#>   Graphemes: 75.
# ‘data’ is the original data frame
# ‘cols’ is a guide to column names in ‘data’
# ‘meta’ is a vector of characters which act as metacharacters
# ‘values’ is a named list of the values of graphemes, exploded into vectors
# ‘zero’ are the characters denoting the linguistic zero
str (trans.com, max.level=1)
#> List of 5
#>  $ data  :'data.frame':  75 obs. of  3 variables:
#>  $ cols  :List of 3
#>  $ meta  : Named chr [1:7] "(ᴍ|m|p|ʙ|b|φ|β|f|v|n|ɴ|t|ᴅ|d|c|ʒ|s|ᴢ|z|θ|δ|l|ʟ|č|ǯ|š|ž|ś|ź|ć|r|ʀ|ŋ|k|ɢ|g|χ|γ|h|ɦ)" "(ᴍ|m|n|ɴ|ŋ)" "(p|ʙ|b|t|ᴅ|d|k|ɢ|g)" "(s|ᴢ|z|š|ž|ś|ź)" ...
#>   ..- attr(*, "names")= chr [1:7] "C" "N" "P" "S" ...
#>  $ values:List of 75
#>  $ zero  : chr "-"
#>  - attr(*, "class")= chr "transcription"
#>  - attr(*, "file")= chr "/tmp/RtmpuumB5F/Rinst5910145558fb/soundcorrs/extdata/trans-common.tsv"Like the transcription, the data are also stored in tsv files. Two formats are theoretically possible: the “long format” in which every word is given its own row, and the “wide format” in which one row holds a pair/triple/… of words (see below).
With the notable exception of sound changes, words need to be segmented for most tasks in ‘soundcorrs’, and all words in a pair/triple/… must have the same number of segments. The default segment separator is ‘|’. If the words are not segmented, the function ‘addSeparators()’ can be used to facilitate the process of manual segmentation and alignment (see below). Tools for automatic alignment also exist (e.g. alineR, LingPy, PyAline), but it is recommended that their results be thoroughly checked by a human. Apart from the segmented and aligned form, each word must be assigned a language.
Hence, the two obligatory columns in the “long format” are
ALIGNED which holds the segmented and aligned word, and
LANGUAGE which holds the name of the language.
In the “wide format”, similarly, a minimum of two columns is necessary, each holding words from a different language. The information about which column holds which language can then be encoded simply as column names (e.g. ‘LATIN’), or in the form of a suffix attached to the names (e.g. ‘ALIGNED.Latin’).
Regarding the two formats, see also ‘long2wide()’ and ‘wide2long()’ below.
It is possible, though not necessarily recommended, to store data from each language in a separate file; it is also possible to use a different transcription for each language. This flexibility can easily lead to a somewhat cumbersome string of arguments for the reader function, so the ‘read.soundcorrs’ function is limited to reading the data for just one language. Individual ‘soundcorrs’ objects can be then combined into one using the ‘merge’ function. The reader function only accepts data in the “wide format”.
‘soundcorrs’ has three sample datasets: 1. the entirely made-up ‘data-abc.tsv’; 2. ‘data-capitals.tsv’ which contains the names of EU capitals in German, Polish and Spanish – from the linguistic point of view, this of course makes no sense; it is merely an example that will hopefully not be seen as too exotic regardless of which language or languages the user specializes in (my gratitude is due to José Andrés Alonso de la Fuente, PhD (Cracow, Poland) for help with Spanish data); and 3. ‘data-ie.tsv’ with a dozen examples of Grimm’s and Verner’s laws (adapted from Campbell L. 2013. Historical Linguistics. An Introduction. Edinburgh University Press. Pp. 136f). The ‘abc’ dataset is in the “long format”, the ‘capitals’ and ‘ie’ datasets are in the “wide format”. The paths of the data files can be established with ‘system.file()’ as in the snippet below, or the entire ‘soundcorrs’ objects can be loaded using ‘loadSampleDataset()’ (see below).
# establish the paths of the two datasets
path.abc <- system.file ("extdata", "data-abc.tsv", package="soundcorrs")
path.cap <- system.file ("extdata", "data-capitals.tsv", package="soundcorrs")
path.ie <- system.file ("extdata", "data-ie.tsv", package="soundcorrs")
# read “capitals”
d.cap.ger <- read.soundcorrs (path.cap, "German", "ALIGNED.German", path.trans.com)
#> Warning in soundcorrs(data, name, col.aligned,
#> read.transcription(transcription), : The following segments are not covered
#> by the transcription: jus, ŋk.
d.cap.pol <- read.soundcorrs (path.cap, "Polish", "ALIGNED.Polish", path.trans.com)
#> Warning in soundcorrs(data, name, col.aligned,
#> read.transcription(transcription), : The following segments are not covered
#> by the transcription: ń, ẃ.
d.cap.spa <- read.soundcorrs (path.cap, "Spanish", "ALIGNED.Spanish", path.trans.com)
#> Warning in soundcorrs(data, name, col.aligned,
#> read.transcription(transcription), : The following segments are not covered
#> by the transcription: ð, ja, ŋk.
d.cap <- merge (d.cap.ger, d.cap.pol, d.cap.spa)
# read “ie”
d.ie.lat <- read.soundcorrs (path.ie, "Lat", "LATIN", path.trans.com)
d.ie.eng <- read.soundcorrs (path.ie, "Eng", "ENGLISH", path.trans.ipa)
#> Warning in transcription(data, col.grapheme, col.meta, col.value): Missing
#> the metacharacters column. The "META" column was generated.
#> Warning in soundcorrs(data, name, col.aligned,
#> read.transcription(transcription), : The following segments are not covered
#> by the transcription: eɪ, ɪ, aʊ, uː, ɑː, ʊ, iː.
d.ie <- merge (d.ie.lat, d.ie.eng)
# read “abc”
tmp <- long2wide (read.table(path.abc,header=T), skip=c("ID"))
d.abc.l1 <- soundcorrs (tmp, "L1", "ALIGNED.L1", trans.com)
d.abc.l2 <- soundcorrs (tmp, "L2", "ALIGNED.L2", trans.com)
d.abc <- merge (d.abc.l1, d.abc.l2)
# some basic summary
d.abc.l1
#> A "soundcorrs" object.
#>   Languages (1): L1.
#>   Entries: 6.
#>   Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.
d.abc
#> A "soundcorrs" object.
#>   Languages (2): L1, L2.
#>   Entries: 6.
#>   Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.
# ‘cols’ are the names of important columns
# ‘data’ is the original data frame
# ‘names’ are the names of the languages
# ‘segms’ are words exploded into segments; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
# ‘segpos’ is a lookup list to check which character belongs to which segment; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
# ‘separators’ are the strings used as segment separators
# ‘trans’ are ‘transcription’ objects
# ‘words’ are words obtained by removing separators from the ‘col.aligned’ column; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
str (d.abc, max.level=1)
#> List of 8
#>  $ cols      :List of 2
#>  $ data      :'data.frame':  6 obs. of  7 variables:
#>  $ names     : chr [1:2] "L1" "L2"
#>  $ segms     :List of 2
#>  $ segpos    :List of 2
#>  $ separators: chr [1:2] "\\|" "\\|"
#>  $ trans     :List of 2
#>  $ words     :List of 2
#>  - attr(*, "class")= chr "soundcorrs"Unlike the transcription and linguistic data, sound changes in ‘soundcorrs’ are in fact bits of code. (The distinction is in fact less than sharp in R, but this is a separate topic.) This gives the user much greater control, while remaining convenient because simple sound changes can be translated automatically.
A sound change function must take two arguments: ‘x’ and ‘meta’, where ‘x’ is a character string to which the change is to be applied, and ‘meta’ is a piece of metadata that might need to be taken into account. The return value of a sound change function must be a vector of character strings, possibly of length 1, i.e. a single character string. The ability to output multiple strings, however, is important because it allows for regressive changes (e.g. to account for both possible sources of a vowel that merged with another vowel).
A sound change function is not expected to be used directly by the user. It is intended to be wrapped in a ‘soundchange’ object using the ‘soundchange()’ constructor function. Apart from the function itself, such an object holds the name and a brief description of the change.
Simpler changes, ones that can be written in the form of a single regular expression, need not be explicitly defined as functions. The ‘soundchange()’ constructor accepts both functions and character strings; the latter are then converted into functions automatically. Such a string must contain exactly one ‘>’ or ‘<’, possibly surrounded by spaces.
‘soundcorrs’ contains three sample changes files: ‘change-dl2l.R’, ‘change-palatalization.R’, and ‘change-rhotacism.R’. The paths of the data files can be established with ‘system.file()’ as was done with ‘transcription’ and ‘soundcorrs’ object above, or the entire ‘soundchange’ objects can be loaded using ‘loadSampleDataset()’ (see below). Note that these sample changes are stored as pieces of R code, so they must be loaded using ‘source()’ rather than ‘read.table’ or similar.
# a simple sound change
sc.V2a <- soundchange ("V > a", "V>a", trans.com, "All vowels change into a.")
# basic summary
sc.V2a
#> A "soundchange" object.
#>   Name: V>a.
#>   Description: All vowels change into a.
#>   Transcription: /tmp/RtmpuumB5F/Rinst5910145558fb/soundcorrs/extdata/trans-common.tsv.
# ‘name’ is the name of the sound change
# ‘desc’ is a brief description
# ‘fun’ is the sound change function
# ‘trans’ is the transcription used in the change function
str (sc.V2a, max.level=1)
#> List of 4
#>  $ name : chr "V>a"
#>  $ desc : chr "All vowels change into a."
#>  $ fun  :function (x, meta)  
#>  $ trans:List of 5
#>   ..- attr(*, "class")= chr "transcription"
#>   ..- attr(*, "file")= chr "/tmp/RtmpuumB5F/Rinst5910145558fb/soundcorrs/extdata/trans-common.tsv"
#>  - attr(*, "class")= chr "soundchange"
# if need be, functions inside ‘soundchange’ objects can be applied directly
sc.V2a$fun ("ouroboros", NULL)
#> [1] "aarabaras"
# a slightly more complex change
sc.VV2a <- soundchange ("V{2,} > a", "VV>a", trans.com, "Only diphthongs change into a.")
sc.VV2a$fun ("ouroboros", NULL)
#> [1] "aroboros"
# a slightly more complex change
sc.CV2Ca <- soundchange ("(C)V > \\1a", "CV>Ca", trans.com, "Only postconsonantal vowels change into a.")
sc.CV2Ca$fun ("ouroboros", NULL)
#> [1] "ourabaras"
# a more complex sound change
sc.2ndV2a.fun <- function (x, meta) {
    tmp <- gregexpr (expandMeta(trans.com,"V+"), x)
    regmatches(x,tmp)[[1]][2] <- "a"
    return (x)
}
sc.2ndV2a <- soundchange (sc.2ndV2a.fun, "2ndV>a", trans.com,
    "Only the vowel in the second syllable changes into a.")
sc.2ndV2a$fun ("ouroboros", NULL)
#> [1] "ouraboros"‘soundcorrs’ exports several functions intended for linguistic analysis. For easier orientation, they are organized below by what kind of outputs they produce, rather than by their names. ‘soundcorrs’ also exports several functions whose use for linguistic analysis, in and of themselves, is rather limited. Those are grouped in one subsection at the end, and discussed in the alphabetical order.
There are three different functions in ‘soundcorrs’ that produce contingnecy tables. There is some logic behind it. ‘summary()’ is only meant to give a general overview of the dataset; ‘coocc()’ is the essential contingency table function; and ‘allCooccs()’ produces an output that is meant to be printed rather than read from the screen.
‘summary()’ produces a segment-to-segment contingency table. The values may represent how many times the given segments correspond to each other (‘unit=“o”’) or in how many words they correspond to each other (‘unit=“w”’). This distinction exists because it is quite possible that there will be a segment which appears more than once in a single word. The argument ‘unit’ accepts nine different values: ‘“o(cc(ur(ence(s))))”’ and ‘“w(or(d(s)))”’. One more argument can be given to ‘summary()’; it is ‘count’, and it determines whether values are given in the absolute, or as relative. It accepts six values: ‘“a(bs(olute))”’ and ‘“r(el(ative))”’.
Note that ‘summary()’ reports how many times the given segments correspond to each other – not how often they co-occur in the same word. For example, in a pair L1 “a|b” : L2 “c|d”, the “a”/“d” cell will show 0 because because L1 “a” never corresponds directly to L2 “d”. This is a different perspective than in ‘coocc()’ below.
# a general overview of the dataset as a whole
summary (d.abc)
#>    L2
#> L1  a b c o u w ə
#>   - 0 0 0 0 0 0 2
#>   a 4 0 0 1 1 0 0
#>   b 0 5 0 0 0 1 0
#>   c 0 0 6 0 0 0 0
# words are the default ‘unit’
summary (d.abc, unit="o")
#>    L2
#> L1  a b c o u w ə
#>   - 0 0 0 0 0 0 2
#>   a 6 0 0 1 2 0 0
#>   b 0 5 0 0 0 1 0
#>   c 0 0 6 0 0 0 0
# in relative values …
rels <- summary (d.abc, count="r")
round (rels, 2)
#>    L2
#> L1     a    b    c    o    u    w    ə
#>   - 0.00 0.00 0.00 0.00 0.00 0.00 1.00
#>   a 0.67 0.00 0.00 0.17 0.17 0.00 0.00
#>   b 0.00 0.83 0.00 0.00 0.00 0.17 0.00
#>   c 0.00 0.00 1.00 0.00 0.00 0.00 0.00
# … relative to entire rows
apply (rels, 1, sum)
#> - a b c 
#> 1 1 1 1‘coocc()’ has two modes: internal and external comparison. The former, invoked when ‘column=NULL’ (the default) cross-tabulates correspondences with themselves. The latter cross-tabulates correspondences with metadata taken from a column in the dataset whose name is given as the argument ‘column’. Like ‘summary()’ above, ‘coocc()’ has the argument ‘unit’ which has the same meaning, and also the argument ‘count’ which may appear to work a little differently. In actuality, its use with ‘summary()’ was a special case. The general idea is that the entire table is divided into blocks such that all rows represent correspondences of the same segment and, in the internal mode, so do all the columns.
Note that ‘coocc()’ reports how many times the given correspondences co-occur in the same word – not how often they appear in the entire dataset. For example, in a pair L1 “a|b” : L2 “c|d”, the “a:c”/“b:d” cell will show 1 because the correspondence L1 “a” : L2 “c” co-occurs with L1 “b” : L2 “d” in one word. This is a different perspective than in ‘summary()’ above.
# a general look in the internal mode
coocc (d.abc)
#>      L1_L2
#> L1_L2 -_ə a_a a_o a_u b_b b_w c_c
#>   -_ə   0   2   0   0   2   0   2
#>   a_a   2   2   0   0   4   0   4
#>   a_o   0   0   0   0   1   0   1
#>   a_u   0   0   0   1   0   1   1
#>   b_b   2   4   1   0   0   0   5
#>   b_w   0   0   0   1   0   0   1
#>   c_c   2   4   1   1   5   1   0
# now with metadata
coocc (d.abc, "DIALECT.L2")
#>      DIALECT.L2
#> L1_L2 north south std
#>   -_ə     0     2   0
#>   a_a     0     2   2
#>   a_o     1     0   0
#>   a_u     1     0   0
#>   b_b     1     2   2
#>   b_w     1     0   0
#>   c_c     2     2   2
# in the internal mode,
#    the relative values are with regard to segment-to-segment blocks
tab <- coocc (d.abc, count="r")
rows.a <- which (rownames(tab) %hasPrefix% "a")
cols.b <- which (colnames(tab) %hasPrefix% "b")
sum (tab [rows.a, cols.b])
#> [1] 1
# there are four different segments in L1
sum (tab)
#> [1] NaN
# if two correspondences never co-occur, the relative value is 0/0
#    which R represents as ‘NaN’, and prints as empty space
coocc (d.abc, count="r")
#>      L1_L2
#> L1_L2       -_ə       a_a       a_o       a_u       b_b       b_w
#>   -_ə           1.0000000 0.0000000 0.0000000 1.0000000 0.0000000
#>   a_a 1.0000000 0.6666667 0.0000000 0.0000000 0.6666667 0.0000000
#>   a_o 0.0000000 0.0000000 0.0000000 0.0000000 0.1666667 0.0000000
#>   a_u 0.0000000 0.0000000 0.0000000 0.3333333 0.0000000 0.1666667
#>   b_b 1.0000000 0.6666667 0.1666667 0.0000000                    
#>   b_w 0.0000000 0.0000000 0.0000000 0.1666667                    
#>   c_c 1.0000000 0.6666667 0.1666667 0.1666667 0.8333333 0.1666667
#>      L1_L2
#> L1_L2       c_c
#>   -_ə 1.0000000
#>   a_a 0.6666667
#>   a_o 0.1666667
#>   a_u 0.1666667
#>   b_b 0.8333333
#>   b_w 0.1666667
#>   c_c
# in the external mode,
#    the relative values are with regard to blocks of rows, and all columns
tab <- coocc (d.abc, "DIALECT.L2", count="r")
rows.a <- which (rownames(tab) %hasPrefix% "a")
sum (tab [rows.a, ])
#> [1] 1‘allCooccs()’ splits a table produced by ‘coocc()’ into blocks, each containing the correspondences of one segment. Its primary purpose is to facilitate the application of tests of independence, for which see ‘lapplyTest()’ below.
‘allCooccs()’ takes all the same arguments as ‘coocc()’: ‘column’, ‘count’, and ‘unit’. In addition, it takes the argument ‘bin’ which determines whether the table should be just cut up, or whether all the resulting slices should also be binned.
The return value of ‘allCooccs()’ is a list which holds all the resulting tables, under names composed from the correspondences and connected with underscores. If ‘column = NULL’, they would be ‘a’, ‘b’, &c. if ‘bin = F’, and if ‘bin = T’, ‘a_b_c_d’ meaning L1 ‘a’ : L2 ‘b’ cross-tabulated with L1 ‘c’ : L2 ‘d’, and so on. If ‘column’ is not ‘NULL’, the names will be ‘a_b_northern’ meaning L1 ‘a’ : L2 ‘b’ tabulated with the ‘northern’ dialect, and so forth.
# for a small dataset, the result is going to be small
str (allCooccs(d.abc), max.level=0)
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================================| 100%
#> List of 34
# but it can grow quite quickly with a larger dataset
str (allCooccs(d.cap), max.level=0)
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |========                                                         |  13%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  19%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |=========================                                        |  39%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  55%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |========================================                         |  61%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |====================================================             |  81%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  87%
  |                                                                       
  |===========================================================      |  90%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |=================================================================| 100%
#> List of 5614
# the naming scheme
names (allCooccs(d.abc))
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================================| 100%
#>  [1] "-_ə_a_a" "-_ə_a_o" "-_ə_a_u" "-_ə_b_b" "-_ə_b_w" "-_ə_c_c" "a_a_-_ə"
#>  [8] "a_a_b_b" "a_a_b_w" "a_a_c_c" "a_o_-_ə" "a_o_b_b" "a_o_b_w" "a_o_c_c"
#> [15] "a_u_-_ə" "a_u_b_b" "a_u_b_w" "a_u_c_c" "b_b_-_ə" "b_b_a_a" "b_b_a_o"
#> [22] "b_b_a_u" "b_b_c_c" "b_w_-_ə" "b_w_a_a" "b_w_a_o" "b_w_a_u" "b_w_c_c"
#> [29] "c_c_-_ə" "c_c_a_a" "c_c_a_o" "c_c_a_u" "c_c_b_b" "c_c_b_w"
# and with ‘column’ not ‘NULL’
names (allCooccs(d.abc,column="DIALECT.L2"))
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================================| 100%
#>  [1] "-_ə_north" "-_ə_south" "-_ə_std"   "a_a_north" "a_a_south"
#>  [6] "a_a_std"   "a_o_north" "a_o_south" "a_o_std"   "a_u_north"
#> [11] "a_u_south" "a_u_std"   "b_b_north" "b_b_south" "b_b_std"  
#> [16] "b_w_north" "b_w_south" "b_w_std"   "c_c_north" "c_c_south"
#> [21] "c_c_std"‘soundcorrs’ has three functions to look for specific examples. ‘findExamples()’ searches for words which exhibit a given correspondence; ‘findPairs()’ is a convenience wrapper for ‘findExamples()’ for when there are only two languages in the dataset; and ‘allPairs()’ produces an almost print-ready summary of the dataset, complete with tables and all the examples.
‘findExamples()’ searches a dataset for pairs/triples/… which exhibit a specific sound correspondence. It can take several arguments, which can be divided into three groups.
The first group is just one obligatory argument, ‘data’. This is the dataset, and it must be a ‘soundcorrs’ object.
The second group are the queries. There must be as many of them as there are languages in ‘data’, and they must be given in the same order as the languages in ‘data’. For example, if the dataset contains data from English and Polish, there need to be two queries, the first to be looked for in English data, and the second in Polish data. All queries can be regular expressions: such as defined in R, or custom metacharacters defined in the transcription. The can also be empty strings, which ‘findExamples()’ will interpret as a permission to accept anything.
The third group are optional arguments which define how the data are sifted and displayed. These arguments can only be used with a name (f.ex. ‘findExamples(data,“a”,“a”,cols=“all”)’) because R does not know a priori how many queries there are going to be.
The ‘cols’ argument defines which columns of the original data are displayed. It can be a vector of strings, “all”, or “aligned”. The last option is the default one.
The next two arguments are ‘distance.start’ and ‘distance.end’. These define the maximum permitted distance, in segments, between the matches. Let us use as example French “f|r|ã|-|s” and English “f|r|ā|n|s”, and imagine that we want to find cases where a French vowel corresponds to a vowel-n sequence in English. The part of the French word that interests us is segment number 3; its counterpart in the English word starts on segment 3 and ends on segment 4. The distance between the starts is 0, and the distance between the ends is 1. ‘findExamples()’ will find our pair of words if ‘distance.start’ is set to 0 or more, and ‘distance.end’ to 1 or more.
Both arguments can also take negative values which means that distance is not checked at all. These are in fact the defaults (-1). Effectively, the default behaviour of ‘findExamples()’ is to find any such pair/triple/… that the first word contains the first query, the second word the second query, and so on, regardless of whether they appear in the corresponding segments. For the example above, ‘findExamples(dataset,“f”,“r”)’ also returns a match. It may therefore seem irresponsible to set the default values of both arguments to -1, but in my experience, it very rarely produces false positives. On the other hand, the opposite behaviour (both arguments set to 0) may easily result in false negatives, ones that are not only of a much less intuitive kind, but also never give the user a chance to spot the problem as they are simply not displayed.
The next argument is ‘na.value’ which determines how missing values (‘NA’) are treated. This argument can only have one of two values: -1 and 0. The former means that missing values are considered non-matches, the latter that they are considered matches. The latter is the default. Note that an empty string query takes precedence over ‘na.value’, that is even whan ‘na.value’ is set to -1, ‘NA’s will show up in the results when the query is an empty string.
The last optional argument is ‘zeros’ which can be set to ‘TRUE’ or ‘FALSE’ (the default). The former means that search is performed on words with linguistic zeros in them. In the example above, the query “Vs” would find “f|r|ã|-|s” only if ‘zeros’ were set to ‘FALSE’.
The output of ‘findExamples()’ is a list with two fields: ‘data’ which is a data frame with matching examples, and ‘which’, a logical vector showing which examples in the original dataset were a match. The class of the return value is ‘df.findExamples’; this is purely for technical reasons, to allow for a more legible printed output.
See also ‘findPairs()’, a convenience wrapper around ‘findExamples()’.
# “ab” spans segments 1–2, while “a” only occupies segment 1
findExamples (d.abc, "ab", "a", distance.end=0)
#> No matches found.
findExamples (d.abc, "ab", "a", distance.end=1)
#>   ALIGNED.L1 ALIGNED.L2
#> 1      a|b|c      a|b|c
#> 2    a|b|a|c    a|b|a|c
#> 5    a|b|c|-    a|b|c|ə
#> 6  a|b|a|c|-  a|b|a|c|ə
# linguistic zeros cannot be found if ‘zeros’ is set to ‘FALSE’
findExamples (d.abc, "-", "", zeros=T)
#>   ALIGNED.L1 ALIGNED.L2
#> 5    a|b|c|-    a|b|c|ə
#> 6  a|b|a|c|-  a|b|a|c|ə
findExamples (d.abc, "-", "", zeros=F)
#> No matches found.
# both the usual and custom regular expressions are permissible
findExamples (d.abc, "a", "[ou]")
#>   ALIGNED.L1 ALIGNED.L2
#> 3      a|b|c      o|b|c
#> 4    a|b|a|c    u|w|u|c
findExamples (d.abc, "a", "O")
#>   ALIGNED.L1 ALIGNED.L2
#> 3      a|b|c      o|b|c
#> 4    a|b|a|c    u|w|u|c
# the output is actuall a list
str (findExamples(d.abc,"a","a"), max.level=1)
#> List of 2
#>  $ data :'data.frame':   4 obs. of  2 variables:
#>  $ which: logi [1:6] TRUE TRUE FALSE FALSE TRUE TRUE
#>  - attr(*, "class")= chr "df.findExamples"
# ‘data’ is what is displayed on the screen
# ‘which’ is useful for subsetting
subset (d.abc, findExamples(d.abc,"a","O")$which)
#> A "soundcorrs" object.
#>   Languages (2): L1, L2.
#>   Entries: 2.
#>   Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.
# ‘which’ can also be used to find examples
#    that exhibit more than one correspondence.
aaa <- findExamples (d.cap, "a", "a", "a", distance.start=0, distance.end=0)$which
bbb <- findExamples (d.cap, "b", "b", "b", distance.start=0, distance.end=0)$which
d.cap$data [aaa & bbb,]
#>         ALIGNED.German ORTHOGRAPHY.German      ALIGNED.Polish
#> 22 b|r|a|t|ī|s|l|a|v|a         Bratislava b|r|a|t|y|s|w|a|v|a
#> 24     b|ū|d|a|p|ä|s|t           Budapest     b|u|d|a|p|e|š|t
#>    ORTHOGRAPHY.Polish     ALIGNED.Spanish ORTHOGRAPHY.Spanish
#> 22         Bratysława b|r|a|t|i|z|l|a|β|a          Bratislava
#> 24          Budapeszt     b|u|ð|a|p|e|s|t            Budapest
#>    OFFICIAL.LANGUAGE
#> 22            Slovak
#> 24         Hungarian
# the ‘cols’ argument can be used to alter the printed output
findExamples (d.abc, "a", "O", cols=c("ORTHOGRAPHY.L1","ORTHOGRAPHY.L2"))
#>   ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> 3            abc           aobc
#> 4           abac           uwucThis is a convenience wrapper around ‘findExamples()’ which can be only used for datasets which contain data from exactly two languages. Instead of the three optional arguments of ‘findExamples()’, ‘findPairs()’ only has one. It is called ‘exact’, and it can take several different values.
The default is the inexact mode (‘exact’ set to 0 or ‘FALSE’). It corresponds to ‘distance.start’ and ‘distance.end’ being both set to -1, ‘na.value’ being set to 0, and ‘zeros’ being set to ‘FALSE’, which are also the default settings in ‘findExamples()’. The risk here are false positives. In my experience, however, those are rare, and because they are displayed, the user has a chance to spot them.
The opposite is the exact mode (‘exact’ set to 1 or ‘TRUE’), which corresponds to ‘distance.start’ and ‘distance.end’ being both set to 0, ‘na.value’ being set to -1, and ‘zeros’ to ‘TRUE’. The risk are false negatives, in my experience both much more common than false positives in the inexact mode, and effectively impossible to spot as they are simply not displayed.
A middle ground is the semi-exact mode (‘exact’ set to 0.5), where ‘distance.start’ and ‘distance.end’ are both set to 1, ‘na.value’ is set to 0, and ‘zeros’ to ‘FALSE’. It decreases the risk of false positives while increasing only a little the risk of false negatives.
Apart from the above, ‘findPairs()’ also has the parameter ‘cols’, whose value is passed directly to ‘findExamples()’.
The output of ‘findPairs()’ is the same as the output of ‘findExamples()’.
‘allPairs()’ does not have great analytic value in itself, but it can be useful when writing a paper e.g. on the phonetic adaptation of loanwords, to prepare its material part.
The output of ‘allPairs()’ consists of sections devoted to each segment, filled with a general contingency table of its various renderings, and followed by subsections which list all pairs exhibiting the given correspondence. ‘soundcorrs’ provides functions to format such output in HTML or in LaTeX, or not at all. Custom formatters are also not very difficult to write.
Tables can show the number of occurrences or the number of words in which the given correspondence manifests itself (‘unit’), in absolute or in relative terms (‘count’; all three with values as with ‘summary()’). Which columns are printed can be modified with ‘cols’, and whether to write to a file or to the screen, with ‘file’ (‘NULL’ meaning the screen). Lastly, the formatting is controlled by a special function, of which ‘soundcorrs’ provides three: ‘formatter.none()’, ‘formatter.html()’, and ‘formatter.latex()’. A custom formatter can also take additional arguments, which will be passed to it from the call to ‘allPairs()’.
# and see what result this gives
allPairs (d.abc, cols=c("ORTHOGRAPHY.L1","ORTHOGRAPHY.L2"))
#> section  [1] "-"
#> table    ə 
#> table    2 
#> subsection   [1] "-" "ə"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   5            abc           abca
#> data.frame   6           abac          abaca
#> section  [1] "a"
#> table    a o u 
#> table    4 1 1 
#> subsection   [1] "a" "a"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   1            abc            abc
#> data.frame   2           abac           abac
#> data.frame   5            abc           abca
#> data.frame   6           abac          abaca
#> subsection   [1] "a" "o"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   3            abc           aobc
#> subsection   [1] "a" "u"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   4           abac           uwuc
#> section  [1] "b"
#> table    b w 
#> table    5 1 
#> subsection   [1] "b" "b"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   1            abc            abc
#> data.frame   2           abac           abac
#> data.frame   3            abc           aobc
#> data.frame   5            abc           abca
#> data.frame   6           abac          abaca
#> subsection   [1] "b" "w"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   4           abac           uwuc
#> section  [1] "c"
#> table    c 
#> table    6 
#> subsection   [1] "c" "c"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   1            abc            abc
#> data.frame   2           abac           abac
#> data.frame   3            abc           aobc
#> data.frame   4           abac           uwuc
#> data.frame   5            abc           abca
#> data.frame   6           abac          abaca
# a clearer result could be obtained by running
# allPairs (d.cap, cols=c("ORTHOGRAPHY.German","ORTHOGRAPHY.Polish"),
#    file="~/Desktop/d.cap.html", formatter=formatter.html)As was mentioned, the “capitals” dataset is linguistically absurd, and so it should not matter that all the Polish names of European capitals are listed as borrowed from German. If however, one wished to fix this problem, and do it not by copying the output to a word processor and replacing “>” with “:” there, but rather inside ‘soundcorrs’, this wish can be fulfilled easily enough. First, the existing ‘formatter.html()’ function needs to be written to a file to serve as a base for the new formatter: ‘dput(formatter.html, “~/Desktop/myFormatter.R”)’. Then, the beginning of the first line of this file needs to be changed to something like ‘myFormatter <- function’…, and finally, the “>” and “<” signs (written in HTML as ‘>’ and ‘<’, respectively) need to be replaced with a colon. All that is then left is to load the new function to R and use it to format the output of ‘allPairs()’:
# load the new formatter function …
# source ("~/Desktop/myFormatter.R")
# … and use it instead of ‘formatter.html()’
# allPairs (d.cap, cols=c("ORTHOGRAPHY.German","ORTHOGRAPHY.Polish"),
#    file="~/Desktop/d.cap.html", formatter=myFormatter)
# note that this time the output will not open in the web browser automatically Two ‘soundcorrs’ functions help automate fitting models to data: the simpler ‘multiFit()’ and the slightly more complex ‘fitTable()’.
‘multiFit()’ fits multiple models to a single dataset. It takes as argument the dataset, as well as a list of models, in which each element is a list that contains two named fields: ‘formula’, and ‘start’. The latter is a list of lists of starting estimates for the parameteres of the model, to be tested in case the previous ones fail to produce a fit. The user can specify the fitting function, as well as pass additional arguments to it.
The return value of ‘fitTable()’ is a list of lists containing the outputs of the fitting function. Warnings and errors, which are suppressed by ‘multiFit()’, are attached to the individual elements of the output as attributes. Technically, the result is of class ‘list.multiFit’ so that it can passed to ‘summary()’ to produce a table for easier comparison of the fits. The available metrics are ‘aic’, ‘bic’, ‘rss’ (the default), and ‘sigma’. In addition, the output of ‘fitTable()’ has an attribute ‘depth’; it is intended for ‘summary()’, and should not be changed by the user.
# prepare some random data
set.seed (27)
dataset <- data.frame (X=1:10, Y=1:10 + runif(10,-1,1))
# prepare models to be tested
models <- list (
    "model A" = list( formula="Y~a+X", start=list(list(a=1)) ),
    "model B" = list( formula="Y~a^X", start=list(list(a=-1),list(a=1)) ))
# normally, (-1)^X would produce an error with ‘nls()’
# fit the models to the dataset
fit <- multiFit (models, dataset)
# inspect the results
summary (fit)
#>      model A  model B
#> rss 4.059485 11.51618‘fitTable()’ applies ‘multiFit()’ over a table, such as the ones produced by ‘coocc()’ or ‘summary()’. The arguments are: the models, the dataset, margin (as in ‘apply()’: 1 for rows, 2 for columns), the converter function, and additional arguments passed to ‘multiFit()’ (including the fitting function). The converter is a function that turns individual rows or columns of the table into data frames to which models can be fitted. ‘soundcorrs’ provides three simple functions: ‘vec2df.id()’ (the default one), ‘vec2df.hist()’, and ‘vec2df.rank()’. The first one only attaches a list of ‘X’ values, the second one extracts from a histogram the midpoints and counts, and the third one ranks the data. Any function can be used, so long as it takes a numeric vector as the only argument, and returns a data frame. The names of columns in the data frames returned by these three functions are ‘X’ and ‘Y’, something to be borne in mind when defining the formulae of the models.
As with ‘multiFit()’, the return value of ‘fitTable()’ is a list of the outputs of the fitting function, only in the case of ‘fitTable()’ it is nested. It, too, can be passed to ‘summary()’ to produce a convenient table.
# prepare the data
dataset <- coocc (d.abc)
# prepare the models to be tested
models <- list (
    "model A" = list( formula="Y~a*(X+b)^2", start=list(list(a=1,b=1)) ),
    "model B" = list( formula="Y~a*(X-b)^2", start=list(list(a=1,b=1)) ))
# vanilla nls() often requires fairly accurate starting estimates
# fit the models to the dataset
fit <- fitTable (models, dataset, 1, vec2df.hist)
# inspect the results
summary (fit, metric="sigma")
#>               -_ə      a_a       a_o     a_u       b_b       b_w       c_c
#> model A        NA 1.272453        NA      NA        NA        NA        NA
#> model B 0.4291194  1.03122 0.9342932 0.72328 0.5919122 0.9342932 0.5919122Currently, ‘soundcorrs’ offers only one function related to sound changes. It is ‘applyChanges()’, and it can be used to automatically apply a series of changes to a series of words.
‘applyChanges()’ does what its name suggests: it applies a series of sound changes to a series of words. It takes up to four arguments: ‘data’ which is a vector of character strings to which the changes will be applied; ‘changes’, a list of ‘soundchange’ objects; ‘target’, a vector of character strings of the same length as ‘data’, to which the results will be compared; and ‘meta’, a vector of the same length as ‘data’, which will be passed on to ‘soundchange’ functions.
The return value of ‘applyChanges()’ is a list, technically of class ‘list.applyChanges’, which contains three elements: ‘end’, the final results of the application of sound changes (this is the only element that is printed by default); ‘match’ which is a named list of the results of comparison to ‘target’; and ‘tree’ which saves the path by which the final forms in ‘end’ have been arrived at. The values in ‘target’ can be: 0 if none of the results matches ‘target’, 0.5 if at least one but not all of the results match ‘target’, and 1 if all the results match ‘target’.
# prepare a list of changes, in the order of application
sc.list <- list (sc.VV2a, sc.2ndV2a, sc.CV2Ca)
# prepare the data and the expected results
data <- c ("ouroboros", "jormungandr")
target <- c ("arabaras", "jarmangandr")
# and apply the changes to our data
res <- applyChanges (data, sc.list, target, meta=NULL)
res
#> $ouroboros
#> [1] "arabaras"
#> 
#> $jormungandr
#> [1] "jormangandr"
# see if they match the expectations
res$match
#> $ouroboros
#> [1] 1
#> 
#> $jormungandr
#> [1] 0
# see which change did not work as expected
#    it was CV > Ca because our changes use the sample "common" transcription,
#    and j does not count in it as a consonant (it's a semivowel)
res$tree
#> 1 ouroboros [VV>a]
#> 2 .. aroboros [2ndV>a]
#> 3 .. .. araboros [CV>Ca]
#> 4 .. .. .. arabaras
#> 1 jormungandr [VV>a]
#> 2 .. jormungandr [2ndV>a]
#> 3 .. .. jormangandr [CV>Ca]
#> 4 .. .. .. jormangandrIn addition to analytic functions, ‘soundcorrs’ also exports several helpers. Let us now briefly discuss those, this time simply in the alphabetic order.
As was mentioned above, automatic segmentation and alignment requires careful supervision, and it may prove in the end to be easier to do by hand. ‘addSeparators()’ can facilitate the first half of this task by interspersing a vector of character strings with a separator.
It may be sometimes that the data are insufficient for a test of independence, or that the contingency table is too diversified to draw concrete conclusions from it. ‘binTable()’ takes one or more rows and one or more columns as arguments, and leaves those rows and columns unchanged, while summing up all the others.
# build a table for a slightly larger dataset
tab <- coocc (d.cap)
# let us focus on L1 a and o
rows <- which (rownames(tab) %hasPrefix% "a")
cols <- which (colnames(tab) %hasPrefix% "o")
binTable (tab, rows, cols)
#>       o_o_o non-o_o_o
#> a_a_a     0        57
#> a_a_o     0         6
#> a_a_u     0         5
#> other    16      1041
# or on all a-like and o-like vowels
rows <- which (rownames(tab) %hasPrefix% "[aāäǟ]")
cols <- which (colnames(tab) %hasPrefix% "[oōöȫ]")
binTable (tab, rows, cols)
#>       o_o_o ō_o_o ō_y_o other
#> a_a_a     0     1     0    56
#> a_a_o     0     0     0     6
#> a_a_u     0     0     0     5
#> ä_e_e     0     0     0    36
#> ā_-_-     1     0     0     6
#> ā_a_a     0     2     0    47
#> other    15    16     3   931Metacharacters defined in the transcription (“wildcards”) can be used inside sound changes, as well as inside a ‘findExamples()’ or a ‘findPairs()’ query, but they can also be used with ‘grep()’ or any other function. The only difference is that the first three functions automatically make a call to ‘expandMeta()’ in order to translate those metacharacters into regular expressions that vanilla R can understand, while for ‘grep()’, or any other function from outside ‘soundcorrs’, the user needs to make an explicit call to ‘expandMeta()’.
Beside the metacharacters defined in the transcription, ‘expandMeta()’ can also understand ‘binary notation’, i.e. an enumeration of distinctive features such as “[+cons,-stop]”. The condition is that the enumeration must be enclosed in square brackets, it must contain the same features as are used in the VALUE column in the transcription, each feature must have a “+” or “-” sign in front of it, the features must be separated by commas, and there can be no spaces inside the brackets. Should any of those rules be broken, the would-be wildcard will be kept in the query string as is, and will surely fail to produce any match in the search.
# let us search a column other than the one specified as ‘aligned’
orth <- d.abc$data [, "ORTHOGRAPHY.L2"]
# look for all VCC sequences
query <- expandMeta (d.cap$trans[[1]], "VCC")
orth [grep(query,orth)]
#> [1] "abc"  "aobc" "abca"
# look for all VCC words
query <- expandMeta (d.cap$trans[[1]], "^VCC$")
orth [grep(query,orth)]
#> [1] "abc"
# the same in the binary notation
query <- expandMeta (d.cap$trans[[1]], "^[+vow][+cons][+cons]$")
orth [grep(query,orth)]
#> [1] "abc"Checks if a string begins with another string. In ‘soundcorrs’, this can be useful for extracting specific rows and columns from a contingency table.
# build a table for a slightly larger dataset
tab <- coocc (d.cap)
# it is quite difficult to read as a whole, so let us focus
#    on a-like vowels in L1 and s-like consonants in L2
rows <- which (rownames(tab) %hasPrefix% "[aāäǟ]")
cols <- which (colnames(tab) %hasPrefix% "[sśš]")
tab [rows, cols]
#>                      German_Polish_Spanish
#> German_Polish_Spanish s_s_s s_s_z s_z_z s_š_s š_š_s
#>                 a_a_a     1     1     0     1     1
#>                 a_a_o     0     0     0     0     1
#>                 a_a_u     0     0     0     0     0
#>                 ä_e_e     0     0     0     2     0
#>                 ā_-_-     0     0     1     0     0
#>                 ā_a_a     0     0     0     2     0‘%hasSuffix%’ works nearly the same as ‘%hasPrefix%’, only instead of the beginning of a word, it looks at its end.
# build a table for a slightly larger dataset
tab <- coocc (d.cap)
# it is quite difficult to read as a whole, so let us focus
#    on what corresponds to a-like vowels in L1 and s-like consonants in L2
rows <- which (rownames(tab) %hasSuffix% "[aāäǟ]")
cols <- which (colnames(tab) %hasSuffix% "[sśš]")
tab [rows, cols]
#>                      German_Polish_Spanish
#> German_Polish_Spanish -_-_s s_s_s s_š_s z_s_s z_z_s š_š_s
#>               -_-_a       0     0     0     0     0     0
#>               -_a_a       1     1     0     0     0     0
#>               -_a_ja      0     0     0     0     0     1
#>               -_y_a       1     0     0     0     0     0
#>               a_a_a       1     1     1     1     0     1
#>               jus_o_a     0     0     0     0     0     0
#>               ā_a_a       0     0     2     0     1     0‘lapplyTest()’ is a variant of ‘base::lapply()’ specifically adjusted for the application of tests of independence. The main difference lies in the handling of warnings and errors.
This function takes a list of contingency tables, such as generated by ‘allCooccs()’ above, and applies to each of its elements a function given in ‘fun’. By default, it is ‘chisq.test()’, but any other test can be used, so long as its output contains an element named ‘p.value’. The result is a list of the outputs of ‘fun’, to each attached as an attribute a warning or an error if any were produced. Additional arguments to ‘fun’ can also be passed in a call to ‘lapplyTest()’.
Technically, the output is of class ‘list.lapplyTest’. It can be passed to ‘summary()’ to sift through the results and only print the ones with the p-value below the specified threshold (the default is 0.05). Those tests which produced a warning are prefixed with an exclamation mark.
# let us prepare the tables
tabs <- allCooccs (d.abc, bin=F)
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================================| 100%
# and apply the chi-squared test to them
chisq <- lapplyTest (tabs)
chisq
#> $`-`
#> 
#>  Chi-squared test for given probabilities
#> 
#> data:  tab
#> X-squared = 6, df = 5, p-value = 0.3062
#> 
#> 
#> $a
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  tab
#> X-squared = 7.7467, df = 6, p-value = 0.2573
#> 
#> 
#> $b
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  tab
#> X-squared = 7.1944, df = 4, p-value = 0.126
#> 
#> 
#> $c
#> 
#>  Chi-squared test for given probabilities
#> 
#> data:  tab
#> X-squared = 6.5714, df = 5, p-value = 0.2545
#> 
#> 
#> attr(,"class")
#> [1] "list.lapplyTest"
# this is only an example on a tiny dataset, so let us be more forgiving
summary (chisq, p.value=0.3)
#> Total results: 4; with p-value ≤ 0.3: 3.
#> ! a: p-value = 0.257
#> ! b: p-value = 0.126
#> ! c: p-value = 0.255
# let us see the problems with ‘a’
attr (chisq$a, "error")
#> NULL
attr (chisq$a, "warning")
#> <simpleWarning in fun(tab, ...): Chi-squared approximation may be incorrect>
# this warning often means that the data were insufficient
tabs$a
#>      L1_L2
#> L1_L2 -_ə b_b b_w c_c
#>   a_a   2   4   0   4
#>   a_o   0   1   0   1
#>   a_u   0   0   1   1Due to technical limitations of R and CRAN, primarily to do with encoding, sample datasets provided by ‘soundcorrs’ cannot be stored in the preloaded form (non-ASCII characters). They also cannot be automatically loaded when ‘soundcorrs’ is attached (staged install prevents this kind of usage of ‘system.file()’), and they cannot be included in full in the source files, even when Unicode characters are escaped because Windows do not know how to convert those to native encoding. It seems that the only half-convenient way of making Unicode datasets available is through a separate function that can load them on user’s request. ‘loadSampleDataset()’ is such a function.
It only takes one argument, ‘x’, which can take one of the following values:
‘long2wide()’, together with ‘wide2long()’ are used to convert data frames between the “long format” and the “wide format” (see above). Of these two, ‘long2wide()’ is particularly useful because the “long format” tends to be easier for humans to perform the segmentation, and is therefore preferable for storing data, while the “wide format” is used internally and required by ‘soundcorrs’.
During the conversion, the number of columns is almost doubled (while the number of rows halved), but because it is unwise to have duplicate column names, they are given suffixes – which are taken from the values in the column ‘LANGUAGE’. The name of the column used for that purpose can be changed using the ‘col.lang’ argument.
Some of the attributes pertain to only one word in a pair or to the pair as a whole. In the “long format” those have to be repeated, but in the “wide format” this is not necessary. ‘long2wide()’ allows for certain columns to be excluded from the conversion, using the ‘skip’ argument.
# the “abc” dataset is in the long format
abc.long <- read.table (path.abc, header=T)
# the simplest conversion unnecessarily doubles the ID column
long2wide (abc.long)
#>   ID.L1 DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 ID.L2 DIALECT.L2 ALIGNED.L2
#> 1     1        std      a|b|c            abc     1        std      a|b|c
#> 2     2        std    a|b|a|c           abac     2        std    a|b|a|c
#> 3     3        std      a|b|c            abc     3      north      o|b|c
#> 4     4        std    a|b|a|c           abac     4      north    u|w|u|c
#> 5     5        std    a|b|c|-            abc     5      south    a|b|c|ə
#> 6     6        std  a|b|a|c|-           abac     6      south  a|b|a|c|ə
#>   ORTHOGRAPHY.L2
#> 1            abc
#> 2           abac
#> 3           aobc
#> 4           uwuc
#> 5           abca
#> 6          abaca
# but this can be avoided with the ‘skip’ argument
abc.wide <- long2wide (abc.long, skip="ID")‘ngrams()’ turns a vector of words into a list of n-grams, or a table of its frequencies. The first argument is the vector of words; the second is ‘n’, the length of n-grams to extract (defaults to ‘1’); and the last ‘as.table’ which determines whether the output is a list of n-grams or a table of its frequencies (defaults to ‘TRUE’).
Two more arguments are available. ‘borders’ is a vector of two character strings: the first to be prepended to all the words, and the second to be appended to them. This way it is clear which n-grams were in the initial, and which in the final position inside the word. ‘borders’ defaults to a vector of two empty strings. Lastly, ‘rm’ is a string of characters that are to be removed from the words before they are cut into n-grams. For instance, to remove all linguistic zeros use ‘rm=“-”’, and to remove zeros and segment separators, use ‘rm=“[-\|]”’.
# with n==1, ngrams() returns simply the frequencies of segments
ngrams (d.cap$data[,"ORTHOGRAPHY.Spanish"])
#> 
#>  A  B  C  D  E  H  L  M  N  P  R  S  T  V  Z  _  a  b  c  d  e  f  g  h  i 
#>  1  5  2  1  1  1  4  1  1  2  2  1  1  4  1  3 30  5  3  7 14  1  5  1 15 
#>  k  l  m  n  o  p  r  s  t  u  v  x  Á  í 
#>  1 11  5  9 10  2 11 13  7  9  2  1  1  4
# counts can easily be turned into a data frame with ranks
tab <- ngrams (d.cap$data[,"ORTHOGRAPHY.Spanish"], n=2)
mtx <- as.matrix (sort(tab,decreasing=T))
head (data.frame (RANK=1:length(mtx), COUNT=mtx, FREQ=mtx/sum(mtx)))
#>    RANK COUNT       FREQ
#> na    1     4 0.02339181
#> st    2     4 0.02339181
#> ag    3     3 0.01754386
#> ar    4     3 0.01754386
#> da    5     3 0.01754386
#> en    6     3 0.01754386‘subset()’ does what its name suggests, i.e. it subsets a dataset using the provided condition. It returns a new ‘soundcorrs’ object.
# select only examples from L2’s northern dialect
subset (d.abc, DIALECT.L2=="north") $data
#>   ID DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 DIALECT.L2 ALIGNED.L2
#> 3  3        std      a|b|c            abc      north      o|b|c
#> 4  4        std    a|b|a|c           abac      north    u|w|u|c
#>   ORTHOGRAPHY.L2
#> 3           aobc
#> 4           uwuc
# select only capitals of countries where German is an official language
subset (d.cap, grepl("German",d.cap$data$OFFICIAL.LANGUAGE)) $data
#>           ALIGNED.German ORTHOGRAPHY.German        ALIGNED.Polish
#> 5  l|u|k|s|ə|m|b|u|r|k|-          Luxemburg l|u|k|s|e|m|b|u|r|k|-
#> 19         v|ī|-|-|-|n|-               Wien         ẃ|-|e|d|e|ń|-
#> 21           b|ä|r|l|ī|n             Berlin           b|e|r|l|i|n
#> 23     b|r|ü|-|s|ə|l|-|-            Brüssel     b|r|u|k|s|e|l|a|-
#>    ORTHOGRAPHY.Polish       ALIGNED.Spanish  ORTHOGRAPHY.Spanish
#> 5          Luksemburg l|u|k|s|e|m|b|u|r|γ|o Ciudad_de_Luxemburgo
#> 19             Wiedeń         b|j|e|-|-|n|a                Viena
#> 21             Berlin           b|e|r|l|i|n               Berlín
#> 23           Bruksela     b|r|u|-|s|e|l|a|s             Bruselas
#>              OFFICIAL.LANGUAGE
#> 5  Luxembourgish,French,German
#> 19                      German
#> 21                      German
#> 23         Dutch,French,German
# select only pairs in which L1 a : L2 a
subset (d.abc, findPairs(d.abc,"a","a")$which) $data
#>   ID DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 DIALECT.L2 ALIGNED.L2
#> 1  1        std      a|b|c            abc        std      a|b|c
#> 2  2        std    a|b|a|c           abac        std    a|b|a|c
#> 5  5        std    a|b|c|-            abc      south    a|b|c|ə
#> 6  6        std  a|b|a|c|-           abac      south  a|b|a|c|ə
#>   ORTHOGRAPHY.L2
#> 1            abc
#> 2           abac
#> 5           abca
#> 6          abaca‘wide2long()’ is simply the inverse of ‘long2wide()’. The conversion may not be perfect, as the order of the columns may change.
In ‘long2wide()’, suffixes were taken from the values in the ‘LANGUAGE’ column; this time they must be specified explicitly. They will be stored in a column defined by the argument ‘col.lang’, which defaults to ‘LANGUAGE’. However, the string that separated column names from suffixes will not be removed by default. To strip it, the argument ‘strip’ needs to be set to the length of the separator.
# let us use the converted “abc” dataset
abc.wide
#>   ID DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 DIALECT.L2 ALIGNED.L2
#> 1  1        std      a|b|c            abc        std      a|b|c
#> 2  2        std    a|b|a|c           abac        std    a|b|a|c
#> 3  3        std      a|b|c            abc      north      o|b|c
#> 4  4        std    a|b|a|c           abac      north    u|w|u|c
#> 5  5        std    a|b|c|-            abc      south    a|b|c|ə
#> 6  6        std  a|b|a|c|-           abac      south  a|b|a|c|ə
#>   ORTHOGRAPHY.L2
#> 1            abc
#> 2           abac
#> 3           aobc
#> 4           uwuc
#> 5           abca
#> 6          abaca
# with the separator preserved
wide2long (abc.wide, c(".L1",".L2"))
#>      ALIGNED DIALECT ORTHOGRAPHY ID LANGUAGE
#> 1      a|b|c     std         abc  1      .L1
#> 2    a|b|a|c     std        abac  2      .L1
#> 3      a|b|c     std         abc  3      .L1
#> 4    a|b|a|c     std        abac  4      .L1
#> 5    a|b|c|-     std         abc  5      .L1
#> 6  a|b|a|c|-     std        abac  6      .L1
#> 7      a|b|c     std         abc  1      .L2
#> 8    a|b|a|c     std        abac  2      .L2
#> 9      o|b|c   north        aobc  3      .L2
#> 10   u|w|u|c   north        uwuc  4      .L2
#> 11   a|b|c|ə   south        abca  5      .L2
#> 12 a|b|a|c|ə   south       abaca  6      .L2
# and with the separator removed
wide2long (abc.wide, c(".L1",".L2"), strip=1)
#>      ALIGNED DIALECT ORTHOGRAPHY ID LANGUAGE
#> 1      a|b|c     std         abc  1       L1
#> 2    a|b|a|c     std        abac  2       L1
#> 3      a|b|c     std         abc  3       L1
#> 4    a|b|a|c     std        abac  4       L1
#> 5    a|b|c|-     std         abc  5       L1
#> 6  a|b|a|c|-     std        abac  6       L1
#> 7      a|b|c     std         abc  1       L2
#> 8    a|b|a|c     std        abac  2       L2
#> 9      o|b|c   north        aobc  3       L2
#> 10   u|w|u|c   north        uwuc  4       L2
#> 11   a|b|c|ə   south        abca  5       L2
#> 12 a|b|a|c|ə   south       abaca  6       L2If you found a bug, have a remark to make about ‘soundcorrs’, or wishes for its future releases, please write to kamil.stachowski@gmail.com.
If you use ‘soundcorrs’ in your research, please cite it as Stachowski K. [forthcoming]. soundcorrs: Tools for Semi-Automatic Analysis of Sound Correspondences.