02-concordance.qmd

---
title: "Concordance"
author: 
  - name:
      given: "Gede Primahadi Wijaya"
      family: "Rajeg"
    url: https://www.ling-phil.ox.ac.uk/people/gede-rajeg
    orcid: 0000-0002-2047-8621
    affiliation:
      - '<a href="https://www.ling-phil.ox.ac.uk/people/gede-rajeg" target="_blank" style="color:DodgerBlue;">University of Oxford</a> / <a href="https://www.cirhss.org/" target="_blank" style="color:DodgerBlue;">CIRHSS</a> & <a href="https://github.com/complexico" target="_blank" style="color:DodgerBlue;">CompLexico</a>, Udayana University'
format: 
  revealjs:
    slide-number: true
    preview-links: auto
    css: styles.css
date: 2024-07-22
date-modified: now
editor: visual
bibliography: references.bib
csl: "https://raw.githubusercontent.com/citation-style-language/styles/master/unified-style-sheet-for-linguistics.csl"
---

## Roadmap

::: incremental
1.  What is a concordance?

2.  What can we learn from looking at concordance lines?

3.  Common aspects to look at from concordance

4.  My typical workflow in analysing concordance

5.  Practice
:::

## What is a concordance?

::: incremental
-   “a collection of the occurrences of a word-form, each in its own textual environment. In its simplest form, it is an index. Each word-form is indexed, and a reference is given to the place of each occurrence in a text” [@sinclair1991, 32].

-   “gives contextual information about a word” [@barth2021, 82].
:::

## What is a concordance? {.scrollable}

::: columns
::: {.column width="50%"}
```{r shakespeare-conc}
#| echo: false
#| out-width: 5in
#| out-height: 6in
#| fig-cap: "A snippet of a concordance of [*charm* in Shakespeare's works](https://www.opensourceshakespeare.org/search/search-results.php?link=con&works[]=*&keyword1=charm&sortby=WorkName&pleasewait=1&msg=sr)."
#| fig-align: left
knitr::include_graphics("img/05-shakespeare-conc.png")
```
:::

::: {.column width="50%"}
“a collection of the occurrences of a word-form, each in its own textual environment. In its simplest form, it is an index. *Each [**word-form**]{style="background-color:pink;"} is indexed, and a [**reference**]{style="color:#0527C2;"} is given [**to the place of each occurrence**]{style="color:#0527C2;"} in a text*” [@sinclair1991, 32, emphases mine].
:::
:::

## What is a concordance? {.scrollable}

MS Word `find` is a kind of concordance!

```{r msword-conc}
#| echo: false
#| label: msword-conc
#| fig-align: left
#| out-width: 7in
#| out-height: 4in
#| fig-asp: .68
#| fig-width: 7
#| fig-dpi: 300
#| fig-scap: "A concordance in MS Word"

knitr::include_graphics("img/05-word-conc.png")
```

## What is a concordance? Key Word in Context (KWIC)

KWIC: a modern-day format of a concordance.

```{r ske-charm}
#| echo: false
#| out-width: 7in
#| out-height: 4in
#| fig-align: left
#| fig-dpi: 300
#| fig-width: 5
#| fig-height: 3
#| fig-subcap: "A snippet of SE's [concordance of the lemma *CHARM*](https://ske.li/15y) in _English Web 2021_"

knitr::include_graphics("img/05-ske-conc-charm-lemma.png")
```

## What is a concordance?

Take away:

-   essentially a means to display (raw, unanalysed) data more efficiently

-   does not perform the analysis to answer the research question

    -   *we*, the researcher, need to analyse them in such a way that these concordance data could provide answer to our research question(s).

    -   It is insufficient just to know the how-to of *corpus* linguistics, but also the *linguistics* side.

## Roadmap

::: nonincremental
~~1. What is a concordance?~~

2.  What can we learn from looking at concordance lines?

3.  Common aspects to look at from concordance

4.  My typical workflow in analysing concordance

5.  Practice
:::

## What can we learn from looking at concordance lines? {.scrollable}

Electronic concordance can be sorted (alphabetically) in various ways [@tribble2010, 175]:

-   by the node/target word/lemma/phrase

-   by the left context of the node

-   by the right context of the node

These sorting help reveals different kinds of usages of the node [cf. @hunston2002, 42-52].

## 

![](img/05-ske-conc-charm-left-context.png){.absolute top="60" left="0" width="90%"}

::: {.absolute left="30" top="20" style="font-size:.8em;"}
Left-context sorting for the [concordance of *charm*](https://ske.li/15z). Any regularity/pattern?
:::

::: {.absolute left="30" bottom="10" style="font-size:.5em;"}
Concordance link: <https://ske.li/15z>
:::

## What can we learn from looking at concordance lines? {.scrollable}

See Hunston[-@hunston2002, 42-52] for details of the following points.

-   “Observing the ‘central’ and ‘typical’”

    -   [in frequency terms]{style="color:LightSeaGreen"}

-   “Observing meaning distinctions”

    -   [different usage patterns of near-synonyms]{style="color:LightSeaGreen"}

-   “Observing meaning and patterns”

    -   [distinct senses/meanings of a word and their patternings (lexical, morpho-syntactical, morphological, etc.) [cf. @gries2006]]{style="color:LightSeaGreen"}

    -   “[Distinguishing between the meanings is a matter of distinguishing between patterns of usage.” [@hunston2002, 47]]{style="color:SteelBlue"}

-   “Observing detail”

    -   [mass of data, more specific observation of individual words]{style="color:LightSeaGreen"}

    -   e.g., [different semantic type of verbs following the phrase *advice~noun~ as to* [@hunston2002, 52]]{style="color:SteelBlue"}

## Roadmap

::: nonincremental
~~1. What is a concordance?~~

~~2. What can we learn from looking at concordance lines?~~

3.  Common aspects to look at from concordance

4.  My typical workflow in analysing concordance

5.  Practice
:::

## Common aspect to look at from concordance {.scrollable}

See Sinclair [-@sinclair2004, Ch. 2] for further details for “The search for units of meaning”

::: incremental
-   *collocation* [@sinclair2004, 28]: [“a frequent co-occurrence of words”]{style="color:crimson"}

-   *colligation* [@sinclair2004, 32]: [“co-occurrence of grammatical choices”]{style="color:crimson"}

-   *semantic preference*

    -   semantic abstraction/categorisation from a group of collocates

    -   example:

        -   categorising the semantic type of noun collocates of the EXPERIENCER argument of ANGRY vs. MAD [@suari2024]; categorisation based on the cross-linguistic semantic catalogue *Concepticon* [@list]

            -   [visualising the results](https://github.com/complexico/anger-mad-coca/blob/main/figs/08-fig3-8-Experiencer-collocates.png) based on the collocation strength of these collocates with the adjectives in COCA corpus

-   *semantic prosody*

    -   pragmatic aura of the context surrounding the node word

    -   typically whether the context is positive or negative

    -   “connotation” in the traditional semantics [@mcenery2012, 136]
:::

## Roadmap

::: nonincremental
~~1. What is a concordance?~~

~~2. What can we learn from looking at concordance lines?~~

~~3. Common aspects to look at from concordance~~

4.  My typical workflow in analysing concordance

5.  Practice
:::

## My typical workflow in analysing concordance {.scrollable}

-   Define my research question(s) (RQs)

    -   [usually within theoretical contexts, mainly in Cognitive Linguistics (esp. cognitive lexical semantics & construction grammar) and/or Austronesian/Indonesian linguistics.]{style="color:SteelBlue"}

-   Extract concordances of the phenomenon from a corpus (not using Sketch Engine)

    -   remember, [concordance is *just* a way to display the raw data]{style="color:crimson"}

    -   [operationalise your RQs]{style="color:SteelBlue"} such that they [translate into **what** to extract]{style="color:SteelBlue"} from the corpus to retrieve the (candidate) data to be analysed (i.e., qualitatively annotated and further analysed statistically) to answer your RQs.

-   Annotate/Code each line of the concordance in MS Excel/Spreadsheet software

    -   [identify the variable(s) to be analysed based on your research question]{style="color:SteelBlue"}

        -   [annotate that as columns in Excel]{style="color:SteelBlue"}; [**one variable, one column**]{style="color:crimson"}

        -   at this stage, [theoretical framework/linguistic skills will be important]{style="color:SteelBlue"} (at least for me)

-   Statistical/quantitative analyses + visualisation in R

    -   [*any* statistical/programming software can be used **as long as** they provide the statistical techniques you are going to use.]{style="color:SteelBlue"}

    -   [very much dependent on your RQs]{style="color:crimson"}

-   Report the results

## My typical workflow in analysing concordance: *Example* {.scrollable}

::: incremental
-   [Topic]{style="color:MediumVioletRed"}: Indonesian diathesis [cf. @rajeg, *forthcoming*]

-   [Context - Qualitative (theoretical) claim]{style="color:MediumVioletRed"} [@moeliono2017, 257; @sneddon2010, 470]:

    -   In Indonesian passive clause, AGENT can be **optionally** expressed

    -   How would we measure this optionality?

-   [Context - Quantitative (null-hypothesis/theoretical) prediction]{style="color:MediumVioletRed"}:

    -   If we look at a sample of passive clauses (say from a corpus), the presence/absence of AGENT would be roughly equal

-   [Research question]{style="color:MediumVioletRed"}:

    -   To what extent does AGENT is explicitly expressed or suppresed in a sample of passive clauses (with the Indonesian passive *di*- prefix)?

-   [Operationalisation - corpus query]{style="color:MediumVioletRed"}:

    -   What should we extract from the corpus as a concordance? 🤔

        -   passive sentences containing verb marked with *di*-

    -   **No syntactic parsing** in the Indonesian corpus; what to do? 🤷

        -   regular expressions \[RegEx\] “`di[a-z]{5,}`”

    -   50 random concordance lines for each of the total 12 genres

        -   Total raw data to analyse: 600 lines (50 lines \* 12 genres)

    -   save the concordance into tab-separated plain text to be opened in MS Excel

-   [Operationalisation - annotating the concordance]{style="color:MediumVioletRed"}:

    -   Given the RQ, what variable/aspect would we annotate from the concordance sample? How?

        -   we need to annotate the presence/absence of AGENT argument such that later on we can **count**/**quantify** how much is the AGENT present/absent.

    -   Let's see the [concordance data](https://github.com/gederajeg/diatesis-bahasa-indonesia/blob/0.0.1/data/potential_di_passive_sample.tsv)

-   [Results]{style="color:MediumVioletRed"}

    -   Theoretical claim: AGENT is optionally expressed in Indonesian passive with *di*-

        -   How would we measure this optionality?

    -   RQ: To what extent does AGENT is explicitly expressed or suppresed in a sample of passive clauses (with the Indonesian passive *di*- prefix)?

    -   Corpus finding:

```{r}
#| echo: false
#| out-width: 7in
#| out-height: 4in
knitr::include_graphics(path = "img/05-fig-long-short-passive-1.png")
```
:::

## Roadmap

::: nonincremental
~~1. What is a concordance?~~

~~2. What can we learn from looking at concordance lines?~~

~~3. Common aspects to look at from concordance~~

~~4. My typical workflow in analysing concordance~~

5.  Practice
:::

## Practice {.scrollable}

-   Extract concordance of a word or phrase from SkE

-   Retrieve random sample from SkE

-   Try to sort the concordance into more meaningful display to detect pattern

-   Export to .csv/.xlsx

-   Open in MS Excel

-   Annotate co-occurrence of the node in terms of its:

    -   lexical co-occurrence (collocation)

        -   e.g., the subject/object of a verb; the modifier of a head noun; any recurring prepositional combination (?); others (?)

    -   grammatical preference of the node word

        -   e.g., syntactic patterning

    -   semantic preference

    -   semantic prosody

-   share and discuss your finding

# End of `Concordance`

- source files for all materials:

    - <https://github.com/complexico/dipscorling2024>

- pdf version as a handout [here](https://github.com/complexico/dipscorling2024/blob/main/02-concordance.pdf)

- How to cite these materials:

    > Rajeg, Gede Primahadi Wijaya. 2024. Materials for the *Diponegoro Summer Course in Corpus Linguistics* (*DipSCORLING 2024*) (22 - 27 July 2024). R Quarto. Zenodo. [https://doi.org/10.5281/zenodo.12793922](https://doi.org/10.5281/zenodo.12793922). (22 July, 2024).

## References