class: center middle main-title section-title-4 # Text .class-info[ **Session 13** .light[PMAP 8921: Data Visualization with R<br> Andrew Young School of Policy Studies<br> May 2020] ] --- name: outline class: title title-inv-7 # Plan for today -- .box-6.medium.sp-after[Qualitative text-based data] -- .box-5.medium.sp-after[Crash course in<br>computational linguistics] --- layout: false name: text-data class: center middle section-title section-title-6 animated fadeIn # Qualitative text-based data --- layout: true class: title title-6 --- # Free responses .center[ <figure> <img src="img/13/free-responses.png" alt="Free responses from a survey" title="Free responses from a survey" width="70%"> <figcaption>Typical free responses from a survey</figcaption> </figure> ] --- # y tho? .center[ <figure> <img src="img/13/word-cloud.png" alt="Bad word cloud" title="Bad word cloud" width="60%"> </figure> ] --- # Some cases are okay .center[ <figure> <img src="img/13/email-word-cloud.jpg" alt="What Happened? word cloud" title="What Happened? word cloud" width="45%"> </figure> ] ??? https://twitter.com/s_soroka/status/907941270735278085 --- # Word clouds for grownups .box-inv-6[Count words, but in fancier ways] .center[ <figure> <img src="img/13/cover.png" alt="Tidy text mining with R" title="Tidy text mining with R" width="30%"> </figure> ] ??? https://www.tidytextmining.com/ --- layout: false class: bg-full background-image: url("img/13/he-she-julia.png") ??? https://pudding.cool/2017/08/screen-direction/ --- layout: false class: bg-90 background-image: url("img/13/minimap-1.png") ??? https://juliasilge.com/blog/song-lyrics-across/ --- layout: false name: computational-linguistics class: center middle section-title section-title-5 animated fadeIn # Crash course in<br>computational linguistics --- layout: true class: title title-5 --- # Core concepts and techniques -- .box-inv-5[Tokens, lemmas, and parts of speech] -- .box-inv-5[Sentiment analysis] -- .box-inv-5[tf-idf] -- .box-inv-5[Topics and LDA] -- .box-inv-5[Fingerprinting] --- # Regular text .small-code[ ``` THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be. The Dursleys shuddered to think what the neighbors would say if the Potters a... ``` ] --- # Tidy text .box-inv-5[One row for each text element] .box-5.small[Can be chapter, page, verse, etc.] .small-code[ ``` # A tibble: 6 x 3 chapter book text <int> <chr> <chr> 1 1 Harry Potter and the Phil… "THE BOY WHO LIVED Mr. and Mrs. Dursley, of number … 2 2 Harry Potter and the Phil… "THE VANISHING GLASS Nearly ten years had passed si… 3 3 Harry Potter and the Phil… "THE LETTERS FROM NO ONE The escape of the Brazilia… 4 4 Harry Potter and the Phil… "THE KEEPER OF THE KEYS BOOM. They knocked again. D… 5 5 Harry Potter and the Phil… "DIAGON ALLEY Harry woke early the next morning. Al… 6 6 Harry Potter and the Phil… "THE JOURNEY FROM PLATFORM NINE AND THREE-QUARTERS … ``` ] --- # Tokens .box-inv-5[Split the text into even smaller parts] .box-5.small[Paragraph, line, verse, sentence, n-gram, word, letter, etc.] .pull-left.small-code[ ``` # A tibble: 6 x 3 word chapter book <chr> <int> <chr> 1 the 1 Harry Potter... 2 boy 1 Harry Potter... 3 who 1 Harry Potter... 4 lived 1 Harry Potter... 5 mr 1 Harry Potter... 6 and 1 Harry Potter... ``` ] -- .pull-right.small-code[ ``` # A tibble: 6 x 3 bigram chapter book <chr> <int> <chr> 1 the boy 1 Harry Potter... 2 boy who 1 Harry Potter... 3 who lived 1 Harry Potter... 4 lived mr 1 Harry Potter... 5 mr and 1 Harry Potter... 6 and mrs 1 Harry Potter... ``` ] --- # Stop words .box-inv-5[Common words that we can generally ignore] .center.small-code[ ``` # A tibble: 1,149 x 2 word lexicon <chr> <chr> 1 a SMART 2 a's SMART 3 able SMART 4 about SMART 5 above SMART 6 according SMART 7 accordingly SMART 8 across SMART 9 actually SMART 10 after SMART # … with 1,139 more rows ``` ] --- # Token frequency: words <img src="13-slides_files/figure-html/hp-words-1.png" width="100%" style="display: block; margin: auto;" /> --- # Token frequency: n-grams <img src="13-slides_files/figure-html/hp-bigrams-1.png" width="100%" style="display: block; margin: auto;" /> --- # Token frequency: n-gram ratios <img src="13-slides_files/figure-html/hp-se-she-1.png" width="100%" style="display: block; margin: auto;" /> --- # Parts of speech .small-code[ ``` # A tibble: 50 x 11 doc_id sid tid token token_with_ws lemma upos xpos feats tid_source relation <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> 1 1 1 1 THE THE the DET DT Definite… 2 det 2 1 1 2 BOY BOY Boy NOUN NN Number=S… 18 nsubj 3 1 1 3 WHO WHO who PRON WP PronType… 4 nsubj 4 1 1 4 LIVED LIVED live VERB VBD Mood=Ind… 2 acl:rel… 5 1 1 5 Mr. Mr. Mr. PROPN NNP Number=S… 4 xcomp 6 1 1 6 and and and CCONJ CC <NA> 7 cc 7 1 1 7 Mrs. Mrs. Mrs. PROPN NNP Number=S… 5 conj 8 1 1 8 Dursl… Dursley Durs… PROPN NNP Number=S… 7 flat 9 1 1 9 , , , PUNCT , <NA> 5 punct 10 1 1 10 of of of ADP IN <NA> 11 case # … with 40 more rows ``` ] .box-inv-5.small[These use the [Penn part of speech tags](https://cs.nyu.edu/grishman/jet/guide/PennPOS.html)] --- # Parts of speech frequency .pull-left-3.small-code[ .box-inv-5.small[Verbs] ``` # A tibble: 1,557 x 2 lemma n <chr> <dbl> 1 say 920 2 get 440 3 have 417 4 go 384 5 look 380 6 be 310 7 know 310 8 see 303 9 think 230 10 do 227 # … with 1,547 more rows ``` ] -- .pull-middle-3.small-code[ .box-inv-5.small[Nouns] ``` # A tibble: 2,852 x 2 lemma n <chr> <dbl> 1 Harry 1315 2 Ron 423 3 Hagrid 258 4 Professor 167 5 Snape 154 6 Hermione 153 7 Dumbledore 144 8 time 138 9 Dudley 136 10 uncle 122 # … with 2,842 more rows ``` ] -- .pull-right-3.small-code[ .box-inv-5.small[Adjectives & adverbs] ``` # A tibble: 1,240 x 2 lemma n <chr> <dbl> 1 back 223 2 so 215 3 just 180 4 when 178 5 very 171 6 now 166 7 then 165 8 all 147 9 how 136 10 there 123 # … with 1,230 more rows ``` ] --- # Artsy stuff .center[ <figure> <img src="img/13/closeup.jpg" alt="Alice in Wonderland punctuation by Nicholas Rougeux" title="Alice in Wonderland punctuation by Nicholas Rougeux" width="100%"> </figure> ] ??? [*Between the Words*](https://www.c82.net/work/?id=347) by Nicholas Rougeux --- # Sentiment analysis .pull-left-3.small-code[ ```r get_sentiments("bing") ``` ``` # A tibble: 6,786 x 2 word sentiment <chr> <chr> 1 2-faces negative 2 abnormal negative 3 abolish negative 4 abominable negative 5 abominably negative 6 abominate negative 7 abomination negative 8 abort negative 9 aborted negative 10 aborts negative # … with 6,776 more rows ``` ] -- .pull-middle-3.small-code[ ```r get_sentiments("afinn") ``` ``` # A tibble: 2,477 x 2 word value <chr> <dbl> 1 abandon -2 2 abandoned -2 3 abandons -2 4 abducted -2 5 abduction -2 6 abductions -2 7 abhor -3 8 abhorred -3 9 abhorrent -3 10 abhors -3 # … with 2,467 more rows ``` ] -- .pull-right-3.small-code[ ```r get_sentiments("nrc") ``` ``` # A tibble: 13,901 x 2 word sentiment <chr> <chr> 1 abacus trust 2 abandon fear 3 abandon negative 4 abandon sadness 5 abandoned anger 6 abandoned fear 7 abandoned negative 8 abandoned sadness 9 abandonment anger 10 abandonment fear # … with 13,891 more rows ``` ] --- <img src="13-slides_files/figure-html/hp-net-sentiment-1.png" width="100%" style="display: block; margin: auto;" /> --- # tf-idf .box-inv-5[Term frequency-inverse document frequency] .box-5.small[How important a term is compared to the rest of the documents] $$ `\begin{aligned} tf &= \frac{n_{\text{term}}}{n_{\text{terms in document}}} \\ idf(\text{term}) &= \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)} \\ tf\text{-}idf(\text{term}) &= tf(\text{term}) \times idf(\text{term}) \end{aligned}` $$ --- # tf-idf <img src="13-slides_files/figure-html/hp-tf-idf-1.png" width="100%" style="display: block; margin: auto;" /> --- # Topic modeling .pull-left.center[ <figure> <img src="img/13/laurel-thatcher-ulrich.jpg" alt="Laurel Thatcher Ulrich" title="Laurel Thatcher Ulrich" width="55%"> </figure> ] .pull-right.center[ <figure> <img src="img/13/midwifes-tale.jpg" alt="A Midwife's Tale" title="A Midwife's Tale" width="55%"> </figure> ] ??? https://commons.wikimedia.org/wiki/File:Laurel_Thatcher_Ulrich_(32803708014).jpg https://commons.wikimedia.org/wiki/File:A_Midwife%27s_Tale_by_Laurel_Thatcher_Ulrich.jpg --- # Latent Dirichlet Allocation (LDA) .center[ <figure> <img src="img/13/LDA.png" alt="Latent Dirichlet Allocation" title="Latent Dirichlet Allocation" width="80%"> </figure> ] --- # Clusters of related words <table> <tr> <th class="cell-left">Topic label</th> <th class="cell-left">Topic words</th> </tr> <tr> <td class="cell-left">Midwifery</td> <td class="cell-left">birth safe morn receivd calld left cleverly pm labour …</td> </tr> <tr> <td class="cell-left">Church</td> <td class="cell-left">meeting attended afternoon reverend worship …</td> </tr> <tr> <td class="cell-left">Death</td> <td class="cell-left">day yesterday informd morn years death expired …</td> </tr> <tr> <td class="cell-left">Gardening</td> <td class="cell-left">gardin sett worked clear beens corn warm planted …</td> </tr> <tr> <td class="cell-left">Shopping</td> <td class="cell-left">lb made brot bot tea butter sugar carried …</td> </tr> <tr> <td class="cell-left">Illness</td> <td class="cell-left">unwell sick gave dr rainy easier care head neighbor …</td> </tr> </table> ??? https://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/ http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/ --- # Track topics over time .pull-left.center[ <figure> <img src="img/13/coldweatherbymonth.png" alt="Cold weather by month" title="Cold weather by month" width="100%"> <figcaption>Cold weather topic by month</figcaption> </figure> ] -- .pull-right.center[ <figure> <img src="img/13/emotionbyyear.png" alt="Emotion by year" title="Emotion by year" width="100%"> <figcaption>Emotion topic over time</figcaption> </figure> ] ??? https://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/ --- # State of the Union addresses .center[ <figure> <img src="img/13/sotu.png" alt="State of the Union topics over time" title="State of the Union topics over time" width="37%"> </figure> ] ??? https://cran.r-project.org/web/packages/cleanNLP/vignettes/state-of-union.html --- # Fingerprinting .box-inv-5[Analyze richness or uniqueness of a document] .box-5[Punctuation patterns, vocabulary choices, sentence length] .box-5[Hapax legomenon] --- # Sentence length .center[ <figure> <img src="img/13/fingerprint-sentence.png" alt="Sentence length heatmaps" title="Sentence length heatmaps" width="75%"> </figure> ] ??? https://kops.uni-konstanz.de/bitstream/handle/123456789/5492/Literature_Fingerprinting.pdf --- # Hapax legomena .center[ <figure> <img src="img/13/fingerprint-hapax.png" alt="Hapax heatmaps" title="Hapax heatmaps" width="75%"> </figure> ] --- # Verse length .center[ <figure> <img src="img/13/fingerprint-verse.png" alt="Verse length heatmaps" title="Verse length heatmaps" width="40%"> </figure> ]