Causal Map 3 Functions

Over the last three years we at Causal Map Ltd have developed a range of algorithms for causal mapping. The algorithms are published in the accompanying file cm3functions.R under an open-source license in the form of functions for the programming language R. These functions are the heart of the analysis engine in our app CausalMap3. The app itself is free to use for analysing existing causal mapping data files but it is closed source, so these functions are a reproducible window into the algorithms and can be used by anyone who wants to analyse and visualise causal mapping data without using the Causal Map app.

What use are they?

The main purpose is to publish and provide examples for a set of algorithms for causal mapping analysis: ways to manipulate, simplify and query causal mapping data.

If you already have causal codings e.g. in the form of an Excel sheet with at least two columns for “cause” and “effect”, these functions could be really interesting to you.

Though do note that if you have such data, you can use the Causal Map app itself for free without installing the functions.

The documentation for the functions isn’t very good yet, but if there is interest we can provide more. We are working on a training for doing causal mapping with methods which do not need the Causal Map app.

If you want to do causal coding, i.e. creating raw causal data, i.e. at minimum an ‘edge list’ of arrows between factors, these functions won’t help you. You’d have to use the Causal Map app or some other method.

How can you use them?

When we get a chance, we will release them as a full R package. But for now you can read and download a single R script which contains the functions and all you need to use them. We are still in the process of documenting the functions. To make them easier to use, we provide this vignette which demonstrates the use of many of the functions as applied to example causal mapping data files (the script will download for you the latest version of the functions script, the main example file and a special example file which demonstrates the use of combined opposites).

This vignette is an .Rmd file: you process it using the package knitr or by pressing Knit in Rstudio. Or you can use it as a source of examples to modify for your own work in the R console.

The functions are used within Causal Map 3, a web app which provides an interface for managing causal mapping project and also enables autocoding with AI. The functions provided here do not yet include functions for using AI.

These functions completely replace the functions in the old R package CausalMapFunctions which is now deprecated. This new set of functions is much simpler.

Setup

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(shiny)
library(DiagrammeR)
library(textreadr)
knitr::opts_chunk$set(echo = T,warning=F,message=F)

htmlx <- message

if(Sys.getenv("USERDOMAIN") != "STEVE-P") {

url <- "https://www.dropbox.com/scl/fi/ulrtbj9fmgu1ltdamsvhy/example-file.xlsx?rlkey=pczexdevcxgz4if2muqwr0dng&dl=1"

download.file(url = url, destfile = "example-file.xlsx",method="curl",extra='-L')

url <- "https://www.dropbox.com/scl/fi/t9fnja78jo2zb650fesqs/opposites-example.xlsx?rlkey=t9cd4afr77te3bs98hnv6rm5v&dl=1"

download.file(url = url, destfile = "opposites-example.xlsx",method="curl",extra='-L')

url <- "https://www.dropbox.com/scl/fi/mtlaf5rmmdrgp3bxbrydz/cm3functions.R?rlkey=7hli0uaookisr9seahpzczynk&dl=1"

download.file(url = url, destfile = "cm3functions.R",method="curl",extra='-L')
}
notify <- function(message,level){
  message(message)
}

# for the convenience of the developer
if(Sys.getenv("USERDOMAIN") == "STEVE-P") {
source("Rfiles/cm3functions.R") 
ex="assets/import_examples/example-file.xlsx" %>%
  get_from_excel()
ex$links$sentiment <- 0
ex2="assets/import_examples/opposites-example.xlsx" %>%
  get_from_excel()
  
} else {

source("cm3functions.R") 
ex="example-file.xlsx" %>%
  get_from_excel()
ex2="opposites-example.xlsx" %>%
  get_from_excel()
} 
ex2$links$sentiment <- 0
ex$links$sentiment <- 0

# constants -----------------------------------------------------------------

contrary_color <- "#f2ab73"
ordinary_color <- "#058488"
flipchar <- "~"

Load the functions and download some example data

Once the commands below have downloaded the data files, you can comment them out.

How is causal map data structured?

A causal map data file is a list of tibbles (data frames): any of links statements sources questions and settings, but at least links or statements must be present.

These data frames correspond to the tabs in the corresponding Excel format. You can import or export Causal Map files as Excel (“round-tripping”).

An uncoded file might contain just a set of statements for coding.

A coded data file might contain just a set of links which have already been coded.

The additional, optional tibbles sources and questions provide additional meta-data about the statements and links.

All the main functions here deal only with the links table, which is of the most interest for causal mapping. The pipe_* functions take a links table as input and output a modified links table as output. So they can be chained indefinitely.

The statements table

Each row is one chunk of text, usually one or a few paragraphs. This table contains at least the fields statement_id which must be unique for each row, and the field text which contains the text of the statement to be coded.

The links table

Coding means adding rows to a table of links. Each row is one coding. This table contains at least the fields link_id (a unique integer), statement_id which must correspond to a row in the statements table, and the field quote which contains a substring of the corresponding statement.

Inspecting the example data file

We will inspect the links, sources and statements tables.

ex$links %>% 
  arrange(desc(source_count)) %>% 
  select(from_label,to_label,source_count,source_id,link_id,statement_id) %>% 
  head

## # A tibble: 6 × 6
##   from_label                to_label source_count source_id link_id statement_id
##   <chr>                     <chr>           <int> <chr>       <dbl> <chr>       
## 1 Health behaviour; Use pi… Improve…           16 MNX-2         939 MNX-2 | 244 
## 2 Health behaviour; Use pi… Improve…           16 TWX-1         961 TWX-1 | 169 
## 3 Health behaviour; Use pi… Improve…           16 TWX-2         964 TWX-2 | 206 
## 4 Health behaviour; Use pi… Improve…           16 MSX-2         977 MSX-2 | 323 
## 5 Health behaviour; Use pi… Improve…           16 MSY-1         981 MSY-1 | 363 
## 6 Health behaviour; Use pi… Improve…           16 TWX-3         987 TWX-3 | 401

ex$links %>% 
  arrange(desc(source_count)) %>% 
  distinct(from_label,to_label,.keep_all = T) %>% 
  select(from_label,to_label,source_count,source_id,link_id,statement_id) %>% 
  head

## # A tibble: 6 × 6
##   from_label                to_label source_count source_id link_id statement_id
##   <chr>                     <chr>           <int> <chr>       <dbl> <chr>       
## 1 Health behaviour; Use pi… Improve…           16 MNX-2         939 MNX-2 | 244 
## 2 Community groups/learning Communi…           14 MNY-1         354 MNY-1 | 70  
## 3 Health behaviour; Use be… Improve…           14 MNX-2         938 MNX-2 | 244 
## 4 Health behaviour; Use mo… Improve…           14 MNX-2         940 MNX-2 | 244 
## 5 Farm production           Income             13 MNY-1         348 MNY-1 | 62  
## 6 Increased knowledge; Far… Planted…           13 TWX-1         358 TWX-1 | 88

ex$sources %>% 
  select(1:6) %>% 
  head

## # A tibble: 6 × 6
##   source_id `Household code` `#Name of province` `#Name of village`
##   <chr>     <chr>            <chr>               <chr>             
## 1 MNX-1     MNX1             A                   1                 
## 2 MNX-2     MNX2             A                   1                 
## 3 MNX-3     MNX3             A                   1                 
## 4 MNX-4     MNX4             A                   1                 
## 5 MNX-5     MNX5             A                   1                 
## 6 MNX-6     MNX6             A                   1                 
## # ℹ 2 more variables: `#Age of the main respondent` <chr>,
## #   `Household size (residents > 6months)` <chr>

ex$statements %>% 
  select(1:6) %>% 
  head

## # A tibble: 6 × 6
##   text                       source_id question_id created modified statement_id
##   <chr>                      <chr>     <chr>       <chr>   <chr>    <chr>       
## 1 This is a fully coded exa… MSY-3     $A8         2023-1… 2023-11… MSY-3 | 1   
## 2 There has been a lot of c… MSY-3     $B1         2023-1… 2023-11… MSY-3 | 2   
## 3 Improved                   MSY-3     $*B2        2023-1… 2023-11… MSY-3 | 3   
## 4 There has been improvemen… MSY-3     $B3         2023-1… 2023-11… MSY-3 | 4   
## 5 Yes there have been some … MSY-3     $C1         2023-1… 2023-11… MSY-3 | 5   
## 6 Increased                  MSY-3     $*C2        2023-1… 2023-11… MSY-3 | 6

The sources table

The optional sources table contains information about the sources in key-value format.

Factors table

In Causal Map 3, there is no independent table for factors. Factors are implied by the link labels. The function make_factors_from_links() shows what factors are implied by the links in the file.

ex$links %>% 
  make_factors_from_links() %>% 
  arrange(desc(source_count)) %>% 
  head

## # A tibble: 6 × 11
##   label        found found_source found_target source_count link_count in_degree
##   <chr>        <lgl> <lgl>        <lgl>               <int>      <int>     <int>
## 1 Improved he… FALSE FALSE        FALSE                  18        150         7
## 2 Diet improv… FALSE FALSE        FALSE                  17         51         7
## 3 Food consum… FALSE FALSE        FALSE                  17         37         6
## 4 Health beha… FALSE FALSE        FALSE                  17         19        16
## 5 Community g… FALSE FALSE        FALSE                  16         37        37
## 6 Farm produc… FALSE FALSE        FALSE                  16        158        81
## # ℹ 4 more variables: out_degree <int>, outcomeness <dbl>, flipped_from <int>,
## #   flipped_to <int>

Filtering the links

These functions use tidyverse conventions and workflows throughout.

There are no special functions for filtering the links because the tidyverse workflow is so simple:

ex$links %>% 
  filter(str_detect(source_id,"MNX"))

## # A tibble: 212 × 46
##    from_label              to_label quote link_id hashtags quote_start quote_end
##    <chr>                   <chr>    <chr>   <dbl> <chr>          <dbl>     <dbl>
##  1 Pregnancy refers:Pregn… Unable/… Ther…     435 <NA>               1       238
##  2 Increased knowledge; F… Improve… I pl…     436 <NA>             305       476
##  3 Increased knowledge; F… Improve… I pl…     437 <NA>             305       476
##  4 Farm production         Produce… [Inc…     438 <NA>               1       129
##  5 Farm production         Produce… [Inc…     439 <NA>             222       513
##  6 Increased knowledge; N… Increas… Ther…     440 <NA>               1       556
##  7 Increased knowledge; F… Able to… Ther…     441 <NA>               1       287
##  8 Receives money from ch… Increas… Yes …     442 <NA>             146       332
##  9 Farm production         Income   [Pur…     443 <NA>               1        40
## 10 Community groups/learn… Communi… With…     444 <NA>               1       507
## # ℹ 202 more rows
## # ℹ 39 more variables: statement_code <chr>, source_count <int>,
## #   link_count <int>, bundle <chr>, statement_id <chr>, text <chr>,
## #   source_id <chr>, question_id <chr>, created <chr>, modified <chr>,
## #   respondent_id <lgl>, merge_number <dbl>, `Household code` <chr>,
## #   `#Name of province` <chr>, `#Name of village` <chr>,
## #   `#Age of the main respondent` <chr>, …

ex$links %>% 
  filter(statement_id=="MNY-1 | 43")

## # A tibble: 2 × 46
##   from_label               to_label quote link_id hashtags quote_start quote_end
##   <chr>                    <chr>    <chr>   <dbl> <chr>          <dbl>     <dbl>
## 1 Increased knowledge; He… Health … [Hea…     340 <NA>               1       151
## 2 Health behaviour         Improve… [Hea…     657 <NA>               1       151
## # ℹ 39 more variables: statement_code <chr>, source_count <int>,
## #   link_count <int>, bundle <chr>, statement_id <chr>, text <chr>,
## #   source_id <chr>, question_id <chr>, created <chr>, modified <chr>,
## #   respondent_id <lgl>, merge_number <dbl>, `Household code` <chr>,
## #   `#Name of province` <chr>, `#Name of village` <chr>,
## #   `#Age of the main respondent` <chr>,
## #   `Household size (residents > 6months)` <chr>, …

However there are special functions for including (‘focus’) and excluding factors by label, see below.

Formatting the map

Formatting the map is done inside the call to make_print_map2(). The structure of the map is not changed.

Factor labels

In this example we make the background colour of the factors depend on their link count aka citation count aka degree. This example also demonstrates the use of the main function to actually print out a map.

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_links(8) %>% 
  make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count", map_label_factors_type="link_count")

Legends

map <- 
  ex$links %>% 
  pipe_retain_hashtags(hashtags = xc("!doubtful !hypothetical"),keep = F,or="ANY") %>% 
  pipe_focus("Ability",any = T) %>% 
  pipe_exclude("food",any = T) %>% 
  pipe_zoom(1) %>% 
  pipe_collapse(xc("Borrowing Income")) %>% 
  pipe_top_links(8) %>% 
  pipe_top_factors(3) %>% 
  pipe_combine_opposites() %>% 
  pipe_remove_brackets(T,T) %>% 
  pipe_label(map_label_links = "source_id",type = "count_unique")

map %>% 
  astore()

## $legend_pipe_retain_hashtags
## [1] "Removing links with all of these hashtags: !doubtful; !hypothetical."
## 
## $legend_pipe_focus
## [1] "Focussing on Ability 1 steps down and 1 steps up, tracing paths, matching anywhere."
## 
## $legend_pipe_exclude
## [1] "Excluding factors: food, matching anywhere."
## 
## $legend_pipe_zoom
## [1] "Zooming in to level 1 of the hierarchy. "
## 
## $legend_pipe_collapse
## [1] "Collapsing factors: Borrowing , Income."
## 
## $legend_pipe_top_links
## [1] "Top 8 links by source count. "
## 
## $legend_pipe_top_factors
## [1] "Top 3 factors by source count. "
## 
## $legend_pipe_combine_opposites
## [1] "Combining opposites."
## 
## $legend_pipe_remove_brackets
## [1] "Removing square and round brackets."
## 
## $legend_pipe_label
## [1] ""

map %>% 
  make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count", map_label_factors_type="link_count")

You don’t have to provide an original links file for comparison:

map %>% 
  make_print_map2(map_color_factors_column = "link_count",map_size_links ="link_count", map_label_factors_type="link_count")

Legends with genAI functions

if(F){
  
map2 <-  
  map %>% 
  pipe_magnetic_clusters(embed("facts",api_key=config$open_ai_api_key),factor_embeddings = embed(map %>% get_both_labels(),api_key=config$open_ai_api_key)) %>% 
  pipe_autocluster(granularity = .177) 

map2 %>% 
  astore()  

map2 %>% 
  make_print_map2(original=map)  

}

Ranks

links <- 
  ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_links(8)

links$from_label <- links$from_label %>% str_replace("Increased knowledge$","Increased knowledge*99")
links$to_label <- links$to_label %>% str_replace("Improved health","Improved health*99")
links$to_label <- links$to_label %>% str_replace("Income","Income*0")


links %>% 
  make_print_map2(original = links,map_color_factors_column = "link_count",map_size_links ="link_count", map_label_factors_type="link_count")

Surprises

Calculating surprises is a formatting function because it does not change the structure of the map, only the labels printed on the links and factors.

Surprise links

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_links(8) %>% 
  pipe_label(map_label_links = "#Name of village" ,type="surprise_links") %>% 
  make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count")

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_links(8) %>% 
  pipe_label(map_label_links = "#Name of village" ,type="surprise_links_simple") %>% 
  make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count")

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_links(8) %>% 
  pipe_label(map_label_links = "#Name of village",type="surprise_sources" ) %>% 
  make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count")

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_links(18) %>% 
  pipe_label(map_label_links = "#Name of village",type = "surprise_links" ) %>% 
  make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="source_count")

Surprise factors

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_links(8) %>% 
  pipe_label(map_label_links = "#Name of village" ,type="surprise_links") %>% 
  make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count",map_label_factors = "#Name of village" ,map_label_factors_type="surprise_links")

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_links(8) %>% 
  pipe_label(map_label_links = "#Name of village" ,type="surprise_sources") %>% 
  make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count",map_label_factors = "#Name of village" ,map_label_factors_type="surprise_sources")

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_links(8) %>% 
  make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count",map_label_factors = "#Name of village" ,map_label_factors_type="surprise_sources_simple")

All the transforms functions

These functions all (potentially) transform the links table in some way, either by filtering out some links or by changing their labels in order to create a different but related graph.

Focus

The focus() function identifies factors by complete label or part of the label (any=TRUE).

By default, focus shows factors one step up and down from the focused factor(s).

The function accepts a list of factor labels or parts of labels.

These code examples also show how the functions can be piped into a long chain. The pipe_* functions work on links tables and output modified links tables. These examples include the functions pipe_zoom() and make_print_map2() which are dealt with later.

These examples show:

focus is case sensitive
you can use the argument any to match any part of a label
you can use multiple search strings

The matched labels are highlighted.

# ex$links %>% 
#   pipe_zoom() %>%
#   pipe_focus(c("AbilitY TO buy fOOD")) %>%  
#   make_print_map2(original = ex$links,map_label_factors = "source_count")
  
ex$links %>% 
  pipe_zoom() %>%
  pipe_focus(c("Ability"),any = T) %>%  
  make_print_map2(original = ex$links,map_label_factors = "source_count")

ex$links %>% 
  pipe_zoom() %>%
  pipe_focus(c("Ability to buy food","Diet improved"),any = T) %>%  
  make_print_map2(original = ex$links,map_label_factors = "source_count")

Focus more than 1 step

The examples:

One step up and two steps down
One step up and zero steps down

ex$links %>% 
  pipe_zoom() %>%
  pipe_focus(c("Ability"),any = T,up=1,down=2) %>%  
  make_print_map2(original = ex$links,map_label_factors = "source_count")

ex$links %>% 
  pipe_zoom() %>%
  pipe_focus(c("Ability to buy food","Diet improved"),any = T,up=1,down=0) %>%  
  make_print_map2(original = ex$links,map_label_factors = "source_count")

Tracing threads

NOTE this doesn’t seem to be working FIXME

ex$links %>% 
  pipe_zoom() %>%
  pipe_focus(c("Ability to buy food","Diet improved"),any = T,up=1,down=0,transforms_tracing_threads = F) %>%  
  pipe_label(type="source_count") %>% 
  make_print_map2(original = ex$links,map_label_factors = "source_count")

ex$links %>% 
  pipe_zoom() %>%
  pipe_focus(c("Ability to buy food","Diet improved"),any = T,up=1,down=0,transforms_tracing_threads = T) %>%  
  pipe_label(type="source_count") %>% 
  make_print_map2(original = ex$links,map_label_factors = "source_count")

Exclude

The exclude function works in a comparable way, to exclude matching factors from the map.

The second map shows the same as the first but with some factors excluded.

ex$links %>% 
  pipe_zoom() %>%
  pipe_top_factors(type="Links",which="Top",top=6) %>%
  select(from_label,to_label,link_count,source_count) %>% 
  distinct %>% 
  head

## # A tibble: 6 × 4
##   from_label          to_label            link_count source_count
##   <chr>               <chr>                    <int>        <int>
## 1 Increased knowledge Health behaviour            23            6
## 2 Farm production     Income                      32           13
## 3 Increased knowledge Increased knowledge          8            6
## 4 Increased knowledge Farm production             24           11
## 5 Increased knowledge Diet improved               30            8
## 6 Diet improved       Improved health              8            4

ex$links %>% 
  pipe_zoom() %>%
  pipe_top_factors(type="Links",which="Top",top=6) %>%
  pipe_exclude("Increased knowledge" %>% c("Diet improved")) %>% 
  select(from_label,to_label,link_count,source_count) %>% 
  distinct %>% 
  head

## # A tibble: 2 × 4
##   from_label       to_label        link_count source_count
##   <chr>            <chr>                <int>        <int>
## 1 Farm production  Income                  32           13
## 2 Health behaviour Improved health        128           18

# ## just checking it works with [ ]
# links <- 
#   ex$links %>% 
#   mutate(from_label=str_replace_all(from_label,"Income","[Income]")) %>% 
#   mutate(to_label=str_replace_all(to_label,"Income","[Income]"))
# 
# 
# links %>% 
#   pipe_zoom() %>%
#   pipe_exclude("[Income]",any = T) %>% 
#   pipe_top_factors(type="Links",which="Top",top=6) %>%
#   select(from_label,to_label,link_count,source_count) %>% 
#   distinct %>% 
#   head

Hashtags

The hashtags function works in a comparable way, to include or exclude matching links to the map.

The second map shows the same as the first but with some factors excluded.

ex$links %>% 
  pipe_zoom() %>%
  pipe_retain_hashtags(hashtags = xc("#hypothetical #doubtful"),keep=T)  %>% 
  select(from_label,to_label,link_count,source_count,hashtags)

## # A tibble: 6 × 5
##   from_label                      to_label      link_count source_count hashtags
##   <chr>                           <chr>              <int>        <int> <chr>   
## 1 Increased knowledge             Improved hea…          1            1 #hypoth…
## 2 Organisation 1                  Access to fe…          1            1 #hypoth…
## 3 Broken taboo                    Diet improved          1            1 #hypoth…
## 4 Improved/new farming techniques Farm product…          1            1 #hypoth…
## 5 Income                          Ability to b…          1            1 #hypoth…
## 6 Increased knowledge             Planted new …          1            1 #hypoth…

ex$links %>% 
  pipe_zoom() %>%
  pipe_retain_hashtags(hashtags = xc("#doubtful"),keep=T)  %>% 
  select(from_label,to_label,link_count,source_count,hashtags)

## # A tibble: 2 × 5
##   from_label                      to_label      link_count source_count hashtags
##   <chr>                           <chr>              <int>        <int> <chr>   
## 1 Improved/new farming techniques Farm product…          1            1 #hypoth…
## 2 Income                          Ability to b…          1            1 #hypoth…

ex$links %>% 
  pipe_zoom() %>%
  pipe_retain_hashtags(hashtags = "#hypothetical",keep=T)  %>% 
  pipe_label("hashtags") %>% 
  make_print_map2(ex$links)

ex$links %>% 
  pipe_zoom() %>%
  select(from_label,to_label,link_count,source_count,hashtags)

## # A tibble: 807 × 5
##    from_label                          to_label link_count source_count hashtags
##    <chr>                               <chr>         <int>        <int> <chr>   
##  1 Increased knowledge                 Health …         23            6 <NA>    
##  2 Planted new crop/vegetable varieti… Income            3            3 <NA>    
##  3 Farm production                     Produce…          8            6 <NA>    
##  4 Receives money from gov             Able to…          1            1 <NA>    
##  5 Income                              Ability…         14           11 <NA>    
##  6 Improved/new farming techniques     Produce…          2            1 <NA>    
##  7 Improved/new farming techniques     Produce…          2            1 <NA>    
##  8 Increased knowledge                 Broken …          6            5 <NA>    
##  9 Farm production                     Income           32           13 <NA>    
## 10 Increased ability to save/increase… Increas…          1            1 <NA>    
## # ℹ 797 more rows

ex$links %>% 
  pipe_zoom() %>%
  pipe_retain_hashtags(hashtags = "#doubtful",keep=F)  %>% 
  select(from_label,to_label,link_count,source_count,hashtags)

## # A tibble: 805 × 5
##    from_label                          to_label link_count source_count hashtags
##    <chr>                               <chr>         <int>        <int> <chr>   
##  1 Increased knowledge                 Health …         23            6 <NA>    
##  2 Planted new crop/vegetable varieti… Income            3            3 <NA>    
##  3 Farm production                     Produce…          8            6 <NA>    
##  4 Receives money from gov             Able to…          1            1 <NA>    
##  5 Income                              Ability…         13           11 <NA>    
##  6 Improved/new farming techniques     Produce…          2            1 <NA>    
##  7 Improved/new farming techniques     Produce…          2            1 <NA>    
##  8 Increased knowledge                 Broken …          6            5 <NA>    
##  9 Farm production                     Income           32           13 <NA>    
## 10 Increased ability to save/increase… Increas…          1            1 <NA>    
## # ℹ 795 more rows

ex$links %>% 
  pipe_zoom() %>%
  pipe_retain_hashtags(hashtags = xc("#hypothetical #doubtful"),keep=F)  %>% 
  select(from_label,to_label,link_count,source_count,hashtags)

## # A tibble: 801 × 5
##    from_label                          to_label link_count source_count hashtags
##    <chr>                               <chr>         <int>        <int> <chr>   
##  1 Increased knowledge                 Health …         23            6 <NA>    
##  2 Planted new crop/vegetable varieti… Income            3            3 <NA>    
##  3 Farm production                     Produce…          8            6 <NA>    
##  4 Receives money from gov             Able to…          1            1 <NA>    
##  5 Income                              Ability…         13           11 <NA>    
##  6 Improved/new farming techniques     Produce…          2            1 <NA>    
##  7 Improved/new farming techniques     Produce…          2            1 <NA>    
##  8 Increased knowledge                 Broken …          6            5 <NA>    
##  9 Farm production                     Income           32           13 <NA>    
## 10 Increased ability to save/increase… Increas…          1            1 <NA>    
## # ℹ 791 more rows

AND

ex$links %>% 
  pipe_zoom() %>%
  pipe_retain_hashtags(hashtags = xc("#hypothetical #doubtful"),or=F,keep=T)  %>% 
  select(from_label,to_label,link_count,source_count,hashtags)

## # A tibble: 2 × 5
##   from_label                      to_label      link_count source_count hashtags
##   <chr>                           <chr>              <int>        <int> <chr>   
## 1 Improved/new farming techniques Farm product…          1            1 #hypoth…
## 2 Income                          Ability to b…          1            1 #hypoth…

ex$links %>% 
  pipe_zoom() %>%
  pipe_retain_hashtags(hashtags = xc("#hypothetical #doubtful"),or=F,keep=F)  %>% 
  select(from_label,to_label,link_count,source_count,hashtags)

## # A tibble: 805 × 5
##    from_label                          to_label link_count source_count hashtags
##    <chr>                               <chr>         <int>        <int> <chr>   
##  1 Increased knowledge                 Health …         23            6 <NA>    
##  2 Planted new crop/vegetable varieti… Income            3            3 <NA>    
##  3 Farm production                     Produce…          8            6 <NA>    
##  4 Receives money from gov             Able to…          1            1 <NA>    
##  5 Income                              Ability…         13           11 <NA>    
##  6 Improved/new farming techniques     Produce…          2            1 <NA>    
##  7 Improved/new farming techniques     Produce…          2            1 <NA>    
##  8 Increased knowledge                 Broken …          6            5 <NA>    
##  9 Farm production                     Income           32           13 <NA>    
## 10 Increased ability to save/increase… Increas…          1            1 <NA>    
## # ℹ 795 more rows

Zoooming

We take advantage of hierarchical coding to collapse subsidiary factors into global factors. So for example the hierarchically coded factors fish; cod and fish; salmon collapse into one factor fish. This means that all the links to the two previous factors are routed into a single factor fish (which may or may not already exist in the dataset).

The examples:

The top raw factors
The top factors after zooming

ex$links %>% 
  make_factors_from_links() %>% 
  arrange(desc(source_count)) %>% 
  select(label,source_count) %>% 
  head

## # A tibble: 6 × 2
##   label                                            source_count
##   <chr>                                                   <int>
## 1 Improved health                                            18
## 2 Diet improved; Diversified                                 17
## 3 Food consumption quantity                                  17
## 4 Health behaviour; Use pits to dispose of rubbish           17
## 5 Community groups/learning                                  16
## 6 Farm production                                            16

ex$links %>% 
  pipe_zoom() %>%
  make_factors_from_links() %>% 
  arrange(desc(source_count)) %>% 
  select(label,source_count) %>% 
  head

## # A tibble: 6 × 2
##   label                     source_count
##   <chr>                            <int>
## 1 Diet improved                       19
## 2 Health behaviour                    18
## 3 Improved health                     18
## 4 Increased knowledge                 18
## 5 Food consumption quantity           17
## 6 Community groups/learning           16

Selecting top links and factors

ex$links %>% 
  pipe_zoom() %>%
  pipe_top_factors(type="Sources",which="Top",top=3) %>%  
  make_print_map2(ex$links)

ex$links %>% 
  pipe_zoom() %>%
  pipe_top_factors(type="Sources",which="Top",top=3) %>%  
  select(from_label,to_label,link_count) %>% 
  distinct() %>% 
  head

## # A tibble: 6 × 3
##   from_label          to_label            link_count
##   <chr>               <chr>                    <int>
## 1 Increased knowledge Health behaviour            23
## 2 Increased knowledge Increased knowledge          8
## 3 Increased knowledge Diet improved               30
## 4 Diet improved       Improved health              8
## 5 Health behaviour    Improved health            128
## 6 Diet improved       Diet improved                1

ex$links %>% 
  pipe_zoom() %>%
  pipe_top_factors(type="Sources",which="At_least",top=15) %>% 
  make_factors_from_links() %>% 
  arrange(desc(source_count)) %>% 
  head

## # A tibble: 6 × 11
##   label        found found_source found_target source_count link_count in_degree
##   <chr>        <lgl> <lgl>        <lgl>               <int>      <int>     <int>
## 1 Health beha… FALSE FALSE        FALSE                  18        151       128
## 2 Improved he… FALSE FALSE        FALSE                  18        140         0
## 3 Farm produc… FALSE FALSE        FALSE                  16         99        57
## 4 Increased k… FALSE FALSE        FALSE                  16        118       110
## 5 Improved/ne… FALSE FALSE        FALSE                  14         38        17
## 6 Income       FALSE FALSE        FALSE                  13         36         3
## # ℹ 4 more variables: out_degree <int>, outcomeness <dbl>, flipped_from <int>,
## #   flipped_to <int>

ex$links %>% 
  pipe_zoom() %>%
  pipe_top_factors(type="Links",which="Top",top=12) %>%  
  make_factors_from_links() %>% 
  arrange(desc(link_count)) %>% 
  head

## # A tibble: 6 × 11
##   label        found found_source found_target source_count link_count in_degree
##   <chr>        <lgl> <lgl>        <lgl>               <int>      <int>     <int>
## 1 Health beha… FALSE FALSE        FALSE                  18        151       128
## 2 Increased k… FALSE FALSE        FALSE                  17        150       142
## 3 Improved he… FALSE FALSE        FALSE                  18        140         0
## 4 Farm produc… FALSE FALSE        FALSE                  16        123        58
## 5 Diet improv… FALSE FALSE        FALSE                  14         75         9
## 6 Planted new… FALSE FALSE        FALSE                  14         65        36
## # ℹ 4 more variables: out_degree <int>, outcomeness <dbl>, flipped_from <int>,
## #   flipped_to <int>

ex$links %>% 
  pipe_zoom() %>%
  pipe_top_factors(type="Links",which="At_least",top=20) %>%  
  make_factors_from_links() %>% 
  arrange(desc(link_count)) %>% 
  head

## # A tibble: 6 × 11
##   label        found found_source found_target source_count link_count in_degree
##   <chr>        <lgl> <lgl>        <lgl>               <int>      <int>     <int>
## 1 Increased k… FALSE FALSE        FALSE                  18        193       175
## 2 Health beha… FALSE FALSE        FALSE                  18        153       130
## 3 Improved he… FALSE FALSE        FALSE                  18        147         7
## 4 Farm produc… FALSE FALSE        FALSE                  16        135        70
## 5 Diet improv… FALSE FALSE        FALSE                  16         78         9
## 6 Income       FALSE FALSE        FALSE                  14         70        32
## # ℹ 4 more variables: out_degree <int>, outcomeness <dbl>, flipped_from <int>,
## #   flipped_to <int>

ex$links %>% 
  pipe_zoom() %>%
  pipe_top_links(type="Sources",which="Top",top=2) %>%  
  select(from_label,to_label,link_count) %>% 
  distinct() %>% 
  head

## # A tibble: 2 × 3
##   from_label                to_label                 link_count
##   <chr>                     <chr>                         <int>
## 1 Health behaviour          Improved health                 128
## 2 Community groups/learning Community works together         20

ex$links %>% 
  pipe_zoom() %>%
  pipe_top_links(type="Sources",which="At_least",top=4) %>%  
  select(from_label,to_label,link_count,source_count) %>% 
  head

## # A tibble: 6 × 4
##   from_label          to_label                           link_count source_count
##   <chr>               <chr>                                   <int>        <int>
## 1 Increased knowledge Health behaviour                           23            6
## 2 Farm production     Produced enough food to eat and/o…          8            6
## 3 Income              Ability to buy food                        14           11
## 4 Increased knowledge Broken taboo                                6            5
## 5 Farm production     Income                                     32           13
## 6 Increased knowledge Improved household relationships           11            7

ex$links %>% 
  pipe_zoom() %>%
  pipe_top_links(type="Links",which="Top",top=20) %>%  
  select(from_label,to_label,link_count,source_count) %>% 
  arrange(desc(link_count)) %>% 
  distinct() %>% 
  head

## # A tibble: 6 × 4
##   from_label          to_label                           link_count source_count
##   <chr>               <chr>                                   <int>        <int>
## 1 Health behaviour    Improved health                           128           18
## 2 Farm production     Income                                     32           13
## 3 Increased knowledge Diet improved                              30            8
## 4 Increased knowledge Planted new crop/vegetable variet…         29           13
## 5 Increased knowledge Farm production                            24           11
## 6 Increased knowledge Health behaviour                           23            6

ex$links %>% 
  pipe_zoom() %>%
  pipe_top_links(type="Sources",which="Top",top=20) %>%  
  select(from_label,to_label,link_count,source_count) %>% 
  arrange(desc(source_count)) %>% 
  distinct() %>% 
  head

## # A tibble: 6 × 4
##   from_label                           to_label          link_count source_count
##   <chr>                                <chr>                  <int>        <int>
## 1 Health behaviour                     Improved health          128           18
## 2 Community groups/learning            Community works …         20           14
## 3 Farm production                      Income                    32           13
## 4 Increased knowledge                  Planted new crop…         29           13
## 5 Planted new crop/vegetable varieties Farm production           23           12
## 6 Increased knowledge                  Improved/new far…         21           12

Tracing paths

Tracing paths, showing source IDs

ex$links %>% 
  pipe_zoom() %>%
  pipe_trace(sess_links = ex$links,from_labels = "Farm production",to_labels="Diet improved",steps = 2,transforms_tracing_threads = F) %>%  
  pipe_label(type="source_id") %>% 
  make_print_map2(original = ex$links,map_color_factors_column = "outcomeness")

Tracing paths, showing source IDs, match anywhere

ex$links %>% 
  pipe_zoom() %>%
  pipe_trace(sess_links = ex$links,from_labels = "Farm",to_labels="Diet",steps = 2,transforms_tracing_threads = F,any=T) %>%  
  pipe_label(type = "source_id") %>% 
  make_print_map2(original = ex$links,map_color_factors_column = "outcomeness")

Tracing threads, showing source IDs

ex$links %>% 
  pipe_zoom() %>%
  pipe_trace(sess_links = ex$links,from_labels = "Farm production",to_labels="Diet improved",steps = 2,transforms_tracing_threads = T) %>%  
  pipe_label(type="source_id") %>% 
  make_print_map2(original = ex$links,map_label_factors = "source_count")

Tracing paths, showing source counts

ex$links %>% 
  pipe_zoom() %>%
  pipe_trace(sess_links = ex$links,from_labels = "Farm production",to_labels="Diet improved",steps = 2,transforms_tracing_threads = F) %>%  
  pipe_label(type="source_count") %>% 
  make_print_map2(original = ex$links,map_label_factors = "source_count")

Tracing threads, showing source counts

ex$links %>% 
  pipe_zoom() %>%
  pipe_trace(sess_links = ex$links,from_labels = "Farm production",to_labels="Diet improved",steps = 2,transforms_tracing_threads = T) %>%  
  pipe_label(type="source_count") %>% 
  make_print_map2(original = ex$links,map_label_factors = "source_count")

Tracing just one step

ex$links %>% 
  pipe_zoom() %>%
  pipe_trace(sess_links = ex$links,from_labels = "Farm production",to_labels="Diet improved",steps = 1) %>%  
  make_print_map2(original = ex$links)

Tracing threads, three steps

ex$links %>% 
  pipe_zoom() %>%
  pipe_trace(sess_links = ex$links,from_labels = "Farm production",to_labels="Diet improved",steps = 3,transforms_tracing_threads = T) %>%  
  pipe_label("link_count") %>% 
  make_print_map2(original = ex$links,map_label_factors = "source_count")

Removing brackets

This file doesn’t actually have factor names with brackets, but we can create some

tmp <- 
  ex$links %>% 
  mutate(from_label=str_replace_all(from_label,"Wellbeing","Wellbeing [square] (round) [[doublesquare]] ((outcome))")) %>%
  mutate(to_label=str_replace_all(to_label,"Wellbeing","Wellbeing [square] (round) [[doublesquare]] ((outcome))"))

tmp %>% 
  pipe_zoom() %>%
  pipe_focus(c("Wellbeing"),any=T) %>%  
  pipe_top_factors(5) %>% 
  make_print_map2(original = ex$links,map_label_factors = "source_count")

tmp %>% 
  pipe_zoom() %>%
  pipe_focus(c("Wellbeing"),any=T) %>%  
  pipe_top_factors(5) %>% 
  pipe_remove_brackets(round = T) %>% 
  make_print_map2(original = ex$links,map_label_factors = "source_count")

Collapse

ex$links %>% 
  pipe_collapse(c("Ability")) %>%  
  pipe_zoom() %>%
  pipe_top_factors() %>%
  make_print_map2(original = ex$links,map_label_factors = "source_count")

ex$links %>% 
  pipe_collapse(c("Ability to buy food","Diet improved"),separate=F) %>%  
  pipe_zoom() %>%
  pipe_top_factors() %>%
  make_print_map2(original = ex$links,map_label_factors = "source_count")

ex$links %>% 
  pipe_collapse(c("Increased","Farm"),separate=T) %>%  
  pipe_zoom() %>%
  pipe_top_factors() %>%
  make_print_map2(original = ex$links,map_label_factors = "source_count")

Combining opposites

ex$links %>% 
  pipe_focus(xc("~"),any = T) %>% 
  pipe_top_factors(20) %>% 
  make_print_map2(ex$links)

# 
# ex$links %>% 
#   pipe_focus(xc("~"),any = T) %>% 
#   pipe_top_factors(20) %>% 
#   make_print_map2(ex$links,map_colour_opposites_red = T)

ex$links %>% 
  pipe_focus(xc("~"),any = T) %>% 
  pipe_combine_opposites() %>% 
  pipe_top_factors(20) %>% 
  make_print_map2(ex$links)

ex2$links %>% 
  pipe_combine_opposites() %>%
  pipe_label(type="list_unique",map_label_links = "link_id") %>% 
  make_print_map2(ex$links)

ex2$links %>% 
  make_print_map2(ex$links)

Sentiment

#ex$links <- ex$links %>% mutate(sentiment=(row_number()%%3)-1)
ex$links %>% 
  pipe_focus(xc("~"),any = T) %>% 
  pipe_top_factors(20) %>% 
  make_print_map2(ex$links,map_color_links_column = "sentiment")

# 
# ex$links %>% 
#   pipe_focus(xc("~"),any = T) %>% 
#   pipe_top_factors(20) %>% 
#   make_print_map2(ex$links,map_colour_opposites_red = T)

ex$links %>% 
  pipe_top_factors(8) %>% 
  pipe_top_links(3,type = "at_least",which = "link_id") %>% 
  make_print_map2(ex$links,map_color_links_column = "sentiment")

Link labels

Note that labelling links is a pipe function, not part of the call to make_print_map2(). This is because labelling links actually transforms the links table.

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_factors(8) %>% 
  make_print_map2(ex$links,map_color_factors_column = "outcomeness")

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_factors(8) %>% 
  pipe_label(type="none") %>% 
  make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_factors(8) %>% 
  pipe_label(type="source_count") %>% 
  make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_factors(8) %>% 
  pipe_label(type="link_count") %>% 
  make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_factors(8) %>% 
  pipe_label(type="list_unique",map_label_links = "source_id") %>% 
  make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_factors(8) %>% 
  pipe_label(type="list_all",map_label_links = "source_id") %>% 
  make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_factors(8) %>% 
  pipe_label(type="count_unique",map_label_links = "source_id") %>% 
  make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_factors(8) %>% 
  pipe_label(type="count_all",map_label_links = "source_id") %>% 
  make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_factors(8) %>% 
  pipe_label(type="list_unique",map_label_links = "hashtags") %>% 
  make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)

Link tally

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_links(8) %>% 
  pipe_label(map_label_links = "#Name of village" ,type="link_tally") %>% 
  make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count")

ex$links %>% 
  pipe_zoom(1) %>% 
  pipe_top_links(18) %>% 
  pipe_label(map_label_links = "#Name of village",type = "link_tally" ) %>% 
  make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="source_count")

Creating additional tables

Most of the tables can be accessed directly

Mentions table

ex$links %>% 
  make_mentions_tabl() %>% 
  select(label,direction,link_id)

## # A tibble: 3,228 × 3
##    label                                   direction   link_id
##    <chr>                                   <chr>         <dbl>
##  1 Health behaviour                        consequence     340
##  2 Income                                  consequence     341
##  3 Produced enough food to eat and/or sell consequence     342
##  4 Able to buy farming equipment/materials consequence     343
##  5 Ability to buy food                     consequence     344
##  6 Produced enough food to eat and/or sell consequence     345
##  7 Produced enough food to eat and/or sell consequence     346
##  8 Broken taboo                            consequence     347
##  9 Income                                  consequence     348
## 10 Increased ability to borrow             consequence     349
## # ℹ 3,218 more rows

ex$links %>% 
  make_mentions_tabl() %>% 
  select(label,direction,link_id) %>% 
  pivot_wider(names_from=direction,values_from = link_id,values_fn = length)

## # A tibble: 145 × 4
##    label                                       consequence influence either
##    <chr>                                             <int>     <int>  <int>
##  1 Health behaviour                                      6        15     21
##  2 Income                                               47        41     88
##  3 Produced enough food to eat and/or sell              11         9     20
##  4 Able to buy farming equipment/materials               6        NA      6
##  5 Ability to buy food                                  16        20     36
##  6 Broken taboo                                          7         8     15
##  7 Increased ability to borrow                           1         1      2
##  8 Increased ability to save/increased savings          23         9     32
##  9 Improved household relationships                     26         7     33
## 10 Shared household decision-making                     12         9     21
## # ℹ 135 more rows

Creating files

This function prepares different kinds of files.

if(F){
  
tmp <- prepare_upload("test","assets/import_examples/example-file.xlsx")
}

This convenience function shows you the effect if you were to upload this file into another, appending or overwriting.

if(F){
make_upload_table(ex,tmp$tmp,"Append")
}

More examples.

if(F){
make_upload_table(ex,tmp$tmp,"Overwrite")

tmp <- prepare_upload("test","assets/import_examples/example-file-just-sources.xlsx")
make_upload_table(ex,tmp$tmp,"Append")
make_upload_table(ex,tmp$tmp,"Overwrite")
tmp <- prepare_upload("test","assets/import_examples/cm2.xlsx")
tmp <- prepare_upload("test","assets/import_examples/hybrid.xlsx")
tmp <- prepare_upload("test","assets/import_examples/sources-long.xlsx")
tmp <- prepare_upload("test","assets/import_examples/text_files/source_and_question_ids.docx")
tmp <- prepare_upload("test","assets/import_examples/text_files/only_source_ids.docx")
tmp <- prepare_upload("test","assets/import_examples/text_files/only_question_ids.docx")
tmp <- prepare_upload("test",
  paste0("assets/import_examples/text_files/",xc("source_and_question_ids only_question_ids"),".docx")
  )
}

Embeddigs

if(F){
phrases <- xc("good bad")
phrases <- xc("apple pie")
embed(phrases,api_key = api_key)
embed(phrases)
embed(phrases,store="default-embeddings") #same

embed(phrases,api_key = api_key,s3=s3,store="test-embs")
embed(phrases,s3=s3,store="test-embs")

embed(phrases,api_key = api_key,s3=s3)
embed(phrases,s3=s3)


s3$list_objects("cm3embeddings",Prefix="store")

}

Auto clusters

if(F){
  
find_typical_string(c(
  "health problems and other things; lotsa details just to test out speed",
  "healthy problems and more stuff; lotsa details just to test out speed",
  "better health; lotsa details just to test out speed",
  "more health; lotsa edetails just to test out speed",
  "more health; lotsa deetails just to test out speeds",
  "more health; lotsa deteails just to test out speed",
  "more health; lotsa edetails just to test out speed",
  "~ more health; lotsa deetails just to test out speeds",
  "~ more health; lotsa deteails just to test out speed",
  "~ more health; lotsa edetails just to test out speed",
  "~ more health; lotsa deetails just to test out speeds",
  "~ more health; lotsa deteails just to test out speed",
  "~ more health; lotsa details just to test out speed",
  "~ more health; lotsa details just to test out speed",
   ~ "improved life pleasure; lotsa details just to test out speed",
  "improve life pleasure; lotsa details just to test out speed",
  "health problems; lotsa details just to test out speed",
  "~ health problems"
  ),method = "lcs")  

  find_typical_string(c(
  "health problems",
  "healthy problems",
  "better health",
  "more health",
  "improved life pleasure",
  "~ health problems"
  ))  
  
  
  
ex=get_file(conn=get_conn(),"example-file-sp2")
links <- ex3$links
links <- ex$links
links %>% 
  pipe_autocluster(api_key=config$open_ai_api_key,granularity = 66) %>% 
  pipe_top_factors(12) %>% 
  make_print_map2(links, map_label_factors_type="link_count")
}

Suggested clusters

if(F){

kk <- get_file(conn=get_conn(),"example-file-validation-study")  
bsdr <- get_file(conn=get_conn(),"example-file")  
api_key = config$open_ai_api_key  
  
links <- kk$links


suggested_labels <- bsdr$links %>% get_both_labels() %>% keep(~!str_detect(.,"~"))
current_labels <- kk$links %>% get_both_labels() %>% keep(~!str_detect(.,"~"))

magnet_embeddings <- embed(phrases = suggested_labels,store = "example-file",api_key = api_key,s3 = s3) %>% rename(text=old)
factor_embeddings <- embed(phrases = current_labels,store = "example-file",api_key = api_key,s3 = s3) %>% rename(text=old)

# magnet_embeddings <- get_embedding_simple(suggested_labels,api_key = api_key)
# 
# factor_embeddings <- get_embedding_simple(current_labels,api_key = api_key)

mags <- 
  pipe_magnetic_clusters(links,magnet_embeddings = magnet_embeddings,factor_embeddings = factor_embeddings,magnetism=.5)



slinks <- mags[,colnames(links)]

bsdr$links <- slinks
write_xlsx(bsdr,"tmp.xlsx")
}

if(F){
  
library(reticulate)
library(tidyverse)
library(shiny)
library(lsa)


source("C:/Users/Zoom/Dropbox/Projects/CausalMap3/cm3functions.R")
source("C:/Users/Zoom/Dropbox/Projects/CausalMap3/Rfiles/symlink_global_functions.R")
source("C:/Users/Zoom/Dropbox/Projects/CausalMap3/Rfiles/cm3functions-NLP.R")
  
config = configr::read.config("config.yml")$default
  ex="assets/import_examples/example-file.xlsx" %>%
  get_from_excel()

embeds <- 
  pipe_embed_statements(ex$statements,api_key = config$open_ai_api_key)  
  
  
search_term <- "Keeping healthy"



cosines <-
  cosine_match(
    api_key=config$open_ai_api_key,
    tex=c(search_term),
    embeddings=embeds
  ) %>%
  as_tibble



res <- 
  pipe_suggested_clusters(
  links=links,
  conn = get_conn(),
  api_key = config$open_ai_api_key,
  slider_magnetism = 6,
  labels = c("Improved Health","Improved farming"),
  if_zoom = T
  )
links %>% select(from_label,to_label,link_id)
res %>% select(from_label,to_label,link_id)

}

Autocoding

# everything is broken. 
# I took out save to usage db 
# getting timeouts even with 3.5 sometimes


if(F){
config = configr::read.config("config.yml")$default
  ex="assets/import_examples/example-file.xlsx" %>%
  get_from_excel()

filname <- "james-dprp-research-sp"  
james <- 
  get_file(filname,conn)

statements <- james$statements[1:2,]
tmp <- auto_code(statements,start=prompt,model="4",api_key = config$open_ai_api_key,characters_per_prompt = 2000,conn=conn,convert_to_links = T,file=filname)







conn <- get_conn()
ll <- 
  tbl(conn,"cm3usage") %>% 
  filter(type!="heartbeat") %>% 
  filter(file==filname) %>%
  # filter(str_detect(response,"guess the papers")) %>%
  arrange(desc((time_stamp))) %>% 
  head(99) %>% collect
View(ll)
#timeout(240)

ask_chat("What day is tomorrow in most science fiction? Give an example",api_key = config$open_ai_api_key,conn=conn,user="asdf",file="asdfadsf",type="typ")

ask_chat("What day is tomorrow in most science fiction?",model="3",api_key = config$open_ai_api_key,conn=conn,user="asdf",file="asdfadsf",type="typ")

ask_chat("Continue this sentence:",model="gpt-4-turbo",api_key = config$open_ai_api_key,conn=conn,user="asdf",file="asdfadsf",type="typ",temperature = .9)

prompt <- readLines(paste0("assets/prompts/_prompt_0_simple.txt")) %>% collap("\n")%>% remove_lines_starting_with
statements <- ex$statements[4:9,]



auto_code(statements,start=prompt,model="3",api_key = config$open_ai_api_key,characters_per_prompt = 2000,conn=conn,convert_to_links = T,file="testing_again")

statements <- ex$statements[4:9,];auto_code(statements,start=prompt,model="3",api_key = config$open_ai_api_key,characters_per_prompt = 2000,conn=conn,convert_to_links = T,file="testing_again")


statements <- ex$statements[4:9,];auto_code(statements,start=prompt,model="gpt-4-turbo",api_key = config$open_ai_api_key,characters_per_prompt = 2000,conn=conn,convert_to_links = T,file="testing_again")



statements <- ex$statements[1:1,];tmp <- 
  auto_code(statements,start=prompt,model="4",api_key = config$open_ai_api_key,characters_per_prompt = 2000,conn=conn,convert_to_links = T,file="testing_again")






tbl(conn,"cm3links") %>% filter(file=="testing3")

auto_code(statements,start=prompt,model="4",api_key = config$open_ai_api_key,characters_per_prompt = 2000,conn=conn)




code_statements(statements,
                            prompt=prompt,
                            remove_duplicates=T, # these can happen because of the way we ask for interlocking links
                            model="4",
                            characters_per_prompt=2000,
                            combine_small = T,
                            prompts_per_batch=1,
                           api_key=config$open_ai_api_key,
                            conn=conn,user="id links",file="asdfadsf"
                            )


##################
##################
##################
##################

library(stringdist)
library(DBI)
config = configr::read.config("config.yml")$default
api_key <- config$open_ai_api_key
generic_assistant <- create_generic_assistant(api_key)

orig <- get_file("stcmalawi24",conn=get_conn())
#upload_tables_from_list(orig,file="ex-ai-holistic222",conn = get_conn(),upload_append_or = "Overwrite",in_session=F)


ex <- get_file("ai-stcmalawi24",conn=get_conn())
ex <- get_file("example-file",conn=get_conn())


naive1_short <- read_lines("C:/Users/Zoom/IFRC/Everyone Counts report - General/ifrc_ec2024_data_and_code/prompts/naive1_short.txt") %>% collap
naive1 <- read_lines("C:/Users/Zoom/IFRC/Everyone Counts report - General/ifrc_ec2024_data_and_code/prompts/naive1.txt") %>% collap
naive1_holistic <- read_lines("C:/Users/Zoom/IFRC/Everyone Counts report - General/ifrc_ec2024_data_and_code/prompts/naive1_holistic.txt") %>% collap
fix <- read_lines("C:/Users/Zoom/IFRC/Everyone Counts report - General/ifrc_ec2024_data_and_code/prompts/fix.txt") %>% collap
fix_holistic <- read_lines("C:/Users/Zoom/IFRC/Everyone Counts report - General/ifrc_ec2024_data_and_code/prompts/fix_holistic.txt") %>% collap
missing <- read_lines("C:/Users/Zoom/IFRC/Everyone Counts report - General/ifrc_ec2024_data_and_code/prompts/missing.txt") %>% collap
# missing_aggressive <- read_lines("C:/Users/Zoom/IFRC/Everyone Counts report - General/ifrc_ec2024_data_and_code/prompts/missing_aggressive.txt") %>% collap

statements <- ex$statements %>% filter(statement_id %in% ex$statements$statement_id[1:1])
  original_statements <-statements

statements <- chunkify(statements[,],combine_small = T,characters_per_prompt = 4000,prompts_per_batch = 1) %>% filter(text!="")

#tx1 <- paste0(naive1,"\n\n",statements$text)
tx1 <- paste0(naive1_holistic,"\n\n",statements$text)
#res2_holistic <- three_steps_parallel(api_key=config$open_ai_api_key,tx1,fix,missing)

res2_holistic <- three_steps_parallel(api_key=config$open_ai_api_key,tx1,fix_holistic)
res2_holistic <- three_steps_parallel(api_key=config$open_ai_api_key,tx1)

res2_holistic

# statements <- 
#   statements %>% mutate(source_id=str_match(statement_id,"^[^ ]*"))
# statements


slinks2 <- 
  seq_along(res2_holistic) %>%
map(~{
  # browser()
  stats <- statements[.,] %>% pull(statement_id) %>% str_split(",") %>% pluck(1) %>% str_trim
  these_stats <- original_statements %>% filter(statement_id %in% stats)
  one_row(res2_holistic[[.]],these_stats)
}) %>% 
  bind_rows

slinks2 %>% 
  pipe_add_sentiment(offline = T)

new=list(links=slinks2,statements=statements)
#new %>% write_xlsx("ex-ai-holistic2.xlsx")

upload_tables_from_list(new,file="ex-ai-holistic222",conn = get_conn(),upload_append_or = "Overwrite")

get_file("ex-ai2",get_conn())

}

Assistants

if(F){  
  library(httr)

# ask_assistant_one_question(api_key=config$open_ai_api_key,create_generic_assistant(config$open_ai_api_key),question = "what is your name",text = "",model="gpt-4o")  
  
  

time_stamp <- time_stamp()

a1 <- create_assistant(config$open_ai_api_key,"name of assistant",instructions = "be kind",time_stamp = time_stamp,file_id = NULL)

assistants <- 
  docs[1:4] %>% 
  map(~{create_assistant_and_file(config$open_ai_api_key, instructions="Prepare to answer questions",description="description",file_path=.,time_stamp = time_stamp)
  })

delete_assistant_file(config$open_ai_api_key,assistant$id,fil$id)
# delete_assistant(api_key,assistant$id)

assistants <- list_assistants(api_key)
assistants$data %>% map(~.$id) %>% map(~delete_assistant(api_key,.))



thread <- create_thread(config$open_ai_api_key,file_ids = NULL,content="Summarise the document.")

run <- run_thread(config$open_ai_api_key,assistant_id = assistant$id,thread_id = thread$id)




docs <- list.files("export",full.names = T)[1:50]
  
  texts <- docs %>% map(~readLines(.) %>% collap)
stats <-
  tibble(statement_id=docs,text=unlist(texts))

  embs <- write_and_return_embeddings(new=stats,api_key = config$open_ai_api_key,write=F)
  tmp <- 
    cosine_match(
      current_statements=stats,
      api_key=config$open_ai_api_key,
      tex=c("organisational problems OR climate change"),
      statement_embeddings=embs,
      conn=NULL
      )
View(tmp)
}

if(F){
  
  
ex$statements %>% select(text,statement_id) %>% `[`(1:5,) %>% write_csv("statements5.csv")
statements5 <- 
  ex$statements  %>% slice(1:5) %>% select(statement_id,text)

statements5 %>% unite(tex,statement_id,text,remove = T,sep="\n\n ") %>%  select(tex) %>% mutate(tex=paste0("Statement ID: ",tex)) %>% pull(tex) %>% collap %>% writeLines("statements5.txt")


fil <- upload_file_a(config$open_ai_api_key,"assets/examples/msc.txt")
fil <- upload_file_a(config$open_ai_api_key,"statements5.txt")

assistant <- 
  create_assistant(
  api_key=config$open_ai_api_key, 
  name="Your Assistant Name",
  description ="Your AssistantDescription",
  instructions="Read the document and prepare to answer questions about the content",
  file_id=    list(fil$id)
  )
thread <- create_thread(config$open_ai_api_key,fil$id,content="For each statement in this text, identify ALL the causal claims within the statement where someone says that one thing influences another. Express the claims in the form 'cause >> effect'. For each claim, identify ONE quote which backs it up. Output your results as csv, with one column called 'statement_id', one column called 'claim', one column 'quote_start' giving the number of the character within the statement where the quote starts, and a column 'quote_end' giving the number of the character within the statement where the quote ends.
Do not provide the actual quote, only the start and end locations.                      
Provide one row with the same statement_id for each claims. So if there are n claims there will be n rows with the same statement_id.
It is ESSENTIAL you find ALL the causal claims. I have no hands, you have to do it for me. If you fail I will lose my job and my family will starve.
Make sure you correctly escape commas and quote marks in your summary column.
Provide ONLY the csv output with no comments or introduction.")

run <- run_thread(config$open_ai_api_key,assistant_id = assistant$id,thread_id = thread$id)
#get_run(config$open_ai_api_key,thread$id, run$id)
complete <- F
res <- NULL
while(!complete){
  Sys.sleep(1)
tmp <- get_run(config$open_ai_api_key,thread$id, run$id)
# browser()
if(tmp$status=="completed") {
  complete=T
  message("completed")
  res <- tmp
  # return(tmp)
}
  
}
current_thread <- get_thread(config$open_ai_api_key,thread$id)

csv <- current_thread$data[[1]]$content[[1]]$text$value %>% str_remove("^```csv\n") %>% str_remove("\n```$")

writeLines(csv,"tmp.csv")
cs <- read_csv("tmp.csv") %>% 
  mutate(nchar=nchar(claim)) 
cs <- cs %>% left_join(statements5)%>% 
  mutate(quote=str_sub(text,quote_start,quote_end))


View(cs)
current_thread$data[[1]]$content[[1]]$text$annotations




fil <- upload_file_a(config$open_ai_api_key,"C:/Users/steve/Dropbox/Projects/CausalMap3/www/guide/full_guide.pdf")

assistant <- 
  create_assistant(
  api_key=config$open_ai_api_key, 
  name="Your Assistant Name",
  description ="Your AssistantDescription",
  instructions="Read this Causal Map Guide and prepare to answer questions about the content to help users use the Causal Map app. Always provide concise answers without extra verbiage and use simple language whereever possible.",
  file_id=    list(fil$id)
)
thread <- create_thread(config$open_ai_api_key,fil$id,content="How do I upload a set of Word documents for causal coding to the Causal Map app, based on the information in the Causal Map Guide?")
thread <- create_thread(config$open_ai_api_key,fil$id,content="How can I show if some links are more frequently mentioned by some groups e.g. women?")
run <- run_thread(config$open_ai_api_key,assistant_id = assistant$id,thread_id = thread$id)
#get_run(config$open_ai_api_key,thread$id, run$id)
complete <- F
res <- NULL
while(!complete){
  Sys.sleep(1)
tmp <- get_run(config$open_ai_api_key,thread$id, run$id)
# browser()
if(tmp$status=="completed") {
  complete=T
  message("completed")
  res <- tmp
  # return(tmp)
}
  
}
current_thread <- get_thread(config$open_ai_api_key,thread$id)
current_thread
get_run(config$open_ai_api_key,run_id = run$id,thread$id)

}

create a single assistant for causal coding

if(F){

wait_for_run <- function(run){
  complete <- F
  res <- NULL
  while(!complete){
    Sys.sleep(5)
    tmp <- get_run(config$open_ai_api_key,run$thread_id, run$id)
    # browser()
    if(tmp$status=="completed") {
      complete=T
      message("completed")
      res <- tmp
      # return(tmp)
    }

    current_thread <- get_thread(config$open_ai_api_key,run$thread_id)


}
    current_thread
}
}

Embeddings simple

if(F){

  stats <- ex$statements %>% 
    filter(source_id %in% xc("MNX-1 MNX-2"))
  
  tmp <- 
    semantic_search_embeddings(current_statements=stats,
                                       cosine_minimum=.25,
                                       api_key=config$open_ai_api_key,
                                       tex=c("health problems","relationship problems"),
                                        file_select = "example-file",
                                       # statement_embeddings=NULL,
                                       s3=s3
                                       )

  tmp$res #%>% filter(any_cosines %>% as.logical)
  
  embs <- write_and_return_embeddings(new=stats,api_key = config$open_ai_api_key,write=F)
  
  
  tmp <- 
    cosine_match(
      current_statements=stats,
      api_key=config$open_ai_api_key,
      tex=c("health problems","relationship problems"),
      statement_embeddings=embs,
      conn=NULL
      )

  
}

if(F){
config = configr::read.config("config.yml")$default
conn <- DBI::dbConnect(
  drv = RMariaDB::MariaDB(),
  load_data_local_infile = F,
  timeout=6000,
  reconnect=T,#FIXME could be dangerous
  # drv = RMySQL::MySQL(max.con=1000, fetch.default.rec=1000),
  idleTimeout=900000,#15 minutes
  interactive_timeout=900000,#15 minutes
  wait_timeout=900000,#15 minutes
  dbname = config$sql_cm$dbname,
  host = config$sql_cm$host,
  port = config$sql_cm$port,
  username = config$sql_cm$username,
  password = config$sql_cm$password,
  mysql=F
)
api_key <- config$open_ai_api_key

library(tictoc)    
ex="assets/import_examples/example-file.xlsx" %>%
  get_from_excel()
ex$statements$file <- "example-file"

q <- paste0("DELETE FROM cm3embeddings_s WHERE file='","example-file","';")
q <- paste0("DELETE FROM cm3embeddings_s_long WHERE file='","example-file","';")
dbExecute(conn,q)

ex$statements %>% 
    filter(source_id %in% xc("MNX-1 MNX-2")) %>%  
    write_and_return_embeddings(api_key = api_key,conn = conn,long=F)
ex$statements %>% 
    filter(source_id %in% xc("MNX-1 MNX-2")) %>%  
    write_and_return_embeddings(api_key = api_key,conn = conn,long=T)

ex$statements %>% 
    filter(source_id %in% xc("MNX-1 MNX-2")) %>%  
    fetch_embeddings(.,api_key=api_key,conn=conn,file="example_file",reset=F,already=NULL)



already <- 
  ex$statements %>% 
    filter(source_id %in% xc("MNX-1 MNX-2")) 
statements <- 
  ex$statements %>% 
    filter(source_id %in% xc("MNX-1 MNX-2")) 

fetch_embeddings(statements,api_key=api_key,conn=conn,file="example_file",reset=F,already=already)

already <- 
  ex$statements %>% 
    filter(source_id %in% xc("MNX-1")) 

fetch_embeddings(statements,api_key=api_key,conn=conn,file="example_file",reset=F,already=already)



q <- paste0("DELETE FROM cm3embeddings_s_long WHERE file='","tearfund-all-2024","';")
dbExecute(conn,q)

  

tic()
create_new_statement_embedding_rows_long(statements = statements,api_key = api_key,conn = conn,file = "example-file",reset=F,already=already)  
toc()

already <- 
  ex$statements %>% 
    filter(source_id %in% xc("MNX-1")) 
tic()
create_new_statement_embedding_rows_long(statements = statements,api_key = api_key,conn = conn,file = "example-file",reset=F,already=already)  
toc()




}

monitoring usage

if(F){
  
tbl(conn,"cm3files") %>%
arrange(desc(created)) %>% 
filter(!str_detect(edit,"steve|hello|gabriele|hannah|fiona|tabitha|gcc43|testing|pubale")) %>% collect %>% 
  write_xlsx("user_files.xlsx")


tbl(conn,"cm3files") %>%
  select(-created)%>% 
  filter(!str_detect(edit,"steve|hello|gabriele|hannah|fiona|tabitha|gcc43|testing|pubale")) %>% left_join(tbl(conn,"cm3links") %>% select(-modified),by="file") %>% 
  group_by(file,edit) %>% 
  summarise(n_links=n(),created=min(created,na.rm=T),modified=max(modified,na.rm=T)) %>% 
  filter(!str_detect(edit,"steve|hello|gabriele|hannah|fiona|tabitha|gcc43|testing|pubale")) %>% collect %>% 
arrange(desc(modified)) %>%
  write_xlsx("user_files.xlsx")

}

Causal Map Functions Vignette

Steve Powell