Over the last three years we at Causal Map Ltd have developed a range of algorithms for causal mapping. The algorithms are published in the accompanying file cm3functions.R under an open-source license in the form of functions for the programming language R. These functions are the heart of the analysis engine in our app CausalMap3. The app itself is free to use for analysing existing causal mapping data files but it is closed source, so these functions are a reproducible window into the algorithms and can be used by anyone who wants to analyse and visualise causal mapping data without using the Causal Map app.
The main purpose is to publish and provide examples for a set of algorithms for causal mapping analysis: ways to manipulate, simplify and query causal mapping data.
If you already have causal codings e.g. in the form of an Excel sheet with at least two columns for “cause” and “effect”, these functions could be really interesting to you.
Though do note that if you have such data, you can use the Causal Map app itself for free without installing the functions.
The documentation for the functions isn’t very good yet, but if there is interest we can provide more. We are working on a training for doing causal mapping with methods which do not need the Causal Map app.
If you want to do causal coding, i.e. creating raw causal data, i.e. at minimum an ‘edge list’ of arrows between factors, these functions won’t help you. You’d have to use the Causal Map app or some other method.
When we get a chance, we will release them as a full R package. But for now you can read and download a single R script which contains the functions and all you need to use them. We are still in the process of documenting the functions. To make them easier to use, we provide this vignette which demonstrates the use of many of the functions as applied to example causal mapping data files (the script will download for you the latest version of the functions script, the main example file and a special example file which demonstrates the use of combined opposites).
This vignette is an .Rmd file: you process it using the package knitr or by pressing Knit in Rstudio. Or you can use it as a source of examples to modify for your own work in the R console.
The functions are used within Causal Map 3, a web app which provides an interface for managing causal mapping project and also enables autocoding with AI. The functions provided here do not yet include functions for using AI.
These functions completely replace the functions in the old R package CausalMapFunctions which is now deprecated. This new set of functions is much simpler.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(shiny)
library(DiagrammeR)
library(textreadr)
knitr::opts_chunk$set(echo = T,warning=F,message=F)
htmlx <- message
if(Sys.getenv("USERDOMAIN") != "STEVE-P") {
url <- "https://www.dropbox.com/scl/fi/ulrtbj9fmgu1ltdamsvhy/example-file.xlsx?rlkey=pczexdevcxgz4if2muqwr0dng&dl=1"
download.file(url = url, destfile = "example-file.xlsx",method="curl",extra='-L')
url <- "https://www.dropbox.com/scl/fi/t9fnja78jo2zb650fesqs/opposites-example.xlsx?rlkey=t9cd4afr77te3bs98hnv6rm5v&dl=1"
download.file(url = url, destfile = "opposites-example.xlsx",method="curl",extra='-L')
url <- "https://www.dropbox.com/scl/fi/mtlaf5rmmdrgp3bxbrydz/cm3functions.R?rlkey=7hli0uaookisr9seahpzczynk&dl=1"
download.file(url = url, destfile = "cm3functions.R",method="curl",extra='-L')
}
notify <- function(message,level){
message(message)
}
# for the convenience of the developer
if(Sys.getenv("USERDOMAIN") == "STEVE-P") {
source("Rfiles/cm3functions.R")
ex="assets/import_examples/example-file.xlsx" %>%
get_from_excel()
ex$links$sentiment <- 0
ex2="assets/import_examples/opposites-example.xlsx" %>%
get_from_excel()
} else {
source("cm3functions.R")
ex="example-file.xlsx" %>%
get_from_excel()
ex2="opposites-example.xlsx" %>%
get_from_excel()
}
ex2$links$sentiment <- 0
ex$links$sentiment <- 0
# constants -----------------------------------------------------------------
contrary_color <- "#f2ab73"
ordinary_color <- "#058488"
flipchar <- "~"
Once the commands below have downloaded the data files, you can comment them out.
A causal map data file is a list of tibbles (data frames): any of links
statements
sources
questions
and settings
, but at least links
or statements
must be present.
These data frames correspond to the tabs in the corresponding Excel format. You can import or export Causal Map files as Excel (“round-tripping”).
An uncoded file might contain just a set of statements for coding.
A coded data file might contain just a set of links which have already been coded.
The additional, optional tibbles sources
and questions
provide additional meta-data about the statements and links.
All the main functions here deal only with the links table, which is of the most interest for causal mapping. The
pipe_*
functions take a links table as input and output a modified links table as output. So they can be chained indefinitely.
Each row is one chunk of text, usually one or a few paragraphs.
This table contains at least the fields statement_id
which must be unique for each row, and the field text
which contains the text of the statement to be coded.
Coding means adding rows to a table of links. Each row is one coding.
This table contains at least the fields link_id
(a unique integer), statement_id
which must correspond to a row in the statements table, and the field quote
which contains a substring of the corresponding statement.
We will inspect the links, sources and statements tables.
ex$links %>%
arrange(desc(source_count)) %>%
select(from_label,to_label,source_count,source_id,link_id,statement_id) %>%
head
## # A tibble: 6 × 6
## from_label to_label source_count source_id link_id statement_id
## <chr> <chr> <int> <chr> <dbl> <chr>
## 1 Health behaviour; Use pi… Improve… 16 MNX-2 939 MNX-2 | 244
## 2 Health behaviour; Use pi… Improve… 16 TWX-1 961 TWX-1 | 169
## 3 Health behaviour; Use pi… Improve… 16 TWX-2 964 TWX-2 | 206
## 4 Health behaviour; Use pi… Improve… 16 MSX-2 977 MSX-2 | 323
## 5 Health behaviour; Use pi… Improve… 16 MSY-1 981 MSY-1 | 363
## 6 Health behaviour; Use pi… Improve… 16 TWX-3 987 TWX-3 | 401
ex$links %>%
arrange(desc(source_count)) %>%
distinct(from_label,to_label,.keep_all = T) %>%
select(from_label,to_label,source_count,source_id,link_id,statement_id) %>%
head
## # A tibble: 6 × 6
## from_label to_label source_count source_id link_id statement_id
## <chr> <chr> <int> <chr> <dbl> <chr>
## 1 Health behaviour; Use pi… Improve… 16 MNX-2 939 MNX-2 | 244
## 2 Community groups/learning Communi… 14 MNY-1 354 MNY-1 | 70
## 3 Health behaviour; Use be… Improve… 14 MNX-2 938 MNX-2 | 244
## 4 Health behaviour; Use mo… Improve… 14 MNX-2 940 MNX-2 | 244
## 5 Farm production Income 13 MNY-1 348 MNY-1 | 62
## 6 Increased knowledge; Far… Planted… 13 TWX-1 358 TWX-1 | 88
ex$sources %>%
select(1:6) %>%
head
## # A tibble: 6 × 6
## source_id `Household code` `#Name of province` `#Name of village`
## <chr> <chr> <chr> <chr>
## 1 MNX-1 MNX1 A 1
## 2 MNX-2 MNX2 A 1
## 3 MNX-3 MNX3 A 1
## 4 MNX-4 MNX4 A 1
## 5 MNX-5 MNX5 A 1
## 6 MNX-6 MNX6 A 1
## # ℹ 2 more variables: `#Age of the main respondent` <chr>,
## # `Household size (residents > 6months)` <chr>
ex$statements %>%
select(1:6) %>%
head
## # A tibble: 6 × 6
## text source_id question_id created modified statement_id
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 This is a fully coded exa… MSY-3 $A8 2023-1… 2023-11… MSY-3 | 1
## 2 There has been a lot of c… MSY-3 $B1 2023-1… 2023-11… MSY-3 | 2
## 3 Improved MSY-3 $*B2 2023-1… 2023-11… MSY-3 | 3
## 4 There has been improvemen… MSY-3 $B3 2023-1… 2023-11… MSY-3 | 4
## 5 Yes there have been some … MSY-3 $C1 2023-1… 2023-11… MSY-3 | 5
## 6 Increased MSY-3 $*C2 2023-1… 2023-11… MSY-3 | 6
The optional sources table contains information about the sources in key-value format.
In Causal Map 3, there is no independent table for factors. Factors are implied by the link labels. The function make_factors_from_links()
shows what factors are implied by the links in the file.
ex$links %>%
make_factors_from_links() %>%
arrange(desc(source_count)) %>%
head
## # A tibble: 6 × 11
## label found found_source found_target source_count link_count in_degree
## <chr> <lgl> <lgl> <lgl> <int> <int> <int>
## 1 Improved he… FALSE FALSE FALSE 18 150 7
## 2 Diet improv… FALSE FALSE FALSE 17 51 7
## 3 Food consum… FALSE FALSE FALSE 17 37 6
## 4 Health beha… FALSE FALSE FALSE 17 19 16
## 5 Community g… FALSE FALSE FALSE 16 37 37
## 6 Farm produc… FALSE FALSE FALSE 16 158 81
## # ℹ 4 more variables: out_degree <int>, outcomeness <dbl>, flipped_from <int>,
## # flipped_to <int>
These functions use tidyverse conventions and workflows throughout.
There are no special functions for filtering the links because the tidyverse workflow is so simple:
ex$links %>%
filter(str_detect(source_id,"MNX"))
## # A tibble: 212 × 46
## from_label to_label quote link_id hashtags quote_start quote_end
## <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 Pregnancy refers:Pregn… Unable/… Ther… 435 <NA> 1 238
## 2 Increased knowledge; F… Improve… I pl… 436 <NA> 305 476
## 3 Increased knowledge; F… Improve… I pl… 437 <NA> 305 476
## 4 Farm production Produce… [Inc… 438 <NA> 1 129
## 5 Farm production Produce… [Inc… 439 <NA> 222 513
## 6 Increased knowledge; N… Increas… Ther… 440 <NA> 1 556
## 7 Increased knowledge; F… Able to… Ther… 441 <NA> 1 287
## 8 Receives money from ch… Increas… Yes … 442 <NA> 146 332
## 9 Farm production Income [Pur… 443 <NA> 1 40
## 10 Community groups/learn… Communi… With… 444 <NA> 1 507
## # ℹ 202 more rows
## # ℹ 39 more variables: statement_code <chr>, source_count <int>,
## # link_count <int>, bundle <chr>, statement_id <chr>, text <chr>,
## # source_id <chr>, question_id <chr>, created <chr>, modified <chr>,
## # respondent_id <lgl>, merge_number <dbl>, `Household code` <chr>,
## # `#Name of province` <chr>, `#Name of village` <chr>,
## # `#Age of the main respondent` <chr>, …
ex$links %>%
filter(statement_id=="MNY-1 | 43")
## # A tibble: 2 × 46
## from_label to_label quote link_id hashtags quote_start quote_end
## <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 Increased knowledge; He… Health … [Hea… 340 <NA> 1 151
## 2 Health behaviour Improve… [Hea… 657 <NA> 1 151
## # ℹ 39 more variables: statement_code <chr>, source_count <int>,
## # link_count <int>, bundle <chr>, statement_id <chr>, text <chr>,
## # source_id <chr>, question_id <chr>, created <chr>, modified <chr>,
## # respondent_id <lgl>, merge_number <dbl>, `Household code` <chr>,
## # `#Name of province` <chr>, `#Name of village` <chr>,
## # `#Age of the main respondent` <chr>,
## # `Household size (residents > 6months)` <chr>, …
However there are special functions for including (‘focus’) and excluding factors by label, see below.
Formatting the map is done inside the call to make_print_map2()
. The structure of the map is not changed.
In this example we make the background colour of the factors depend on their link count aka citation count aka degree. This example also demonstrates the use of the main function to actually print out a map.
ex$links %>%
pipe_zoom(1) %>%
pipe_top_links(8) %>%
make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count", map_label_factors_type="link_count")
map <-
ex$links %>%
pipe_retain_hashtags(hashtags = xc("!doubtful !hypothetical"),keep = F,or="ANY") %>%
pipe_focus("Ability",any = T) %>%
pipe_exclude("food",any = T) %>%
pipe_zoom(1) %>%
pipe_collapse(xc("Borrowing Income")) %>%
pipe_top_links(8) %>%
pipe_top_factors(3) %>%
pipe_combine_opposites() %>%
pipe_remove_brackets(T,T) %>%
pipe_label(map_label_links = "source_id",type = "count_unique")
map %>%
astore()
## $legend_pipe_retain_hashtags
## [1] "Removing links with all of these hashtags: !doubtful; !hypothetical."
##
## $legend_pipe_focus
## [1] "Focussing on Ability 1 steps down and 1 steps up, tracing paths, matching anywhere."
##
## $legend_pipe_exclude
## [1] "Excluding factors: food, matching anywhere."
##
## $legend_pipe_zoom
## [1] "Zooming in to level 1 of the hierarchy. "
##
## $legend_pipe_collapse
## [1] "Collapsing factors: Borrowing , Income."
##
## $legend_pipe_top_links
## [1] "Top 8 links by source count. "
##
## $legend_pipe_top_factors
## [1] "Top 3 factors by source count. "
##
## $legend_pipe_combine_opposites
## [1] "Combining opposites."
##
## $legend_pipe_remove_brackets
## [1] "Removing square and round brackets."
##
## $legend_pipe_label
## [1] ""
map %>%
make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count", map_label_factors_type="link_count")
You don’t have to provide an original links file for comparison:
map %>%
make_print_map2(map_color_factors_column = "link_count",map_size_links ="link_count", map_label_factors_type="link_count")
if(F){
map2 <-
map %>%
pipe_magnetic_clusters(embed("facts",api_key=config$open_ai_api_key),factor_embeddings = embed(map %>% get_both_labels(),api_key=config$open_ai_api_key)) %>%
pipe_autocluster(granularity = .177)
map2 %>%
astore()
map2 %>%
make_print_map2(original=map)
}
links <-
ex$links %>%
pipe_zoom(1) %>%
pipe_top_links(8)
links$from_label <- links$from_label %>% str_replace("Increased knowledge$","Increased knowledge*99")
links$to_label <- links$to_label %>% str_replace("Improved health","Improved health*99")
links$to_label <- links$to_label %>% str_replace("Income","Income*0")
links %>%
make_print_map2(original = links,map_color_factors_column = "link_count",map_size_links ="link_count", map_label_factors_type="link_count")
Calculating surprises is a formatting function because it does not change the structure of the map, only the labels printed on the links and factors.
ex$links %>%
pipe_zoom(1) %>%
pipe_top_links(8) %>%
pipe_label(map_label_links = "#Name of village" ,type="surprise_links") %>%
make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count")
ex$links %>%
pipe_zoom(1) %>%
pipe_top_links(8) %>%
pipe_label(map_label_links = "#Name of village" ,type="surprise_links_simple") %>%
make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count")
ex$links %>%
pipe_zoom(1) %>%
pipe_top_links(8) %>%
pipe_label(map_label_links = "#Name of village",type="surprise_sources" ) %>%
make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count")
ex$links %>%
pipe_zoom(1) %>%
pipe_top_links(18) %>%
pipe_label(map_label_links = "#Name of village",type = "surprise_links" ) %>%
make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="source_count")
ex$links %>%
pipe_zoom(1) %>%
pipe_top_links(8) %>%
pipe_label(map_label_links = "#Name of village" ,type="surprise_links") %>%
make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count",map_label_factors = "#Name of village" ,map_label_factors_type="surprise_links")
ex$links %>%
pipe_zoom(1) %>%
pipe_top_links(8) %>%
pipe_label(map_label_links = "#Name of village" ,type="surprise_sources") %>%
make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count",map_label_factors = "#Name of village" ,map_label_factors_type="surprise_sources")
ex$links %>%
pipe_zoom(1) %>%
pipe_top_links(8) %>%
make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count",map_label_factors = "#Name of village" ,map_label_factors_type="surprise_sources_simple")
These functions all (potentially) transform the links table in some way, either by filtering out some links or by changing their labels in order to create a different but related graph.
The focus()
function identifies factors by complete label or part of the label (any=TRUE
).
By default, focus
shows factors one step up and down from the focused factor(s).
The function accepts a list of factor labels or parts of labels.
These code examples also show how the functions can be piped into a long chain. The pipe_*
functions work on links tables and output modified links tables. These examples include the functions pipe_zoom()
and make_print_map2()
which are dealt with later.
These examples show:
focus
is case sensitiveany
to match any part of a labelThe matched labels are highlighted.
# ex$links %>%
# pipe_zoom() %>%
# pipe_focus(c("AbilitY TO buy fOOD")) %>%
# make_print_map2(original = ex$links,map_label_factors = "source_count")
ex$links %>%
pipe_zoom() %>%
pipe_focus(c("Ability"),any = T) %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
ex$links %>%
pipe_zoom() %>%
pipe_focus(c("Ability to buy food","Diet improved"),any = T) %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
The examples:
ex$links %>%
pipe_zoom() %>%
pipe_focus(c("Ability"),any = T,up=1,down=2) %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
ex$links %>%
pipe_zoom() %>%
pipe_focus(c("Ability to buy food","Diet improved"),any = T,up=1,down=0) %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
NOTE this doesn’t seem to be working FIXME
ex$links %>%
pipe_zoom() %>%
pipe_focus(c("Ability to buy food","Diet improved"),any = T,up=1,down=0,transforms_tracing_threads = F) %>%
pipe_label(type="source_count") %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
ex$links %>%
pipe_zoom() %>%
pipe_focus(c("Ability to buy food","Diet improved"),any = T,up=1,down=0,transforms_tracing_threads = T) %>%
pipe_label(type="source_count") %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
The exclude function works in a comparable way, to exclude matching factors from the map.
The second map shows the same as the first but with some factors excluded.
ex$links %>%
pipe_zoom() %>%
pipe_top_factors(type="Links",which="Top",top=6) %>%
select(from_label,to_label,link_count,source_count) %>%
distinct %>%
head
## # A tibble: 6 × 4
## from_label to_label link_count source_count
## <chr> <chr> <int> <int>
## 1 Increased knowledge Health behaviour 23 6
## 2 Farm production Income 32 13
## 3 Increased knowledge Increased knowledge 8 6
## 4 Increased knowledge Farm production 24 11
## 5 Increased knowledge Diet improved 30 8
## 6 Diet improved Improved health 8 4
ex$links %>%
pipe_zoom() %>%
pipe_top_factors(type="Links",which="Top",top=6) %>%
pipe_exclude("Increased knowledge" %>% c("Diet improved")) %>%
select(from_label,to_label,link_count,source_count) %>%
distinct %>%
head
## # A tibble: 2 × 4
## from_label to_label link_count source_count
## <chr> <chr> <int> <int>
## 1 Farm production Income 32 13
## 2 Health behaviour Improved health 128 18
# ## just checking it works with [ ]
# links <-
# ex$links %>%
# mutate(from_label=str_replace_all(from_label,"Income","[Income]")) %>%
# mutate(to_label=str_replace_all(to_label,"Income","[Income]"))
#
#
# links %>%
# pipe_zoom() %>%
# pipe_exclude("[Income]",any = T) %>%
# pipe_top_factors(type="Links",which="Top",top=6) %>%
# select(from_label,to_label,link_count,source_count) %>%
# distinct %>%
# head
We take advantage of hierarchical coding to collapse subsidiary factors into global factors.
So for example the hierarchically coded factors fish; cod
and fish; salmon
collapse into one factor fish
.
This means that all the links to the two previous factors are routed into a single factor fish
(which may or may not already exist in the dataset).
The examples:
ex$links %>%
make_factors_from_links() %>%
arrange(desc(source_count)) %>%
select(label,source_count) %>%
head
## # A tibble: 6 × 2
## label source_count
## <chr> <int>
## 1 Improved health 18
## 2 Diet improved; Diversified 17
## 3 Food consumption quantity 17
## 4 Health behaviour; Use pits to dispose of rubbish 17
## 5 Community groups/learning 16
## 6 Farm production 16
ex$links %>%
pipe_zoom() %>%
make_factors_from_links() %>%
arrange(desc(source_count)) %>%
select(label,source_count) %>%
head
## # A tibble: 6 × 2
## label source_count
## <chr> <int>
## 1 Diet improved 19
## 2 Health behaviour 18
## 3 Improved health 18
## 4 Increased knowledge 18
## 5 Food consumption quantity 17
## 6 Community groups/learning 16
ex$links %>%
pipe_zoom() %>%
pipe_top_factors(type="Sources",which="Top",top=3) %>%
make_print_map2(ex$links)
ex$links %>%
pipe_zoom() %>%
pipe_top_factors(type="Sources",which="Top",top=3) %>%
select(from_label,to_label,link_count) %>%
distinct() %>%
head
## # A tibble: 6 × 3
## from_label to_label link_count
## <chr> <chr> <int>
## 1 Increased knowledge Health behaviour 23
## 2 Increased knowledge Increased knowledge 8
## 3 Increased knowledge Diet improved 30
## 4 Diet improved Improved health 8
## 5 Health behaviour Improved health 128
## 6 Diet improved Diet improved 1
ex$links %>%
pipe_zoom() %>%
pipe_top_factors(type="Sources",which="At_least",top=15) %>%
make_factors_from_links() %>%
arrange(desc(source_count)) %>%
head
## # A tibble: 6 × 11
## label found found_source found_target source_count link_count in_degree
## <chr> <lgl> <lgl> <lgl> <int> <int> <int>
## 1 Health beha… FALSE FALSE FALSE 18 151 128
## 2 Improved he… FALSE FALSE FALSE 18 140 0
## 3 Farm produc… FALSE FALSE FALSE 16 99 57
## 4 Increased k… FALSE FALSE FALSE 16 118 110
## 5 Improved/ne… FALSE FALSE FALSE 14 38 17
## 6 Income FALSE FALSE FALSE 13 36 3
## # ℹ 4 more variables: out_degree <int>, outcomeness <dbl>, flipped_from <int>,
## # flipped_to <int>
ex$links %>%
pipe_zoom() %>%
pipe_top_factors(type="Links",which="Top",top=12) %>%
make_factors_from_links() %>%
arrange(desc(link_count)) %>%
head
## # A tibble: 6 × 11
## label found found_source found_target source_count link_count in_degree
## <chr> <lgl> <lgl> <lgl> <int> <int> <int>
## 1 Health beha… FALSE FALSE FALSE 18 151 128
## 2 Increased k… FALSE FALSE FALSE 17 150 142
## 3 Improved he… FALSE FALSE FALSE 18 140 0
## 4 Farm produc… FALSE FALSE FALSE 16 123 58
## 5 Diet improv… FALSE FALSE FALSE 14 75 9
## 6 Planted new… FALSE FALSE FALSE 14 65 36
## # ℹ 4 more variables: out_degree <int>, outcomeness <dbl>, flipped_from <int>,
## # flipped_to <int>
ex$links %>%
pipe_zoom() %>%
pipe_top_factors(type="Links",which="At_least",top=20) %>%
make_factors_from_links() %>%
arrange(desc(link_count)) %>%
head
## # A tibble: 6 × 11
## label found found_source found_target source_count link_count in_degree
## <chr> <lgl> <lgl> <lgl> <int> <int> <int>
## 1 Increased k… FALSE FALSE FALSE 18 193 175
## 2 Health beha… FALSE FALSE FALSE 18 153 130
## 3 Improved he… FALSE FALSE FALSE 18 147 7
## 4 Farm produc… FALSE FALSE FALSE 16 135 70
## 5 Diet improv… FALSE FALSE FALSE 16 78 9
## 6 Income FALSE FALSE FALSE 14 70 32
## # ℹ 4 more variables: out_degree <int>, outcomeness <dbl>, flipped_from <int>,
## # flipped_to <int>
ex$links %>%
pipe_zoom() %>%
pipe_top_links(type="Sources",which="Top",top=2) %>%
select(from_label,to_label,link_count) %>%
distinct() %>%
head
## # A tibble: 2 × 3
## from_label to_label link_count
## <chr> <chr> <int>
## 1 Health behaviour Improved health 128
## 2 Community groups/learning Community works together 20
ex$links %>%
pipe_zoom() %>%
pipe_top_links(type="Sources",which="At_least",top=4) %>%
select(from_label,to_label,link_count,source_count) %>%
head
## # A tibble: 6 × 4
## from_label to_label link_count source_count
## <chr> <chr> <int> <int>
## 1 Increased knowledge Health behaviour 23 6
## 2 Farm production Produced enough food to eat and/o… 8 6
## 3 Income Ability to buy food 14 11
## 4 Increased knowledge Broken taboo 6 5
## 5 Farm production Income 32 13
## 6 Increased knowledge Improved household relationships 11 7
ex$links %>%
pipe_zoom() %>%
pipe_top_links(type="Links",which="Top",top=20) %>%
select(from_label,to_label,link_count,source_count) %>%
arrange(desc(link_count)) %>%
distinct() %>%
head
## # A tibble: 6 × 4
## from_label to_label link_count source_count
## <chr> <chr> <int> <int>
## 1 Health behaviour Improved health 128 18
## 2 Farm production Income 32 13
## 3 Increased knowledge Diet improved 30 8
## 4 Increased knowledge Planted new crop/vegetable variet… 29 13
## 5 Increased knowledge Farm production 24 11
## 6 Increased knowledge Health behaviour 23 6
ex$links %>%
pipe_zoom() %>%
pipe_top_links(type="Sources",which="Top",top=20) %>%
select(from_label,to_label,link_count,source_count) %>%
arrange(desc(source_count)) %>%
distinct() %>%
head
## # A tibble: 6 × 4
## from_label to_label link_count source_count
## <chr> <chr> <int> <int>
## 1 Health behaviour Improved health 128 18
## 2 Community groups/learning Community works … 20 14
## 3 Farm production Income 32 13
## 4 Increased knowledge Planted new crop… 29 13
## 5 Planted new crop/vegetable varieties Farm production 23 12
## 6 Increased knowledge Improved/new far… 21 12
ex$links %>%
pipe_zoom() %>%
pipe_trace(sess_links = ex$links,from_labels = "Farm production",to_labels="Diet improved",steps = 2,transforms_tracing_threads = F) %>%
pipe_label(type="source_id") %>%
make_print_map2(original = ex$links,map_color_factors_column = "outcomeness")
ex$links %>%
pipe_zoom() %>%
pipe_trace(sess_links = ex$links,from_labels = "Farm",to_labels="Diet",steps = 2,transforms_tracing_threads = F,any=T) %>%
pipe_label(type = "source_id") %>%
make_print_map2(original = ex$links,map_color_factors_column = "outcomeness")
ex$links %>%
pipe_zoom() %>%
pipe_trace(sess_links = ex$links,from_labels = "Farm production",to_labels="Diet improved",steps = 2,transforms_tracing_threads = T) %>%
pipe_label(type="source_id") %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
ex$links %>%
pipe_zoom() %>%
pipe_trace(sess_links = ex$links,from_labels = "Farm production",to_labels="Diet improved",steps = 2,transforms_tracing_threads = F) %>%
pipe_label(type="source_count") %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
ex$links %>%
pipe_zoom() %>%
pipe_trace(sess_links = ex$links,from_labels = "Farm production",to_labels="Diet improved",steps = 2,transforms_tracing_threads = T) %>%
pipe_label(type="source_count") %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
ex$links %>%
pipe_zoom() %>%
pipe_trace(sess_links = ex$links,from_labels = "Farm production",to_labels="Diet improved",steps = 1) %>%
make_print_map2(original = ex$links)
ex$links %>%
pipe_zoom() %>%
pipe_trace(sess_links = ex$links,from_labels = "Farm production",to_labels="Diet improved",steps = 3,transforms_tracing_threads = T) %>%
pipe_label("link_count") %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
This file doesn’t actually have factor names with brackets, but we can create some
tmp <-
ex$links %>%
mutate(from_label=str_replace_all(from_label,"Wellbeing","Wellbeing [square] (round) [[doublesquare]] ((outcome))")) %>%
mutate(to_label=str_replace_all(to_label,"Wellbeing","Wellbeing [square] (round) [[doublesquare]] ((outcome))"))
tmp %>%
pipe_zoom() %>%
pipe_focus(c("Wellbeing"),any=T) %>%
pipe_top_factors(5) %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
tmp %>%
pipe_zoom() %>%
pipe_focus(c("Wellbeing"),any=T) %>%
pipe_top_factors(5) %>%
pipe_remove_brackets(round = T) %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
ex$links %>%
pipe_collapse(c("Ability")) %>%
pipe_zoom() %>%
pipe_top_factors() %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
ex$links %>%
pipe_collapse(c("Ability to buy food","Diet improved"),separate=F) %>%
pipe_zoom() %>%
pipe_top_factors() %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
ex$links %>%
pipe_collapse(c("Increased","Farm"),separate=T) %>%
pipe_zoom() %>%
pipe_top_factors() %>%
make_print_map2(original = ex$links,map_label_factors = "source_count")
ex$links %>%
pipe_focus(xc("~"),any = T) %>%
pipe_top_factors(20) %>%
make_print_map2(ex$links)
#
# ex$links %>%
# pipe_focus(xc("~"),any = T) %>%
# pipe_top_factors(20) %>%
# make_print_map2(ex$links,map_colour_opposites_red = T)
ex$links %>%
pipe_focus(xc("~"),any = T) %>%
pipe_combine_opposites() %>%
pipe_top_factors(20) %>%
make_print_map2(ex$links)
ex2$links %>%
pipe_combine_opposites() %>%
pipe_label(type="list_unique",map_label_links = "link_id") %>%
make_print_map2(ex$links)
ex2$links %>%
make_print_map2(ex$links)
#ex$links <- ex$links %>% mutate(sentiment=(row_number()%%3)-1)
ex$links %>%
pipe_focus(xc("~"),any = T) %>%
pipe_top_factors(20) %>%
make_print_map2(ex$links,map_color_links_column = "sentiment")
#
# ex$links %>%
# pipe_focus(xc("~"),any = T) %>%
# pipe_top_factors(20) %>%
# make_print_map2(ex$links,map_colour_opposites_red = T)
ex$links %>%
pipe_top_factors(8) %>%
pipe_top_links(3,type = "at_least",which = "link_id") %>%
make_print_map2(ex$links,map_color_links_column = "sentiment")
Note that labelling links is a pipe function, not part of the call to make_print_map2()
. This is because labelling links actually transforms the links table.
ex$links %>%
pipe_zoom(1) %>%
pipe_top_factors(8) %>%
make_print_map2(ex$links,map_color_factors_column = "outcomeness")
ex$links %>%
pipe_zoom(1) %>%
pipe_top_factors(8) %>%
pipe_label(type="none") %>%
make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)
ex$links %>%
pipe_zoom(1) %>%
pipe_top_factors(8) %>%
pipe_label(type="source_count") %>%
make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)
ex$links %>%
pipe_zoom(1) %>%
pipe_top_factors(8) %>%
pipe_label(type="link_count") %>%
make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)
ex$links %>%
pipe_zoom(1) %>%
pipe_top_factors(8) %>%
pipe_label(type="list_unique",map_label_links = "source_id") %>%
make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)
ex$links %>%
pipe_zoom(1) %>%
pipe_top_factors(8) %>%
pipe_label(type="list_all",map_label_links = "source_id") %>%
make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)
ex$links %>%
pipe_zoom(1) %>%
pipe_top_factors(8) %>%
pipe_label(type="count_unique",map_label_links = "source_id") %>%
make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)
ex$links %>%
pipe_zoom(1) %>%
pipe_top_factors(8) %>%
pipe_label(type="count_all",map_label_links = "source_id") %>%
make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)
ex$links %>%
pipe_zoom(1) %>%
pipe_top_factors(8) %>%
pipe_label(type="list_unique",map_label_links = "hashtags") %>%
make_print_map2(ex$links,map_color_factors_column = "source_count",map_size_factors = "source_count",map_label_factors = "source_count",map_size_links = "source_count",map_wrap_factor_labels = 9)
ex$links %>%
pipe_zoom(1) %>%
pipe_top_links(8) %>%
pipe_label(map_label_links = "#Name of village" ,type="link_tally") %>%
make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="link_count")
ex$links %>%
pipe_zoom(1) %>%
pipe_top_links(18) %>%
pipe_label(map_label_links = "#Name of village",type = "link_tally" ) %>%
make_print_map2(original = ex$links,map_color_factors_column = "link_count",map_size_links ="source_count")
Most of the tables can be accessed directly
ex$links %>%
make_mentions_tabl() %>%
select(label,direction,link_id)
## # A tibble: 3,228 × 3
## label direction link_id
## <chr> <chr> <dbl>
## 1 Health behaviour consequence 340
## 2 Income consequence 341
## 3 Produced enough food to eat and/or sell consequence 342
## 4 Able to buy farming equipment/materials consequence 343
## 5 Ability to buy food consequence 344
## 6 Produced enough food to eat and/or sell consequence 345
## 7 Produced enough food to eat and/or sell consequence 346
## 8 Broken taboo consequence 347
## 9 Income consequence 348
## 10 Increased ability to borrow consequence 349
## # ℹ 3,218 more rows
ex$links %>%
make_mentions_tabl() %>%
select(label,direction,link_id) %>%
pivot_wider(names_from=direction,values_from = link_id,values_fn = length)
## # A tibble: 145 × 4
## label consequence influence either
## <chr> <int> <int> <int>
## 1 Health behaviour 6 15 21
## 2 Income 47 41 88
## 3 Produced enough food to eat and/or sell 11 9 20
## 4 Able to buy farming equipment/materials 6 NA 6
## 5 Ability to buy food 16 20 36
## 6 Broken taboo 7 8 15
## 7 Increased ability to borrow 1 1 2
## 8 Increased ability to save/increased savings 23 9 32
## 9 Improved household relationships 26 7 33
## 10 Shared household decision-making 12 9 21
## # ℹ 135 more rows
This function prepares different kinds of files.
if(F){
tmp <- prepare_upload("test","assets/import_examples/example-file.xlsx")
}
This convenience function shows you the effect if you were to upload this file into another, appending or overwriting.
if(F){
make_upload_table(ex,tmp$tmp,"Append")
}
More examples.
if(F){
make_upload_table(ex,tmp$tmp,"Overwrite")
tmp <- prepare_upload("test","assets/import_examples/example-file-just-sources.xlsx")
make_upload_table(ex,tmp$tmp,"Append")
make_upload_table(ex,tmp$tmp,"Overwrite")
tmp <- prepare_upload("test","assets/import_examples/cm2.xlsx")
tmp <- prepare_upload("test","assets/import_examples/hybrid.xlsx")
tmp <- prepare_upload("test","assets/import_examples/sources-long.xlsx")
tmp <- prepare_upload("test","assets/import_examples/text_files/source_and_question_ids.docx")
tmp <- prepare_upload("test","assets/import_examples/text_files/only_source_ids.docx")
tmp <- prepare_upload("test","assets/import_examples/text_files/only_question_ids.docx")
tmp <- prepare_upload("test",
paste0("assets/import_examples/text_files/",xc("source_and_question_ids only_question_ids"),".docx")
)
}
if(F){
phrases <- xc("good bad")
phrases <- xc("apple pie")
embed(phrases,api_key = api_key)
embed(phrases)
embed(phrases,store="default-embeddings") #same
embed(phrases,api_key = api_key,s3=s3,store="test-embs")
embed(phrases,s3=s3,store="test-embs")
embed(phrases,api_key = api_key,s3=s3)
embed(phrases,s3=s3)
s3$list_objects("cm3embeddings",Prefix="store")
}
if(F){
find_typical_string(c(
"health problems and other things; lotsa details just to test out speed",
"healthy problems and more stuff; lotsa details just to test out speed",
"better health; lotsa details just to test out speed",
"more health; lotsa edetails just to test out speed",
"more health; lotsa deetails just to test out speeds",
"more health; lotsa deteails just to test out speed",
"more health; lotsa edetails just to test out speed",
"~ more health; lotsa deetails just to test out speeds",
"~ more health; lotsa deteails just to test out speed",
"~ more health; lotsa edetails just to test out speed",
"~ more health; lotsa deetails just to test out speeds",
"~ more health; lotsa deteails just to test out speed",
"~ more health; lotsa details just to test out speed",
"~ more health; lotsa details just to test out speed",
~ "improved life pleasure; lotsa details just to test out speed",
"improve life pleasure; lotsa details just to test out speed",
"health problems; lotsa details just to test out speed",
"~ health problems"
),method = "lcs")
find_typical_string(c(
"health problems",
"healthy problems",
"better health",
"more health",
"improved life pleasure",
"~ health problems"
))
ex=get_file(conn=get_conn(),"example-file-sp2")
links <- ex3$links
links <- ex$links
links %>%
pipe_autocluster(api_key=config$open_ai_api_key,granularity = 66) %>%
pipe_top_factors(12) %>%
make_print_map2(links, map_label_factors_type="link_count")
}
if(F){
kk <- get_file(conn=get_conn(),"example-file-validation-study")
bsdr <- get_file(conn=get_conn(),"example-file")
api_key = config$open_ai_api_key
links <- kk$links
suggested_labels <- bsdr$links %>% get_both_labels() %>% keep(~!str_detect(.,"~"))
current_labels <- kk$links %>% get_both_labels() %>% keep(~!str_detect(.,"~"))
magnet_embeddings <- embed(phrases = suggested_labels,store = "example-file",api_key = api_key,s3 = s3) %>% rename(text=old)
factor_embeddings <- embed(phrases = current_labels,store = "example-file",api_key = api_key,s3 = s3) %>% rename(text=old)
# magnet_embeddings <- get_embedding_simple(suggested_labels,api_key = api_key)
#
# factor_embeddings <- get_embedding_simple(current_labels,api_key = api_key)
mags <-
pipe_magnetic_clusters(links,magnet_embeddings = magnet_embeddings,factor_embeddings = factor_embeddings,magnetism=.5)
slinks <- mags[,colnames(links)]
bsdr$links <- slinks
write_xlsx(bsdr,"tmp.xlsx")
}
if(F){
library(reticulate)
library(tidyverse)
library(shiny)
library(lsa)
source("C:/Users/Zoom/Dropbox/Projects/CausalMap3/cm3functions.R")
source("C:/Users/Zoom/Dropbox/Projects/CausalMap3/Rfiles/symlink_global_functions.R")
source("C:/Users/Zoom/Dropbox/Projects/CausalMap3/Rfiles/cm3functions-NLP.R")
config = configr::read.config("config.yml")$default
ex="assets/import_examples/example-file.xlsx" %>%
get_from_excel()
embeds <-
pipe_embed_statements(ex$statements,api_key = config$open_ai_api_key)
search_term <- "Keeping healthy"
cosines <-
cosine_match(
api_key=config$open_ai_api_key,
tex=c(search_term),
embeddings=embeds
) %>%
as_tibble
res <-
pipe_suggested_clusters(
links=links,
conn = get_conn(),
api_key = config$open_ai_api_key,
slider_magnetism = 6,
labels = c("Improved Health","Improved farming"),
if_zoom = T
)
links %>% select(from_label,to_label,link_id)
res %>% select(from_label,to_label,link_id)
}
# everything is broken.
# I took out save to usage db
# getting timeouts even with 3.5 sometimes
if(F){
config = configr::read.config("config.yml")$default
ex="assets/import_examples/example-file.xlsx" %>%
get_from_excel()
filname <- "james-dprp-research-sp"
james <-
get_file(filname,conn)
statements <- james$statements[1:2,]
tmp <- auto_code(statements,start=prompt,model="4",api_key = config$open_ai_api_key,characters_per_prompt = 2000,conn=conn,convert_to_links = T,file=filname)
conn <- get_conn()
ll <-
tbl(conn,"cm3usage") %>%
filter(type!="heartbeat") %>%
filter(file==filname) %>%
# filter(str_detect(response,"guess the papers")) %>%
arrange(desc((time_stamp))) %>%
head(99) %>% collect
View(ll)
#timeout(240)
ask_chat("What day is tomorrow in most science fiction? Give an example",api_key = config$open_ai_api_key,conn=conn,user="asdf",file="asdfadsf",type="typ")
ask_chat("What day is tomorrow in most science fiction?",model="3",api_key = config$open_ai_api_key,conn=conn,user="asdf",file="asdfadsf",type="typ")
ask_chat("Continue this sentence:",model="gpt-4-turbo",api_key = config$open_ai_api_key,conn=conn,user="asdf",file="asdfadsf",type="typ",temperature = .9)
prompt <- readLines(paste0("assets/prompts/_prompt_0_simple.txt")) %>% collap("\n")%>% remove_lines_starting_with
statements <- ex$statements[4:9,]
auto_code(statements,start=prompt,model="3",api_key = config$open_ai_api_key,characters_per_prompt = 2000,conn=conn,convert_to_links = T,file="testing_again")
statements <- ex$statements[4:9,];auto_code(statements,start=prompt,model="3",api_key = config$open_ai_api_key,characters_per_prompt = 2000,conn=conn,convert_to_links = T,file="testing_again")
statements <- ex$statements[4:9,];auto_code(statements,start=prompt,model="gpt-4-turbo",api_key = config$open_ai_api_key,characters_per_prompt = 2000,conn=conn,convert_to_links = T,file="testing_again")
statements <- ex$statements[1:1,];tmp <-
auto_code(statements,start=prompt,model="4",api_key = config$open_ai_api_key,characters_per_prompt = 2000,conn=conn,convert_to_links = T,file="testing_again")
tbl(conn,"cm3links") %>% filter(file=="testing3")
auto_code(statements,start=prompt,model="4",api_key = config$open_ai_api_key,characters_per_prompt = 2000,conn=conn)
code_statements(statements,
prompt=prompt,
remove_duplicates=T, # these can happen because of the way we ask for interlocking links
model="4",
characters_per_prompt=2000,
combine_small = T,
prompts_per_batch=1,
api_key=config$open_ai_api_key,
conn=conn,user="id links",file="asdfadsf"
)
##################
##################
##################
##################
library(stringdist)
library(DBI)
config = configr::read.config("config.yml")$default
api_key <- config$open_ai_api_key
generic_assistant <- create_generic_assistant(api_key)
orig <- get_file("stcmalawi24",conn=get_conn())
#upload_tables_from_list(orig,file="ex-ai-holistic222",conn = get_conn(),upload_append_or = "Overwrite",in_session=F)
ex <- get_file("ai-stcmalawi24",conn=get_conn())
ex <- get_file("example-file",conn=get_conn())
naive1_short <- read_lines("C:/Users/Zoom/IFRC/Everyone Counts report - General/ifrc_ec2024_data_and_code/prompts/naive1_short.txt") %>% collap
naive1 <- read_lines("C:/Users/Zoom/IFRC/Everyone Counts report - General/ifrc_ec2024_data_and_code/prompts/naive1.txt") %>% collap
naive1_holistic <- read_lines("C:/Users/Zoom/IFRC/Everyone Counts report - General/ifrc_ec2024_data_and_code/prompts/naive1_holistic.txt") %>% collap
fix <- read_lines("C:/Users/Zoom/IFRC/Everyone Counts report - General/ifrc_ec2024_data_and_code/prompts/fix.txt") %>% collap
fix_holistic <- read_lines("C:/Users/Zoom/IFRC/Everyone Counts report - General/ifrc_ec2024_data_and_code/prompts/fix_holistic.txt") %>% collap
missing <- read_lines("C:/Users/Zoom/IFRC/Everyone Counts report - General/ifrc_ec2024_data_and_code/prompts/missing.txt") %>% collap
# missing_aggressive <- read_lines("C:/Users/Zoom/IFRC/Everyone Counts report - General/ifrc_ec2024_data_and_code/prompts/missing_aggressive.txt") %>% collap
statements <- ex$statements %>% filter(statement_id %in% ex$statements$statement_id[1:1])
original_statements <-statements
statements <- chunkify(statements[,],combine_small = T,characters_per_prompt = 4000,prompts_per_batch = 1) %>% filter(text!="")
#tx1 <- paste0(naive1,"\n\n",statements$text)
tx1 <- paste0(naive1_holistic,"\n\n",statements$text)
#res2_holistic <- three_steps_parallel(api_key=config$open_ai_api_key,tx1,fix,missing)
res2_holistic <- three_steps_parallel(api_key=config$open_ai_api_key,tx1,fix_holistic)
res2_holistic <- three_steps_parallel(api_key=config$open_ai_api_key,tx1)
res2_holistic
# statements <-
# statements %>% mutate(source_id=str_match(statement_id,"^[^ ]*"))
# statements
slinks2 <-
seq_along(res2_holistic) %>%
map(~{
# browser()
stats <- statements[.,] %>% pull(statement_id) %>% str_split(",") %>% pluck(1) %>% str_trim
these_stats <- original_statements %>% filter(statement_id %in% stats)
one_row(res2_holistic[[.]],these_stats)
}) %>%
bind_rows
slinks2 %>%
pipe_add_sentiment(offline = T)
new=list(links=slinks2,statements=statements)
#new %>% write_xlsx("ex-ai-holistic2.xlsx")
upload_tables_from_list(new,file="ex-ai-holistic222",conn = get_conn(),upload_append_or = "Overwrite")
get_file("ex-ai2",get_conn())
}
if(F){
library(httr)
# ask_assistant_one_question(api_key=config$open_ai_api_key,create_generic_assistant(config$open_ai_api_key),question = "what is your name",text = "",model="gpt-4o")
time_stamp <- time_stamp()
a1 <- create_assistant(config$open_ai_api_key,"name of assistant",instructions = "be kind",time_stamp = time_stamp,file_id = NULL)
assistants <-
docs[1:4] %>%
map(~{create_assistant_and_file(config$open_ai_api_key, instructions="Prepare to answer questions",description="description",file_path=.,time_stamp = time_stamp)
})
delete_assistant_file(config$open_ai_api_key,assistant$id,fil$id)
# delete_assistant(api_key,assistant$id)
assistants <- list_assistants(api_key)
assistants$data %>% map(~.$id) %>% map(~delete_assistant(api_key,.))
thread <- create_thread(config$open_ai_api_key,file_ids = NULL,content="Summarise the document.")
run <- run_thread(config$open_ai_api_key,assistant_id = assistant$id,thread_id = thread$id)
docs <- list.files("export",full.names = T)[1:50]
texts <- docs %>% map(~readLines(.) %>% collap)
stats <-
tibble(statement_id=docs,text=unlist(texts))
embs <- write_and_return_embeddings(new=stats,api_key = config$open_ai_api_key,write=F)
tmp <-
cosine_match(
current_statements=stats,
api_key=config$open_ai_api_key,
tex=c("organisational problems OR climate change"),
statement_embeddings=embs,
conn=NULL
)
View(tmp)
}
if(F){
ex$statements %>% select(text,statement_id) %>% `[`(1:5,) %>% write_csv("statements5.csv")
statements5 <-
ex$statements %>% slice(1:5) %>% select(statement_id,text)
statements5 %>% unite(tex,statement_id,text,remove = T,sep="\n\n ") %>% select(tex) %>% mutate(tex=paste0("Statement ID: ",tex)) %>% pull(tex) %>% collap %>% writeLines("statements5.txt")
fil <- upload_file_a(config$open_ai_api_key,"assets/examples/msc.txt")
fil <- upload_file_a(config$open_ai_api_key,"statements5.txt")
assistant <-
create_assistant(
api_key=config$open_ai_api_key,
name="Your Assistant Name",
description ="Your AssistantDescription",
instructions="Read the document and prepare to answer questions about the content",
file_id= list(fil$id)
)
thread <- create_thread(config$open_ai_api_key,fil$id,content="For each statement in this text, identify ALL the causal claims within the statement where someone says that one thing influences another. Express the claims in the form 'cause >> effect'. For each claim, identify ONE quote which backs it up. Output your results as csv, with one column called 'statement_id', one column called 'claim', one column 'quote_start' giving the number of the character within the statement where the quote starts, and a column 'quote_end' giving the number of the character within the statement where the quote ends.
Do not provide the actual quote, only the start and end locations.
Provide one row with the same statement_id for each claims. So if there are n claims there will be n rows with the same statement_id.
It is ESSENTIAL you find ALL the causal claims. I have no hands, you have to do it for me. If you fail I will lose my job and my family will starve.
Make sure you correctly escape commas and quote marks in your summary column.
Provide ONLY the csv output with no comments or introduction.")
run <- run_thread(config$open_ai_api_key,assistant_id = assistant$id,thread_id = thread$id)
#get_run(config$open_ai_api_key,thread$id, run$id)
complete <- F
res <- NULL
while(!complete){
Sys.sleep(1)
tmp <- get_run(config$open_ai_api_key,thread$id, run$id)
# browser()
if(tmp$status=="completed") {
complete=T
message("completed")
res <- tmp
# return(tmp)
}
}
current_thread <- get_thread(config$open_ai_api_key,thread$id)
csv <- current_thread$data[[1]]$content[[1]]$text$value %>% str_remove("^```csv\n") %>% str_remove("\n```$")
writeLines(csv,"tmp.csv")
cs <- read_csv("tmp.csv") %>%
mutate(nchar=nchar(claim))
cs <- cs %>% left_join(statements5)%>%
mutate(quote=str_sub(text,quote_start,quote_end))
View(cs)
current_thread$data[[1]]$content[[1]]$text$annotations
fil <- upload_file_a(config$open_ai_api_key,"C:/Users/steve/Dropbox/Projects/CausalMap3/www/guide/full_guide.pdf")
assistant <-
create_assistant(
api_key=config$open_ai_api_key,
name="Your Assistant Name",
description ="Your AssistantDescription",
instructions="Read this Causal Map Guide and prepare to answer questions about the content to help users use the Causal Map app. Always provide concise answers without extra verbiage and use simple language whereever possible.",
file_id= list(fil$id)
)
thread <- create_thread(config$open_ai_api_key,fil$id,content="How do I upload a set of Word documents for causal coding to the Causal Map app, based on the information in the Causal Map Guide?")
thread <- create_thread(config$open_ai_api_key,fil$id,content="How can I show if some links are more frequently mentioned by some groups e.g. women?")
run <- run_thread(config$open_ai_api_key,assistant_id = assistant$id,thread_id = thread$id)
#get_run(config$open_ai_api_key,thread$id, run$id)
complete <- F
res <- NULL
while(!complete){
Sys.sleep(1)
tmp <- get_run(config$open_ai_api_key,thread$id, run$id)
# browser()
if(tmp$status=="completed") {
complete=T
message("completed")
res <- tmp
# return(tmp)
}
}
current_thread <- get_thread(config$open_ai_api_key,thread$id)
current_thread
get_run(config$open_ai_api_key,run_id = run$id,thread$id)
}
if(F){
wait_for_run <- function(run){
complete <- F
res <- NULL
while(!complete){
Sys.sleep(5)
tmp <- get_run(config$open_ai_api_key,run$thread_id, run$id)
# browser()
if(tmp$status=="completed") {
complete=T
message("completed")
res <- tmp
# return(tmp)
}
current_thread <- get_thread(config$open_ai_api_key,run$thread_id)
}
current_thread
}
}
if(F){
stats <- ex$statements %>%
filter(source_id %in% xc("MNX-1 MNX-2"))
tmp <-
semantic_search_embeddings(current_statements=stats,
cosine_minimum=.25,
api_key=config$open_ai_api_key,
tex=c("health problems","relationship problems"),
file_select = "example-file",
# statement_embeddings=NULL,
s3=s3
)
tmp$res #%>% filter(any_cosines %>% as.logical)
embs <- write_and_return_embeddings(new=stats,api_key = config$open_ai_api_key,write=F)
tmp <-
cosine_match(
current_statements=stats,
api_key=config$open_ai_api_key,
tex=c("health problems","relationship problems"),
statement_embeddings=embs,
conn=NULL
)
}
if(F){
config = configr::read.config("config.yml")$default
conn <- DBI::dbConnect(
drv = RMariaDB::MariaDB(),
load_data_local_infile = F,
timeout=6000,
reconnect=T,#FIXME could be dangerous
# drv = RMySQL::MySQL(max.con=1000, fetch.default.rec=1000),
idleTimeout=900000,#15 minutes
interactive_timeout=900000,#15 minutes
wait_timeout=900000,#15 minutes
dbname = config$sql_cm$dbname,
host = config$sql_cm$host,
port = config$sql_cm$port,
username = config$sql_cm$username,
password = config$sql_cm$password,
mysql=F
)
api_key <- config$open_ai_api_key
library(tictoc)
ex="assets/import_examples/example-file.xlsx" %>%
get_from_excel()
ex$statements$file <- "example-file"
q <- paste0("DELETE FROM cm3embeddings_s WHERE file='","example-file","';")
q <- paste0("DELETE FROM cm3embeddings_s_long WHERE file='","example-file","';")
dbExecute(conn,q)
ex$statements %>%
filter(source_id %in% xc("MNX-1 MNX-2")) %>%
write_and_return_embeddings(api_key = api_key,conn = conn,long=F)
ex$statements %>%
filter(source_id %in% xc("MNX-1 MNX-2")) %>%
write_and_return_embeddings(api_key = api_key,conn = conn,long=T)
ex$statements %>%
filter(source_id %in% xc("MNX-1 MNX-2")) %>%
fetch_embeddings(.,api_key=api_key,conn=conn,file="example_file",reset=F,already=NULL)
already <-
ex$statements %>%
filter(source_id %in% xc("MNX-1 MNX-2"))
statements <-
ex$statements %>%
filter(source_id %in% xc("MNX-1 MNX-2"))
fetch_embeddings(statements,api_key=api_key,conn=conn,file="example_file",reset=F,already=already)
already <-
ex$statements %>%
filter(source_id %in% xc("MNX-1"))
fetch_embeddings(statements,api_key=api_key,conn=conn,file="example_file",reset=F,already=already)
q <- paste0("DELETE FROM cm3embeddings_s_long WHERE file='","tearfund-all-2024","';")
dbExecute(conn,q)
tic()
create_new_statement_embedding_rows_long(statements = statements,api_key = api_key,conn = conn,file = "example-file",reset=F,already=already)
toc()
already <-
ex$statements %>%
filter(source_id %in% xc("MNX-1"))
tic()
create_new_statement_embedding_rows_long(statements = statements,api_key = api_key,conn = conn,file = "example-file",reset=F,already=already)
toc()
}
if(F){
tbl(conn,"cm3files") %>%
arrange(desc(created)) %>%
filter(!str_detect(edit,"steve|hello|gabriele|hannah|fiona|tabitha|gcc43|testing|pubale")) %>% collect %>%
write_xlsx("user_files.xlsx")
tbl(conn,"cm3files") %>%
select(-created)%>%
filter(!str_detect(edit,"steve|hello|gabriele|hannah|fiona|tabitha|gcc43|testing|pubale")) %>% left_join(tbl(conn,"cm3links") %>% select(-modified),by="file") %>%
group_by(file,edit) %>%
summarise(n_links=n(),created=min(created,na.rm=T),modified=max(modified,na.rm=T)) %>%
filter(!str_detect(edit,"steve|hello|gabriele|hannah|fiona|tabitha|gcc43|testing|pubale")) %>% collect %>%
arrange(desc(modified)) %>%
write_xlsx("user_files.xlsx")
}