Create searchable Bluesky bookmarks with R

jeudi 19 décembre 2024, 10:00 , par InfoWorld

The Bluesky social network is seeping into the mainstream, thanks in part to people unhappy with changes at X (formerly Twitter). “It’s not some untouched paradise, but it’s got a spark of something refreshing,” Jan Beger, global head of AI Advocacy at GE HealthCare, posted on LinkedIn last month. “Like moving to a new neighborhood where the air feels a little clearer, and people actually wave when they see you.”

That includes an increasing number of R users, who often use the hashtag #rstats. “Bluesky is taking off! Please join us there with this handy Posit employees starter pack,” Posit (formerly RStudio) wrote in its December newsletter. Starter packs make it easy to find and follow other Bluesky users based on topics and interests, either one by one or with a single click. Any user can create them. (See my companion FAQ on Bluesky with tips for getting started.)

However, because Bluesky is relatively new (it opened to the general public early this year), some features that are standard on other microblogging platforms aren’t yet available. One of the most sorely missed is “bookmarks” — being able to save a liked post for easy retrieval later. There are a couple of workarounds, either with third-party apps or custom feeds. In addition, thanks to Bluesky’s free API and the atrrr R package, you can use R to save data locally of all your liked posts — and query them however you want. Here’s how.

Save and search your Bluesky likes

This project lets you download your Bluesky likes, wrangle the data, save it, and load it into a searchable table. Optionally, you can save the data to a spreadsheet and manually mark the ones you want to consider as bookmarks, so you can then search only those.

For even more functionality, you can send the results to a generative AI chatbot and ask natural language questions instead of searching for specific words.

Let’s get started!

Step 1. Install the atrrr R package

I suggest installing the development version of the atrrr R package from GitHub. I typically use the remotes package for installing from GitHub, although there are other options such as devtools or pak.

remotes::install_github('JBGruber/atrrr', build_vignettes = TRUE)

I’ll also be using dplyr, purrr, stringr, tidyr, and rio for data wrangling and DT to optionally display the data in a searchable table. Install those from CRAN, then load them all with the following code.

library(atrrr)
library(dplyr)
library(purrr)
library(stringr)
library(tidyr)
library(rio)
library(DT) # optional

You will need a free Bluesky application password (different from your account password) to request data from the Bluesky API. The first time you make a request via atrrr, the app should helpfully open the correct page on the Bluesky website for you to add an app password, if you’re already logged into Bluesky in your default browser. Or, you can go here to create an app password in advance.

Either way, when you first use the atrrr package, you’ll be asked to input an application password if there’s not one already stored. Once you do that, a token should be generated and stored in a cache directory at file.path(tools::R_user_dir('atrrr', 'cache'), Sys.getenv('BSKY_TOKEN', unset = 'token.rds')). Future requests should access that token automatically.

You can find more details about Bluesky authentication at the atrrr package website.

Step 2. Download and wrangle your Bluesky likes

The get_actor_likes() function lets you download a user’s likes. Here’s how to download your 100 most recent likes:

my_recent_likes

Note: As of this writing, if you ask for more likes than there actually are, you’ll get an error. If that happens, try a smaller number. (You may end up with a larger number anyway.)

The get_actor_likes() command above returns a data frame — more specifically a tibble, which is a special type of tidyverse data frame — with 20 columns. However, some of those columns are list columns with nested data, which you can see if you run glimpse(my_recent_likes). You’ll probably want to extract some information from those.

IDG

As shown above, each liked post’s text, author_name, author_handle, like_count, and repost_count are arranged in conventional (unnested) columns. However, the timestamp when the post was created is nested inside the posted_data list column, while any URLs mentioned in a post are buried deep within the embed_data column.

Here’s one way to unnest those columns using the tidyr package’s unnest_wider() function:

my_recent_likes_cleaned
unnest_wider(post_data, names_sep = '_', names_repair = 'unique') |>
unnest_wider(embed_data, names_sep = '_', names_repair = 'unique') |>
unnest_wider(embed_data_external, names_sep = '_', names_repair = 'unique')

That creates a data frame (tibble) with 40 columns. I’ll select and rename the columns I want, and add a TimePulled timestamp column, which could come in handy later.

my_recent_likes_cleaned
select(Post = text, By = author_handle, Name = author_name, CreatedAt = post_data_createdAt, Likes = like_count, Reposts = repost_count, URI = uri, ExternalURL = embed_data_external_uri) |>
mutate(TimePulled = Sys.time() )

Another column that would be useful to have is the URL of each liked post. Note that the existing URI column is in a format like at://did:plc:oky5czdrnfjpqslsw2a5iclo/app.bsky.feed.post/3lbd2ee2qvm2r. But the post URL uses the syntax https://bsky.app/profile/{author_handle}/post/{a portion of the URI}.

The R code below creates a URL column from the URI and By columns, adding a new PostID column as well for the part of the URI used for the URL.

my_recent_likes_cleaned
mutate(PostID = stringr::str_replace(URI, 'at.*?post\/(.*?)$', '\1'),
       URL = glue::glue('https://bsky.app/profile/{By}/post/{PostID}') )

I typically save data like this as Parquet files with the rio package.

rio::export(my_recent_likes_cleaned, 'my_likes.parquet')

You can save the data in a different format, such as an.rds file or a.csv file, simply by changing the file extension in the file name above.

Step 3. Keep your Bluesky likes file updated

It’s nice to have a collection of likes as a snapshot in time, but it’s more useful if you can keep it updated. There are probably more elegant ways to accomplish this, but a simple approach is to load the old data, pull some new data, find which rows aren’t in the old data frame, and add them to your old data.

previous_my_likes

Next, get your recent likes by running all the code we did above, up until saving the likes. Change the limit to what’s appropriate for your level of liking activity.

my_recent_likes
unnest_wider(post_data, names_sep = '_', names_repair = 'unique') |>
unnest_wider(embed_data, names_sep = '_', names_repair = 'unique') |>
unnest_wider(embed_data_external, names_sep = '_', names_repair = 'unique') |>
select(Post = text, By = author_handle, Name = author_name, CreatedAt = post_data_createdAt, Likes = like_count, Reposts = repost_count, URI = uri, ExternalURL = embed_data_external_uri) |>
mutate(TimePulled = Sys.time() ) |>
mutate(PostID = stringr::str_replace(URI, 'at.*?post\/(.*?)$', '\1'),
       URL = glue::glue('https://bsky.app/profile/{By}/post/{PostID}') )

Find the new likes that aren’t already in the existing data:

new_my_likes

Combine the new and old data:

deduped_my_likes

And, finally, save the updated data by overwriting the old file:

rio::export(deduped_my_likes, 'my_likes.parquet')

Step 4. View and search your data the conventional way

I like to create a version of this data specifically to use in a searchable table. It includes a link at the end of each post’s text to the original post on Bluesky, letting me easily view any images, replies, parents, or threads that aren’t in a post’s plain text. I also remove some columns I don’t need in the table.

my_likes_for_table
   mutate(
     Post = str_glue('{Post} >>'),
     ExternalURL = ifelse(!is.na(ExternalURL), str_glue('{substr(ExternalURL, 1, 25)}...'), '')

     ) |>
select(Post, Name, CreatedAt, ExternalURL)

Here’s one way to create a searchable HTML table of that data, using the DT package:

DT::datatable(my_likes_for_table, rownames = FALSE, filter = 'top', escape = FALSE, options = list(pageLength = 25, autoWidth = TRUE, filter = 'top', lengthMenu = c(25, 50, 75, 100), searchHighlight = TRUE,
                  search = list(regex = TRUE)

      )
)

This table has a table-wide search box at the top right and search filters for each column, so I can search for two terms in my table, such as the #rstats hashtag in the main search bar and then any post where the text contains LLM (the table’s search isn’t case sensitive) in the Post column filter bar. Or, because I enabled regular expression searching with the search = list(regex = TRUE) option, I could use a single regexp lookahead pattern (?=.rstats)(?=.(LLM)) in the search box.

IDG

Generative AI chatbots like ChatGPT and Claude can be quite good at writing complex regular expressions. And with matching text highlights turned on in the table, it will be easy for you to see whether the regexp is doing what you want.

Query your Bluesky likes with an LLM

The simplest free way to use generative AI to query these posts is by uploading the data file to a service of your choice. I’ve had good results with Google’s NotebookLM, which is free and shows you the source text for its answers. NotebookLM has a generous file limit of 500,000 words or 200MB per source, and Google says it won’t train its large language models (LLMs) on your data.

The query “Someone talked about an R package with science-related color palettes” pulled up the exact post I was thinking of — one which I had liked and then re-posted with my own comments. And I didn’t have to give NotebookLLM my own prompts or instructions to tell it that I wanted to 1) use only that document for answers, and 2) see the source text it used to generate its response. All I had to do was ask my question.

IDG

I formatted the data to be a bit more useful and less wasteful by limiting CreatedAt to dates without times, keeping the post URL as a separate column (instead of a clickable link with added HTML), and deleting the external URLs column. I saved that slimmer version as a.txt and not.csv file, since NotebookLM doesn’t handle.csv extentions.

my_likes_for_ai
mutate(CreatedAt = substr(CreatedAt, 1, 10)) |>
select(Post, Name, CreatedAt, URL)

rio::export(my_likes_for_ai, 'my_likes_for_ai.txt')

After uploading your likes file to NotebookLM, you can ask questions right away once the file is processed.

IDG

If you really wanted to query the document within R instead of using an external service, one option is the Elmer Assistant, a project on GitHub. It should be fairly straightforward to modify its prompt and source info for your needs. However, I haven’t had great luck running this locally, even though I have a fairly robust Windows PC.

Update your likes by scheduling the script to run automatically

In order to be useful, you’ll need to keep the underlying “posts I’ve liked” data up to date. I run my script manually on my local machine periodically when I’m active on Bluesky, but you can also schedule the script to run automatically every day or once a week. Here are three options:

Run a script locally. If you’re not too worried about your script always running on an exact schedule, tools such as taskscheduleR for Windows or cronR for Mac or Linux can help you run your R scripts automatically.

Use GitHub Actions. Johannes Gruber, the author of the atrrr package, describes how he uses free GitHub Actions to run his R Bloggers Bluesky bot. His instructions can be modified for other R scripts.

Run a script on a cloud server. Or you could use an instance on a public cloud such as Digital Ocean plus a cron job.

Use a spreadsheet or R tool to add Bookmarks and Notes columns

You may want a version of your Bluesky likes data that doesn’t include every post you’ve liked. Sometimes you may click like just to acknowledge you saw a post, or to encourage the author that people are reading, or because you found the post amusing but otherwise don’t expect you’ll want to find it again.

However, a caution: It can get onerous to manually mark bookmarks in a spreadsheet if you like a lot of posts, and you need to be committed to keep it up to date. There’s nothing wrong with searching through your entire database of likes instead of curating a subset with “bookmarks.”

That said, here’s a version of the process I’ve been using. For the initial setup, I suggest using an Excel or.csv file.

Step 1. Import your likes into a spreadsheet and add columns

I’ll start by importing the my_likes.parquet file and adding empty Bookmark and Notes columns, and then saving that to a new file.

my_likes
mutate(Notes = as.character(''),.before = 1) |>
mutate(Bookmark = as.character(''),.after = Bookmark)

rio::export(likes_w_bookmarks, 'likes_w_bookmarks.xlsx')

After some experimenting, I opted to have a Bookmark column as characters, where I can add just “T” or “F” in a spreadsheet, and not a logical TRUE or FALSE column. With characters, I don’t have to worry whether R’s Boolean fields will translate properly if I decide to use this data outside of R. The Notes column lets me add text to explain why I might want to find something again.

Next is the manual part of the process: marking which likes you want to keep as bookmarks. Opening this in a spreadsheet is convenient because you can click and drag F or T down multiple cells at a time. If you have a lot of likes already, this may be tedious! You could decide to mark them all “F” for now and start bookmarking manually going forward, which may be less onerous.

Save the file manually back to likes_w_bookmarks.xlsx.

Step 2. Keep your spreadsheet in sync with your likes

After that initial setup, you’ll want to keep the spreadsheet in sync with the data as it gets updated. Here’s one way to implement that.

After updating the new deduped_my_likes likes file, create a bookmark check lookup, and then join that with your deduped likes file.

bookmark_check
select(URL, Bookmark, Notes)

my_likes_w_bookmarks
relocate(Bookmark, Notes)

Now you have a file with the new likes data joined with your existing bookmarks data, with entries at the top having no Bookmark or Notes entries yet. Save that to your spreadsheet file.

rio::export(my_likes_w_bookmarks, 'likes_w_bookmarks.xlsx')

An alternative to this somewhat manual and intensive process could be using dplyr::filter() on your deduped likes data frame to remove items you know you won’t want again, such as posts mentioning a favorite sports team or posts on certain dates when you know you focused on a topic you don’t need to revisit.

Next steps

Want to search your own posts as well? You can pull them via the Bluesky API in a similar workflow using atrrr’s get_skeets_authored_by() function. Once you start down this road, you’ll see there’s a lot more you can do. And you’ll likely have company among R users.

Lire la suite sur InfoWorld

https://www.infoworld.com/article/3623509/create-searchable-bluesky-bookmarks-with-r.html

56 sources (32 en français)

Date Actuelle

mar. 4 nov. - 23:22 CET