Using Coereba

Background

Manual (supervised) analysis is widespread in cytometry (especially conventional flow cytometry (CFC)), wherein researchers will draw hierarchial boolean gates around cell populations using the graphical user interface provided by various commercial softwares. By sequentially switching between markers, you end up filtering down to smaller subsets of cells of potential interest.

With the increase in markers (initially by mass cytometry (MC), recently by spectral flow cytometry (SFC)) and accompanying combinatorial combinations needing to be gated, algorithmic (unsupervised) analysis methods have been developed to cluster cells on basis median fluorescent intensity (MFI) of individual markers.

While these handle more combinatorial data, how many identified clusters are present (and/or biologically meaningful) remains a point of concern. The additional dependence on MFI values being consistent across individuals and experiments increases these methods susceptibility to batch effects.

Rationale

SFC datasets in particular, collecting greater numbers of cells than MC and increased number of markers compared to CFC, require new tools to enable comprehensive profiling of phenotypes present within the acquired cells. These datasets loss resolution due to improper unmixing controls poses additional challenges.

Coereba is a collection of tools that attempt to implement a semi-supervised analytical approach. Using automated gating, splitpoints between positive and negative cells for every marker are estimated on an individual basis. Through extensive visualization and a ShinyApp, failures in gate setting by algorithmic gating can be adjusted by the researcher. Individual cells are then classified on the basis of these marker splitpoints, and the identity and metadata information is appended to the .fcs file.

Following unsupervised analysis, this manually defined gating information can extracted and used both for analysis and verify the algorithm performed as predicted across specimens. Due to the researcher intervention, it is more resilient to batch effects than a pure MFI approach.

The author and maintainer in no part claims that Coereba is the solution to all cytometry analysis woes. It is a tool in the open-source toolbelt that can enable you to do useful things and figure out interesting quirks that you wouldn’t be otherwise able to do. Go forth, gain better understanding underlying problems our current paradigms have, and may the next generation of researchers benefit from what we learn.

Getting Started

Installing Coereba

The Coereba R package is in preparation for submission to Bioconductor later this year. Until that time, it is available for install via GitHub.

if(!require("remotes")) {install.packages("remotes")}
remotes::install_github("https://github.com/DavidRach/Coereba")

# install.packages("BiocManager")
# BiocManager::install("Coereba")

Loading Libraries

Coereba relies on underlying infrastructure provided flowWorkspace and other Bioconductor cytometry packages. It leverages functional programming principles implemented in tidyverse packages available via CRAN. It is important to make sure that all the required packages are installed on your computer, and that library is called for each.

library(Coereba)
library(flowCore)
library(flowWorkspace)
library(openCyto)
library(ggcyto)  
library(data.table)
library(dplyr)
library(purrr) 
library(stringr)
library(ggplot2)
library(gt)
library(plotly)
library(htmltools)

Locating .fcs files

To get started, you will first need to provide the location on your computer where the .fcs files of interest are being stored. An example of how the author does this is provided below and can be modified for your own user and desired computer folder.

File_Location <- file.path("C:", "Users", "JohnDoe", "Desktop", "TodaysExperiment")
FCS_Pattern <- ".fcs$"
FCS_Files <- list.files(path = File_Location, pattern = FCS_Pattern,
                        full.names = TRUE, recursive = FALSE)

For this vignette, we will using several small .fcs files that can be found in Coereba’s extdata folder for our example.

File_Location <- system.file("extdata", package = "Coereba")
FCS_Pattern <- ".fcs$"
FCS_Files <- list.files(path = File_Location, pattern = FCS_Pattern,
                        full.names = TRUE, recursive = FALSE)
head(FCS_Files, 3)

The folder contains three unmixed full-stained samples from a 29-color SFC panel, characterizing Innate-like T cells (ILT) in cord blood mononuclear cells (CBMCs). As rare circulating cells, the original .fcs files contained many cellular events, which makes them useful to show off Coereba’s capabilities at characterizing cells that were not the original focus.

Creating a GatingSet for Unmixed .fcs files

Working with the unmixed full-stained samples mentioned above, we will proceed by bringing the .fcs files found within Coereba’s extdata folder into a GatingSet object where we can interact with them.

UnmixedFCSFiles <- FCS_Files[c(1:2)]
UnmixedCytoSet <- load_cytoset_from_fcs(UnmixedFCSFiles, truncate_max_range = FALSE,
                                        transformation = FALSE)
UnmixedGatingSet <- GatingSet(UnmixedCytoSet)
UnmixedGatingSet

Now that we have created a GatingSet object, we will identify the markers/fluorophores present within the .fcs files, and exclude markers from the list don’t require transformation (as is the case with FSC, SSC). We will then bi-exponentially transform the data using flowWorkspace flowWorkspace::flowjo_biexp_trans() before applying a openCytogating template that was customized for this particular panel, which can be found in the extdata folder.

Markers <- colnames(UnmixedCytoSet)
KeptMarkers <- Markers[-grep("Time|FS|SC|SS|Original|-W$|-H$|AF", Markers)]

MyBiexponentialTransform <- flowjo_biexp_trans(channelRange = 256, maxValue = 1000000,
                                               pos = 4.5, neg = 0, widthBasis = -1000)
TransformList <- transformerList(KeptMarkers, MyBiexponentialTransform)
UnmixedGatingSet <- transform(UnmixedGatingSet, TransformList)

UnmixedGates <- fread(file.path(path = File_Location, pattern = 'GatesUnmixed.csv'))
UnmixedGating <- gatingTemplate(UnmixedGates)
gt_gating(UnmixedGating, UnmixedGatingSet)

plot(UnmixedGatingSet)

We can additionally verify that the gating of the cell populations of interest was correct, by visualizing the gates using ggcyto or the Luciernaga package.

library(Luciernaga)

MyPlot <- Utility_GatingPlots(x=UnmixedGatingSet[1], sample.name=c("GROUPNAME", "TUBENAME"),
                              removestrings=".fcs", subset="live", gtFile=UnmixedGates,
                              DesiredGates=NULL, outpath = getwd(), export=FALSE,
                              therows=3, thecolumns=2)

MyPlot

Coereba_GateCutoffs

Having set up a GatingSet with transformations and gates applied, we can proceed to test the Coereba workflow. The first step is to create a .csv file, containing the individual specimens, and the individual markers of interest. Each cell within the .csv file will correspond to the MFI for a given marker for a given individual where positive expression of a marker transitions to becoming a negative (referred here to as a splitpoint).

These could be written out individually, or based on automated calculations. To simplify the process, in Coereba, we have implemented two approaches to estimating these splitpoints. One function is the in-house Coereba_GateCutoffs(), which does an “okay-ish but work-in progress”. The flip approach is leveraging the openCyto pipeline (in development) and retrieving the values.

Both approaches can be computationally heavy, but still faster than eye-balling everything and typing in the values one-by-one. Save the eyestrain for subsequent validation steps.

TheGateCutoff <- Coereba_GateCutoffs(gs=UnmixedGatingSet[1],
  subset="live", sample.name="GROUPNAME", desiredCols =c("BUV805-A"))

TheGateCutoffs <- map(.x=UnmixedGatingSet[1:3], .f=Coereba_GateCutoffs,
  subset="live", sample.name="GROUPNAME", desiredCols =c("BUV805-A")) %>% bind_rows()

TheGateCutoffs

Visualizing Gate Cutoffs

Once we have generated a GateCutoffs.csv file, we can visualize how well the splitpoints were by plotting them as red-lines using the Luciernaga packages Utility_NbyNPlots() to visualize the markers for a given individual and Utility_UnityPlots() to visualize the markers across individuals. This a good way of understanding how well the competing gating-algorithms did in the context of your individual panel, and how much effort will be required for the next step in correcting the gates individually.

Coereba_App

Previously, all adjustments to the above splitpoints had to eye-balled in the pdf, and then adjusted in the .csv file, and repeated ad nauseum. This obviously did not scale well. We have subsequently created a ShinyApp that will bring in the GatingSet object, and plot an interactive version of Utility_UnityPlot() from the Luciernaga package, visualizing the splitpoint for a given marker across all individuals as red vertical lines.

To get started, first make sure you know where the Coereba_GateCutoffs() .csv output is saved at, as you will need to select the file within the App.

File_Location <- system.file("extdata", package = "Coereba")
TheCSV <- file.path(File_Location, "GateCutoffsForNKs.csv")
TheCSVData <- read.csv(TheCSV, check.names=FALSE)
head(TheCSVData)

Next, make sure to remember the name of your GatingSet (“UnmixedGatingSet” for this case), and the sample.name keyword to identify individual specimens, as you will need to input these values within the app.

Then, run the following command in your console:

Coereba_App()

The first tab view will let you upload the GateCutoff.csv file generated by the previous function, by letting you navigate to the storage folder and clicking on it.

Once this is done, navigate to the Upload a GatingSet tab. Enter the information about your GatingSet (“UnmixedGatingSet” for this example), select your x and y parameters, enter the SampleName keyword. Desired bins allows adjustments to the visualization given the number of cells present, clearance multiplier adjust the margins wiggle-room, removestrings arguments. You specify margin and gate subsets like you would for Utility_UnityPlots() and click Display Estimated Gate Cutoffs. Finally, you click on Generate Plots and go retrieve your coffee/tea/beverage-of-choice.

Upon loading of the plots, scroll until you encounter a splitpoint line that was placed incorrectly, you can simply click on the correct location of the axis, which will save the coordinate as a splitpoint for that given marker and individual to a data.frame. This will later be ported to update the corresponding splitpoint value for a given marker and individual in the exisiting GateCutoff .csv. By just clicking, you can therefore quickly correct cases where the algorithm failed to gate properly due to batch effects, bad staining, or algorithmic incompetence (sorry, the coders are still learning as well).

Once you have corrected some/all the markers, navigate to the Click Data tab. You can then export the click-data information by specifying a filename and providing an outpath to the folder you want to save at. Unfortunately, due to security issues in ShinyApps, you will need to type out the designated outpath manually.

And repeat until the markers you are interested have been validated and you have a folder of clickpoint .csvs awaiting conversion into the existing GateCutoff folder.

Since the code is within a ShinyApp running in R, it has both the advantages and disadvantages of a R Shiny App. It scales reasonably well, but can take a while to load the next iterated marker. During testing, loading a new marker to inspect took anywhere from 3-10 additional minutes. This area is a active work in progress to speed up, but the author/maintainer hasn’t had the time to finish reading Mastering Shiny by Hadley Wickham yet or implement it in Rust, if you have, feel free to assist!

Coereba_UpdateGates

Once you have completed validating the GateCutoffs with Coereba_App(), it is time to convert your validated click-data and update the GateCutoff.csv. To fascilitate this, provide the location of the folder containing just the clickpoint .csv’s to Coereba_UpdateGates() and fill out the required arguments. It will proceed to update the GateCutoffs.

For this example, we will use ClickDataExamples.csv stored within Coereba’s extdata folder:

File_Location <- system.file("extdata", package = "Coereba")
TheOldCSV <- file.path(File_Location, "GateCutoffsForNKs.csv")
TheClickInfo <- file.path(File_Location, "ClickDataExample.csv")
TheClickData <- read.csv(TheClickInfo, check.names=FALSE)
TheClickData

The old data:

TheOldData <- read.csv(TheOldCSV, check.names=FALSE)
TheOldData

We can then run Coereba_UpdateGates() and watch the values for BUV469-A are updated.

UpdatedCSV <- Coereba_UpdateGates(Clicks=TheClickInfo, Old=TheOldCSV,
  export=FALSE, outpath=NULL, fileName="UpdatedCSV")

UpdatedCSV

We can further validate this by re-running Luciernaga Utility_UnityPlots() using the updated GateCutoff.csv and seeing if the previous issues have been corrected.

Note on Cell Populations and Splitpoint

During development, we have noticed that for SFC data, the splitpoint location can vary substantially for individual cell populations (B cells, NK cells, T cells, etc), even in the abscence of biology. We believe this is due to uncertainty in the unmixing, as we have observed a correlation of negative splitpoint MFI in relation to the individual cells kappa value (matrix complexity essentially). Try your best to have the gate reflect what you are interested in, reduce the amount of error as much as you can, and in numbers veritas. Keep track of the exceptions (more later) and write up a Cyto paper.

How Coereba works

Having a GateCutoff.csv containing the splitpoint information for each marker validated for each individual is immensely useful, allowing you to do many things you wouldn’t be capable of otherwise. Namely, we can for an individual anotate each cell within their .fcs file.

What do I mean? Let’s first think about all the cells circulating in an individuals bloostream. We don’t have any prior information in this example, but we want to profile these cells into clusters of similar cells. On one end of the spectra, there is a single cluster that contains every single cell from the sample, regardless of their marker expression. At the opposite end of the spectrum, every single cell clusters in an it’s own cluster with just itself. In between these two extremes, depending on our question, lies a meaningful cluster number that will match the underlying biology to reduce the amount of variance that lies in clumping/splitting populations needlessly.

Coereba takes the approacch by individually annotating an individuals cells one by one, on the basis of where their MFI value lands related to the validated splitpoint. So if the splitpoint for FITC is at 50, and the individual cell MFI for FITC is 80, it returns as FITC_pos. If the splitpoint for APC is at 70, and the individual cell MFI is 30, it returns as APC_Neg.

When you iterate over the splitpoints for each marker, each cell derrives an identity. FITC_pos-APC_Neg-BV421_Pos…. etc. We can then group cells with matching identies and derrive information from these terminal nodes.

Individually, this may not mean much. After all, dichotomous splitting of a 30-marker panel is around half-a billion potential terminal nodes or clusters. Similarly, cytometers have instrumental, batch, experimental noise, etc.

But when we scale to all the events collected within a typical sample, we get far fewer terminal nodes, in the hundred to lower thousand range for a 30-color panel. The reason is cells (and the panels by which we investigate them) are not uniformely distributed by markers across high-dimensional space, cells fall within specific phenotypes (cell types) that undergo division of similar clones.

Consequently, we can iteratively leverage the number of markers included to target questions of interest, broad or narrow scope, in context of our individual panels, and rely on methods described below to gain insight into biological systems.

Utility_Coereba

Utility_Coereba() is the function that implements the above approach, labelling for each individual specimen the individual cells on the basis of their splitpoint identity, and then returning the enumeration of the resulting clusters. It iterates over the different specimens in the GatingSet, returning information for every individual that can be used in subsequent steps. MultiReach() is the older version of the proccess.

Both functions are currently waiting on the ILT paper to be sent to collaborators so that David can implement his 5-subject binder worth of notes to implement a S4 OOP class to make manipulating the final outputs easier than the current .qmd files full of tidyverse data tidying.

CoerebaIDs <- Utility_Coereba(x=UnmixedGatingSet[1], subsets="live",
  sample.name="GROUPNAME", reference=TheCSV, starter="Spark Blue 550-A")

head(CoerebaIDs, 5)

Filtering Coereba Clusters

Some things are literally poisson noise. Others are cell clusters with varying abundance across heteregenous human patients. Others are individually unique terminal nodes that can be the result of individual biology…. or unmixing errors … or your lab tech messing up the staining panel. You can leverage the filter functions to identify all of these quickly.

Coereba_MarkerExpressions

Having generated a Marker by Cluster data.frame (binaryData) and a Cluster by Specimen data.frame (dataData), we can leverage the combination of the two data.frames to aggregate the clusters on the basis of marker presence, returning the proportion of of cells that are positive for a given marker.

For this demonstration, we will focus on NKTs, using stored .csv files for their binary and data outputs stored within the Coereba extdata folder.

File_Location <- system.file("extdata", package = "Coereba")

panelPath <- file.path(File_Location, "ILTPanelTetramer.csv")
panelData <- read.csv(panelPath, check.names=FALSE)

binaryPath <- file.path(File_Location, "HeatmapExample.csv")
binaryData <- read.csv(binaryPath, check.names=FALSE)

dataPath <- file.path(File_Location, "ReadyFileExample.csv")
dataData <- read.csv(dataPath, check.names=FALSE)

We will first just return the marker expressions for all markers:

library(Coereba)

AllMarkers <- Coereba_MarkerExpressions(data=dataData, binary=binaryData,
  panel=panelData, starter="SparkBlue550")
  
head(AllMarkers, 5)

We can additionally by specifying returnType = “Combinatorial” derrive proportions within quadrant gates for two markers of interest. To do so, we specify the fluorophores corresponding to the markers as a list in the CombinatorialArgs.

MemoryQuadrants <- Coereba_MarkerExpressions(data=dataData, binary=binaryData,
  panel=panelData, starter="SparkBlue550", returnType = "Combinatorial", 
  CombinatorialArgs = c("BV510", "BUV395"))

head(MemoryQuadrants, 5)

Utility_MarkerPlots

Having returned aggregated data above with Coereba_MarkerExpressions, we can plot this data using Utility_MarkerPlots(). These are combination of ggplot2 geom_boxplot() and ggbeeswarm geom_beeswarm plots. We provide some arguments that allow for filtering and reordering the markers shown in these plots, some basic customizations to the plots, and the ability to save as a .png file to a designated folder.

To do so for your own individual spots, you will need to specify a metadata column by which to factor by, and provide a list of shape and fill values corresponding to each factor level.

shape_ptype <- c("HU" = 22, "HEU-lo" = 21, "HEU-hi" = 21)
fill_ptype <- c("HU" = "white", "HEU-lo" = "darkgray", "HEU-hi" = "black")

We can start by viewing all markers:

ThePlot <- Utility_MarkerPlots(data=AllMarkers, panel=panelData,
  myfactor="ptype", shape_palette = shape_ptype, fill_palette = fill_ptype)

ThePlot

Let’s abbreviate the number of columns by specifying the marker names to include (filterForThese), and rearrange them in desired X-axis order

ThePlot <- Utility_MarkerPlots(data=AllMarkers, panel=panelData,
  myfactor="ptype", shape_palette = shape_ptype, fill_palette = fill_ptype,
  filterForThese=c("CD7", "CD4", "CD8"), XAxisLevels = c("CD7", "CD4", "CD8"))

ThePlot

And to ggsave directly to a desired folder:

Utility_Stats

The Utility_Stats() and the Utility_Behemoth() functions are either: “Great!!!”, or guaranteed to make your friendly-local statistician cry (Opinions vary).

Basically, Utility_Stats() is a coding-attempt to replicate the typical immunologist workflow, cutting out the need to copy-paste data from a .csv file into GraphPad Prism (trademark), run a normality test, determine number of groups, and follow up with the corresponding t-test/anova/non-parametric equivalents. Utility_Behemoth() then takes these outputs, and adds the p-value results to a ggbeeswarm/ggplot2 boxplot of the underlying data. In combination with purrr and Luciernaga Utility_Patchwork(), these can be quickly formatted into a .pdf.

Does this workflow rapidly profile all the columns in your data.frame. Yes it does. Is it emblematic of potential issues arising from basing science entirely on null-hypothesis statistical testing? Also yes, seeing how many plots return “significant” due to a couple outliers. In the end, it mimics the current workflow of many researchers, just speeding it up. Google Statistical Rethinking on YouTube for food-for-thought.

To begin, we will use the Coereba_MarkerExpressions data we generated above to test out Utility_Stats().

Result <- Utility_Stats(data=AllMarkers, var="CD62L",
  myfactor="ptype", normality="dagostino", correction="none")

Result

In combination with purrr we can rapidly iterate over all the markers, but will need to skip over the metadata columns that are present at the begining of the dataframe.

library(purrr)
library(dplyr)
# colnames(AllMarkers)
TheLength <- ncol(AllMarkers)

TheData <- map(names(AllMarkers)[c(9:TheLength)], ~ Utility_Stats(
  data = AllMarkers, var = .x, myfactor = "ptype", 
  normality = "dagostino")) %>% bind_rows()

The results can be filtered using regular dplyr functions.

MaybeSignificant <- TheData %>% dplyr::filter(pvalue < 0.05)

MaybeSignificant

Utility_Behemoth

Utility_Behemoth() continues from where Utility_Stats() left-off, appending the resulting p-value information onto a ggplot2 plot. The plot is a combination of ggplot2 geom_boxplot() and ggbeeswarm geom_beeswarm plots, and accepts arguments to customize these and rearrange the group order. As with most Coereba plots, we need to provide the shape and fill for our respective factors levels.

To do so for your own individual spots, you will need to specify a metadata column by which to factor by, and provide a list of shape and fill values corresponding to each factor level.

shape_ptype <- c("HU" = 22, "HEU-lo" = 21, "HEU-hi" = 21)
fill_ptype <- c("HU" = "white", "HEU-lo" = "darkgray", "HEU-hi" = "black")

SinglePlot <- Utility_Behemoth(data=AllMarkers, var="CD62L",
  myfactor="ptype", normality="dagostino", correction="none",
  shape_palette=shape_ptype, fill_palette=fill_ptype,
  XAxisLevels = c("HU", "HEU-lo", "HEU-hi"))

SinglePlot

In combination with purrr we can rapidly iterate over all the markers (skipping the initial metadata colums) and generate all the plots. These can then be passed to Luciernaga Utility_Patchwork() to rearrange into patchwork objects with our desired layout, and passed to a .pdf file.

# colnames(AllMarkers)
TheLength <- length(AllMarkers)

AllPlots <- map(names(AllMarkers)[c(9:TheLength)], ~ Utility_Behemoth(data=AllMarkers, var=.x,
  myfactor="ptype", normality="dagostino", correction="none",
  shape_palette=shape_ptype, fill_palette=fill_ptype, corral.width=0.7,
  XAxisLevels = c("HU", "HEU-lo", "HEU-hi")))

These in turn can be passed to Luciernaga `Utility_Patchwork() function to arrange desired layout and output to a .pdf file for late reference.

library(Luciernaga)

StorageLocation <- file.path("C:", "Users", "JohnDoe", "Desktop")

TheAssembledPlot <- Utility_Patchwork(x=AllPlots, filename="NKT_Markers",
 outfolder=NULL, thecolumns=2, therows=3, width=7, height=9,
 returntype="patchwork")

TheAssembledPlot[1]

Utility_Heatmap

The original visualization for the Coereba clusters, in a Bananaquit color-scheme. Next time someone insist there are just 5 clusters in their FlowSOM, point them here.

ThePlot <- Utility_Heatmap(binary=binaryData, panel=panelPath,
  export=FALSE, outpath=NULL, filename=NULL)

ThePlot

Diffcyt port

Wrote the function to send the outputs to diffcyt for the edgeR/limma/GLMM modeling of the results (vs the above t-test approach). It works, but does it mean it’s less p-hacky? TBD. Also, given diffcyt’s current maintenance status (looking at you Luke), may not be best approach vs a new implementation.

David Rach

1 June 2025

Background

Rationale

Getting Started

Installing Coereba

Loading Libraries

Locating .fcs files

Creating a GatingSet for Unmixed .fcs files

Coereba_GateCutoffs

Visualizing Gate Cutoffs

Coereba_App

Coereba_UpdateGates

Note on Cell Populations and Splitpoint

How Coereba works

Utility_Coereba

Filtering Coereba Clusters

Coereba_MarkerExpressions

Utility_MarkerPlots

Utility_Stats

Utility_Behemoth

Utility_Heatmap

Diffcyt port