Using Coereba
David Rach
University of Maryland, Baltimoredrach@som.umaryland.edu
1 June 2025
Source:vignettes/Workflow.Rmd
Workflow.Rmd
Background
Manual (supervised) analysis is widespread in cytometry (especially conventional flow cytometry (CFC)), wherein researchers will draw hierarchial boolean gates around cell populations using the graphical user interface provided by various commercial softwares. By sequentially switching between markers, you end up filtering down to smaller subsets of cells of potential interest.
With the increase in markers (initially by mass cytometry (MC), recently by spectral flow cytometry (SFC)) and accompanying combinatorial combinations needing to be gated, algorithmic (unsupervised) analysis methods have been developed to cluster cells on basis median fluorescent intensity (MFI) of individual markers.
While these handle more combinatorial data, how many identified clusters are present (and/or biologically meaningful) remains a point of concern. The additional dependence on MFI values being consistent across individuals and experiments increases these methods susceptibility to batch effects.
Rationale
SFC datasets in particular, collecting greater numbers of cells than MC and increased number of markers compared to CFC, require new tools to enable comprehensive profiling of phenotypes present within the acquired cells. These datasets loss resolution due to improper unmixing controls poses additional challenges.
Coereba is a collection of tools that attempt to implement a semi-supervised analytical approach. Using automated gating, splitpoints between positive and negative cells for every marker are estimated on an individual basis. Through extensive visualization and a ShinyApp, failures in gate setting by algorithmic gating can be adjusted by the researcher. Individual cells are then classified on the basis of these marker splitpoints, and the identity and metadata information is appended to the .fcs file.
Following unsupervised analysis, this manually defined gating information can extracted and used both for analysis and verify the algorithm performed as predicted across specimens. Due to the researcher intervention, it is more resilient to batch effects than a pure MFI approach.
The author and maintainer in no part claims that Coereba is the solution to all cytometry analysis woes. It is a tool in the open-source toolbelt that can enable you to do useful things and figure out interesting quirks that you wouldn’t be otherwise able to do. Go forth, gain better understanding underlying problems our current paradigms have, and may the next generation of researchers benefit from what we learn.
Getting Started
Installing Coereba
The Coereba R package is in preparation for submission to Bioconductor later this year. Until that time, it is available for install via GitHub.
if(!require("remotes")) {install.packages("remotes")}
remotes::install_github("https://github.com/DavidRach/Coereba")
# install.packages("BiocManager")
# BiocManager::install("Coereba")
Loading Libraries
Coereba relies on underlying infrastructure provided flowWorkspace and other Bioconductor cytometry packages. It leverages functional programming principles implemented in tidyverse packages available via CRAN. It is important to make sure that all the required packages are installed on your computer, and that library is called for each.
library(Coereba)
library(flowCore)
library(flowWorkspace)
library(openCyto)
library(ggcyto)
library(data.table)
library(dplyr)
library(purrr)
library(stringr)
library(ggplot2)
library(gt)
library(plotly)
library(htmltools)
Locating .fcs files
To get started, you will first need to provide the location on your computer where the .fcs files of interest are being stored. An example of how the author does this is provided below and can be modified for your own user and desired computer folder.
File_Location <- file.path("C:", "Users", "JohnDoe", "Desktop", "TodaysExperiment")
FCS_Pattern <- ".fcs$"
FCS_Files <- list.files(path = File_Location, pattern = FCS_Pattern,
full.names = TRUE, recursive = FALSE)
For this vignette, we will using several small .fcs files that can be found in Coereba’s extdata folder for our example.
File_Location <- system.file("extdata", package = "Coereba")
FCS_Pattern <- ".fcs$"
FCS_Files <- list.files(path = File_Location, pattern = FCS_Pattern,
full.names = TRUE, recursive = FALSE)
head(FCS_Files, 3)
The folder contains three unmixed full-stained samples from a 29-color SFC panel, characterizing Innate-like T cells (ILT) in cord blood mononuclear cells (CBMCs). As rare circulating cells, the original .fcs files contained many cellular events, which makes them useful to show off Coereba’s capabilities at characterizing cells that were not the original focus.
Creating a GatingSet for Unmixed .fcs files
Working with the unmixed full-stained samples mentioned above, we will proceed by bringing the .fcs files found within Coereba’s extdata folder into a GatingSet object where we can interact with them.
UnmixedFCSFiles <- FCS_Files[c(1:2)]
UnmixedCytoSet <- load_cytoset_from_fcs(UnmixedFCSFiles, truncate_max_range = FALSE,
transformation = FALSE)
UnmixedGatingSet <- GatingSet(UnmixedCytoSet)
UnmixedGatingSet
Now that we have created a GatingSet object, we will identify the
markers/fluorophores present within the .fcs files, and exclude markers
from the list don’t require transformation (as is the case with FSC,
SSC). We will then bi-exponentially transform the data using flowWorkspace
flowWorkspace::flowjo_biexp_trans()
before applying a
openCytogating
template that was customized for this particular panel, which can be
found in the extdata folder.
Markers <- colnames(UnmixedCytoSet)
KeptMarkers <- Markers[-grep("Time|FS|SC|SS|Original|-W$|-H$|AF", Markers)]
MyBiexponentialTransform <- flowjo_biexp_trans(channelRange = 256, maxValue = 1000000,
pos = 4.5, neg = 0, widthBasis = -1000)
TransformList <- transformerList(KeptMarkers, MyBiexponentialTransform)
UnmixedGatingSet <- transform(UnmixedGatingSet, TransformList)
UnmixedGates <- fread(file.path(path = File_Location, pattern = 'GatesUnmixed.csv'))
UnmixedGating <- gatingTemplate(UnmixedGates)
gt_gating(UnmixedGating, UnmixedGatingSet)
plot(UnmixedGatingSet)
We can additionally verify that the gating of the cell populations of interest was correct, by visualizing the gates using ggcyto or the Luciernaga package.
library(Luciernaga)
MyPlot <- Utility_GatingPlots(x=UnmixedGatingSet[1], sample.name=c("GROUPNAME", "TUBENAME"),
removestrings=".fcs", subset="live", gtFile=UnmixedGates,
DesiredGates=NULL, outpath = getwd(), export=FALSE,
therows=3, thecolumns=2)
MyPlot
Coereba_GateCutoffs
Having set up a GatingSet with transformations and gates applied, we
can proceed to test the Coereba
workflow. The first step is
to create a .csv file, containing the individual specimens, and the
individual markers of interest. Each cell within the .csv file will
correspond to the MFI for a given marker for a given individual where
positive expression of a marker transitions to becoming a negative
(referred here to as a splitpoint).
These could be written out individually, or based on automated
calculations. To simplify the process, in Coereba, we have implemented
two approaches to estimating these splitpoints. One function is the
in-house Coereba_GateCutoffs()
, which does an “okay-ish but
work-in progress”. The flip approach is leveraging the openCyto
pipeline (in development) and retrieving the values.
Both approaches can be computationally heavy, but still faster than eye-balling everything and typing in the values one-by-one. Save the eyestrain for subsequent validation steps.
TheGateCutoff <- Coereba_GateCutoffs(gs=UnmixedGatingSet[1],
subset="live", sample.name="GROUPNAME", desiredCols =c("BUV805-A"))
TheGateCutoffs <- map(.x=UnmixedGatingSet[1:3], .f=Coereba_GateCutoffs,
subset="live", sample.name="GROUPNAME", desiredCols =c("BUV805-A")) %>% bind_rows()
TheGateCutoffs
Visualizing Gate Cutoffs
Once we have generated a GateCutoffs.csv file, we can visualize how
well the splitpoints were by plotting them as red-lines using the
Luciernaga
packages Utility_NbyNPlots()
to
visualize the markers for a given individual and
Utility_UnityPlots()
to visualize the markers across
individuals. This a good way of understanding how well the competing
gating-algorithms did in the context of your individual panel, and how
much effort will be required for the next step in correcting the gates
individually.
Coereba_App
Previously, all adjustments to the above splitpoints had to
eye-balled in the pdf, and then adjusted in the .csv file, and repeated
ad nauseum. This obviously did not scale well. We have subsequently
created a ShinyApp that will bring in the GatingSet object, and plot an
interactive version of Utility_UnityPlot()
from the
Luciernaga
package, visualizing the splitpoint for a given
marker across all individuals as red vertical lines.
To get started, first make sure you know where the
Coereba_GateCutoffs()
.csv output is saved at, as you will
need to select the file within the App.
File_Location <- system.file("extdata", package = "Coereba")
TheCSV <- file.path(File_Location, "GateCutoffsForNKs.csv")
TheCSVData <- read.csv(TheCSV, check.names=FALSE)
head(TheCSVData)
Next, make sure to remember the name of your GatingSet (“UnmixedGatingSet” for this case), and the sample.name keyword to identify individual specimens, as you will need to input these values within the app.
Then, run the following command in your console:
The first tab view will let you upload the GateCutoff.csv file generated by the previous function, by letting you navigate to the storage folder and clicking on it.
Once this is done, navigate to the Upload a GatingSet tab. Enter the
information about your GatingSet (“UnmixedGatingSet” for this example),
select your x and y parameters, enter the SampleName keyword. Desired
bins allows adjustments to the visualization given the number of cells
present, clearance multiplier adjust the margins wiggle-room,
removestrings arguments. You specify margin and gate subsets like you
would for Utility_UnityPlots()
and click Display Estimated
Gate Cutoffs. Finally, you click on Generate Plots and go retrieve your
coffee/tea/beverage-of-choice.
Upon loading of the plots, scroll until you encounter a splitpoint line that was placed incorrectly, you can simply click on the correct location of the axis, which will save the coordinate as a splitpoint for that given marker and individual to a data.frame. This will later be ported to update the corresponding splitpoint value for a given marker and individual in the exisiting GateCutoff .csv. By just clicking, you can therefore quickly correct cases where the algorithm failed to gate properly due to batch effects, bad staining, or algorithmic incompetence (sorry, the coders are still learning as well).
Once you have corrected some/all the markers, navigate to the Click Data tab. You can then export the click-data information by specifying a filename and providing an outpath to the folder you want to save at. Unfortunately, due to security issues in ShinyApps, you will need to type out the designated outpath manually.
And repeat until the markers you are interested have been validated and you have a folder of clickpoint .csvs awaiting conversion into the existing GateCutoff folder.
Since the code is within a ShinyApp running in R, it has both the advantages and disadvantages of a R Shiny App. It scales reasonably well, but can take a while to load the next iterated marker. During testing, loading a new marker to inspect took anywhere from 3-10 additional minutes. This area is a active work in progress to speed up, but the author/maintainer hasn’t had the time to finish reading Mastering Shiny by Hadley Wickham yet or implement it in Rust, if you have, feel free to assist!
Coereba_UpdateGates
Once you have completed validating the GateCutoffs with
Coereba_App()
, it is time to convert your validated
click-data and update the GateCutoff.csv. To fascilitate this, provide
the location of the folder containing just the clickpoint .csv’s to
Coereba_UpdateGates()
and fill out the required arguments.
It will proceed to update the GateCutoffs.
For this example, we will use ClickDataExamples.csv stored within
Coereba
’s extdata folder:
File_Location <- system.file("extdata", package = "Coereba")
TheOldCSV <- file.path(File_Location, "GateCutoffsForNKs.csv")
TheClickInfo <- file.path(File_Location, "ClickDataExample.csv")
TheClickData <- read.csv(TheClickInfo, check.names=FALSE)
TheClickData
The old data:
TheOldData <- read.csv(TheOldCSV, check.names=FALSE)
TheOldData
We can then run Coereba_UpdateGates()
and watch the
values for BUV469-A are updated.
UpdatedCSV <- Coereba_UpdateGates(Clicks=TheClickInfo, Old=TheOldCSV,
export=FALSE, outpath=NULL, fileName="UpdatedCSV")
UpdatedCSV
We can further validate this by re-running Luciernaga
Utility_UnityPlots()
using the updated GateCutoff.csv and
seeing if the previous issues have been corrected.
Note on Cell Populations and Splitpoint
During development, we have noticed that for SFC data, the splitpoint location can vary substantially for individual cell populations (B cells, NK cells, T cells, etc), even in the abscence of biology. We believe this is due to uncertainty in the unmixing, as we have observed a correlation of negative splitpoint MFI in relation to the individual cells kappa value (matrix complexity essentially). Try your best to have the gate reflect what you are interested in, reduce the amount of error as much as you can, and in numbers veritas. Keep track of the exceptions (more later) and write up a Cyto paper.
How Coereba works
Having a GateCutoff.csv containing the splitpoint information for each marker validated for each individual is immensely useful, allowing you to do many things you wouldn’t be capable of otherwise. Namely, we can for an individual anotate each cell within their .fcs file.
What do I mean? Let’s first think about all the cells circulating in an individuals bloostream. We don’t have any prior information in this example, but we want to profile these cells into clusters of similar cells. On one end of the spectra, there is a single cluster that contains every single cell from the sample, regardless of their marker expression. At the opposite end of the spectrum, every single cell clusters in an it’s own cluster with just itself. In between these two extremes, depending on our question, lies a meaningful cluster number that will match the underlying biology to reduce the amount of variance that lies in clumping/splitting populations needlessly.
Coereba
takes the approacch by individually annotating
an individuals cells one by one, on the basis of where their MFI value
lands related to the validated splitpoint. So if the splitpoint for FITC
is at 50, and the individual cell MFI for FITC is 80, it returns as
FITC_pos. If the splitpoint for APC is at 70, and the individual cell
MFI is 30, it returns as APC_Neg.
When you iterate over the splitpoints for each marker, each cell derrives an identity. FITC_pos-APC_Neg-BV421_Pos…. etc. We can then group cells with matching identies and derrive information from these terminal nodes.
Individually, this may not mean much. After all, dichotomous splitting of a 30-marker panel is around half-a billion potential terminal nodes or clusters. Similarly, cytometers have instrumental, batch, experimental noise, etc.
But when we scale to all the events collected within a typical sample, we get far fewer terminal nodes, in the hundred to lower thousand range for a 30-color panel. The reason is cells (and the panels by which we investigate them) are not uniformely distributed by markers across high-dimensional space, cells fall within specific phenotypes (cell types) that undergo division of similar clones.
Consequently, we can iteratively leverage the number of markers included to target questions of interest, broad or narrow scope, in context of our individual panels, and rely on methods described below to gain insight into biological systems.
Utility_Coereba
Utility_Coereba()
is the function that implements the
above approach, labelling for each individual specimen the individual
cells on the basis of their splitpoint identity, and then returning the
enumeration of the resulting clusters. It iterates over the different
specimens in the GatingSet, returning information for every individual
that can be used in subsequent steps. MultiReach()
is the
older version of the proccess.
Both functions are currently waiting on the ILT paper to be sent to collaborators so that David can implement his 5-subject binder worth of notes to implement a S4 OOP class to make manipulating the final outputs easier than the current .qmd files full of tidyverse data tidying.
CoerebaIDs <- Utility_Coereba(x=UnmixedGatingSet[1], subsets="live",
sample.name="GROUPNAME", reference=TheCSV, starter="Spark Blue 550-A")
head(CoerebaIDs, 5)
Filtering Coereba Clusters
Some things are literally poisson noise. Others are cell clusters with varying abundance across heteregenous human patients. Others are individually unique terminal nodes that can be the result of individual biology…. or unmixing errors … or your lab tech messing up the staining panel. You can leverage the filter functions to identify all of these quickly.
Coereba_MarkerExpressions
Having generated a Marker by Cluster data.frame (binaryData) and a Cluster by Specimen data.frame (dataData), we can leverage the combination of the two data.frames to aggregate the clusters on the basis of marker presence, returning the proportion of of cells that are positive for a given marker.
For this demonstration, we will focus on NKTs, using stored .csv files for their binary and data outputs stored within the Coereba extdata folder.
File_Location <- system.file("extdata", package = "Coereba")
panelPath <- file.path(File_Location, "ILTPanelTetramer.csv")
panelData <- read.csv(panelPath, check.names=FALSE)
binaryPath <- file.path(File_Location, "HeatmapExample.csv")
binaryData <- read.csv(binaryPath, check.names=FALSE)
dataPath <- file.path(File_Location, "ReadyFileExample.csv")
dataData <- read.csv(dataPath, check.names=FALSE)
We will first just return the marker expressions for all markers:
library(Coereba)
AllMarkers <- Coereba_MarkerExpressions(data=dataData, binary=binaryData,
panel=panelData, starter="SparkBlue550")
head(AllMarkers, 5)
We can additionally by specifying returnType = “Combinatorial” derrive proportions within quadrant gates for two markers of interest. To do so, we specify the fluorophores corresponding to the markers as a list in the CombinatorialArgs.
MemoryQuadrants <- Coereba_MarkerExpressions(data=dataData, binary=binaryData,
panel=panelData, starter="SparkBlue550", returnType = "Combinatorial",
CombinatorialArgs = c("BV510", "BUV395"))
head(MemoryQuadrants, 5)
Utility_MarkerPlots
Having returned aggregated data above with
Coereba_MarkerExpressions
, we can plot this data using
Utility_MarkerPlots()
. These are combination of ggplot2
geom_boxplot() and ggbeeswarm
geom_beeswarm plots. We provide some arguments that allow for filtering
and reordering the markers shown in these plots, some basic
customizations to the plots, and the ability to save as a .png file to a
designated folder.
To do so for your own individual spots, you will need to specify a metadata column by which to factor by, and provide a list of shape and fill values corresponding to each factor level.
shape_ptype <- c("HU" = 22, "HEU-lo" = 21, "HEU-hi" = 21)
fill_ptype <- c("HU" = "white", "HEU-lo" = "darkgray", "HEU-hi" = "black")
We can start by viewing all markers:
ThePlot <- Utility_MarkerPlots(data=AllMarkers, panel=panelData,
myfactor="ptype", shape_palette = shape_ptype, fill_palette = fill_ptype)
ThePlot
Let’s abbreviate the number of columns by specifying the marker names to include (filterForThese), and rearrange them in desired X-axis order
ThePlot <- Utility_MarkerPlots(data=AllMarkers, panel=panelData,
myfactor="ptype", shape_palette = shape_ptype, fill_palette = fill_ptype,
filterForThese=c("CD7", "CD4", "CD8"), XAxisLevels = c("CD7", "CD4", "CD8"))
ThePlot
And to ggsave directly to a desired folder:
Utility_Stats
The Utility_Stats()
and the
Utility_Behemoth()
functions are either: “Great!!!”, or
guaranteed to make your friendly-local statistician cry (Opinions
vary).
Basically, Utility_Stats()
is a coding-attempt to
replicate the typical immunologist workflow, cutting out the need to
copy-paste data from a .csv file into GraphPad Prism (trademark), run a
normality test, determine number of groups, and follow up with the
corresponding t-test/anova/non-parametric equivalents.
Utility_Behemoth()
then takes these outputs, and adds the
p-value results to a ggbeeswarm/ggplot2 boxplot of the underlying data.
In combination with purrr and
Luciernaga
Utility_Patchwork()
, these can be
quickly formatted into a .pdf.
Does this workflow rapidly profile all the columns in your data.frame. Yes it does. Is it emblematic of potential issues arising from basing science entirely on null-hypothesis statistical testing? Also yes, seeing how many plots return “significant” due to a couple outliers. In the end, it mimics the current workflow of many researchers, just speeding it up. Google Statistical Rethinking on YouTube for food-for-thought.
To begin, we will use the Coereba_MarkerExpressions
data
we generated above to test out Utility_Stats()
.
Result <- Utility_Stats(data=AllMarkers, var="CD62L",
myfactor="ptype", normality="dagostino", correction="none")
Result
In combination with purrr we can rapidly iterate over all the markers, but will need to skip over the metadata columns that are present at the begining of the dataframe.
library(purrr)
library(dplyr)
# colnames(AllMarkers)
TheLength <- ncol(AllMarkers)
TheData <- map(names(AllMarkers)[c(9:TheLength)], ~ Utility_Stats(
data = AllMarkers, var = .x, myfactor = "ptype",
normality = "dagostino")) %>% bind_rows()
The results can be filtered using regular dplyr functions.
Utility_Behemoth
Utility_Behemoth()
continues from where
Utility_Stats()
left-off, appending the resulting p-value
information onto a ggplot2 plot. The plot is a combination of ggplot2
geom_boxplot() and ggbeeswarm
geom_beeswarm plots, and accepts arguments to customize these and
rearrange the group order. As with most Coereba
plots, we
need to provide the shape and fill for our respective factors
levels.
To do so for your own individual spots, you will need to specify a metadata column by which to factor by, and provide a list of shape and fill values corresponding to each factor level.
shape_ptype <- c("HU" = 22, "HEU-lo" = 21, "HEU-hi" = 21)
fill_ptype <- c("HU" = "white", "HEU-lo" = "darkgray", "HEU-hi" = "black")
SinglePlot <- Utility_Behemoth(data=AllMarkers, var="CD62L",
myfactor="ptype", normality="dagostino", correction="none",
shape_palette=shape_ptype, fill_palette=fill_ptype,
XAxisLevels = c("HU", "HEU-lo", "HEU-hi"))
SinglePlot
In combination with purrr we can
rapidly iterate over all the markers (skipping the initial metadata
colums) and generate all the plots. These can then be passed to
Luciernaga
Utility_Patchwork()
to rearrange
into patchwork objects with our desired layout, and passed to a .pdf
file.
# colnames(AllMarkers)
TheLength <- length(AllMarkers)
AllPlots <- map(names(AllMarkers)[c(9:TheLength)], ~ Utility_Behemoth(data=AllMarkers, var=.x,
myfactor="ptype", normality="dagostino", correction="none",
shape_palette=shape_ptype, fill_palette=fill_ptype, corral.width=0.7,
XAxisLevels = c("HU", "HEU-lo", "HEU-hi")))
These in turn can be passed to Luciernaga
`Utility_Patchwork() function to arrange desired layout and output to a
.pdf file for late reference.
library(Luciernaga)
StorageLocation <- file.path("C:", "Users", "JohnDoe", "Desktop")
TheAssembledPlot <- Utility_Patchwork(x=AllPlots, filename="NKT_Markers",
outfolder=NULL, thecolumns=2, therows=3, width=7, height=9,
returntype="patchwork")
TheAssembledPlot[1]
Utility_Heatmap
The original visualization for the Coereba clusters, in a Bananaquit color-scheme. Next time someone insist there are just 5 clusters in their FlowSOM, point them here.
ThePlot <- Utility_Heatmap(binary=binaryData, panel=panelPath,
export=FALSE, outpath=NULL, filename=NULL)
ThePlot
Diffcyt port
Wrote the function to send the outputs to diffcyt for the edgeR/limma/GLMM modeling of the results (vs the above t-test approach). It works, but does it mean it’s less p-hacky? TBD. Also, given diffcyt’s current maintenance status (looking at you Luke), may not be best approach vs a new implementation.