Matt Gedye
A Personal Laboratory Information Management System
My first job in research was as a lab tech, tasked with designing a Laboratory Information Management System (LIMS) for the lab. I didn’t really know how to do it, and the result was a clunky Excel Dashboard, that while I cringe a little bit when I look back on it in hindsight, was nonetheless quite proud of. What I’ve designed for myself now is inspired by that experience and is a classic, “if I knew then what I know now” kind of moment. I left that job to start my PhD, which coincided with an emerging interest in computer programming. Coincidentally, the person who inspired my interest in programming, recently shared his schema for database management. I’ve borrowed the bones and used it as the scaffolding to design a personal LIMS for myself. Full disclosure, I needed this up and running fast so I can use it to its full potential now, in the final year of my PhD, and so I utilised the assistance of Claude Code.
The underlying philosophy of why I designed this LIMS is tied to three aims:
- I wanted full independence from any proprietary software.
- I need it to scale smoothly over the course of a whole research career.
- I wanted something adaptable so any output can be easily read by my colleagues and supervisors without them needing to learn how to do anything new.
Therefore, everything is written in plain-text (SQL, R, Markdown, CSV). I’m not dependent on any proprietary software like LabArchives, OneDrive, Office365 etc. All I need is a terminal and my favourite open-source software. However, all proprietary software is capable of reading plain-text, so I can easily share any results file with my colleagues and supervisors and they can read it in their preferred program. Because of this versatility, when the time comes for graduation and moving institutions, the data generated is archived safely in the appropriate place, while the schema (~25KB in size) comes with me and I can set up again on a new computer within minutes. I won’t be able to go into everything here, but below are some of the main functions (note: all data are fabricated).
Database and Sample Management
The core is a single SQLite database as the universal source of
truth across all projects (research.db). At the highest
layer of the database, every subject, be it a mouse, a participant, a
cell culture donor, or anything else I haven’t worked with yet, enters
as an anchor. One row. Then everything downstream links
to the anchor: samples come from anchors, experiments run on samples,
assays link to experiments, which then generate results for analysis
and/or new samples for storage (in which the cycle can begin
again).
For example, suppose I want to see, at a glance, some citrate synthase activity across a cohort of exercise and sedentary participants to get an indication of mitochondrial volume. These might belong to a project called RUN, which is recorded as project ‘8’ in the database. By entering the following SQLite query:
SELECT p.participant_id, p.intervention, ROUND(cs.cs_umol_min_g_protein, 1) AS cs_umol_min_g_protein
FROM anchors a
JOIN participants p ON p.anchor_id = a.anchor_id
JOIN samples s ON s.anchor_id = a.anchor_id
JOIN cs_assays cs ON cs.sample_id = s.sample_id
WHERE a.project_id = 8
AND cs.excluded = 0
ORDER BY p.intervention, p.participant_idI get the following output:
participant_id intervention cs_umol_min_g_protein
P001 exercise 168.4
P002 exercise 182.7
P003 exercise 155.9
P004 exercise 174.2
P005 sedentary 98.3
P006 sedentary 87.6
P007 sedentary 112.5
P008 sedentary 94.8
Another disclosure: I’m still quite new to the SQL programming language, so I often utilise DB Browser so I can view with a GUI (kind of like Excel).
Then, suppose I want to find the samples (homogenised muscle) that were used to generate these results because I know there’s some tissue left over and I want to either run the experiment again, or do something else. Again, I’ll enter a query:
SELECT p.participant_id, p.intervention, ss.sample_id, ss.freezer, ss.drawer, ss.box, ss.position
FROM sample_storage ss
JOIN samples s ON s.sample_id = ss.sample_id
JOIN anchors a ON a.anchor_id = s.anchor_id
JOIN participants p ON p.anchor_id = a.anchor_id
WHERE a.project_id = 8
AND s.sample_type = 'muscle_homogenate'
ORDER BY p.participant_idAnd get my results:
participant_id intervention sample_id freezer drawer box position
P001 exercise P001_2026-01-17_homog -80_B 3 musc-2 A1
P002 exercise P002_2026-01-17_homog -80_B 3 musc-2 A2
P003 exercise P003_2026-01-17_homog -80_B 3 musc-2 A3
P004 exercise P004_2026-01-17_homog -80_B 3 musc-2 A4
P005 sedentary P005_2026-01-17_homog -80_B 3 musc-2 B1
P006 sedentary P006_2026-01-17_homog -80_B 3 musc-2 B2
P007 sedentary P007_2026-01-17_homog -80_B 3 musc-2 B3
P008 sedentary P008_2026-01-17_homog -80_B 3 musc-2 B4
Additionally, because all samples link back to the same anchor, if needed, I can pull the full sample history for any subject in one query:
SELECT s.sample_id, s.tissue_type, s.sample_type, s.date
FROM samples s
JOIN anchors a ON a.anchor_id = s.anchor_id
JOIN participants p ON p.anchor_id = a.anchor_id
WHERE p.participant_id = 'P001'
ORDER BY s.dateBelow, I can then see that all the samples I have for this participant originate from the same biopsy of skeletal muscle tissue and that several experiments have been performed on it.
sample_id tissue_type sample_type date
P001_2025-09-16_biopsy skeletal muscle whole_muscle 2025-09-16
P001_2025-09-16_mito skeletal muscle mito_isolate 2025-09-16
P001_2026-01-21_mito-rna skeletal muscle mitochondrial_rna 2026-01-21
P001_2026-01-17_homog skeletal muscle homogenised_muscle 2026-01-14
What’s not shown here, is that I can also use the schema to query for ‘parent’ and ‘child’ samples.
Generating Experiments and Importing/Exporting Results
The initial setup might look complicated, but day to day use is
fairly straightforward. From new subject (anchor) to
tissue/sample collection to experiment/assay to sample storage to
analysis, there’s only ever one schema and one query language.
Of course, this is dependent on my lab notebook directories being
structured exactly the same. So I use a Python
script to ensure all lab notebook directories are organised
consistently. Firstly, all new entries (experiments) generated by the
script follow the same logic: plain text, version-controlled,
renderable to PDF or HTML (for easy upload to LabArchives or pasting
into a physical lab notebook). I’m currently enrolled at two
universities simultaneously. One in Australia and one in Denmark and
each have their own procedures for archiving lab notes. My compromise:
all protocols (whether from the institution, a kit supplier, or a
collaborator) are converted to .Rmd and housed in a
protocols/templates directory. Then generating a new
experiment is one command handled by the Python script:
python3 scripts/generate_experiment.py \
--project 8 \
--procedure mito_isolation \
--date 2026-04-17 \
--name mito-isolations \
--samples 8 \
--ids P001,P002,P003,P004,P005,P006,P007,P008This creates a dated directory in the specific project’s lab notebook, pre-populated with the necessary files:
projects/RUN/lab-notebook/experiments/
└── 2026-04-17_mito-isolations/
├── 2026-04-17_mito-isolations.Rmd ← protocol; pre-filled with experiment ID, date, reagent volumes
├── input/
│ └── consumables.csv ← lot numbers & expiry dates to verify before starting
└── results/
├── results.csv ← blank table; sample IDs pre-populated (P001_2026-04-17_mito, …)
└── sample-storage.csv ← freezer/drawer/box/position; fill in after storing
This simultaneously inserts a new row into the
experiments table in the database:
experiment_id project_id date procedure_type
14 8 2026-04-17 mito_isolation
I use .Rmd files because then I can import the data in
the consumables.csv file to generate the exact
volume/quantity of a specific consumable or reagent I need based on
the number of samples I’m processing in the experiment. So when I
render for printing to bring to the bench, I get something like
this:
Table: Reagent Volumes for 8 samples (µL)
reagent per_sample_ul total_ul
Solution 1 18 144
Solution 3 2100 16800
Protease Inhibitor Cocktail 10 80
Anti-TOM22 MicroBeads 50 400
Then I simply render the .Rmd file to PDF, print,
bring to the bench and scribble all over it (pen and paper will always
be superior to digital at the bench). When done, I manually fill in
results/results.csv and run the experiment-specific insert
script. Then, results land in the appropriate row in the database,
linked back to the experiment and samples automatically. Of
course, some results can be generated with a piece of equipment
specific to the experiment that may offer the option to export to xlsx
or csv. A simple script to ‘rearrange’ this output to match the
headers in results.csv can automate this process if required.
Then, any bench notes get typed back into the .Rmd,
re-rendered, and it’s ready for digital archiving (with whatever my
institution prefers to use) with my notes included. Of course, simply
pasting the notes into a physical lab notebook and scanning for
physical archiving works fine too. If necessary, the template
.Rmd file can be updated and version controlled to
reflect any permanent changes that need to be made to the protocol. I
would typically do this for any pilot experiment in a private repo
until the results are published. Then I’ll move the ‘final’ protocol
to the templates directory. Finally, if the experiment produced
‘child’ samples, their storage location can be updated in the
appropriate csv file generated by the experiment generator script.
Then, six months later, finding where a specific sample lives is a
query rather than a freezer archaeology expedition (see above).
Data Analysis
I’m a sucker for good organisation and have enjoyed using this to clean up my old project directories and unify everything under a single database. But it’s hard to justify spending so much time on it unless it’s for a more noble reason than just being neat and tidy. By the time I get to analysis, I can see where this LIMS really earns its place. Previously I had a database for each project, which were also ‘Unixy’ and fairly well organised. But it became clumsy and awkward when I needed to compare results across projects or if an R script to analyse results in one project would work perfectly for results from the same experiment in a different project, but the directory paths or naming conventions inside the file were slightly different. I’d end up with a chunk of code and output that looked something like this:
proj_a <- read_csv("projects/RUN/results/cs/2024-03-15_cs.csv")
proj_b <- read_csv("projects/DIABETES/analysis/citrate-synthase/cs_final.csv")
combined <- bind_rows(proj_a, proj_b)
#> participant_id CS_activity cs_umol_min_g_prot
#> <chr> <dbl> <dbl>
#> 1 P001 168.4 NA
#> 2 P002 182.7 NA
#> 3 P101 NA 94.2
#> 4 P102 NA 87.6I could then spend close to an hour (on a good day) just re-writing
the new script to update file paths and make sure the new csv files
had matching headers. In the end, I’d just copy and paste a csv file
with the headers I needed from Project A to Project B and then
overwrite the rows with the new results. Not only prone to human
error, but over time, with the addition of more and more experiments,
this led to a sea of unorganised floating files. My solution is a more
unified approach. I still give each project its own directory for
analysis in R (it would get very messy otherwise). But regardless of
project, every analysis draws from the same source
(research.db). In this way, there’s no copying of data or
templates between projects and no stale csv files. Just a universal
path to the database in a single config.R file that every
script sources at the top and simply states which project id/s I’m
working with:
db <- dbConnect(SQLite(), here("research.db"))
cs <- dbGetQuery(db, "SELECT ... WHERE e.project_id IN (8, 9)")
dbDisconnect(db)Then, to keep all results generated from the database uniform and organised, I use an intuitive and straightforward naming convention for what I call my analysis-notebooks (as distinct from my lab-notebooks) in each project’s directory.
Very simply, all output files are prefixed with the script
(S) or notebook (N) that generated them.
S05_dds_wald.rds came from script 05.
N01_pca.png from notebook 01.
S05_exercise_vs_sedentary.csv from script 05. These are
sequential as it’s often the case that the output generated in
S1 then serves as input to begin S2:
projects/RUN/analysis/
└── citrate-synthase/
├── analysis-notebooks/
│ ├── N01_cs_qc.Rmd
│ └── N02_cs_group_comparison.Rmd
└── results/
├── figures/
│ └── N02_cs_boxplot.png
└── tables/
└── N02_cs_group_means.csv
It might look like organisational overkill and slightly pedantic,
but it’s already proved its usefulness many times when I’ve had to
scramble to find a figure for a slide moments before a supervisor
meeting.
It’s also useful when creating symlinks inside the directories
that house my thesis chapters or other drafts for papers that need
these figures.
Deleting Data
This schema is intended to be highly versatile, capable of automating many of my daily tasks. I can’t therefore, describe them all and there’s a lot more I could include. I intend to actively update the GitHub repo, so that’s the best place to see regular updates. But one last thing I wanted to touch on that I think is worth mentioning, is that if at any point when moving institutions and the data used to generate results must remain behind, wiping the database once all data is archived in the appropriate place, is clean and straightforward:
# remove all results — keeps schema, experiments, samples structure
sqlite3 research.db "
DELETE FROM cs_assays;
DELETE FROM bca_assays;
DELETE FROM rna_extractions;
DELETE FROM rna_qc;
DELETE FROM pipeline_runs;
DELETE FROM sequencing_submissions;
DELETE FROM sequencing_runs;
"What remains is the schema, the scripts, and the protocols that comprise the workflow pipeline. Everything needed to start again immediately at a new institution.
P.S What I’m doing now.