Matt Gedye

A Personal Laboratory Information Management System

My first job in research was as a lab tech, tasked with designing a Laboratory Information Management System (LIMS) for the lab. I didn’t really know how to do it, and the result was a clunky Excel Dashboard, that while I cringe a little bit when I look back on it in hindsight, was nonetheless quite proud of. What I’ve designed for myself now is inspired by that experience and is a classic, “if I knew then what I know now” kind of moment. I left that job to start my PhD, which coincided with an emerging interest in computer programming. Coincidentally, the person who inspired my interest in programming, recently shared his schema for database management. I’ve borrowed the bones and used it as the scaffolding to design a personal LIMS for myself. Full disclosure, I needed this up and running fast so I can use it to its full potential now, in the final year of my PhD, and so I utilised the assistance of Claude Code.

The underlying philosophy of why I designed this LIMS is tied to three aims:

  1. I wanted full independence from any proprietary software.
  2. I need it to scale smoothly over the course of a whole research career.
  3. I wanted something adaptable so any output can be easily read by my colleagues and supervisors without them needing to learn how to do anything new.

Therefore, everything is written in plain-text (SQL, R, Markdown, CSV). I’m not dependent on any proprietary software like LabArchives, OneDrive, Office365 etc. All I need is a terminal and my favourite open-source software. However, all proprietary software is capable of reading plain-text, so I can easily share any results file with my colleagues and supervisors and they can read it in their preferred program. Because of this versatility, when the time comes for graduation and moving institutions, the data generated is archived safely in the appropriate place, while the schema (~25KB in size) comes with me and I can set up again on a new computer within minutes. I won’t be able to go into everything here, but below are some of the main functions (note: all data are fabricated).

Database and Sample Management

The core is a single SQLite database as the universal source of truth across all projects (research.db). At the highest layer of the database, every subject, be it a mouse, a participant, a cell culture donor, or anything else I haven’t worked with yet, enters as an anchor. One row. Then everything downstream links to the anchor: samples come from anchors, experiments run on samples, assays link to experiments, which then generate results for analysis and/or new samples for storage (in which the cycle can begin again).

For example, suppose I want to see, at a glance, some citrate synthase activity across a cohort of exercise and sedentary participants to get an indication of mitochondrial volume. These might belong to a project called RUN, which is recorded as project ‘8’ in the database. By entering the following SQLite query:

SELECT p.participant_id, p.intervention, ROUND(cs.cs_umol_min_g_protein, 1) AS cs_umol_min_g_protein
FROM anchors a
JOIN participants p ON p.anchor_id  = a.anchor_id
JOIN samples s      ON s.anchor_id  = a.anchor_id
JOIN cs_assays cs   ON cs.sample_id = s.sample_id
WHERE a.project_id = 8
  AND cs.excluded = 0
ORDER BY p.intervention, p.participant_id

I get the following output:

participant_id  intervention  cs_umol_min_g_protein
P001            exercise      168.4
P002            exercise      182.7
P003            exercise      155.9
P004            exercise      174.2
P005            sedentary     98.3
P006            sedentary     87.6
P007            sedentary     112.5
P008            sedentary     94.8

Another disclosure: I’m still quite new to the SQL programming language, so I often utilise DB Browser so I can view with a GUI (kind of like Excel).

Then, suppose I want to find the samples (homogenised muscle) that were used to generate these results because I know there’s some tissue left over and I want to either run the experiment again, or do something else. Again, I’ll enter a query:

SELECT p.participant_id, p.intervention, ss.sample_id, ss.freezer, ss.drawer, ss.box, ss.position
FROM sample_storage ss
JOIN samples s      ON s.sample_id  = ss.sample_id
JOIN anchors a      ON a.anchor_id  = s.anchor_id
JOIN participants p ON p.anchor_id  = a.anchor_id
WHERE a.project_id = 8
  AND s.sample_type = 'muscle_homogenate'
ORDER BY p.participant_id

And get my results:

participant_id  intervention  sample_id               freezer  drawer  box     position
P001            exercise      P001_2026-01-17_homog   -80_B    3       musc-2  A1
P002            exercise      P002_2026-01-17_homog   -80_B    3       musc-2  A2
P003            exercise      P003_2026-01-17_homog   -80_B    3       musc-2  A3
P004            exercise      P004_2026-01-17_homog   -80_B    3       musc-2  A4
P005            sedentary     P005_2026-01-17_homog   -80_B    3       musc-2  B1
P006            sedentary     P006_2026-01-17_homog   -80_B    3       musc-2  B2
P007            sedentary     P007_2026-01-17_homog   -80_B    3       musc-2  B3
P008            sedentary     P008_2026-01-17_homog   -80_B    3       musc-2  B4

Additionally, because all samples link back to the same anchor, if needed, I can pull the full sample history for any subject in one query:

SELECT s.sample_id, s.tissue_type, s.sample_type, s.date
FROM samples s
JOIN anchors a      ON a.anchor_id = s.anchor_id
JOIN participants p ON p.anchor_id = a.anchor_id
WHERE p.participant_id = 'P001'
ORDER BY s.date

Below, I can then see that all the samples I have for this participant originate from the same biopsy of skeletal muscle tissue and that several experiments have been performed on it.

sample_id                   tissue_type      sample_type         date
P001_2025-09-16_biopsy      skeletal muscle  whole_muscle        2025-09-16
P001_2025-09-16_mito        skeletal muscle  mito_isolate        2025-09-16
P001_2026-01-21_mito-rna    skeletal muscle  mitochondrial_rna   2026-01-21
P001_2026-01-17_homog       skeletal muscle  homogenised_muscle  2026-01-14

What’s not shown here, is that I can also use the schema to query for ‘parent’ and ‘child’ samples.

Generating Experiments and Importing/Exporting Results

The initial setup might look complicated, but day to day use is fairly straightforward. From new subject (anchor) to tissue/sample collection to experiment/assay to sample storage to analysis, there’s only ever one schema and one query language.

Of course, this is dependent on my lab notebook directories being structured exactly the same. So I use a Python script to ensure all lab notebook directories are organised consistently. Firstly, all new entries (experiments) generated by the script follow the same logic: plain text, version-controlled, renderable to PDF or HTML (for easy upload to LabArchives or pasting into a physical lab notebook). I’m currently enrolled at two universities simultaneously. One in Australia and one in Denmark and each have their own procedures for archiving lab notes. My compromise: all protocols (whether from the institution, a kit supplier, or a collaborator) are converted to .Rmd and housed in a protocols/templates directory. Then generating a new experiment is one command handled by the Python script:

python3 scripts/generate_experiment.py \
    --project 8 \
    --procedure mito_isolation \
    --date 2026-04-17 \
    --name mito-isolations \
    --samples 8 \
    --ids P001,P002,P003,P004,P005,P006,P007,P008

This creates a dated directory in the specific project’s lab notebook, pre-populated with the necessary files:

projects/RUN/lab-notebook/experiments/
└── 2026-04-17_mito-isolations/
    ├── 2026-04-17_mito-isolations.Rmd   ← protocol; pre-filled with experiment ID, date, reagent volumes
    ├── input/
    │   └── consumables.csv              ← lot numbers & expiry dates to verify before starting
    └── results/
        ├── results.csv                  ← blank table; sample IDs pre-populated (P001_2026-04-17_mito, …)
        └── sample-storage.csv           ← freezer/drawer/box/position; fill in after storing

This simultaneously inserts a new row into the experiments table in the database:

experiment_id  project_id  date        procedure_type
14             8           2026-04-17  mito_isolation

I use .Rmd files because then I can import the data in the consumables.csv file to generate the exact volume/quantity of a specific consumable or reagent I need based on the number of samples I’m processing in the experiment. So when I render for printing to bring to the bench, I get something like this:

Table: Reagent Volumes for 8 samples (µL)

reagent                      per_sample_ul  total_ul
Solution 1                   18             144
Solution 3                   2100           16800
Protease Inhibitor Cocktail  10             80
Anti-TOM22 MicroBeads        50             400

Then I simply render the .Rmd file to PDF, print, bring to the bench and scribble all over it (pen and paper will always be superior to digital at the bench). When done, I manually fill in results/results.csv and run the experiment-specific insert script. Then, results land in the appropriate row in the database, linked back to the experiment and samples automatically. Of course, some results can be generated with a piece of equipment specific to the experiment that may offer the option to export to xlsx or csv. A simple script to ‘rearrange’ this output to match the headers in results.csv can automate this process if required. Then, any bench notes get typed back into the .Rmd, re-rendered, and it’s ready for digital archiving (with whatever my institution prefers to use) with my notes included. Of course, simply pasting the notes into a physical lab notebook and scanning for physical archiving works fine too. If necessary, the template .Rmd file can be updated and version controlled to reflect any permanent changes that need to be made to the protocol. I would typically do this for any pilot experiment in a private repo until the results are published. Then I’ll move the ‘final’ protocol to the templates directory. Finally, if the experiment produced ‘child’ samples, their storage location can be updated in the appropriate csv file generated by the experiment generator script. Then, six months later, finding where a specific sample lives is a query rather than a freezer archaeology expedition (see above).

Data Analysis

I’m a sucker for good organisation and have enjoyed using this to clean up my old project directories and unify everything under a single database. But it’s hard to justify spending so much time on it unless it’s for a more noble reason than just being neat and tidy. By the time I get to analysis, I can see where this LIMS really earns its place. Previously I had a database for each project, which were also ‘Unixy’ and fairly well organised. But it became clumsy and awkward when I needed to compare results across projects or if an R script to analyse results in one project would work perfectly for results from the same experiment in a different project, but the directory paths or naming conventions inside the file were slightly different. I’d end up with a chunk of code and output that looked something like this:

proj_a <- read_csv("projects/RUN/results/cs/2024-03-15_cs.csv")
proj_b <- read_csv("projects/DIABETES/analysis/citrate-synthase/cs_final.csv")

combined <- bind_rows(proj_a, proj_b)
#>   participant_id  CS_activity  cs_umol_min_g_prot
#>   <chr>                 <dbl>               <dbl>
#> 1 P001                  168.4                  NA
#> 2 P002                  182.7                  NA
#> 3 P101                     NA                94.2
#> 4 P102                     NA                87.6

I could then spend close to an hour (on a good day) just re-writing the new script to update file paths and make sure the new csv files had matching headers. In the end, I’d just copy and paste a csv file with the headers I needed from Project A to Project B and then overwrite the rows with the new results. Not only prone to human error, but over time, with the addition of more and more experiments, this led to a sea of unorganised floating files. My solution is a more unified approach. I still give each project its own directory for analysis in R (it would get very messy otherwise). But regardless of project, every analysis draws from the same source (research.db). In this way, there’s no copying of data or templates between projects and no stale csv files. Just a universal path to the database in a single config.R file that every script sources at the top and simply states which project id/s I’m working with:

db <- dbConnect(SQLite(), here("research.db"))
cs <- dbGetQuery(db, "SELECT ... WHERE e.project_id IN (8, 9)")
dbDisconnect(db)

Then, to keep all results generated from the database uniform and organised, I use an intuitive and straightforward naming convention for what I call my analysis-notebooks (as distinct from my lab-notebooks) in each project’s directory.

Very simply, all output files are prefixed with the script (S) or notebook (N) that generated them. S05_dds_wald.rds came from script 05. N01_pca.png from notebook 01. S05_exercise_vs_sedentary.csv from script 05. These are sequential as it’s often the case that the output generated in S1 then serves as input to begin S2:

projects/RUN/analysis/
└── citrate-synthase/
    ├── analysis-notebooks/
    │   ├── N01_cs_qc.Rmd
    │   └── N02_cs_group_comparison.Rmd
    └── results/
        ├── figures/
        │   └── N02_cs_boxplot.png
        └── tables/
            └── N02_cs_group_means.csv

It might look like organisational overkill and slightly pedantic, but it’s already proved its usefulness many times when I’ve had to scramble to find a figure for a slide moments before a supervisor meeting.
It’s also useful when creating symlinks inside the directories that house my thesis chapters or other drafts for papers that need these figures.

Deleting Data

This schema is intended to be highly versatile, capable of automating many of my daily tasks. I can’t therefore, describe them all and there’s a lot more I could include. I intend to actively update the GitHub repo, so that’s the best place to see regular updates. But one last thing I wanted to touch on that I think is worth mentioning, is that if at any point when moving institutions and the data used to generate results must remain behind, wiping the database once all data is archived in the appropriate place, is clean and straightforward:

# remove all results — keeps schema, experiments, samples structure
sqlite3 research.db "
  DELETE FROM cs_assays;
  DELETE FROM bca_assays;
  DELETE FROM rna_extractions;
  DELETE FROM rna_qc;
  DELETE FROM pipeline_runs;
  DELETE FROM sequencing_submissions;
  DELETE FROM sequencing_runs;
"

What remains is the schema, the scripts, and the protocols that comprise the workflow pipeline. Everything needed to start again immediately at a new institution.


P.S What I’m doing now.