Rust prototype for plant organelle dominant-form draft graph assembly.
The assembler keeps repeats and ambiguous branches in the graph. Plastid uses a single dominant-form pass by default. Mitochondrial assembly defaults to two rounds: build a conservative skeleton first, then remap reads to the skeleton to rescue supported closing links.
/Users/zouyinstein-m4max/.cargo/bin/cargo build --releaseIf you want plain cargo to work in a new terminal, add Rust to PATH first:
export PATH="$HOME/.cargo/bin:$PATH"Plastid:
./target/release/simple_draft_asm -p ps -i data/plastid.fastq.gz -o result_plastid_profile_default -t 8Mitochondrion:
./target/release/simple_draft_asm -p ms -i data/mito.fastq.gz -o result_mito_profile_default -t 8Use --rounds 1 if you want only the first mitochondrial skeleton pass. Use
--numt-interference low|high to switch the mitochondrial profile; the mito
default is currently high.
Use --preset or -p for the common organelle/profile combinations:
| preset | aliases | meaning |
|---|---|---|
ml |
mito_low |
mitochondrial low-data mode, replacing the user-facing compact command |
ms |
mito_standard |
mitochondrial standard mode, including the default two-round skeleton workflow |
mh |
mito_high |
mitochondrial high-data mode: standard mode plus --min-link-ratio 0.30 --subsets=25,50,100 |
mx |
mito_stable, mito_complex |
full mitochondrial complex-repeat mode with internal-junction splitting and stable bridge selection |
pl |
plastid_low |
plastid low-data mode, replacing the user-facing compact command |
ps |
plastid_standard |
plastid standard mode, keeping the plastid one-round workflow |
ph |
plastid_high |
plastid high-data mode: standard mode plus --min-link-ratio 0.30 --subsets=25,50,100 |
Presets are shorthand over the existing options; the older --organelle,
--data-mode compact, --small-dataset, --min-link-ratio, and --subsets
flags are still supported. Later explicit options can still override a preset.
The preset layer does not merge plastid and mito internals: plastid keeps its
one-round graph logic, while mito keeps its two-round skeleton/remapping logic
unless --rounds is explicitly changed.
Use low presets (-p ml or -p pl) for small corrected-read inputs where
standard high-depth profiles can drop low-support links before the graph is
complete. The older --data-mode compact and --small-dataset flags are kept
as aliases for compatibility. This mode keeps the standard large-data plastid
and mito profiles unchanged and only applies when explicitly requested.
For the corrected MECAT plastid input:
./target/release/simple_draft_asm -p pl \
-i data/mecat_corrected_plastid.fasta.gz \
-o result_mecat_plastid_compact \
-t 8For the corrected MECAT mitochondrial input:
./target/release/simple_draft_asm -p ml \
-i data/mecat_corrected_mito.fasta.gz \
-o result_mecat_mito_compact \
-t 8In mitochondrial compact mode only, the second round performs an additional local bridge-completion step after read remapping. It identifies open skeleton ends and small disconnected components, extracts reads touching just those local regions, tests candidate bridges, and keeps only links that improve the main topology without creating high-degree secondary branching. Unsupported small components are omitted from the final main graph rather than force-linked. This compact-mito completion path does not change plastid compact mode or the standard large-data mito/plastid profiles.
Start with standard mode unless the data volume already tells you otherwise. Plastid standard mode is a one-round dominant graph workflow. Mitochondrial standard mode is two-round: the first round builds a conservative skeleton and the second round remaps reads to that skeleton to extend/rescue supported open graph structure.
Use compact mode only for genuinely small corrected-read inputs. Randomly
sampling 500 or 1000 raw reads from a large mitochondrial dataset is not a
replacement for the full standard run; on data/Col-0_mito.fastq.gz, 500-read
and 1000-read compact tests did not match the full standard topology.
Use high-data mode for very large datasets when weak secondary links need to be controlled by read subsetting and link filtering:
./target/release/simple_draft_asm -p mh \
-i data/mito.fastq.gz \
-o result_mito_high \
-t 8When a mitochondrial standard graph has suspicious 1-2 or 2-1 nodes, use
mx to analyze or repair only the selected repeat neighborhood rather than
continuing to tune global thresholds.
Use -p mx for mitochondrial datasets where the normal standard graph may be
closed but repeat topology is unstable. The recommended workflow is manual: run
standard ms first, inspect the graph, then rerun mx when the standard graph
has suspicious repeat-node structure or sample-to-sample instability.
./target/release/simple_draft_asm -p ms \
-i data/CRR958891_mito.fastq.gz \
-o result_crr_mito_ms \
-t 8
./target/release/simple_draft_asm -p mx \
-i data/CRR958891_mito.fastq.gz \
-o result_crr_mito_mx \
-t 8The mx rerun is not a global threshold sweep. It changes the topology search
itself:
- Build the normal mitochondrial two-round skeleton, including the first-pass read-walk rescue graph.
- Remap all reads to the skeleton and look for strong internal split points inside existing segments, not only at segment ends.
- Split skeleton segments at supported internal junctions and remap reads when new breakpoints are found.
- Build a candidate graph from existing skeleton links, PAF links, local read-supported bridges, and rescue-graph links.
- For paired
1-2/2-1nodes, test whether their single sides are already connected through a short2-2bridge-chain. When the bridge-chain support strongly dominates single-copy reconnect alternatives, add the direct double-copy chord. - Score candidate bridge sets globally by connectedness, open ends, endpoint overload, branch count, cycle rank, link count, and support.
- Keep only bridges that improve the main topology without creating excessive secondary links.
- Apply repeat-aware shortcut replacement when a short local chain can explain an otherwise ambiguous connection while preserving valid node degrees.
- Prune closed redundant links only after confirming the graph remains connected, closed, and within degree guards.
mx has three user-facing modes:
--mx-mode auto: full automatic mito-stable search. This can add bridges, perform repeat-aware shortcut replacement, and prune redundant closed links.--mx-mode forbid-selected --selected-nodes LIST: only analyze the selected nodes and try to repair those nodes with local candidate bridges. Automatic repeat expansion and pruning are disabled in this mode.--mx-mode allow-selected --selected-nodes LIST: analyze the selected nodes without changing the graph. This is useful when deciding whether selected1-2/2-1nodes should remain as repeat-copy structure.
For example, to test whether edge_12 and edge_17 should be repaired:
./target/release/simple_draft_asm -p mx \
--skeleton-gfa data/check/all_mito_500K_before_rr.gfa \
-i data/CRR958891_mito.fastq.gz \
-o benchmarks/crr_mx_repair_20260613/forbid_edge12_edge17 \
-t 8 \
--mx-mode forbid-selected \
--selected-nodes edge_12,edge_17To inspect whether edge_4 and edge_8 are better interpreted as retained
repeat-copy nodes:
./target/release/simple_draft_asm -p mx \
--skeleton-gfa data/check/all_mito_500K_before_rr.gfa \
-i data/CRR958891_mito.fastq.gz \
-o benchmarks/crr_mx_repair_20260613/allow_edge4_edge8 \
-t 8 \
--mx-mode allow-selected \
--selected-nodes edge_4,edge_8For unstable mitochondrial repeat data, this is meant to separate stable regions from unstable repeat neighborhoods and then solve those neighborhoods with local split/bridge/repeat repair. The topology scan treats node names as graph-local identifiers; compare roles, degrees, neighbors, and sequence placement between runs rather than assuming names are stable.
The main mx diagnostics are:
mito_stable_splits.tsv: internal breakpoints used to expose hidden branch endpoints.mito_stable_bridge.candidates.tsvandmito_stable_bridge.selected.tsv: tested and accepted bridge links.mito_stable_bridge.report.txt: topology score summary for the selected bridge set.mito_stable_selected_nodes.tsv: selected node degree, incident base links, and incident read-supported candidates.mito_stable_copy_choice.tsv: local single-copy versus double-copy interpretations for three-way node pairs, including added-link support, support source such asread_walkorbridge_chain:*, and topology outcome.mito_stable_node_degrees.tsv: physical left/right degree class per final node.mito_stable_node_repairs.tsv: before/after degree class and repair events for nodes changed by selected bridges, manual edits, pruning, or repeat expansion.mito_stable_topology_scan.tsv: final node-role scan, including accepted three-way merge candidates and invalid node shapes.mito_stable_manual_edits.tsv: advanced forced/dropped link edits, when supplied.mito_stable_repeat_expansions.tsv: repeat-aware shortcut replacements.mito_stable_pruned.tsv: redundant links removed after topology validation.
If you already have an incomplete or suspicious GFA and want to repair that
specific graph, add --skeleton-gfa; this is an input graph to inspect and
repair, not a topology reference:
./target/release/simple_draft_asm -p mx \
--skeleton-gfa result_crr_mito_ms/graph.gfa \
-i data/CRR958891_mito.fastq.gz \
-o result_crr_mito_mx_from_gfa \
-t 8Advanced --force-link and --drop-link options remain available as local
graph-edit controls. They use from:from_orient:to:to_orient syntax, for
example edge_8:-:edge_4:-, and are recorded in
mito_stable_manual_edits.tsv.
The manual two-step repeat audit remains useful when reviewing a suspicious GFA:
first run forbid-selected on edge_12,edge_17, then run forbid-selected on
edge_4,edge_8 from that repaired graph. The second run only lists
edge_8:-:edge_4:- as selected because edge_17:-:edge_12:- is already
present in its input graph; the two steps together make both repeat cassettes
2-2.
The current direct CRR reruns show the bridge-chain behavior from direct ms
graphs. mx auto resolves CRR958891 by adding utg12:+:utg14:+ through the
utg13 bridge-chain and utg17:+:utg20:+ through the utg21 bridge-chain.
The same motif also triggers on CRR958893, adding utg12:-:utg14:+ through
utg13 and utg0:+:utg18:+ through utg19. See each run's
mito_stable_copy_choice.tsv for support and source.
For CRR958891, use
CRR958891_mx_auto_from_cleaned_ms/graph.gfa
as the final graph. The cleaned input drops utg10 because its ms depth is 0,
and drops the manually rejected utg8--utg26 direct link before mx auto.
This keeps sample-level cleanup separate from mx: mx auto then only adds
the two repeat-cassette chords utg12:+:utg14:+ and utg17:+:utg20:+.
Why the utg17--utg20 direct link is easy to lose in ms: the evidence is
not absent, but the standard graph representation assigns it to the intervening
short repeat bridge utg21. In the CRR958891 ms graph, utg21 is 1,520 bp
and the retained links are very strong:
utg17:+ -> utg21:- skeleton 146, PAF 117
utg21:- -> utg20:+ skeleton 140, PAF 127
The direct chord utg17:+ -> utg20:+ only appears as a weak raw candidate in
links.tsv with PAF support 3 and link ratios about 0.02, so it does not pass
the normal ms graph filters and is absent from ms/graph.gfa. Biologically
and topologically, however, this is the same double-copy cassette evidence:
utg17:R--utg21:R plus utg21:L--utg20:L says the two single sides are joined
through a short 2-2 bridge. mx auto therefore converts that bridge-chain
support into the direct double-copy chord, writing
utg17:+:utg20:+ with support 140 and support_source=bridge_chain:utg21.
For large plastid datasets, weak secondary links can be retained even when the
dominant graph is already clear. Use --min-link-ratio to require each GFA link
to have support close to the best competing link at both of its endpoints:
./target/release/simple_draft_asm --organelle plastid \
-i data/rice_plastid.fastq.gz \
-o result_rice_plastid_clean \
--min-link-ratio 0.30 \
-t 8The default is 0, which preserves the previous fixed --min-link-support
behavior. A value such as 0.30 removes low-proportion secondary links while
keeping the primary endpoint-supported links.
For the current rice plastid dataset, use the cleaned 25% read subset as the working parameter combination:
./target/release/simple_draft_asm --organelle plastid \
-i data/rice_plastid.fastq.gz \
-o result_rice_plastid_best \
--min-link-ratio 0.30 \
--subsets=25 \
-t 8For the current rice mitochondrial dataset, use the same 25% read subset and relative link filtering with the default two-round mitochondrial workflow:
./target/release/simple_draft_asm --organelle mito \
-i data/rice_mito.fastq.gz \
-o result_rice_mito_best \
--min-link-ratio 0.30 \
--subsets=25 \
-t 8Use --read-subsets to run deterministic read-level subsampling experiments in
one command. --subsets is the same option with a shorter name. Reads are
selected before syncmer/minimizer discovery, and retained reads follow the
normal assembly path. For formal data-volume checks, use a halving series:
./target/release/simple_draft_asm --organelle plastid \
-i data/rice_plastid.fastq.gz \
-o result_rice_plastid_read_subsets \
--subsets=12.5,25,50,100 \
-t 8Each subset is written under the output directory as read_subset_25/,
read_subset_50/, and so on. Decimal subsets use underscores in directory
names, for example read_subset_12_5/. The top-level read_subsets.tsv
records the elapsed time and read ID file for each subset. Non-100% subsets also
write read_ids.txt in their subset directory for downstream read extraction;
IDs are the first whitespace-delimited token in each FASTQ/FASTA header. The
default behavior is unchanged when neither --read-subsets nor --subsets is
supplied.
For workflows that need to read the data again, such as the default two-round
mitochondrial mode, non-100% subsets also materialize reads.fasta in the
subset directory. Round 1, skeleton remapping, and rescue all use that same
subset FASTA, so the subset is applied consistently across the full workflow.
graph.gfa: final draft graph.unitigs.fasta: final unitig sequences.depth.tsv: segment depth estimated by read remapping when available.links.tsv: junction/link support table.report.txt: run parameters and summary.
For two-round mitochondrial runs:
round1_skeleton/: conservative first-pass graph.round1_readlinks/: first-pass graph with read-walk links for rescue evidence.round2_skeleton/: skeleton remapping and rescued final graph.- top-level
graph.gfa,depth.tsv, andlinks.tsvare copied from the final second-round result.
For mitochondrial compact runs, round2_skeleton/ also includes local bridge
diagnostics:
mito_compact_bridge.report.txt: bridge completion summary.mito_compact_bridge.links.tsv: accepted local bridge candidates.mito_compact_bridge.pruned.tsv: secondary links and small components removed from the final main topology.mito_compact_bridge.read_ids.txtandmito_compact_bridge.reads.fasta: reads selected for local bridge inspection.
For experimental mito-stable runs, round2_skeleton/ and the top-level output
also include:
mito_stable_splits.tsv: high-support internal skeleton breakpoints used to expose hidden branch endpoints before final bridge selection.mito_stable_bridge.report.txt: global bridge-selection topology summary.mito_stable_bridge.candidates.tsvandmito_stable_bridge.selected.tsv: candidate and accepted bridge links after internal splitting.mito_stable_node_degrees.tsv: physical left/right degree class for each final node.mito_stable_topology_scan.tsv: self-check of node roles; highlights accepted three-way merge candidates through small2-2repeat bridges, ordinary2-2repeat nodes, linear nodes, and invalid node shapes.mito_stable_repeat_expansions.tsv: copied repeat segments used to replace low-overlap shortcut links with local high-overlap chains; this file can be empty when node-degree constraints reject the repair.mito_stable_pruned.tsv: closed redundant links removed after repeat repair.
The latest validation set was regenerated under
benchmarks/simple_refresh_20260612/ and
benchmarks/tool_refresh_20260612/ on 2026-06-12. The simple refresh keeps
Col-0 large datasets in standard mode only, rice large datasets in standard and
--min-link-ratio 0.30 --subsets=25,100 modes, and the two Arabidopsis MECAT
mitochondrial datasets in compact mode. OATK was rerun for all six inputs. Flye
was rerun only for Col-0 mitochondrial and the two compact mitochondrial inputs;
rice Flye and Col-0 plastid Flye are skipped because those inputs are too large
for the current comparison budget. For future Flye comparison rows, runs that
exceed 10 minutes should be recorded as >10m with n/a graph statistics.
The Arabidopsis mitochondrial compact datasets are the positive regression tests for the compact-mito bridge workflow. They should keep the compact-mito specific local bridge behavior while leaving plastid compact and large-data profiles unchanged.
| input | mode | output graph | S | L | bases | components | open endpoints | bridge behavior |
|---|---|---|---|---|---|---|---|---|
data/mecat_mito_Arb-0.fasta.gz |
compact mito | benchmarks/simple_refresh_20260612/results/mecat_mito_Arb-0_compact/graph.gfa |
18 | 40 | 365,716 | 1 | 0 | 179 focused reads, 1 accepted local PAF bridge |
data/mecat_mito_AUZE-A-5.fasta.gz |
compact mito | benchmarks/simple_refresh_20260612/results/mecat_mito_AUZE-A-5_compact/graph.gfa |
17 | 38 | 363,552 | 1 | 0 | already closed by skeleton plus full-read PAF |
Benchmark date: 2026-06-12. Commands were run on this machine:
- Model: Mac Studio
Mac16,9 - Chip: Apple M4 Max, 14 cores (10 performance + 4 efficiency)
- Memory: 36 GB
- OS: macOS 26.5.1 (25F80)
- Threads:
-t 8 - Tool versions: Flye
2.9.6-b1802, minimap22.30-r1287, OATK/syncasm1.0, simple_draft_asm0.1.0
Standard and compact simple rows use /usr/bin/time -p wall time. Subset rows
use the per-subset elapsed_seconds values from read_subsets.tsv. OATK and
Flye rows use /usr/bin/time -p wall time from
benchmarks/tool_refresh_20260612/logs/.
| tool | dataset | profile | elapsed | graph | S | L | bases | components |
|---|---|---|---|---|---|---|---|---|
| simple_draft_asm | Col-0 mito | standard | 2.33s | benchmarks/simple_refresh_20260612/results/col0_mito_standard/graph.gfa |
19 | 46 | 376,128 | 1 |
| OATK/syncasm | Col-0 mito | -k 1001 -c 30 |
0.98s | benchmarks/tool_refresh_20260612/oatk/col0_mito/col0_mito.utg.final.gfa |
9 | 24 | 364,639 | 1 |
| Flye | Col-0 mito | --genome-size 500k |
225.64s | benchmarks/tool_refresh_20260612/flye/col0_mito/assembly_graph.gfa |
12 | 17 | 370,282 | 1 |
| simple_draft_asm | Col-0 plastid | standard | 3.63s | benchmarks/simple_refresh_20260612/results/col0_plastid_standard/graph.gfa |
3 | 8 | 129,833 | 1 |
| OATK/syncasm | Col-0 plastid | -k 1001 -c 30 |
2.77s | benchmarks/tool_refresh_20260612/oatk/col0_plastid/col0_plastid.utg.final.gfa |
3 | 8 | 132,513 | 1 |
| Flye | Col-0 plastid | skipped, too slow | n/a | n/a | n/a | n/a | n/a | n/a |
| simple_draft_asm | rice mito | standard | 17.60s | benchmarks/simple_refresh_20260612/results/rice_mito_standard/graph.gfa |
151 | 179 | 1,127,187 | 75 |
| simple_draft_asm | rice mito | --min-link-ratio 0.30 --subsets=25 |
3.928s | benchmarks/simple_refresh_20260612/results/rice_mito_subsets_ratio030/read_subset_25/graph.gfa |
18 | 44 | 364,738 | 1 |
| simple_draft_asm | rice mito | --min-link-ratio 0.30 --subsets=100 |
27.039s | benchmarks/simple_refresh_20260612/results/rice_mito_subsets_ratio030/read_subset_100/graph.gfa |
151 | 153 | 1,127,187 | 84 |
| OATK/syncasm | rice mito | -k 1001 -c 30 |
4.19s | benchmarks/tool_refresh_20260612/oatk/rice_mito/rice_mito.utg.final.gfa |
263 | 442 | 2,877,837 | 110 |
| Flye | rice mito | skipped, dataset too large | n/a | n/a | n/a | n/a | n/a | n/a |
| simple_draft_asm | rice plastid | standard | 11.40s | benchmarks/simple_refresh_20260612/results/rice_plastid_standard/graph.gfa |
5 | 29 | 115,873 | 1 |
| simple_draft_asm | rice plastid | --min-link-ratio 0.30 --subsets=25 |
18.103s | benchmarks/simple_refresh_20260612/results/rice_plastid_subsets_ratio030/read_subset_25/graph.gfa |
3 | 8 | 115,266 | 1 |
| simple_draft_asm | rice plastid | --min-link-ratio 0.30 --subsets=100 |
18.842s | benchmarks/simple_refresh_20260612/results/rice_plastid_subsets_ratio030/read_subset_100/graph.gfa |
5 | 14 | 115,873 | 1 |
| OATK/syncasm | rice plastid | -k 1001 -c 30 |
7.03s | benchmarks/tool_refresh_20260612/oatk/rice_plastid/rice_plastid.utg.final.gfa |
263 | 326 | 3,505,084 | 150 |
| Flye | rice plastid | skipped, dataset too large | n/a | n/a | n/a | n/a | n/a | n/a |
| simple_draft_asm | mecat mito Arb-0 | compact mito | 0.65s | benchmarks/simple_refresh_20260612/results/mecat_mito_Arb-0_compact/graph.gfa |
18 | 40 | 365,716 | 1 |
| OATK/syncasm | mecat mito Arb-0 | -k 1001 -c 30 |
0.14s | benchmarks/tool_refresh_20260612/oatk/mecat_mito_Arb-0/mecat_mito_Arb-0.utg.final.gfa |
4 | 0 | 286,320 | 4 |
| Flye | mecat mito Arb-0 | --genome-size 500k |
35.46s | benchmarks/tool_refresh_20260612/flye/mecat_mito_Arb-0/assembly_graph.gfa |
2 | 2 | 368,875 | 2 |
| simple_draft_asm | mecat mito AUZE-A-5 | compact mito | 0.67s | benchmarks/simple_refresh_20260612/results/mecat_mito_AUZE-A-5_compact/graph.gfa |
17 | 38 | 363,552 | 1 |
| OATK/syncasm | mecat mito AUZE-A-5 | -k 1001 -c 30 |
0.14s | benchmarks/tool_refresh_20260612/oatk/mecat_mito_AUZE-A-5/mecat_mito_AUZE-A-5.utg.final.gfa |
4 | 0 | 301,896 | 4 |
| Flye | mecat mito AUZE-A-5 | --genome-size 500k |
37.08s | benchmarks/tool_refresh_20260612/flye/mecat_mito_AUZE-A-5/assembly_graph.gfa |
3 | 4 | 364,696 | 1 |
Benchmark command set:
bash benchmarks/simple_refresh_20260612/run_simple_benchmarks.sh
bash benchmarks/tool_refresh_20260612/run_oatk_benchmarks.sh
bash benchmarks/tool_refresh_20260612/run_flye_remaining_benchmarks.shOATK uses external/oatk/syncasm -k 1001 -c 30 -t 8. Flye uses
--pacbio-hifi, --extra-params output_gfa_before_rr=1, and
--genome-size 500k for mitochondrial inputs or --genome-size 160k for
plastid inputs. Flye rows that are intentionally skipped or still active after
10 minutes are recorded as n/a.
Common options are shown with:
./target/release/simple_draft_asm --helpLow-level tuning parameters are hidden from the normal interface but documented with:
./target/release/simple_draft_asm --help-advanced