# MILK tutorial: Network visualization of MILK trees PBMC (3k dataset) This notebook provides a basic walkthrough of how MILK can be applied to scRNA-seq datasets. Specifically, using the processed 3k PBMCs from 10X Genomics obtained from the ["Preprocessing and clustering 3k PBMCs (legacy workflow)"](https://scanpy.readthedocs.io/en/stable/generated/scanpy.datasets.pbmc3k_processed.html), which is provided by Scanpy. ## Importing libraries ```python import os import pathlib import numpy as np import pandas as pd import scanpy as sc ``` ## Loading the dataset ```python adata = sc.datasets.pbmc3k_processed() ``` The processed `anndata` object contains the following information: AnnData object with n_obs × n_vars = 2638 × 1838 obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain' var: 'n_cells' uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups' obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr' varm: 'PCs' obsp: 'distances', 'connectivities' Visualizing the processed dataset, colored by annotated cell types: ```python sc.pl.umap(adata,color="louvain") ``` ![png](milk_tutorial_pbmc3k_dataset_files/milk_tutorial_pbmc3k_dataset_6_0.png) ## MILK input We can apply MILK to the PCA embeddings (50 components) of the processed dataset. ```python input_df = pd.DataFrame(adata.obsm["X_pca"],index=adata.obs_names) input_df.to_csv("input.csv",header=False) ```

AAACATACAACCAC-1	5.556233	0.257714	-0.186810	2.800131	-0.033783	-0.189702	0.310228	-1.323691	2.691945	0.125928	...	-0.266174	1.024464	-0.709844	-0.052780	-0.686898	-1.419867	-2.865078	0.027601	2.671032	-0.297620
AAACATTGAGCTAC-1	7.209530	7.481985	0.162706	-8.018575	3.012900	0.322293	2.270888	-0.605055	-0.905611	1.225260	...	0.158161	0.819037	0.578912	-1.169742	0.955408	0.068133	-0.883082	2.930932	0.354197	-1.081801
AAACATTGATCAGC-1	2.694438	-1.583658	-0.663126	2.205649	-1.686360	-1.965395	-1.894999	-1.522103	1.914985	-0.481202	...	-1.054254	0.805932	1.543282	1.504834	-0.831818	-0.236549	1.883515	1.084782	0.381470	0.064662
AAACCGTGCTTCCG-1	-10.143295	-1.368530	1.209812	-0.700096	-2.872336	0.230617	1.278005	0.487900	-0.447965	-0.328465	...	1.297246	0.611073	-0.007878	-0.648735	0.543566	3.156763	1.691134	-0.301377	-0.225427	0.962879
AAACCGTGTATGCG-1	-1.112816	-8.152788	1.332405	-4.252473	2.036407	5.597797	-0.110658	-0.102257	0.014520	0.581409	...	1.191032	1.042533	1.734694	-0.142114	0.586381	0.636326	-1.451625	1.809683	-0.087072	-0.737833

## MILK execution ### Exporting MILK directory to PATH In terminal: ```bash milk -i input.csv ``` The following is an excerpt of the standard output: [ Info: MILK [ Info: Using 1 worker(s) for distributed processing [ Info: Parsed Arguments: [ Info: label: nothing [ Info: percentile: 1.0 [ Info: batch-size: 50 [ Info: output-dir: ./milk.out [ Info: merge-threshold: 100000 [ Info: sample-size: 1 [ Info: threads: 1 [ Info: cache-size-limit: 50000 [ Info: job-time: 24:00:00 [ Info: group-stratification-mode: false [ Info: job-name: milk [ Info: partition-size: 10000 [ Info: hpc-mode: false [ Info: skip-reconstruction: false [ Info: metric: euclidean [ Info: verbose: false [ Info: job-scheduler: nothing [ Info: environment-path: nothing [ Info: force-overwrite: true [ Info: job-account: nothing [ Info: job-memory: 4 [ Info: seed: 21 [ Info: input-path: input.csv [ Info: ========== [ Info: Iteration: 0 (2638 objects) [ Info: Caching 2638 objects from the following path: /home/brett/milk/tutorial/milk.out/input.iteration_00000000.input.csv [ Info: ========== Starting encapsulated recursions... [ Info: Iteration: 0 (2638 objects) [ Info: [input.iteration_00000000] 1723 groups (257 groups optimized); 2638 objects (2638 total); threshold: 10.4644 (computed); 3479378 comparisons (1 CPU(s); cache); runtime: 0.01 min [ Info: 1723 objects after recursive iteration. [ Info: Iteration: 1 (1723 objects) [ Info: [input.iteration_00000001] 958 groups (220 groups optimized); 1723 objects (2638 total); threshold: 12.3272 (computed); 2361059 comparisons (1 CPU(s); cache; previous_groupings); runtime: 0.0 min [ Info: 958 objects after recursive iteration. [ ... [ Info: Iteration: 31 (2 objects) [ Info: [input.iteration_00000031] 1 groups (1 groups optimized); 2 objects (2638 total); threshold: 50.5063 (computed); 2639 comparisons (1 CPU(s); cache; previous_groupings); runtime: 0.0 min [ Info: 1 objects after recursive iteration. ┌ Info: └ Completed recursive downsampling procedure. [ Info: Reconstructing hierarchical graph... [ Info: Hierarchical reconstruction [ Info: Done! ## Visualization of the single-cell MILK tree embedding ```python import numpy as np from glob import glob from tqdm import tqdm import graph_tool.all as gt import seaborn as sns import matplotlib.cm as cm import matplotlib.pyplot as plt import matplotlib.patches as mpatches import matplotlib.colors as mcolors from Bio import Phylo from Bio.Phylo.BaseTree import Tree,Clade from collections import defaultdict,deque ``` ```python gt.openmp_set_num_threads(1) ``` These Python functions construct a BioPython tree object from the MILK output files: ```python def instantiate_node(node_id,info_dict): node = Clade(name=node_id) node_dict = info_dict[node_id] node.representative_id = node_dict["representative_id"] node.group_size = node_dict["group_size"] node.iteration = node_dict["iteration"] node.threshold = node_dict["threshold"] node.spread = node_dict["spread"] node.specificity = node_dict["specificity"] node.resolution = node_dict["resolution"] return node def construct_tree_object(vertices_df,edges_df): """ For each (parent) node, provide list of (children) subnodes (top is root; bottom refers to leaves) """ descendents_dict = edges_df.groupby("source")["target"].apply(list).to_dict() max_iteration_df = vertices_df[vertices_df["iteration"] == vertices_df["iteration"].max()] root_id = max_iteration_df["node_id"].values[0] node_info_dict = vertices_df.set_index("node_id").to_dict(orient="index") root_node = instantiate_node(root_id,node_info_dict) queue = deque([root_node]) while queue: clade = queue.popleft() subclades = [] for sample_id in descendents_dict.get(clade.name,[]): subclade = instantiate_node(sample_id,node_info_dict) subclades.append(subclade) queue.append(subclade) clade.clades = subclades return Tree(root_node) ``` ```python # MILK output directory contains a vertices and edges table vertices_path = os.path.join(".","milk.out","output","vertices.csv.gz") vertices_df = pd.read_csv(vertices_path) edges_path = os.path.join(".","milk.out","output","edges.csv.gz") edges_df = pd.read_csv(edges_path) ``` Table with vertex (i.e., node) information:

	node_id	representative_id	group_size	iteration	threshold	spread	specificity	resolution
0	CAATCGGAGAAACA-1	CAATCGGAGAAACA-1	1	0	0.000000	NaN	NaN	0
1	TCGGTAGAGTAGGG-1	TCGGTAGAGTAGGG-1	1	0	0.000000	NaN	NaN	0
2	TAGCCGCTTACGAC-1	TAGCCGCTTACGAC-1	1	0	0.000000	NaN	NaN	0
3	I1	CAATCGGAGAAACA-1	3	1	10.464395	9.64075	6.0	2
4	CCTCGAACACTTTC-1	CCTCGAACACTTTC-1	1	0	0.000000	NaN	NaN	0

Table including edge information:

	source	target
0	I1	CAATCGGAGAAACA-1
1	I1	TCGGTAGAGTAGGG-1
2	I1	TAGCCGCTTACGAC-1
3	I2	CCTCGAACACTTTC-1
4	I3	GAACCTGATGAACC-1

Refer to the [documentation](https://milk-trees.readthedocs.io/en/latest/usage.html#output) for more information on the output. Creating the tree object: ```python tree = construct_tree_object(vertices_df,edges_df) ``` Cell types annotated in this dataset: ```python labels = sorted(adata.obs["louvain"].unique()) ``` - B cells - CD14+ Monocytes - CD4 T cells - CD8 T cells - Dendritic cells - FCGR3A+ Monocytes - Megakaryocytes - NK cells ```python labels_dict = {} for sample_id,label in zip(adata.obs_names,adata.obs["louvain"]): labels_dict[sample_id] = label ``` ```python color_dict = {} for label,col in zip(adata.obs['louvain'].cat.categories,adata.uns["louvain_colors"]): color_dict[label] = tuple(list(mcolors.to_rgb(col))+[0.75]) ``` ```python neutral_col = (0.75,0.75,0.75,0.25) neutral_threshold = 0.8 g = gt.Graph(directed=False) vertex_labels = g.new_vertex_property("vector") vertex_sizes = g.new_vertex_property("float") edge_widths = g.new_edge_property("int") root = g.add_vertex() vertex_labels[root] = neutral_col # color_dict[labels_dict[tree.clade.representative_id]] vertex_sizes[root] = np.log1p(tree.count_terminals())*2 stack = [(root,tree.clade)] while stack: vertex,clade = stack.pop() for subclade in clade: subvertex = g.add_vertex() edge = g.add_edge(vertex,subvertex) representative_label = labels_dict[subclade.representative_id] representative_col = color_dict[representative_label] leaf_labels = [] for leaf in subclade.get_terminals(): leaf_labels.append(labels_dict[leaf.representative_id]) prop_matching = np.mean([int(representative_label == label) for label in leaf_labels]) ''' I'm requiring at least 0.8 of the leaves (associated with the respective clade) to have the same cell type in order for the clade to be colored. Otherwise, it will just be in gray. This can just help to visualize complex tree structures. ''' if prop_matching >= neutral_threshold: vertex_labels[subvertex] = color_dict[labels_dict[subclade.representative_id]] else: vertex_labels[subvertex] = neutral_col vertex_sizes[subvertex] = np.log1p(subclade.count_terminals())*2 edge_widths[edge] = np.log1p(subclade.count_terminals()) stack.append((subvertex,subclade)) g.vertex_properties["cell_type"] = vertex_labels g.vertex_properties["group_size"] = vertex_sizes g.edge_properties["pen_width"] = edge_widths ``` Scale force-directed network layout provided by the `graph_tool` Python library: ```python pos = gt.sfdp_layout(g,C=10) ``` ```python gt.graph_draw( g, pos=pos, vertex_fill_color=vertex_labels, vertex_color=[1,1,1,0.5], edge_color=[0.75,0.75,0.75,0.25], edge_pen_width=edge_widths, vertex_size=vertex_sizes, output_size=(400,600), bg_color=None ) ``` ![png](milk_tutorial_pbmc3k_dataset_files/milk_tutorial_pbmc3k_dataset_21_0.png) We can compare it back to the UMAP embeddings. ```python sc.pl.umap(adata,color="louvain") ``` ![png](milk_tutorial_pbmc3k_dataset_files/milk_tutorial_pbmc3k_dataset_22_0.png)