This document provides a comprehensive comparison between Python (PM4PY) and R (bupaR) implementations for process mining in healthcare contexts, helping you choose the right tool for your specific needs.
Authors:
Developed at SMAILE, Karolinska Institutet
| Choose Python if you need: | Choose R if you need: |
|---|---|
| β Large-scale data processing (>1M events) | β Interactive visualizations |
| β Production deployment | β Statistical analysis integration |
| β Deep learning integration | β Quick prototyping |
| β 64-bit precision | β Familiar tidyverse workflow |
| β API development | β RMarkdown reporting |
| β Performance optimization | β Shiny dashboards |
# Architecture: Object-oriented
class ProcessMiner:
def __init__(self):
self.event_log = None
self.dfg = None
def discover_process(self, data):
self.event_log = pm4py.convert_to_event_log(data)
self.dfg = pm4py.discover_dfg(self.event_log)
return self.dfg
Characteristics:
# Architecture: Functional/Pipeline
event_log <- data %>%
eventlog(case_id = "case",
activity_id = "activity",
timestamp = "timestamp") %>%
filter_activity_frequency(percentage = 0.8) %>%
process_map()
Characteristics:
| Aspect | Python | R | Notes |
|---|---|---|---|
| Max Events | ~10M+ | ~1-2M | Python handles larger datasets |
| Max Cases | ~1M+ | ~100K | R may struggle with many cases |
| Memory Usage | Efficient | Higher | R creates more copies |
| 64-bit Support | Full | Partial | Python better for precision |
| Streaming | Yes | Limited | Python can process chunks |
# Python - Memory efficient
import pandas as pd
# Process in chunks
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
process_chunk(chunk)
# R - Loads entire file
library(readr)
# Must load all data
data <- read_csv('large_file.csv')
# Creates copy for each operation
processed <- data %>%
mutate(...) %>% # Copy 1
filter(...) %>% # Copy 2
arrange(...) # Copy 3
Note: The following comparisons are based on general characteristics of the underlying libraries and languages, not empirical benchmarks on this specific implementation.
| Aspect | Python (PM4PY) | R (bupaR) | Notes |
|---|---|---|---|
| Large Datasets | Generally better | May struggle | Pythonβs memory model typically handles large data more efficiently |
| Iterative Processing | Efficient | Less efficient | Rβs copy-on-modify can impact performance |
| Visualization | Good static plots | Excellent interactive | R/Shiny excels at interactive dashboards |
| Parallel Processing | Good support | Good support | Both support parallelization |
| Characteristic | Python | R |
|---|---|---|
| Memory Model | Reference-based | Copy-on-modify |
| Large Data Handling | Can process in chunks | Typically loads all data |
| Garbage Collection | Automatic | Automatic but different strategy |
Important: Actual performance will vary based on:
For production deployments, we recommend benchmarking with your specific datasets and workflows.
Disclaimer: Performance comparisons in this document are based on general characteristics of the languages and libraries, not empirical testing of this specific implementation. Actual performance may vary significantly based on your use case.
| Algorithm | Python (PM4PY) | R (bupaR) |
|---|---|---|
| DFG Discovery | β Full | β Full |
| Heuristics Miner | β Full | β Limited |
| Alpha Algorithm | β Full | β No |
| Inductive Miner | β Full | β Partial |
| Petri Nets | β Full | β No |
| BPMN | β Full | β No |
| Conformance Checking | β Full | β Basic |
| Predictive Monitoring | β Full | β No |
import pm4py
# 1. Directly-Follows Graph (DFG) - Most Common
dfg, start_activities, end_activities = pm4py.discover_dfg(event_log)
# 2. Heuristics Miner - Handles Noise Well
heu_net = pm4py.discover_heuristics_net(event_log)
petri_net, initial_marking, final_marking = pm4py.convert_to_petri_net(heu_net)
# 3. Alpha Algorithm - Classical Academic Algorithm
alpha_petri_net, initial_marking, final_marking = pm4py.discover_petri_net_alpha(event_log)
# 4. Alpha+ Algorithm - Extended Alpha
alpha_plus_net = pm4py.discover_petri_net_alpha_plus(event_log)
# 5. Inductive Miner - Guarantees Sound Models
tree = pm4py.discover_process_tree_inductive(event_log)
petri_net, initial_marking, final_marking = pm4py.convert_to_petri_net(tree)
# 6. Inductive Miner - Infrequent (IMf)
tree_imf = pm4py.discover_process_tree_inductive(event_log,
noise_threshold=0.2)
# 7. Inductive Miner - Direct (IMd)
tree_imd = pm4py.discover_process_tree_inductive(event_log,
multi_processing=True)
# 8. ILP Miner - Integer Linear Programming
ilp_net, ilp_im, ilp_fm = pm4py.discover_petri_net_ilp(event_log)
# 9. Fuzzy Miner - For Spaghetti Processes
fuzzy_model = pm4py.discover_fuzzy_model(event_log)
# 10. Split Miner - Automated Parameters
split_net, split_im, split_fm = pm4py.discover_petri_net_split_miner(event_log)
library(bupaR)
library(processmineR)
library(heuristicsmineR)
# 1. Directly-Follows Graph (DFG)
process_map <- event_log %>%
process_map(type = frequency("absolute"))
# 2. Precedence Matrix
precedence <- event_log %>%
precedence_matrix(type = "absolute")
# 3. Heuristics Miner (Limited)
library(heuristicsmineR)
heuristics_model <- heuristics_miner(event_log,
dependency_threshold = 0.9)
# 4. Causal Net Discovery
causal_net <- event_log %>%
causal_graph(threshold = 0.8)
# 5. Fuzzy Miner Approach
fuzzy_map <- event_log %>%
filter_activity_frequency(percentage = 0.95) %>%
process_map(type = performance(median, "hours"))
# 6. Social Network Mining
resource_map <- event_log %>%
resource_map(type = handover_of_work())
# 7. Dotted Chart Analysis
dotted_chart(event_log)
# 8. Process Animation
animate_process(event_log)
# 1. Token-Based Replay
from pm4py.algo.conformance.tokenreplay import algorithm as token_replay
replayed_traces = token_replay.apply(event_log, petri_net,
initial_marking, final_marking)
# 2. Alignments - Most Precise
from pm4py.algo.conformance.alignments import algorithm as alignments
aligned_traces = alignments.apply_log(event_log, petri_net,
initial_marking, final_marking)
# 3. Multi-threaded Alignments
aligned_traces_mt = alignments.apply_log(event_log, petri_net,
initial_marking, final_marking,
parameters={"multiprocessing": True})
# 4. Decomposed Alignments - For Large Models
from pm4py.algo.conformance.decomp_alignments import algorithm as decomp_align
decomp_aligned = decomp_align.apply(event_log, petri_net,
initial_marking, final_marking)
# 5. Anti-Alignments - Find Deviations
from pm4py.algo.conformance.antialignments import algorithm as anti_align
anti_aligned = anti_align.apply(event_log, petri_net,
initial_marking, final_marking)
# 6. Footprints Conformance
from pm4py.algo.conformance.footprints import algorithm as footprints_conf
footprints_conf_result = footprints_conf.apply(event_log, petri_net,
initial_marking, final_marking)
# 1. Fitness Checking (Basic)
fitness_measures <- event_log %>%
fitness_token_replay(petri_net)
# 2. Precision Checking
precision_measures <- event_log %>%
precision_alignment(petri_net)
# 3. Conformance Dashboard
conformance_dashboard(event_log, reference_model)
# 4. Trace Alignment (Limited)
aligned_traces <- event_log %>%
trace_alignment(reference_model)
# 1. Case Duration Analysis
case_durations = pm4py.get_all_case_durations(event_log, "hours")
# 2. Activity Duration Analysis
activity_durations = pm4py.get_activity_durations(event_log)
# 3. Waiting Time Analysis
waiting_times = pm4py.get_waiting_time(event_log)
# 4. Service Time Analysis
service_times = pm4py.get_service_time(event_log)
# 5. Cycle Time Analysis
cycle_times = pm4py.get_cycle_time(event_log)
# 6. Performance DFG
perf_dfg, sa, ea = pm4py.discover_performance_dfg(event_log)
# 7. Sojourn Time Analysis
from pm4py.statistics.sojourn_time.log import get as soj_time_get
sojourn_times = soj_time_get.apply(event_log)
# 8. Resource Performance
resource_activities = pm4py.get_resource_activity_workload(event_log)
# 1. Throughput Time Analysis
throughput <- event_log %>%
throughput_time(level = "log",
units = "hours")
# 2. Processing Time Analysis
processing_time <- event_log %>%
processing_time(level = "activity")
# 3. Idle Time Analysis
idle_time <- event_log %>%
idle_time(level = "case")
# 4. Resource Utilization
resource_util <- event_log %>%
resource_frequency(level = "resource")
# 5. Activity Frequency Analysis
activity_freq <- event_log %>%
activity_frequency(level = "activity")
# 6. Trace Coverage
trace_coverage <- event_log %>%
trace_coverage(level = "trace")
# 7. Performance Dashboard
performance_dashboard(event_log)
# 1. Clinical Pathway Mining
def discover_clinical_pathway(event_log, outcome_attribute="SepsisLabel"):
# Separate by outcome
positive_cases = pm4py.filter_event_attribute_values(
event_log, outcome_attribute, [1], retain=True)
negative_cases = pm4py.filter_event_attribute_values(
event_log, outcome_attribute, [0], retain=True)
# Discover pathways for each outcome
pos_dfg, pos_sa, pos_ea = pm4py.discover_dfg(positive_cases)
neg_dfg, neg_sa, neg_ea = pm4py.discover_dfg(negative_cases)
return {
'positive_outcome': (pos_dfg, pos_sa, pos_ea),
'negative_outcome': (neg_dfg, neg_sa, neg_ea)
}
# 2. Treatment Effect Mining
def analyze_treatment_effects(event_log, treatment_activity, outcome_metric):
treated_cases = pm4py.filter_event_attribute_values(
event_log, "concept:name", [treatment_activity], retain=True)
control_cases = pm4py.filter_event_attribute_values(
event_log, "concept:name", [treatment_activity], retain=False)
treated_outcomes = pm4py.get_all_case_durations(treated_cases)
control_outcomes = pm4py.get_all_case_durations(control_cases)
return {
'treated_mean': treated_outcomes.mean(),
'control_mean': control_outcomes.mean(),
'effect_size': treated_outcomes.mean() - control_outcomes.mean()
}
# 3. Risk Stratification Mining
def risk_stratification_mining(event_log, risk_factors):
stratified_results = {}
for risk_level in ['low', 'medium', 'high']:
risk_cases = filter_by_risk_level(event_log, risk_factors, risk_level)
dfg, sa, ea = pm4py.discover_dfg(risk_cases)
stratified_results[risk_level] = {
'dfg': dfg,
'case_count': len(risk_cases),
'avg_duration': pm4py.get_all_case_durations(risk_cases).mean()
}
return stratified_results
# 1. Clinical Pathway Analysis
analyze_clinical_pathways <- function(event_log, outcome_var) {
positive_outcomes <- event_log %>%
filter(!!sym(outcome_var) == 1) %>%
process_map(type = frequency("relative"))
negative_outcomes <- event_log %>%
filter(!!sym(outcome_var) == 0) %>%
process_map(type = frequency("relative"))
list(
positive = positive_outcomes,
negative = negative_outcomes
)
}
# 2. Care Pathway Compliance
pathway_compliance <- function(event_log, reference_pathway) {
event_log %>%
trace_explorer() %>%
mutate(compliance = case_when(
trace_pattern %in% reference_pathway ~ "Compliant",
TRUE ~ "Non-compliant"
)) %>%
group_by(compliance) %>%
summarise(
n_cases = n(),
percentage = n() / nrow(.) * 100
)
}
# 3. Resource Handover Analysis
resource_handover_analysis <- function(event_log) {
event_log %>%
resource_map(type = handover_of_work()) %>%
render_resource_map()
}
# 4. Temporal Pattern Analysis
temporal_patterns <- function(event_log) {
event_log %>%
dotted_chart(x = "absolute",
y = "duration") %>%
plot()
}
# 1. Sepsis Progression Miner
class SepsisProgressionMiner:
def __init__(self, event_log):
self.event_log = event_log
def discover_sepsis_patterns(self):
# Filter sepsis cases
sepsis_cases = pm4py.filter_event_attribute_values(
self.event_log, "SepsisLabel", [1], retain=True)
# Time-based analysis
time_patterns = self.analyze_temporal_patterns(sepsis_cases)
# Clinical milestone detection
milestones = self.detect_clinical_milestones(sepsis_cases)
return {
'temporal_patterns': time_patterns,
'clinical_milestones': milestones
}
# 2. Readmission Pattern Miner
class ReadmissionMiner:
def discover_readmission_patterns(self, event_log, time_window_days=30):
readmission_patterns = {}
# Implementation for readmission analysis
return readmission_patterns
# 3. Medication Adherence Miner
class MedicationAdherenceMiner:
def analyze_adherence_patterns(self, event_log, medication_events):
adherence_metrics = {}
# Implementation for medication adherence
return adherence_metrics
# 1. Clinical Protocol Miner
library(clinicaltrialr)
library(survival)
clinical_protocol_miner <- function(event_log, protocol_definition) {
protocol_adherence <- event_log %>%
mutate(
protocol_step = map_protocol_step(activity_id, protocol_definition),
adherence = check_protocol_adherence(protocol_step)
) %>%
group_by(case_id) %>%
summarise(
adherence_rate = mean(adherence, na.rm = TRUE),
protocol_completion = max(protocol_step, na.rm = TRUE)
)
return(protocol_adherence)
}
# 2. Survival Analysis Integration
survival_process_mining <- function(event_log, time_to_event, status) {
# Combine process mining with survival analysis
pathway_survival <- event_log %>%
group_by(trace_id) %>%
summarise(
pathway_pattern = paste(activity_id, collapse = "->"),
time_to_event = first(!!sym(time_to_event)),
status = first(!!sym(status))
) %>%
survfit(Surv(time_to_event, status) ~ pathway_pattern, data = .)
return(pathway_survival)
}
# 3. Quality Indicator Mining
quality_indicator_mining <- function(event_log, quality_measures) {
quality_analysis <- event_log %>%
group_by(case_id) %>%
summarise(
across(all_of(quality_measures),
list(mean = mean, min = min, max = max),
na.rm = TRUE)
) %>%
pivot_longer(cols = -case_id,
names_to = "measure",
values_to = "value")
return(quality_analysis)
}
| Library | Focus | Algorithms | Healthcare Suitability |
|---|---|---|---|
| PM4PY | Comprehensive PM | DFG, Alpha, Inductive, Heuristics, ILP | βββββ Complete |
| ProM-Lite | Academic research | Research algorithms | βββ Research focus |
| PM4Py-GPU | GPU acceleration | High-performance mining | ββββ Large datasets |
| ProcessLens | Visualization | Interactive exploration | βββ Limited healthcare |
| Processminer | Basic mining | Simple algorithms | ββ Basic functionality |
| Celonis | Enterprise | Commercial algorithms | ββββ Enterprise healthcare |
| Library | Focus | Algorithms | Healthcare Suitability |
|---|---|---|---|
| bupaR | Business process analysis | DFG, Performance, Social network | βββββ Excellent |
| processmineR | Process discovery | Basic discovery algorithms | βββ Good |
| heuristicsmineR | Heuristics mining | Heuristics nets | βββ Research focus |
| eventdataR | Event data manipulation | Data preparation | ββββ Essential for healthcare |
| xesreadR | XES file handling | Data import/export | βββ Standard compliance |
| processcheckR | Conformance checking | Constraint checking | ββββ Clinical guidelines |
| processanimateR | Process animation | Dynamic visualization | ββββ Patient journey |
| processmonitR | Real-time monitoring | Live dashboards | ββββ Clinical monitoring |
# 1. Healthcare Process Mining Toolkit (custom)
class HealthcarePMToolkit:
"""Extended PM4PY for healthcare applications"""
def __init__(self):
self.clinical_miners = {
'sepsis_miner': SepsisProgressionMiner(),
'pathway_miner': ClinicalPathwayMiner(),
'readmission_miner': ReadmissionMiner(),
'treatment_miner': TreatmentEffectMiner()
}
def analyze_clinical_outcomes(self, event_log, outcome_measures):
"""Analyze clinical outcomes using process mining"""
results = {}
for outcome in outcome_measures:
miner = self.clinical_miners.get(f"{outcome}_miner")
if miner:
results[outcome] = miner.analyze(event_log)
return results
# 2. Medical Process Discovery
from pm4py.algo.discovery.dfg import algorithm as dfg_discovery
class MedicalProcessDiscoverer:
"""Specialized discovery for medical processes"""
def discover_care_pathways(self, event_log, specialization="general"):
"""Discover care pathways by medical specialization"""
# Filter by medical department/specialization
specialized_log = self.filter_by_department(event_log, specialization)
# Discover process with medical constraints
dfg, start_activities, end_activities = dfg_discovery.apply(
specialized_log,
parameters={
'min_edge_frequency': 5, # Clinical significance threshold
'filter_activity_frequency': 0.1 # Remove rare activities
}
)
return {
'care_pathway': dfg,
'entry_points': start_activities,
'discharge_points': end_activities,
'clinical_metrics': self.calculate_clinical_metrics(specialized_log)
}
# 3. Predictive Healthcare Mining
class PredictiveHealthcareMiner:
"""Predictive process monitoring for healthcare"""
def predict_patient_outcome(self, partial_trace, outcome_models):
"""Predict patient outcomes from partial traces"""
features = self.extract_clinical_features(partial_trace)
predictions = {}
for outcome, model in outcome_models.items():
predictions[outcome] = {
'probability': model.predict_proba(features)[0][1],
'risk_level': self.categorize_risk(model.predict_proba(features)[0][1]),
'interventions': self.recommend_interventions(outcome, features)
}
return predictions
# 1. Clinical bupaR Extensions
library(bupaR)
library(dplyr)
library(survival)
# Clinical pathway analysis extensions
clinical_bupaR <- function() {
# Enhanced activity analysis for clinical activities
clinical_activity_frequency <- function(event_log, clinical_categories = NULL) {
if (!is.null(clinical_categories)) {
event_log <- event_log %>%
mutate(clinical_category = map_clinical_category(activity_id, clinical_categories))
}
event_log %>%
activity_frequency(level = "activity") %>%
arrange(desc(absolute_frequency)) %>%
mutate(
clinical_significance = case_when(
relative_frequency > 0.8 ~ "Core pathway",
relative_frequency > 0.5 ~ "Common pathway",
relative_frequency > 0.2 ~ "Alternative pathway",
TRUE ~ "Rare pathway"
)
)
}
# Resource utilization for healthcare staff
healthcare_resource_analysis <- function(event_log) {
event_log %>%
resource_frequency(level = "resource") %>%
mutate(
workload_category = case_when(
relative_frequency > 0.4 ~ "High workload",
relative_frequency > 0.2 ~ "Moderate workload",
TRUE ~ "Light workload"
)
) %>%
resource_map(type = handover_of_work())
}
# Temporal analysis for clinical timing
clinical_time_analysis <- function(event_log) {
list(
throughput = event_log %>% throughput_time(level = "log"),
processing = event_log %>% processing_time(level = "activity"),
waiting = event_log %>% idle_time(level = "case"),
timing_compliance = event_log %>%
filter_processing_time(interval = c(0, 24), units = "hours") %>%
nrow() / nrow(event_log)
)
}
return(list(
activity_analysis = clinical_activity_frequency,
resource_analysis = healthcare_resource_analysis,
time_analysis = clinical_time_analysis
))
}
# 2. Outcome-based Process Mining
outcome_based_pm <- function(event_log, outcome_variable) {
# Separate by outcomes
positive_outcomes <- event_log %>%
filter(!!sym(outcome_variable) == 1)
negative_outcomes <- event_log %>%
filter(!!sym(outcome_variable) == 0)
# Comparative analysis
comparison_results <- list(
positive_pathway = positive_outcomes %>% process_map(),
negative_pathway = negative_outcomes %>% process_map(),
# Performance comparison
positive_performance = positive_outcomes %>%
throughput_time(level = "log") %>%
summary(),
negative_performance = negative_outcomes %>%
throughput_time(level = "log") %>%
summary(),
# Activity differences
activity_differences = bind_rows(
positive_outcomes %>% activity_frequency() %>% mutate(outcome = "positive"),
negative_outcomes %>% activity_frequency() %>% mutate(outcome = "negative")
) %>%
pivot_wider(names_from = outcome, values_from = relative_frequency) %>%
mutate(difference = positive - negative) %>%
arrange(desc(abs(difference)))
)
return(comparison_results)
}
# 3. Quality Indicators Integration
quality_pm_integration <- function(event_log, quality_indicators) {
# Map quality indicators to process events
quality_enhanced_log <- event_log %>%
left_join(quality_indicators, by = "case_id") %>%
mutate(
quality_category = case_when(
quality_score >= 0.8 ~ "High quality",
quality_score >= 0.6 ~ "Moderate quality",
TRUE ~ "Low quality"
)
)
# Quality-stratified process analysis
quality_analysis <- quality_enhanced_log %>%
group_by(quality_category) %>%
do(
process_map = process_map(.),
performance = throughput_time(., level = "log"),
resource_usage = resource_frequency(., level = "resource")
)
return(quality_analysis)
}
| Algorithm Type | Python (PM4PY) | R (bupaR) | Clinical Use Case |
|---|---|---|---|
| DFG | β Full support | β Full support | Basic pathway discovery |
| Heuristics Miner | β Complete implementation | β οΈ Limited (heuristicsmineR) | Noisy clinical data |
| Alpha Algorithm | β Alpha, Alpha+, Alpha++ | β Not available | Academic/research use |
| Inductive Miner | β IM, IMf, IMd, IMc | β οΈ Basic implementation | Guaranteed sound models |
| ILP Miner | β Full implementation | β Not available | Complex clinical processes |
| Split Miner | β Available | β Not available | Automated parameter tuning |
| Fuzzy Miner | β Available | β οΈ Approximated via filtering | Spaghetti processes |
| Declare Miner | β Available | β Not available | Constraint-based mining |
| Transition Systems | β Full support | β Not available | State-based modeling |
| Method | Python (PM4PY) | R (bupaR) | Healthcare Application |
|---|---|---|---|
| Token Replay | β Full implementation | β οΈ Basic via processcheckR | Guideline compliance |
| Alignments | β Cost-based alignments | β Not available | Precise deviation analysis |
| Multi-perspective | β Data + control flow | β Limited | Complex clinical rules |
| Decomposed | β Large model support | β Not available | Scalable conformance |
| Anti-alignments | β Available | β Not available | Finding model weaknesses |
| Footprints | β Available | β Not available | Lightweight conformance |
| Feature | Python | R | Winner |
|---|---|---|---|
| Process Discovery | Comprehensive | Good | Python |
| Conformance | Token replay, alignments | Basic fitness | Python |
| Enhancement | Full support | Limited | Python |
| Social Network | Yes | Yes | Tie |
| Time Analysis | Good | Excellent | R |
| Resource Analysis | Good | Excellent | R |
# Static visualization
pm4py.view_dfg(dfg, start_activities, end_activities)
# Customization options
parameters = {
"format": "png",
"rankdir": "LR",
"font_size": 12,
"bgcolor": "white"
}
pm4py.save_vis_dfg(dfg, "process.png", parameters=parameters)
Pros:
Cons:
# Interactive visualization
process_map(event_log,
type = frequency("relative"),
sec = performance(median),
rankdir = "LR") %>%
render_process_map()
# Animated traces
animate_process(event_log)
Pros:
Cons:
| Aspect | Support | Notes |
|---|---|---|
| Web APIs | Excellent | Flask, FastAPI, Django |
| Databases | Full | All major DBs supported |
| Cloud | Native | AWS, GCP, Azure ready |
| Docker | Easy | Lightweight containers |
| ML/AI | Seamless | TensorFlow, PyTorch |
| Notebooks | Full | Jupyter, Colab |
| Aspect | Support | Notes |
|---|---|---|
| Web APIs | Limited | Plumber, RestRserve |
| Databases | Good | DBI, odbc packages |
| Cloud | Moderate | RStudio Cloud |
| Docker | Complex | Larger images |
| ML/AI | Good | tidymodels, keras |
| Notebooks | Full | RMarkdown, RStudio |
import pm4py
import pandas as pd
# Load and prepare
df = pd.read_csv('sepsis.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
# Convert to event log
event_log = pm4py.convert_to_event_log(df)
# Filter incomplete cases
filtered_log = pm4py.filter_incomplete_cases(event_log)
# Discover process
dfg, start, end = pm4py.discover_dfg(filtered_log)
# Filter infrequent paths
dfg_filtered = pm4py.filter_dfg(dfg, min_edge_frequency=10)
# Visualize
pm4py.view_dfg(dfg_filtered, start, end)
# Performance metrics
performance = pm4py.get_all_case_durations(filtered_log)
print(f"Avg duration: {performance.mean():.2f} hours")
library(bupaR)
library(tidyverse)
# Load and prepare
data <- read_csv('sepsis.csv') %>%
mutate(timestamp = as.POSIXct(timestamp))
# Convert to event log
event_log <- data %>%
eventlog(case_id = "case",
activity_id = "activity",
timestamp = "timestamp")
# Filter incomplete cases
filtered_log <- event_log %>%
filter_trace_frequency(percentage = 1.0)
# Discover and filter process
process_map <- filtered_log %>%
filter_activity_frequency(interval = c(10, NA)) %>%
process_map(type = frequency("absolute"))
# Performance metrics
performance <- filtered_log %>%
throughput_time("case") %>%
summary()
print(paste("Avg duration:", mean(performance$throughput_time), "hours"))
# Python: event_log = pm4py.read_xes('log.xes')
# R equivalent:
event_log <- read_xes('log.xes')
# Python: dfg = pm4py.discover_dfg(event_log)
# R equivalent:
precedence_matrix <- event_log %>% precedence_matrix()
# R: event_log %>% filter_activity_frequency(percentage = 0.8)
# Python equivalent:
from pm4py.algo.filtering.log.attributes import attributes_filter
filtered = attributes_filter.apply_events(event_log,
min_frequency=0.8)
# R: process_map(event_log)
# Python equivalent:
dfg, start, end = pm4py.discover_dfg(event_log)
pm4py.view_dfg(dfg, start, end)
| Criterion | Python | R | Recommendation |
|---|---|---|---|
| Learning Curve | βββ | ββββ | R easier to start |
| Performance | βββββ | βββ | Depends on use case |
| Scalability | βββββ | ββ | Python for big data |
| Visualization | βββ | βββββ | R for interactivity |
| Statistics | βββ | βββββ | R for analysis |
| Production | βββββ | ββ | Python for deployment |
| Documentation | ββββ | ββββ | Both good |
| Community | ββββ | βββ | Python larger |
Memory Model: 64-bit native
Threading: GIL (single-threaded for CPU tasks)
Parallel: multiprocessing support
Data Types:
- Integer: int64 (unlimited precision)
- Float: float64 (IEEE 754)
- Timestamp: datetime64[ns]
Max Array Size: Limited by RAM
Garbage Collection: Automatic (reference counting + cyclic)
JIT Compilation: Available (Numba, PyPy)
Memory Model: Mixed 32/64-bit
Threading: Limited (some packages support)
Parallel: parallel, foreach packages
Data Types:
- Integer: int32 (2^31-1 max)
- Float: float64 (IEEE 754)
- Timestamp: POSIXct/POSIXlt
Max Vector Size: 2^31-1 elements
Garbage Collection: Automatic (generational)
JIT Compilation: Limited (compiler package)
Recommendation: Python (PM4PY) + R (bupaR) Hybrid
βββ Python: Data processing, discovery, conformance
βββ R: Statistical analysis, dashboards, reporting
βββ Integration: Shared data via Parquet/CSV
Why this combination:
Recommendation: R (bupaR) Primary
βββ bupaR: Core process mining
βββ survival: Survival analysis integration
βββ tidymodels: Statistical modeling
βββ RMarkdown: Publication-ready reports
Why R:
Recommendation: Python (PM4PY) Primary
βββ PM4PY: Process discovery and monitoring
βββ FastAPI: Real-time API endpoints
βββ Redis: Caching for performance
βββ Docker: Containerized deployment
Why Python:
Recommendation: R (bupaR) Primary
βββ bupaR: Process analysis
βββ processanimateR: Visual storytelling
βββ Shiny: Interactive dashboards
βββ plotly: Interactive visualizations
Why R:
Recommendation: Both Python and R
βββ Start with R (easier learning curve)
βββ Progress to Python (advanced techniques)
βββ Compare results between platforms
βββ Teach platform selection criteria
# Python - Full implementation
from pm4py.algo.conformance.tokenreplay import algorithm as token_replay
replayed = token_replay.apply(event_log, petri_net, im, fm)
# Use when: Basic compliance checking, performance important
# Healthcare: Guideline adherence, protocol compliance
# R - Basic implementation
fitness_measures <- event_log %>%
fitness_token_replay(petri_net)
# Use when: Simple fitness calculation, integrated with bupaR workflow
# Python - Advanced implementation
from pm4py.algo.conformance.alignments import algorithm as alignments
aligned = alignments.apply_log(event_log, petri_net, im, fm)
# Use when: Precise deviation analysis, root cause analysis
# Healthcare: Detailed clinical audit, malpractice investigation
β Not available in R - Use Python for advanced conformance
# 1. Use efficient filtering
filtered_log = pm4py.filter_event_attribute_values(
event_log, "concept:name", frequent_activities, retain=True)
# 2. Chunk processing for large datasets
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
process_chunk(chunk)
# 3. Use multiprocessing
parameters = {"multiprocessing": True, "cores": 4}
tree = pm4py.discover_process_tree_inductive(event_log, parameters=parameters)
# 4. Pre-filter noise
clean_log = pm4py.filter_variants_top_k(event_log, k=1000)
# 1. Use data.table for large datasets
library(data.table)
event_dt <- as.data.table(event_log)
# 2. Filter early and often
filtered_log <- event_log %>%
filter_activity_frequency(percentage = 0.9) %>%
filter_trace_frequency(percentage = 0.95)
# 3. Use parallel processing
library(parallel)
results <- mclapply(case_list, analyze_case, mc.cores = 4)
# 4. Optimize visualizations
process_map(event_log, render = FALSE) # Skip rendering for analysis
This comprehensive comparison was developed by:
This comparison is based on PM4PY 2.7+ and bupaR 0.5+ as of 2024
Developed at SMAILE (Stockholm Medical Artificial Intelligence and Learning Environments), Karolinska Institutet