Energy-aware Swarm Intelligence with Small Language Models (SLMs)

Ship reliable language capabilities on constrained hardware—without guessing the cost to run it.

SLM-Bench is an energy- and efficiency-aware benchmarking product for Small Language Models. It helps you compare SLMs across quality, latency, cost, power, and carbon, so you can pick (or route to) the right model for every device, budget, and workload.

Accuracy Cost on Device Runtime Power Consumption CO2 Emissions

Why this matters

Large models can be overkill when you're deploying to:

  • Edge devices and on-prem GPUs
  • Cost-sensitive inference at scale
  • Battery- and power-limited environments
  • Sustainability- and compliance-driven orgs

Small Language Models can be the pragmatic choice—but only if you can quantify the trade-offs.

Overview of the SLM landscape and model evolution
SLM-Bench evaluation framework: correctness and efficiency metrics

What you get

A single scoreboard for quality + efficiency

SLM-Bench evaluates SLMs on real NLP tasks and datasets, then reports a multi-metric view of performance:

  • Correctness (how good are the answers?)
  • Computation (how fast and how heavy is execution?)
  • Consumption (how much energy, cost, and CO₂ does it take?)
Performance comparison of SLMs across correctness dimensions

Decision-ready metrics (not just "accuracy")

Compare models using a standardized set of metrics, including:

  • Accuracy / F1 (classification & QA)
  • BLEU / ROUGE / METEOR (generation quality)
  • Runtime and compute indicators
  • Energy usage, cost, and CO₂ emissions

Fair comparisons on controlled hardware

Results are produced under controlled hardware conditions with standardized evaluation protocols—so you can trust differences between models as signal, not noise.


Leaderboard

Medal system: fast signal for best-in-class

To make trade-offs easy to scan:

🥇 Gold = best-in-class across metrics
🥈 Silver = strong performance with solid balance
🥉 Bronze = dependable baseline

Hardware-aware rankings

Benchmarks are reported per target GPU profile (e.g. L4 and A10G) so teams can evaluate models where they actually run.

GPU - L4

ModelProviderParametersContext WindowTraining Time GoldSilverBronze

GPU - A10G

ModelProviderParametersContext WindowTraining Time GoldSilverBronze
SLM-Bench benchmark overview: tasks, datasets, and domains

Recommendations (choose the right SLM for the job)

SLM-Bench highlights that different models win for different goals:

  • If you optimize for correctness: pick the top accuracy leaders
  • If you optimize for throughput/latency: pick the best compute-efficient models
  • If you optimize for energy/carbon: pick the models with the lowest measured consumption
  • If you want balanced performance: pick the models that score consistently across all categories

This lets you design an energy-aware swarm strategy: route requests to the best-fit SLM based on quality targets, budget, and power constraints.


How SLM-Bench works

A reproducible pipeline turns "model evaluation" into an operational workflow:

1.Universal Data Loader – normalizes diverse datasets into one interface
2.Preprocessing – trims/transforms inputs per task
3.Calling – runs models consistently across tasks
4.Postprocessing – cleans/structures model outputs
5.Evaluation – computes correctness + efficiency metrics
6.Report – produces tables and visualizations
7.Logging – captures runs for traceability and iteration
SLM-Bench evaluation pipeline: from data loading to reporting

Built for real deployments

Use SLM-Bench to:

  • Select the right SLM for on-device assistants
  • Compare candidate models before production rollout
  • Justify model choice with measurable cost/energy/CO₂ data
  • Track efficiency regressions across model upgrades
  • Build routing policies for "swarm" / multi-model systems

Get started

  • View the open-source implementation and reproduce results locally
  • • Run a full benchmark sweep, or evaluate a single model–dataset pair
  • • Contribute new models, datasets, and hardware profiles