Energy-aware Swarm Intelligence with Small Language Models (SLMs)

Ship reliable language capabilities on constrained hardware—without guessing the cost to run it.

SLM-Bench is an energy- and efficiency-aware benchmarking product for Small Language Models. It helps you compare SLMs across quality, latency, cost, power, and carbon, so you can pick (or route to) the right model for every device, budget, and workload.

Explore the Leaderboard View on GitHub

Accuracy

Cost on Device

Runtime

Power Consumption

CO2 Emissions

Why this matters

Large models can be overkill when you're deploying to:

Edge devices and on-prem GPUs
Cost-sensitive inference at scale
Battery- and power-limited environments
Sustainability- and compliance-driven orgs

Small Language Models can be the pragmatic choice—but only if you can quantify the trade-offs.

Overview of the SLM landscape and model evolution

SLM-Bench evaluation framework: correctness and efficiency metrics

What you get

A single scoreboard for quality + efficiency

SLM-Bench evaluates SLMs on real NLP tasks and datasets, then reports a multi-metric view of performance:

Correctness (how good are the answers?)
Computation (how fast and how heavy is execution?)
Consumption (how much energy, cost, and CO₂ does it take?)

Performance comparison of SLMs across correctness dimensions

Decision-ready metrics (not just "accuracy")

Compare models using a standardized set of metrics, including:

Accuracy / F1 (classification & QA)
BLEU / ROUGE / METEOR (generation quality)
Runtime and compute indicators
Energy usage, cost, and CO₂ emissions

Fair comparisons on controlled hardware

Results are produced under controlled hardware conditions with standardized evaluation protocols—so you can trust differences between models as signal, not noise.

Leaderboard

Medal system: fast signal for best-in-class

To make trade-offs easy to scan:

🥇 Gold = best-in-class across metrics

🥈 Silver = strong performance with solid balance

🥉 Bronze = dependable baseline

Hardware-aware rankings

Benchmarks are reported per target GPU profile (e.g. L4 and A10G) so teams can evaluate models where they actually run.

GPU - L4

Model	Provider	Parameters	Context Window	Training Time	Gold	Silver	Bronze

GPU - A10G

Model	Provider	Parameters	Context Window	Training Time	Gold	Silver	Bronze

SLM-Bench benchmark overview: tasks, datasets, and domains

Recommendations (choose the right SLM for the job)

SLM-Bench highlights that different models win for different goals:

If you optimize for correctness: pick the top accuracy leaders
If you optimize for throughput/latency: pick the best compute-efficient models
If you optimize for energy/carbon: pick the models with the lowest measured consumption
If you want balanced performance: pick the models that score consistently across all categories

This lets you design an energy-aware swarm strategy: route requests to the best-fit SLM based on quality targets, budget, and power constraints.

How SLM-Bench works

A reproducible pipeline turns "model evaluation" into an operational workflow:

1.Universal Data Loader – normalizes diverse datasets into one interface

2.Preprocessing – trims/transforms inputs per task

3.Calling – runs models consistently across tasks

4.Postprocessing – cleans/structures model outputs

5.Evaluation – computes correctness + efficiency metrics

6.Report – produces tables and visualizations

7.Logging – captures runs for traceability and iteration

SLM-Bench evaluation pipeline: from data loading to reporting

Built for real deployments

Use SLM-Bench to:

Select the right SLM for on-device assistants
Compare candidate models before production rollout
Justify model choice with measurable cost/energy/CO₂ data
Track efficiency regressions across model upgrades
Build routing policies for "swarm" / multi-model systems

Get started

• View the open-source implementation and reproduce results locally
• Run a full benchmark sweep, or evaluate a single model–dataset pair
• Contribute new models, datasets, and hardware profiles

Explore the Leaderboard View on GitHub Cite the paper