Eclipsco — Next Generation AI

The problem with raw scores

When you compute a semantic agreement score after every message in a live debate, the raw number bounces wildly. Two models might start at 62% agreement, drop to 48% when one pushes back hard, spike to 79% when they find a shared premise, then settle around 71%. If you display this raw number directly, users see a volatile meter that oscillates every few seconds. In user testing, this produced anxiety rather than insight — people watched the number more than they read the actual responses.

What we want from a score

A useful agreement score should do three things: reflect the current direction of the conversation (trending toward consensus or away from it), remain stable enough that users can act on it (not change 20 points in two seconds), and respond meaningfully when a real shift happens (not lag several messages behind genuine convergence). These requirements are in tension. The first and third want responsiveness; the second wants stability.

The exponential moving average approach

We settled on an exponential moving average (EMA) with a weight of approximately 0.7 on the new score and 0.3 on the previous smoothed score. This means recent messages have significantly more influence than old ones (unlike a simple rolling average), but a single outlier message cannot dominate the display. The formula is straightforward: smoothed_score = α × new_score + (1 − α) × previous_smoothed. We tested α values from 0.5 to 0.9; 0.7 produced the best balance between responsiveness and stability across our test set of debates.

How we compute the raw score

The raw score before smoothing comes from cosine similarity between the vector embeddings of the models' most recent responses. We use a lightweight embedding model that can run inference in under 100ms, which is critical for real-time display during an active debate. The cosine similarity gives us a number between -1 and 1; we normalize this to a 0–100 percentage for display. We do not compare the full conversation history on every message — only the most recent exchange, which keeps computation fast and makes the score reflect where the conversation currently is, not where it has been.

What we learned from deployment

In production, the smoothed score proved significantly more useful than both the raw score and a simpler rolling average. Users reported it felt honest — when models were clearly converging, the score rose steadily; when they were circling around an unresolved disagreement, the score plateaued instead of falsely suggesting progress. One unexpected finding: the score plateau itself became a useful signal. When the score stops moving for several rounds at a moderate level (say, 60–75%), it often indicates that the models have found a genuine boundary — an area where reasonable disagreement persists. This is often more valuable to a user than false consensus.

Real-time semantic scoring at scale: our smoothing approach

The problem with raw scores

What we want from a score

The exponential moving average approach

How we compute the raw score

What we learned from deployment