the stepmania chronarticles

Originally written by Mina sometime in late 2015-early 2016. Don't know the exact date.

Preface

Difficulty has been a long standing and complex issue within many rhythm game communities over the past two decades. It also tends to be a divisive issue between communities. In order to understand difficulty we must first understand the current systems and the reasons for their existence.

Most rhythm game companies or platforms provide a basic difficulty rating system through which new players are guided, however as players develop and become more advanced it becomes apparent that the systems provided are almost universally inadequate for anything beyond novice players.

This is due to a lack of incentive for companies to actually spend the time developing and producing a robust difficulty rating system for their game, and even if they were inclined to in spite of this, it’s likely they would lack the experience to follow through.

Most games are constructed with a small set of beginner files in mind which then branch out into successively larger and more difficult pools of files. Most new players take any given difficulty system at face value, and for the most part, systems provided are quite capable of guiding naive rhythm game newcomers through initial stages of progression.

The threshold at which newer players recognize the shortcomings of any specific system tends to be almost by definition the point at which they are experienced enough on the matter to resolve their own methodology. As these players begin to interact with more longstanding members of their respective communities, their concept of difficulty and the system by which they internally apply it to their games becomes further shaped by those who have undergone the same process, albeit with a larger degree of experience.

This is how most communally constructed difficulty systems begin and propagate. These systems largely entail taking each file and assessing its general relative difficulty to other files within the game given a sample of player opinions on the matter, and creating a list ordering the files from easiest to hardest. As a game’s file pool expands and players become better, systems are expanded or rescaled to keep up. Any player good enough to recognize “official” difficulties aren’t worth anything immediately gets sucked into the existing engine of player derived difficulty, and the need for a more refined official system is erased as soon as it is created.

While at first glance this may seem to be perfectly logical and adequate (and to be fair, for the most part it is), it is within this paradigm that the fundamental challenge of difficulty is both birthed and abandoned.

Every player internalizes the experience of their progression through a rhythm game and formulates their concept of difficulty. These concepts of difficulty, while relatable on a global scale, are almost all different. Every player has their own idea of what difficulty is, what particular things affect difficulty and by how much. To compound the issue these notions are rarely ever made explicit, and not only do players adopt differing perspectives on what difficulty means for a given file, but the same players may adopt slightly different perspectives when given a different file, or looking at the same file in a different period of time.

Because this process produces reasonably inoffensive results on a large scale smaller scale quibbles generally end in “well that’s your opinion”. Imagine sitting in a room with 20 phd candidates arguing about the use of a particular word in a thesis. Each has their own perception of what the word means and how it should be applied and in what connotations it is appropriate. Now imagine this continues for hours or days with not a single person taking the initiative to actually check the dictionary definition of the word and the process ends in everyone agreeing that everyone’s opinion is their own and the matter is settled only because it would be more productive to find a new word to argue about.

Sounds ridiculous, yes? Well the only difference is we don’t have a dictionary to consult, and therein lies the first challenge of difficulty. A common language must be created for which there are explicit definitions, and then a definition of difficulty must be declared.

Part 1: Understanding Difficulty

So how is this possible? I’ve already asserted that everyone internalizes their own concept of difficulty. Difficulty is a constructed human concept for which there is no physical measurement. Any definition asserted can be rejected by another party as being an opinion and their own concept of difficulty presented instead. In many cases it’s not who is right that wins out in the end, but who seems more right and garners the most support from others. There is no immediately obvious methodology for determining whether the winner is correct. Yes, this is all true, and no, it wouldn’t be a challenge if it didn’t seem like a futile exercise in battling human subjectivity.

Here comes the fun part. We know difficulty is not a physical property of a chart, but what if it was? Well the solution would be rather simple then, we would measure it. But we can’t, because it doesn’t exist, however, this alone is not sufficient rationale to dismiss this line of inquiry. If we can’t measure difficulty, is there perhaps a physical property of a chart that, when measured, might produce an estimation of difficulty?

The answer is yes, and undoubtedly you have both already concluded that it is NPS and dismissed the concept as idiotic. For those who haven’t made those two connections yet a brief background is in order. NPS is a measurement of notes per second, some games use kps (keys) or some other variation, but the concept remains the same. If there are more buttons to hit faster, it’s harder. I’ll be referring to the concept universally as NPS henceforce, and average NPS (the total length in time of the file divided by the total number of notes) as avg NPS, and from now on I’ll also recuse myself from the duty of making sure everything I say is globally applicable to rhythm games by restricting myself to the Stepmania. This does not mean, however, that nothing I say is applicable to other games. Most of the underlying concepts and principles are either directly applicable or adaptable to other games of the VSRG genre and in fact much of the underlying logical principles employed are applicable to any endeavor in life.

Using avg NPS is the most obvious first step for approaching the problem of difficulty, and it is struck down just as quickly because the results it generates are obviously terrible. Avg NPS inherits one of the major problematic qualities community difficulty systems have, generally acceptable accuracy on a large scale, but weak accuracy on a per-file basis. Avg NPS takes this quality and magnifies it to an intolerable degree. Both the frequency and magnitude of error produced by simple avg NPS ratings are prolific (these are important concepts for those of you who aren’t particularly statistically inclined, in fact, they are the tools with which we may measure the “usability” of a system), and most communities would rather rely on nothing at all than average NPS.

Avg NPS fails on two major fronts, first, it makes no accounting for how notes are configured within files, and second, it makes no accounting for how a file plays or its pacing. A 5 minute file with a few moderately difficult sections may produce the same avg NPS as a 2 minute file with consistent difficulty. Even worse, a 5 minute file with a single section so difficult it becomes unpassable may produce the same average NPS difficulty rating as both of the above. The magnitude of error is large enough to result in placing files too easy for players and too difficult for players in the same category, and while this is unacceptable, it isn’t even the most significant of the issues.

There is an incredible level of variation in the patterning and flow (flow will be the general term used to refer to the specific pacing of individual charts, whether they are 5 minutes of unbroken stamina play or short files bursty files, etc) of charts. Even within packs that are themed to a specific pattern type with very narrow ranges of musical selection produces charts that play out much differently from one another. Take all of Icyworld’s Sharpnel files, for instance. They’re all basically the same thing, and yet each of them plays out uniquely.

We’ve identified that pattern configuration and flow are the two wrenches in the avg NPS machine, and if we are concerned with estimating frequency of error (normally we would be estimating frequency of error within a specified margin of error, but that requires our error and the margin thereof to be an actual measurement of something, which comes later. Also, in this case the evidence is so overwhelming an exhaustive inquiry is unnecessary. If you don’t believe me play any 5 ranked maps in Osu! and tell me I’m wrong) we are concerned with the degree to which the intersection of file flow and patterning produces charts that are treated significantly different by an avg NPS system.

You can either work forward by asserting the degrees to which avg NPS would underrate or overrate a sample of files or backwards by taking files with the same NPS ratings and re-ranking them on your own internally, but the conclusion you will arrive at is the same. In both scenarios your internal ranking (remember, I’ve already claimed that this methodology is systemically flawed) will reveal that avg NPS will either underrate or overrate the vast majority of files.

This is the fatal flaw that renders the system unusable. With a frequency of error likely in excess of 90% (here I’m estimating the frequency of error a rigorous study on the subject might produce) it is impossible to generate a reliable baseline of ratings for comparing other ratings produced by the same system in order to provide any useful information in the first place. This is true of any system with a high frequency of error. If you’re wondering right now, “so what constitutes a high frequency of error?”, you’ve already got the right mindset, but that query requires nuanced investigation into the subject that we aren’t ready for yet so just keep it in the back of your mind for now.

So we know avg NPS sucks, and we know why it sucks. For a time the concept of “NPS + fudge factor” was introduced and in use, and if it sounds like people just took average NPS ratings and pushed them up or down based on what was most obviously overrated or underrated by NPS, that’s because that’s what it was. While conceptually this was the correct idea, in practice the degree of adjustments that had to be made essentially turned it into a community derived system and it inherited all the faults of differing subjective views of difficulty on top of avg NPS being incredibly unreliable. And as with many ideas that might have worked with a little more effort, but didn’t, the practice was abandoned and avg NPS assumed a negative reputation.

Yes, this foray into the annals of history does have a purpose; the following section highlights a fundamental underlying concept behind this discussion of difficulty so pay attention.

We learned that avg NPS is unreliable and we learned why it was unreliable, but in the process we forgot why we wanted to use it in the first place. Every player whether they are consciously aware of it or not understands that there is a fundamental relationship between NPS and difficulty. We know avg NPS provides accurate trends of difficulty despite uselessness on a per-file basis. What’s interesting is when community derived difficulty values for large enough sample sizes are plotted against average NPS the result is a strikingly strong linear relationship.

Given all of the incredibly problematic issues with avg NPS it’s remarkable that any clear trend is visible at all. Now let’s take a quick step back and imagine something curious, what if avg NPS didn’t have those problems? What would we see? If the community derived difficulties were perfect, we would see a relationship, perhaps linear, perhaps describable by another mathematical function, regardless, we would be observing that there is a mathematical function for which a value for difficulty may be accurately derived based on a physical characteristic for a file, average NPS. In essence, we would be able to “measure” difficulty.

“But we can’t!”

Well, we know what the problems are, and we know we get what we want if the problems go away. What is the logical conclusion? No, we don’t give up, we fix the problems.

“Hold on, even if we derive a value for difficulty based on NPS, wouldn’t we just be measuring NPS, what’s the point of difficulty then? Difficulty still has no meaning and is entirely subjective.”

Yes, excellent point that none of you were thinking! Before we can continue, difficulty must be defined. We’ve already established that it is mathematically relatable to NPS under ideal conditions, but is there another measurable element of reality that it is also related to?

The quickest way to arrive at an answer is to ask the question, “why do we care about difficulty in the first place?”. For most players a value for difficulty represents a primary indicator for how much a player will struggle on a given file, or more measurably, their performance on the file, the secondary and less preferred indicator of course, is actually playing the file. If difficulty values are relative to one another within reasonable bounds, players can make fairly strong estimates for which files they will perform better or worse on based on their performance on files at similar difficulty ratings. The magnitude of the difference indicates the degree to which a player will struggle, playing files rated far above what they’re comfortable with translates into a proportionate decrease in performance.

For other players, difficulty represents an indicator of player skill. Achieving the same score on progressively higher rated files means you are becoming better as a player. Achieving the same score on a higher rated file than another player achieves on a lower level file means you’re better than them. The magnitude of the difference indicates how much better, or worse, you are, compared to another player.

The bad news is I basically just told you, again, that difficulty is subjective. The good news, is difficulty is a human construction and we can define it however we want. We just need to define it in a manner that actually produces results we care about and root it in measurable terms.

Don’t think about it too hard yet, follow the logic. We want another quantitative concept to mathematically relate to difficulty, and lucky for us, we just found two.

The process of constructing a common language in order to better address the question of “what is difficulty” necessitates explicitly defining various concepts, preferably quantifiably. Even if we know what “difficulty” is, the problem continues to persist if everyone has different definitions of what contributes to difficulty. Also, explicit definitions facilitate accountability. This is important conceptually, however, not very relevant right now.

Let’s deal with player skill first.

“Haven’t we already defined player skill as difficulty, which is just NPS? So player skill \= NPS. What’s the point?”

In practice holding this perspective probably won’t negatively impact your experience navigating a difficulty system, however here we are primarily concerned with fundamental understanding of concepts, and secondarily with practical usage. Difficulty is a concept through which we can make an estimation of the skill level of a player. Given a large sample of player scores on files of various difficulties we can infer a reasonably accurate measurement of that player’s true skill, it is absolutely not the same thing as player skill.

“So what is player skill, and what do you by mean true skill?”

The article up to this point has been concept introduction and explanatory material, an informational lift hill, if you will. This is now the informational drop.

While most people don’t necessarily think of player skill as a quantifiable value, much like most things we can see or experience in the world we can establish relationships and trends when controlling for the proper variables. Player skill is an intersection between time invested in playing the game and rate of improvement per hour spent. The more time you play the game, the better you become, the more efficiently you play the game, the faster you will get better. For the set of all players who improve at exactly the same rate, the determinant of player skill and their relative to strengths to one another is singularly the number of hours invested into the game. For the set of players who have invested the exact same amount of time into the game, the sole determinant of their relative strength is their rate of improvement. The set of players who have invested X hours into the game at Y rate of improvement all have the exact same level of skill, and this is true for any value of X, or Y.

Now before objecting, first, recognize that rejecting the above assertions is the logical equivalent of stating that player skill is independent of either time invested or player improvement rate, or both. In any case, the logical conclusion that follows from this is that player skill is either entirely random, or dependent upon a variable not considered. We would then observe large variations in player skill that resulted in players being able to AAA files one day that they couldn’t pass at half the speed the day prior, or the reverse.

And no, I don’t care how badly you randomly played that one day, it’s not statistically relevant. If the asserted relationships between player skill, time played, and player rate of improvement were incorrect, you wouldn’t have even mentally cataloged playing badly one day because it would be the norm.

Now that being said, an important point is brought up. We know players tend to vary in performance from day to day or week to week, sometimes the underlying cause is apparent, such as pain or discomfort in the hands or shoulders, or taking a few months away from the game, but often there is no obvious catalyst for drastically reduced performance. This can be frustrating and is an interesting topic for discussion and conjecture, but currently it is irrelevant.

For our definitional assertion of player skill I am going to make the clear distinction between true player skill and observed player skill. Scores are the product of observed player skill. We use them as estimates of true player skill because we assume that on average, player scores faithfully represent a player’s true skill. However, while this may hold up when discussing averages it is very obviously not true on a score by score basis. So then the question is why?

The game is almost entirely comprised of building muscle memory, the two other major elements to play are executing on the muscle memory and properly parsing the visual triggers for execution. Many things can adversely affect the second or third processes, being sick, playing at an uncomfortable angle, new monitor, being tired. Sore muscles, sour mood, temperature, windows update lagging your pc in the background, and so on and so forth. These and many other factors can interfere with muscle memory execution or recall, however, you don’t just lose muscle memory.

Players who quit for a year and resume playing are nearly always as good as or better than they were when they quit after a month. Maybe you lose some muscle, maybe your setup isn’t exactly the same as when you quit, but eventually you recoup the skill you “lost” because “skill” is fundamentally a measurement of your written muscle memory for the game. You aren’t regaining “lost” skill, the skill is still there, you just aren’t quite ready to use it yet.

Furthermore, even within the same session a single player will not produce a set of scores that all produce the same estimates of player skill. Given a skill X and file difficulty Y, players will not achieve a constant and predictable score, they will only be achieving a score within their performance distribution. This is another important concept that will be revisited soon.

Now, there are a couple simple facts that we must accept before continuing.

First: We can never know the true skill level of a player, we can only estimate it based on information provided. Second: Even if we were told the true skill level for a player by an omniscient being, we would be incapable of verifying whether or not it was accurate.

Now, yes, this sucks, but if we knew everything we wouldn’t need statistics. Unfortunately, we don’t know everything; thankfully, we do have statistics.

Let’s take a model population, an infinite population whose distribution for any data point or intersection of data points perfectly models our own. This population is infinite, so we have an infinite set of Stepmania players to work with, cool. There are an infinite number of players at all skill levels (including those we have not yet reached) with every permutation of relevant traits, and we know that our collective population is a sample of this model population. Since this is all imaginary, we can even say that all values for all data points for all members of this population are known. We’ll call this population P. The P is for population. So is population, redundant, I know.

Following in line with our establishment of the existence of a mathematically representable relationship between time invested and true player skill (or the library of constructed muscle memory), let’s take the set of members of population P which improve at the average rate of skill per hour invested. Since their true skill values are known we can take them and plot them against hours invested into the game, also known. The result is a distribution that indicates to us, for an average player, the number of hours required to attain a specified level of skill.

True player skill X will be defined as the number of hours required to reach an observed skill level for a member of population P with a rate of improvement per hour representative of the perfect average of the population.

Now that we have an explicitly defined concept of what true player skill is, we can establish its relationship to difficulty.

Recall the second reason we care about difficulty. Players who have a rough idea of their performance upon files within a specified difficulty range can extrapolate those performances and formulate a relatively strong approximation of how they will score on files above and below the initial difficulty range. Now, how correct these approximations are depend on a number of things, how accurate the difficulty system is, how confident the player is in their initial estimation of their performance at a given difficulty, and how well they estimate how a given differential in difficulty translates into a differential in performance.

For the moment, we’ll excuse ourselves from reality and return to the universe in which population P exists. In this universe, we know everything, and not only do we know everything, we know it to an infinite degree of accuracy. As each player of population P has a known true skill level, every file ever played by any member of population P has a known true difficulty level. This constitutes the population F (file population) for population P, and it will inherit the same qualities we have previously ascribed to population P.

We’ve asserted that players use difficulty ratings to guess at expected performance on files, this is because there is a clear relationship between level of skill X of a player and the level of difficulty Y of a file. Players inherently know this to be true, and it can be observed as a trend on larger scales even within our own population for the data we have access to. Indeed, this is the entire purpose for which difficulty ratings exist in the first place, and if it fails to fulfill the purpose for which it is designed, it is useless. Hence the universal movement of rhythm gaming communities away from inadequate difficulty systems.

Anyway, members of population P do not need to guess. In order for the intended purpose of difficulty to be fulfilled, every member of population P at true skill X must perform exactly the same on the same file, which has a known difficulty value of Y. Now, this does not mean that every player given a single attempt on the file in question will obtain the exact same score. Remember, scores are not measurements of skill, but estimates. Furthermore, if each member played the file a second time, not every player will obtain the exact same score on their second attempt, and even the differential between their first and second scores will vary.

This is because a single score is only a single trial for a test for which there are an infinite number of outcomes. Even if every player in question attains AAAA (in Stepmania, given default timing windows, the best score attainable) grade on the file, their scores are only identical from the perspective of the game’s judgement system, or in other words, the degree of accuracy for which we are concerned with. Though we do not particularly care about performance differences on the order of +- 0.0001ms timing windows, they do exist and in our imaginary universe these measurements can be known to an infinite degree of accuracy.

Given an infinite set of trials for random player of true skill X for a random file of true difficulty Y, there exists an infinite set of outcomes for the set of all permutations of judgement timings to all degrees of accuracy. Each outcome has a probability of being produced given a single trial by the player, and each outcome has a probability of being produced within N trials by the player.

Now, at the moment this doesn’t really help us very much, even if we had all this information readily at our fingertips we simply do not care about the vast majority of it. We aren’t interested in variation in accuracy below the judgement windows for our game, and we aren’t particularly interested in bivariate distributions.

We need to explicitly state the parameters for which we are interested in. Most if not all rhythm games operate on some variation of a point system which is based on accuracy within judgement windows. The closer you are to hitting a note at its exact point in time, the more points you get. The more points you get, the higher your score is.

The specifics of a given scoring system, such as how impactful your deviation from a perfect timing is and where grade cutoffs are, or whether there are even grade cutoffs at all, vary from game to game. However, we are unconcerned with the specifics of any given system, we are only interested in the relationships it reveals:

If true skill X increases and file difficulty Y remains constant, higher values of X will produce higher values of accumulated points Z.

If X remains constant, and Y increases, Z decreases.

The increase or decrease of Z is proportional to the increase or decrease of X or Y if the other variable constant.

For any two values of X and Y there exists a unchanging value for Z regardless of a specific player or file in question.

The relationships between X, Y, and Z may be expressed mathematically.

Now, let’s be clear. If a specific scoring system in a game produces results that do not follow these trends, then the scoring system is erroneous in producing measurements of difficulty or player skill, or both, not our definition of difficulty.

An obvious example of this is machine score in Stepmania. Over time it has become readily apparent that higher or lower machine scores have very little to do with an actual score, in fact, despite machine score still being displayed in nearly every theme’s score results screen, it’s unlikely that any serious player has looked at it in quite a few years, I know I haven’t.

It is important to note that these relationships are constructed based on collapsed score distributions, or averages. For now we’ll state that a value of difficulty Y is representative of a value for player skill X for which the average of a resulting score distribution first exceeds a specified value.

If we look at any score as a distribution of ms differentials of key inputs from their intended target arrow’s perfect accuracy, or their 0ms point, we know that we can obtain a probability for the occurrence for any specified ms distribution, however we don’t want to be inserting distributions for an input. That’s like, a lot of work without ms distributions being an output provided by the game engine (hint hint).

For the moment, we’ll imagine that our scoring system in our perfect universe is, well, perfect. An ms distribution is used to determine the proportion of total points that is gained, and the score grade subsequently determined from that. In our perfect system any ms distribution that results in Z total points accumulated on file Y necessarily requires the same input true skill level X for a player. Also in our perfect system, the relationships between X, Y, are Z are independent of the specific value of Z (the fact that this is untrue in reality becomes important later).

Now instead of having an infinite set of probabilities for an infinite set of possible scores, we have a slightly smaller infinite set of probabilities for any given total proportion of points gained. Thankfully, in reality, judgement windows force upon us a relevant degree of accuracy and furthermore, at least in Stepmania, point proportions on the score screen are rounded to two decimals. Let’s use this as our declared “care level” (degree of accuracy).

What do we have now? Well now we have only 10,000 possible input Z values for which an expected value of X may be calculated to return on average given an infinite set of trials on a file of difficulty Y. Naturally, if we pick 0 as our Z value, the calculated value of X is 0 regardless of our value for Y. This is a fairly long winded proof that it takes no skill to fail out of a song, thankfully, we’re getting more out of this than that.

What should become obvious at this juncture, if it wasn’t already, is that single value for difficulty depends on a stated expectation of point gain. While we may assert any value we wish to standardize difficulty, and indeed in our perfect universe it would be irrelevant which value you declared, it makes logical sense to tailor it to the landscape of the current metagame for your game.

In Stepmania the de facto score grade by which players are judged is AA, or 93% of total points accumulated by the player within the scoring system. It is important to recognize why this has become the case. An A grade requires only 80% of total points to be accumulated, while AAA grade requires 100% of points to be accumulated (in our scoring system, the final accuracy judgement has no effect on proportion of accumulated points, there are arguments to be made for and against this system, however that is a discussion for another time).

Given timing windows and the judgement system A grade requires a low enough proportion of total points to be gained that not only can players “mash” their way through files (keystrokes that are guided not by arrangement and place in time of notes, but merely by the existence of them), they can do it on files that they are unable to read properly (if they could read it, they wouldn’t be mashing).

If we refer back to the definition of player skill, when scoring becomes decoupled from reading capability and guided development of muscle memory through directed action, scoring also becomes decoupled from skill. Yes, this is also a very, very long winded proof that mashing takes no skill.

AAA grade runs in the opposite direction, requiring players to hit no arrows outside a 90ms timing window. The reality of play is that once players are capable of a high enough skill level to AAA files, the files in question are much less appealing to play (this concept will be expanded greatly upon later). Technically, the probability of a player AAA’ing a file is never 100%. Players can fumble, or stop paying attention, or have a heart attack, or be crushed under a whale that recently materialized in the upper atmosphere.

When considering AAA grades, minor errors with small probabilities often nullify scores. Players often observe a near-AAA grade score and mentally undermine the score, citing that an AAA would be more impressive. When we look at this from the perspective of a perfect difficulty system, given a value of Z of 100% versus 99.99% we would find that the expected true skill levels for both queries would be virtually identical, however this fact is not represented in most attitudes.

The reality of the situation is that we don’t have an infinite number of trials to expend on attempting to AAA something, the reality is minor errors nullify the perceived skill required to obtain a near AAA score more than they should, the reality is many players find files they should be able to AAA too far below their skill level to be enjoyable, the reality is certain characteristics of files and certain pattern configurations result in significantly different effects on difficulty depending on the proportion of points depending on the value and that in many cases when looking at AAA grade the proportions of files (proportion is plural, yes) that fall within relevant difficulty ranges becomes incredibly small relative to the total expanse of the file, and the reality is because of all these things once a requisite level of skill in order to AAA a file is achieved, whether or not a given player AAA’s a file is largely dependent upon their number of trials.

I’ll restate the above for emphasis, and also in case you got lost. AAAs are about either grinding or about being so much better than you actually had to be to AAA it you might as well be trying to quad it. Playing files for which you are significantly above the skill required to AAA tends to be unappealing. Most players tend to play files fairly close to their skill range, and regardless of whether the player in question is focused on speed or accuracy, the point at which just barely AAAing files lies relative to player skill tends to be outside of the normal range of comfort.

Now, grinding is not a measure of skill. Most people know this without knowing precisely why, so let’s approach the issue logically. When we look at scores obtained by players we generally do not take a look at their number of trials or the average score of their trials, we only consider the highest score obtained, which definitionally had the lowest probability of attainment if player skill stayed constant. Grinding is a method by which players may skew their observed score distribution and the resulting estimation of player skill in their favor.

This is not to say we should begin considering taking averages for players’ scores, that’s ridiculous, we would never be able to control for scores grinded at a specific skill level and the entire principle is antithetical to much of the game. Indeed, while players who grind scores out may skew their scores in their favor, in many cases this will hurt them in the long run. If a player has obtained most of their best scores through grinding, even if they improve they have essentially forced themselves to continue grinding in order produce scores that reflect their improvement.

Instead, it is our duty to construct a system for rating players that takes scores as estimations for player skill and aggregates them in a way that produces the best estimation of a player’s skill. This is an interesting topic and there is a lot to discuss when considering determining player ratings from scores which you can read about here.

The result is, at least in Stepmania, AA grade has been tacitly adopted as the standard to which players are held, and to which they play. The issues that present when considering A grade and AAA grade scoring are also apparent in AA scoring, though their effect much diminished. Files which are shorter and contain short sections with sharp spikes in difficulty tend to favor a grindy approach, and even within the bounds of a 93% score goal it is certainly possible for players to mash elements of the file they are unable to read. The two issues are not necessarily mutually exclusive either, it is much easier to mash your way through a section you have difficulty reading to get a better score if it’s a short file and the amount you have to mash is limited in scope, the issue becomes compounded and grinding because very appealing. Scores which just barely scrape the edge of AA grade, but don’t officially pass the 93% mark, are held in disproportionately low esteem relative to the virtually indistinguishable level of skill required to achieve them.

Regardless, the magnitude of these effects are much less pronounced and limited to a much smaller selection of files. I will personally state that as a player I find, depending on the file, the 95-97% score goal mark to most likely have the largest correlation with true skill. This is obviously a personal opinion and do not expect everyone to agree with me, however, the fact that I hold this opinion is important.

In Stepmania’s current competitive metagame, insofar as it actually exists, 93% grade has been adopted as the standard benchmark for “speed” players due to the default AA grade cutoff being universal and easy to point to. It is understood within the community that it is fairly silly to base accuracy players on their AAs, and the “accuracy” have a separate school of thought. For now, I’ll only be discussing “speed” players and i’ll be using the term to refer to globally refer to all skillsets that are not pure accuracy. If you don’t know what skillsets are it turns out they’re kind of important and you can read about them and their impact on perception of difficulty and the general metagame here.

While it’s arguable that a more appropriate cutoff could have been established, the decision making of the community tends to be largely unguided and unconscious, this is also fine, but, despite most players understanding that 97% on a file requires a higher level of skill to obtain than 93% on the same file, and placing emphasis on this fact on a per-file basis, when it comes to the larger perspective, the scores for players that surpass 93% by large margins are often viewed as being the same as 93% scores, resulting in players whose best scores lie within the 95%+ range being underrated compared to a similarly skilled player whose best scores lie just at the 93% mark.

The issue at hand is not whether or not a 97% score is better than a 93% score on the same file, but how do we relate a 97% score on one file to a 93% score on a completely different file? This can be done if two files and scores are carefully taken into consideration, however this is not reproducible on a larger scale. Essentially, to be a speed player whose true skill level is most accurately estimated, it is necessary to achieve as many barely obtainable AA scores as possible.

For the moment we’ll ignore the fact that, for most players who strive for a point of equilibrium between speed and accuracy, this sucks, and focus on the statistical implications. Most speed players tend to enjoy playing files they can reliably obtain between 91% and 98% on, and in addition this range of difficulty is the most effective range to play at in order to develop yourself as a speed player. It is my guess that, on average, between warming up and pushing for new AA’s, most players divide their number of hours invested into playing the game fairly equally between files they can be expected to 91%, 92%, 93%, and so on and so forth. If anything the distribution is likely skewed towards easier files, or files players will on average score higher percentages on.

We know a single score is the result of a single trial of a player on a given song, and we know that performance far outside the expected norm, while rare, may occur. Most players refer to such scores as “flukes”, and in turn “flukes” are often the pillars upon which the legacies of players are built. The problem is, if every score obtained above is treated at the same value as its 93% difficulty, and similarly 91-92.99% scores are underweighted, we force only a tiny fraction of the average player’s time invested to fully “matter”, and their 1/1000 probability scores must lie within this tiny fraction of play to properly influence the perception of their skill. This is ridiculous from both a practical standpoint and system efficacy/accuracy standpoint.

Returning to the two major camps of players “speed” and “accuracy”, the reason they have been segregated is because the physical characteristics that influence pure accuracy, or AAA+ grade, and speed, are too different to be relatable in a coherent manner. In fact, they require significantly different development and execution of muscle memory as well as reading skill.

Now, we also know that the same principle applies to 91% through 97%, which is why attempts to relate them on a mass scale have been avoided. On the other hand, assuming we properly construct a methodology for accurately determining difficulty for 93%, can we expand it to include 91%-97% scores to a degree of accuracy that is reasonable?

The answer turns out to be yes and it will be discussed more later, but the first question we must answer is, what is a reasonable degree of accuracy?

We know that a true difficulty for every file exists, and we know that as players given enough time and experience and through consensus we’re capable of producing estimates of true difficulty within a certain range of error. Independent of what the actual margin of error is, our explicit goal for accuracy for an algorithmically determined difficulty system is simply exceed the margin of human perception error (for now I’ll just roll frequency and magnitude of error into one, but each is important in its own way). It is essentially the point at which we are no longer capable of discerning whether or not what we’re being told by the system is wrong. As it turns out, this is a fairly low bar, however it’s important to realize that an algorithmically determined difficulty system that does not produce results within the margin of human error is functionally worthless.

Now let’s backpedal a bit, we know the standard score goal for gameplay in the metagame is 93%. We might surmise that a more apt value can be chosen, however for now the decision is given to us. The difficulty rating we are concerned with is specifically the difficulty to obtain a 93% total proportion of points on a given file.

Ok cool, that was pretty simple, now let’s backpedal a little further, recap, and figure out again what the point of the last 10 pages was.

1. We know every player who plays or has ever played the game has a true skill level. True skill level is measured in hours it would take an average person to invest into the game in order to construct the same library of muscle memory as the person in question. True skill level is independent of observed skill level and any factors that may influence it. True skill level is also independent of the specific muscle memory constructed, it is only a measurement of volume. We can never know the true skill level of any player, only estimate it, and we can never verify whether or not our estimation is correct, we can only assert the degree of confidence we place in it.

2. We know every file has a true difficulty level. If NPS is a perfect representation of difficulty, NPS \= difficulty level, and it may be measured and scaled or transformed. Files are configurations of notes that span a length of time, and any subsection of a file also has a true difficulty level. This includes any 30 second section, 3 second section, or 3 note section. We are incapable of knowing the true difficulty level of a file, only estimating it, and again, we can only estimate the accuracy of our estimations, never verify them. Note that while technically each individual note within a file also has a true difficulty level, NPS cannot be used to measure it due to division by a time segment of 0ms.

3. If we could control for every factor that contributes to a discrepancy between observed skill and true skill, players with skill X on files with difficulty Y have identical score distributions for any combination of the values of X and Y, and the averages of these score distributions are understood to be errorless indicators of true skill level. If we control for any two of the three variables the resulting values will always be transformable through a single power exponent to fit a line.

Now we are ready to declare a definition of difficulty:

Difficulty values are estimations of the required level of true player skill required to produce score distributions that average a cumulative point proportion of 93% given an estimation for true file difficulty. Due to the relationship between true skill and true difficulty our value of difficulty may be interpreted as either or both and scaled to reflect measurements of true skill(hours), true difficulty(NPS), or to any system that controls one or more variables in order to produce constants during progression. Player scores on a file with a given difficulty are estimations of player skill and a system can be created that aggregates scores and produces a single estimate for player skill.

Cool, now we know what we’re talking about. We’ve defined the elements relevant to the discussion of difficulty. In our model universe, we would be able to derive estimations of difficulty based on NPS that would produce identical values to their known true difficulty values. We would be able to take a score distribution from a model player and reverse calculate their player skill level, and verify our results. We would be able to take any score on any file and compare the requisite skill level to obtain this score to any score on any other file. In other words, we have created the perfect system in our perfect universe.

Ok, we got through the easy part. Now the hard part is hard because of a simple truth; our universe isn’t perfect. We don’t know everything, and we don’t have infinite data or the resources to generate it. Even if we did we’d die before we could finish properly analyzing it.

Now, we didn’t just waste all this time, well, we would have if we chose this moment to give up, but at least I didn’t. We’ve identified relationships and trends that we know we would observe if we knew everything. We’ve created definitions for concepts we can use to discuss difficulty and in the process we have given ourselves the tools we need to understand what difficulty is. None of this changes just because we don’t know everything.

Conceptually the system is bulletproof, it is designed to give us exactly what we want so long as we give it what it wants― and therein lies the second challenge of difficulty. We know the circumstances under which this system will work, and we know the elements of reality that prevent it from working. So now it’s time to fix the problems.

Part 2: Estimating Difficulty

The question of how to proceed is perhaps fairly daunting. Initially it may seem as though it is unclear as to what we should do with the understanding of difficulty we have constructed. While for the reader this may be the case, when given three weeks or so to contemplate the issue, multiple approaches to the problem at hand became clear and the problem became selecting the best one. You can trust me on that, but I’ll discuss the different approaches and their relative strengths and weaknesses at another time, for now we’ll ask ourselves what do we want?

We want a system that provides us with useable difficulty ratings with the least amount of developmental time, effort, and complexity. In short, we want to pick the shortest path that get’s us to where we want to be. The use of the word usability here might be confusing, so I’ll elaborate, though it mostly has to do with the concepts of statistical error and confidence and if you don’t need this particular primer feel free to skip ahead (a few pages).

Error is not a concept we can eliminate or avoid. All difficulty systems have error, whether they are created by the original game designers, whether they are community derived, or whether they are created as a by-product of a programmatic system. Error exists whether or not you are aware of it, and accepting that it exists is in no way a sacrifice, it is merely the accommodation of a reality that already exists, you know, basically what we do every day of our lives.

Error comes in three forms, frequency, magnitude, and the intersection of the two, or, the frequency of error given a defined range of magnitude. At lower ends, particularly for novice files, communally derived difficulty delivers excellently on both fronts. There are plenty of players to offer different perspectives on the files in question and large amounts of data regarding performance may be collected and used to reinforce opinions.

Furthermore most games contain few novice level files, largely because certain pattern usage is prohibitive, and because newer players (at least, the ones that continue to play) tend to advance fairly quickly through the ranks. There isn’t a need for a large volume of novice files because by the time novices play through enough of them, they are no longer novices.

This is an odd situation, on the one hand we wish for our beginner levels to have a very high level of accuracy. Novice players, as stated before, tend to take difficulty ratings at face value and use them to guide their progression through the game. Inaccurate difficulty can be off-putting to new players, they are less likely to blame the difficulty system than to blame themselves for being incapable of executing performance on similarly rated files, even if in reality they are much farther from each other in difficulty. They simply lack the experience to be able to make an informed decision for themselves.

On the other hand, from a practical standpoint the lower end of the difficulty spectrum is the section in which players advance out of the most quickly. Having perfect accuracy is unnecessary because by the time a player accrues enough scores on enough files the process of doing so will necessarily have advanced them much further.

It is much more important to have perfect accuracy at the very top end (from the perspective of the active top level players, at least). At the very top ends even small margins of error in difficulty rating can place files years apart in true skill level at the same rating.

Let’s use the concept of perfect nps as an example, moving from 1-2 nps files may take a novice player a few hours. Moving from 28-29 nps may take a top player with years of experience another 4 months of practice and play. In both cases nps is increased by 1, however the required time to generate the skill necessary to make the improvement is vastly different. Not only is it vastly different in absolute terms, but the relative difference between 1 and 2 is far greater than the relative difference between 28 and 29. This is the basis for the statement that movement along the nps line of skill per hour invested is logarithmic.

The underlying concept is that the margin of error should not some defined metric of practicality. We don’t want to be placing two files at the same rating when one takes two more years to be able to AA than the other. This is the magnitude of error. If we wish to base our desired target magnitude of error on a pre-existing model, we have at our disposal communally generated difficulties once again.

In the case of very top end files, difficulty ratings (or if we are looking at “difficulty” from the perspective of “achievement”, the degree of “achievement”), tend to present incredibly high magnitudes of error. It is not particularly the fault of the participants in the endeavor, it is due to a lack of information. When looking at what are perceived to be the top 3 players in any rhythm game (or any game, really) it is often the case that they each have radically different strengths and weaknesses, and skillsets.

This is not because it just so happens that in most games the distribution of top players is guided by some cosmic force to be spread along differing specializations, it is because of another equally silly cosmic force— human perception bias. Independent of whether a perfect system would label the top 3 players as being the top 3 or not, the default mode of human perception is to take the best at each skill and equate to them the same level of skill. No, it really doesn’t make any logical sense, and yes, we are all guilty of it.

This is an extremely different paradigm to break from, and even more so when it is reinforced by psychological influences of group dynamics. The reality is if you have 10000 players playing one specific type of file, and 5 players playing another, it is more likely that the top 100 players from your first group are all better than the best player of your second group than it is for the top two to be of equivalent skill level.

Ironically it is in fact this very concept that so often works against the Stepmania community. In the case of many other VSRGs the argument is put forth that because Stepmania is a relatively small and concentrated community, our players cannot be “as skilled” as more populated communities. In this regards Stepmania tends to be an extreme outlier, and it is the result of a single fundamental difference between Stepmania and other rhythm games.

For its entire existence charts played in Stepmania have been created by the community. Conquering old charts creates a need for new and more challenging charts which players can create for themselves at any time in a matter of hours. In most conventional rhythm games, this process of feedback looping is artificially interrupted by level designers and “official” charts. This is however a discussion for another time.

The point is communally derived difficulty systems not only fail miserably at delivering adequate magnitudes of error at the top end, but they are incapable of providing themselves with the necessary information or experience to do so, literally by definition. By the time enough players are good enough at what were formerly the most difficult files or the greatest achievements in order to accurately place them in relation to each other, they are no longer the most difficult!

Now, in order for a system to functionally replace a pre-existing system, it must provide enough improvement in specific areas, or even better, across all areas. In terms of magnitude of error, the bar has been set pretty low. I observed, at the top end, magnitudes of error in the Stepmania community difficulties that I would guess spanned at minimum a year.

This is obviously subjective, and being human I too am affected by human perception bias. That doesn’t stop me from making this statement, it only stops the reader from having absolute confidence in it. This leads, eventually, to an interesting realization. At best our system only needs to produce values for which given the range of error incurred by human perception bias, we are unable to determine whether or not error actually exists.

Recall that we cannot create a perfect system, and recall that even if we did, we could never prove that it was perfect. The best we can do is to create a system for which we cannot tell whether or not it is perfect. The result is we are able to place our absolute confidence in it.

So what does this mean functionally? Even in assessing error we cannot truly know what we are talking about in absolute terms, we can only assert estimates of error in our estimates. Thankfully I’ve already done a bunch of the leg-work, have already created a system and exhaustively tested it, and whether or not you choose to accept it, the range of error that I’ve found to be acceptable at the top end is around +-3%.

This is what I view as the borderline of acceptability; this range of error (if you are already familiar with the system it’s roughly a spread of 31.4-34.7) already encompasses years of skill. My guess is a system that produces error within a 1% margin is functionally no different to us than a perfect system. These are the desired benchmarks for magnitude of error for an acceptable, and desirable system, however, they are meaningless without the context of the distribution or frequency of error.

In a way no difficulty we ever assert for a file is correct; we are always wrong (technically you can get around this with clever semantics I know, but let’s stay on target). We cannot care about being right, it is a pointless endeavor, we can only care about how often and the degree to which we are wrong. The previously stated acceptable ranges of error magnitude are only the ranges for which I expect the vast majority, or 95%+, of all files evaluated for difficulty, and with respect to a distribution, I expect a very small portion of those files to have difficulties with magnitudes of error that fall at the ends of the range.

For the files that fall within the specified margin, most will fall within a 2% margin of error, and hopefully enough of them will fall within a 1% margin of error that effective baseline difficulties can be assigned to guide further adjustment. Conceptually a baseline is simple, in fact it is more or less consciously employed in most of the ratings delivered through communal assignment or consensus. You take a file for which you assert the difficulty rating is as close as possible to the true value, and you use it to guide your assessments of other files.

The issue, as previously discussed, is that everyone views every baseline differently. They all have their perspectives and takes on what functionally creates the difficulty in the specified file, and they all approach and utilize the baseline differently. This issue is artificially recreated if the frequency of error of any system exceeds a large enough margin. Baselines only truly help if their relation to other baselines accurately represent the true difference in difficulty. In essence, you need a large enough population of relatable baselines for having any understanding of the values provided by the system.

Failure to do this creates a system that is functionally unusable, it would be like using metric and imperial measurements interchangeably during a physics class. Only instead of just metric and imperial you had an infinite set of differing units of measurements and you didn’t know how to relate any of them because there is no standard unit for anything. Right, not only is it unusable, but any reverence or officiality given to such a system is simply harmful.

In other words, ideally we hope to create a system which produces values that do not exceed a 1% margin of error for any file evaluated, and for which the average margin of error is far lower (+-0.2%?). This may not resonate immediately with the reader as a far-fetched goal, however let me say that a much more reasonable (actually, the most reasonable, as it is essentially the bare minimum required for any system to be useful at all) goal is for 90% of evaluated files to fall within 3% margins of error with the average margin of error closer to 1.5%.

We can expect great outliers to exceed 10-15% margins of error however we expect the frequency of such occurrences to be on the order of 1/10000.

While it may seem superfluous, estimating error is an integral component of any system involving data for which true values are not known. It is the only metric by which we can place a calculated degree of confidence in a system and more importantly it is the most important metric by which we may use to construct and improve the system in the first place.

So again, the question is what is the simplest approach we can take in order to produce a usable system? Let’s refer back to the simplest approach we can take, period. Averaging NPS. We know this doesn’t produce usable results, but we also have a fairly good idea of why.

Let’s keep it as simple as possible, in fact, the next logical step is already commonly employed to produce somewhat reliable information about a chart. It is called an NPS graph. An NPS graph is simple (really everything is simple but this is another discussion for another time and place).

NPS graphs splice files into one second segments and display for each segment a value for NPS pertaining to that particular second. This solves a number of issues with using avg NPS as a single value for difficulty indication, however it also retains some of the others. Furthermore it is a visual display of data that must be interpreted by a player before it can have any meaning. These limitations were accepted as being facts of life and NPS graphs are taken as they should be, with a grain of salt

Now, if you’ve been paying attention recall that any set of notes theoretically has a true difficulty. We use and employ NPS graphs because we acknowledge that avg NPS (technically in the case of NPS graphs at 1 second intervals, the avg is redundant) is much more accurate an indicator if the interval of time for which it is being assigned is smaller. Necessarily there exists less variation in pattern and flow on smaller orders of time and we are more confident in the information that a value for NPS relays to us when it only pertains to a single second.

Now, if we want to generate a single value of difficulty from an NPS graph we need to do something with the data we are presented. Averaging the values is only slightly less idiotic than avg NPS on the whole, so what do we do?

We draw upon our experience as players and our understanding of how difficulty is created by the nature of the file, and we translate it into mathematical concepts which we may apply to any file. Remember the discussion of error, it’s ok if we aren’t totally right, we only have to be right enough. As you will see further on, it’s less important to be right than it is to be specific.

A distribution can be drawn from any NPS graph. If you don’t know what a distribution is, then you don’t understand a fundamental concept from which 90% of what is written in these articles is derived because you haven’t googled it yet. So go do that.

When split into intervals, any file can be understood as being comprised as a number of sequentially locked files that span the length of the specified interval. Somewhat like a course, or marathon of files for which an order is predetermined and breaks are timed (yes, I know there aren’t any breaks between successive 1 second intervals in a file, but the point of analogies is to highlight fundamental comparisons between two concepts or objects not commonly related, not to take two things that are entirely equivalent and say “ha, look how similar these equivalent things are!”).

During sessions, players tend to play within a particular range relative to their skill level. We seek to be challenged but not to a degree to which we cannot learn from the experience. This is a fundamental pattern of play we all exhibit, and it is healthy because it provides the greatest framework from within which we may learn and improve at the game.

This concept extends to all games. There’s a reason 9 Dan Go players do not spend all day playing against 20 kyu players. Well, I mean it may happen every now and then, but the experience is non-productive for either player. Very few people in the world are capable of parsing a game out purely in their minds to the degree that they can take strategic concepts and moves employed by a master of a game and properly utilize them on their own such that they skip from novice to master in a few games.

Those concepts and moves took thousands of games to develop the understanding of, a process which likely saw the development and obsoletion of hundreds of tactical approaches to hundreds of different situations against hundreds of different opponents. The concept of “skipping” ahead is antithetical to the process of becoming a master.

In other words, virtually everyone knows to fire a gun you need to pull the trigger, but this doesn’t mean everyone can pick up a gun and fire it. In fact if you took a random person and asked them to fire a gun in you have provided for them (unloaded, safety on) at a can on the fence, you’ll get a lot of very puzzled looks as people squeeze the trigger to no sound.

Assuming you aren’t just selecting members of a police department, very few of your test subjects will think to check if the gun is loaded and the safety is off before pulling the trigger. In this scenario someone inexperienced with firearms learns that there’s more to shooting a gun than just pulling the trigger. For a veteran member of a police department nothing has been learned, the exercise has been irrelevant to their continued development as a law enforcer or their aptitude with firearms. This is why we don’t spend all day asking them to this.

Part of the appeal of Stepmania as a game or activity is that it is continuously mentally and physically engaging. A new player can be just as challenged by a novice file as an expert player can be challenged by an expert one. The difference is below a certain threshold of difficulty an expert player finds the difficulty of files to be non-relevant. Sort of like playing an RPG and going back and clearing areas significantly below your level, it’s irrelevant how strong they are in relation to each other, functionally there is no difference to you. You just kill everything instantly and move on.

Irrespective of how difficult they are to each other, no file below this threshold of relevance challenges the player on a physical or mental level and nothing is to be learned. In essence, it is the ultimate waste of time.

The inverse is also true, players do not tend to zone into areas where they die instantly repeatedly without any control over their situation. Perhaps some players might choose to do so in the hopes of exploring or finding a bug or exploiting a known one, however in the event that the player knows that nothing of value can be obtained, it is unlikely that they will pursue this activity for any length of time. Similarly players also have a threshold above which they choose not to play.

As I’ve jokingly noted before, it is technically impossible for any player to know for certain they will obtain a perfect score on a file, independent of how skilled they are. Functionally this is irrelevant, we know performance is tied to skill, and we know players stay within fairly strict bounds of file difficulty because progressing further in either direction creates unstimulating and unproductive experiences.

Scoring on files that are far too easy, even for accuracy players, is often difficult not because of the difficulty of the file, but because of the interest of the player. There is a point at which scoring because less dependent on playing capability than player will. Files that are far too hard produce a similarly negative experience, players have no capacity to relate what they are supposed to play because in order to do so they have to be able to play at a level still significantly higher than the level they are currently at. Scoring becomes a function of how effectively a player mashes through parts he cannot play, rather than skill.

Translating these upper and lower bounds into a mathematical concept is simple. We can apply the pattern of play we observe on a file-by-file basis to an NPS distribution, sections that are too easy fall below a certain threshold and become irrelevant to the difficulty of the file, and sections that are too far above a certain threshold can be considered to have significantly more impact on the difficulty of the file than values that fall within the upper and lower bound.

The question then becomes how do we determine the upper and lower bounds? We can’t set the upper bound as the highest NPS value, that defeats the purpose of it. We need to take an approach that is simple yet immune to as many sources of error as possible. We may refer back to our distribution and take a guess.

We know high spikes in difficulty can be overcome if infrequent, so we need to postulate a value for which the frequency of difficulty spikes becomes high enough that they are no longer “spikes” but the upper bound of “relevant” difficulty. Since we’re keeping it simple, the value I guessed would fairly robustly represent the upper bound for most files is the lowest value for which a cumulative 10% of the entire file has been encompassed. If the upper bound is considered to be the maximum range for which a player is comfortable playing at (and indeed— the logical statement here is that this will always be the case for a player who is capable of barely AA’ing the file) we can take another experienced-guided shot in the dark and say the lower bound is likely around 80% of the value of the upper bound.

So if the upper bound is 20 NPS, the lower bound is 16 NPS. The logical assertion we are making is that for any file for which the given distribution of NPS produces these bounds, any interval that falls below 16 NPS is functionally irrelevant to difficulty. Any interval that is above 20 NPS has a greater impact on difficulty.

Now we’re working with a lot less data, and we’re working in a situation in which we know roughly how much each point of data matters compared to the others. The first iteration of the difficulty algorithm was essentially this. NPS values that fell between 16-20 NPS were averaged and values above 20 NPS figured in with increasingly higher weight the greater they were above the upper bound. Upper and lower bounds were calculated for any file in question and the same process repeated.

This produced a base value which was then further modified to account for proportion of irrelevant intervals to the proportion of relevant intervals (or more conventionally known as “free dp”) and stamina and so on and so forth.

This produces a single value, which is what we want. We’ve also further eliminated shortcomings of avg NPS by expanding on the model of an NPS graph, this is good, but the results we get from this system are fundamentally flawed and are unusable.

Anyone with a passing familiarity with programming and data processing can replicate these steps and observe the results on their own. All that’s required is the capacity to process the .sm format and extract the relevant data before applying basic mathematical processes and the subsequent examination of them. This can be done in under a day and one or more issues with the results will be immediately obvious.

And so we need to take another step. Effective problem solving isn’t about fixing everything at once, much less to the degree of perfection. It’s about identifying the issues which impact your desired goal the most and which of those you can make the largest headway in with the least amount of time and effort.

This is meta-optimization, or, the optimization of optimization. We are constructing a system rooted in logic and the process we will work within involves repeatedly identifying and fixing the most problematic issues— not necessarily to the degree of perfection, but to the degree for which results are good enough that further time investment is better spent elsewhere.

We can manipulate NPS values in any way we want, and we can do so in the most logically sound fashion, but ultimately we have to approach the problem of pattern configuration. We know patterns have a significant impact on the difficulty of files alongside NPS, but what is the actual degree or range of the impact? Does NPS matter more than patterns, or do patterns matter more than NPS?

If you’ve ever played the game, you already know the answer. Pattern configuration is much more important than raw NPS. Let’s take a quadjack at 150 bpm. It has 4 times the NPS of a single jack at 150 bpm of the same length. From the perspective of NPS, the quadjack is 4 times more difficult, 10 versus 40.

Forget NPS for a second, what does this differential mean from the perspective of time investment? We’ve already stated we require our system to produce ratings that fall at maximum within a single year estimate of the required amount of time needed to play in order to AA a file. The bad news is in order to fully grasp the difference between the values of 10 and 40 you must have already constructed a system to which you may compare them.

The good news is, I’ve already done that. Within the context of the system I’ve already created the difference of 10 versus 40 likely spans decades, with 10 difficulty AA scores being obtainable within a few months of playing the game. I can’t assert a more exact length of time because the fact is a difficulty of 40 lies outside the realm of anything anyone has achieved playing the game up until this point, and by quite a large margin.

Now, I didn’t have the capacity to make these comparisons when I first started, I was mostly working in the dark, using my experience to give some context and guidance to the decisions I was faced with. While arguably a 150 bpm quadjack and a single 150 bpm quadjack of the same length are not actually the same difficulty, if we make the assumption that our player is equally skilled with both hands (no player is in reality, however favoring either hand is irresponsible, the logical statement we are making by asserting that both hands are equivalent in skill is that difficulty must be constructed around the player’s weaker hand) they are roughly the same.

Sure one takes slightly more physical exertion but this isn’t something that requires decades of training to overcome. We know that depending on the pattern, NPS can misrate sections to a degree that falls far outside our hopeful range of error. Pattern configuration proves to be the most complex and problematic issue at play. In order to reach our intended goal we need to mitigate the errors produced by pattern configuration so they average out to produce usable results.

We are going to do this through pattern normalization. Pattern normalization can be tricky. Essentially, we that NPS can be a perfect representation of difficulty. We also know that pattern configuration has enough of an impact on NPS that it renders it totally useless. What we need to do is we need to devise a method to take any pattern at any NPS and normalize it to a baseline pattern at a given NPS.

The tricky part isn’t the concept, the tricky part is the implementation. The experience of most players with NPS is that jacks are underrated due to an increased NPSf, and rolls are overrated because they can be essentially treated as jacks. This doesn’t necessarily mean that jacks are underrated and rolls are overrated, per se, really it’s irrelevant what pattern we choose to normalize all patterns to, however we are pointed in a specific direction through our experience and we might as well follow it.

Jacks tend to be viewed as underrated and rolls overrated because of their prevalence during an average player’s play. Most of us understand NPS through the context of patterns that fall between jacks and rolls or quadmashes/jumptrills. The problem is we don’t really know what pattern we’re normalizing to, only that we’re pushing patterns that we find underrated in one direction and patterns we find overrated in the other.

Let’s do a simple math exercise, how many possible configurations of arrows can exist in a 4x16 grid? Turns out, it’s a lot. It’s 2^16, or 65536. Thankfully this is for all possible NPS values, however we still need to normalize every possible pattern for each NPS value to each other by creating a set of rules or equations that must be applied to every pattern. I probably don’t need to tell you, but this is a prohibitive amount of work.

We have neither the time nor expertise to sit down and assign mathematical adjustments to every possible combination of notes and thoroughly check them, actually, if I had spent the near-decade playing the game doing this instead, it might have been possible, but I didn’t.

So instead we’ll do the next best thing, we’ll use our understanding of what drives difficulty in the game to adjust for trends we know exist. We need to keep something important in mind, whatever we do, it must be specific. If we intend to correctly represent the difficulty of jacks and rolls by doubling and halving NPS respectively, we need to make sure that we’re only doing it for the sections for which these actually apply.

Our adjustment is intended to reflect the added or reduced difficulty inherent in any pattern configuration such that every file affect is moved closer to its true NPS value, or difficulty. We would like this to be as true as possible, however hyper focusing can prove to be just as unproductive as doing nothing.

The question still remains, where do we start? How do we take an interval NPS value and begin to adjust it? Even if we know the patterns we wish to adjust for, and even roughly how much we wish to adjust them by, how do we write logicals that a program can apply to every interval in a manner we see fit?

The interpretation of quad column data is fairly complex, there are many patterns that can be formed and many different ways each of those patterns can be approached, indeed it would be much simpler if we only had to normalize patterns for two columns at a time. Thankfully, this is the case.

I’ve already alluded to the principle, but we assume that both hands are equivalent in skill and that difficulty should be constructed around the weakest hand. The difficulty for any one hand is specifically defined by the difficulty of the notes that hand is designated to hit.

Now we have a clearer idea of how to approach pattern normalization. The first approach I took, which I later abandoned because it didn’t make any sense, was to assert that the difficulty of any interval was largely based on the difficulty of the most difficult hand. This is a large step up in simplicity from trying to manage four columns of data, however it is logically and functionally deficient for a number of reasons.

The better approach is to assign each half of an interval, split along each hand, its own value for difficulty. There’s no reason we can’t do this, so we might as well, it’s also far more accurate. Now it is the case, and experienced players will tell you this because it’s true, that sometimes patterns are much more difficult than the sum of their individual hand difficulties. This mostly pertains to patterns like polyrhythmic trilling where each hand is playing a different speed, and the reason is largely due to the difficulty of reading incurred.

First, insofar as we can, we would like to eliminate reading from our considerations. We want our difficulty to be solely derived from the physical characteristics of a file and any metadata we can further construct or extract from the dataset. We want to avoid introducing a human variable as much as possible, not because it’s irrelevant, but because we want to be able to control its impact when and if we do account for it.

Second, we must keep in mind the principles of conduct we have set forth for ourselves. Just because we know this doesn’t apply in some situations, doesn’t mean it’s unusable. Our expectation is that the error incurred by processing intervals in this fashion is lower than the alternative on the whole. We accept that it will not be a perfect representation but hopefully when pursued we can provide ourselves with a system of difficulty we can place our confidence in.

Parsing a .sm file provides a large volume of information. Much of this data must be further extracted from the base information provided, such as ms differentials, but it isn’t difficult to generate what we need to work with using programming.

I’ll skip over the process of generating the data we need and get straight to the data we need. We want to know how many notes are contained within each interval on each hand. For each interval we want to know how the notes are are configured in relation to one another, what their ms timings from the previous note are respective to both the last note on the hand in question, and the last note on the finger in question.

A 50 ms differential between two notes on two different fingers on the same hand implies something completely different from a 50 ms differential between two notes on the same finger. If we are going to be making targeted adjustments we need to be sure we only affect the patterns we intend to affect. In this case more is better and the more we push NPS up or down based upon a pattern, the more strict the logicals for this adjustment have to be made.

We also want to know the relative distribution of notes on each finger on each hand for each interval. Originally I attempted to further simplify the process by using the max NPS on either finger as the baseline to which I would normalize. The intent was to bypass the necessity of accounting for tricky formations of jacks, however, as with many of the decisions made before this, it proved to be inadequate in the results produced.

There are a number of fundamental procedural errors that are incurred, “grouping” is one of them. Using the highest NPS for either finger as the baseline exposes your results to high relative error based on the initial assignment of a base value. Two patterns may present very similarly and as a result your pattern normalization may approach both in a similar fashion, however one may have an extra note in the more prolific column.

The result is a significant level of relative error introduced into the most basic elements of the procedure. This also ties in closely with interval sampling error, which will be discussed later. This is something that is unavoidable, we must simply choose the approach that gives us the best possible results for the least amount of effort.

In the end, it’s easier to discard any clever approach and simply take interval NPS per hand at face value for our starting point. From here we know the general direction in which we need to move, we need to systematically push up heavily anchored patterns and jacks and push down jumptrills and rolls.

Let’s deal with the easiest problems first. We know two hand jumptrills need to be pushed down, and we know we roughly want to cut the difficulty of them in half. We also know one hand jumps don’t truly contribute “2 difficulty”. As a general rule, the closer any pattern is being hittable as a one hand jump the easier it is, this is particularly true if it is true for the entire interval. Straight one hand jumps on a single hand are not necessarily part of a jumptrill, however we can treat them the same.

We can calculate the density for one hand jumps within an interval for each hand. Take the number of one hand jumps multiplied by two, and divide it by the number of notes, and you are left with the proportion of total notes contained within one handed jumps. If an interval only contains one handed jumps the result should be 1, or, 100% of all notes contained within the interval are a part of a one hand jump.

Now there’s a tricky thing to consider, as a general rule the larger the proportion of one handed jumps present in an interval the easier it is. For intervals purely consistently of one handed jumps we have a fairly good idea of exactly the magnitude of error we need to correct for, and we can be relatively confident in the correction that we are making. For intervals with lower proportions we cannot say the same thing, what determines the effect one handed jumps have on intervals is both the proportion and the configuration of consecutive one hand jumps as well as the context of the notes surrounding them. An interval may have 5 one hand jumps that are consecutive, and single notes afterward, or the jumps may be spaced such that hidden chain patterns are formed within jumpstream. One hand jumps that form hidden chains don’t push down difficulty nearly as much as 5 consecutive one hand jumps do.

Now, we know this is the case, but we also know that patterns that are the worst offenders of this trend tend to be the most rare, and that even when present most files are constructed to avoid repeated use of hidden chains. We can be relatively confident that pushing down interval NPS based on proportion of one handed jumps within the interval won’t incur statistically significant error for the majority of files.

For files specifically constructed around chain-based patterns we’re well aware they will be significantly underrated, however this is fine, because it is known error. The problem with communally derived systems is that every source of error is unknown, you don’t know who underrated or overrated anything or why, and because you don’t know, you can never improve the system.

It’s fine to include adjustments that are largely correct if we know the specific circumstances in which they will fall short, and especially if we know roughly how much they will fall short by. When we’ve exhausted improvement in every area such that the most effective use of our time is to come back and revisit this particular modifier, we can do so knowing what the problems are and how we need to fix them.

This would entail writing logicals that defined sequences of consecutive one handed jumps and further refined the interval modifier to account for them, this isn’t hard, but in relative terms it’s much more time consuming than our simpler yet more error prone alternative. It also would increase processing time, which is a consideration when performing large scale trend analysis.

Furthermore if we are aware of the errors to which our adjustments are prone we can manually compensate for them when doing large scale trend analysis. If we know one specific type of file tends to be underrated due to a specific pattern normalizing modifier we can mentally inventory it when creating or testing other modifiers.

The reality is every file evaluated for difficulty is both underrated and overrated, for many different reasons, and to differing degrees. With algorithmically constructed modifiers we have the advantage of understanding what tends to be overrated or underrated because the same modifiers are applied in the same fashion to everything. If we find that we are pushing down one handed jumps too much or too little, we can adjust.

If we don’t observe change we intended to effect with our adjustment then we have failed to effectively translate our understanding of the issue into a mathematical modifier or we have failed to effectively understand the issue, or there’s a completely different issue at hand that we have not considered.

Now that I’ve finished justifying copping out and taking the simplest approach we can continue on that path. The absolute simplest solution would be to state that we should remove 1 NPS for every one handed jump present in an interval, we already know this doesn’t make any logical sense so the next best thing is to scale it by one hand jump density.

We can take the number of one handed jumps in an interval and assert that this is the number we would remove from an interval if the interval were solely comprised of one handed jumps, functionally we would be subtracting half the NPS values for any interval with a one hand jump density of 1.

We want to scale down the magnitude of our adjustment based on the proportion of one handed jumps within the interval, so we can simply multiply the number of one handed jumps within an interval by the one handed jump density. This doesn’t produce a linear downscale, because we are scaling the number of one handed jumps with the proportion of jumps. We definitely want to shove pure jumpjacks down, but we want to err on the side of caution when dealing with more conventional patterns.

Through trial and error I found it’s necessary to further apply an exponential modifier to the jump density. The modifier we end up with is number of one handed jumps multiplied by one hand jump density squared.

Yes, this really is pretty simple. In fact, it’s a lot simpler than the reality— the reality is much more nuanced. But we aren’t trying to discover mathematical laws of stepmania physiology, we just want to guess at the form they most closely resemble and move on.

So we can apply this modifier to any interval we process and make sure it doesn’t have any unexpected results. We don’t expect the difficulty of stream files to go down when we toggle this modifier on, because it’s mathematically and definitionally impossible. In this particular case, this is obvious, however it will not always be so cut and dry.

Now we’ve accounted for pushing down jumptrills, and in the process we’ve also scaled our modifier to the proportion of one handed jumps so heavily jumptrillable js is pushed down relative to stream patterns of equivalent NPS. We know this is what we want and we know the problems with it and until we decide to greatly expand on the system we can be confident that this modifier does what we expect it to do.

It’s fundamental to the methodology we are working with to understand that we have effected a mathematical modifier for difficulty based on the presence of certain note configurations and the physiological shift of difficulty that it necessarily implies. We have a specific definition for what the problem is and we’re asserting an estimation of exactly how much one hand jumps affect difficulty within an interval and we can do this only because we have established a logical precursor for what we are doing.

Math is just logic; any mathematical adjustment or calculation we make in any capacity implies a logical statement of fact for some relevant aspect of the game or physical reality that affects the accuracy of NPS values that we know we must account for.

What we absolutely cannot do is make up unguided modifiers to alter difficulty and use improved results to justify our actions. They must be justified through a logical statement which can be argued either for or against. It is entirely irrelevant if you obtain better results by changing something for no reason. Within mathematics and statistics, being right for the wrong reasons is the same thing as being wrong.

Apart from jumptrillable rolls, jumptrills (and by extension quadjacks) produce the largest unwarranted increase in difficulty. The major contributors to underration are jacks, runningmen, and heavily anchored patterns. The common theme between jacks, runningmen, and anchored patterns is that they’re all variations on jacks. If there is a common theme between rolls beyond a specific speed and 2 hand jumptrills it’s they’re both hittable… as 2 hand jumptrills.

Rolls are much trickier to deal with compared to jumptrills. In the case of the latter we are working with explicit definitions of what jumps are, in the case of rolls we need to create our own definitions for patterns that present as able to be hit as jumptrills without actually being jumptrills. We also need to do this in a way that doesn’t adversely affect files that present jumptrillable roll patterns within more complex streams that aren’t practically jumptrillable.

Downscaling rolls is the easy part (kind of, at least). Avoiding downscaling stream files in the process is what proves to be incredibly difficult. When creating a set of pattern normalizing modifiers that we must apply to all intervals for every file we need to make sure we aren’t being redundant, and we also need to make sure we don’t artificially introduce breakpoints. We’re already exposing ourselves to interval sampling error creating breakpoints only compounds the issue, and greatly at that.

In early iterations of the algorithm roll detection was incredibly strict. It’s easy to place into words how rolls are configured but it’s more difficult to define it mathematically. We know the friendliest incarnation of the roll pattern is configured into sets of repeated 1234 or 4321 notes, this is easy enough to pick up on if we need to, but defining explicit pattern sequences is irritating and processor consuming. Also, an integral part of what defines rolls is the equivalent spacing of each successive note in a column along with the note configuration and pure configuration does nothing to address this.

Thankfully we’re working with sets of data and there exists a nifty method for extracting metadata from a file that can identify roll patterns without sequencing. We can timestamp every note in a file, and from timestamps we can derive the absolute and relative positions in time every note occupies. We aren’t particularly interested in absolute positions in time, but we are very much interested in the relative positions in time each note is relative to the notes surrounding it.

We can calculate ms differentials from the previous note, the previous note on the same finger, and the previous note on the opposite finger. These are all datasets that can be manipulated to provide useful information, they are all useful in different capacities and provide different kinds of information depending on how you manipulate them.

We’re interested in downscaling roll configurations that present as jumptrillable. We know that the defining features of rolls that make them jumptrillable are, equivalent distribution of notes between the two fingers in question, consistent spacing between consecutive notes, and a requisite speed at and beyond which the timing windows can be fully taken advantage of.

Picking up on equivalent spacing between notes within an interval can be done with simple data manipulation. Coefficient of variation (CV) is a basic indicator of variation between values in a dataset. CV is calculated as the standard deviation divided by the average of a dataset. We can calculate the CV for our sets of MS values and a value of 0 indicates the ms values are identical.

There are simpler ways to determine whether or not values in a dataset are identical but recall we aren’t interested in explicit rolls, we are interested in any pattern configuration that presents as being jumptrillable, rolls are simply the most common manifestation of this. We want our detection to be strict enough that it doesn’t push down patterns that aren’t jumptrillable, however we certainly don’t want 32nd rolls for which a single note is offset by a 192nd to be considered “not a roll”.

Using CV to identify jumptrillable configurations has the added benefit of allowing some breathing room for accents or hyper technical offsets within jumptrillable patterns. As a practical matter any interval which presents CVs for ms values from the last note on the same finger for both fingers that sum to below 0.2 and contains numbers of notes on each finger with in 1 point of each other (to account for interval sampling error) beyond a certain speed can be confidently labeled as “definitely jumptrillable”.

We can assign an appropriate downscale value based on our confidence of jumptrillability, but we still have to determine a baseline speed for which any effect is to be taken, and furthermore in order to avoid compounding the problems presented by interval sampling error we need to scale our modifier.

Early iterations of the algorithm had far stricter detection than what has been outlined, and the modifier for rolls wasn’t scaled to anything, it was simply a flat modifier of 0.75 applied to intervals that qualified for the downscale. This was the catalyst for the horrifically incorrect ratings observed early on, where files like OTFS were “easier” on 1.4 than they were on 1.3 or even 1.2. Bumping the speed up slightly triggered application of a flat and overzealous downscale.

In fact jumptrilling rolls presents an interesting paradox, nearly every pattern configuration eventually becomes easier at higher speeds due to taking advantage of timing windows, however rolls configurations are the only patterns for which this phenomenon is observable in practical play, and by players of intermediate skill level no less.

Normally the increased ease of hitting patterns is offset by the necessary physiological impossibility of actually actuating keys to take advantage of it, in other words, 2 fast 2 hard. For rolls this is not the case, below a certain speed rolls must be hit with direct finger motion, as the speed increases the capacity to jumptrill the roll increases, making scoring on them much easier. As speed further increases difficulty once again increases (remember, this are all increases or decreases to the relative amount of difficulty we must accommodate when pattern normalizing, not absolute values of difficulty) due to the exponential difficulty growth of jack based patterns

The result is a modifier for rolls has to be designed that scales to the speed of the roll, the speed of the jacks, and exercises the greatest influence somewhere around the 440 bpm mark, and minimal influence on the outer fringes of jumptrillability, 375 and \~650bpm rolls respectively. Beyond 650 or so bpm we must ease up on the roll downscale of rolls for a completely different reason than we did at the lower end, which requires creating a methodology for dealing with jack sequences before it can be applied faithfully.

Even after accomplishing this we still run into significant sources of error when dealing with jumptrillable rolls. The major offender here is interval sampling error. Interval sampling error is the error incurred when you split files into any arbitrary segment lengths. Smaller segments (0.5s) are more likely to present an explicit pattern that can be scaled up or down appropriately, however it is equally as likely that we end up splitting explicit patterns that would be scaled appropriately halfway through the pattern with an interval cutoff.

Take a 0.5 second roll wedged into jumpstream. If our interval cutoffs are such that the entirety of the roll fits within a single interval our pattern modifiers will detect it and accordingly downscale the difficulty of the roll, however if our cutoff splits the roll in half and now we have two intervals with half a roll and half standard jumpstream, our roll detection won’t pick up on the pattern and no downscale will be made.

Larger intervals are more likely to fully encompass explicit patterns but in doing so they necessarily disguise them with extraneous notes. It’s not as efficient to normalize patterns on larger scales, the principle which extended to files in their entirety is the reason for which we are splitting them into intervals in the first place.

The only way to completely remove interval sampling error is to simply not calculate anything within intervals. This is an unavoidable fact. Remember, we aren’t working within intervals because we expect it to provide us with the best possible accuracy, we’re working within intervals because we expect it to provide us with a usable level of accuracy with a minimal amount of time invested. We can’t completely negate interval sampling error, but we can mitigate it to a certain extent. We know roughly which patterns are negatively affected and we know roughly by how much. There are three major approaches to the problem.

The first is to calculate every file twice with differing offsets. If in the above example we did this using a 0.25 offset we would have one instance in which the roll was fully accounted for and another instance in which the roll was unaccounted for. It may be tempting to average the two difficulties and call it a day but we can’t do that. Averaging is a mathematical procedure, as discussed above, it makes a logical statement about what we are doing.

Averaging states that we expect the magnitude and frequency of error to be roughly equivalent for both calculated instances of the same file and that averaging the two results produces a result closer to the true value. This isn’t true in our specific instance. In our specific instance we have two results one of which is completely wrong and one of which is probably right assuming we properly reflected the reality of roll downscaling. In this specific instance the value averaging would produce is simply less correct than one of our two input values.

Now this is fine, if we know that averaging the results will provide us with more accurate results on average. The problem is, the above is more or less the best possible scenario in which averaging two different results helps us. It’s not about how accurate an averaged result is compared to one of the initial estimations, it’s that it’s very easy to tell what’s happening and why and exactly how much error averaging introduces compared to the better estimation.

We can’t possibly do the same on a mass scale and the reality is it is absolutely impossible for us to functionally determine whether or not averaging offset calculations produces better results on the whole or not because we are incapable of separating out the sources of error incurred when looking at results. We would be introducing a source of error that we cannot control or even provide a reasonable estimation of functional impact on our results for and that means we can’t do this.

The second approach is to calculate difficulty on different intervals. We can hazard guesses at which types of files or pattern configurations will be best picked up by specific interval lengths and make passes on each interval to pick up factors that influence difficulty that are best detected at different intervals and aggregate them based on our levels of confidence in their results.

This means we need to set up differing parameters for each separate interval length we intend to use and analyze the resulting datasets with the same level of rigorousness as we would a single set of data, and then determine whether or not molding a value for difficulty based on three different indicators will actually provide better results than not doing so.

This approach requires a lot of work and is only minimally utilized in the current incarnation of the algorithm. Currently the algorithm calculates based on 0.5, 1, and 2 second intervals and then applies a weighted average based on the expectation that 0.5 interval sampling provides the best estimate of error, however there is enough of a proportion of files for which significant error is introduced at this interval that shifting all difficulties up or down based on 1s and 2s interval passes results in greater accuracy on the whole.

This is almost certainly the least tested modifier in use in the algorithm and will very likely be omitted in the next iteration of the algorithm.

The third and most practical approach to dealing with interval sampling error is to calculate within a single interval and attempt to adjust all calculations to accommodate interval sampling error and apply a sensitive moving average to the intervals before making any final calculations. This approach is particularly effective at lower intervals and also helps to reduce the effect of grouping created by sampling on smaller intervals in the first place.

It also serves to emulate to some degree the physical reality of playing the game. Intervals aren’t isolated sections of notes for which we are thrown into entirely prepared. A very easy section that succeeds a very difficult section can very realistically carry through increased difficulty (remember, as interpreted by a lower proportion of points gained) based on its relative position to the hard section due to the cb rush mechanic or the physical reality of stamina.

Speaking of stamina, we have yet to account for it. It’s true that we can take an interval or a file and normalize the NPS value based on the patterns present to reflect perfect difficulty, but this is only true in isolation. In isolation, NPS and note configuration are the only two factors which contribute to difficulty, but the moment we consider sequences of intervals or files stamina immediately begins to influence our calculations.

Remember that currently we are still calculating final difficulty based on averaging relevant NPS intervals and applying a weighted modifier for NPS intervals that are above the upper bound of comfort. The first iteration of stamina modifiers was as simple as it was idiotic. A multiplier of 1 was established at interval 1, and for each successive interval that fell within or above the relevant NPS range this multiplier was further multiplied by a very tiny amount (\~1.0005), and for every interval below the lower bound of relevant NPS this multiplier was divided by a much greater amount (\~1.0020), in order to simulate the cumulative effect of stress and relief.

After reaching the final interval a single multiplier value was produced, generally somewhere in the 1.0-1.2 range, and this flat modifier was applied to our weighted average for relevant NPS. This is the simplest possible solution that makes any logical sense, and it is better than doing nothing about stamina. We know stamina has an impact on scoring, and we know that it’s roughly tied to how long continuous sections of notes are and how many breaks you are allowed between them, if any.

This isn’t to say that it’s a very good system though, it was intended to be a stopgap pending the development of a system that more accurately reflects play. Anyone who has ever played the game can tell you stamina doesn’t just hit you all at once at the end of a file, it has an effect on every interval and every note throughout the entire file and if we wish to properly account for it we must create a system that does this. It also fails to account for a large number of nuances that I’ll skip over for now, as they’ll be covered further on.

Ok let’s stop for a moment and recap what we’re currently doing. We’re taking the simplest possible approach to solving any problem we come across in our estimation of file difficulty and we need to continuously iterate it until we obtain results that are useful.

So far we’re expanding on NPS graphs by splitting up files into intervals and looking at NPS within each interval. We’ve introduced the concept of pattern normalization and we know we can make mathematical adjustments to an interval’s NPS value rooted in logical understanding of how the game is played.

We know the better we understand how patterns affect NPS based difficulty and the better we understand how to translate that into mathematical formulae, the closer our results will be to true difficulty.

We know that if we do this perfectly we’ll obtain perfect values of difficulty for each interval. But we still don’t have a system for properly accounting for stamina, and we still don’t have a system for taking interval estimations, whether or not they are perfect, and accurately producing an overall estimation for difficulty based on those intervals.

Taking relevant NPS sections and averaging them with a weight for more difficult sections is a process which introduces enough error that even if we achieved perfect pattern normalization we would still have erroneous estimations of difficulty. No accounting for how much fodder (free dp) exists within files has been made. The stamina system is only barely better than doing nothing at all.

Averaging relevant pattern normalized NPS has gotten us fairly far, and working within this model has given us a decent amount of insight into the problem at large. The work thus far is re-creatable within a few days by an experienced programmer. Not only is it easily reproducible, but anyone who applies basic logic can expand or improve on the framework currently provided.

The problem is the framework itself introduces too many sources of error. Eventually we will be unable to continue to improve the algorithm because of the failings of this framework. In fact, the framework itself incurs a margin of error that simply exceeds our minimum requirement of error. It is impossible to create a usable set of ratings by continuing on this path, just as it was impossible to create a usable set of ratings with average NPS no matter how hard we fudged them. We need to provide ourselves with the tools to construct a bulletproof framework that we can work within.

Remember that arduous bit earlier on about understanding difficulty? Up until now we’ve only loosely implemented basic concepts in the algorithm. We haven’t utilized any of the meticulously outlined definitions or relationships because up until this point we haven’t needed to. We had to understand that it is impossible to create a usable algorithmically derived difficulty system without creating an explicitly defined and logically bulletproof framework within which to work. Not only does it make it possible to continue, but actually, it makes our lives a lot easier.

Skillset Development and Specialization

One of the most influential and misunderstood elements of the game is the concept of skillset specialization. For the majority of players skillset specialization has an indiscernible effect upon play, however, at expert levels and above skillset specialization has an overwhelming effect on the perceptions and attitudes of players, particularly for those most polarized into a single skillset.

For reference these are the current skills existent in the metagame their derivation and definitions will be explained elsewhere as the focus of this article is what drives their development and the impact it has:

Speed (subdivided into stream vs jumpstream) Stamina Jacking Technique/Technical

The conceptual foundation of skillset specialization is simple, and it should become clear why newer players generally do not even realize the existence of skillsets while it very nearly defines every expert player (even players who are “perfectly rounded” in their skillsets, are still defined by the concept):

Players tend to play what they enjoy, tend to enjoy what they are good at, and tend to become good at what they play.

This play pattern produces a self-reinforcing drive to excel at specific patterns, the stick becomes the carrot and the carrot the stick and both become the vehicle with which players traverse the continuum of skill. A Stepmania carrotstickmobile, if you will.

The same principle also applies to the different styles players use to play. Most players upon beginning the game simply playing in whichever way feels comfortable, and they continue to do so until something prompts a change. The longer a player plays with one style, the better they get with it relative to another, and the more resistance is created when a switch is desired.

While playstyle forms in parallel with skillset specialization, it also has an amplifying effect on it. The most timeless example is one-hand trilling with index versus spread (yes, technically there is no such thing as one hand trilling with index, but that’s pretty much the point).

Despite spread being the much stronger playstyle overall, spread players will perform poorly on one-based or, more broadly, “index” patterns than an index player of an equal skill level. However, high density jumpstream and handstream is virtually impossible with index irrespective of speed. If we look at commonly played files by players of either playstyle we find that indeed the files they tend to choose to play reflect the strengths of their styles.

Index vs spread is as extreme an example for differences between playstyles, and in fact there are many variations of spread that have their own relative strengths and weaknesses and guide the development of their respective users in kind. The two major types of spread play are wrist up and wrist down. From a physiological standpoint, the latter tends to facilitate individual finger motion, or, “one-hand” patterns, while the former tends to facilitate success in anchored and jacky patterns, and “two-hand” patterns (through the employment of the wrist and forearms to supplant and support individual finger motion). A more detailed discussion on the concept of “technique” can be found here.

While these are important underlying patterns that contribute to the formation of skillsets, the most impactful element is simply time. If we could identify the point of divergence in a player throughout their growth between two or more skills we would find that, unless conscious attempts to shore up weaknesses are made, the relative difference between their skillsets will continue to increase with the passage of time ad infinitum.

Now that we understand what drives the formation of skillsets it’s time to explicitly define what exactly skillset specialization is, and the actual effect it has on the subject with which we are concerned, difficulty.

Stepmania, when broken down to its most basic elements, is merely an infinite exercise in the construction of muscle memory. For every configuration of all number of notes for any speed there is muscle memory that must both be written and refined to the degree of perfect accuracy as defined by the game engine or player in question before being completed, however, new muscle memory does not always need to be constructed from scratch.

While muscle memory is a complex, nuanced and even controversial phenomenon of our existence its application to video games is relatively simple, particularly in our case. Despite this, the discussion on the topic is still moderately complex and I will be making assertions on the subject derived from logic and a decade of experience. The following section is largely conjectural however I will assert that I’m probably right.

The muscle memory built while playing the game is visually activated, that is, it is triggered by the event of processing notes appearing on the screen. The construction of muscle memory necessarily develops from the execution of an initial directed action processed consciously, however first-time novice players excluded, the process can be expedited by drawing upon previously constructed muscle memory and adjusting the execution based on the visual stimulus.

Let’s for a moment take any pattern y of 5 notes for which muscle memory has been built at 200 bpm. Playing the pattern at 220 bpm does not require a player to construct totally new muscle memory, it requires the player to execute the pre-existing 200 bpm muscle memory 10% faster.

Given the infinite nature of muscle memory that can be constructed (and particularly the degree to which it can be refined and then if you want to add an extra dimension of infinity and discuss transitional permutations you can do that too), the logical path for most players is to construct a core library of strong and highly refined muscle memory for a set of patterns (determined by the underlying force behind skillset specialization as discussed above). Conceptually this is not dissimilar from product specialization in a marketplace or careers in an educated workforce.

Players are then guided to play files constructed predominantly of patterns for which they have strong pre-existing muscle memory, and the construction of muscle memory for less refined or non-existent patterns is facilitated by the familiar atmosphere.

Fundamentally the purpose of muscle memory is to optimize specific and repetitive tasks to the degree that they can be automated to completion without the person needing to “think” about what they are doing, freeing up mental capacity for other purposes.

While this may seem both obvious and a broad textbook definition not particularly relevant to our discussion, the important question to ask is contained within: what are we freed up to do? In the context of Stepmania, and somewhat ironically, it means we are free to construct more muscle memory.

The construction of muscle memory, given that it is necessarily grounded in directed action, is taxing on players mentally. Files for which there are high frequencies of unfamiliar patterns necessitate continuous construction of new muscle memory which drains the amount of mental energy the player has.

This would not be an issue if mental energy, or mental stamina, is reserved purely for the construction of muscle memory. This however, is not the case. Recall that muscle memory can theoretically be developed to an infinite degree. The muscle memory to perfectly execute a 200 bpm minijack is not the same as the muscle memory to perfectly execute a 200.0000001 bpm minijack. To say otherwise is to logically state that 200.0000001 \= 200, which while incorrect from a mathematical standpoint, is from a practical standpoint, true.

So then who cares, you might ask? Well it’s less about who cares, and more about the degree to which someone cares. The difference between hitting a 200 bpm minijack at 200 bpm and at 150 bpm more or less results in the same judgement result due to how timing windows work in Stepmania; for more complex patterns and for pattern sequences that extend a significant length of time there is a clearly observable difference in result. For instance, hitting a 200 bpm 25 note jack at 150 bpm will both result in a significant number of misses and you looking like an idiot for tapping the keyboard when nothing is on the screen.

Let’s try to figure out how much we “care”; 200 bpm 16th notes are separated by 75ms. A 25 note sequence occupies 1875ms in time. For the moment we’ll assume that we hit the first note neither late nor early (or ms point 0), and the timing window to hit a note extends 125ms. We must actuate the relevant key 25 times in 2000ms in order to avoid receiving a miss, requiring us to actuate the key more or less 80ms apart (technically you could still have an average actuation time of 80ms and get misses but we’ll assume that doesn’t happen) giving us a required bpm of (75/80)*200, or 187.5 bpm. So we can get away with hitting a 25 note 200 bpm jack at only 187.5 bpm, great!

This is only, however, if we wish to avoid receiving a miss. In this instance, avoiding a miss is the degree to which we care; the margin of error for the implementation of erroneous muscle memory is strictly relative to the condition for success. Players have different conditions for success, general conditions that apply to their overall play patterns and perhaps more specific conditions for a particular file or instance of playing a file.

An obvious example of this is players who are only interested in AAAA’ing versus players who are only interesting in A’ing during their play sessions. The former will not develop consistent muscle memory for playing songs much faster than their norm, because they simply don’t play them. The latter will not develop highly refined and accurate muscle memory for the slow sections (if they exist) in their songs because scoring accurately on extremely easy sections of hard files bears no relevance to whether or not you A the file in the end. The drive to create and refine new muscle memory exists for each player, however it only exists relative to the standards and goals they have set themselves to.

This is an extremely important concept for a number of reasons, and it will be referenced in the future, however currently the most pertinent one is its relevance to mental stamina and the concept of reading resolution. Thus far we’ve been dealing with absolute numbers in our examples, 200 bpm, 200.0000001 bpm, and so on and so forth, however this does not comprise an accurate reflection of how players actually play the game. There is no player in the world who could tell the difference between the two speeds given a minijack. This is because no player in the world has the requisite reading resolution to discern the two during play.

While the skill of reading is immensely complex and integral to play and a more in depth discussion can be found here, for now, we will simply appropriate the concept of reading resolution.

Put shortly, reading resolution is the degree to which a player can discern differences in visual spacing for notes on the screen. Given a constant scroll speed (which at least in Stepmania, is the standard for play, for more context go here) the space between successive notes in a single column is an indicator of speed. If the player has adequate reading resolution, even if they have never hit a 200 bpm jack before in their life, they will be much better equipped to hazard a guess at exactly how fast it should be hit than a player with an inadequate reading resolution.

In fact a player with a sufficiently inadequate reading resolution will be functionally unable to tell the difference between 200 bpm and 150 bpm longjacks. And here functionally means during play. Yes, anyone with enough time observing an infinite 200 bpm longjack versus an infinite 150 bpm longjack will be able to tell you which is faster. This is however, independent of the construction of muscle memory to actually hit it, which is the point of reading it in the first place.

Reading is not one dimensional however, that is, a player doesn’t develop the capacity to simply read up to certain speed and that’s the end of the story. Since higher accuracy requires lower margin of error for hits within timing windows, it necessitates a higher reading resolution for the specific patterns in question. Reading resolution is both relative to the desired level of accuracy and to the specific patterns being read, and just like constructing and executing physical muscle memory, the more you deviate from your comfort either in speed or pattern configuration, the harder it will be to read.

Now comes the important part. Reading unfamiliar patterns at unfamiliar speeds is mentally taxing. Playing unfamiliar patterns at unfamiliar speeds is mentally taxing. By definition, you are doing both concurrently when playing unfamiliar patterns or even familiar ones at unfamiliar speeds. Not only are you at a disadvantage when moving out of your comfort zone on two fronts, but the two fronts are co-dependent, the harder you find something to read the harder you will find it to hit and the more you struggle to hit it the more it becomes harder to read. This is the principle foundation for what players perceive as “skill drops”, and a more detailed discussion on that particular subject can be found here.

Now you might ask me, given all that is working against me, how can I play anything outside of my element? Am I just doomed to play the same file on successively higher rates for the rest of my life?

The answer is quite obviously no. Unless all you own are OD packs, then yes, yes you are doomed. Despite all that works against you there are a number of important principles that also work for you.

If we look at skill growth for time invested we find that our returns are logarithmic. You spend more time getting less far and this is always the most true for your strongest skill. In many cases while it may be initially daunting and depressing to work on a weaker skill, your return for time invested sees very high gains. It may take a month for a top level player who is exceptionally weak in stream based patterns to mostly “catch up” with his other skills, while it may take a player exceptionally strong in streaming but weaker overall to traverse the same distance. There are a number of ways to visualize this, however the simplest might just be to think of powerleveling a new party member in an RPG. Given the asserted definition of skill, it just isn’t as hard to get slightly worse as your best skill as it was to get as good at your best skill as you are, and this is why.

While reading strength may vary in relative terms depending on the level of accuracy you wish to perform at and how familiar you are with the patterns, the capacity to efficiently process large volumes of visual information and parse them into relevant information is what you spend the greatest amount of time actually developing as a player. If you have only played jumpstream files for the last two years and you play a stream file of equivalent difficulty to your jumpstream skill, yes, you will play badly. Yes, it will be very bad in relation to your jumpstream skill. However this isn’t because you aren’t capable of reading it, or even necessarily of hitting it. Barring extreme outliers in capability or incapability with specific patterns (oh-trilling) the mere fact that the file is equivalent to your rating as a player indicates you should be able to hit it, you just have learn to do it.

New muscle memory has to be constructed and used as a base point for improving reading and so on and so forth, but the process is expedited because the wall of being unable to increase your reading resolution to a degree accurate enough to see visible improvement doesn’t exist. Muscle memory also exists on many different scales. Extremely specific muscle memory for a particular complex pattern configuration can be developed, both good and bad (being unable to read a pattern and developing muscle memory to hit it based on judgements is a quick way to develop very bad habits), however muscle memory is also internalized on the column specific level for two successive notes given a particular distance between them.

Essentially, what makes playing unfamiliar files difficult is that both the frequency of unfamiliar patterns for which no accurate muscle memory has been built and the magnitude of the difference become compounded. Conceptually the magnitude of difference is defined as the inverse proportion of pre-existing muscle memory that can be used to construct new muscle memory. The more of your muscle memory you can draw upon and the closer it is to what you have to adjust it to, the less mentally taxing the event is and the more comfortable you are playing, in the reverse, the experience becomes both incredibly frustrating, and completely unproductive. Overloading yourself by trying to build too much new muscle memory at once results in accomplishing absolutely nothing at all.

This is why, despite the greatest gain on time investment, most players choose not to actively work on their weaknesses, and only let them “level up” over time, essentially, the level at which the most basic pre-existing muscle memory combined with directed action is enough to produce adequate performance on relevant files. However, the discrepancy between a player’s “minimum” skill and “maximum” skill (for skills that tend not to be polarizing) tends to lie within the margin of error of human perception of file difficulty until very advanced levels of play.

At the very top levels skillset specialization generally has accumulated its effect long enough that it produces very visible differentiations between players playing to their strengths and players playing files they are weaker at. The game mechanic of cbrushing also magnifies the difference, as cbrush cancelling is far and the way the most demanding of both muscle memory, reading skill, and directed action of any activity in the game. More can be found here.

Now, we know skillset specialization develops from general patterns of play that avoids unpleasant play experiences created by unproductive overload of the brain, but what is it about the different pattern types that actually produce such clearly defined sets of pattern types that players grow within?

First, let’s take stream vs jumpstream. Both broad pattern types can be broken into a number of different subcategories, and players may find themselves excelling at certain subcategories within jumpstream or stream and very weak at others (oh-trilly js vs 2h-trilly js, rollstream vs runningmen, etc), perhaps even weaker than their general skills in the other category entirely, but what makes these two categories of pattern types worthy of being claimed different skills entirely?

Over time the community has tacitly segregated the two, probably the end result of the original rift between spread and index players and the need for each to claim a territory. Regardless, the separation is not illogical. The defining characteristic of stream patterns is the sheer number of permutations that a given number of notes may be arranged in. While this may seem unimportant at first, not only is the number of patterns for which specific muscle memory must be developed much larger than for jumpstream, but the resulting patterns formed are more complex. Essentially, there’s more to learn and it’s harder to learn it.

Stream also has much more room for difficulty variation as a result of pattern arrangement. Despite being perfectly valid, many tight (defined here as forcing the player to make very fast motions on a single hand, usually as a result of repeated 131 configurations) stream patterns are rarely used anymore. They have been replaced by “friendlier” stream patterns in the current charting meta, shifting the inferred difficulty of a given value for stream bpm.

Nobody currently hears 400 bpm stream and imagines vertex gamma being played at 400 bpm, but players should be wary of this fact and keep in mind that there is far more variation in pattern difficulty within stream subcategories than within jumpstream subcategories.

The issue becomes muddied even further when “cheatability” is inserted into the issue. This will be discussed in detail separately, here.

Even with the general prevalence of friendly stream patterns, not only are streams generally more difficult to read at the same nps as jumpstream, but they tend to punish mistakes harder with longer cbrushes.

Now that we have established some basic logically derived support for why streams are harder than jumpstream, there are also reasons why jumpstream is easier than stream. Yes you read that right.

Hitting jumps may be difficult for newer players just beginning their exposure to the concept of hitting two notes at once, however, for more experienced players hitting two notes at the same time is essentially the same as hitting a single note from a mental processing perspective, and moreover given specific configurations jumps within jumpstream, from a physical perspective too.

Patterns with a decent density of jumps are not only less complex from a mental inscription perspective, but are easier to re-assert yourself within after getting lost or cbrushing.

Given five notes all of which must lie in different points in time, you will have four different values for the ms from the previous note. If one of the five is a jump, you will only have four. The practical result is that it’s easier to spot the next jump within jumpstream and figure out “where in time” the note exists even if you have no idea “where in time” you are currently playing, compared to having to single out a note within a stream where there is both more and greater variation in “where in time” that note is relative to the notes around it.

Jumpstream patterns are also much more easily taken advantage of by a wrist-up playstyle. This will be a very brief primer to the concept of technique which is explained further here. If players learn to play wrists up, they can learn to develop muscle memory that treats every one hand jump as a slam of the wrist or forearm after aligning both fingers together. When perfected, this technique essentially removes one hand jumps from being relevant to difficulty, the larger the proportion of one handed jumps present in the file, the more it helps.

Jumps and the presence thereof, in line with the above reasonings, are also more easily internalized and more “compatible” with other jumpstream patterns when appropriating existing muscle memory for use in constructing them. There are much fewer permutations to learn and the permutations that exist tend to be more like the others compared to stream patterns.

To be fair to jumpstream though, it also has qualities that make it more difficult than stream, relatively speaking. Hyper dense jumpstream patterns often create heavily anchored chains on the 8th notes. In the worst case, the chains end up forming on one hand producing a need for the player to have strong wristjack control in addition to playing the stream elements of the jumpstream.

While this is not necessarily the case from a theoretical vantage, in the current charting meta jumpstream files tend to be much more heavily anchored than stream files, further accentuating the divide between the two. Extremely fast jumpstream files require significant jack control and extremely fast streams require significant individual finger speed control.

In fact the current charting meta is a very interesting example of population and sample bias that can wildly affect perception of difficulty.

Before we dive in let’s first consider a hypothetical situation in order to underscore a couple important concepts. Two populations exist, both with the same distributions of skill level, and both have been playing exclusively a single type of file for as long as they have existed, in this case, say, 5 years. Population A has only ever played variations of rollstream files, and population B has only ever played hyper dense js files.

Now if both populations have existed for the same period of time and both populations have equal distributions of player skill we know by definition that the average player from both populations is of equal skill. What do you think will happen if we take an average player from each population and swap them into the other, along with the file they are best at? Yeah, sounds like a bad time. But let’s up the stakes.

Let’s now assert that the best player from each population A and B are of equivalent skill level. No, this isn’t a matter of “but how do we know”, stop, wrong idea. We’ve stated it; in this example we are omniscient beings pushing around imaginary pawns to create imaginary drama for our imaginary amusement.

The best player of population A (Bplayer A) goes to population B, and brings with him a variety of rollstreamy goodness. Bplayer A announces he is from an exotic and wonderful land and he has come to bring justice to the heathens of jumpstreamland. Bplayer A posts scores on a number of the files he has brought with him and disseminates his good word upon the masses of population B.

Population B sets to work playing them immediately. Players nearly fail out of every file, constantly cbrushing due to being unable to read any of the patterns. The best player of population B (Bplayer B) champions the efforts of his people and manages to score a high A on the hardest file provided, a file Bplayer A has effortlessly AA’d. Disappointed that a player from a far off land he has never heard of has utterly dominated him Bplayer B lies at the feet of his new overlord Bplayer A, and relinquishes his kingdom to Bplayer A and his ilk. Members of population A are revered as gods.

There are a couple obvious issues with this example. First, we know that if we flip the tables and Bplayer B becomes the grand emissary to population A, the same thing will happen, only population B becomes revered.

Second, this only holds up if no player from population A ever plays any file from population B. The operating assumption of population B is that “because these players are superior at files we are bad at, it must mean they are superior overall”. This is not an unreasonable assumption, it could be true, but in fact in our exercise it is not.

If Bplayer A posts a bunch of scores of him getting absolutely smashed on what population B would consider to be moderately difficult jumpstream files, the entire charade collapses and it devolves into an, in the short term at least, irreconcilable argument over whose files are actually harder and whose players are actually better. The argument has no resolution because each side’s perception of difficulty has been strictly developed within their respective file metas and there is no player with a commonality of experience to relate the two.

Now, if we assume eventually over time the two populations come together and begin collectively playing both of their charts, and that Bplayer A and BplayerB both invest the same amount of time with the same amount of skill returned for their time investment, and they realize that they are both as good at their weakness as the other, they might reasonably conclude that in fact they are both approximately the same skill level.

Now that’s an if, and that’s assuming the absolute proper conditions exist for such a revolutionary exclamation to be made. The more likely scenario is some level of reconciliation will occur and bother Bplayer A and Bplayer B will believe that the other is the better player, when the reality (that we have asserted, remember) is they are both equivalent in skill.

It is extremely difficult to dissociate yourself as a player from your strengths as a player when judging how difficult a file is, yes this is true, but “it’s easy for me, therefore it must be easy for everyone” and “I can’t do it, so it must be hard” are not valid lines of reasoning, least of all the most grievous offender, “if I can wreck this so hard but I get smashed on a file another person is good at, then they must be able to super wreck my scores”.

The second and just as if not more pertinent lesson to our current charting meta is that just because on average players are weaker at a specific file type, doesn’t mean it’s inherently more difficult. This is a major flaw in communally constructed difficulty systems. We know in our example population B would have massively overrated the difficulty of every file produced by Bplayer A relative to their own internal rating system, particularly upon initially playing them.

The population bias can work in reverse as well, let’s assume that Bplayer A posted a bunch of scores on files he brought and nobody from population B played any of them. Perhaps those files would be added to their internal rating system but it’s unlikely any of them would particularly care about being thorough and the operating assumption would be it’s highly unlikely that Bplayer A was anywhere near as good as Bplayer B, and the relevant files underrated as a result.

The problem with population bias isn’t that it can’t be fixed, it’s that, by definition, populations prone to high levels of population bias are incapable of diagnosing it. There’s more to be said on the advantages and disadvantages of communally constructed difficulty systems, but not for now.

For now let’s take what we’ve theorized to have observed in our imaginary and extreme thought experiment and apply it to what we’ve established about file types through logical reasoning and furthermore to the current charting meta and further furthermore how it is relevant to this particular discussion of difficulty.

It’s a given that stream patterns at the same nps as js patterns are on average harder, the question is, by how much? If we don’t know how much, how do we find out?

The current charting meta for Stepmania keyboard play has been almost exclusively dominated by jumpstream in recent years, particularly towards the upper end of experienced players (where all of this actually makes a difference). NB3-5 are the gold standard packs to have and NB rates are the staple food for many players, and often what they revert to when nothing in particular draws their attention. In fact there are a number of players who have essentially played nothing but NB rates for their entire playing careers.

Now we can make a reasonable assumption that the upper echelons of keyboard Stepmania play are likely to be biased towards jumpstream (in terms of proportion of time spent during play for specific patterns). If we conclude that we are biased, we need to make relatively a confident assertion as to by how much. Recall, this is strictly for jumpstream vs stream files, nevermind jack files, for which the process must be repeated.

Thankfully, unlike in our imaginary Stepmania purgatory, we have many players who enjoy both stream and jumpstream files and are relatively even in their strength in both areas. Using these players as initial benchmarks is the first step. Take their scores and create benchmark pairs for what are asserted to be roughly equatable difficulties of stream vs jumpstream files will give us target benchmarks for adjusting difficulty trends for stream and jumpstream files respectively. Ideally benchmark pairs would be roughly the same length and have the same file flow, and insofar as is possible, contain similar pattern structure.

After identifying logical underpinnings for what elements of each file (for example, frequency of one hand jumps) make them easier or harder, and then adjust the algorithm based on what you have theorized. Operating on the assumption that your benchmarks are close to their true values of difficulty you refine, scrap, and create new modifiers until your algorithm places benchmark files roughly where you think they should be.

Then you go back and observe the results on a larger data sample and if there are clearly underrated or overrated files, you take them and go back to the drawing board until you’ve gotten results that both make sense logically, and produce results. The moment you can’t tell whether or not anything is overrated or underrated is the moment at which your algorithm has exceeded human perception bias and any further improvement is no longer relevant. And that’s the moment when you let your algorithm alter your perception of difficulty rather than continuing to guide it.

Technique

Most reasonably veteran player, or even stream viewer, have been exposed to the concept(s) of technique. Some files are technical, some players are technical, some players have more technique or better technique, or both, and so on and so forth. Through experience and context most people have a rough idea of what technique actually is, but when pressed to explain to someone who does not understand or wishes for a clearer and more precise understanding, they often have difficulty vocalizing the concepts, particularly on a scales greater than specific examples.

Like much else in Stepmania, technique is interdependent with other skills, such as reading and physical execution. Visible increases in any category tends to alleviate strain from players resulting in improved performances in all categories. Technique, also as with other skills, can be broken down into successively more specific subcategories.

The two overarching categories of technique are playing technique and reading technique. These are specific techniques for application in controlled circumstances, and a player’s overall technique is roughly their breadth and strength of specific techniques. Note that in order for anything to functionally be a technique, it must produce an outcome better than what could otherwise be obtained.

A specific playing technique is any single key actuation or any set of key actuations that are designed to either partially or completely reduce the strain of actuating keys through independent finger movement.

While that may be difficult to digest completely for the moment, it’s easier to understand when presented with various techniques employed during play. Playing technique originated as simply sets of methods for abusing pattern cheatability, however over time it has expanded significantly.

The most common playing technique is referred to as “jumptrilling”. Jumptrilling is mostly employed on rolls of sufficient speed, and it very simply involves executing a jumptrill instead of individually rolling each note. By abusing timing windows and pattern configuration players are able to remove the need for half of the otherwise required individual key actuations.

Jumptrilling technique can be further refined by employing a “staggered jumptrill” in which the jumptrill motion occurs while the fingers on each hand are offset from one another, producing a deviation in the timings of their actuation.

Both forms of jumptrilling incur a significant penalty on accuracy, less so for the latter if mastered, but relative to accurately performing each actuation as a roll at the intended speed it is still severe. The trade-off is significantly reduced probability of error, magnitude of error, and physical and mental strain.

This is great, but consideration of the advantages of jumptrilling rolls versus actually rolling them is a moot point for the majority of players. Most players are jumptrilling rolls simply because they are unable to reliably hit the rolls otherwise. In the scenarios for which this is not the case, the question then essentially becomes “do I care enough about pure accuracy to make this file much more difficult than it has to be?”. Aside from accuracy players within the context of accuracy files with jumptrillable rolls within them, the answer is yes.

Rolls aren’t the only jumptrillable pattern, though. Any pattern that mimics the properties of rolls (even distribution of notes between all fingers and even distribution of timing differentials between notes) can be jumptrilled. The more patterns deviate from those parameters, the harder they become to jumptrill, or, more skilled jumptrilling is required.

It is in these contexts the question of whether or not to employ jumptrilling becomes more difficult. The more patterns deviate from vanilla roll or jumptrill configurations the riskier it becomes. Not only is there a significantly higher risk for hitting the pattern in question without large point penalties, but the more significant issue is whether or not the player has enough control to transition into and out of the jumptrillable section properly.

The term control is used often in Stepmania for a variety of definitions, but to me it means something very specific. A player’s control is their ability to seamlessly transition to and from any combination of various playing techniques, standard play, and especially directed action (directed action being either adjusting or overriding muscle memory for a more favorable outcome than would otherwise have been produced); the more frequent and distinct transitions are, the greater the level of control required. For more context on control, go here.

Technical files are files for which difficulty is disproportionately reduced for players with strong technique and control. Pattern cheatability plays a large part in determining whether or not a file is technical, however it is not the sole determinant and files with comprised largely of abusable patterns aren’t necessarily technical. Control requirement is the defining factor in whether or not a file is technical.

Now, anyone can jumptrill a roll, but what truly separates players who have strength in technique compared to others essentially becomes how much more often they answer the question “can I make this file easier by using technique(s)?” with yes, and the frequency with which they are rewarded for doing so.

Techniques aren’t just about abusing patterns, they’re about streamlining and maximizing the efficiency of play in any form. One of the most difficult techniques is called “Actuation point surfing”, or APS for short.

Conceptually, APS is simple, however in order to understand it it’s important to understand what the standard is.

When actuating a key to hit a note most players bottom out the key and return their finger to a starting point above the keycap, such that the keycap is fully returned to default state. This is a strike cycle, and depending on the player, a strike cycle may span the distance of roughly 8-14mm. Most mechanical keyboard keycaps have a bottom out depth of 4mm and most players keep their finger starting positions above keycaps.

The action of bottoming out keys and returning to a standard starting position above the keycap creates a uniformity to striking that promotes accuracy and control. In this case, eventually, in order to play progressively faster patterns players must reduce the duration of their strike cycle by either decreasing the distance between their starting points and the keycaps, or by increasing their strike speed.

Simply put, the duration of a strike cycle should not exceed the ms difference between two successive notes on a single column. If it does, depending on how much the difference is and how long the player is tested, misses or cbrushes might occur.

Where pure speed is concerned, the accuracy afforded by consistent strike cycles is generally unneeded, the stability and efficiency for muscle memory development is the real prize. If you know you’re falling behind in a stream you can shorten a cycle to catch up by manually moving your finger back down before it reaches the starting point, and then return back to the normal cycle until another manual correction needs to be made. Eventually the cycles normalize such that no manual corrections need to be made, and the process begins anew on something harder. This is the framework within which nearly every player develops muscle memory, or, learns to play the game.

While this process is the most efficient method for learning the game, it is not the most efficient for playing it. The distinction is most obvious when considering what constitutes playing for improvement and playing for scoring. Playing for scoring involves taking a player at a given skill level and maximizing their capacity to perform well and stripping away anything that interferes with this goal.

Enter APS. APS takes strike cycles and minimizes them to the absolute greatest degree. The starting point is the minimum point above which a key is not actuated. For most mechanical switches point of actuation is between 1.5-2.5mm of depression. This means not only will your fingers never disconnect from keycaps, but keycaps should never extend to default position at all. Only enough force is applied to actuate the key from this point, bottoming no longer occurs, and fingers are reset to just above the point of actuation. Ideally, strike cycle is reduced to 1-2mm and this negates the requirement for high strike speed.

Reduction in both distance to be covered and necessary speed all but completely nullifies the physical strain. Because APS minimizes the physical element of play so greatly physical stamina becomes a non-factor, and performance is entirely contingent upon how long execution can continue before messing up.

Learning to use APS and mastering it seems like the obvious choice for any player who wishes to obtain the best scores they can, however it should be equally obvious why this isn’t easy. Before talking about what is even needed to master APS, it’s important to know the circumstances under which it’s actually usable.

APS is realistically restricted to patterns complex enough to require individual finger motion to hit but fast enough that late or early hits will fall within the timing window of something (think 450+ stream equivalent). Thankfully, this is where it does the most good. Anything too slow and fingers will tire out from being held just above actuation points and accidental presses may occur resulting in poor scoring on sections you would have been able to score better on playing normally anyway. As previously stated, accuracy won’t be very high, and capacity to control accuracy even lower.

APS is also only usable with brown/clear switches, or any switch type that involves a tactile bump upon actuation. The bump is used as a tactile confirmation of actuation and fingers “ride the spring” back to reset points instead of being lifted through motor control. This further cuts in half the necessary motion on the part of the player, and the opportunities for error along with it.

Theoretically it’s possible to do this without a tactile switch, however as discussed previously between auditory tactile and visual feedback avenues, tactile is far and away the best. Moreover APS will create virtually no sound, leaving you with only visual feedback, which is by far the worst. I can imagine a universe in which someone could spend years training themselves to become so refined and accurate they were capable of employing this on linear switches just to prove me wrong, however, that’s like spending 5 years learning to ride a bike by only pedaling it on the outside of a space station. Yes, you did, technically, learn to ride a bike in space, but you could have just done it on Earth like everyone else, and a lot faster too.

Now what does APS actually require in order to learn? It needs construction of a totally new set of muscle memory specific to its execution and control. A player needs to be able to repeatedly apply enough force to overcome actuation pressure and then immediately release just enough for the spring to push back on fingers. This requires incredibly fine motor control, as you may imagine. The control required in order to manually adjust for errors is also extremely high, and it has to be built and refined specifically for APS execution. On top of all this, the player also must be able to comfortably read the pattern they intend to use APS to hit.

If any of these skills are too weak APS becomes entirely unusable and will result in something looking much like mashing. Neglecting the fact that it takes years to become skilled enough to read 450+ equivalent comfortably it still takes years to master APS even when actively training it. Nevertheless, it is essentially the Stepmania “endgame”.

Reading techniques involve reading complex patterns by identifying patterns present on at least a single note denomination level above what the pattern is actually comprised of. Reading technique is much more broad than specific playing technique, and while it’s easy to explain to players exactly how and why they should jumptrill a roll, it’s much more difficult to do the same with reading techniques.

Most players develop and employ some reading techniques whether they realize it or not. By virtue of specialization most players have a more comfortable timing reading and playing certain patterns compared to others. Reading technique involves identifying patterns you are strong at within otherwise unmanageable patterns and accordingly adjusting execution.

For example players that are strong in jacks may find it easier to identify jack sequences formed by anchored stream or jumpstream and hit them as they would jacks of that particular speed and length, while use their confidence in the relative jack timings to hit offshoot notes rather than processing and hitting everything sequentially.

The same is also applicable to runningmen. Rather than timing offshoot notes by their relative distance from each other or the judgement line identifying which two notes within the jack sequence they fall between and hitting them based on their positions relative to the jack sequence is much less taxing and more accurate.

In order for a player to have strong reading technique they must necessarily have exceptional reading skill. The stronger their reading the better their ability to quickly identify familiar patterns within unfamiliar ones the faster they can form and refine reading technique and the greater the effect it will have on their play.

Wrist up vs wrist down

Spread play is further subdivided into two distinct styles, wrist up and wrist down.

Wrist down tends to be the natural style of spread, particularly for beginner players. Wrist down is exactly as it implies— wrists are left to rest on a surface and individual finger motion comprises most of strikes, with wrist motions either pivoting on the resting point of the hands or temporarily raised occasionally supplementing (particularly for wristjack patterns).

Wrist down overview: Wrist down is adopted so frequently because it replicates natural typing on the keyboard, making it the first option available (for beginners the most immediate option is more or less instantly adopted), however it has its advantages and disadvantages when compared to wrist up.

Advantages: - Feels natural - Minimizes total movement and energy expenditure - Facilitates individual finger motion, making one-hand based patterns easier - Superior accuracy - Temporary wrist up play may be incorporated to deal with wristjack heavy patterns - Simple execution - Players with predispositions to increased strength or range of motion of their fingers (generally as a result of double jointing) find high reinforcement of natural strength

Disadvantages: - Focused strain on forearm muscles that control index and middle fingers - Weakness at heavily anchored/jacky patterns - Prone to wrist injury - Lower “skill cap”

Wrist up overview: Wrist up is by far the less frequently employed style of spread. Wrist up involves permanently hovering hands above the keyboard during play. Generally the arms do not rest on any surface, however resting elbows on chair arms is an option that can be played around. It is difficult to pinpoint the origin of wrist up however it likely grew out of players who focused on jack heavy files and wished to remove the need to transition their hand positioning in order to maximize their performance.

Advantages: - Distributing strike motion along the entire arm allows for the use of more complex techniques - Stamina drain is reduced due to distribution of physical stress - High strength in anchored/jacky patterns - High “skill cap”

Disadvantages: - Weakness at one hand based patterns, particularly those of considerable length or frequency - Generally requires larger investment of time to play at the same level as wrist down - Accuracy tends to suffer on easier files - More “moving parts” means more things can go wrong - More physically draining overall - Requires development of most arm muscles, and especially shoulder muscles - While in theory possible, it is too difficult and unreliable to transition to wrist down to aid in one hand patterns while playing wrist up.

Wrist down detail:

By nature of the style, wrist down places the majority of physical stress into the forearm muscles that control individual motion of the index and middle fingers. This is both a boon and a detraction— players will find that this focus develops high strength of the muscle group in question and reinforces the strength at one hand patterns and general individual finger motion, however, the singular focus on these muscles means strain or injury can severely hamper play.

Wrist down is also much simpler in execution, the majority of motion is exercised using only forearm muscles with occasional supplement from the wrist. This minimizes the amount of muscle memory that needs to be generated and creates consistent strike cycles through which muscle memory can be most quickly formed and improved upon. The consistency with which motions are made creates an idealized framework for focus on accuracy. Errors or mis-timings need to be corrected only relative to only one or two independent motor movements, compared to the 4 or 5 when playing wrist up. Fewer muscles in play means muscle memory development is focused.

The only nuance to wrist down execution is raising wrists to more easily hit wristjack heavy patterns. It can be difficult to learn to seamlessly transition between the two, and it is generally inadvisable to do so rapidly, however it can learn to be done and is essentially as far as the “skill cap” for wrist down technique extends.

The majority of accuracy focused players tend to favor wrist down playstyle for this reason. For the purpose in question the advantages conferred by wrist up are inconsequential and the disadvantages are too severe.

The strength apart from accuracy from playing wrist down comes from the streamlined process of hitting one hand trills, or patterns based heavily on them. The angle and position of the wrist and fingers during wrist down tends to maximize the strength and subsequently control over individual strikes, further enhancing accuracy. Greater strength and control results in less stamina expended for each strike in total, but the stamina drain tends to be focused.

Due to the focus on specific parts of the hand and forearm any genetic traits that contribute to better performance with individual fingers tend to become magnified. The limited space with which to work means double joints and the associated increase in range of motion and/or strength are particularly beneficial. Personally, I have a number of double joints present in my left hand that are absent in my right. I have greatly increased range of motion and strength in my left hand, and indeed during play I am almost entirely gated by the limitations of my right hand relative to my left. In this scenario playing wrist down has effectively zero benefit, as it is irrelevant how much stronger my left hand is than my right; the only relevant factor to scoring is the strength of your weakest hand. It is for this reason that I adopted the wrist up playstyle.

In order to fully understand the purpose and benefits of wrist up a comparison can be drawn to pitching a baseball. Wrist down is like pitching a baseball by only using muscles below your shoulder. This means your are essentially pitching with only your wrist and tricep. Naturally your pitch won’t go very far, or very fast, which is why nobody does this, but this is fine if distance and speed are not desired goals. If the desired goal is the ability to launch the baseball at the same trajectory and with the same force every pitch, then this indeed becomes a desirable approach to pitching the baseball, and it can probably be learned fairly quickly. Beyond the scope of this purpose, this is a terrible inefficient manner with which to pitch a baseball.

Wrist up is roughly equatable with pitching a baseball using your full body. Using your entire arm grants additional distance and speed, and further employing other beneficial body movement only gets your farther and faster.

In Stepmania, however, the distance and speed during strikes are mostly fixed. We aren’t using additional muscles to go faster or get further, we’re using additional muscles to distribute stress among various sets of muscles so that either stamina drain can be significantly reduced, or, control can be significantly increased. Motions that would otherwise require the movement of one or more individual fingers can be replaced with movements of the wrist, the forearm, or even the shoulder.

The trade-off is that it takes much longer to learn finer motor movement using those muscles compared to your fingers. Not only does it take much longer, but in order to properly maximize wrist up technique, muscle memory for every permutation of muscle state created through increasing the number of moving parts must be developed. Think of playing frogger with two lanes versus five.

Note: I’ll be referencing using the forearm to perform motions however this is anatomically erroneous, technically moving the forearm is the job of the tricep, but that sounds dumb and pretentious and you all get what I mean anyway.

The simplest replacement technique that can be learned when playing wrist up is the “one hand slam”. One handed jumps played wrist down require two individual motions to be made simultaneously, this isn’t necessarily difficult and it can be done with relative ease however any pre-existing pattern for which a note is added to form a one handed jump is made harder by its existence. Playing wrist up allows these motions to be replaced through slamming either the forearm or the wrist downward such that keys become actuated as a result. Doing this only requires finger motion to move such that the fingers are aligned when they actuate the keys, and can resume individual motion on the notes afterwards.

If we return to our pattern for which an extra note has been added to form a one handed jump, our technique effectively deletes the two notes and makes the pattern easier, rather than more difficult. Even players not versed in wrist up technique can observe the benefit in isolation by playing a test file with only a single one hand jump and practicing for a little bit. They will likely find it slightly more difficult to hit every other note in the pattern, but the one hand jump will be significantly easier.

Executing in isolation is completely different from practical play, given a large enough time interval between one handed jumps whichever muscle employed to slam will have returned to a resting state and be prepared once again. This is generally not reflective of actual play, and particularly not reflective of virtually any jumpstream file played fast enough. In practicality, the muscle that you used to slam will unlikely be ready before the next one handed jump, forcing the player to either hit the jump in question as they would normally, or to use a different muscle to perform the slam.

If the forearm is initially used, as the forearm is returning to resting position the wrist can then be employed instead. Depending of the frequency of one handed jumps it’s possible that neither the wrist or forearm will be ready to slam with any accuracy and in this case it’s possible to rotate the entire arm using the shoulder to replicate the motion.

The challenge comes from building muscle memory to effectively delegate which muscle takes on the responsibility for the task to maximize reduction of one handed jumps in a manner that adjusts for any possible position of the arm or any muscles being used. It’s like using a minigun to shoot fish in a barrel, yes, it’s really overkill, but once you get the handle of it you’ll be done killing fish faster than the guy next to you with a pistol.

Even within its own context learning to use the one hand slam technique is not particularly daunting. The true challenge comes from learning all the other replacement techniques and using them in conjunction with the others where appropriate, in essence, it would be like learning to seamlessly transitioning between a minigun, shotgun, sniper rifle, and rocket launcher to shoot fishes in barrels of varying sizes and distances.

Wrist up is not particularly strong at one handed patterns. The angle at which hands must be held is fairly unfavorable for individual finger motion, but this is not a consequence that we must accept. The purpose of playing wrist up is to expand the available use of techniques, and there is indeed a replacement technique appropriate for minimizing the ground lost on strength at one hand trilling. Remember, the game doesn’t care how you actuate the keys, it only cares that the keys are actuated when they’re supposed to be.

The motion of one handed trilling can be mimicked without moving the fingers almost at all. Fingers can be held together while the wrist is rotated over the keys. It’s likely that the finger intended to be pressed will have to be slid down slightly and the other slid up slightly, and the process repeated as the wrist rotates in the opposite direction, but the action of rotating the wrist to cover the majority of the distance provides ample time to learn to do this seamlessly. In effect, this is incredibly difficult to learn and employ while playing, particularly if you already struggle with one handed patterns, it is also incredibly unreliable for any extended one hand trills. Extended wrist rotation is exhausting and beyond a choice few rotations any subsequent ones massively increase the probability of error, luckily in most cases one hand trill patterns do not exceed this length very often and the majority of the time the strain of one handed patterns can be alleviated with wrist rotation. Wrist rotation can also be used to artificially realize the “swingyness” of polyrhythmic patterns.

It’s important to note that the dexterity present in human fingers is very high, and the precision of movement and strength is far greater than any of the other muscles present in the arm. Even if any replacement technique is compensated for with minor adjustments on the part of fingers it is unlikely that accuracy on the order of individual finger motion will never be replicated, in the specific case of wrist rotation it is almost necessary to attempt to hit the trill slightly early to maximize the window for clean transitioning.

Twitch tapping is an offshoot of the one hand slam technique, in execution it is mostly the same however it is used under different circumstances. Occasionally files may contain fast minijacks (fast being 300bpm+ or so). This is not an issue if they are within isolation and not accompanied by patterns that already challenge the skill of the player, however in the event that they do they can be effectively nullified through twitch tapping.

Twitch tapping is the motion of actuating a key with one muscle movement, releasing the key with another muscle movement, and then instantly re-actuating the key with a third muscle movement that overrides the key release movement and produces a secondary actuation extremely quickly. Depending on quality of execution the second actuation may come within 20-100ms, but given the timing windows of the game it is irrelevant. With this technique any minijack at any speed can be PA’d on judge 4.

While it may seem complex, and it can be difficult to learn initially, it is one of the easier replacement techniques to effectively use during play and generally doesn’t interfere too largely with any of the others. The only difficulty comes from rapid successive employment to hit repeated minijacks.

Jacks, jumpjacks, and two hand jumptrills may all be made simpler through wrist up play. Wristjacking was coined for an obvious reason— it was most easily hit with the wrists and as a result people did just that, they hit them with their wrists. This doesn’t necessarily have to be the case. The forearms or shoulders can be used to replicate with slightly reduced accuracy and speed the motion of wristjacking, and may be used even during wristjacking in order to magnify it. If not used in conjunction they may be used in tandem, and successive jack patterns can be approached by alternating jacking with any of the three muscles, drastically reducing the physical strain set upon the player. Two hand jumptrills are absolutely the easiest to make use of this and it’s nearly always advisable to delegate them to the shoulders. Committing these particular patterns to shoulder specific muscle memory can result in very high accuracy on top of negligible amount of strain.

The one hand slam once mastered can be refined further and used to apply to any heavily anchored stream or jumpstream patterns. Rather than a decisive slamming motion wrists, forearms, or shoulders can be used to “slowjack” the anchored patterns while individual fingers are modulated to produce key actuations between the slowjacks. It may seem like a player is simply jumptrilling the pattern in question, and they would more or less be correct, however, this is a far more nuanced execution that requires mastery on multiple levels to obtain any decidedly improved result.

Learning to play wrist up doesn’t just involve development of additional muscle memory, it requires the development of additional muscle. The first issue that players face when switching from wrist down is hovering hands above the keyboard without support. This requires considerable development of shoulder muscles that the majority of players don’t have, even for players with sufficient muscle mass in the shoulders it will likely take a few sessions before tiredness or soreness from extended play ceases.

Any muscles used in replacement techniques must also be developed to keep up with their counterparts. This is not significant for most players as the muscle requirement below a certain level is essentially nill, however above a certain level (somewhere around d7ish) play requires significant muscle mass, and it takes considerable time and effort to develop strength and stamina in the relevant muscles before wrist up techniques can be exploited to the fullest extent.

When discussing which techniques players should use, or really any facet of gameplay, there tend to be two perspectives on the matter. On the one hand players will tell you that what is important is what works for you, and the best option is the one you are the most comfortable with. On the other hand players will tell you there is a best way to play and you should play that way no matter what.

Both perspectives are correct and incorrect in some regards. Yes, technically, if you specifically are incapable of playing a certain style and you will never be better with it regardless of the amount of time invested compared to another style you are using, then there’s no point in switching. Even if other players see considerable improvement.

This however doesn’t mean there isn’t a superior style given an average player, and it certainly doesn’t mean different styles don’t have differing levels of exploitable performance. Nobody will tell you index is better for playing jumpstream than spread given the same time invested into both styles. It’s irrelevant if one person somewhere plays index because they only have two fingers and are unable to play spread. In their case— yes, spread is the inferior playstyle, but it speaks nothing to the average or even the highest possible level of skill that can be attained with a given style for a player who isn’t physically incapable of playing it.