Posts on psychometrics: The Science of Assessment

If you have worked in the field of assessment and psychometrics, you have undoubtedly encountered the word “standard.” While a relatively simple word, it has the potential to be confusing because it is used in three (and more!) completely different but very important ways. Here’s a brief discussion.

Standard = Cutscore

As noted by the well-known professor Gregory Cizek here, “standard setting refers to the process of establishing one or more cut scores on a test.” The various methods of setting a cutscore, like Angoff or Bookmark, are referred to as standard setting studies. In this context, the standard is the bar that separates a Pass from a Fail. We use methods like the ones mentioned to determine this bar in as scientific and defensible fashion as possible, and give it more concrete meaning than an arbitrarily selected round number like 70%. Selecting a round number like that will likely get you sued since there is no criterion-referenced interpretation.

Standard = Blueprint

If you work in the field of education, you often hear the term “educational standards.” These refer to the curriculum blueprints for an educational system, which also translate into assessment blueprints, because you want to assess what is on the curriculum. Several important ones in the USA are noted here, perhaps the most common of which nowadays is the Common Core State Standards, which attempted to standardize the standards across states. These standards exist to standardize the educational system, by teaching what a group of experts have agreed upon should be taught in 6th grade Math classes for example. Note that they don’t state how or when a topic should be taught, merely that 6th Grade Math should cover Number Lines, Measurement Scales, Variables, whatever – sometime in the year.

Standard = Guideline

If you work in the field of professional certification, you hear the term just as often but in a different context, accreditation standards. The two most common are the National Commission for Certifying Agencies (NCCA) and the ANSI National Accreditation Board (ANAB). These two organizations are a consortium of credentialing bodies that give a stamp of approval to credentialing bodies, stating that a Certification or Certificate program is legit. Why? Because there is no law to stop me from buying a textbook on any topic, writing 50 test questions in my basement, and selling it as a Certification. It is completely a situation of caveat emptor, and these organizations are helping the buyers by giving a stamp of approval that the certification was developed with accepted practices like a Job Analysis, Standard Setting Study, etc.

In addition, there are the professional standards for our field. These are guidelines on assessment in general rather than just credentialing. Two great examples are the AERA/APA/NCME Standards for Educational and Psychological Measurement and the International Test Commission’s Guidelines (yes they switch to that term) on various topics.

Also: Standardized = Equivalent Conditions

The word is also used quite frequently in the context of standardized testing, though it is rarely chopped to the root word “standard.” In this case, it refers to the fact that the test is given under equivalent conditions to provide greater fairness and validity. A standardized test does NOT mean multiple choice, bubble sheets, or any of the other pop connotations that are carried with it. It just means that we are standardizing the assessment and the administration process. Think of it as a scientific experiment; the basic premise of the scientific method is holding all variables constant except the variable in question, which in this case is the student’s ability. So we ensure that all students receive a psychometrically equivalent exam, with equivalent (as much as possible) writing utensils, scrap paper, computer, time limit, and all other practical surroundings. The problem comes with the lack of equivalence in access to study materials, prep coaching, education, and many bigger questions… but those are a societal issue and not a psychometric one.

So despite all the bashing that the term gets, a standardized test is MUCH better than the alternatives of no assessment at all, or an assessment that is not a level playing field and has low reliability. Consider the case of hiring employees: if assessments were not used to provide objective information on applicant skills and we could only use interviews (which are famously subjective and inaccurate), all hiring would be virtually random and the amount of incompetent people in jobs would increase a hundredfold. And don’t we already have enough people in jobs where they don’t belong?

The generalized partial credit model (GPCM, Muraki 1992) is a category in a group of families in item response theory (IRT).

It is designed to work with items that are partial credit.  That is, instead of just right/wrong as possible, scoring an examinee can receive partial points for completing some aspects of the item correctly.  For example, a typical multiple-choice item is scored as 0 points for incorrect and 1 point for correct. 

A GPCM item might consist of 3 aspects and be 0 points for incorrect, 3 points for fully correct, and 1 or 2 points if the examinee only completes 1 or 2 of the aspects, but not all three.

Examples of GPCM items

GPCM items, therefore contain multiple point levels starting at 0.  There are several examples that are common in the world of educational assessment.

The first example, which nearly everyone is familiar with, is essay rubrics.  A student might be instructed to write an essay on why extracurriculars are important in school, with at least 3 supporting points.  Such an essay might be scored with the number of points presented (0,1,2,3) as well as on grammar (0=10 or more errors, 1= 3-9 errors, and 2 = 2 errors or less). Here’s a shorter example.

Another example is multiple response items.  For example, a student might be presented with a list of 5 animals and be asked to identify which are Mammals.  There are 2 correct answers, so the possible points are 0,1,2.

Note that this also includes their tech-enhanced equivalents, such as drag and drop; such items might be reconfigured to dragging the animal names into boxes, but that’s just window dressing to make the item look sexier.

The National Assessment of Educational Progress and many other K-12 assessments utilize the GPCM since they so often use item types like this.

Why use the generalized partial credit model?

Well, the first part of the answer is a more general question: why use polytomous items?  Well, these items are generally regarded to be higher-fidelity and assess deeper thinking than multiple-choice items. They also provide much more information than multiple-choice items in an IRT paradigm.

The second part of the answer is the specific question: If we have polytomous items, why use the GPCM rather than other models? 

There are two parts to that answer that refer to the name generalized partial credit model.  First, partial credit models are appropriate for items where the scoring starts at 0, and different polytomous items could have very different performances.  In contrast, Likert-style items are also polytomous (almost always), but start at 1, and apply the same psychological response process on every item.  For example, a survey where statements are presented and examinees are to, “Rate each on a scale of 1 to 5.” 

Second, the “generalized” part of the name means that it includes a discrimination parameter for evaluating the measurement quality of an item.  This is similar to using the 2PL or 3PL for dichotomous items rather than using the Rasch model and assuming items are of equal discrimination.  There is also a Rasch partial credit model that is equivalent and can be used alongside Rasch dichotomous items, but this post is just focusing on GPCM.

Definition of the Generalized Partial Credit Model

The equation below (Embretson & Reise, 2000) defines the generalized partial credit.

Generalized partial credit model equation 

    In this equation: 

      m – number of possible points

      x – the student’s score on the item

      i – index for item

      θ – student ability

      a – discrimination parameter for item i

      gij – the boundary parameter for step j on item i; there are always m-1 boundaries

      r – an index used to manage the summation.

    What do these mean?  The a parameter is the same concept as the a parameter in dichotomous IRT, where 0.5 might be low and 1.2 might be high.  The boundary parameters define the steps or thresholds that explain how the GPCM works, which will become clearer when you see the graph below.

    As an example, let us consider a 4-point item with the following parameters.

    GPCM parameters

    If you use those numbers to graph the functions for each point level as a function of theta, you would see a graph like the one below.  Here, consider Option 1 to be the probability of getting 0 points; this is a very high probability for the lowest examinees but drops as ability increases.

    Generalized-partial-credit-model

    The Option 5 line is for receiving all possible points; high probability for the best examinees, but probability decreases as ability does.  Between, we have probability curves for 1, 2, and 3 points.  If an examinee has a theta of -0.5, they have a high probability of getting 2 points on the item (yellow curve).

    The boundary parameters mentioned earlier have a very real interpretation with this graph; they are literally the boundaries between the curves.  That is the theta level, at which 1 point (purple) becomes more likely that 0 points (red) are at -2.4 where the two lines cross.  Note that this is the first boundary parameter b1 in the image earlier.

    How to use the GPCM

    As mentioned before, the GPCM is appropriate to use as your IRT model for multi-point items in an educational context, as opposed to Likert-style psychological items.  They’re almost always used in conjunction with the 2PL or 3PL dichotomous models; consider a test of 25 multiple-choice items, 3 multiple response items, and an essay with 2 rubrics.

    To implement, you need an IRT software program that can estimate dichotomous and polytomous items jointly, such as Xcalibre.  Consider the screenshot below to specify these. 

    Xcalibre-IRT-model-selection

    If you implement IRT with Xcalibre, it produces a page like this for each GPCM item.

    Xcalibre item response theory

    To score students with the GPCM, you either need to use the IRT program like Xcalibre to score students or a test delivery system that has been specifically designed to support the GPCM in the item banker and implement GPCM in scoring routines.  The former only works when you are doing the IRT analysis after all examinees have completed a test; if you have a continuous deployment of assessments, you will need to use the latter approach.

    Where can I learn more?

    IRT textbooks will provide a treatment of polytomous models like the generalized partial credit model. Examples are de Ayala (2010) and Embretson & Reise (2000). Also, I recommend the 2010 book by Nering and Ostini, which was previously available as a monograph.

      Scaled scoring is a process used in assessment and psychometrics to transform exam scores to another scale (set of numbers), typically to make the scores more easily interpretable but also to hide sensitive information like raw scores and differences in form difficulty (equating).  For example, the ACT test produces scores on a 0 to 36 scale; obviously, there are more than 36 questions on the test, so this is not your number correct score, but rather a repackaging.  So how does this repackaging happen, and why are we doing it in the first place?

      An Example of Scales: Temperature

      First, let’s talk about the definition of a scale.  A scale is a range of numbers for which you can assign values and interpretations.  Scores on a student essay might be 0 to 5 points, for example, where 0 is horrible and 5 is wonderful.  Raw scores on a test, like number-correct, are also a scale, but there are reasons to hide this, which we will discuss.

      An example of scaling that we are all familiar with is temperature.  There are three scales that you have probably heard of: Fahrenheit, Celsius, and Kelvin.  Of course, the concept of temperature does not change, we are just changing the set of numbers used to report it.  Water freezes at 32 Fahrenheit and boils at 212, while these numbers are 0 and 100 with Celsius.  Same with assessment: the concept of what we are measuring does not change on a given exam (e.g., knowledge of 5th grade math curriculum in the USA, mastery of Microsoft Excel, clinical skills as a neurologist), but we can change the numbers.

      What is Scaled Scoring?

      In assessment and psychometrics, we can change the number range (scale) used to report scores, just like we can change the number range for temperature.  If a test is 100 items but we don’t want to report the actual score to students, we can shift the scale to something like 40 to 90.  Or 0 to 5.  Or 824,524 to 965,844.  It doesn’t matter from a mathematical perspective.  But since one goal is to make it more easily interpretable for students, the first two are much better than the third.

      So, if an organization is reporting Scaled Scores, it means that they have picked some arbitrary new scale and are converting all scores to that scale.  Here’s some examples…

      Real Examples

      Many assessments are normed on a standard normal bell curve.  Those which use item response theory do so implicitly, because scores are calculated directly on the z-score scale (there are some semantic differences, but it’s the basic idea).  Well, any scores on the z-score bell curve can be converted to other scales quite easily, and back again.  Here some of the common scales used in the world of assessment.

      z score T score IQ Percentile ACT SAT
      -3 20 55 0.02 0 200
      -2 30 70 2.3 6 300
      -1 40 85 15.9 12 400
      0 50 100 50 18 500
      1 60 115 84.1 24 600
      2 70 130 97.7 30 700
      3 80 145 99.8 36 800

      Note how the translation from normal curve based approaches to Percentile is very non-linear!  The curve-based approaches stretch out the ends.

      Why do Scaled Scoring?

      There are a few good reasons:

      1. Differences in form difficulty (equating) – Many exams use multiple forms, especially across years.  What if this year’s form has a few more easy questions and we need to drop the passing score by 1 point on the raw score metric?  Well, if you are using scaled scores like 200 to 400 with a cutscore of 350, then you just adjust the scaling each year so the reported cutscore is always 350.
      2. Hiding the raw score – In many cases, even if there is only one form of 100 items, you don’t want students to know their actual score.
      3. Hiding the z scale (IRT) – IRT scores people on the z-score scale.  Nobody wants to be told they have a score of -2.  That makes it feel like you have negative intelligence or something like that.  But if you convert it to a big scale like the SAT above, that person gets a score of 300, which is a big number so they don’t feel as bad.  This doesn’t change the fact that they are only at the 2nd percentile though.  It’s just public relations and marketing, really.

      Who uses Scaled Scoring?

      Just about all the “real” exams in the world use this.  Of course, most use IRT, which makes it even more important to use scaled scoring.

      Methods of Scaled Scoring

      There are 4 types of scaled scoring.  The rest of this post will get into some psychometric details on these, for advanced readers.

      1. Normal/standardized
      2. Linear
      3. Linear dogleg
      4. Equipercentile

      Normal/standardized

      This is an approach to scaled scoring that many of us are familiar with due to some famous applications, including the T score, IQ, and large-scale assessments like the SAT. It starts by finding the mean and standard deviation of raw scores on a test, then converts whatever that is to another mean and standard deviation. If this seems fairly arbitrary and doesn’t change the meaning… you are totally right!

      Let’s start by assuming we have a test of 50 items, and our data has a raw score average of 35 points with an SD of 5. The T score transformation – which has been around so long that a quick Googling can’t find me the actual citation – says to convert this to a mean of 50 with an SD of 10. So, 35 raw points become a scaled score of 50. A raw score of 45 (2 SDs above mean) becomes a T of 70. We could also place this on the IQ scale (mean=100, SD=15) or the classic SAT scale (mean=500, SD=100).

      A side not about the boundaries of these scales: one of the first things you learn in any stats class is that plus/minus 3 SDs contains 99% of the population, so many scaled scores adopt these and convenient boundaries. This is why the classic SAT scale went from 200 to 800, with the urban legend that “you get 200 points for putting your name on the paper.” Similarly, the ACT goes from 0 to 36 because it nominally had a mean=18 and SD=6.

      The normal/standardized approach can be used with classical number-correct scoring, but makes more sense if you are using item response theory, because all scores default to a standardized metric.

      Linear

      The linear approach is quite simple. It employs the  y=mx+b  that we all learned as schoolkids. With the previous example of a 50 item test, we might say intercept=200 and slope=4. This then means that scores range from 200 to 400 on the test.

      Yes, I know… the Normal conversion above is technically linear also, but deserves its own definition.

      Linear dogleg

      The Linear Dogleg approach is a special case of the previous one, where you need to stretch the scale to reach two endpoints. Let’s suppose we published a new form of the test, and a classical equating method like Tucker or Levine says that it is 2 points easier and the slope of Form A to Form B is 3.8 rather than 4. This throws off our clean conversion of 200 to 400 scale. So suppose we use the equation SCALED = 200 + 3.8*RAW but only up until the score of 30. From 31 onwards, we use SCALED = 185 + 4.3*RAW. Note that the raw score of 50 then still comes out to be scaled of 400, so we still go from 200 to 800 but there is now a slight bend in the line. This is called the “dogleg” similar to the golf hole of the same name.

      Dogleg example

      Equipercentile

      Lastly, there is Equipercentile, which is mostly used for equating forms but can similarly be used for scaling.  In this conversion, we match the percentile for each, even if it is a very nonlinear transformation.  For example, suppose our Form A had a 90th percentile of 46, which became a scale of 384.  We find that Form B has a 90th percentile at 44 points, so we call that a scaled score of 384, and calculate a similar conversion for all other points.

      Why are we doing this again?

      Well, you can kind of see it in the example of having two forms with a difference in difficulty. In the Equipercentile example, suppose there is a cut score to be in the top 10% to win a scholarship. If you get 45 on Form A you will lose, but if you get 45 on Form B you will win. Test sponsors don’t want to have this conversation with angry examinees, so they convert all scores to an arbitrary scale. The 90th percentile is always a 384, no matter how hard the test is. (Yes, that simple example assumes the populations are the same… there’s an entire portion of psychometric research dedicated to performing stronger equating.)

      How do we implement scaled scoring?

      Some transformations are easily done in a spreadsheet, but any good online assessment platform should handle this topic for you.  Here’s an example screenshot from our software.

      Scaled scores in FastTest

      A standard setting study is a formal, quantitative process for establishing a performance standard on an exam, such as what score is “proficient” or “passing.”  This is typically manifested as a cutscore which is then used for making decisions about people: hire them, pass them, accept them into university, etc.  Because it is used for such important decisions, a lot of work goes into standard setting, using methods based on scientific research.

      What is NOT standard setting?

      In the assessment world, there are actually three uses of the word standard:

      1. A formal definition of the content that is being tested, such as the Common Core State Standards in the USA.  
      2. A formalized process for delivering exams, as seen in the phrase “standardized testing.”
      3. A benchmark for performance, like we are discussing here.

      For this reason, I prefer the term cutscore study, but the phrase standard setting is used more often.  

      How is a standard setting study used?

      As part of a comprehensive test development cycle, after item authoring, item review, and test form assembly, a cutscore or passing score will often be set to determine what level of performance qualified as “pass” or a similar classification.  This cannot be done arbitrarily, such as setting it at 70% because that’s what you saw when you were in school.  That is a legal landmine!  To be legally defensible and eligible for Accreditation of a Certification Program, it must be done using one of several standard-setting approaches from the psychometric literature.  So, if your organization is classifying examinees into Pass/Fail, Hire/NotHire, Basic/Proficient/Advanced, or any other groups, you most likely need a standard setting study.  This is NOT limited to certification, although it is often discussed in that pass/fail context.

      What are some methods of a standard setting study?

      There have been many methods suggested in the scientific literature of psychometrics.  They are often delineated into examinee-centered and item-centered approaches.  Angoff and Bookmark are designed around evaluating items, while Contrasting Groups and Borderline Groups are designed around evaluating the distributions of actual examinee scores.  The Bookmark approach is sort of both types, however, because it uses examinee performance on the items as the object of interest.

      Angoff

      Modified Angoff analysis

      In an Angoff study, a panel of subject matter experts rates each item, estimating the percentage of minimally competent candidates that would answer each item correctly.  If we take the average of all raters, this then translates into the average percentage-correct score that the raters expect from a minimally competent candidate – a very compelling argument for a cutscore to pass competent examinees!  It is often done in tandem with the Beuk Compromise.  The Angoff method does not require actual examinee data, though the Beuk does.

      Bookmark

      The bookmark method orders the items in a test form in ascending difficulty, and a panel of experts reads through and places a “bookmark” in the book where they think a cutscore should be.  Obviously, this requires enough real data to calibrate item difficulty, usually using item response theory, which requires several hundred examinees.

      Contrasting Groups

      contrasting groups cutscore

      With the contrasting groups approach, candidates are sorted into Pass and Fail groups based on their performance on a different exam or some other unrelated standard.  We can then compare the score distributions on our exam for the two separate groups, and pick a cutscore that best differentiates Pass vs Fail on the other standard.  An example of this is below.  If using data from another exam, a sample of at least 50 candidates is obviously needed, since you are evaluating distributions.

      Borderline Group

      The Borderline Group method is similar to Contrasting Groups, but a borderline group is defined using alternative information such as biodata, and the scores of the group are evaluated.

      Hofstee

      The Hofstee approach is often used as a reality check for the modified-Angoff method, but can be done on its own.  It involves only a few estimates from a panel of SMEs.

      Ebel

      The Ebel approach categorizes items by importance as well as difficulty.  It is very old and not used anymore.

      How to choose an approach?

      There is often no specifically correct answer.  In fact, guidelines like NCCA do not lay out which method to use, they just tell you to use an appropriate method.

      There are several considerations.  Perhaps the most important is whether you have existing data.  The Bookmark, Contrasting Groups, and Borderline Group approaches all assume that we have data from a test already delivered, which we can analyze with the perspective of the latent standard.  The Angoff and Hofstee approaches, in contrast, can be done before a test is ever delivered.  This is arguably less defensible, but is a huge practical advantage.

      The choice also depends on whether you can easily recruit a panel of subject matter experts, as that is required for Angoff and Bookmark.  The Contrasting Groups method assumes we have a gold standard, which is rare.

      How can I implement a standard setting study?

      If your organization has an in-house psychometrician, they can usually do this.  If, for example, you are a board of experts in a profession but lack experience in psychometrics, you need to hire a firm.  We can perform such work for you – contact us to learn more.

       

      If you are dealing with data science, which psychometrics most definitely is, you’ve probably come across  R. It is an environment that allows you to implement packages for many different types of analysis, which are built by a massive community of data scientists around the world.

      R has become one of the two main languages for data science and machine learning (the other being Python) and its popularity has grown drastically.   R for psychometrics is also becoming common.

      I was extremely anti-R for several years, but have recently started using it for several important reasons.

      However, for some even more important reasons, I don’t use it for all of my work.  I’d recommend you do the same.

      Let’s talk a bit about why.

      What is R?

      R is a programming language-like environment for statistical analysis.  Its Wikipedia page defines it as a “programming language and free software environment for statistical computing and graphics”. But I use the term “programming-language-like environment”.

      This is because it is more like command scripting from DOS than an actual compiled language like Java or Pascal.  R has an extremely steep learning curve compared to software that provides a decent UI; it claims that RStudio is a UI, but it really is just a more advanced window to see the same command code!

      R can be maddening for several reasons.  For example, it will not recognize a missing value in data when running a simple correlation and is unable to give you a decent error message explaining this.  This was my first date with R, and turned me off for years.  A similar thing occurred to me the first time I used PARSCALE in 2009 and couldn’t get it to work for days.  Eventually,, I discovered it was because the original codebase was DOS, which limits you to 8-character-file names, which is nowhere in the documentation!  They literally expected all users to be knowledgeable on 1980s DOS rules in 2009.

      BUT… R is free, and everybody likes free.  Even though free never means there is no cost.

      What are packages?

      R comes with some analysis out of the box, but the vast majority is available in packages.  For example, if you want to do factor analysis or item response theory, you install one of several packages that do those.  These packages are written by contributors and uploaded to an R server somewhere.

      There is no code review or anything else to check the packages, so it is entirely a caveat emptor situation.  This isn’t malicious, they’re just taking the scientific approach that assumes other researchers will replicate, disprove, or alternative work.

      For important, commonly used packages (I am a huge fan of caret), this is most definitely the case.  For rarely used packages and pet projects, it is the opposite.

      distance learning

      Why do I use R for psychometrics or elsewhere?

      As mentioned earlier, I use R when there are well-known packages that are accepted in the community.  The caret package is a great example.  Just Google “r caret” and you can get a glimpse of the many resources, blog posts, papers, and other uses of the package.  It isn’t an analysis package usually, it just makes it easier to call existing, proven packages.  Another favorite is the text2vec package, and of course, there is the ubiquitous tidyverse.

      I love to use R in cases of more general data science problems because this means a community several orders of magnitude above psychometricians, which definitely contributes to the higher quality.  The caret package is for regression and classification, which are used in just about every field.

      The text2vec package is for natural language processing, used in fields as diverse as marketing, political science, and education.  One of my favorite projects I’ve heard come across was the analysis of the Jane Austen corpus.  Fascinating.

      When would I use R packages that I might consider less stellar?

      I don’t mind using R when it is a low-stakes situation such as exploratory data analysis for a client.  I would also consider it an acceptable alternative to commercial software when the analysis is something I do very rarely.  No way am I going to pay $10,000 or whatever for something I do 2 hours per year.

      Finally, I would consider it for niche analyses where no other option exists except to write my own code, and it does not make financial sense to do so.  However, in these cases I still try to perform due diligence.

      Why I don’t Use R

      Often it comes down to a single word: quality.

      For niche packages, the code might be 100% from a grad student who was a complete newbie on a topic for their thesis, with no background in software development or ancillary topics like writing a user manual.   Moreover, no one has ever validated a single line of the code.  Because of this, I am very wary when using R packages.

      If you use one like this, I highly recommend you do some QA or background research on it first!   I would love it if R had a community rating system like exists for WordPress plugins.  With those, you can see that one plugin might be used on 1,000,000 sites with a 4.5/5.0 rating, while another is used on 43 sites with a 2.7/5.0 rating.

      This quality thing is of course a continuum.  There is a massive gap between the grad student project and something like caret.  In between, you might have an R package that is a hobby of a professor who devotes some time to it and has extremely deep knowledge of the subject matter—but it remains a part-time endeavor by someone with no experience in commercial software.  For example of this situation, please see this comparison of IRT results with R vs professional tools.

      The issue on User Manuals is of particular concern to me as someone that provides commercial software and knows what it is like to support users.  I have seen user manuals in the R world that literally do not tell the users how to use the package.  They might provide a long-winded description of some psychometrics, obviously copied from a dissertation, as a “manual”, when at best it only belongs as an appendix.  No info on formatting of input files, no provision of example input, no examples of usage, and no description of interpreting the output.

      Even in the cases of an extremely popular package that has high-quality code, the documentation is virtually unreadable.  Check out the official landing page for tidyverse.  How welcoming is that?  I’ve found that the official documentation is almost guaranteed to be worthless – instead, head over to popular blogs or YouTube channels on your favorite topic.

      The output is also famously below average regarding quality.

      testing software

      R stores its output as objects, a sort of mini-database behind the scenes.  If you want to make graphs or dump results to something like a CSV file, you have to write more code just for such basics.  And if you want a nice report in Word or PDF, get ready to write a ton of code, or spend a week doing copy-and-paste.  I noticed that there was a workshop a few weeks ago at NCME (April 2019) that was specifically on how to get useful output reports from R, since this is a known issue.

      Is R turning the corner?

      I’ve got another post coming about R and how it has really turned the corner because of 3 things: Shiny, RStudio, and availability of quality packages.  More on that in the future, but for now:

      • Shiny allows you to make applications out of R code so that the power of R can be available to end-users without them having to write & run code themselves.  Until Shiny, R was limited to people who wanted to write & run code.
      • RStudio makes it easier to develop R code, by overlaying an integrated development environment (IDE) on top of R.  If you have ever used and IDE, you know how important this is.  You’ve got to be incredibly clueless to not use an IDE for development.  Yet the first release of RStudio did not happen until 2011.  This shows how rooted R was in academia.
      • As you might surmise from my rant above, it is the quality packages (and third-party documentation!) that are really opening the floodgates.

      Another hope for  R is jumping on the bandwagon of the API economy.  It might become the lingua franca of the data analytics world from an integration perspective.  This might be the true future of R for psychometrics.

      But there are still plenty of issues.  One of my pet peeves is the lack of quality error trapping.  For example, if you do simple errors, the system will crash with completely worthless error messages.  I found this to happen if I run an analysis, open my output file, and run it again when forgetting to close the output file.  As previously mentioned, there is also the issue with a single missing data point in a correlation.

      Nevertheless, R is still not really consumer facing.  That is, actual users will always be limited to people that have strong coding skills AND deep content knowledge on a certain area of data science or psychometrics.  Just like there will always be a home for more user-friendly statistical software like SPSS, there will always be a home for true psychometric software like Xcalibre.

      Psychometric forensics is a surprisingly deep and complex field.  Many of the indices are incredibly sophisticated, but a good high-level and simple analysis to start with is overall time vs. scores, which I call Time-Score Analysis.  This approach uses simple flagging on two easily interpretable metrics (total test time in minutes and number correct raw score) to identify possible pre-knowledge, clickers, and harvester/sleepers.  Consider the four quadrants that a bivariate scatterplot of these variables would produce.

       

      Quadrant Interpretation Possible threat? Suggested flagging
      Upper right High scores and taking their diligent time Good examinees NA
      Upper left High scores with low time Pre-knowledge Top 50% score and bottom 5% time
      Lower left Low scores with low time “Clickers” or other low motivation Bottom 5% time and score
      Lower right Low scores with high time Harvesters, sleepers, or just very low ability Top 5% time and bottom 5% scores

      An example of Time-Score Analysis

      Consider the example data below.  What can this tell us about the performance of the test in general, and about specific examinees?

      This test had 100 items, scored classically (number-correct), and a time limit of 60 minutes.  Most examinees took 45-55 minutes, so the time limit was appropriate.  A few examinees spent 58-59 minutes; there will usually be some diligent students like that.  There was a fairly strong relationship of time with the score, in that examinees who took longer, scored highly.

      Now, what about the individuals?  I’ve highlighted 5 examples.

      1. This examinee had the shortest time, and one of the lowest scores.  They apparently did not care very much.  They are an example of a low motivation examinee that moved through quickly.  One of my clients calls these “clickers.”
      2. This examinee also took a short time but had a suspiciously high score.  They definitely are an outlier on the scatterplot, and should perhaps be investigated.
      3. This examinee is simply super-diligent.  They went right up to the 60-minute limit and achieved one of the highest scores.
      4. This examinee also went right up to the 60-minute limit but had one of the lowest scores.  They are likely low ability or low motivation.  That same client of mine calls these “sleepers” – a candidate that is forced to take the exam but doesn’t care, so just sits there and dozes. Alternatively, it might be a harvester; some who have been assigned to memorize test content, so they spend all the time they can, but only look at half the items so they can focus on memorization.
      5. This examinee had by far the lowest score, and one of the lowest times.  Perhaps they didn’t even answer every question.  Again, there is a motivation/effort issue here, most likely.
      6. Time-Score example (annotated)

       

      How useful is time-score analysis?

      Like other aspects of psychometric forensics, this is primarily useful for flagging purposes.  We do not know yet if #4 is a Harvester or just low motivation.  Instead of accusing them, we open an investigation.  How many items did they attempt?  Are they repeat test-takers?  What location did they take the test?  Do we have proctor notes, site video, remote proctoring video, or other evidence that we can review? 

      There is a lot that can go into such an investigation.  Moreover, simple analyses such as this are merely the tip of the iceberg when it comes to psychometric forensics.  In fact, so much that I’ve heard some organizations simply stick their head in the sand and don’t even bother checking out someone like #4.  It just isn’t in the budget.

      Some of this analysis is best done with specialized software for psychometric forensics, like SIFT.

      However, test security is an essential aspect of validity.  If someone has stolen your test items, the test is compromised, and you are guaranteed that scores do not mean the same thing they meant when the test was published.  It’s now apples and oranges, even though the items on the test are the same.  Perhaps you might not challenge individual examinees but perhaps institute a plan to publish new test forms every 6 months. Regardless, your organization needs to have some difficult internal discussions and establish a test security plan.

       

      Subject matter experts are an important part of the process in developing a defensible exam.  There are several ways that their input is required.  Here is a list from highest involvement/responsibility to lowest:

      1. Serving on the Certification Committee (if relevant) to decide important things like eligibility pathways
      2. Serving on panels for psychometric steps like Job Task Analysis or Standard Setting (Angoff)
      3. Writing and reviewing the test questions
      4. Answering the survey for the Job Task Analysis

      Who are Subject Matter Experts?

      A subject matter expert (SME) is someone with knowledge of the exam content.  If you are developing a certification exam for widgetmakers, you need a panel of expert widgetmakers, and sometimes other stakeholders like widget factory managers.

      You also need test development staff and psychometricians.  Their job is to guide the process to meet international standards, and make the SME time the most efficient.

      Example: Item Writing Workshop

      psychometric training and workshopsThe most obvious usage of subject matter experts in exam development is item writing and review. Again, if you are making a certification exam for experienced widgetmakers, then only experienced widgetmakers know enough to write good items.  In some cases, supervisors do as well, but then they are also SMEs.  For example, I once worked on exams for ophthalmic technicians; some of the SMEs were ophthalmic technicians, but some of the SMEs (and much of the nonprofit board) were ophthalmologists, the medical doctors for whom the technicians worked.

      An item writing workshop typically starts with training on item writing, including what makes a good item, terminology, and format.  Item writers will then author questions, sometimes alone and sometimes as a group or in pairs.  For higher stakes exams, all items will then be reviewed/edited by other SMEs.

      Example: Job Task Analysis

      Job Task Analysis studies are a key step in the development of a defensible certification program.  It is the second step in the process, after the initial definition, and sets the stage for everything that comes afterward.  Moreover, if you seek to get your certification accredited by organizations such as NCCA or ANSI, you need to re-perform the job task analysis study periodically. JTAs are sometimes called job analysis, practice analysis, or role delineation studies.

      The job task analysis study relies heavily on the experience of Subject Matter Experts (SMEs), just like Cutscore studies. The SMEs have the best tabs on where the profession is evolving and what is most important, which is essential both for the initial JTA and the periodic re-set of the exam. The frequency depends on how quickly your field is evolving, but a cycle of 5 years is often recommended.

      The goal of the job task analysis study is to gain quantitative data on the structure of the profession.  Therefore, it typically utilizes a survey approach to gain data from as many professionals as possible.  This starts with a group of SMEs generating an initial list of on-the-job tasks, categorizing them, and then publishing a survey.  The end goal is a formal report with a blueprint of what knowledge, skills, and abilities (KSAs) are required for certification in a given role or field, and therefore what are the specifications of the certification test.

      • Observe— Typically the psychometrician (that’s us) shadows a representative sample of people who perform the job in question (chosen through Panel Composition) to observe and take notes. After the day(s) of observation, the SMEs sit down with the observer so that he or she may ask any clarifying questions.

        The goal is to avoid doing this during the observation so that the observer has an untainted view of the job.  Alternatively, your SMEs can observe job incumbents – which is often the case when the SMEs are supervisors.

      • Generate— The SMEs now have a corpus of information on what is involved with the job, and generate a list of tasks that describe the most important job-related components. Not all job analysis uses tasks, but this is the most common approach in certification testing, hence you will often hear the term job task analysis as a general term.
      • Survey— Now that we have a list of tasks, we send a survey out to a larger group of SMEs and ask them to rate various features of each task.

        How important is the task? How often is it performed? What larger category of tasks does it fall into?

      • Analyze— Next, we crunch the data and quantitatively evaluate the SMEs’ subjective ratings to determine which of the tasks and categories are most important.

      • Review— As a non-SME, the psychometrician needs to take their findings back to the SME panel to review the recommendation and make sure it makes sense.

      • Report— We put together a comprehensive report that outlines what the most important tasks/categories are for the given job.  This in turn serves as the foundation for a test blueprint, because more important content deserves more weight on the test.

        This connection is one of the fundamental links in the validity argument for an assessment.

      Example: Cutscore studies

      When the JTA is completed, we have to determine who should pass the assessment, and who should fail. This is most often done using the modified Angoff process, where the SMEs conceptualize a minimally competent candidate (MCC) and then set pass/fail point so that the MCC would just barely pass.  There are other methods too, such as Bookmark or Contrasting Groups.

      For newly-launching certification programs, these processes go hand-in-hand with item writing and review. The use of evidence-based practices in conducting the job task analysis, test design, writing items, and setting a cutscore serve as the basis for a good certification program.  Moreover, if you are seeking to achieve accreditation – a third part stamp of approval that your credential is high quality – documentation that you completed all these steps is required.

      Performing these tasks with a trained psychometrician inherently checks a lot of boxes on the accreditation to-do list, which can position your organization well for the future. When it comes to accreditation— the psychometricians and measurement specialists at Assessment Systems have been around the block a time or two. We can walk you through the lengthy process of becoming accredited, or we can help you perform these tasks a la carte.

      One of the most cliche phrases associated with assessment is “teaching to the test.”  I’ve always hated this phrase, because it is only used in a derogatory matter, almost always by people who do not understand the basics of assessment and psychometrics.  I recently saw it mentioned in this article on PISA, and that was one time too many, especially since it was used in an oblique, vague, and unreferenced manner.

      So, I’m going to come out and say something very unpopular: in most cases, TEACHING TO THE TEST IS A GOOD THING.

      Why teaching to the test is usually a good thing

      If the test reflects the curriculum – which any good test will – then someone who is teaching to the test will be teaching to the curriculum.  Which, of course, is the entire goal of teaching. The phrase “teaching to the test” is used in an insulting sense, especially because the alliteration is resounding and sellable, but it’s really not a bad thing in most cases.  If a curriculum says that 4th graders should learn how to add and divide fractions, and the test evaluates this, what is the problem? Especially if it uses modern methodology like adaptive testing or tech-enhanced items to make the process more engaging and instructional, rather than oversimplifying to a text-only multiple choice question on paper bubble sheets?

      The the world of credentialing assessment, this is an extremely important link.  Credential tests start with a job analysis study, which surveys professionals to determine what they consider to be the most important and frequently used skills in the job.  This data is then transformed into test blueprints. Instructors for the profession, as well as aspiring students that are studying to pass the test, then focus on what is in the blueprints.  This, of course, still contains the skills that are most important and frequently used in the job!

      So what is the problem then?

      Now, telling teachers how to teach is more concerning, and more likely to be a bad thing.  Finland does well because it gives teachers lots of training and then power to choose how they teach, as noted in the PISA article.

      As a counterexample, my high school math department made an edict starting my sophomore year thaborderline method educational assessmentt all teachers had to use the “Chicago Method.” It was pure bunk and based on the fact that students should be doing as much busy work as possible instead of the teachers actually teaching. I think it is because some salesman convinced the department head to make the switch so that they would buy a thousand brand new textbooks.  The method makes some decent points (here’s an article from, coincidentally, when I was a sophomore in high school) but I think we ended up with a bastardization of it, as the edict was primarily:

      1. Assign students to read the next chapter in class (instead of teaching them!); go sit at your desk.
      2. Assign students to do at least 30 homework questions overnight, and come back tomorrow with any questions they have.
      3. Answer any questions, then assign them the next chapter to read.  Whatever you do, DO NOT teach them about the topic before they start doing the homework questions.  Go sit at your desk.

      Isn’t that preposterous?  Unsurprisingly, after two years of this, I went from being a leader of the Math Team to someone who explicitly said “I am never taking Math again”.  And indeed, I managed to avoid all math during my senior year of high school and first year of college. Thankfully, I had incredible professors in my years at Luther College, leading to me loving math again, earning a math major, and applying to grad school in psychometrics.  This shows the effect that might happen with “telling teachers how to teach.” Or in this case, specifically – and bizarrely – to NOT teach.

      What about all the bad tests out there?

      Now, let’s get back to the assumption that a test does reflect a curriculum/blueprints.  There are, most certainly, plenty of cases where an assessment is not designed or built well.  That’s an entirely different problem, and is an entirely valid concern. I have seen a number of these in my career.  This danger why we have international standards on assessments, like AERA/APA/NCME and NCCA.  These provide guidelines on how a test should be build, sort of like how you need to build a house according to building code and not just throwing up some walls and a roof.

      ansi accreditation certification exam candidates

      For example, there is nothing that is stopping me from identifying a career that has a lot of people looking to gain an edge over one another to get a better job… then buying a textbook, writing 50 questions in my basement, and throwing it up on a nice-looking website to sell as a professional certification.  I might sell it for $395, and if I get just 100 people to sign up, I’ve made $39,500!!!! This violates just about every NCCA guideline, though. If I wanted to get a stamp of approval that my certification was legit – as well as making it legally defensible – I would need to follow the NCCA guidelines.

      My point here is that there are definitely bad tests out there, just like there are millions of other bad products in the world.  It’s a matter of caveat emptor. But just because you had some cheap furniture on college that broke right away, doesn’t mean you swear off on all furniture.  You stay away from bad furniture.

      There’s also the problem of tests being misused, but again that’s not a problem with the test itself.  Certainly, someone making decisions is uninformed. It could actually be the best test in the world, with 100% precision, but if it is used for an invalid application then it’s still not a good situation.  For example, if you took a very well-made exam for high school graduation and started using it for employment decisions with adults. Psychometricians call this validity – that we have evidence to support the intended use of the test and interpretations of scores.  It is the #1 concern of assessment professionals, so if a test is being misused, it’s probably by someone without a background in assessment.

      So where do we go from here?

      Put it this way, if an overweight person is trying to become fitter, is success more likely to come from changing diet and exercise habits, or from complaining about their bathroom scale?  Complaining unspecifically about a high school graduation assessment is not going to improve education; let’s change how we educate our children to prepare them for that assessment, and ensure that the assessment reflects the goals of the education.  Nevertheless, of course, we need to invest in making the assessment as sound and fair as we can – which is exactly why I am in this career.

      Item response theory is the predominant psychometric paradigm for mid or large scale assessment.  As noted in my introductory blog post, it is actually a family of models.  In this post, we discuss the two parameter IRT model (IRT 2PL).

      Consider the following 3PL equation (simplified from Hambleton & Swaminathan, 1985, Eq. 3.3).  The IRT 2PL simply removes the c and (1-c) elements, so that probability is only a function of a and b.

      3PL irt equation

      This equation is predicting the probability of a certain response based on the examinee trait/ability level, the item discrimination parameter a, and the item difficulty/location parameter b.  If the examinee’s trait level is higher than the item location, the person has more than a 50% chance of responding in the keyed direction.

      This phrase “in the keyed direction” is one you might often hear with the IRT 2PL.  This is because it is not often used with education/knowledge/ability assessments where items usually have a correct answer and guessing is often possible.  The IRT 2PL is used more often in attitudinal or other psychological assessments where guessing is irrelevant and there is no correct answer.  For example, consider an Extroversion scale, where examinees are responding Yes/No to statements like “I love to go to parties” or “I prefer to read books in my free time.”  There is not much to guess here, and the sense of “correct” is not relevant.

      However, it is quite clear that the first statement is keyed in the direction of extroversion while the second statement is the reverse.  In fact, you would get the 1 point of response for saying No to that statement rather than Yes.  This is often called reverse-scored.

      There are other aspects that go into whether you should use the 2PL model, but this is one of the most important.  In addition, you should also examine model fit indices and take sample size into account.

      How do I implement the two parameter IRT model?

      Like other IRT models, the 2PL requires specialized software.  Not all statistical packages will do it.  And while you can easily calculate classical statistics in Excel, there is no way to do IRT (well, unless you want to write your own VBA programs to do so).  As mentioned in this article on the three parameter model, there are a lot of IRT software programs available, but not all meet the required standards.

      You should evaluate cost and functionality.  If you are a fan of R, there are packages to estimate IRT there.  However, I recommend our Xcalibre program for both newbies and professionals.  For newbies, it is much easier to use, which means you spend more time learning the concepts of IRT and not fighting command code that might be 30 years old.  For professionals, Xcalibre saves you from having to create reports by copy and paste which is incredibly expensive.

      Item response theory (IRT) is an extremely powerful psychometric paradigm that addresses many of the inadequacies of classical test theory (CTT).  If you are new to the topic, there is a broad intro here, where you will learn that IRT is actually a family of mathematical models rather than one specific one.  Today, I’m talking about the 3PL.

      One of the most commonly used models is called the three parameter IRT model (3PM), or the three parameter logistic model (3PL or 3PLM) because it is almost always expressed in a logistic form.  The equation for this is below (Hambleton & Swaminathan, 1985, Eq. 3.3).

      3PL irt equation

       

      Like all IRT models, it is seeking to predict the probability of a certain response based on examinee ability/trait level and some parameters which describe the performance of the item.  With the 3PL, those parameters are a (discrimination), b (difficulty or location), and c (pseudo-guessing).  For more on these, check out the descriptions in my general IRT article.

      The remaining point then is what we mean by the probability of a certain response.  The 3PL is a dichotomous model which means that it is predicting a binary outcome such as correct/incorrect or agree/disagree.

      When should I use the three parameter IRT model?

      The applicability of the 3PL to a certain assessment depends on the relevance of the components just discussed.  First, the response to the items must be binary.  This eliminates Likert-type items (“Rate on a scale of 1 to 5”), partial credit items (scoring an essay as 0 to 5 points), and performance assessments where scoring might include a range of points, deductions, or timing (number of words typed per minute).

      Next, you should evaluate the applicability of the use of all three parameters.  Most notably, are the items in your assessment susceptible to guessing?  Because the thing that differentiates the 3PL from its sisters the 1PL and 2PL is that it attempts to model for guessing.  This, of course, is highly relevant for multiple-choice items on knowledge or ability assessments, so the 3PL is often a great fit for those.

      Even in this case, though, there are a number of practitioners and researchers that still prefer to use the 1PL or 2PL models.  There are some deeper methodological issues driving this choice.  The 2PL is sometimes chosen because it works well with an estimation method called Joint Maximum Likelihood.

      The 1PL, also known as the Rasch model (yes, I know the Rasch people will say they are not the same, I am grouping them together for simplicity in comparison), is often selected because adherents to the model believe in certain advantages such as it providing “objective measurement.”  Also, the Rasch model works far better for smaller samples (see this technical report by Guyer & Thompson and this one by Yoes).  Regardless, you should probably evaluate model fit when selecting models.

      I am from a camp that is pragmatic in choice rather than dogmatic.  While training on the 3PL in graduate school, I have no qualms against using the 2PL or 1PL/Rasch if the test type and sample size warrant it or if fit statistics indicate they are sufficient.

      How do I implement the three parameter IRT model?

      If you want to implement the three parameter IRT model, you need specialized software.  General statistical software such as SPSS does not always produce IRT analysis, though some do.  Even in the realm of IRT-specific software, not all produce the 3PL.  And, of course, the software can vary greatly in terms of quality.  Here are three important ways it can vary:

      1. Accuracy of results: check out this research study which shows that some programs are inaccurate
      2. User-friendliness: some programs require you to write extensive code, and some have a purely graphical interface
      3. Output usability and interpretability: some programs just give simple ASCII text, others provide extensive Word or HTML reports with many beautiful tables and graphs.

      For more on this topic, head over to my post on how to implement IRT in general.

      Want to get started immediately?  Download a free copy of our IRT software Xcalibre.