Posts on psychometrics: The Science of Assessment

A psychometrist is an important profession within the world of assessment and psychology.  Their primary role is to deliver and interpret assessments.  For example, they might give IQ tests to kids to identify those who qualify as Gifted, then explain the results to parents and teachers.  Obviously, there are many assessments which do not require one-on-one in-person delivery like this; psychometrists are unique in that they are trained on how to deliver these complex types of assessments.  This post will describe more about the role of a psychometrist.

Psychometrist: What do they do?

A psychometrist is someone involved in the use and administration of assessments, and in most cases is working in the field of psychological testing. This is someone who uses tests every day and is familiar with how to administer such tests (especially complex ones like IQ) and interpret their results to provide feedback to individuals. Some have doctoral degrees as a clinical/counseling psychologist and have extensive expertise in that role; for example, the use of an Autism-spectrum screening test to effectively diagnose patients and develop individualized plans.

Consider the following definition from the National Association of Psychometrists:

A psychometrist is responsible for the administration and scoring of psychological and neuropsychological tests under the supervision of a clinical psychologist or clinical neuropsychologist. 

Source: https://www.napnet.org/what

Where do psychometrists work?

The vast majority of psychometrists work in a clinical setting.  One might work in an Autism center.  One might be at a psychiatric hospital.  One might be at a neurological clinic.  Some school psychologists also perform this work, working directly in schools.  In all cases, they are working directly with the examinee (patient, student, etc.).

Psychometrist Training and Certification

Psychometrists have at least a Bachelor’s degree in psychology or related field, often a Master’s.  There is typically a clinical training component.  Learn more at the National Association of Psychometrists

There is a specific certification for psychometrists, offered by the Board of Certified Psychometrists.  This involves passing a certification exam of 120 questions over 2.5 hours; the test is professionally designed and administered to meet best practices for credentialing exams.

Psychometrist vs. Related Roles

One misconception that I often see on the internet is the distinction or lack thereof between the related job titlesSome professionals are only involved with the engineering of assessments, usually not even in the field of psychology.  They do not work with patients.  Others work with patients but focus on counseling rather than assessment.  The most flagrant offender, curiously, is Google. Like most companies, we utilize AdWords, and find that some job titles and terms are treated interchangeably when they are not related.

A psychometrist usually works under the direction of a psychologist, though sometimes a psychologist serves as their own psychometrist.  For example, a psychologist at a mental health clinic is in charge of screening patients and treating them, but might have staff to deliver psychological assessments.  But a psychologist in a school might not have staff for that, and also delivers IQ tests to students.

For clarification, here is a comparison of major job titles:

Aspect Psychometrician Psychometrist Psychologist
How are they involved with assessment? Engineering & validation Administration & interpretation Patient treatment
Education PhD in Psychometrics, Psychology, or Education Bachelor’s/Master’s in Psychology (often Counseling) PhD in Psychology (often Counseling or Clinical)
Quantitative skills Complex analyses like item response theory or factor analysis; complex designs such as adaptive testing Interpreting scores with summary statistics (mean, standard deviation, z-scores, correlations) Quantitative research outside of assessment, such as comparing treatment methods
Soft skills Often a pure data analyst, but some work with expert panels for topics like job analysis or Angoff studies; never with patients or students Works extensively with patients and students, often in a counseling role, and can be highly trained on those aspects Works extensively with patients and students, often in a counseling role, and can be highly trained on those aspects
Example Researcher involved in designing high-stakes exams such as medical certification or university admissions Staff in a clinic that delivers IQ and other assessments to patients Supervisory staff in a clinic that treats patients

 

Need help in designing an assessment?  Contact us.

 

 

The American Chiropractic Board of Sports Physicians™ (ACBSP™: acbsp.com) has partnered with Assessment Systems Corporation (ASC: assess.com) to advance the written testing procedure to a digitalized assessment model with secure remote proctoring for the Certified Chiropractic Sports Physician® (CCSP®) and Diplomate American Chiropractic Board of Sports Physicians® (DACBSP®) certifications.  While this will provide the immediate benefit that ACBSP certification examinations will continue during the COVID-19 restrictions, future exams will also be held in this format. 

The certifications have been traditionally delivered via paper-and-pencil at specific locations; ASC will work with the ACBSP™ to digitalize the assessment and then deliver them with secure remote proctoring.  The remote proctoring provided by MonitorEDU will utilize two video streams for monitoring, one for the room and one for the candidate, as well as a lockdown browser to maintain security on the computer, providing support to the validity and integrity of the assessment.

Dr. Anne Sorrentino, the President of ACBSP™, states: “ACBSP™ is excited to announce the upcoming November 7, 2020 written exams will be remotely proctored computer-based testing (RP-CBT). This means travel to an exam site is not necessary. The exam may now be taken in the privacy of one’s own home! COVID temporarily closed down the opportunity for the advancement of the CCSP® and DACBSP® testing process this past April. The ACBSP™ moved to RP-CBT to enable everyone the opportunity to continue the certification process.”

Dr. Nathan Thompson, CEO of ASC, notes: “ASC has helped many high-stakes assessment programs pivot during 2020, and we are enthusiastic about working with the ACBSP certification team to modernize the assessment process.  This will not only enable continuity during COVID but increase accessibility of the credentials by making them more widely available, with shorter turnaround times.”

About ACBSP

Since 1980, the American Chiropractic Board of Sports Physicians™ (ACBSP™: acbsp.com) has led the development of sports medicine certification and has managed a world-class credentialing process that ensures certified sports chiropractors meet competency standards to effectively work with and treat athletes and those engaged in athletic activities. In addition, the ACBSP™ offers continuing education and research seminars to facilitate the dissemination of the latest scientific knowledge, treatment trends, and best practices for patient care.

The ACBSP™ is the governing board for the Certified Chiropractic Sports Physician® (CCSP®) and Diplomate American Chiropractic Board of Sports Physicians® (DACBSP®) certifications. Various accredited chiropractic colleges offer the curricula and training leading to qualification for taking the certification exams. The ACBSP™ governs the administration of the examinations and certifications. These certifications are designed for chiropractic doctors who want to specialize in chiropractic sports medicine.

About ASC

Assessment Systems Corporation (ASC: assess.com) has been a leader in the assessment industry since 1979, providing both world-class software for test development, secure delivery, and psychometric analytics, as well as extensive consulting services for testing organizations. ASC’s co-founder, Dr. David Weiss, is considered the father of computerized adaptive testing (CAT), and ASC was the first company to offer item banking and adaptive testing software to the public, in the early 1980s. ASC focuses on organizations that require high quality and high security from their exams, across industries such as certification, licensure, higher education, K-12 education, and pre-employment testing.  ASC drives innovation internationally, partnering with Ministries of Education in countries as diverse as Iceland, United Arab Emirates, Singapore, and Botswana. Follow us on LinkedIn, or sign up for a free account in our test development platform Assess.ai.

The foundation of a decent assessment program is the ability to develop and manage strong item banks. Item banks are a central repository of test questions, each stored with important metadata such as Author or Difficulty. They are designed to treat items are reusable objects, which makes it easier to publish new exam forms.

Of course, the storage of metadata is very useful as well and provides validity documentation evidence. Most importantly, a true item banking system will make the process of developing new items more efficient (lower cost) and effective (higher quality).

1. Item writers are screened for expertise

Make sure the item writers (authors) that are recruited for the program will meet minimum levels of expertise. Often this involves a lot of years of experience in the field. You also might want to make sure their demographics are sufficiently distributed, such as specialty area or geographic region.

2. Item writers are trained on best practices

Item writers must be trained on best practices in item writing, as well as any guidelines provided by the organization. A great example is this book from TIMSS. ASC has provided their guidelines for download here. This facilitates higher quality item banks.

3. Items go through review workflow to check best practices

After items are written, they should proceed through a standardized workflow and quality assurance. This is the best practice in developing any products. The field of software development uses a concept called the Kanban Board, which ASC has implemented in its item banking platform.

Review steps can include psychometrician, bias, language editing, and course content.

4. Items are all linked to blueprint/standards

All items in the item banks should be appropriately categorized. This guarantees that no items are measuring an unknown or unneeded concept. Items should be written to meet blueprints or standards.

5. Item banks pilotingitem writing laptop paper

Items are all written with good intent. However, we all know that some items are better than others. Items need to be given to some actual examinees so we can obtain feedback, and also obtain data for psychometric analysis.

Often, they are piloted as unscored items before eventual use as “live” scored items. But this isn’t always possible.

6. Psychometric analysis of items

After items are piloted, you need to analyze them with classical test theory and/or item response theory to evaluate their performance. I like to say there are three possible choices after this evaluation: hold, revise, and retire. Items that perform well are preserved as-is.

Those of moderate quality might be modified and re-piloted. Those that are unsalvageable are slated for early retirement.

How to accomplish all this?

This process can be extremely long, involved, and expensive. Many organizations hire in-house test development managers or psychometricians; those without that option will hire organizations such as ASC to serve as consultants.

Regardless, it is important to have a software platform in place that can effectively manage this process. Such platforms have been around since the 1980s, but many organizations still struggle by managing their item banks with Word, PowerPoint, and Email!

ASC provides an item banking platform for free, which is used by hundreds of organizations. Click below to sign up for your own account.


Sign Up For Free Account

The COVID-19 Pandemic drastically changing all aspects of our world, and one of the most impacted areas is educational assessment and other types of assessment. Many organizations still delivered tests with methodologies from 50 years ago, such as putting 200 examinees in a large room with desks, paper exams, and a pencil. COVID-19 is forcing many organizations to pivot, which provides an opportunity to modernize assessments. But how can we maintain assessment security – and therefore validity – through these changes? Here are some suggestions, all of which can be easily implemented on ASC’s industry-leading assessment platforms. Get started by signing up for a free account at https://assess.com/assess-ai/.

True item banking with content access

A good assessment starts with good items. While learning management systems (LMSs) and other not-really-assessment platforms include some item authoring functionality, they usually don’t meet the basic requirements for real item banking. There are best practices regarding item banking that are standard at large-scale assessment organizations (e.g., US State Departments of Education_, but these are surprisingly rare for professional certification/licensure exams, universities, and other organizations. Here are some examples.

  • Items are reusable (you don’t have to upload for each test where they are used)
  • Item version tracking
  • User edit tracking and audits
  • Author content controls (Math teachers can only see Math items)
  • Store metadata such as item response theory parameters and classical statistics
  • Track item usage across tests
  • Item review workflow

Role based access

All users should be limited by roles, such as Item Author, Item Review, Test Publisher, and Examinee Manager. So, for example, someone in charge of managing the roster of examinees/students might never see any test questions.

Data Forensics

There are many ways you can analyze your test results to look for possible security/validity threats. Our SIFT software provides a free software platform to help you implement this modern methodology. You can evaluate collusion indices, which quantify how similar the responses are for any pair of examinees. You can evaluate response times, group performance, and roll-up statistics as well.

healthcare certification

Randomization

When tests are delivered online, you should have the option to randomize item order and also answer order. When printing to paper, there should be an option to randomize the order. But of course, you are much more limited in this respect when using paper.

Linear on the fly testing (LOFT)

LOFT will create a uniquely randomized test for each examinee. For example, you might have a pool of 300 items spread across 4 domains, and set each examinee will receive 100 items with 25 from each domain. This greatly increases security.

Computerized adaptive testing (CAT)

CAT takes the personalization even further, and adapts the difficulty of the exam and the number of items seen to each student, based on certain psychometric goals and algorithms. This makes the test extremely secure.

Lockdown browser

Want to ensure that the student can’t surf for answers, or take screenshots of items? You need a lockdown browser. ASC’s assessment platforms, Assess.ai and FastTest, both come with this out of the box and at no extra cost.

Examinee test codes

Want to make sure the right person takes the right exam? Generate unique one-time passwords to be given out by a proctor after identity verification. This is especially useful with remote proctoring; the student never receives any communication before the exam about how to get in, other than to start the virtual proctoring session. Once the proctor verifies the examinee identity, then they provide the unique one-time password.

Proctor codes

Want an extra layer on the test kickoff procedure? After a student has their identity verified, and enters their code, then the proctor needs to also enter a different password that is unique to them on that day.

Date/Time Windows

Want to prevent examinees from logging in early or late? Set up a specific time window, such as 9-12AM on Friday.

AI-based proctoring

This level of proctoring is relatively inexpensive and scalable, and does a great job of validating the results of an individual examinee. However, it does nothing to protect the intellectual property of your exam questions. If an examinee steals all the questions, you won’t know until later. So, it is very useful for low or mid-stakes exams, but not as useful for high-stakes exams such as certification or licensure. Learn more about our remote proctoring options. I also recommend this blog post for an overview of the remote proctoring industry.

Live Online Proctoring

If you can’t do in-person test centers because of COVID, this is the next best option. Live proctors can check in the candidate, verify identity, and implement all the other things above. In addition, they can verify the examinee’s environment and stop the exam if they see the examinee stealing questions or other major issues. MonitorEDU is a great example of this.


Contact Us To Improve Assessment

How can I start?

Need a hand implementing some of these measures? Or just want to talk about the possibilities? Email ASC at solutions@assess.com

This collusion detection (test cheating) index simply calculates the number of responses in common between a given pair of examinees.  For example, both answered ‘B’ to a certain item regardless of whether it was correct or incorrect.  There is no probabilistic evaluation that can be used to flag examinees.  However, it could be of good use from a descriptive or investigative perspective. 

It has a major flaw in that we expect it to be very high for high-ability examinees.  If two smart examinees both get 99/100 correct, the minimum RIC they could have is 98/100.  Even if they have never met each other and have no possibility of collusion or cheating.

Note that RIC is not standardized in any way, so its range and relevant flag cutoff will depend on the number of items in your test, and how much your examinee responses vary.  For a 100-item test, you might want to set the flag at 90 items.  But for a 50-item test, this is obviously irrelevant, and you might want to set it at 45.

Problems such as these with Responses In Common have led to the development of much more sophisticated indices of examinee collusion and copying, such as Holland’s K index and variants.

Need an easy way to calculate this?  Download our SIFT software for free.

Exact Errors in Common (EEIC) is an extremely basic collusion detection index simply calculates the number of responses in common between a given pair of examinees.

For example, suppose two examinees got 80/100 correct on a test. Of the 20 each got wrong, they had 10 in common. Of those, they gave the same wrong answer on 5 items. This means that the EEIC would be 5. Why does this index provide evidence of collusion detection? Well, if you and I both get 20 items wrong on a test (same score), that’s not going to raise any eyebrows. But what if we get the same 20 items wrong? A little more concerning. What if we gave the same exact wrong answers on all of those 20? Definitely cause for concern!

There is no probabilistic evaluation that can be used to flag examinees.  However, it could be of good use from a descriptive or investigative perspective. Because it is of limited use by itself, it was incorporated into more advanced indices, such as Harpp, Hogan, and Jennings (1996).

Note that because Exact Errors in Common is not standardized in any way, so its range and relevant flag cutoff will depend on the number of items in your test, and how much your examinee responses vary.  For a 100-item test, you might want to set the flag at 10 items.  But for a 20-item test, this is obviously irrelevant, and you might want to set it at 5 (because most examinees will probably not even get more than 10 errors).

EEIC is easy to calculate, but you can download the SIFT software for free.

This exam cheating index (collusion detection) simply calculates the number of errors in common between a given pair of examinees.  For example, two examinees got 80/100 correct, meaning 20 errors, and they answered all of the same questions wrongly, the EIC would be 20. If they both scored 80/100 but had only 10 wrong questions in common, the EIC would be 10.  There is no probabilistic evaluation that can be used to flag examinees, as with more advanced indices. In fact, it is used inside some other indices, such as Harpp & Hogan.  However, this index could be of good use from a descriptive or investigative perspective.

Note that EIC is not standardized in any way, so its range and relevant flag cutoff will depend on the number of items in your test, and how much your examinee responses vary.  For a 100-item test, you might want to set the flag at 10 items.  But for a 30-item test, this is obviously irrelevant, and you might want to set it at 5 (because most examinees will probably not even get more than 10 errors).

Learn more about applying EIC with SIFT, a free software program for exam cheating detection and other assessment issues.

Harpp, Hogan, and Jennings (1996) revised their Response Similarity Index somewhat from Harpp and Hogan (1993). This produced a new equation for a statistic to detect collusion and other forms of exam cheating: response similarity index.

Explanation of Response Similarity Index

EEIC denote the number of exact errors in common or identically wrong,

D is the number of items with a different response.

Note that D is calculated across all items, not just incorrect responses, so it is possible (and likely) that D>EEIC.  Therefore, the authors suggest utilizing a flag cutoff of 1.0 (Harpp, Hogan, & Jennings, 1996):

Analyses of well over 100 examinations during the past six years have shown that when this number is ~1.0 or higher, there is a powerful indication of cheating.  In virtually all cases to date where the exam has ~30 or more questions, has a class average <80% and where the minimum number of EEIC is 6, this parameter has been nearly 100% accurate in finding highly suspicious pairs.

However, Nelson (2006) has evaluated this index in comparison to Wesolowsky’s (2000) index and strongly recommends against using the HHJ.  It is notable that neither makes any attempt to evaluate probabilities or standardize.  Cizek (1999) notes that both Harpp-Hogan methods do not even receive attention in the psychometric literature.

This approach has very limited ability to detect cheating when the source has a high ability level. While individual classroom instructors might find the EEIC/D straightforward and useful, there are much better indices for use in large-scale, high-stakes examinations.

Harpp and Hogan (1993) suggested a response similarity index defined as   

response similarity index by Harpp and Hogan (1993)

 

Response Similarity Index Explanation

EEIC denote the number of exact errors in common or identically wrong,

EIC is the number of errors in common.

This is calculated for all pairs of examinees that the researcher wishes to compare. 

One advantage of this approach is that it extremely simple to interpret: if examinee A and B each get 10 items wrong, 5 of which are in common, and gave the same answer on 4 of those 5, then the index is simply 4/5 = 0.80.  A value of 1.0 would therefore be perfect “cheating” – on all items that both examinees answered incorrectly, they happened to select the same distractor.

The authors suggest utilizing a flag cutoff of with the following reasoning (Harpp & Hogan, 1993, p. 307):

The choice of 0.75 is derived empirically because pairs with less than this fraction were not found to sit adjacent to one another while pairs with greater than this ratio almost always were seated adjacently.

The cutoff can differ from dataset to dataset, so SIFT allows you to specify the cutoff you wish to use for flagging pairs of examinees.  However, because this cutoff is completely arbitrary, a very high value (e.g., 0.95) is recommended by as this index can easily lead to many flaggings, especially if the test is short.  False positives are likely, and this index should be used with great caution.  Wesolowsky (unpublished PowerPoint presentation) called this method “better but not good.”

This index evaluates error similarity analysis (ESA), namely estimating the probability that a given pair of examinees would have the same exact errors in common (EEIC), given the total number of errors they have in common (EIC) and the aggregated probability P of selecting the same distractor.  Bellezza and Bellezza utilize the notation of k=EEIC and N=EIC, and calculate the probability

Bellezza and Bellezza calculate the probability

Note that this is summed from k to N; the example in the original article is that a pair of examinees had N=20 and k=18, so the equation above is calculated three times (k=18, 19, 20) to estimate the probability of having 18 or more EEIC out of 20 EIC.  For readers of the Cizek (1999) book, note that N and k are presented correctly in the equation but their definitions in the text are transposed.

The calculation of P is left to the researcher to some extent.  Published resources on the topic note that if examinees always selected randomly amongst distractors, the probability of an examinee selecting a given distractor is 1/d, where d is the number of incorrect answers, usually one less than the total number of possible responses.  Two examinees randomly selecting the same distractor would be (1/d)(1/d).  Summing across d distractors by multiplying by d, the calculation of P would be

error similarity analysis

That is, for a four-option multiple choice item, d=3 and P=0.3333.  For a five-option item, d=4 and P=0.25.

However, examinees most certainly do not select randomly amongst distractors. Suppose a four-option multiple-choice item was answered correctly by 50% (0.50) of the sample.  The first distractor might be chosen by 0.30 of the sample, the second by 0.15, and the third by 0.05.  SIFT calculates these probabilities and uses the observed values to provide a more realistic estimate of P

SIFT therefore calculates this error similarity analysis index using the observed probabilities and also the random-selection assumption method, labeling them as B&B Obs and B&B Ran, respectively.  The indices are calculated all possible pairs of examinees or all pairs in the same location, depending on the option selected in SIFT. 

How to interpret this index?  It is estimating a probability, so a smaller number means that the event can be expected to be very rare under the assumption of no collusion (that is, independent test taking).  So a very small number is flagged as possible collusion.  SIFT defaults to 0.001.  As mentioned earlier, implementation of a Bonferroni correction might be prudent.

The software program Scrutiny! also calculates this ESA index.  However, it utilizes a normal approximation rather than exact calculations, and details are not given regarding the calculation of P, so its results will not agree exactly with SIFT.

Cizek (1999) notes:

          “Scrutiny! uses an approach to identifying copying called “error similarity analysis” or ESA—a method which, unfortunately, has not received strong recommendation in the professional literature. One review (Frary, 1993) concluded that the ESA method: 1) fails to utilize information from correct response similarity; 2) fails to consider total test performance of examinees; and 3) does not take into account the attractiveness of wrong options selected in common. Bay (1994) and Chason (1997) found that ESA was the least effective index for detecting copying of the three methods they compared.”

Want to implement this statistic? Download the SIFT software for free.