Category Archives: Assessment


For the latest Ontario Ministry of Education document on assessment of learning, for learning and as learning please see Growing Success (2010).

Note: This is a PERSONAL blog, not an official Ministry of Education website. This is a forum for sharing.

Please add comments and your favourite resources (and let me know if there are any dead links!) Thank you!

The following is a summary of my talk about Assessment.  This is my own work, and based on Special Education in Ontario Schools (2008) by Ken Weber & Sheila Bennett. 

©Angela onthebellcurve (2012)

Assessment and Evaluation:  What is it?
•An Educated Guess
•A snapshot of a student’s strengths and needs at a particular time
•Tools to get some understanding of the students’ abilities, strengths and needs
•Assessment can be Formal (Standardised) or Informal (teacher created)
•Essentially, a description of a student’s performance or ability
Assessment: Why do it?
•To develop an IEP and make program and placement decisions
•To get information about a student’s academic abilities, intelligence (cognitive ability), behaviour, strengths, needs and so on
Assessment: Who does it?
•Originally only ‘experts’ would conduct ‘assessments’ that would be used to make decisions about placement and program
•Now, assessment is more team-based and involves input from a variety of sources and people
•Team members can include: classroom teacher, the School Special Ed. Teacher, and / or other professionals such as Speech Language Pathologists, Occupational Therapists or Psychologists
Caution: A test is only as good as the person administering it.
•We can only assess what we observe
–Although most testing attempts to assess ‘invisible’ processes such as working memory, visual-spatial ability and vocabulary by having the student demonstrate these abilities, these tests may not capture all of a student’s abilities

There are many things that influence student performance – such as anxiety, inattention, hunger, fatigue, depression

•Assessment gives a snap shot of the student’s performance and ability under particular circumstances at a particular time
•i.e.  Students with average intelligence may perform very poorly on standardized cognitive assessments like the C-CAT if they have difficulties with Attention or Impulsivity

Reliability and Validity

  • Reliability – will the assessment measure the same thing repeatedly at different times?  If you measure this item again will you have the same measurement?
  • I.e. a broken ruler will always measure the same measurement repeatedly
  • Definition of Reliability –  the degree to which a student would obtain the same score if the test were re-administered (assuming no further learning, practice effects or change).
  • Validity  (definition) – the extent to which a test measures what it is designed to measure.
  • In our broken ruler example, the assessment tool is reliable.  It is not valid.
  • Our broken ruler does not correspond with any external measurements.
  • If our broken ruler did correspond with external measurements, it would be valid.

Validity – IQ and Shoes???

Imagine we created a test for intelligence that found shoe size is positively correlated with our definition of intelligence

-i.e. the bigger the shoe size, the ‘more’ intelligent you are.  Therefore we would measure feet to determine intelligence.

This test would be reliable (we would have the same results over repeated measurements)

However this test is not valid.  Our measure of intelligence and our definition of intelligence do not correspond to any external measurements (pre-existing intelligence tests or research about intelligence)

Informal Assessments

  • May not be as reliable as formal assessments, but may be more valid!
  • Different assessments and evaluations over time give a more comprehensive view of the student
  • Examples include: portfolios, teacher created tests and assignments, running records / miscue analysis, teacher observations

Benefits of Informal Assessments

  • Assessment conducted by person working with student on ongoing basis
  • Assessment can be tailored to meet specific needs (i.e. decoding ability or borrowing in subtraction)
  • May provide a picture of why and when a student fails to demonstrate a specific skill rather than just confirming they cannot do it!!!!!

Formal Assessments

  • May be called ‘standardized’ because the test results are compared to norms (the groups used by published that are supposed to be representative of the population)
  • Tests are usually timed (this may be difficult for slow or deep thinkers)
  • Group tests usually have single answers to multiple choice questions (this may be difficult for divergent thinkers)
  • Formal Assessments Include: Rating Scales, Inventories & Checklists, Intelligence tests
  • Specific examples include the WISC, Canadian Achievement Tests (CAT) [This assesses academic achievement] and Canadian Cognitive Abilities Test (C-CAT) [This assesses cognitive (Thinking) ability]

Follow this link for details about the WISC.

Follow this link for details about Norms, Percentiles, Stanines, Grade Equivalents etc..

Important to Know: Percentiles are NOT people

One of my students once said:

Yo Man, I don’t like, um what do you callit… psychologists.  They say this and they say that.  They tell my mom I need meds.  I don’t need meds.  I’ve matured.  (referring to his problems with anger management)

Those psychologists don’t know me.  What do they know about my life?  How can they say that I’m this or I’m that.  They don’t know me.

When reading standardised assessment results be aware of Band of Confidence / Standard Error of Measurement

  • Standard Error of Measurement  – the extent to which a subject’s score is ‘out.’  This information is in the technical manual for published tests.
  • Band of Confidence – because of the Standard Error of Measurement, a test score can never be considered absolutely correct.  Therefore some test manuals offer a range around a score can be interpreted with confidence.

Issues Around Formal Assessment

  • Sometimes student performance on Formal Assessments may not reflect their actual ability due to individual student factors such as anxiety, non-compliance, impulsivity, etc.
  • Sometimes Formal Assessments may miss key ecological factors in student performance (i.e. social-emotional issues, triggers for behaviour)
  • Sometimes Formal Assessments may not reflect the frequency or intensity of behaviour
  • To have an accurate profile of the student you need information from a variety of sources and assessment tools
  • You need a balance of Formal and Informal assessments – standardized tests, teacher-created evaluations and observations
  • Information from a variety of sources – parents, teachers, TAs and even the student!

©Angela onthebellcurve (2012)


1 Comment

Filed under Assessment

Assessment: Norms, Percentiles, Stanines, Grade Equivalents etc


  • The results obtained by a supposedly representative sample of students on this particular test
  • Once the test is published, students who write the test have their results compared to these norms
  • This produces individual scores such as Grade Equivalent, Percentile, Stanine, etc.

Grade Equivalent           

  • A test score is related to the school grade ‘equivalent’
  • i.e. a grade equivalent of 6.2 indicates student performance is comparable to a student in the 2nd month of grade 6


  • A percentile rank is a type of converted score that expresses a student’s score relative to their group in percentile points.

–      Imagine lining up participants in a race in order of winning – they would line up as First place, Second Place, Third place, Fourth place…etc.

–      BUT, imagine doing this for 100 people!  Your First place winner would be standing in front of 99 other people.  Thus, they would be in the 99th percental (having performed better than 99% of the group)

– This indicates the percentage of students tested who made scores equal to or lower than the specified score.

– I.e. A student ranking at the 57th percentile performs better than 57 percent of students of the same age who wrote this test (norm group)

Important to know: Percentiles are a ranking system based on a line-up of performers

Percentile Chart

  • Percentile Chart – ranks the scores from low to high and assigns a percentile ranking to a particular score.  So if someone scored 3 % on a test and they were the only one out of 100 people to score this low, they would be at the 1st percentile.  (they ranked the lowest out of 100)
  • Ex. Your height is at the 2nd percentile.  This means that 98 percent of the population of people your age are taller than you are.

Bell Curve

  • The bell curve rises up over the 40-60th percentiles because most people score within this range.  (i.e. most people are medium-sized if we use height as an example)
  • This is callled a ‘normal curve’ because it reflects the ‘normal’ (statistical term) distribution of discret traits within a population – ie. height, weight, test scores.  This only applys to things you can quantify (measure).  It would be hard to develop a scale to determine how much ‘kindness’ a person has, never mind score it and plot ‘kindness’ within a population.
  • Percentiles and Stanines are often used together – ie. 40th percentile, stanine 4

Average: What is it?

  • NOT ‘normal’ (at least in the every day language sense)!
  • A statistical term analogous to Mean (sum of scores divided by the number of scores)
  • On standardized tests, the Average or Mean is the 50th percentile
  • Therefore scores above the 50th percentile are ‘above average’ and scores below the 50th percentile are ‘below average’
  • Average’ is usually reported as a range (i.e. 40th – 60th percentile)


AVERAGE means the middle range of scores
Someone scoreing in the average range would have more than some, but less than others.

Average is a range of scores in the middle of everyone else’s scores.  So people in the middle (40-60th percentile) scored more than the people who scored less than the 40th percentile (the left hand side of the curve).  However, the people in the middle scored less than the people on the right hand side of the curve.

  • Ex. Think of clothing sizes – average is medium.  A medium-sized person wears bigger clothes than a small-sized person (compared to the small-sized people the medium-sized people use more fabric) but they wear smaller clothes than an larger sized person (compared to the large-sized people the medium-sized people use less fabric)

Stanine Chart
The distribution of scores is divided into 9 intervals.


  • “Standard NINE” (an army term)
  • A reporting scheme or way of ranking student performance on a test based on an equal interval scale of 1 to 9.  (5 is average, 6 is slightly above, 4 is slightly below average)
  • Usually used with percentiles

See it all in action together:

Here, the chart is flipped sideways with descriptive qualifiers.

(image from

Leave a comment

Filed under Assessment

Formal Assessment – the WISC (intelligence test)

One example of a Formal Assessment that assesses cognitive (thinking) ability or intelligence is the WISC.  WISC stands for Wechsler Intelligence Scale for Children.  There are different versions (i.e. WISC – III or IV).

The WISC is an “IQ” or Intelligence Test in which students are assessed on verbal, performance and quantitative ability.

Psychologists may administer the WISC and write a report called a Psychoeducational Assessment (the ‘Psych’ Report).  Teachers do not administer the WISC – but we DO read the reports.

Intelligence Tests  such as the WISC – IV may assess the following psychological processes:

  • A total of five composite scores can be derived with the WISC–IV. The WISC-IV generates a Full Scale IQ (FSIQ) which represents overall cognitive ability, the four other composite scores are Verbal Comprehension index (VCI), Perceptual Reasoning Index (PRI), Processing Speed Index (PSI) and Working Memory Index (WMI).

From the WISC-IV:  Verbal Comprehension Index

  • The Verbal Comprehension Index subtests are as follows:
  • Vocabulary – examinee is asked to define a provided word.
  • Similarities – asking how two words are alike/similar.
  • Comprehension – questions about social situations or common concepts.
  • Information (supplemental) – general knowledge questions.
  • Word reasoning (supplemental)- a task involving clues that lead to a specific word, each clue adds more information about the object/word/concept.
  • The Verbal Comprehension Index is an overall measure of verbal concept formation (the child’s ability to verbally reason) and is influenced by knowledge learned from the environment.

From the WISC-IV: Perceptual Reasoning Index

  • The Perceptual Reasoning Index subtests are as follows:
  • Block Design – children put together red-and-white blocks in a pattern according to a displayed model. This is timed, and some of the more difficult puzzles award bonuses for speed.
  • Picture Concepts – children are provided with a series of pictures presented in rows (either two or three rows) and asked to determine which pictures go together, one from each row.
  • Matrix Reasoning – children are shown an array of pictures with one missing square, and select the picture that fits the array from five options.
  • Picture Completion (supplemental) – children are shown artwork of common objects with a missing part, and asked to identify the missing part by pointing and/or naming.

From the WISC-IV: Processing Speed Index

  • The Processing Speed Index subtests are as follows:
  • Coding – children under 8 mark rows of shapes with different lines according to a code, children over 8 transcribe a digit-symbol code. The task is time-limited with bonuses for speed.
  • Symbol Search – children are given rows of symbols and target symbols, and asked to mark whether or not the target symbols appear in each row.
  • Cancellation (supplemental)- children scan random and structured arrangements of pictures and marks specific target pictures within a limited amount of time.

From the WISC-IV: Working Memory Index

  • The Working Memory Index  (formerly known as Freedom from Distractibility Index) subtests are as follows:
  • Digit Span – children are orally given sequences of numbers and asked to repeat them, either as heard or in reverse order.
  • Letter-Number Sequencing – children are provided a series of numbers and letters and asked to provide them back to the examiner in a predetermined order.
  • Arithmetic (supplemental) – orally administered arithmetic questions. Timed.

From the WISC-IV: Scoring

  • Each of the ten core subtests is given equal weighting towards full scale IQ. There are three subtests for both VCI and PRI, thus they are given 30% weighting each; in addition, PSI and WMI are given weighting for their two subtests each.
  • The WISC-IV also produces seven process scores on three subtests: block design, cancellation and digit span. These scores are intended to provide more detailed information on cognitive abilities that contribute to performance on the subtest. These scores do not contribute to the composite scores.

Scores are reported in the psychoeducational report.  Scores are usually presented as percentiles with descriptions (i.e 35th percentile, low average ability).  See Assessment: Norms, Percentiles, Stanines, Grade Equivalents etc for details.

  • Bell Curve illustrating the range of scores on the Wechsler Adult Intelligence Scale
  • The average score is a range from 85 to 115.  That means most people score between this range.  (That is why the high point of the bell is over these scores).  Fewer people score less than 55 or higher than 145 – that is why the low point of the bell is over these scores.
  • A score of less than 70 indicates low cognitive ability (mild intellectual ability) and a score of less than 55 relates to moderate mental retardation (developmental disability).
  • A score over 130 indicates high cognitive ability and intellectual giftedness (the actual criteria for gifted identification depends on your board)

How does the Test maker know their test is Valid and Reliable?

Psychometric properties of the WISC-IV

The WISC–IV US standardization sample consisted of 2,200 children between the ages of 6 and 16 years 11 months and the UK sample consisted of 780 children. Both standardizations included special group samples including the following: children identified as gifted, children with mild or moderate mental retardation, children with learning disorders (reading, reading/writing, math, reading/writing/math), children with ADHD, children with expressive and mixed receptive-expressive language disorders. children with autism, children with Asperger’s syndrome, children with open or closed head injury, and children with motor impairment.

WISC–IV is also validated with measures of achievement, memory, adaptive behaviour, emotional intelligence, and giftedness. Equivalency studies were also conducted within the Wechsler family of tests enabling comparisons between various Wechsler scores over the lifespan. A number of concurrent studies were conducted to examine the scale’s reliability and validity. Evidence of the convergent and discriminant validity of the WISC–IV is provided by correlational studies with the following instruments: WISC–III, WPPSI–III, WAIS–III, WASI, WIAT–II, CMS, GRS, BarOn EQ, and the ABAS–II. Evidence of construct validity was provided through a series of exploratory and confirmatory factor-analytic studies and mean comparisons using matched samples of clinical and nonclinical children.

Leave a comment

Filed under Assessment