Development Of Psychometrically Validated Standardized Test Instruments For Outcomes Assessment In Experiential Engineering Education

There is a major trend in engineering education to provide students with realistic hands-on learning experiences. This paper reports on the results of work done to develop standardized test instruments to use for student learning outcomes assessment in an experiential hands-on manufacturing engineering and technology environment. The specific outcomes targeted for assessment are those defined under the MILL (Manufacturing Integrated Learning Laboratory) Manufacturing Competency Model. In a unique feature aimed at experiential learning, the test instruments incorporate the use of a physical manipulative to evaluate attainment of particular hands-on skills. The resulting standardized tests have been subjected to extensive psychometric analysis. The results of the analysis indicate excellent structure of the test instruments. The test instruments have shown high levels of stability, internal consistency, and reliability. These tests can be used as instruments for outcomes assessment to help document attainment of targeted learning outcomes for program assessment, accreditation, and other assessment purposes.


INTRODUCTION
here has been a focus in recent developments in engineering education on improving student learning by providing more hands-on learning experiences.With respect to manufacturing, the Society of Manufacturing Engineers Education and Research Community's Curricula 2015 report examined the state of manufacturing education and industry, emerging issues, and opportunities for improvement (Mott & Hugh 2011).It states that as manufacturing becomes more established as a discipline, it is necessary to work towards a strong yet flexible core curriculum and that there is a need for a consistent model that can be used to design and assess programs.
In our previous work, we described how a core common curriculum was developed by five departments at four different institutions to provide experiential hands-on manufacturing education.The following major curriculum areas emerged: (1) drafting/design, (2) manufacturing process, (3) process engineering, and (4) CAD/CAM.The associated competencies that students are expected to master are shown in Table 1.This table summarizes what came to be known as the MILL Manufacturing Competency Model (MILL Model for short).Further discussion of the MILL Model is found in Ssemakula, Liao, Ellis, Kim, and Sawilowsky (2009); Ssemakula et al. (2011aSsemakula et al. ( , 2011b)).

Copyright by author(s); CC-BY 2
The Clute Institute The competencies shown in Table 1 form the knowledge base or blueprint from which the standardized test instruments that are the subject of this paper were developed.A collaboration among diverse partners to develop a standardized test is likely to introduce novel issues in arriving at an agreement on matters that impact the psychometric priorities of the instrument.For example, consider evidence for content validity, "the degree that a test measures what it purports to measure" (Sawilowsky, 2000a, p. 155; see also Sawilowsky, 2000b).It may be obtained via a Venn diagram of the test's blueprint of objectives with the following: outline of the curricular modules, subject matter experts' analyses, school district curricula and objectives guides, standards set by blue-ribbon panels, and related resources.However, when there are multiple and diverse partners, there may be little opportunity for agreement on the choice of topics and certainly subtopics required to support claims of content validity.
The purpose of this paper, therefore, is to explicate the approach that was used to develop a standardized test of the core manufacturing competencies detailed in the MILL Model.This collaborative effort included partners from five different departments at three universities and one community college around the United States, and an advisory board of industry representatives.The goal was to develop a content-valid standardized test instrument to help validate the attainment of core student learning outcomes in the manufacturing arena.Among other things, the results from the test can be used to demonstrate that an educational program is addressing the competency gaps that were identified by industry (SME 2003); and meeting student outcomes established by accreditation agencies such as the Accreditation Board for Engineering and Technology (ABET 2016a, ABET 2016b).

DEVELOPMENT OF TEST INSTRUMENTS
Once the test blueprint in Table 1 above was finalized, specifications were developed to mark out the different levels of cognitive ability addressed (recall, application, problem solving), and to serve as prompts in developing individual test items to assess the various cognitive levels.To assess whether students were achieving the target outcomes under the MILL Model, a proprietary parallel form standardized test was developed using an experimental design methodology.The drafting/design, manufacturing processes, process engineering, and CAD/CAM knowledge areas

Copyright by author(s); CC-BY 3
The Clute Institute from Table 1 formed the test's subscales.The multiple competencies contained within each subscale naturally formed the test blueprint.
This particular test is unique in that it incorporates a hands-on manipulative which can be used to provide a direct assessment of student attainment of specific hands-on skills.This helps to maximize the applicability of the test to the all-important hands-on learning experiences that are so highly sought by industry as well as demanded by modern engineering education pedagogy.The manipulative was designed to encompass the same hands-on skills as those stipulated under the MILL Model.It provides a realistic manipulative to be referenced during the test itself, so as to tie in with the corresponding hands-on experiences.To make hands-on experiences explicitly relevant to the test, test questions frequently refer the examinee to inspect the artifact because the question is based directly on the artifact.Multiple copies of the artifact were fabricated and distributed to each partner institution to be made available to students during the testing.
Two versions of a standardized test were developed (Forms A and B).The test length was initially set to be twice the number of items per target competency.A conference was organized where representatives from each partner school were trained in the use of the blueprint approach to test construction as described in Bridge, Musial, Roe, Frank, and Sawilowsky (2003).There were 33 test items written for Form A scale and 32 test items written for Form B scale.The items were sent to the advisory board to obtain validity evidence based on their expertise.The test content was also validated by academic and industry experts.The item pool was then administered to target students at all partner institutions.A sample test question (with possible answers) is shown in Figure 1.To maintain the academic integrity of the tests, we are including only one sample question to illustrate the nature of the test instruments.After the tests were administered at the various consortia schools, extensive and rigorous psychometric analysis was carried out on the resulting student performance data as described below.

PSYCHOMETRIC ANALYSIS OF TEST INSTRUMENTS
An examination of the psychometric properties of the test items was initially undertaken by exploratory factor analysis (EFA) methods.Principal components extraction with varimax rotation was used with the number of factors fixed at four to identify the four factors of Manufacturing Processes, Process Engineering, Drafting/Design, and CAD/CAM which constitute the MILL Model.Test items with factor weights less than |.4| were suppressed, and the items were examined to ensure they did not load on more than one factor, or to ensure they did not fail to load on at least one factor.Then, an item deletion approach was undertaken wherein an item was retained only if its point-biseral correlation (a measure of the homogeneity of the item with the remainder of the scale) was approximately ≥ +.The Clute Institute

Reliability
"According to classical measurement theory, instrument reliability is the consistency that a test measures whatever it measures" (Sawilowsky, 2007, p. 516.).Two types of instrument reliability, parallel forms and internal consistency, were assessed for the test instruments developed.
Parallel Forms describes the stability of scores over time.It was obtained by administering both Form A, and after a delay (in order to eliminate fatigue), Form B to the same students.It is distinguished from test-retest, where the same test is administered twice to the same students, in that it has the advantage of precluding the memory effect (i.e., relying on recollection of the answer obtained previously instead of skills and abilities).It is obtained by computing the Pearson product-moment coefficient of correlation on the scores of the two parallel forms.The parallel forms reliability, rPF, was .86 (with p = .006),indicating a fairly high level of stability of scores.
Internal Consistency is a measure of the homogeneity of subject matter represented by the test items, where the test is randomly split into two parts and the Pearson correlation is computed on the scores for the two parts.This form of internal consistency is called split-halves, rSH, is dependent on the fashion in which the test is split.Cronbach's α, an improvement on split halves, is the long-run average correlation obtained from all possible permutations of splitting the test into two parts.To achieve an unbiased estimate, a parallel model was assumed and also reported.Cronbach's α for Form A was .80,and the unbiased parallel model rPM was .81.The standard error of measure (SEM) was 1.82.
The items descriptors, level of difficulty, and other summary statistics are compiled in Table 2.The column titled P in the above table is also known, in Classical Measurement Theory, as the item's difficulty index (Sawilowsky, 2007, p. 517).P values close to 1 indicate the items are easy, whereas P values close to 0 indicate the items are hard.For example, item #29 above would be considered difficult, whereas item #8 is considered to be very easy.Ideally, the average P value in a standardized test should be P = .5.In this case, the mean P value for Form A is = .61,which is reasonably close to the desired middle point.
The item-total statistics are compiled in Table 3.They predict the impact of further adjustment through item deletion on Cronbach's α.It is clear that no further item deletions would result in materially increasing the reliability of the final Form A scale.Discrimination statistics via Classical Measurement Theory, the discrimination index, and the point-biserial correlation, are provided in Table 4. Discrimination is the ability of a test item to differentiate between those who know and either those who do not know or are unable to produce the correct answer.These statistics are further refined by grouping students into ability levels.Students scoring in the top 27½ % are placed in the high ability group, whereas students scoring in the bottom 27½ % are placed in the low ability group.Students in the high ability group should correctly answer an item with high a discriminatory ability, while students in the low ability group should not be able to answer that item correctly.When the item difficulty P = .5, the discrimination index D takes on values from -1 to +1.All negative values are interpreted as unacceptable.As D increases from 0 to +1 the discrimination, which is desirable, increases.Note, however, that as the item difficulty rises or lowers from P=.5, the minimum and maximum values for D shrink, and their interpretation is dependent on the values of P. Hence, the point-biserial correlation is also given as an aid to interpreting the item discrimination.An inspection of Table 4 below indicates strong discriminatory ability for all items.

Copyright by author(s); CC-BY 6
The Clute Institute The final Form A total scale statistics are compiled in Table 5 below.The same process used for obtaining psychometric results for Form A was applied to Form B. Cronbach's α was .64,and under the parallel model, the unbiased reliability estimate rPM was .65.The SEM = 1.906.The remaining statistics are compiled in Tables 6 -9 below.It is also apparent in Table 7 that no further item deletion would materially improve the reliability of Form B, and there was adequate discriminatory power between high and low ability groups.

X
Copyright by author(s); CC-BY 7 The Clute Institute  This yields a standard scale of 20 to 80, with a mean of 50 and standard deviation of 10.For comparison purposes, the relevant scores on Form A are presented in Table 10.This table readily permits converting a raw score into T score, and percentile equivalents.(Note that not all raw scores were present in the original data (e.g., X = 1, 2, 4).Interpolation, extrapolation, or spline fitting techniques can be used to obtain the corresponding standardized scores for those data points.).
The results of the all the analyses demonstrate that the test instruments that were developed are internally consistent, and psychometrically reliable and valid.

Copyright by author(s); CC-BY 9
The Clute Institute

CONCLUSION
This work shows that standardized tests can be used for outcomes assessment and authentication of the attainment of target hands-on competencies.The paper demonstrates that the test instruments that have been developed are robust, and psychometrically valid.For Form A, the instrument's average difficulty index P of 0.61 is close to the middle point of 0.5 which is ideal for standardized tests; the parallel forms reliability, rPF, was .86,indicating a high level of stability of test scores; Cronbach's α was .80,while the unbiased parallel model rPM was .81;indicating a high level of internal consistency.Form B has similar characteristics.The use of well-constructed standardized tests can conveniently document attainment of targeted learning outcomes for accreditation and other assessment purposes.This is an innovative use of standardized testing for programmatic evaluation.The incorporation of a physical manipulative to evaluate attainment of hands-on engineering competencies in a psychometrically validated standardized test is a unique pioneering contribution.

Figure 1 .
Figure 1.Sample Test Question 125. (Items were eliminated if the point-biserial was negative, or positive but approximately 0 < rPB < +.125.)The final result was two 20-item test instruments, Form A and Form B. Q.What is the proper sequence of G-codes in numerical control programming if the objective is to mill the outer profile of the plate in the figure below?

Table 1 .
Common Competencies and Curriculum Test Blueprint.

Table 2 .
Form A Item Statistics

Table 3 .
Form A Item-Total statistics X

Table 4 .
Form A Item Difficulty and Discrimination Statistics

Table 5 .
Final Form A Total Scale Statistics, rPM = .81

Table 6 .
MILL Form B Item Statistics

Table 7 .
MILL Form B Item-Total Statistics

Table 8 .
MILL Form B Item Difficulty and Discrimination Statistics.

Table 9 .
Final Form B Total Scale Statistics, rPM = .65

Table 10 .
Standardized Scores for Form A