Predicting the Effectiveness of Pattern-based Entity Extractor Inference




Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, Fabiano Tarlao


Applied Soft Computing (ASOC)
(rank Q1 in Software)



Links and material:

Abstract #

An essential component of any workflow leveraging digital data consists in the identification and extraction of relevant patterns from a data stream. We consider a scenario in which an extraction inference engine generates an entity extractor automatically from examples of the desired behavior, which take the form of user-provided annotations of the entities to be extracted from a dataset. We propose a methodology for predicting the accuracy of the extractor that may be inferred from the available examples. We propose several prediction techniques and analyze experimentally our proposals in great depth, with reference to extractors consisting of regular expressions. The results suggest that reliable predictions for tasks of practical complexity may indeed be obtained quickly and without actually generating the entity extractor.