We propose a system for the automatic generation of regular expressions for text-extraction tasks. The user describes the desired task only by means of a set of labeled examples. The generated regexes may be used with common engines such as those that are part of Java, PHP, Perl and so on. Usage of the system does not require any familiarity with regular expressions syntax. We performed an extensive experimental evaluation on 12 different extraction tasks applied to real-world datasets. We obtained very good results in terms of precision and recall, even in comparison to earlier state-of-the-art proposals. Our results are highly promising toward the achievement of a practical surrogate for the specific skills required for generating regular expressions, and significant as a demonstration of what can be achieved with GP-based approaches on modern IT technology.

Automatic Synthesis of Regular Expressions from Examples

BARTOLI, Alberto;DE LORENZO, ANDREA;MEDVET, Eric;
2014

Abstract

We propose a system for the automatic generation of regular expressions for text-extraction tasks. The user describes the desired task only by means of a set of labeled examples. The generated regexes may be used with common engines such as those that are part of Java, PHP, Perl and so on. Usage of the system does not require any familiarity with regular expressions syntax. We performed an extensive experimental evaluation on 12 different extraction tasks applied to real-world datasets. We obtained very good results in terms of precision and recall, even in comparison to earlier state-of-the-art proposals. Our results are highly promising toward the achievement of a practical surrogate for the specific skills required for generating regular expressions, and significant as a demonstration of what can be achieved with GP-based approaches on modern IT technology.
File in questo prodotto:
File Dimensione Formato  
2014-Computer-AutomaticSynthesisRegexExamples (2).pdf

accesso aperto

Descrizione: pdf post-print
Tipologia: Bozza finale post-referaggio (post-print)
Licenza: Digital Rights Management non definito
Dimensione 252.75 kB
Formato Adobe PDF
252.75 kB Adobe PDF Visualizza/Apri
computer2014.pdf

non disponibili

Descrizione: pdf editoriale
Tipologia: Documento in Versione Editoriale
Licenza: Digital Rights Management non definito
Dimensione 1.38 MB
Formato Adobe PDF
1.38 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11368/2758954
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 42
  • ???jsp.display-item.citation.isi??? 32
social impact