General description

The student (or group of up to 3 students) is expected to design, develop, and present a solution based on Machine Learning to one problem among a set of problems provided by the teachers.


The student (or group of students) will deliver a single document (as a pdf file), within one week before the exam date, by email, to the teacher. The document maximum lenght is fixed at 4 pages (excluding references), if the document is drafted according to this LaTex tamplate, or 1200 words (including every word: title, authors, ...), otherwise.

The document will contain (not necessarily in the following order):

  • the problem statement
  • a description of one or more performance indexes able to capture the degree to which a(ny) solution solves the problem, or some other evaluation criteria
  • a description of the proposed solution, from the algorithmic point of view, with scarce focus on the tools used to implement it
  • a description of the experimental evaluation of the solution, including
    • a description of the used data, if any
    • a description of the experimental procedure and the comparison baseline, if any
    • the presentation and the discussion of the results
The students are allowed (and encouraged) to refer to existing (technical/scientific/research) literature for shrinking/omitting obvious parts of the description, if appropriate.
The students are not required to deliver any source code or other material. However, if needed for gaining more insights on their work, the students will be asked to provide more material or to answer some questions.
If the project has been done by a group of students, the document must show, for each student of the group, which (at least one) of the following activities the student took part in:
  • problem statement
  • solution design
  • solution development
  • data gathering
  • writing


The teachers will evaluate the project output on a 0-33 scale.
Part of the score (up to 3 points), is determined statically and independently from the document content, as follows:
  • +1, if the project has been done by a single student
  • from +0 to +2, depending on which problem (among the teachers-provided set) has been chosen by the student (see below)
The remaining 30 points are assigned according to these criteria:
  • clarity (from 0 to 15): is the document understandable and easy to read? is the length appropriate? are all non-obvious design choices explicited? is the solution/experimental campaign repeatable/reproducible basing on the provided description?
  • technical soundness (from 0 to 10): are the problem statement, evaluation criteria, evaluation procedure sound? are design choices motivated experimentally, with references, or by other means? are conclusions and findings actually supported by results?
  • results (from 0 to 5): does the solution effectively /efficiently solve the problem? is there a baseline which is improved in some way?
Note that the students' solution is not required to exhibit some degree of novelty (i.e., to advance the state of the art of the specific research field). However, student are expected not to simply "cut-and-paste" an existing (research) project.
Note that, depending on the chosen problem, there could be more or less freedom on some aspect: e.g., problem statement, data gathering, and so on.
If the project has been done by a group of students, each student will be graded (for the project part) according to both the overall project score and the student contribution, desumed from the activities she/he actually carried on, according to what specified in the document (see above).


e-Dermatology (+0 score)

Material for this problem is available here, but can be accessed only upon request to the teacher (Eric Medvet), to be sent by email.
Build a test for diagnosis of the disease, where the outcome is one of the following:
  • DAC
  • DCI
  • eczema atopico
  • eczema atopico+DCI
  • eczema vescicolare
  • eczema ipercheratosico
  • eczema nummulare
  • psoriasi
  • pitiriasi rubra pilare
  • micosi cutanea superficiale
  • cheratodermia palmo-plantare
  • acrodermatite continua 
Further info in the docx file.

Soccer team maker (+1 score)

Material is available as a zip file at the bottom of this page.
A group of friends play soccer weekly. Each week, they form two teams 7 or 8 players (trying to accomodate the fact that sometimes their number on the field vary from 12 to 16, or rarely more) heuristically: basically, they random choose a team for each player, trying to accomodate player roles, age, attitude for running.
They want to automatize the team-forming process, with these goals:
  • goalkeepers should be evenly distributed (if possible) among teams
  • the resulting match should be balanced
The input of the process is given by the list of the players who are on the field in a given week.

The history of past matches and some data about all involved players are available as json files.

Matches data includes, for each match:
  • the id
  • the date
  • two arrays of players performances (the order of the array does not matter).
Each player performance includes:
  • id of the player
  • number of autogoals scored by the player
  • number of goals scored by the player
  • role (as "position") of the player in the match (P=goalkeeper, D=defender, C=midfielder, A=forward)
Players data includes, for each player:
  • the id
  • the players attitude for running (P=runs, T=does not run)
  • the birth year, if available (as "birthDate")
  • the usual role (as "position", actual position in a match could differ)

Syntax-based extractor learning (+1 score)

Material is available as a zip file at the bottom of this page.
The goal is to propose a method for learning an extractor of syntax-based entities from examples of desired behavior.
For example, given a text where a user manually highlighted the dates, the method should learn an extractor of dates, applicable to any text.
A learned extractor, receives as input a text t and extracts (i.e., gives as output) a list of substrings of t that it deems to match the syntax pattern inferable from the examples. Here substring has to be intended as a localized portion of the text, i.e., the extractor may extract two substrings which have equal content but appear in different places in t.
The learning method should receive as input one or more pieces of text, each text t along with the list of all the substrings that should be extracted from t.
A set of example problems are available in the data directory. The problems may or may not be used. Each problem is defined in terms of a single file of text. The name of the file is composed of two pieces separated by a dash: the first piece roughly describes the nature of the text (e.g., log, HTML "code", email headers, ...); the second piece roughly describes the nature of the syntax-based entities to be extracted (e.g., dates, URLs, phone numbers, ...).  In the first line of the file, there is a regular expression: when applied to the remaining part of the text, the first capturing group (0 based indexing) matches all and only the corresponding syntax-based entities.


  1. Bartoli, Alberto, et al. "Inference of regular expressions for text extraction from examples." IEEE Transactions on Knowledge and Data Engineering 28.5 (2016): 1217-1230.

Citation relevance (+2 score)

There is no material for this problem.
The goal is to build a tool which, given a research paper A citing a research paper B, gives an estimate of the relevance of the citation.
Intuitively, a citation is relevant if the content of paper B is in some way useful for understanding and/or putting in a context the content of paper A.


  1. Bai, Xiaomei, et al. "Identifying Anomalous Citations for Objective Evaluation of Scholarly Article Impact." PloS one 11.9 (2016): e0162364.
  2. Valenzuela, Marco, Vu Ha, and Oren Etzioni. "Identifying meaningful citations." Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.
Eric Medvet,
Dec 14, 2016, 2:01 PM
Eric Medvet,
Dec 14, 2016, 2:01 PM