This paper describes the design of the first large-scale IR test collection built for the Czech language. This collection also happens to be very challenging, as it is based on a continuous text stream from automatic transcription of spontaneous speech and thus lacks clearly defined document boundaries.
All aspects of the collection building are presented, together with some initial experiments.