Tagging Datasets

In machine learning, almost every data entry has meta-data entry as well. Meta-data can represent labels (the actual digits in the mnist case) or image size, data source and more. cnvrg.io has out-of-the-box support for tagging objects. How? Simply by adding a yaml file for each object or querying a path inside the dataset files.

cnvrg support two types of queries:

  • Path regex queries
  • tags 

File Path  Queries:

You can query any kind of file, type, folder in your dataset, simply enter your file path query you're looking for. i.e.:

  • *.zip
  • cats/*.png
  • *.tffrecord
  • image-0*.png

and so on.

Tags Queries:

For example, the image 9.png  representing digit “9” in my mnist dataset. To tag it with Meta-Data, create a yaml file with the following content:


label: “9”
width: “32”
height: “32”
source: “yann lecun”

Once this file is ready and is right in the same directory of the original 9.png  file, run cnvrg data upload  – and tags will be stored and indexed in cnvrg.io.

Querying Datasets

Once a dataset has been tagged, cnvrg.io allow you to create queries and subsets of your original dataset. This is especially good, when you want to work on a subset of your dataset (for example, only "train" labeled data)

To create a query, go to your datasets page and type your query in the query box 

for tags, syntax is simply key:value. Regex is supported for values.
for path queries, syntax is any kind of part of file path.  Regex is supported.

You can also select the commit version (Latest will run always on the latest commit in the dataset). Clicking search will lead to the results page. 

Saving Queries

Queries can also be saved which is great for reproducible data science and for reusability. For each saved query, you can browse its files, see its info and rerun experiments and jobs.

Using queries in experiments

Use your stored queries by adding --query flag to your command (or use the Copy Query Command button in the queries list). 

$ cnvrg run --data=dataName --query=queryID python train.py

* To obtain your query ID, go to the query info page.

Did this answer your question?