Data engineering
                        CD4ML with Jenkins in DAGsHub
                            
                         
    
agent {
    dockerfile {
        additionalBuildArgs  '--tag rppp:$BRANCH_NAME'
        args '-v $WORKSPACE:/project -w /project -v /extras:/extras -e PYTHONPATH=/project'
    }
}
Details:
# Dockerfile
FROM python:3.8                       # Base image for our job
RUN pip install --upgrade pip && 
    pip install -U setuptools==49.6.0 # Upgrade pip and setuptools
RUN apt-get update && 
    apt-get install unzip groff -y    # Install few system dependencies
COPY requirements.txt ./              # Copy requirements.txt file into image
RUN pip install -r requirements.txt   # Installing project dependencies
When building docker images from a Dockerfile, we can control which files docker needs to consider to create docker context, by defining ignore patterns in a .dockerignore file. This enables faster and lighter Dockerfile builds. In our case, the only external file needed to build the Docker image is the requirements.txt file:# .dockerignore * # Ignores everything !requirements.txt # except for requirements.txt file ;)Now that we’ve defined the docker image we want to use to run our pipeline, Let’s dive into our Jenkins pipeline stages.
 Figure 1: End-to-end Jenkins pipeline stages
Figure 1: End-to-end Jenkins pipeline stages
stage('Run Unit Test') {
    steps {
        sh 'pytest -vvrxXs'
    }
}
stage('Run Linting') {
  steps {
    sh '''
      flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
      flake8 . --count --max-complexity=10 --max-line-length=127 --statistics
      black . --check --diff
    '''
  }
}
stage('Setup DVC Creds') {
    steps {
        withCredentials(
            [
                usernamePassword(
                    credentialsId: 'DVC_Remote_Creds',
                    passwordVariable: 'PASSWORD',
                    usernameVariable: 'USER_NAME'),
            ]
        ) {
            sh '''
                dvc remote modify origin --local auth basic
                dvc remote modify origin --local user $USER_NAME
                dvc remote modify origin --local password $PASSWORD
                dvc status -r origin
            '''
        }
    }
}
Explanation:

stage('Sync DVC Remotes') {
  steps {
    sh '''
      dvc status
      dvc status -r jenkins_local
      dvc status -r origin
      dvc pull -r jenkins_local || echo 'Some files are missing in local cache!'                          # 1
      dvc pull -r origin         # 2
      dvc push -r jenkins_local  # 3
    '''
  }
}
| Options | Pros | Cons | 
| For All Commits | We will never miss any experiment | This will increase build latency. It will be extremely expensive if we use cloud resources for training jobs. Might be overkill to run the DVC pipeline for all commits/changes | 
| Only for changes in the master branch | Only master branch experiments are saved, which ensures only "approved" changes and experiments are tracked. | We can not compare experiments in the feature branch before merging it to master.“Bad” experiments can slip through the PR review process and get merged to master before we could catch it. | 
| Setup a manual trigger | We can decide when we want to run/skip an experiment. | Automation is not complete. There is still room for manual errors. | 
| “Special” Commit message syntax | We can decide when we want to run/skip an experiment. | Automation is not complete. There is still room for manual errors. Commits are immutable, and it would be awkward to amend or create a new commit just to add the instruction. It also mixes MLOps instructions with the real purpose of the commit messages - documenting the history of the code | 
| On Pull Request | We can run and compare experiments before we approve the PR. No “Bad” experiments can now slip through the PR review process. | None | 
stage('Update DVC Pipeline') {
  when { changeRequest() }                                       //# 1
  steps {
    sh '''
      dvc repro --dry -mP
      dvc repro -mP                                                # 2
      git branch -a
      cat dvc.lock
      dvc push -r jenkins_local                                    # 3
      dvc push -r origin                                           # 3
    '''
    sh 'dvc metrics diff --show-md --precision 2 $CHANGE_TARGET' //# 4
  }
}
 Figure 3: Comparing metrics between feature branch and master
Figure 3: Comparing metrics between feature branch and master
stage('Commit back results') {
  when { changeRequest() }
  steps {
    withCredentials(
      [
        usernamePassword(
          credentialsId: 'GIT_PAT',
          passwordVariable: 'GIT_PAT',
          usernameVariable: 'GIT_USER_NAME'),
      ]
    ) {
        sh '''
          git branch -a
          git status
          if ! git diff --exit-code dvc.lock; then              # 1
            git add .
            git status
            git config --local user.email $JENKINS_EMAIL         # 2
            git config --local user.name $JENKINS_USER_NAME      # 2
            git commit -m "$GIT_COMMIT_REV: Update dvc.lock and metrics"
                                                                # 3
            git push https://$GIT_USER_NAME:$GIT_PAT@dagshub.com/puneethp/RPPP HEAD:$CHANGE_BRANCH                           # 4
          else
            echo 'Nothing to Commit!'                           # 5
          fi
      '''
    }
  }
}

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.