Amazon SageMaker : End-to-End Managed Machine Learning Platform

Amazon SageMaker Overview

When it comes to Machine Learning (ML) and giving it as a service, it requires the expertise of Data Engineering, ML and DevOps. While deploying a model on production, several issues arise like versioning issues, problems in model pipeline, etc. solving out these issues is time-consuming. ML Research Scientists and practitioners at Amazon came out with a solution to run the entire pipeline of machine learning-powered by AWS called Amazon Sagemaker. With the availability of tools for every stage, it has various in-house tools available for easing the model building and deploying.

Amazon SageMaker uses Jupyter Notebook and Python with boto to connect with the s3 bucket, or it has its high-level Python API for model building. The compatibility with modern deep learning libraries like TensorFlow, PyTorch, and MXNet reduces the model building time. An entire pipeline can be set up in the notebook using SageMaker, or custom code can be imported in it to set up the ML pipeline.

Machine Learning Pipeline?

  • Fetch data – Get real-time data from Kafka streams or repository hosting data. SageMaker requires the data to be in s3 bucket for setting up the training job.
  • Data Pre-processing – This step involves data wrangling and data preparation for training. Data wrangling is one of the most time-consuming steps in a machine learning project, and the maximum time of a team is spent on it. Amazon SageMaker Processing enables the running jobs to pre-process data for training and post-process for generating the inference, feature engineering and model evaluation at scale.
  • Model Training – The pre-processing pipeline has been set up both for training and testing data. Amazon SageMaker has popular algorithms already built in it. Simply import the library and use it. The following is the working of training pipeline at Amazon SageMaker:
  • First, the training data is imported from the s3 bucket
  • The training is started by calling the ML to compute instances stored in the EC2 container registry
  • The trained model artefacts will be saved into a model artifacts s3 bucket
Amazon Sagemaker Architecture of Sagemaker
Architecture of Sagemaker
  • Model Evaluation- Evaluating a trained model on SageMaker can be done in two ways offline testing or online testing. In offline testing, requests are made through the endpoint of Jupyter notebook on historical data (data separated previously) by validation set or using cross-validation. In online testing, the model is deployed, and a traffic threshold is set to handle requests. If it is working fine, then the traffic threshold is set to 100%.
  • Model Deployment- The model has crossed the baseline, and it’s time to deploy it: the trained model artefacts path and the Docker registry path of the inference code. In SageMaker, the model can be implemented by using CreateModel API, defining the configuration of HTTPS endpoint and creating it.
  • Monitoring- The model performance is monitored in real-time, the ground values of data are saved into s3, and the performance deviation is analyzed. This will give the instance where the drift started; then it is trained on new samples that are saving in real-time in a bucket.

Data Preparation using SageMaker Ground Truth: A machine learning model depends entirely upon the data. Higher the data quality more efficient the model will be. But getting quality data (labelled) is inefficient in terms of cost and time.

At Amazon SageMaker, the labelling of data is not too difficult. The user can either opt for the private, public or vendor workforce. In the private and vendor, the user runs the labelling job on its own or uses third-party APIs, and it requires some agreement of confidentiality statements. In the public workforce, a service called Amazon Mechanical Turk Workforce creates a labelling job and gives the status of successful or failed labelled jobs. Below are the steps –

  • Store data in the s3 bucket and define a manifest file for which the labelling job is going to run.
  • Create a labelling workforce by choosing the workforce type.
  • Create a labelling job by choosing the job type such as Image Classification, Text Classification, Bounding Box, etc.
  • For example, the chosen job is of Bounding Box then draw a bounding box around your desired object and give it a label.
  • Visualize your results by seeing the confidence score and other metrics.

Hyperparameter Tuning at SageMaker

  • Random Search- As the name implies a list of hyperparameters is defined, the combinations are picked at random, and a training job is run on it. SageMaker provides the concurrent running of jobs for finding the best hyperparameter without interrupting the current training job.
  • Bayesian Search- SageMaker has its Bayesian Search algorithm. The algorithm works by checking the performance of previously used combinations of hyperparameters in a job and explores the new combination using the supplied list. Below are the steps-
  • For the testing of a training task, the metrics are specified when developing a hyperparameter tuning work. Only 20 criteria can be specified for a single task; the parameters are given a unique name with their regular expression to extract information from the logs.
  • The hyperparameter ranges are defined as per the parameter type, i.e. a distinction is created between the parameter type in ParameterRanges JSON object. The values of categorical, continuous and integer are defined separately.
  • Create a notebook on SageMaker and connect with SageMaker’s Boto3 client.
  • Specify the bucket and data output location and launch the configured hyperparameter tuning job defined in step a. and b.
  • Monitor the progress of concurrently running hyperparameter tuning jobs and find the best model on SageMaker’s console by clicking on the best training job.

Best practices for Amazon Sagemaker

  • Defining the number of parameters: For limiting the search space and finding the best variables for a model, SageMaker allows the use of 20 parameters in a hyperparameter tuning job.
  • Defining the range of hyperparameters: Defining a broader range for hyperparameters allows finding the best possible values, but it is time-consuming. Find the best value by limiting the range of benefits, and limit the search space for that range.
  • Logarithmic scaling of hyperparameters: If the search space is small, define the scaling of hyperparameters as linear. If the search space is a large opt logarithmic scaling because it decreases the running time of jobs.
  • Finding the best number of concurrent training jobs: More concurrent jobs can get the work done quickly, but the tuning jobs depend upon the results of previous runs. In other words, running a job one at a time can achieve the best results with the least amount of computing time.
  • Running training jobs on multiple instances: Running a training job on multiple instances uses the last-reported objective metric. For examining all the parameters, design a distributed training job architecture for getting the logs of the desired metric.

Amazon SageMaker Studio

SageMaker studio is a fully functional Integrated Development Environment (IDE) for doing machine learning. It is a unification of all the key features of SageMaker. In SageMaker Studio, the user can write code in the notebook environment, perform visualization, debugging, model tracking, and monitor the model performance in a single window. It uses the following features of SageMaker-

  • Amazon SageMaker Debugger: The debugger in SageMaker will monitor the values of feature vectors and hyperparameters. The logs of a debug job are stored in CloudWatch and for checking the exploding tensors, examining the vanishing gradient problems the tensor values can be saved in the s3 bucket. By placing SaveConfig from the debugger SDK at the instance where the value of the tensors needs to be checked and SessionHook will be associated at the start of every debugging job run.
  • Amazon SageMaker Model Monitor: SageMaker model monitors the model performance by examining the data drift. The constraints and statistics file of features are defined in JSON. The constraint.json file contains the list of features with their type, and the required status is defined by the completeness field whose value ranges from 0-1, and statistics.json the file contains the mean, median, quantile, etc. information for each feature. The reports are saved in s3 and can be viewed in detail under constraint_violations.json which consists of feature names and type of violation (data type, min or max value of a feature, etc.)
  • Amazon SageMaker Model Experiment: Tracking of several experiments (training, hyperparameter tuning jobs, etc.) is easier while working on SageMaker. Just initialize an Estimator object and log the experiment values. The values stored in the experiment can be imported into a pandas data frame and makes analysis easier.
  • Amazon SageMaker AutoPilot: ML using AutoPilot is just a click away. Specify the path of data and the target attribute(regression, binary classification or multi-class classification) type. If not specified the built-in algorithms automatically specify the target type and run the data preprocessing and model according to it. The data preprocessing step automatically generates Python code which can be used for further jobs. A custom pipeline for it can also be defined using DescribeAutoMlJob API.

Running custom training algorithms

  • Run the dockerized training image on SageMaker
  • The SageMaker calls a CreateTrainingJob function which runs training for a specific period
  • Specify the hyperparameters in TrainingJobName
  • Check the status by TrainingJobStatus

Security at SageMaker

  • Cloud Security: AWS uses a shared responsibility model, which involves security in the cloud by AWS for securing the infrastructure and security of cloud which involves the services opted by a customer, IAM key management and privileges to different users, keeping the credentials secure, etc.
  • Data Security: SageMaker ensures that data and model artefacts should be kept encrypted in transit and at rest. Requests to the Amazon SageMaker API and console are made over a secure (SSL) connection. Notebooks and scripts are encrypted using AWS KMS(Key Management Service) Key if the key is not available these are encrypted by using transient key after the decryption this transient key becomes obsolete.

Advantages of SageMaker

  • It has a debugger that specifies the range of hyperparameters automatically to be used in training.
  • End to end ML pipeline can be deployed with ease
  • ML models can be deployed at the edge using SageMaker Neo
  • ML compute instance suggests the instance type while running the training.


AWS charges each SageMaker customer for the computation, storage, and data processing tools used to build, train, perform and log machine learning models and predictions, along with the S3 costs to maintain the data sets used for training and ongoing predictions. The SageMaker framework is designed to support the end-to-end lifecycle of ML applications, right from model data creation to model execution. The scalable construction also makes it versatile. That means you can choose to use SageMaker independently for model construction, training, or deployment.

You may also be interested in reading about Machine Learning Services and Solutions

Related Posts

Leave a Comment

Name required.
Enter a Valid Email Address.
Comment required.(Min 30 Char)